Scoring rule

{{Short description|Measure for evaluating probabilistic forecasts}} {{Distinguish|Score voting}} thumb|Visualization of the expected score under various predictions from some common scoring functions. Dashed black line: forecaster's true belief, red: linear, orange: spherical, purple: quadratic, green: log. In decision theory, both a '''scoring rule'''<ref name="GneitingRaftery2007"> {{cite journal | last1=Gneiting | first1=Tilmann | last2=Raftery | first2=Adrian E. | author2-link=Adrian Raftery | title=Strictly Proper Scoring Rules, Prediction, and Estimation | journal=Journal of the American Statistical Association | year=2007 | volume=102 | issue=447 | pages=359–378 | doi=10.1198/016214506000001437 | url=https://sites.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf | s2cid=1878582 }}</ref> as well as a '''scoring function'''<ref name="Gneiting2011"> {{cite journal | last1=Gneiting | first1=Tilmann | title=Making and Evaluating Point Forecasts | journal=Journal of the American Statistical Association|year=2011 | volume=106 | issue=494 | pages=746–762 | doi=10.1198/jasa.2011.r10138 | arxiv=0912.0902 | s2cid=88518170 }}</ref> provide an ''ex post'' summary measure for the evaluation of the quality of a prediction or forecast. They assign a numeric ''score'' to a single prediction given the actual outcome. Depending on the sign convention, this score can be interpreted as a loss or a reward for the forecaster. '''Scoring rules''' assess probabilistic predictions or forecasts, i.e. predictions of the whole probability distribution <math>F</math> of the outcome. On the other hand, '''scoring functions''' assess point predictions, i.e. predictions of a property or functional <math>T(F)</math> of the probability distribution <math>F</math> of the outcome. Examples of such a property are the expectation and the median.

thumb|The average logarithmic score of 10 points i.i.d. sampled from a standard normal distribution (blue histogram), evaluated on a variety of distributions (red line). Although not necessarily true for individual samples, on average, a proper scoring rule will give the lowest score if the predicted distribution matches the data distribution.[[File:Calibration plot.png|thumb|A calibration curve allows to judge how well model predictions are calibrated, by comparing the predicted quantiles to the observed quantiles. Blue is the best calibrated model, see calibration (statistics).]]Scoring rules answer the question "how good is a predicted probability distribution given the observation of the actual outcome?" Scoring rules that are '''(strictly) proper''' are proven to have the lowest expected score if the predicted distribution equals the underlying distribution of the target variable. Although this might differ for individual observations, this should result in a minimization of the expected score if the "correct" distributions are predicted.

In the same way, scoring functions answer the question "how good is a point prediction given the observation of the actual outcome?". Scoring functions that are '''(strictly) consistent''' (for the functional <math>T</math>) are proven to have the lowest expected score if the point prediction equals (or is among) the true functional of the underlying distribution of the target variable.

Scoring rules and scoring functions are often used as "cost functions" or "loss functions" of forecasting models. If a sample of forecasts and observations of the outcome is collected, they can be evaluated as the empirical mean of the given sample, often also called the "score". Scores of predictions of different models or forecasters can then be compared to conclude which model or forecaster is best.

For example, consider a probabilistic model that predicts (based on an input <math>x</math>) a gaussian distribution <math>\mathcal{N}(\mu, \sigma^2)</math> with mean <math>\mu \in \mathbb{R}</math> and standard deviation <math>\sigma \in \mathbb{R}_+</math>. A common interpretation of probabilistic models is that they aim to quantify their own predictive uncertainty. In this example, an observed target variable <math>y \in \mathbb{R}</math> is then held compared to the predicted distribution <math>\mathcal{N}(\mu, \sigma^2)</math> and assigned a score <math>\mathbf{S}(\mathcal{N}(\mu, \sigma^2), y) \in \mathbb{R}</math>. When a probabilistic model is trained on a scoring rule, it should "teach" the model to predict when its uncertainty is low, and when its uncertainty is high, and it should result in '''calibrated''' predictions, while minimizing the predictive uncertainty.

Although the example given concerns the probabilistic forecasting of a real valued target variable, a variety of different scoring rules have been designed with different target variables in mind. Scoring rules exist for binary and categorical probabilistic classification, as well as for univariate and multivariate probabilistic regression.

== Definitions == Consider a sample space or observation domain, <math>\Omega</math>, which comprises the potential outcomes of a future observation; a σ-algebra <math>\mathcal A</math> of subsets of <math>\Omega</math> and a convex class <math>\mathcal F</math> of probability measures on <math>(\Omega, \mathcal A)</math>. A function defined on <math>\Omega</math> and taking values in the extended real line, <math>\overline{\mathbb{R}} = [-\infty, \infty]</math>, is <math>\mathcal F</math>-quasi-integrable if it is measurable with respect to <math>\mathcal A</math> and is quasi-integrable with respect to all <math>F \in \mathcal{F}</math>.

A (statistical) functional <math>T</math> is a potentially set-valued mapping from the class of probability distributions <math>\mathcal F</math> to a Euclidean space, i.e. <math>T: \mathcal F \rightarrow \mathbb{R}^d</math> with <math>F \rightarrow T(F)</math>.

=== Probabilistic forecast === A probabilistic forecast is any probability measure <math>F \in \mathcal{F}</math>, i.e. a distribution of potential future observations.

=== Point forecast === A point forecast for the functional <math>T</math> is any value <math>x \in \mathbb{R}^d</math>.

=== Scoring rule === A scoring rule is any extended real-valued function <math>\mathbf{S}: \mathcal{F} \times \Omega \rightarrow \mathbb{R}</math> such that <math>\mathbf{S}(F, \cdot)</math> is <math>\mathcal F</math>-quasi-integrable for all <math>F \in \mathcal{F}</math>. <math>\mathbf{S}(F, y)</math> represents the loss or penalty when the forecast <math>F \in \mathcal{F}</math> is issued and the observation <math>y \in \Omega</math> materializes.

=== Scoring function === A scoring function is any real-valued function <math>S: \mathbb{R}^d \times \Omega \rightarrow \mathbb{R}</math> where <math>S(x, y)</math> represents the loss or penalty when the point forecast <math>x \in \mathbb{R}^d</math> is issued and the observation <math>y \in \Omega</math> materializes.

=== Orientation / Sign convention === Scoring rules <math>\mathbf{S}(F,y)</math> and scoring functions <math>S(x, y)</math> are negatively (positively) oriented if smaller (larger) values mean better. Changing the convention can be accomplished by multiplying the score by <math>-1</math>. Here we adhere to the negative orientation, hence the association with "loss".

=== Expected score === We write for the expected score of a probabilistic prediction <math>F \in \mathcal F</math> with respect to the underlying distribution <math>Q \in \mathcal{F}</math>:

: <math>\mathbb{E}_{Y \sim Q}[\mathbf{S}(F,Y)]= \int \mathbf{S}(F, \omega) \mathrm{d}Q(\omega)</math>

Similar, the expected score of a point prediction <math>x \in \mathbb{R}^d</math> with resprect to the underlying distribution <math>Q \in \mathcal{F}</math>:

: <math>\mathbb{E}_{Y \sim Q}[S(x,Y)]= \int S(x, \omega) \mathrm{d}Q(\omega)</math>

=== Sample average score === A way to estimate the expected score is by means of the sample average score. Given a sample of prediction-observation pairs e.g. <math>(F_i, y_i)</math> for probabilistic predictions <math>F_i</math> and observations <math>y_i \in \Omega</math>, <math>i=1\ldots n</math>, <math>(x_i, y_i)</math> for point predictions <math>x_i</math>, the average score is calculated as

* for scoring rules: <math>\widehat{E[\mathbf{S}]} = \frac{1}{n}\sum_{i=1}^n \mathbf{S}(F_i, y_i)</math>

* for scoring functions: <math>\widehat{E[S]} = \frac{1}{n}\sum_{i=1}^n S(x_i, y_i)</math>

By invoking some law of large numbers argument, the sample average scores are consistent estimators of the expectation.

== Properties ==

=== Propriety and consistency === Strictly proper scoring rules and strictly consistent scoring functions encourage honest forecasts by maximization of the expected reward: If a forecaster is given a reward of <math>-\mathbf{S}(F, y)</math> if <math>y</math> realizes (e.g. <math>y=rain</math>), then the highest expected reward (lowest score) is obtained by reporting the true probability distribution.<ref name="GneitingRaftery2007"/>

==== {{anchor|ProperScoringRules}}Proper scoring rules ==== A scoring rule <math>\mathbf{S}</math> is '''proper''' relative to <math>\mathcal{F}</math> if (assuming negative orientation) its expected score is minimized when the forecasted distribution matches the distribution of the observation.

: <math> \mathbb{E}_{Y \sim Q}[\mathbf{S}(Q, Y)] \leq \mathbb{E}_{Y \sim Q}[\mathbf{S}(F, Y)]</math> for all <math>F,Q \in\mathcal{F}</math>.

{{anchor|StrictlyProperScoringRules}}It is '''strictly proper''' if the above equation holds with equality if and only if <math>F=Q</math>.

==== Consistent scoring functions ==== A scoring function <math>S</math> is '''consistent''' for the functional <math>T</math> relative to the class <math>\mathcal F</math> if

: <math> \mathbb{E}_{Y \sim F}[S(t, Y)] \leq \mathbb{E}_{Y \sim F}[S(x, Y)]</math> for all <math>F \in \mathcal{F}</math>, all <math>t \in T(F)</math> and all <math>x \in \mathbb{R}^d</math>.

It is strictly consistent if it is consistent and equality in the above equation implies that <math>x \in T(F)</math>.

=== Affine transformation === After an affine transformation a strictly proper scoring rule remains strictly proper, a strictly consistent scoring function (for some functional <math>T</math>) remains strictly consistent.<ref name="Bickel " /> That is, if <math>\mathbf{S}(F,y)</math> is a strictly proper scoring rule then <math>a+b\mathbf{S}(F,y)</math> with <math>b \neq 0</math> is also a strictly proper scoring rule, though if <math>b < 0</math> then the optimization sense of the scoring rule switches between maximization and minimization. For scoring functions the same statement applies with the obvious changes.

=== Locality === A proper scoring rule is said to be ''local'' if its estimate for the probability of a specific event depends only on the probability of that event. This statement is vague in most descriptions but we can, in most cases, think of this as the optimal solution of the scoring problem "at a specific event" is invariant to all changes in the observation distribution that leave the probability of that event unchanged. All binary scores are local because the probability assigned to the event that did not occur is determined so there is no degree of flexibility to vary over.

Affine functions of the logarithmic scoring rule are the only strictly proper local scoring rules on a finite set that is not binary.

=== Decomposition ===

The expectation value of a proper scoring rule <math>S</math> can be decomposed into the sum of three components, called ''uncertainty'', ''reliability'', and ''resolution'',<ref name="Murphy"> {{Cite journal | last = Murphy | first= A.H. | year = 1973 | title = A new vector partition of the probability score | journal = Journal of Applied Meteorology | volume = 12 | issue = 4 | pages = 595–600 | doi = 10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2 | bibcode= 1973JApMe..12..595M | doi-access = free }}</ref><ref name="Broecker"> {{Cite journal | last = Bröcker | first= J. | year = 2009 | title = Reliability, sufficiency, and the decomposition of proper scores | journal = Quarterly Journal of the Royal Meteorological Society | volume = 135 | issue = 643 | pages = 1512–1519 | url = http://www.personal.reading.ac.uk/~pt904209/publications/decomposition_qjrms.pdf | doi = 10.1002/qj.456 | arxiv = 0806.0813 | bibcode = 2009QJRMS.135.1512B | s2cid= 15880012 }}</ref> which characterize different attributes of probabilistic forecasts:

:<math> E(S) = \mathrm{UNC} + \mathrm{REL} - \mathrm{RES}. </math>

If a score is proper and negatively oriented (such as the Brier Score), all three terms are positive definite. The uncertainty component is equal to the expected score of the forecast which constantly predicts the average event frequency. The reliability component penalizes poorly calibrated forecasts, in which the predicted probabilities do not coincide with the event frequencies.

The equations for the individual components depend on the particular scoring rule. For the Brier Score, they are given by

:<math> \mathrm{UNC} = \bar{x}(1-\bar{x}) </math> :<math> \mathrm{REL} = E(p-\pi(p))^2 </math> :<math> \mathrm{RES} = E(\pi(p)-\bar{x})^2 </math>

where <math>\bar{x}</math> is the average probability of occurrence of the binary event <math>x</math>, and <math>\pi(p)</math> is the conditional event probability, given <math>p</math>, i.e. <math>\pi(p) = P(x=1\mid p)</math>

== Examples of proper scoring rules == There are an infinite number of scoring rules, including entire parameterized families of strictly proper scoring rules. The ones shown below are simply popular examples.

=== Categorical variables === For a categorical response variable with <math>m</math> mutually exclusive events, <math>Y \in \Omega = \{1, \ldots, m\}</math>, a probabilistic forecaster or algorithm will return a probability vector <math>\mathbf{p} \in [0,1]^m</math> with probabilities for each of the <math>m</math> outcomes.

If <math>y=i</math> materializes, one often abbreviates the score as <math>\mathbf{S}(\mathbf{p}, i)</math>.

==== Logarithmic score ==== {{See also|Deviance (statistics)}} thumb|upright=1.25|Expected value of logarithmic rule. When Event 1 is expected to occur with probability of 0.8, the blue line is described by the function <math>0.8 \log(x)+(1-0.8)\log(1-x)</math>. The logarithmic scoring rule is a strictly proper and local scoring rule. This is also the negative of Shannon entropy, which is commonly used as a scoring criterion in Bayesian inference. This scoring rule has strong foundations in information theory. :<math>\mathbf{S}(\mathbf{p}, i) = \ln(p_i) </math>

Here, the score is calculated as the logarithm of the probability estimate for the actual outcome. That is, a prediction of 80% that correctly proved true would receive a score of {{math|ln(0.8) {{=}} −0.22}}. This same prediction also assigns 20% likelihood to the opposite case, and so if the prediction proves false, it would receive a score based on the 20%: {{math|ln(0.2) {{=}} −1.6}}. The goal of a forecaster is to maximize the score and for the score to be as large as possible, and −0.22 is indeed larger than −1.6.

If one treats the truth or falsity of the prediction as a variable {{math|''x''}} with value 1 or 0 respectively, and the expressed probability as {{math|''p''}}, then one can write the logarithmic scoring rule as {{math|''x'' ln(''p'') + (1 − ''x'') ln(1 − ''p'')}}. Note that any logarithmic base may be used, since strictly proper scoring rules remain strictly proper under linear transformation. That is: :<math>L(\mathbf{p},i) = \log_b(p_i) </math> is strictly proper for all <math>b>1</math>.

==== Brier/Quadratic score ==== The quadratic scoring rule is a strictly proper scoring rule :<math>\mathbf{S}_Q(\mathbf{p},i) = 2p_i - \mathbf{p}\cdot \mathbf{p} = 2p_i -\sum_{j=1}^m p_j^2 </math> where <math>p_i</math> is the probability assigned to the correct answer <math>i</math>.

The Brier score, originally proposed by Glenn W. Brier in 1950,<ref name="Brier"> {{Cite journal | last = Brier | first= G.W. | year = 1950 | title = Verification of forecasts expressed in terms of probability | journal = Monthly Weather Review | volume = 78 | issue = 1 | pages = 1–3 | url = http://docs.lib.noaa.gov/rescue/mwr/078/mwr-078-01-0001.pdf | doi = 10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2 | bibcode= 1950MWRv...78....1B }}</ref> can be obtained by an affine transform from the quadratic scoring rule. :<math>\mathbf{S}_B(\mathbf{p},i) = \sum_{j=1}^m (y_j-p_j)^2 </math> Where <math>y_j = 1</math> when the <math>j</math>th event is correct and <math>y_j = 0</math> otherwise. It can be thought of as a generalization of mean squared error to probabilistic forecasts.

An important difference between these two rules is that a forecaster should strive to maximize the quadratic score <math>\mathbf{S}_Q</math> yet minimize the Brier score <math>\mathbf{S}_B</math>. This is due to a negative sign in the linear transformation between them.

==== Spherical score ==== The spherical scoring rule is also a strictly proper scoring rule :<math>\mathbf{S}(\mathbf{p},i) = \frac{p_i}{\lVert \mathbf{p} \rVert} = \frac{p_i}{\sqrt{p_1^2 + \cdots + p_m^2}} </math> Also its generalization with <math>\alpha > 1</math> is strictly proper :<math>\mathbf{S}(\mathbf{p},i) = \frac{p_i^{\alpha-1}}{\left(\sum_{j=1}^m p_j^\alpha\right)^{(\alpha-1)/\alpha}}</math>

==== Ranked Probability Score ==== The ranked probability score <ref name="Epstein 1969 pp. 985–987">{{cite journal | last=Epstein | first=Edward S. | title=A Scoring System for Probability Forecasts of Ranked Categories | journal=Journal of Applied Meteorology and Climatology | publisher=American Meteorological Society | volume=8 | issue=6 | date=1969-12-01 | doi=10.1175/1520-0450(1969)008<0985:ASSFPF>2.0.CO;2 | pages=985–987 | url=https://journals.ametsoc.org/view/journals/apme/8/6/1520-0450_1969_008_0985_assfpf_2_0_co_2.xml | access-date=2024-05-02}}</ref> (RPS) is a strictly proper scoring rule, that can be expressed as: :<math>RPS(\mathbf{p},i) = \sum_{k=1}^{m-1} \left(\sum_{j=1}^k p_j - y_j\right)^2</math> Where <math>y_j = 1</math> when the <math>j</math>th event is correct and <math>y_j = 0</math> otherwise, and <math>m</math> is the number of classes. Other than other scoring rules, the ranked probability score considers the distance between classes, i.e. classes 1 and 2 are considered closer than classes 1 and 3. The score assigns better scores to probabilistic forecasts with high probabilities assigned to classes close to the correct class. For example, when considering probabilistic forecasts <math> \mathbf{p}_1 = (0.5, 0.5, 0)</math> and <math> \mathbf{p}_2 = (0.5, 0, 0.5)</math>, we find that <math>RPS(\mathbf{p}_1,1) = 0.25</math>, while <math>RPS(\mathbf{p}_2,1) = 0.5</math>, despite both probabilistic forecasts assigning identical probability to the correct class.

==== Comparison of categorical strictly proper scoring rules ==== Shown below on the left is a graphical comparison of the Logarithmic, Quadratic, and Spherical scoring rules for a binary classification problem. The ''x''-axis indicates the reported probability for the event that actually occurred.

It is important to note that each of the scores have different magnitudes and locations. The magnitude differences are not relevant however as scores remain proper under affine transformation. Therefore, to compare different scores it is necessary to move them to a common scale. A reasonable choice of normalization is shown in the picture where all scores intersect the points (0.5,0) and (1,1). This ensures that they yield 0 for a uniform distribution (two probabilities of 0.5 each), reflecting no cost or reward for reporting what is often the baseline distribution. All normalized scores below also yield 1 when the true class is assigned a probability of 1. {|style="margin:1em auto;" | thumb|right|upright=1.25|Score of a binary classification for the true class showing logarithmic (blue), spherical (green), and quadratic (red) | thumb|left|upright=1.25|Normalized score of a binary classification for the true class showing logarithmic (blue), spherical (green), and quadratic (red) |}

=== Univariate continuous variables === The scoring rules listed below aim to evaluate probabilistic predictions when the predicted distributions are univariate continuous probability distributions, i.e. the predicted distributions <math>F</math> are defined over a univariate target variable <math>Y \in \mathbb{R}</math> and have a probability density function <math>f: \mathbb{R} \to \mathbb{R}_+</math>. They can be categorized into 3 groups: * Scoring rules for predictions of the probability density <math>f</math> * Scoring rules for prediction of the CDF <math>F</math> * Scoring rules depending on first and second momentum only

==== Logarithmic score for continuous variables ==== The logarithmic score is a local, strictly proper scoring rule. It is defined as :<math>L(F,y) = - \ln(f(y))</math>. The logarithmic score for continuous variables has strong ties to Maximum likelihood estimation and to the Kullback–Leibler divergence.

==== Quadratic score for continuous variables ==== The quadratic scoring rule for continuous variables reads :<math>S(f,y)= 2 f(y) - \|f\|_2^2 </math> It is strictly proper for densities for which the norm <math>\|f\|_2^2 = \left(\int f(y)^2 dy\right)^{\frac{1}{2}} </math> exists.

==== Continuous ranked probability score ==== thumb|Illustration of the continuous ranked probability score (CRPS). Given a sample y and a predicted cumulative distribution F, the CRPS is given by computing the difference between the curves at each point x of the support, squaring it and integrating it over the whole support.

The continuous ranked probability score (CRPS)<ref>{{cite journal |last1=Matheson |first1=James E. |last2=Winkler |first2=Robert L. |title=Scoring Rules for Continuous Probability Distributions |journal=Management Science |date=June 1976 |volume=22 |issue=10 |pages=1087–1096 |doi=10.1287/mnsc.22.10.1087}}</ref> is a strictly proper scoring rule much used in meteorology. It is closely related to the one-dimensional energy distance, and is defined as :<math>CRPS(F,y)=\int_\mathbb{R} ( F(x) - H(x - y) ) ^2 dx</math>

where <math>H</math> is the Heaviside step function and <math>y \in \mathbb R</math> is the observation. For distributions with finite first moment, the continuous ranked probability score can be written as:<ref name="GneitingRaftery2007"> {{cite journal | last1=Gneiting | first1=Tilmann | last2=Raftery | first2=Adrian E. | author2-link=Adrian Raftery | title=Strictly Proper Scoring Rules, Prediction, and Estimation | journal=Journal of the American Statistical Association | year=2007 | volume=102 | issue=447 | pages=359–378 | doi=10.1198/016214506000001437 | url=https://sites.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf | s2cid=1878582 }}</ref> :<math>CRPS(F, y) = \mathbb{E}_{X \sim F}|X-y| - \frac{1}{2}\mathbb{E}_{X,X' \sim F}|X-X'|</math> where <math>X</math> and <math>X'</math> are independent random variables, both sampled from the distribution <math>F</math>. This is the ''energy form'' of CRPS and opens the door to estimating the CRPS via Monte Carlo sampling (through approximating the expectation value).

Furthermore, when the cumulative probability function <math>F</math> is continuous, the continuous ranked probability score can also be written as<ref name="Taillardat Mestre Zamo Naveau 2016 pp. 2375–2393">{{cite journal | last1=Taillardat | first1=Maxime | last2=Mestre | first2=Olivier | last3=Zamo | first3=Michaël | last4=Naveau | first4=Philippe | title=Calibrated Ensemble Forecasts Using Quantile Regression Forests and Ensemble Model Output Statistics | journal=Monthly Weather Review | publisher=American Meteorological Society | volume=144 | issue=6 | date=2016-06-01 | issn=0027-0644 | doi=10.1175/mwr-d-15-0260.1 | pages=2375–2393| url=https://hal-meteofrance.archives-ouvertes.fr/meteo-03544106/file/%5B15200493%20-%20Monthly%20Weather%20Review%5D%20Calibrated%20Ensemble%20Forecasts%20Using%20Quantile%20Regression%20Forests%20and%20Ensemble%20Model%20Output%20Statistics.pdf }}</ref> :<math>CRPS(F, y) = \mathbb{E}_{X \sim F}|X-y| + \mathbb{E}_{X \sim F}[X] - 2 \mathbb{E}_{X \sim F}[X \cdot F(X)]</math> The continuous ranked probability score can be seen as both a continuous extension of the ranked probability score, as well as quantile regression. The continuous ranked probability score over the empirical distribution <math>\hat F_q</math> of an ordered set points <math>q_1 \leq \ldots \leq q_n</math> (i.e. every point has <math>1/n</math> probability of occurring), is equal to twice the mean '''quantile loss''' applied on those points with evenly spread quantiles <math>(\tau_1, \ldots, \tau_n) = (1/(2n), \ldots, (2n-1)/(2n))</math>:<ref name="Bröcker 2012 pp. 1611–1617">{{cite journal | last=Bröcker | first=Jochen | title=Evaluating raw ensembles with the continuous ranked probability score | journal=Quarterly Journal of the Royal Meteorological Society | volume=138 | issue=667 | date=2012 | issn=0035-9009 | doi=10.1002/qj.1891 | pages=1611–1617}}</ref> :<math>CRPS\left(\hat F_q, y\right) = \frac{2}{n} \sum_{i=1}^n \tau_i (y - q_i)_+ + (1 - \tau_i) (q_i - y)_+</math>

For many popular families of distributions, closed-form expressions for the continuous ranked probability score have been derived. The continuous ranked probability score has been used as a loss function for artificial neural networks, in which weather forecasts are postprocessed to a Gaussian probability distribution.<ref name="Rasp Lerch 2018 pp. 3885–3900">{{cite journal | last1=Rasp | first1=Stephan | last2=Lerch | first2=Sebastian | title=Neural Networks for Postprocessing Ensemble Weather Forecasts | journal=Monthly Weather Review | publisher=American Meteorological Society | volume=146 | issue=11 | date=2018-10-31 | issn=0027-0644 | doi=10.1175/mwr-d-18-0187.1 | pages=3885–3900| arxiv=1805.09091 }}</ref><ref name="Grönquist Yao Ben-Nun Dryden 2021 p. 20200092">{{cite journal | last1=Grönquist | first1=Peter | last2=Yao | first2=Chengyuan | last3=Ben-Nun | first3=Tal | last4=Dryden | first4=Nikoli | last5=Dueben | first5=Peter | last6=Li | first6=Shigang | last7=Hoefler | first7=Torsten | title=Deep learning for post-processing ensemble weather forecasts | journal=Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences | volume=379 | issue=2194 | date=2021-04-05 | issn=1364-503X | doi=10.1098/rsta.2020.0092 | article-number=20200092| pmid=33583263 | arxiv=2005.08748 }}</ref>

CRPS was also adapted to survival analysis to cover censored events.<ref>Countdown Regression: Sharp and Calibrated Survival Predictions, https://arxiv.org/abs/1806.08324</ref>

The CRPS can be thought of as the generalization of the mean absolute error (MAE) to probabilistic forecasts, and for a single sample is equivalent to the MAE. Another way to think of it is the Brier/quadaratic score of the sampled cumulative distribution function <math>F</math> for the binary event <math>\{X \leq y\}</math>.

CRPS is a special case of the Cramér distance and can be seen as an improvement of Wasserstein distance often used in machine learning. Cramér distance performed better in ordinal regression than KL distance or the Wasserstein metric.<ref>The Cramer Distance as a Solution to Biased Wasserstein Gradients https://arxiv.org/abs/1705.10743</ref>

CRPS is widely used for evaluating probabilistic forecasts and compared against other scoring rules, see for example <ref name="Bjerregård Møller Madsen 2021 p. 100058">{{cite journal | last1=Bjerregård | first1=Mathias Blicher | last2=Møller | first2=Jan Kloppenborg | last3=Madsen | first3=Henrik | title=An introduction to multivariate probabilistic forecast evaluation | journal=Energy and AI | publisher=Elsevier BV | volume=4 | year=2021 | issn=2666-5468 | doi=10.1016/j.egyai.2021.100058 | article-number=100058| doi-access=free }}</ref>. It also has some critical theoretical limitations: It has been shown that CRPS can produce systematically misleading evaluations by favoring probabilistic forecasts whose medians are close to the observed outcome, regardless of the actual probability assigned to that region, potentially resulting in higher scores for forecasts that allocate negligible (or even zero) probability mass to the true outcome. Furthermore, CRPS is not invariant under smooth transformations of the forecast variable, and its ranking of forecast systems may reverse under such transformations, raising concerns about its consistency for evaluation purposes.<ref>Beyond Strictly Proper Scoring Rules: The Importance of Being Local https://doi.org/10.1175/WAF-D-19-0205.1</ref>

==== Dawid-Sebastiani score ==== The Dawid-Sebastiani score (DSS)<ref>{{cite journal |last1=Dawid |first1=A. Philip |last2=Sebastiani |first2=Paola |title=Coherent dispersion criteria for optimal experimental design |journal=The Annals of Statistics |date=1 March 1999 |volume=27 |issue=1 |doi=10.1214/AOS/1018031101}}</ref> only depends on first and second momentum, or equivalently, mean <math>\mu_F</math> and standard deviation <math>\sigma_F</math> of the distribution <math>F</math>: :<math>S(F, y) = \left(\frac{y-\mu_F}{\sigma_F}\right)^2 + \log\sigma_F^2</math>

=== Multivariate continuous variables === The scoring rules listed below aim to evaluate probabilistic predictions when the predicted distributions are univariate continuous probability distribution's,{{explain|reason=Why do we use the word "univariate" when we are talking about multivariate distributions? What is it that consists of one variable ("uni" ~ one; "variate" ~ variable)?|date=January 2026}} i.e. the predicted distributions are defined over a multivariate target variable <math>X \in \mathbb{R}^n</math> and have a probability density function <math>f: \mathbb{R}^n \to \mathbb{R}_+</math>.

==== Multivariate logarithmic score ==== The multivariate logarithmic score is similar to the univariate logarithmic score: :<math>L(D,y) = - \ln(f_D(y))</math> where <math>f_D</math> denotes the probability density function of the predicted multivariate distribution <math>D</math>. It is a local, strictly proper scoring rule.

==== Hyvärinen scoring rule ==== The Hyvärinen scoring function (of a density p) is defined by<ref name=":0">{{Cite journal|last=Hyvärinen|first=Aapo|date=2005|title=Estimation of Non-Normalized Statistical Models by Score Matching|url=http://jmlr.org/papers/v6/hyvarinen05a.html|journal=Journal of Machine Learning Research|volume=6|issue=24|pages=695–709|issn=1533-7928}}</ref>

:<math>s(p) = 2 \Delta_y \log p(y) + \|\nabla_y \log p(y)\|_2^2 </math>

Where <math>\Delta</math> denotes the Hessian trace and <math>\nabla</math> denotes the gradient. This scoring rule can be used to computationally simplify parameter inference and address Bayesian model comparison with arbitrarily-vague priors.<ref name=":0" /><ref>{{Cite journal|last1=Shao|first1=Stephane|last2=Jacob|first2=Pierre E.|last3=Ding|first3=Jie|last4=Tarokh|first4=Vahid|date=2019-10-02|title=Bayesian Model Comparison with the Hyvärinen Score: Computation and Consistency|journal=Journal of the American Statistical Association|volume=114|issue=528|pages=1826–1837|doi=10.1080/01621459.2018.1518237|issn=0162-1459|arxiv=1711.00136|s2cid=52264864}}</ref> It was also used to introduce new information-theoretic quantities beyond the existing information theory.<ref>{{Cite journal|last1=Ding|first1=Jie|last2=Calderbank|first2=Robert|last3=Tarokh|first3=Vahid|date=2019|title=Gradient Information for Representation and Modeling|journal=Advances in Neural Information Processing Systems|volume=32|url=http://papers.neurips.cc/paper/8510-gradient-information-for-representation-and-modeling|pages=2396–2405}}</ref>

The Hyvärinen scoring rule is local of order 2 (meaning it locally takes into account derivatives up to second order).

==== Energy score ==== {{See also|Energy distance#Goodness-of-fit}} The energy score is a multivariate extension of the continuous ranked probability score:<ref name="GneitingRaftery2007"> {{cite journal | last1=Gneiting | first1=Tilmann | last2=Raftery | first2=Adrian E. | author2-link=Adrian Raftery | title=Strictly Proper Scoring Rules, Prediction, and Estimation | journal=Journal of the American Statistical Association | year=2007 | volume=102 | issue=447 | pages=359–378 | doi=10.1198/016214506000001437 | url=https://sites.stat.washington.edu/raftery/Research/PDF/Gneiting2007jasa.pdf | s2cid=1878582 }}</ref> :<math>ES_\beta(D, Y) = \mathbb{E}_{X \sim D}[\lVert X - Y \rVert_2^\beta] - \frac{1}{2} \mathbb{E}_{X,X' \sim D}[\lVert X - X' \rVert_2^\beta]</math> Here, <math>\beta \in (0, 2)</math>, <math>\lVert\rVert_2</math> denotes the <math>n</math>-dimensional Euclidean distance and <math>X, X'</math> are independently sampled random variables from the probability distribution <math>D</math>. The energy score is strictly proper for distributions <math>D</math> for which <math>\mathbb{E}_{X \sim D}[\lVert X \rVert_2]</math> is finite. It has been suggested that the energy score is somewhat ineffective when evaluating the intervariable dependency structure of the forecasted multivariate distribution.<ref name="s594">{{cite web | last1=Pinson | first1=Pierre | last2=Tastu | first2=Julija | title=Discrimination ability of the Energy score | publisher=Technical University of Denmark | date=2013 | url=https://orbit.dtu.dk/en/publications/discrimination-ability-of-the-energy-score | access-date=2024-05-11}}</ref> Apart from a term that depends only on the distribution of the observation, the energy score is equal to twice the energy distance between the predicted distribution and the empirical{{explain|reason=What do we mean by empirical here? Can the distribution of the observation be anything else than empirical?|date=January 2026}} distribution of the observation.

==== Variogram score ==== The variogram score of order <math>p</math> is given by:<ref name="c352">{{cite journal | last1=Scheuerer | first1=Michael | last2=Hamill | first2=Thomas M. | title=Variogram-Based Proper Scoring Rules for Probabilistic Forecasts of Multivariate Quantities* | journal=Monthly Weather Review | publisher=American Meteorological Society | volume=143 | issue=4 | date=2015-03-31 | issn=0027-0644 | doi=10.1175/mwr-d-14-00269.1 | pages=1321–1334}}</ref> :<math>VS_p(D, Y) = \sum_{i,j=1}^n w_{ij} (|Y_i - Y_j|^p - \mathbb{E}_{X \sim D}[|X_i - X_j|^p])^2</math> Here, <math>w_{ij}</math> are weights, often set to 1, and <math>p > 0</math> can be arbitrarily chosen, but <math>p = 0.5, 1</math> or <math>2</math> are often used. <math>X_{i}</math> is here to denote the <math>i</math>'th marginal random variable of <math>X</math>. The variogram score is proper for distributions for which the <math>(2p)</math>'th moment is finite for all components, but is never strictly proper. Compared to the energy score, the variogram score is claimed to be more discriminative with respect to the predicted correlation structure.

==== Conditional continuous ranked probability score ==== The conditional continuous ranked probability score (Conditional CRPS or CCRPS) is a family of (strictly) proper scoring rules. Conditional CRPS evaluates a forecasted multivariate distribution <math>D</math> by evaluation of CRPS over a prescribed set of univariate conditional probability distributions of the predicted multivariate distribution:<ref name="h713">{{cite book | last=Roordink | first=Daan | last2=Hess | first2=Sibylle | title=Machine Learning and Knowledge Discovery in Databases: Research Track | chapter=Scoring Rule Nets: Beyond Mean Target Prediction in Multivariate Regression | publisher=Springer Nature Switzerland | publication-place=Cham | volume=14170 | date=2023 | isbn=978-3-031-43414-3 | doi=10.1007/978-3-031-43415-0_12 | page=190–205}}</ref> :<math>CCRPS_{\mathcal{T}}(D,Y) = \sum_{i=1}^k CRPS(P_{X \sim D}(X_{v_i} | X_j = Y_j \text{ for } j \in \mathcal{C}_i), Y_{v_i})</math> Here, <math>X_i</math> is the <math>i</math>'th marginal variable of <math>X \sim D</math>, <math>\mathcal{T} = (v_i, \mathcal{C}_i)_{i=1}^k</math> is a set of tuples that defines a conditional specification (with <math>v_i \in \{1, \ldots, n\}</math> and <math>\mathcal{C}_i \subseteq \{1, \ldots, n\} \setminus \{v_i\}</math>), and <math>P_{X \sim D}(X_{v_i} | X_j = Y_j \text{ for } j \in \mathcal{C}_i)</math> denotes the conditional probability distribution for <math>X_{v_i}</math> given that all variables <math>X_j</math> for <math>j \in \mathcal{C}_i</math> are equal to their respective observations. In the case that <math>P_{X \sim D}(X_{v_i} | X_j = Y_j \text{ for } j \in \mathcal{C}_i)</math> is ill-defined (i.e. its conditional event has zero likelihood), CRPS scores over this distribution are defined as infinite. Conditional CRPS is strictly proper for distributions with finite first moment, if the chain rule is included in the conditional specification, meaning that there exists a permutation <math>\phi_1, \ldots, \phi_n</math> of <math>1, \ldots, n</math> such that for all <math>1 \leq i \leq n</math>: <math>(\phi_i, \{\phi_1, \ldots, \phi_{i-1}\}) \in \mathcal{T}</math>.

=== Interpretation of proper scoring rules === All proper scoring rules are equal to weighted sums (integral with a non-negative weighting functional) of the losses in a set of simple two-alternative decision problems that ''use'' the probabilistic prediction, each such decision problem having a particular combination of associated cost parameters for false positive and false negative decisions. A ''strictly'' proper scoring rule corresponds to having a nonzero weighting for all possible decision thresholds. Any given proper scoring rule is equal to the expected losses with respect to a particular probability distribution over the decision thresholds; thus the choice of a scoring rule corresponds to an assumption about the probability distribution of decision problems for which the predicted probabilities will ultimately be employed, with for example the quadratic loss (or Brier) scoring rule corresponding to a uniform probability of the decision threshold being anywhere between zero and one. The classification accuracy score (percent classified correctly), a single-threshold scoring rule which is zero or one depending on whether the predicted probability is on the appropriate side of 0.5, is a proper scoring rule but not a strictly proper scoring rule because it is optimized (in expectation) not only by predicting the true probability but by predicting ''any'' probability on the same side of 0.5 as the true probability.<ref>Leonard J. Savage. Elicitation of personal probabilities and expectations. J. of the American Stat. Assoc., 66(336):783–801, 1971.</ref><ref>Schervish, Mark J. (1989). "A General Method for Comparing Probability Assessors", ''Annals of Statistics'' '''17'''(4) 1856–1879, https://projecteuclid.org/euclid.aos/1176347398</ref><ref>{{cite conference | title =How good were those probability predictions? The expected recommendation loss (ERL) scoring rule | first =David B. | last =Rosen | year =1996 | conference = | editor =Heidbreder, G. | book-title =Maximum Entropy and Bayesian Methods (Proceedings of the Thirteenth International Workshop, August 1993) | publisher =Kluwer, Dordrecht, The Netherlands | url =https://doi.org/10.1007/978-94-015-8729-7_33 }}</ref><ref>Roulston, M. S., & Smith, L. A. (2002). Evaluating probabilistic forecasts using information theory. Monthly Weather Review, 130, 1653–1660. See APPENDIX "Skill Scores and Cost–Loss". [https://journals.ametsoc.org/downloadpdf/journals/mwre/130/6/1520-0493_2002_130_1653_epfuit_2.0.co_2.xml]</ref><ref>"Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications", Andreas Buja, Werner Stuetzle, Yi Shen (2005) http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.184.5203</ref><ref>Hernandez-Orallo, Jose; Flach, Peter; and Ferri, Cesar (2012). "A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss." ''Journal of Machine Learning Research'' '''13''' 2813–2869. http://www.jmlr.org/papers/volume13/hernandez-orallo12a/hernandez-orallo12a.pdf</ref>

== Examples of consistent scoring functions == There are an infinite number of scoring functions including entire parameterized families of strictly consistent scoring functions for certain functionals <math>T</math>. The ones shown below are a selection of well known ones.

=== Expectation === The following scoring functions are strictly consistent for the expected value, i.e. <math>T(F) = E_{Y\sim F}[Y]</math>.

==== Squared error ==== :<math>S(x, y) = (x - y)^2</math>

=== Quantiles === The following scoring functions are strictly consistent for the <math>\alpha</math>-quantile, i.e. <math>T(F)=q_\alpha</math> is the set of values <math>q</math> satisfying <math>\lim_{y\uparrow q} F(y) \leq \alpha \leq F(q)</math>.

==== Quantile loss / pinball loss ==== :<math>S(x, y) = (\mathbf{1}\{x \geq y\} - \alpha) (x - y)</math>

=== Intervals === The point prediction consists of a central <math>(1-\alpha)</math> prediction interval, <math>x=(l, u)</math>, where the lower endpoint <math>l</math> predicts the <math>\frac{\alpha}{2}</math>quantile and the upper endpoint <math>u</math> predicts the <math>1-\frac{\alpha}{2}</math>quantile.

==== Interval score ==== The inverval score is a combination of the two pinball losses for the corresponding quantiles. :<math> S_\alpha(l,u;y) = (u - l) + \frac{2}{\alpha}(l - y)\,\mathbf{1}\{y < l\} + \frac{2}{\alpha}(y - u)\,\mathbf{1}\{y > u\} </math>

"The forecaster is rewarded for narrow prediction intervals, and he or she incurs a penalty, the size of which depends on α, if the observation misses the interval"<ref name="GneitingRaftery2007"/>

== Applications == thumb|upright=1.25|The logarithmic rule

=== Meteorological weather forecasts === An example of probabilistic forecasting is in meteorology where a weather forecaster may give the probability of rain on the next day. One could note the number of times that a 25% probability was quoted, over a long period, and compare this with the actual proportion of times that rain fell. If the actual percentage was substantially different from the stated probability we say that the forecaster is poorly calibrated. A poorly calibrated forecaster might be encouraged to do better by a bonus system. A bonus system designed around a proper scoring rule will incentivize the forecaster to report probabilities equal to his personal beliefs.<ref name="Bickel"> {{Cite journal | last = Bickel | first =E.J. | year = 2007 | title = Some Comparisons among Quadratic, Spherical, and Logarithmic Scoring Rules | journal = Decision Analysis | volume = 4 | issue = 2 | pages = 49–65 | id = | url = http://faculty.engr.utexas.edu/bickel/Papers/QSL_Comparison.pdf | doi= 10.1287/deca.1070.0089 }}</ref>

In addition to the simple case of a binary decision, such as assigning probabilities to 'rain' or 'no rain', scoring rules may be used for multiple classes, such as 'rain', 'snow', or 'clear', or continuous responses like the amount of rain per day.

The image shows an example of a scoring rule, the logarithmic scoring rule, as a function of the probability reported for the event that actually occurred. One way to use this rule would be as a cost based on the probability that a forecaster or algorithm assigns, then checking to see which event actually occurs.

Scoring rules can be used beyond evaluation metrics to directly serve as loss function to construct estimators.<ref name="GneitingRaftery2007"/>

== See also == * Coherence * Decision rule

== Literature == * Strictly Proper Scoring Rules, Prediction, and Estimation. Tilmann Gneiting &Adrian E Raftery Pages 359-378, https://doi.org/10.1198/016214506000001437, [https://stat.uw.edu/research/tech-reports/strictly-proper-scoring-rules-prediction-and-estimation#:~:text=A%20scoring%20rule%20is%20strictly%20proper%20if%20the,to%20make%20careful%20assessments%20and%20to%20be%20honest pdf] * Scoring rules for continuous probability distributions, James E. Matheson, Robert L. Winkler, (1976) Scoring Rules for Continuous Probability Distributions. Management Science 22(10):1087-1096. http://dx.doi.org/10.1287/mnsc.22.10.1087

== References == {{Reflist}}

==External links== * [http://www.decisionsciencenews.com/?p=963 Video comparing spherical, quadratic and logarithmic scoring rules] * [https://www.stat.washington.edu/research/reports/2009/tr551.pdf Local Proper Scoring Rules] * [http://faculty.engr.utexas.edu/bickel/Papers/Scoring_Rules_Education.pdf Scoring Rules and Decision Analysis Education] * [http://www.stat.washington.edu/research/reports/2004/tr463.pdf Strictly Proper Scoring Rules] * [https://www.jstor.org/discover/10.2307/1402448?uid=16779064&uid=3737864&uid=2129&uid=2&uid=70&uid=16734048&uid=3&uid=67&uid=62&sid=21101527707467 Scoring Rules and uncertainty] * [https://www.fharrell.com/post/class-damage/ Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules] * [http://cran.nexr.com/web/packages/scoringRules/vignettes/crpsformulas.html Closed-form expressions of the continuous ranked probability score] {{Decision theory}} Category:Decision theory Category:Probability assessment