SCORING RULES. ROBERT L. WINKLER Fuqua School of Business, Duke University, Durham, North Carolina

Size: px

Start display at page:

Download "SCORING RULES. ROBERT L. WINKLER Fuqua School of Business, Duke University, Durham, North Carolina"

Florence Summers
6 years ago
Views:

1 SCORING RULES INTRODUCTION ROBERT L. WINKLER Fuqua School of Business, Duke University, Durham, North Carolina VICTOR RICHMOND R. JOSE McDonough School of Business, Georgetown University, Washington, D.C. Uncertainty is a pervasive feature of our world, and fields such as decision analysis and statistics provide methods to help us make decisions, forecasts, and inferences in the face of uncertainty. Our everyday language includes many terms that relate to the degree of uncertainty in a situation: for example, rain is unlikely today, the chances are good that a surgical procedure will be successful, the prospects for an improved economic situation are not favorable, and so on. As the mathematical language of uncertainty, probability theory provides a structure to quantify uncertainty. Probabilities are encountered in the media (e.g., the probability of rain this afternoon is 20%) and widely used in modeling. Although probability forecasts are formulated and used extensively, very often they are never evaluated after the event or variable of interest is observed. Scoring rules provide such evaluations by giving a numerical score based on the probabilities and on the actual observation. For example, a probability of rain of 40% in a simple two-state setting of rain versus no rain will receive a higher score than a probability of 20% if it rains, and a lower score if it does not rain. In this manner, we can use scoring rules to compare the sources of the probabilities, which might be experts, models, or simply past data. The first scoring rule used on a regular basis was a quadratic rule developed by Brier [1] to evaluate probabilistic weather forecasts. Indeed, weather forecasting is the area in which scoring rules have been used most extensively. The presence of such an ex post evaluation using suitably designed scoring rules also provides ex ante incentives for careful formulation of probability forecasts. Much of the early development of scoring rules emphasized this ex ante role of scoring rules [e.g., [2 6]]. Attention was focused on strictly proper scoring rules, for which a forecaster can maximize his or her expected score only by honestly reporting the probabilities and also has the incentive to obtain further information to increase the accuracy of the probabilities. This ex ante motivation yields rules that reward probabilities that have good characteristics ex post, as we shall see. For a general discussion of scoring rules and reviews of the scoring rule literature, see Winkler [7] and Gneiting and Raftery [8]. We discuss some basic properties of scoring rules in the second section, focusing on the aspects related to ex ante incentives, and present some commonly encountered rules. In the next section, we turn to ex post evaluation, showing how some notions involving strictly proper scoring rules relate to ex post evaluation. The next two sections involve scoring rules with special characteristics, namely those that provide evaluations of probabilities relative to baseline distributions and those that take into account any ordering of the events of interest. A brief summary and discussion, including some connections with other fields, is presented in the final section. STRICTLY PROPER SCORING RULES We begin by considering the simplest possible situation, that of a single event A and its complement. Suppose that an expert is assessing a probability for A and is being evaluated with a scoring rule S. If a probability r is reported for A, then the score willbe Wiley Encyclopedia of Operations Research and Management Science, edited by James J. Cochran Copyright 2010 John Wiley & Sons, Inc. 1

2 2 SCORING RULES S(r, e), where e = 1ifA occurs and e = 0if A does not occur. Furthermore, assume that the expert s best judgment is that the probability of A is denoted by p. Then the expected score is S(r, p) = ps(r,1)+ (1 p)s(r,0). The scoring rule S is said to be strictly proper if S (r,0) S (r,1) S(p, p) > S(r, p) foranyr p. (1) 0.5 To maximize the expected score with a strictly proper rule, the expert should set r = p, thereby reporting the probability honestly. The scoring rules discussed here are oriented such that a higher score is better. Some rules in the literature, such as the Brier score [1], are oriented with a negative score being better, in which case the expert should set r = p to minimize the expected score. Rules such as the Brier score can be converted to a positive orientation by changing the sign of the score, so the focus on scores with a positive orientation here is not restrictive. The expected score S(p, p) for honest reporting from a strictly proper scoring rule is strictly convex, and conversely, a strictly proper scoring rule can be generated from any strictly convex function of p that is taken as the expected score function S(p, p) for honest reporting [6]. Thus, there are an infinite number of rules satisfying Equation (1). Three commonly used rules are as follows: Quadratic: S(r, e) = 1 2(e r) 2, (2) Logarithmic: S(r, e) = log[re + (1 r)(1 e)], (3) Spherical: S(r, e) = [re + (1 r)(1 e)]/[r 2 + (1 r) 2 ] 1/2. (4) These and any other strictly proper rules can be scaled as desired (e.g., to avoid negative scores), because any positive affine transformation of a strictly proper rule is itself strictly proper. Figure 1 shows S(r, 1), S(r, 0), and S(p, p) for the quadratic scoring rule. Note that for this simple two-event setting, S(r,1) and S(r, 0) are mirror images of each r (a) S (p,p) p (b) Figure 1. (a) Score functions S(r,1) and S(r,0) and (b) expected score S(p, p) under honest reporting for the quadratic scoring rule. other, with S(r, 1) increasing in r and S(r,0) decreasing in r. With the general concept of a strictly proper scoring rule established for the case of a single event (and its complement), we next generalize to the case of a set of mutually exclusive and exhaustive events {A 1,..., A k }, for which the expert s probabilities are given by the vector p = (p 1,..., p k )andthe reported probabilities are r = (r 1,..., r k ). With a scoring rule S, the expert s score is S(r, e i )ifa i occurs, where e i is a vector with the ith element equal to 1 and the other elements all equal to 0. The expected score from the perspective of the expert is S(r, p) = k i=1 p is(r, e i ), and the scoring rule is strictly proper if S(p, p) > S(r, p) forany r p. Quadratic, logarithmic, and spherical

3 SCORING RULES 3 rules for this case are S(r, e i ) = 2r i k r 2 j, (5) j=1 S(r, e i ) = log r i, (6) 1/2 k and S(r, e i ) = r i /, (7) r 2 j j=1 respectively. Note that this setup could be used when we are considering a discrete distribution of a random variable, which could include a discretization of a continuous random variable into a set of intervals. Finally, we present scoring rules for probability distributions of a continuous random variable x. Letp denote the expert s probability density function for x, andletr denote the corresponding reported density function. Then, for a scoring rule that gives a score S(r, x)when x = x, the expert s expected score is S(r, p) = S(r, x)p(x)dx. Quadratic, logarithmic, and spherical scoring rules in the continuous case are and S(r, x) = 2r(x) r 2 (x)dx, (8) S(r, x) = log r(x), (9) ( 1/2 S(r, x) = r(x)/ r (x)dx) 2. (10) Our focus has been on strictly proper scoring rules, which have been developed with the goal of providing the expert with an incentive to report honestly. But if the expert is not well-informed with respect to the situation, reporting probabilities honestly may not mean reporting good probabilities. Not all probability forecasts are necessarily good forecasts. Fortunately, assuming honest forecasting, strictly proper scoring rules will reward forecasts by providing higher expected scores to forecasts for which p is closer to 0 or 1. To see how this works for the quadratic, logarithmic, and spherical rules that have been presented here, we note that they share an important characteristic: they are symmetric. In the single-event case, that means that the expected score for an honestly reported probability of r is the same as the expected score for an honestly reported probability of 1 r. As noted earlier, reporting r = p under a strictly proper scoring rule results in an expected score function S(p, p) that is strictly convex in p. This convexity, combined with the symmetry, means that S(p, p) is minimized at p = 0.5 and increases as p 0orp 1 from p = 0.5. These features of S(p, p) are illustrated for the quadratic rule in Fig. 1. That means that under honest reporting, the expected score is higher for probability forecasts that are sharper, where sharpness refers to the degree that p is closer to 0 or 1. For example, a probability of 0 or 1 is perfectly sharp, whereas a probability of 0.5 admits a lot of uncertainty about the outcome. To illustrate the incentives from strictly proper scoring rules for both honesty and sharpness, consider a decomposition of the expected score for the quadratic rule in the case of a probability for a single event. The expected score is S(r, p) = p[1 2(1 r) 2 ] + (1 p)(1 2r 2 ). Expanding, adding and subtracting p 2, and rearranging yields S(r, p) = 1 2(p r) 2 2p(1 p). (11) The second term on the right-hand side of Equation (11) can be viewed as a penalty (because of the negative sign) for not setting r = p, and it thus provides an incentive for honesty. The last term is a penalty for lack of sharpness, because p(1 p) is maximized at p = 0.5 and decreases as p 0 or p 1. The best possible expected score is one, and dishonesty (r p) or lack of perfect sharpness (0 < p < 1) will reduce the expected score. Other strictly proper rules (e.g., the logarithmic and spherical rules) can be decomposed in a similar manner. Keep in mind that to maximize expected score, the expert has to report honestly, so attempting to have probabilities look sharp artificially (i.e., sharp reported probabilities that are not consistent with the expert s judgments) will decrease the expected score, not increase it. Note from Equation (11) that the sharpness term relates to the sharpness of p, not the sharpness of r. The primary

4 4 SCORING RULES aspect of strictly proper scoring rules is to encourage honesty, and the nature of strictly proper scoring rules is such that honest reporting by experts who have sharper probabilities will yield higher expected scores than honest reporting by experts who have probabilities that are not so sharp. In the final analysis, then, strictly proper scoring rules reward both honesty and sharpness. Strictly proper scoring rules differ in some characteristics. We will present different types of strictly proper rules in later sections and discuss some of their characteristics. As we shall see, not all strictly proper rules are symmetric in the sense discussed above, and in some cases it may be desirable to use a rule that is not symmetric. Another characteristic of note is that the logarithmic rule is the only rule for which the score depends only on the probability or density that has been assigned to the event or value of the variable that actually occurs. It does not depend on the probability or density assigned to other events or values. For example, if we consider scoring rules for k mutually exclusive and exhaustive events and event A i occurs, the logarithmic score, log r i, depends only on r i, whereas the quadratic score, 2r i k j=1 r2 j, depends on all of the probabilities r 1,..., r k. This property, unique to the logarithmic rule when k > 2, is called locality, and it is consistent in spirit with the likelihood principle that plays a major role in statistics. Although the quadratic, logarithmic, and spherical rules given above are the usual suspects when we think about scoring rules, they are special cases of two rich families of strictly proper scoring rules [9]. When probabilities for a set of k mutually exclusive and exhaustive events are reported, scores for the pseudospherical and power families are given by [ ( S S β (r, e i) = 1 ) r β 1 i β 1 E r (r β 1 ) 1/β 1] and (12) S P β (r, e i) = rβ 1 i 1 β 1 E r(r β 1 ) 1, (13) β respectively, where E r (r β 1 ) = k i=1 r i(r β 1 i ) and < β <. When β = 2, Equations (12) and (13) yield the spherical and quadratic rules, respectively. When β 1, both families converge to the logarithmic rule. In summary, the most important characteristic of scoring rules in an ex ante sense related to probability assessment is that they should be strictly proper, and there are many such rules from which to choose. If a rule is indeed strictly proper, then it should provide incentives for an expert to honestly report probabilities and to invest effort in an attempt to make those probabilities sharper. Such effort might be directed toward such things as gathering more data, using more powerful methods to analyze the data, and learning more about the processes affecting the events in question or about forecasts provided by others. EX-POST EVALUATION WITH STRICTLY PROPER SCORING RULES We shift now from the ex ante perspective of the previous section to consider the ex post evaluation of probability forecasts. For a given situation, we assume that an expert has probabilities that are based on the information available and consistent with the expert s best judgment. The ex ante viewpoint involves rules that provide incentives for the expert to attempt to come up with good probabilities and to report those probabilities honestly. Many of the characteristics discussed in the preceding section have counterparts when we consider their use for evaluation purposes. As is the case with statistical analysis of data in general, a single observation is not very informative. To have a reliable evaluation (when comparing experts or models in terms of their probabilities, for example), we would like to have a large number of observations. Thus, instead of considering a given situation with a single probability or set of probabilities, we have a set of data consisting of many situations with different probabilities.

5 SCORING RULES 5 Suppose that we have a sample of probability forecasts and the corresponding observations for the occurrence of an event. For example, this might consist of probabilities of default for loans (generated by a model or assessed by a bank officer) or probabilities of rain for different days at a given location. First, we can look at all of the occasions in the data set for which a particular value of the reported probability r (say, 0.30) was used, and determine the relative frequency of occurrence of the event of interest on those occasions. Denote this relative frequency by f r. With the quadratic scoring rule, the average score on all of the occasions with this value of r is S(r, f r ) = f r [1 2(1 r) 2 ] + (1 f r )(1 2r 2 ). Ex ante, the expected score from the perspective of the expert is a function of r and p,withp being known only to the expert. Ex post, the average score is a function of r and f r, where we are able to observe f r : S(r, f r ) = 1 2(f r r) 2 2f r (1 f r ). (14) Note that this is simply Equation (11) with f r used in place of p. The second term on the right-hand side of Equation (14) is a measure of calibration, which involves the correspondence between the reported probability and the relative frequency of occurrence of the event when that probability is used. If f r = r, then the reported probabilities of r are perfectly calibrated. The more the relative frequency deviates from r, the worse the calibration is. The last term on the right-hand side of Equation (14) is a measure of sharpness, which is better as f r 0orf r 1. Poorer calibration and less sharpness lead to lower average scores. In data with reported probabilities and outcomes, different probabilities will be used on different occasions. The overall average score S for an expert is found by aggregating the average scores for the different values of r. Ifweletn r represent the number of times a reported probability of r is used in the data set and let n = r n r represent the overall sample size, the overall average score can be expressed as S = r = 1 2 r (n r /n)s(r, f r ) (n r /n)(f r r) 2 2 (n r /n)f r (1 f r ). (15) r The first summation on the right-hand side of Equation (15) is an overall measure of calibration for the data set, and the second summation is an overall measure of sharpness. This decomposition into calibration and sharpness components can be generalized beyond the quadratic rule to any strictly proper scoring rule [10]. One convenient way to think about calibration and sharpness is to think of the probability assessment process for a single event as a two-step process. First, the expert puts forecast situations in equivalence classes, or boxes, such that the expert feels that the events in a given box have roughly the same probability of occurrence. Second, the expert assigns numbers (probability values) to the boxes. Calibration is then an evaluation of how well the expert assigns the numbers. Sharpness, on the other hand, is unrelated to the probability values. Instead, it measures how effective the expert is in creating boxes for which the relative frequency of occurrence of the events is close to 0 or 1. This is unlikely to be the way an expert really thinks about the forecasting process, but it is a convenient way to emphasize key differences between calibration and sharpness. For one thing, we can always attempt to correct for miscalibration. If an expert always gives probabilities that are too high, for example, a decision maker using those probabilities can reduce reported probabilities from that expert. Correcting for poor sharpness is a much trickier business. We have illustrated the notion of decomposition of an average score using the quadratic scoring rule, but the same idea can be applied to other rules. Also, an ex post average score or an ex ante expected score can be decomposed in different ways. The decomposition into terms measuring calibration and sharpness is the most frequently

6 6 SCORING RULES used decomposition, and it is arguably the most important decomposition. Gneiting and Raftery [8] comment that the goal of probabilistic forecasting is to maximize the sharpness of the (probabilities) subject to calibration. Although that seems reasonable, we feel that in reported evaluations, too much emphasis is typically given to calibration and not enough to sharpness. Ex post evaluation with strictly proper scoring rules involves most of the same ideas encountered in the role of scoring rules in terms of ex ante incentives. Ex ante incentives for honesty translate into ex post evaluations of calibration, and ex ante measures of sharpness based on the probabilities translate into ex post measures based on the relative frequencies of occurrence of the events of interest for given probability values (given boxes). The use of scoring rules and their decompositions to evaluate probabilities ex post can be thought of as exploratory data analysis. Such evaluations can be used to compare experts or models. In the case of a single expert or model, they can be used to learn more about that expert s or model s characteristics and abilities as a probability forecaster. Feedback can also help the expert understand his or her own characteristics as a probability forecaster and attempt to improve them in the future. SCORING RULES WITH BASELINE DISTRIBUTIONS Scoring rules such as the quadratic, logarithmic, and spherical rules given earlier can be thought of as providing an absolute evaluation of probabilities. Often, we would like to have a relative evaluation by comparing how good a probability or probability distribution is, relative to some baseline. When assessing probabilities of rain, for example, it is easier to get a high score in a location where rain seldom occurs than it is in a location where it rains reasonably often. Does that mean that the probability forecasts in the drier area are better? As noted earlier, the most commonly used scoring rules are symmetric in the sense that any permutation of the labels on the events and their associated probabilities does not change the expected score. One implication of this symmetry, when combined with the convexity of the expected score function, is that the expected score is minimized for a uniform distribution that gives a probability of 1/k to each event in the k-event case. Thus, these scores are implicitly being evaluated relative to a uniform baseline distribution. To avoid comparison with a uniform distribution, we could consider the percentage improvement in average scores over the scores for a baseline distribution. In assessing a probability of rain, for example, we might use climatology, which is the long-term relativefrequencyofraininagivenlocation at a specific time of year, as a baseline. However, a percentage improvement in the score over the baseline, which is called a skill score, is not strictly proper. For a strictly proper rule with a baseline distribution that is not uniform, we can choose a desired convex expected score function and generate a strictly proper rule that yields that expected score function [11]. For example, we might choose a function that is minimized at what might be viewed as a least skillful forecast. In forecasting rain, climatology might be considered least skillful among forecasts that seem reasonable, since it just involves looking up some past data and does not require any weather-forecasting expertise. In contrast, although a uniform distribution requires no expertise, it may not seem at all reasonable, as in the case of a dry location with a very low climatological relative frequency of rain. Asymmetric rules can be generated from symmetric rules. For example, in the singleevent case for which it is felt that the expected score with honest reporting should be minimized at a probability of q, we can take any symmetric strictly proper scoring rule S and create a new rule S (r, e q) = [S(r, e) S(q, e)]/t(q), where T(q) = S(1, 1) S(q,1) if r q and S(0, 0) S(q,0)ifr q. More generally, the families of scoring rules given by Equations (12) and (13) can be generalized to pseudospherical and power families of strictly proper scoring rules that allow for the incorporation of baseline distributions [9]. If the baseline

7 S (r, e 2 q) S (p, p q) S (r, e 1 q) r 1 (a) p 1 (b) Figure 2. (a) Score functions S(r,e 1 q) and S(r,e 2 q) and (b) expected score S(p,p q) under honest reporting for the power scoring rule with β = 2andq = (0.2, 0.8). distribution for a set of k mutually exclusive and collectively exhaustive events is denoted by q = (q 1,..., q k ), then we can define the pseudospherical and power families of scoring rules with baselines as follows: and S S β (r, e i q) = 1 β 1 S P β (r, e i q) = (r i/q i ) β 1 1 β 1 [ ( r i /q i E r [(r/q) β 1 ] 1/β ) β 1 1] E r[(r/q) β 1 ] 1, β (16) (17) SCORING RULES 7 where E r [(r/q) β 1 ] = k i=1 r i(r i /q i ) β 1 and <β<. These scoring rules are scaled so that they yield scores of 0 when r = q. Thus, a positive score represents improvement over the baseline and a negative score indicates a forecast worse than the baseline. (The expert s expected score with honest reporting is positive except at r = q, where it is 0.) As with Equations (12) and (13), β = 2 corresponds to spherical and quadratic rules, respectively, in Equations (16) and (17), and both families converge to a logarithmic rule when β 1. Figure 2 shows S S β (r, e 1 q), S S β (r, e 2 q), and S(p, p q) for the power scoring rule with β = 2. The consideration of baseline distributions provides a relative evaluation as opposed to an absolute evaluation, and relative evaluations are often of great interest. In addition, evaluations with baseline distributions can be useful in evaluating probabilities (and evaluating the forecasters providing the probabilities) that are made under different circumstances. For example, if one weather forecaster assesses probabilities of rain in a very dry climate (say, with a climatology of 0.05) and another forecaster assesses probabilities in a more moist climate (climatology 0.40), then it is much easier for the first forecaster to obtain higher scores in an absolute evaluation. This is because of the fact that the first forecaster is able, on average, to make sharper forecasts. If we use a relative evaluation with climatology as the baseline, then we are comparing the two forecasters in terms of how effective they are at improving upon a forecast based solely on climatology, thereby adjusting for the differences in the forecast situations. While not perfect, this will tend to even the playing field somewhat and make forfairercomparisons. SCORING RULES THAT ARE SENSITIVE TO DISTANCE In some situations, the events of interest are ordered. For example, in a soccer match, a team can win, lose, or tie. A win is better than a tie, which in turn is better than a loss,

8 8 SCORING RULES so there is an ordering. If we are giving probabilities for x, the amount of rain in inches on a given day, we might assess probabilities for x = 0, 0 < x 0.5, 0.5 < x 1, and x > 1. These four events are ordered. The scoring rules for multiple events, discussed in the preceding sections, ignore any ordering. Suppose that two experts report probabilities of (0.3,0.4,0.2,0.1) and (0.3,0.1,0.2,0.4). If there is no rain, then the experts will receive the same score. They both gave a probability of 0.3 for the event that occurred, and they gave probabilities of 0.4, 0.2, and 0.1 for the other three events. The fact that the latter three probabilities were in different orders does not affect the score because the scoring rule does not take ordering of the events into account. Some might argue that the first expert gave more probability on the event closest to the event that occurred and should therefore receive a higher score. Scoring rules have been developed that would take ordering into account in this way, and we say that such rules are sensitive to distance. Informally, this means that for the probability not assigned to the event that occurs, a higher score will result if more probability is given to events closer to the event that occurs and less probability to more distant events. The first strictly proper sensitive-todistance scoring rule was a quadratic rule called the ranked probability score [12]: S(r, e i ) = i 1 j=1 R2 j k 1 j=i (1 R j ) 2, where R j = j l=1 r l is a cumulative probability. By connecting the score to cumulative probabilities,theruleisabletotakesensitivityto distance into consideration. As probability moves from events more distant from the event that occurs to events closer to the event that occurs, the cumulative probabilities change accordingly and result in an increase in the score. Thesameideathatisusedtogeneratethe ranked probability score from the quadratic score can be used to obtain a sensitive-todistance scoring rule S based on any strictly proper rule S for a single event and its complement: i 1 k 1 S (r, e i ) = S(R j,0)+ S(R j,1). (18) j=1 j=i The corresponding expected score is S (r, p) = k 1 i=1 [P is(r i,1)+ (1 P i )S(R i, 0)], using a vector P of cumulative probabilities P j = j l=1 p l based on p. Note that Equation (18) can be used to generate new pseudospherical and power families of strictly proper scoring rules, with or without baseline distributions [13]. If baseline distributions are used, they will be expressed in cumulative form, with a vector Q of cumulative probabilities Q j = j l=1 q l representing the cumulative baseline distribution. It is important to mention that properties of the scoring rule S in Equation (18), other than the fact that it is strictly proper, are not necessarily inherited by S. For example, if S is logarithmic, S will not inherit the property of locality mentioned earlier. The score is based on the cumulative probabilities R 1,..., R k 1, so it clearly depends on more than just r i. Also, if S is determined from Equation (18) using a symmetric, strictly proper S without a baseline distribution, then the expected score S (p,p) is minimized at p = (0.5, 0,...,0,0.5). That is, if a baseline distribution is not chosen, the default baseline distribution is (0.5, 0,..., 0,0.5), not (1/k,..., 1/k), and the score for a uniform distribution will not be the same for all events because the ordering of the events is relevant and some events are more distant from others. Note that the default baseline distribution of (0.5, 0,...,0,0.5) for S translates to (0.5,...,0.5) when expressed in terms of the cumulative probabilities (R 1,..., R k 1 ). Since the relevant probabilities are cumulative in a score that is sensitive to distance, the baseline distribution is uniform in the cumulative probabilities. Furthermore, this distribution will give the same score S regardless of which event occurs, because R j = 0.5 ineachofthes(r j,0) and S(R j,1) terms in Equation (18). Commonly encountered scoring rules ignore any ordering of the events. When the events of interest are ordered, however, the ordering may be important in terms of the underlying real-world situation. For instance, with forecasts of returns on an investment, high probabilities for values that are not identical to the returns that actually occur but are close to those returns would

9 SCORING RULES 9 seem to be more valuable for investment purposes than high probabilities for values that are quite distant from the actual returns. In such a setting, a scoring rule that is sensitive to distance might provide incentives and an ex post evaluation that are more consistent with the decision making problem. SUMMARY AND DISCUSSION Probability forecasts are important inputs to quantify uncertainty in inferential and decision making problems. It is therefore important to have appropriate incentives for careful formulation of probability forecasts andtohavemeasurestoevaluatetheforecasts once the uncertainty is resolved and we see what actually happens. Those are exactly the roles that are played by scoring rules. In particular, strictly proper scoring rules provide incentives for making good forecasts (i.e., sharp forecasts) and reporting them honestly. In terms of ex post evaluation, the incentive for honest reporting ex ante translates into measures of the calibration of the forecasts, and given good calibration, sharper forecasts will earn higher scores on average. A few scoring rules tend to be used most often, but rich families of strictly proper scoring rules have been developed. Beyond the basic rules, there are options that add more flexibility while maintaining the strictly proper nature of the rules. Some rules allow the evaluation of probabilities relative to a chosen baseline distribution. Among other things, this makes scores for probabilities made in different situations more comparable. Other rules take into account any ordering of the events and are sensitive to distance in the sense of giving higher scores to probability distributions assigning higher probabilities to events near the event that occurs, all other things being equal. This feature is relevant when being close with the probability forecast can lead to better decisions or inferences. How might a user choose a scoring rule in a given situation? Among the basic rules, some have different properties from others and different rules can lead to different scores and different rankings of experts [14]. Thus, the choice of a rule might depend on how one feels about those properties and about the situation at hand. For example, with probabilistic answers for a multiple-choice test, the locality property of the logarithmic rule might have strong appeal. At the same time, the possibility of a score of negative infinity with the logarithmic score is a potential concern; some claim that the logarithmic rule has undesirable properties [15], and in certain settings, locality becomes less important. For example, in a two-event setting, all rules satisfy locality, and in a setting where sensitivity to distance is considered important, a sensitive-to-distance logarithmic rule no longer has the locality property. There is no general agreement on a single best rule for all situations. The use of a scoring rule that involves the choice of a baseline distribution depends on whether a relative evaluation is desired and whether there is a specific baseline distribution against which to compare probabilities. An important thing to keep in mind is that using the basic rules without choosing a baseline distribution means that probabilities are being evaluated relative to the default distribution, which is a uniform distribution. Another choice when events are ordered is whether to use a rule that is sensitive to distance, and this choice is related to whether giving higher probabilities to events close to the event that occurs is viewed as important for the situation at hand. What about practical issues in using the rules? The use of the rules ex post to evaluate probabilities is straightforward, just involving the computation of scores using the formula for any scoring rule that is chosen. Scores can then be used as feedback to enable experts and modelers to see their performance and perhaps learn from it. The use of scoring rules ex ante means that they should be part of a general probability assessment process that might include some training regarding probability if necessary. For many experts, the connection of the scoring rule formulas with the incentives is too opaque to make it valuable to dwell on the formulas. Discussing the incentives in an intuitive fashion is generally more effective. One option for relatively simple cases (e.g., a probability

10 10 SCORING RULES for a single event) is to present the possible scores in graphical or tabular form. The incentive to maximize expected score with strictly proper scoring rules is probably reasonable in most cases. In the context of thinking of the expert as wanting to maximize expected utility, it implies that the utility function is linear in the score (or linear in money if the score is translated into a monetary reward). If the expert s utility function U(S) for the score is known, a modification of the score to S = U 1 (S) with a strictly proper S will adjust for U, since U(S ) = U[U 1 (S)] = S, and will thereby encourage honest reporting. A practical problem with this is that we are not likely to know U, and eliciting it from the expert is not easy. In many cases, we feel that the importance of the score is probably not large enough to cause major violations of such linear utility due to risk aversion or risk taking, for example. However, if there are other stakes related to the probability forecasts, those stakes might cause significant shifting of the probabilities away from honest reporting. For example, if the situation is viewed as a contest against other experts (rightly or wrongly), the consideration of strategic play might lead the expert to give more extreme probabilities (i.e., probabilities closer to and often equal to 0 or 1) than justified by the expert s best judgments, in order to try to win the contest [16]. In most cases, however, we would expect that experts will try to come up with the best set of probabilities given the information that is available to them, and will not think strategically. In any event, strictly proper scoring rules can always be used for ex post evaluation purposes, and any hedging of reported probabilities can be expected to lead to lower average scores. In closing, we note that work on scoring rules has interesting connections to other fields. Scoring rules are closely connected to decision theory/decision analysis. A decision maker may hire an expert to report probabilities for events related to the decision and might like to tailor a scoring rule to the decision-making problem, in the spirit of Savage s share of the business notion [6]. The expert s reported probabilities can be viewed as new information by the decision maker, and connections between scores and the value of that information are of interest. These notions are related to the literature on incentives and mechanism design in economics and especially to agency theory. On a different tack, expected scores from strictly proper scoring rules are related to information measures from signal processing and information theory [9]. For example, the expected score for honest reporting under a logarithmic scoring rule is the negative Shannon entropy of the expert s probability distribution p and is the Kullback Leibler divergence of p with respect to q if the baseline distribution q is chosen. Finally, extensive experimental work by psychologists has investigated the degree to which individuals probability assessments are well calibrated and has led to various theories of the calibration of subjective probabilities [17]. Judgments about others probabilities are important in competitive situations, and economics and psychology are both relevant for this issue. Given the importance of probability forecasts in decision modeling and statistics as well as the connections with these different (and somewhat disparate) fields, we expect the interest in scoring rules and the application of such rules in practice to grow. REFERENCES 1. Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev 1950;78(1): Good IJ. Rational decisions. J R Stat Soc [Ser B] 1952;14(1): McCarthy J. Measures of the value of information. Proc Natl Acad Sci USA 1956;42(9): de Finetti B. Does it make sense to speak of good probability appraisers? In: Good IJ, editor. The scientist speculates: an anthology of partly-baked ideas. New York: Wiley; pp Winkler RL, Murphy AH. Good probability assessors. J Appl Meteorol 1968;7(5): Savage LJ. Elicitation of personal probabilities and expectations. J Am Stat Assoc 1971;66(336):

11 SCORING RULES Winkler RL. Scoring rules and the evaluation of probabilities. Test 1996;5(1): Gneiting T, Raftery A. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 2007;102(477): Jose VRR, Nau RF, Winkler RL. Scoring rules, generalized entropy, and utility maximization. Oper Res 2008;56(5): De Groot MH, Fienberg SE. Assessing probability assessors: calibration and refinement. In: Gupta SS, Berger JO, editors. Statistical decision theory and related topics. New York: Academic Press; pp Winkler RL. Evaluating probabilities: asymmetric scoring rules. Manage Sci 1994;40(11): Epstein ES. A scoring system for probability forecasts of ranked categories. J Appl Meteorol 1969;8(6): Jose VRR, Nau RF, Winkler RL. Sensitivity to distance and baseline distributions in forecast evaluation. Manage Sci 2009;55(4): Bickel JE. Some comparisons among quadratic, spherical, and logarithmic rules. Decis Anal 2007;4(2): Selten R. Axiomatic characterization of the quadratic scoring rule. Exp Econ 1998;1(1): Lichtendahl KC, Winkler RL. Probability elicitation, scoring rules, and competition among forecasters. Manage Sci 2007;53(11): O Hagan A, Buck CE, Daneshkhah A, et al. Uncertain judgements: eliciting experts probabilities. Chichester: Wiley; 2006.

Scoring rules can provide incentives for truthful reporting of probabilities and evaluation measures for the

Scoring rules can provide incentives for truthful reporting of probabilities and evaluation measures for the MANAGEMENT SCIENCE Vol. 55, No. 4, April 2009, pp. 582 590 issn 0025-1909 eissn 1526-5501 09 5504 0582 informs doi 10.1287/mnsc.1080.0955 2009 INFORMS Sensitivity to Distance and Baseline Distributions