The Calibration Simplex: A Generalization of the Reliability Diagram for Three-Category Probability Forecasts

Size: px

Start display at page:

Download "The Calibration Simplex: A Generalization of the Reliability Diagram for Three-Category Probability Forecasts"

Tyler Goodwin
5 years ago
Views:

1 1210 W E A T H E R A N D F O R E C A S T I N G VOLUME 28 The Calibration Simplex: A Generalization of the Reliability Diagram for Three-Category Probability Forecasts DANIEL S. WILKS Department of Earth and Atmospheric Sciences, Cornell University, Ithaca, New York (Manuscript received 20 February 2013, in final form 10 May 2013) ABSTRACT Full exposition of the performance of a set of forecasts requires examination of the joint frequency distribution of those forecasts and their corresponding observations. In settings involving probability forecasts, this joint distribution has a high dimensionality, and communication of its information content is often best achieved graphically. This paper describes an extension of the well-known reliability diagram, which displays the joint distribution for probability forecasts of dichotomous events, to the case of probability forecasts for three disjoint events, such as below, near, and above normal. The resulting diagram, called the calibration simplex, involves a discretization of the 2-simplex, which is an equilateral triangle. Characteristics and interpretation of the calibration simplex are illustrated using both idealized verification datasets, and the and 8 14-day temperature and precipitation forecasts produced by the U.S. Climate Prediction Center. 1. Introduction Forecast verification is the process of systematically and quantitatively describing the relationships between forecasts and the events they are meant to predict. It is an essential process both for optimal use of forecasts in decision making (Katz and Murphy 1997), and as a part of the process of improving the forecasts themselves (e.g., Murphy 1997; Wilks 2011; Jolliffe and Stephenson 2012). Most often forecast quality is characterized using scalar (i.e., single number) verification statistics such as mean squared error, which has been referred to as the measures-oriented approach to verification (Murphy 1997). Although restricting attention to one or a small number of scalar statistics simplifies the verification process both computationally and conceptually, this practice inevitably masks some aspects of forecast performance even in the most restricted verification settings. The problem arises because the dimensionality (Murphy 1991) of the joint distribution of forecasts and observations (Murphy and Winkler 1987) is large, especially in settings involving probability forecasts. Distributions-oriented (Murphy 1997), or diagnostic Corresponding author address: Daniel S. Wilks, Dept. of Earth and Atmospheric Sciences, Bradfield Hall, Rm. 1113, Cornell University, Ithaca, NY dsw5@cornell.edu (Murphy et al. 1989; Murphy and Winkler 1992), verification methods communicate the richness of verification datasets by respecting their dimensionality, but at the same time must overcome the problem of how best to portray the high-dimensional information. Arguably the most effective approaches to communicating the full information content of a joint distribution of forecasts and observations have involved the use of well-designed graphics, although only a small number of these have yet been devised. Murphy et al. (1989) used bivariate histograms to convey the joint distribution of nonprobabilistic daily temperature forecasts and their corresponding observations, and also introduced the conditional quantile plot to portray the same information in terms of both the calibration refinement, and the likelihood-base rate factorizations (Murphy and Winkler 1987) of the joint distribution. By far the most commonly used diagnostic verification graphic is the reliability diagram (Murphy and Winkler 1977; Wilks 2011). Reliability diagrams show the calibration-refinement factorization of the joint distribution of probability forecasts for dichotomous (yes no) events, by plotting the subsample event relative frequency as a function of forecast probability (the calibration), and including also a plot of the frequency of use of each of the possible scalar probability forecasts (the refinement). Probability forecasts are most frequently issued for dichotomous events, so that the reliability diagram, or DOI: /WAF-D Ó 2013 American Meteorological Society

2 OCTOBER 2013 W I L K S 1211 the closely allied attributes diagram (Murphy and Hsu 1986), is appropriate and effective for graphically displaying the relevant joint distribution. However, probability forecasts may also be issued jointly for more than two predictand categories, notably including threecategory temperature or precipitation forecasts at lead times of a week and longer (e.g., Van den Dool 2007; Livezey and Timofeyeva 2008; Barnston et al. 2010). In this format the predictand categories are often defined in terms of the climatological terciles, so that the belownormal (cold or dry), near-normal, and above-normal (warm or wet) categories each have climatological occurrence probabilities of 1 /3. Diagnostic verification of such forecasts has been approached by reducing the three-element probability forecast vectors to collections of dichotomous probabilities (e.g., Wilks 2000; Wilks and Godfrey 2002; Barnston et al. 2010), but that approach is not fully satisfying because it neglects the relationships among probabilities assigned to the different events (Murphy and Hsu 1986). This paper proposes an extension of the well-known reliability diagram, to verification of probability forecast vectors pertaining to three distinct outcome categories using a two-dimensional graphic called the calibration simplex. The calibration simplex graphically represents the calibration-refinement factorization of the full joint distribution of these forecasts and their corresponding observations. Section 2 defines the structure of the calibration simplex in terms of the joint distribution of the forecasts and observations, and in relation to the conventional reliability diagram. Characteristic calibration simplex forms that are diagnostic for important aspects of forecast performance are illustrated in section 3, using artificial data. Section 4 shows example results using and 8 14-day forecasts for U.S. temperature and precipitation, and section 5 concludes the paper. 2. Structure of the calibration simplex in relation to the reliability diagram Diagnostic verification methods are those that communicate the joint frequency distribution of the forecasts and their corresponding observations (Murphy and Winkler 1987): Pr ff i, o j g 5 Prff i \ o j g; i 5 1,..., I; j 5 1,..., J. (1) Here, there are I distinct forecasts f i possible, each of which pertains to J possible observations or outcomes o j. In the case of forecasts for dichotomous outcomes, J 5 2. For the purposes of constructing a reliability diagram, the forecast probabilities are rounded to a discrete set of I values if they have not already been issued as such. For example, I 5 11 for probability forecasts rounded to 10ths, in which case the forecasts range from f through f It can be more informative to work with the joint distribution in terms of one of its factorizations (Murphy and Winkler 1987). The reliability diagram is based on the calibration-refinement factorization: Prff i, o j g 5 Prfo j j f i gprff i g; i 5 1,..., I; j 5 1,..., J. (2) Thus, the full joint distribution can be decomposed into a collection of I conditional distributions Prfo j j f i g of the observations given each of the possible forecasts f i, called the calibration distributions, and a single I-element frequency distribution Prff i g specifying the frequencies of use of the possible forecasts, called the refinement distribution. In the case of the reliability diagram, the calibration distributions are Bernoulli (i.e., binomial with N 5 1) distributions, defined by the probability distribution function: Prfo j f i g 5 p o i (1 2 p i )12o, o 5 0, 1, (3) where each p i is estimated by its empirical (in the verification dataset) conditional relative frequency ^p i of the yes event being forecast on occasions following the corresponding forecast f i. Each of the I calibration distributions is fully characterized by its estimated Bernoulli probability ^p i, and, collectively these define the vertical positions of the plotted points in the main portion of the reliability diagram. A histogram, or other quantitative representation of the refinement distribution, completes the graphical portrayal of Eq. (2). Probability vectors f i 5 [f B,i, f N,i, f A,i ] T pertaining to three mutually exclusive and collectively exhaustive outcomes (e.g., below, near, and above normal) can be plotted in two dimensions because the three forecast probabilities in each vector must sum to 1. The geometrically appropriate graph in this case is the regular 2- simplex (Epstein and Murphy 1965; Murphy 1972), which takes the shape of an equilateral triangle. Each of the corners of the 2-simplex corresponds to forecast certainty (i.e., 100% probability) for one of the three outcomes being forecast. The point within the simplex at which a three-element probability vector f i is plotted is located at distances proportional to the probabilities for each of the three outcomes, perpendicularly from the sides of the simplex opposite the respective corners. This barycentric plotting system (a plotting position is at the center of masses corresponding to the probabilities,

3 1212 W E A T H E R A N D F O R E C A S T I N G VOLUME 28 FIG. 1. Calibration simplexes illustrating well-calibrated forecasts exhibiting (a) lower and (b) higher sharpness. placed at the simplex vertices) generalizes the reliability diagram because the 1-simplex appropriate to two-element probability forecasts for dichotomous events is the unit interval on the real line, which is the horizontal axis of the reliability diagram. Figure 1 illustrates the plotting of discretized forecast vectors onto the simplex, which has been rendered as a tessellation of hexagons. Each scalar forecast probability has been rounded to one of the K 5 10 values 0/9, 1/9,..., 9/9, yielding I 5 K(K 1 1)/ distinct possible vector forecasts f i, each of which is represented by one of the hexagons. This choice for the discretization seems natural for forecasts pertaining to tercile-based outcome definitions, although any regular discretization (e.g., the K 5 11 scalar forecast elements 0.0, 0.1,...,1.0) may be used. The hexagons at the three vertices represent forecasts assigning all probability to the outcome labeled at that corner. Hexagons representing other forecast vectors are located at perpendicular distances from the respective opposite sides, which are indicated by the probability labels in the margins. For example, forecasts of equal probability for the middle category, f N, are located along the same horizontal row of hexagons (perpendicularly upward from the horizontal bottom edge of the simplex), as indicated by the probability labels increasing upward along the left edge of the figure. Probability labels for the above-normal category, f A, increase downward along the right edge of the simplex, and probability labels for the below-normal category, f B, increase to the left along the bottom edge of the simplex. The result, for example, is that the large dot in the center of Fig. 1a locates the climatological forecast vector f clim 5 [1/3, 1/3, 1/3] T. Similarly, the forecasts in Fig. 1a having the largest above-normal probabilities are [2/9, 2/9, 5/9] T and [1/9, 3/9, 5/9] T, which are represented by the glyphs located farthest to the right in that diagram. Any two of the three forecast vector elements are sufficient to locate that vector s position on the simplex. Other choices for arranging the labels on the simplex edges are possible and equally valid, but those in Fig. 1 have been used in order for the outcomes from below to above normal to increase from left to right. Figure 1 illustrates the graphical representation of refinement distributions Prff i g as glyph scatterplots (e.g., Wilks 2011), where the circle areas are proportional to the subsample sizes. Empty hexagons represent forecast vectors that were never used in the verification datasets under consideration. The two refinement distributions represented in Fig. 1 will be defined quantitatively in section 3. Generalizing Eq. (3) for the calibration distributions, in the case of three-element vector forecasts each of the I calibration distributions is multinomial, again with N 5 1: Prfo j f i g 5 p o B B,i po A A,i (1 2 p B,i 2 p A,i )12o 2o B A, o B, o A 5 0, 1, and o B 1 o A # 1. (4) Thus, each of the I calibration distributions are fully determined by any two of the three empirical (within the verification dataset) conditional relative frequencies ^p B,i, ^p A,i, and ^p N,i ^p B,i 2 ^p A,i of the three events being forecast by f i, and therefore can be represented by a two-dimensional vector defined by, for example, ^p B,i and ^p A,i within the ith hexagon of the simplex. Figure 2 illustrates schematically how these conditional relative frequencies for the below- and above-normal

4 OCTOBER 2013 W I L K S 1213 FIG. 2. Plotting schematic showing locations of refinement distribution glyphs for calibrated forecasts (zero miscalibration errors, gray dashed arrows and point), and miscalibrated forecasts (nonzero calibration errors, black solid arrows and point). toward the A vertex of the simplex, consistent with the conditional outcome vector having a higher relative frequency than forecast for the above-normal category. It is necessary to choose a scale for these refinement distribution glyph displacements, which should be indicated in a figure s legend. Here, this scale has been chosen as 2 /3 of a probability unit corresponding to a long (vertex to vertex) diameter of the hexagons. With this choice, displacement of a plotting glyph of 1 /3 the distance from the hexagon center to one of the corners equates to a miscalibration of 1 /9, which corresponds to one discrete probability category when K Of course the scale in Fig. 2 could be increased to one or as much as two probability units in order to deemphasize random variations due to small subsample sizes, or to accommodate large conditional miscalibration errors. Similarly, the scale could be decreased in order to better discern subtle miscalibrations exhibited by a generally well-calibrated set of forecasts with large subsample sizes. It may be more convenient when programming plotting software to use the horizontal and vertical Cartesian displacements from the hexagon centers, which are proportional (depending on the chosen scaling) to x 52tan(308)e B 1 tan(308)e A e B e A, and y 52e B 2 e A, respectively. 3. Idealized forecast examples outcomes are represented within each hexagon, using the corresponding miscalibration errors, or differences between the conditional average observations ^p B,i and ^p A,i, and the respective forecast vector elements: " eb,i e A,i # ^p B,i 2 f B,i 5. (5) ^p A,i 2 f A,i These vector conditional miscalibration errors are plotted within their respective hexagons, using barycentric coordinates similar to those in the overall simplex, but with the origins at the centers of the hexagons. For a well-calibrated forecast subsample f i, both elements of Eq. (5) will be zero, and the dot representing the corresponding subsample size will be plotted at the center of the ith hexagon, as illustrated by the gray dot in Fig. 2, and by all subsamples in both panels of Fig. 1. In contrast, the solid dot in Fig. 2 shows the plotting position when the above-normal category has been conditionally underforecast by 0.2 probability units, together with overforecasting of the below- and near-normal categories by 0.1 probability units each. In this case the location for the glyph representing the corresponding element of the refinement distribution will be displaced Characteristic patterns of simplex glyph sizes and placements are diagnostic for different aspects of overall forecast performance, analogously to the situation for the reliability diagram (Wilks 2011, p. 335). These are illustrated in this section using simple idealized statistical models for the calibration and refinement distributions, designed to demonstrate results for a range of cases with known expected behaviors. a. Sharpness: Refinement distributions Sharpness is a characteristic of the refinement distribution, independently of the relationship between the forecasts and observations. Forecast sharpness is commonly characterized using a measure of the dispersion of the refinement distribution, such as the standard deviation or variance (e.g., Wilks 2001). Forecasts that deviate frequently, including relatively large differences from, the climatological forecast are referred to as sharp. Forecasts exhibiting poor sharpness deviate rarely and relatively little from the climatological forecast. Although in the present setting the refinement distribution is trivariate, it is sufficient to define it in terms of the joint distribution of two of the three forecast probabilities because they must all sum to 1. Here, joint distributions of the log-odds transformations of f B and f A,

5 1214 W E A T H E R A N D F O R E C A S T I N G VOLUME 28 (f ) 5 ln f 1 2 f, (6) will be modeled using bivariate normal distributions, in order to ensure 0, f B, 1 and 0, f A, 1. This is an arbitrary but convenient assumption that allows an easy exposition of the characteristic diagnostics for the diagram. The climatological forecast of f B 5 f A 5 1/3 maps to E[ (f B )] 5 E[ (f A )] for the log-odds bivariate normal distributions. In addition, a strong negative correlation of yields vanishingly small probabilities for the impossible f B 1 f A. 1, as well as relatively small deviations of the near-normal forecast f N from its mean value of 1 /3. Within this framework, relatively sharp forecasts are generated using var[ (f B )] 5 var[ (f A )] 5 1, and forecasts with low sharpness are generated using var[ (f B )] 5 var[ (f A )] Some 10 6 independent bivariate normal realizations were generated to construct each of the two refinement distributions. Figure 1 shows these two refinement distributions, plotted as glyph histogram scatterplots on the simplex. In both cases the resulting forecasts are shown as being perfectly calibrated, so all of the circles representing subsample relative frequencies are plotted at the centers of their hexagons. For the low-sharpness forecasts in Fig. 1a, it is clear that the most common forecast is f clim 5 ( 1 /3, 1 /3, 1 /3) T, which accounts for 68.3% of the 10 6 realizations. Furthermore, the individual forecast elements deviate no more than 2 /9 from f clim, and only very rarely are those deviations larger than 1 /9. In contrast, the sharper forecasts in Fig. 1b often deviate quite strongly from f clim (which accounts for only 10.7% of the 10 6 realizations), and take on values throughout the unit interval for the extreme-category probabilities f B and f A. b. Conditional and unconditional biases: Calibration distributions Figure 3 shows calibration simplexes illustrating overconfident forecasts (Fig. 3a), underconfident forecasts (Fig. 3a), and forecasts exhibiting an unconditional overforecasting bias (Fig. 3c)for the above-normal category. Each of these cases is illustrated using the highersharpness refinement distribution shown in Fig. 1b. For the overconfident forecasts shown in Fig. 3a, the parameters of the multinomial calibration distributions [Eq. (4)] are based on those used to define overconfident idealized reliability diagrams in Wilks (2001): ^p B,i (f B,i ) 5 (f B,i 1 1/3)/2 ^p N,i (f N,i ) 5 2/3 2 (f B,i 1 f A,i )/2 ^p A,i (f A,i ) 5 (f A,i 1 1/3)/2: (7) FIG. 3. Calibration simplexes illustrating (a) overconfident [Eq. (7)], (b) underconfident [Eq. (8)], and (c) unconditionally biased [Eq. (9)] forecasts. In each case the refinement distribution is the same as in Fig. 1b.

6 OCTOBER 2013 W I L K S 1215 These functional relationships describe conditionally biased forecasts, which exhibit overforecasting for forecast probabilities above the climatological 1 /3, andunderforecasting for forecast probabilities below 1 /3. These were chosen in Wilks (2001) to correspond to the no skill line (Murphy and Hsu 1986) on the attributes diagram. If such forecasts had been generated by a dynamical ensemble forecasting system, the overconfidence would be diagnostic for underdispersion (Wilks 2011, p. 372). Figure 3a shows that the signature for overconfidence in the calibration simplex is a displacement of the relative frequency glyphs toward the center of the diagram, analogously to the calibration function in conventional reliability or attributes diagrams being tilted from the ideal 458 diagonal toward the horizontal, climatological, no resolution (Murphy and Hsu 1986) line. It can be seen in Fig. 3a that the degree of miscalibration increases as the forecast probabilities become more extreme, so that the glyphs for the most extreme probabilities in the two lower corners of the diagram (f B 5 9/9 and f A 5 9/9) have been plotted at the hexagon vertices nearest the center of the diagram, indicating conditional miscalibrations of 1 /3 of a probability unit. Figure 3b shows the calibration simplex for conditionally biased forecasts that are underconfident, in the sense that probabilities smaller than 1 /3 are overforecast and probabilities larger than 1 /3 are underforecast, according to ^p B,i (f B,i ) 5 2f B,i 2 1/3, ^p N,i (f N,i ) 5 2(1 2 f B,i 2 f A,i ) 2 1/3, and ^p A,i (f A,i ) 5 2f A,i 2 1/3: (8) Here, conditional average observations more extreme than the unit interval are truncated at 0 or 1, and the values for the three outcomes are renormalized to sum to 1 if necessary. The relationships in Eq. (8) also generalize the underconfident calibration function used in Wilks (2001), and would be diagnostic for overdispersion of a dynamical ensemble. Figure 3b shows that the signature for underconfidence in the calibration simplex is the displacement of the relative frequency glyphs away from the climatological forecast at the center of the diagram, and toward the corners of the simplex. Again, this is analogous to the signature for underconfidence in the reliability diagram, in which the calibration function is tilted at an angle steeper than 458, and away from the climatological horizontal no-resolution line. Finally, Fig. 3c shows the calibration simplex for unconditionally biased forecasts, exhibiting uniform overforecasting of the above-normal category equally at the expense of both the below- and near-normal categories: ^p B,i (f B,i ) 5 f B,i 1 1/9, ^p N,i (f N,i ) 5 f N,i 1 1/9, and ^p A,i (f A,i ) 5 f A,i 2 2/9. (9) Again, results outside the unit interval are truncated and renormalized if necessary. In Fig. 3c the relative frequency glyphs are displaced upward and to the left, away from the A corner of the simplex. Equivalently, the below- and near-normal categories have been uniformly and equally underforecast, at the expense of the abovenormal category, and the result is that the glyphs are displaced toward the simplex edge connecting the B and N corners. 4. Real forecast examples This section illustrates the use of the calibration simplex to understand the performance of the Climate Prediction Center (CPC) extended range forecasts for average temperature and accumulated precipitation, at lead times of 6 10 and 8 14 days. These are subjective probability forecasts generated on weekdays, during for the 6 10-day forecasts, and for the 8 14-day forecasts, and interpolated to station locations from the graphical map products posted operationally online ( The text-format versions and general information on these forecasts are also available online ( products/archives/short_range/ and noaa.gov/products/archives/short_range/readme.6-10day.txt, respectively). Figure 4 shows the calibration simplexes for the temperature forecasts, which include n six-toten-day forecasts (Fig. 4a) and n eight-tofourteen-day forecasts (Fig. 4b). Glyphs are plotted only for forecasts having subsample sizes of 20 or more, and verifying categories have been determined relative to the normals for forecasts made through April 2011, and using the normals thereafter. At each of the lead times, the overwhelming majority of forecasts include the climatological or near-climatological near-normal forecast f N 5 1 /3, consistent with the conventional expectation that forecasts of the near-normal category will exhibit intrinsically weak skill (Van den Dool and Toth 1991). Thus, the sharpness for the nearnormal probabilities is quite low. The most frequent forecast for each of the lead times is the climatological

7 1216 W E A T H E R A N D F O R E C A S T I N G VOLUME 28 FIG. 4. Calibration simplexes for (a) and (b) 8 14-day temperature forecasts. probability vector f clim, which was issued for 34.9% of the 6 10-day forecasts and 35.4% of the 8 14-day forecasts. These percentages correspond to n clim and for the climatological forecast f clim (largest dots in the middles of the plots) in Figs. 4a and 4b, respectively (cf. glyph sizes in the legend). The corresponding error vectors [e B, e N, e A ] T are [20.085, 0.056, 0.029] T (Fig. 4a) and [20.88, 0.036, 0.052] T (Fig. 4b). Accordingly, both of these glyphs have been displaced away from the B vertices, indicating too few belownormal verifications when the climatological temperature forecast was issued. The more extreme probability vectors were used correspondingly less frequently at the 8 14-day lead time, but overall the 6 10-day forecasts are only slightly sharper. For both lead times, the temperature forecasts are only moderately well calibrated, with typical miscalibration errors in the range of 1/9 2/9. The abovenormal outcome is underforecast for the larger ($4/9) above-normal and near-normal probabilities, as these subsample-size glyphs are displaced toward the A corner of the simplex. The glyph representing the very large subsample of f clim forecasts is displaced toward the edge connecting the N and A vertices, indicating a smaller fraction of below-normal outcomes than the climatologically expected 1 /3. Both of these results are consistent with the baseline 30-yr normals lagging the quasi-linear warming trend that has been evident in U.S. temperature data since the mid-1970s (e.g., Livezey et al. 2007; Wilks 2013), together with the mean forecasts not fully tracking the warming, as has also been observed for seasonal tercile forecasts (Wilks 2000; Wilks and Godfrey 2002; Barnston et al. 2010). On the other hand, displacement of glyphs away from the simplex center for below-normal probabilities of 4/9 and larger indicates an overall underconfidence (consistent with the pattern of glyph dispersion away from the center of Fig. 3b), suggesting that these forecasts could be somewhat sharper without degrading their skill. Calibration simplexes for the precipitation forecasts are shown in Fig. 5, which include n six-to-tenday forecasts (Fig. 5a) and n eight-to-fourteenday forecasts (Fig. 5b). The precipitation forecasts exhibit notably less sharpness than their temperature counterparts in Fig. 4, with 44.9% of the 6 10-day forecasts and 44.7% of the 8 14-day forecasts using the climatological probabilities f clim. As was the case for the temperature forecasts, f N 5 1 /3 is overwhelmingly the most common near-normal forecast. The most extreme probabilities have been used somewhat less frequently in the 8 14-day precipitation forecasts, but overall these are only slightly less sharp that the 6 10-day forecasts. At both lead times there is a clear tendency for the subsample-size glyphs to be displaced toward the B vertex, indicating underforecasting of the below-normal category, and corresponding overforecasting in roughly equal proportions of the near-normal and above-normal outcomes. Figure 5 also indicates strong overconfidence in the larger ($4/9) probabilities for the near-normal category, for both lead times. 5. Summary and conclusions Gaining a full appreciation of the performance of a set of forecasts requires investigation of the joint distribution of the forecasts and corresponding observations (Murphy and Winkler 1987), and a graphical approach to this exposition will often be the most immediately

8 OCTOBER 2013 W I L K S 1217 FIG. 5. As in Fig. 4, but for precipitation forecasts. informative. The well-known reliability diagram for probability forecasts of dichotomous events is the most commonly encountered such graphical device. This paper has defined and illustrated a natural extension of the reliability diagram to probability forecasts for three disjoint events, called the calibration simplex. It displays the refinement distribution for such forecasts using a glyph scatterplot histogram of the K(K 1 1)/2 possible vector forecasts that result when each probability element has been rounded to one of K discrete values. Simultaneously, it shows the two empirical outcome relative frequencies conditional on each forecast vector that are necessary to fully characterize the corresponding refinement distributions, using displacements of the glyphs from central plotting locations. Two somewhat similar approaches to this same problem have been proposed previously, although apparently neither has been used subsequently. Epstein and Murphy (1965) suggested that the information in the calibration distributions (although not using this terminology) could be sketched by printing numerical values of the scalar lengths of the calibration error vectors [Eq. (5)] inside each of a tessellation of equilateral triangles within the 2-simpex. However, this approach neither expressed the vector nature of the miscalibration errors, nor did it communicate the refinement distribution in any way. Murphy and Daan (1985, p. 418) proposed to draw line segments on the 2-simplex connecting points locating each distinct forecast vector f i with its corresponding average outcome vector ^p i, and indicating the corresponding subsample sizes with numerals printed at the midpoints of these line segments. This approach does indeed express the full information content of the joint distribution, but overall the diagram is somewhat difficult to read. Recently, this latter idea has been independently rediscovered by Jupp et al. (2012), who call it the ternary reliability diagram. They also propose a color scheme for representing the threeelement probability vectors. In principle, the calibration simplex could be extended to settings having more than three predictand categories, although this would be difficult or impossible to realize in practice because of graphical and cognitive limitations with respect to visually rendering the highdimensional geometries. For example the 3-simplex, which would be appropriate for probability forecasts of four distinct events, is a three-dimensional regular tetrahedron, within the volume of which the spherical glyphs corresponding to possible four-element forecast vectors would reside. It is difficult to imagine how such a geometrical object could be presented graphically in an understandable way. Use of the calibration simplex has been illustrated using the CPC and 8 14-day subjective temperature and precipitation forecasts. These temperature forecasts exhibit an unconditional bias consistent with the average forecasts lagging the ongoing climate warming, as has been observed also for seasonal forecasts, but these graphs also indicate that greater sharpness for the below- and above-normal elements of the forecast vectors could be employed without degrading the overall accuracy or skill. The near-normal element of the precipitation forecast vectors were seen to be strongly overconfident, with overall underforecasting of the below-normal category, and corresponding overforecasting in roughly equal proportions of the near-normal and above-normal outcomes.

9 1218 W E A T H E R A N D F O R E C A S T I N G VOLUME 28 Acknowledgments. I thank Scott Handel for supplying the CPC forecasts and corresponding verifications. This research was supported by the National Science Foundation under Grant AGS REFERENCES Barnston, A. G., S. Li, S. J. Mason, D. G. DeWitt, L. Goddard, and X. Gong, 2010: Verification of the first 11 years of IRI s seasonal climate forecasts. J. Appl. Climatol. Meteor., 49, Epstein, E. S., and A. H. Murphy, 1965: A note on the attributes of probabilistic predictions and the probability score. J. Appl. Meteor., 4, Jolliffe, I. T., and D. B. Stephenson, 2012: Introduction. Forecast Verification: A Practitioner s Guide in Atmospheric Sciences, I. T. Jolliffe and D. B. Stephenson, Eds., Wiley-Blackwell, 1 9. Jupp, T. E., R. Lowe, C. A. S. Coelho, and D. B. Stephenson, 2012: On the visualization, verification and recalibration of ternary probabilistic forecasts. Philos. Trans. Roy. Soc. London, A370, Katz, R. W., and A. H. Murphy, Eds., 1997: Economic Value of Weather and Climate Forecasts. Cambridge University Press, 222 pp. Livezey, R. E., and M. M. Timofeyeva, 2008: The first decade of long-lead U.S. seasonal forecasts Insights from a skill analysis. Bull. Amer. Meteor. Soc., 89, , K. Y. Vinnikov, M. M. Timofeyeva, R. Tinker, and H. M. Van den Dool, 2007: Estimation and extrapolation of climate normals and climatic trends. J. Appl. Meteor. Climatol., 46, Murphy, A. H., 1972: Scalar and vector partitions of the probability score: Part II. N-state situation. J. Appl. Meteor., 11, , 1991: Forecast verification: Its complexity and dimensionality. Mon. Wea. Rev., 119, , 1997: Forecast verification. Economic Value of Weather and Climate Forecasts, R.W. Katz and A.H. Murphy, Eds., Cambridge University Press, , and R. L. Winkler, 1977: Reliability of subjective probability forecasts of precipitation and temperature. Appl. Stat., 26, , and H. Daan, 1985: Forecast evaluation. Probability, Statistics, and Decision Making in the Atmospheric Sciences, A. H. Murphy and R. W. Katz, Eds., Westview, , and W.-R. Hsu, 1986: The attributes diagram: A geometrical framework for assessing the quality of probability forecasts. Int. J. Forecasting, 2, , and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev., 115, , and, 1992: Diagnostic verification of probability forecasts. Int. J. Forecasting, 7, , B. G. Brown, and Y.-S. Chen, 1989: Diagnostic verification of temperature forecasts. Wea. Forecasting, 4, Van den Dool, H., 2007: Empirical Methods in Short-Term Climate Prediction. Oxford University Press, 215 pp., and Z. Toth, 1991: Why do forecasts for near normal often fail? Wea. Forecasting, 6, Wilks, D. S., 2000: Diagnostic verification of the Climate Prediction Center long-lead outlooks, J. Climate, 13, , 2001: A skill score based on economic value for probability forecasts. Meteor. Appl., 8, , 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed. Academic Press, 676 pp., 2013: Projecting normals in a nonstationary climate. J. Appl. Meteor. Climatol., 52, , and C. M. Godfrey, 2002: Diagnostic verification of the IRI net assessment forecasts, J. Climate, 15,

Ensemble Verification Metrics

Ensemble Verification Metrics Debbie Hudson (Bureau of Meteorology, Australia) ECMWF Annual Seminar 207 Acknowledgements: Beth Ebert Overview. Introduction 2. Attributes of forecast quality 3. Metrics: