Comparing Probabilistic Forecasting Systems with the Brier Score

Size: px

Start display at page:

Download "Comparing Probabilistic Forecasting Systems with the Brier Score"

Juniper Perkins
5 years ago
Views:

1 1076 W E A T H E R A N D F O R E C A S T I N G VOLUME 22 Coparing Probabilistic Forecasting Systes with the Brier Score CHRISTOPHER A. T. FERRO School of Engineering, Coputing and Matheatics, University of Exeter, Exeter, United Kingdo (Manuscript received 13 April 2006, in final for 16 January 2007) ABSTRACT This article considers the Brier score for verifying enseble-based probabilistic forecasts of binary events. New estiators for the effect of enseble size on the expected Brier score, and associated confidence intervals, are proposed. An exaple with precipitation forecasts illustrates how these estiates support coparisons of the perforances of copeting forecasting systes with possibly different enseble sizes. 1. Introduction Verification scores are coonly used as nuerical suaries for the quality of weather forecasts. General introductions to forecast verification are given by Jolliffe and Stephenson (2003) and Wilks (2006, chapter 7). There are any situations in which we ay wish to copare the values of a verification score coputed for two sets of forecasts: to copare the quality of forecasts fro a single forecasting syste at different ties or locations, or in different eteorological conditions, or to copare the quality of forecasts fro two forecasting systes. Several authors have recoended that easures of uncertainty for the scores, such as standard errors or confidence intervals, should be coputed to aid such coparisons. Woodcock (1976), Seaan et al. (1996), Kane and Brown (2000), Stephenson (2000), Thornes and Stephenson (2001), and Wilks (2006, section 7.9) propose confidence intervals for scores of deterinistic binary-event forecasts. Bradley et al. (2003) use siulation to copare the biases and standard errors of different estiators for several scores of probabilistic binary-event forecasts, but do not discuss estiators for the standard errors. Haill (1999) takes a different approach and proposes hypothesis tests for coparing the scores of two sets of deterinistic or probabilistic forecasts; see also Briggs and Ruppert (2005). Jolliffe (2007) reviews this work and related Corresponding author address: C. Ferro, School of Engineering, Coputing and Matheatics, University of Exeter, Harrison Bldg., North Park Rd., Exeter EX4 4QF, United Kingdo. E-ail: c.a.t.ferro@exeter.ac.uk ideas, and also presents confidence intervals for the correlation coefficient. Woodcock (1976) explains the otivation for these attepts to quantify uncertainty. The value of a score depends on the choice of target observations, so the superiority of one forecasting syste over another as gauged by their verification scores coputed for only finite saples of forecasts and observations cannot be definitive: the superiority ay be reduced or even reversed were the systes applied to new target observations. The ethods listed previously estiate the variation that would arise in the value of a verification score were forecasts ade for different sets of potential target observations, and thereby quantify the uncertainty about soe true value that would be known were forecasts available for all potential observations. We shall consider uncertainty in expected values of the Brier verification score (Brier 1950) in the case of enseble-based probabilistic binary-event forecasts. Unbiased estiators for the expected Brier score that would be obtained for any enseble size are defined in section 2. Standard errors and confidence intervals are developed in section 3, and their perforance is assessed with a siulation study in section 4. Methods for coparing the Brier scores of two forecasting systes are presented in section 5. Confidence intervals appropriate for coparing Brier scores of two systes siultaneously for ultiple events and sites are described in section 6. The ethods are illustrated throughout the paper with seasonal precipitation forecasts fro the Developent of a European Multiodel Enseble Syste for Seasonal to Interannual Prediction (DEMETER) project DOI: /WAF Aerican Meteorological Society

2 OCTOBER 2007 F E R R O 1077 where I t I(X t u), I(A) 1ifA is true, and I(A) 0ifA is false. All suations will be over t 1,...,n unless otherwise specified. We entioned in the previous section that the variation in verification scores caused by the choice of target observations leads to uncertainty about the quality of the forecasting syste. One quantity of interest that we ay be uncertain about is the expected Brier score, denoted B E Bˆ,n, 1 FIG. 1. Observed (vertical lines) October rainfall () in Jakarta fro 1958 to 1995 plotted between both the ECMWF (filled circles) and MF (open circles) nine-eber enseble forecasts. (Paler et al. 2004). In particular, 3-onth-ahead, nine-eber enseble forecasts of total October precipitation produced by the European Centre for Mediu-Range Weather Forecasts (ECMWF) and Météo-France (MF) global atosphere ocean coupled odels are copared with observations recorded at Jakarta (6.17 S, E) for the years Data are issing for 1983, leaving 37 yr. The ensebles were generated by sapling independent perturbations of the initial ocean state. The Jakarta observations and the forecasts fro the corresponding grid box are shown in Fig Expected Brier scores a. Definitions We define the Brier score together with notation that will be used throughout the rest of the paper. Let {X t : t 1,..., n} be a set of n observations, and let {(Y t,1,...,y t, ):t 1,...,n} be a corresponding set of -eber enseble forecasts. For each tie t, suppose that we issue a probabilistic forecast, Qˆ t, for the event that observation X t exceeds a threshold u. The Brier score for this set of forecasts, equal to one-half of the score originally proposed by Brier (1950), is the ean squared difference between the forecasts and the indicator variables for the event; that is, n Bˆ,n 1 n Qˆ t I t 2, t 1 and defined as the average Brier score over repeated saples fro a population of observations and forecasts. This population can be defined iplicitly by assuing that the available saple of observations and forecasts is in soe sense representative of the larger population. We assue that the population is a stationary sequence of which our data for a partial realization. This is likely to be a good approxiation in a stable cliate and could be generalized for a changing cliate by assuing, for exaple, that the data are a partial realization of a nonstationary, ultivariate tie series odel chosen to represent cliatic change. We shall concentrate on B, but other quantities, such as the conditional expected Brier score discussed in appendix A, ay also be of interest. b. The effects of enseble size We investigate how B depends on the enseble size. By stationarity, the expectation of (Qˆ t I t ) 2 is the sae for all t, so we can write B E Qˆ I 2, where Qˆ and I are the forecast and the event indicator for an arbitrary tie. This expectation is an average over all possible values of the observation variable X and the enseble Y (Y 1,...,Y ). Now let Z denote sufficient inforation about the forecasting odel to deterine a probability distribution for Y, given which Y is independent of X. This inforation ight be the odel specification plus a probability distribution for its initial conditions, for exaple. The law of total expectation (e.g., Severini 2005, p. 55) says that we can obtain B in two stages: first by taking the expectation with respect to X and Y when Z is held fixed, and then averaging over Z. This is written B E E Qˆ I 2 Z.

3 1078 W E A T H E R A N D F O R E C A S T I N G VOLUME 22 The interior, conditional expectation is E Qˆ I 2 Z E Qˆ 2 Z 2E Qˆ I Z E I Z E Qˆ 2 Z 2E Qˆ Z P P, where P E(I Z) Pr(X u Z) is the probability with which the event occurs. We ust specify how Qˆ is fored fro the enseble ebers in order to reveal the effects of enseble size. We choose forecasts equal to the proportion of ebers that exceed a threshold, possibly different fro u; that is, Qˆ K 1 I Y i, i 1 where K is the nuber of ebers exceeding. Alternative forecasts could be considered, for exaple, (K a)/( b) with b a 0, although these lead to ore coplicated forulas later. For siplicity, we assue that the ebers within any particular enseble are exchangeable. Exchangeability eans that the ebers are indistinguishable by their statistical properties: their joint distribution function is invariant to relabeling the ebers. This adits hoogeneous dependence between ebers and includes the special case of independent and identically distributed ebers. Exchangeability iplies that all ebers of an enseble exceed with the sae probability, Pr Y i Z Q for all i, and all pairs of ebers jointly exceed with the sae probability, Pr Y i, Y j Z R for all i j. Taken together with the forecast definition [(3)], we have E Qˆ Z 1 Pr Y i Z Q, i 1 E Qˆ 2 Z 1 2 i 1, Y j Z Q Pr Y i j R, 5 and the conditional expectation [(2)] equals R 2PQ P 1 Q R. Finally, we take the expectation with respect to Z to obtain B E R 2E PQ E P 1 E Q R. Because P, Q, and R are independent of, this expression describes copletely the effects of enseble size. Moreover, the final ter is non-negative because R Q. As the enseble size increases, B therefore decreases onotonically to the expected Brier score, B, that would be obtained for an infinite enseble size, and we can write B M B 1 E Q R, M 7 where B M is the expected Brier score that would be obtained for an enseble of size M. This generalizes the relationship found by Richardson [2001, Eq. (9)] for independent enseble ebers, in which case R Q 2. c. Unbiased estiators The Brier score Bˆ,n is an unbiased estiator for B by definition [(1)] but is biased for B M when M. Estiating B M fro ensebles of size is useful for coparing forecasting systes with different enseble sizes or for assessing the potential benefit of larger ensebles (cf. Buizza and Paler 1998). Equations (4) and (5) can be used to show that an unbiased estiator for B M is Bˆ M,n Bˆ,n M M 1 n Qˆ t 1 Qˆ t, 8 and letting M yields an unbiased estiator for B. 1) REMARK 1 The new estiator [(8)] is undefined if 1, in which case an unbiased estiator for B M (M 1) does not exist because the forecasts contain no inforation about the effects of enseble size. Matheatically, for any function h(k, I) independent of R, E h K, I Z h 0, I 1 Q h 1, I Q cannot contain the required R ter. Richardson (2001) does, however, develop a ethod for estiating a skill score based on B M given an enseble of any size, even 1. He achieves this by assuing independent enseble ebers (R Q 2 ) and perfect reliability (Q P) in which case the expression [(6)] for B becoes and so B 1 E Q 1 Q, 6

4 OCTOBER 2007 F E R R O 1079 M 1 B M M 1 B. In this special case, an unbiased estiator can therefore be obtained for B M even when 1 by substituting Bˆ,n for B in the right-hand side of (9). 2) REMARK 2 The adjustent ter in the definition [(8)] of Bˆ M,n depends on only the forecasts and is a easure of sharpness (e.g., Potts 2003). Let S 1 n Qˆ t be the saple variance of the forecasts around onehalf: as S decreases fro its axiu value (1/4) to its iniu value (0), forecasts becoe ore concentrated around one-half and the sharpness decreases. Now, Bˆ M,n Bˆ,n M M S, so Bˆ M,n reduces the Brier score by aounts that depend on the estiated sharpness and the enseble size,. For fixed sharpness, the iproveent in forecast quality fro increasing the enseble size decreases as increases: the law of diinishing returns. For fixed, the iproveent decreases as the sharpness increases, suggesting that the iproveent ay be attributed to the opportunity to shift forecasts slightly farther away fro one-half. 3) REMARK 3 The Brier score Bˆ,n is proper (e.g., Wilks 2006, p. 298) because, if our belief in the occurrence of the event {I t 1} equals p [0, 1], then the expected contribution to the Brier score with respect to this belief fro issuing forecast Qˆ t, that is, E Qˆ t I t 2 Qˆ 2 t 1 p Qˆ t 1 2 p, is iniized by choosing Qˆ t p. Siilar calculations show that Bˆ M,n is iproper when M because the optiu forecast is then p M M p. 9 Therefore, Bˆ M,n should not be used in situations where it could be hedged. d. Exchangeability We assued that enseble ebers were exchangeable and independent of observations given suitable inforation, Z. The latter assuption is hard to contest because Z can include the full specification of the forecasting odel and its inputs. Exchangeability is ore restrictive and would be violated were one eber biased relative to the other ebers, for exaple, or were one pair of ebers ore strongly correlated than other pairs. Exchangeability ight be justified by the process generating the enseble. For exaple, exchangeability will hold if the initial conditions for the ebers are randoly sapled fro a probability distribution. Exchangeability is also likely to hold if the forecast lead tie is long enough for any initial ordering or dependence between the ebers to be lost. This latter arguent sees appropriate for our 3-onth-ahead rainfall forecasts. Exchangeability ight also be justified by epirical assessent. Roano (1988) describes a bootstrap test for exchangeability based on the axiu distance between the epirical distribution functions of the ebers and peruted ebers. Applying this test for the ECMWF and MF enseble forecasts of Jakarta rainfall gave p values of 0.26 and 0.24, which is only weak evidence for rejecting exchangeability. The effect of enseble size on the expected Brier score can still be estiated even when exchangeability is unjustifiable. If we wish to estiate B M for a subset of M ebers, then an unbiased estiator is siply the Brier score evaluated for the forecasts constructed fro those M ebers. This approach is straightforward to ipleent for any verification score, but is inapplicable when M. e. Data exaple We estiate the expected Brier scores B and B for the ECMWF and MF rainfall forecasts at a range of event thresholds u and. These are shown in Fig. 2, where we set u and let u range fro the 10% to the 90% quantiles of the observed rainfall. The MF forecasts appear to have significantly lower Brier scores than do those of the ECMWF for thresholds below 90 (about the edian observed rainfall), and the two systes have siilar Brier scores at higher thresholds. The estiated difference between B for MF and B for ECMWF is also large below 90, suggesting that increasing the ECMWF enseble size would not be sufficient to atch the MF Brier score.

5 1080 W E A T H E R A N D F O R E C A S T I N G VOLUME 22 If the suands are dependent, then estiates of serial correlation ay be incorporated into the standard error (e.g., Wilks 1997). There is little evidence for serial dependence in the suands of the Brier scores for our ECMWF and MF rainfall forecasts. For exaple, a two-sided test for the lag-one autocorrelation (e.g., Chatfield 2004, p. 56) was conducted for both the ECMWF and MF data at each of nine thresholds u ranging fro the 10% to the 90% quantiles of the observed rainfall, and only one p value was saller than 0.1. We assue serial independence hereafter. Standard errors for the estiates of B and B are shown in Fig. 2 and are large enough to call into question the statistical significance of the differences noted previously in the quality of the ECMWF and MF forecasts. These differences are assessed ore forally in section 5. FIG. 2. Estiates Bˆ,n (upper thick) and Bˆ,n (lower thick) for the ECMWF (solid) and MF (dashed) forecasts of October rainfall at Jakarta exceeding different thresholds during the period Thresholds are arked as quantiles (lower axis) and absolute values (, upper axis) of the observed rainfall. Standard errors (upper thin) and their conditional versions (lower thin; see appendix A) are shown for the ECMWF (solid) and MF (dashed) forecasts, and are indistinguishable for Bˆ,n and Bˆ,n. 3. Sapling variation a. Standard errors Point estiates of expected Brier scores were presented in the previous section. In this section, we estiate the uncertainty associated with these estiates due to sapling variation. In particular, we shall estiate standard errors and construct confidence intervals for the expected scores. We assue only that the data are stationary; we no longer need to assue exchangeability or a particular for [(3)] for the forecasts. We shall consider only Bˆ M,n, the estiator [(8)] for B M, because other estiators can be obtained as special cases by changing M. We can write Bˆ M,n as the saple ean of the suands W t Qˆ t I t 2 M M 1 Qˆ t 1 Qˆ t. If the interval between successive ties t is large, then we ay be justified in aking the further assuption that these suands are independent, in which case the standard error of Bˆ M,n is estiated by ˆ M,n W t Bˆ M,n 2. n n 1 b. Confidence intervals More inforative descriptions of uncertainty are afforded by confidence intervals, which we now construct. Unless the suands of Bˆ M,n exhibit long-range dependence, we can expect a central liit theore to hold and iply that Bˆ M,n is approxiately norally distributed when n is large. An approxiate (1 2 ) confidence interval for B M would then be Bˆ M,n ˆ M, n z, where z is the quantile of the standard noral distribution. An alternative approxiation to the distribution of Bˆ M,n is available via the bootstrap ethod (e.g., Davison and Hinkley 1997). To obtain confidence intervals for B M, the distribution of the studentized statistic T n Bˆ M,n B M ˆ M,n is approxiated by a bootstrap saple {T n * i : i 1,..., r}. If suands are independent, then this saple can be fored by repeating the following steps for each i 1,...,r: 1) Resaple W* s uniforly and with replaceent fro {W t : t 1,...,n} for each s 1,...,n. 2) Set Bˆ M,n * W* t /n and ˆ * M,n W* t Bˆ M,n * 2. n n 1 3) Set T n * i (Bˆ * M,n Bˆ M,n)/ ˆ * M,n. Block bootstrapping (e.g., Wilks 1997) can be eployed if the suands are serially dependent. Bootstrap (1 2 ) confidence intervals are then defined by the liits

6 OCTOBER 2007 F E R R O 1081 where k r and T n * (1)... T* n (r) are the order statistics of the bootstrap saple. Neither the noral nor the bootstrap confidence liits are guaranteed to fall in the interval [0, 1], so they will always hereafter be truncated at the end points. These confidence intervals can be used to test hypotheses of the for B M b, for soe reference value b that represents inial forecast quality. If a twosided (1 ) confidence interval for B M does not contain b, then the hypothesis is rejected in favor of the two-sided alternative hypothesis B M b at the 100 % level. One coon reference value for B is the Brier score, q 2 (1 2q) I t /n, obtained if the sae probability q is forecast at every tie t. Another is the expected Brier score, (2 1)/(6) or 1/3, obtained if the forecast at each tie t is selected independently fro a unifor distribution on either the set {i/ : i 0,...,} or the interval [0, 1]. The dark gray bands in the top two panels of Fig. 3 are bootstrapped 90% confidence intervals (using r 5000) for B for the ECMWF and MF rainfall forecasts. The ECMWF forecasts are significantly worse than cliatology (the constant forecast q I t /n) at the 10% level for a few thresholds, but are significantly better than rando forecasts except between the 30% and 70% quantiles ( ). The MF forecasts are not significantly different fro cliatology at any threshold, but are significantly better than rando forecasts except between the 45% and 65% quantiles ( ). 4. Siulation study FIG. 3. (top) Brier scores Bˆ,n (solid) for the ECMWF forecasts, with bootstrapped 90% confidence intervals for B (dark gray) and B,n (light gray; see appendix A) at each threshold. Expected Brier scores are also shown for rando forecasts (dotted) and cliatology (dashed). (iddle) The sae as in the top panel but for the MF forecasts. (botto) The difference (solid) in Bˆ,n between the ECMWF and MF forecasts, with bootstrapped 90% confidence intervals for the differences between B (dark gray) and B,n (light gray). Bˆ M,n ˆ M,n T* n r 1 k and Bˆ M,n ˆ M,n T* n k, a. Serial independence We copare the perforances of the proposed noral and bootstrap confidence intervals for B with a siulation study. The perforance of an equitailed (1 2 ) confidence interval is coonly assessed by its achieved coverage and average length in repeated siulated datasets for which the true value of B is known. Let Bˆ i be the point estiate and let L i and U i be the lower and upper confidence liits coputed fro the ith of N datasets. The achieved lower and upper coverages are the proportions of ties that B falls above and below the lower and upper liits; that is, N N 1 I B L i and N 1 i 1 I B U i, i 1 which should both equal 1. The average length is the ean distance between the lower and upper liits; that is, N N 1 U i L i, i 1 which should be as sall as possible. The perforance of the confidence intervals depends on the enseble size, the saple size n, the thresholds u and, the target coverage defined by, and the joint distribution of the observations and forecasts. We exaine the effects of all of these factors in this siulation study, although a coplete investigation is ipossible. Serially independent observations are siulated fro a standard noral distribution. Enseble ebers are also norally distributed, and each has a correlation with its conteporary observation but is otherwise independent of the other ebers. Forecasts are siple proportions [(3)] and we use thresholds u equal to p quantiles of the standard noral distribution. We consider the following values for the various factors: 2, 4, 8; n 10, 20, 40; p 0.5, 0.7, 0.9; 0, 0.4, 0.8; and between and Results for p 0.1 and 0.3 would be the sae as for p 0.9 and 0.7, respectively, because the forer could be obtained by redefining events as deficits below thresholds, which N

7 1082 W E A T H E R A N D F O R E C A S T I N G VOLUME 22 FIG. 4. Monte Carlo estiates of noral (solid) and bootstrap (dashed) lower (left) and upper (right) coverage errors plotted against when 8; n 40; 0 (thin), 0.4 (ediu), and 0.8 (thick); and p (top) 0.9, (iddle) 0.7, and (botto) 0.5. Solid horizontal lines ark zero error. does not alter the Brier score. We use N datasets and r 1000 bootstrap saples throughout. We show results for 8 and n 40 only; results are qualitatively siilar for different values. Figure 4 shows the lower and upper coverage errors for the noral and bootstrap confidence intervals as varies and with different values for and p. Figure 5 shows the corresponding lengths. The coverage errors of the lower liits are usually positive (the lower liits are too low and overcover) while the upper errors are often negative (the upper liits are too low and undercover). The errors are always saller for the bootstrap liits than for the noral liits. The bootstrap achieves this by shrinking the lower liits and extending the upper liits copared to the noral liits (not shown) to capture asyetry in the sapling distribution of the Brier score, producing wider intervals for the bootstrap, as revealed by Fig. 5. The interval lengths decrease as both and increase. b. Serial dependence To investigate the sensitivity of the results to the presence of serial dependence, the siulations were repeated with observations generated fro a first-order oving-average process with correlation 0.5 at lag one. The standard errors were adjusted for the lag-one correlation and the block bootstrap was eployed with blocks of size two. Results (not shown) were qualitatively siilar to those for serial independence, except that both positive and negative lower coverage errors were found. Errors were larger and intervals wider because of the saller effective saple sizes. c. Modified bootstrap intervals The bootstrap coverage errors in Fig. 4 are typically less than /2. Errors decrease as n increases, so these intervals ay be acceptable for any applications. Iproveents are desirable, however, particularly for sall saple sizes and rare events. Several odifications have been explored by the author, specifically basic and percentile bootstrap intervals and bootstrap calibration (Davison and Hinkley 1997, chapter 5) and a continuity correction (Hall 1987) to account for the discrete nature of the suands of the Brier score. None of these ethods iproved significantly on the studentized intervals presented above. A variance-

8 OCTOBER 2007 F E R R O 1083 and the second with enseble size, verified against the sae set of observations. Quantities pertaining to the second syste will be distinguished with pries. We can copare the two systes by constructing a confidence interval for the difference, B M B M, between the Brier scores that would be expected were both enseble sizes equal to M. Such a coparison ay help to identify whether or not a perceived superiority of one syste is due only to its larger enseble size. If the (1 ) confidence interval contains zero, then the null hypothesis of equal scores is rejected in favor of the two-sided alternative at the 100 % level. We estiate the difference between the Brier scores using unbiased estiators [(8)], though the subsapling approach described at the end of section 2d could also be used. Noral confidence intervals are defined by FIG. 5. Monte Carlo estiates of noral (solid) and bootstrap (dashed) interval lengths plotted against when 8; n 40; 0 (thin), 0.4 (ediu), and 0.8 (thick); and p (top) 0.9, (iddle) 0.7, and (botto) 0.5. stabilizing transforation proposed by DiCiccio et al. (2006) was also applied and found to reduce the coverage error in the lower liit, especially for rare events for which errors were approxiately halved. The effect on the upper liits was sall. A large part of the coverage error in sall saples arises fro the fact that the Brier score can take only a sall nuber of distinct values. One way to reduce these errors is to sooth the Brier score by adding a sall aount of rando noise (Lahiri 1993). Investigations unreported here show that this can indeed reduce coverage errors significantly at the expense of widening the confidence intervals. However, results depend strongly on the aount of soothing eployed and aking general recoendations is difficult. An alternative solution could be to fit joint probability distributions to the observations and forecasts before deterining the forecast probabilities (Bradley et al. 2003). This would allow the forecasts, and hence the Brier score, to take any values on the interval [0, 1], and so avoid discretization errors. Another advantage would be the avoidance of intervals with zero length, which occurs for both the noral and bootstrap intervals described above when all suands of the Brier score are equal. 5. Coparing Brier scores Consider the task of coparing the Brier scores of two forecasting systes, the first with enseble size Bˆ M,n Bˆ M,n M,n z, where, if there is no serial dependence, 2 M,n 1 n n 1 W t W t Bˆ M,n Bˆ M,n 2. As in section 3, this can be adjusted to account for serial dependence. Bootstrap intervals approxiate the distribution of D n Bˆ M,n Bˆ M,n B M B M M,n with a bootstrap saple {D n * i : i 1,..., r}. If the suands are serially independent, then this saple can be fored by repeating the following steps for each i 1,...,r. 1) Resaple pairs (W* s, W * s ) uniforly and with replaceent fro {(W t, W t ):t 1,...,n} for each s 1,...,n. 2) Copute Bˆ * M,n, Bˆ * M,n, and * M,n for the resapled data. 3) Set D n * i [(Bˆ * M,n Bˆ * M,n ) (Bˆ M,n Bˆ M,n )]/ * M,n. The first step preserves dependence between conteporary suands of the two scores. Block bootstrapping ay again be eployed if the suands are serially dependent, and confidence liits take the for and Bˆ M,n Bˆ M,n M,n D n * l Bˆ M,n Bˆ M,n M,n D n * k. Bootstrapped 90% confidence intervals for the difference between B for the ECMWF and MF forecasts are illustrated by the dark gray bands in Fig. 3 (botto panel). The scores are significantly different at the 10% level between only the 0.3- and 0.4-quantile thresholds (50 70 ).

1084 W E A T H E R A N D F O R E C A S T I N G VOLUME 22 The statistical significance of the differences between Brier scores can also be quantified using hypothesis tests.

9 1084 W E A T H E R A N D F O R E C A S T I N G VOLUME 22 The statistical significance of the differences between Brier scores can also be quantified using hypothesis tests. The powers of four such tests are investigated in appendix B, where the perutation test is found to be an attractive alternative to the bootstrap test presented above. The perutation test yields siilar results for our data, however, with p values less than 0.1 between only the 0.3- and 0.4-quantile thresholds. 6. Multiple coparisons We have so far constructed confidence intervals separately for each threshold u. These intervals are designed to contain the quantity of interest, such as an expected score or the difference between two expected scores, with a certain probability at each individual threshold. We ay wish, however, to construct confidence intervals such that the quantity of interest is contained siultaneously within the intervals at all thresholds with a certain probability. We describe how to construct such confidence intervals in this section. Denote by B(u) the quantity of interest at threshold u and suppose that we want to consider a collection S of thresholds. We ai to find confidence liits L(u) and U(u) for each u S such that Pr{L u B u U u for all u S} If we used the (1 2 ) confidence liits proposed in previous sections for L(u) and U(u), then this probability would be less than 1 2. For exaple, if scores were independent between thresholds, then the probability would be (1 2 ) S, where S is the nuber of thresholds. Siultaneous confidence liits can be obtained using a bootstrap ethod described by Davison and Hinkley (1997, section 4.2.4). Suppose that equitailed confidence intervals at each threshold u are based on a bootstrap saple {T* i (u) :i 1,...,r} and have the for L u Bˆ u ˆ u T* r 1 k u, U u Bˆ u ˆ u T* k u for soe 1 k r/2. If we use these liits to for siultaneous intervals, then the bootstrap estiate of the coverage probability [(10)] is r 1 r I T* k u T* i u T* r 1 k u i 1 for all u S. It is sufficient, therefore, to choose k such that this estiate is as close as possible to 1 2. The resapling ust preserve dependence between FIG. 6. (top) Brier scores Bˆ,n (solid) for the ECMWF forecasts, with bootstrapped siultaneous 90% confidence intervals for B (dark gray) and B,n (light gray) at each threshold. Expected Brier scores are also shown for rando forecasts (dotted) and cliatology (dashed). (iddle) As in the top panel but for the MF forecasts. (botto) The difference (solid) in Bˆ,n between the ECMWF and MF forecasts, with bootstrapped siultaneous 90% confidence intervals for the differences between B (dark gray) and B,n (light gray). thresholds: the statistics {T* i (u) :u S} should be coputed fro the sae data for each i. So, resapling schees take the following for. 1) Resaple (X* s, Y* s,1,..., Y* s, ) fro {(X t, Y t,1,..., Y t, ):t 1,...,n} for each s 1,...,n. 2) Copute T* i (u) for all u S. The size of the resaple ay also need to be larger for siultaneous intervals. If scores were independent across thresholds, the worst case, then the bootstrap estiate of the coverage would be approxiately (1 2k/r) S. If this is to equal 1 2, then we require r 2k/[1 (1 2 ) 1/ S ] k S / for large S. If 0.05 and we want k 5 for exaple, then r 100 S. The dark gray bands in Fig. 6 are bootstrapped, siultaneous 90% confidence intervals for B for the ECMWF and MF rainfall forecasts. Considering all thresholds together, then, we find that, at the 10% level, neither the ECMWF nor MF forecasts differ significantly fro cliatology. The evidence for a difference between the ECMWF and MF forecasts is arginal at the 10% level. 7. Discussion This article identified the effect of enseble size on the expected Brier score [(7)] and, given ensebles of

10 OCTOBER 2007 F E R R O 1085 size, an unbiased estiator [(8)] for the expected Brier score that would be obtained for any other enseble size. We assued that enseble ebers were exchangeable, an acceptable assuption when the forecast lead tie is long enough for systeatic differences between ebers to be lost. We proposed standard errors and confidence intervals for the expected Brier scores and found that bootstrap intervals perfored well in a siulation study. When coparing the Brier scores fro two forecasting systes, we proposed coparing estiates of expected Brier scores that would be obtained were the enseble sizes equal, and described confidence intervals for their difference. We showed that if the Brier scores for several event definitions are of interest, then it is possible to construct confidence intervals that siultaneously contain with a specified probability the expected scores for all events. We applied our ethods to two sets of rainfall forecasts. For forecasting low rainfall totals, MF forecasts had lower Brier scores than ECMWF forecasts, even after estiating the effect of increasing the ECMWF enseble size to infinity. Standard errors and confidence intervals suggested that the scores were only arginally significantly different at the 10% level for a few thresholds, and neither set of forecasts perfored better than forecasting cliatology. Müller et al. (2005) have ais siilar to ours but for the ore general quadratic ranked probability score (RPS). They note that the expected RPS for perfectly calibrated but rando enseble forecasts exceeds the RPS obtained by forecasting cliatology, which is equivalent to a perfectly calibrated rando enseble forecast with infinite enseble size. This is analogous to B exceeding B. Instead of using cliatology as the reference forecast in RPS skill scores, they therefore propose using a perfectly calibrated rando enseble forecast with an enseble size equal to that of the forecasts being assessed. This is equivalent to our proposal of coparing Bˆ,n with B instead of B. Müller et al. (2005) also produce confidence bands representing the sapling variation in the RPS skill score for rando forecasts that arises aong different observation forecast datasets. Coparing a forecast syste s skill score with these bands provides a guide to its statistical significance relative to a rando forecast, but does not provide a foral statistical test because the sapling variation in the syste s skill score is ignored. Our confidence intervals differ substantially: they are confidence intervals for the expected score of the forecast syste being eployed and can, therefore, be used to ake coparisons with the expected score of any reference forecast, not only rando forecasts, and can also be used to copare the expected scores of two forecasting syste. The ethods presented in this article can be extended in several ways. We have defined events as exceedances of thresholds for siplicity, but the sae ethods could be applied for events defined by ebership of general sets. We have also considered scalar observations and forecasts for siplicity, but ultivariate data can be handled with the sae ethods; for exaple, events could be defined by ebership of ultidiensional sets. The ethods presented here can also be extended to ultiple-category Brier scores (Brier 1950) and to the RPS. Coputer code for the procedures presented in this article and written in the statistical prograing language R is available fro the author. Acknowledgents. Conversations with Professor I. T. Jolliffe and Drs. C. A. S. Coelho, D. B. Stephenson, and G. J. van Oldenborgh (who provided the data), plus coents fro the referees, helped to otivate and iprove this work. a. Unbiased estiators APPENDIX A Conditional Brier Scores We discussed the expected Brier score [(1)] in the ain part of the paper, where the expectation was taken over repeated sapling of forecasts and observations. We investigate the conditional expected Brier score in this appendix, where the expectation is taken over repeated sapling of forecasts, but where the observations reain fixed. This quantity would be of interest if we wanted to know how a forecasting syste would have perfored on a particular set of target observations for different enseble sizes. As before, we shall find the effects of enseble size and construct unbiased estiators, standard errors, and confidence intervals. The only source of variation in the conditional case is the generation of enseble ebers: each observation X t, and the corresponding odel details Z t, are fixed. Consequently, we no longer need to assue stationarity, and the conditional expected Brier score is B,n 1 n E Qˆ t I t 2 X t, Z t 1 n E Qˆ 2 t Z t 2E Qˆ t Z t I t I t since X t deterines I t. To see the effects of enseble size, we assue again that the forecasts Qˆ t are siple

11 1086 W E A T H E R A N D F O R E C A S T I N G VOLUME 22 proportions [(3)] and that enseble ebers are exchangeable. Then, for each t, and E Qˆ t Z t Q t E Qˆ 2 t Z t Q t 1 R t for soe Q t and R t independent of, and B,n 1 n R t 2I t Q t I t 1 Q t R t B,n 1 n Q t R t, where B,n is the conditional counterpart of B. The adjusted Brier score [(8)] is again unbiased for B M,n. b. Standard errors Estiating the uncertainty about B M,n due to sapling variation is harder than for B M because we no longer assue stationarity. The contribution to the sapling variation ust therefore be quantifiable for each enseble separately. This is easier if we strengthen our assuption of exchangeability to one of independent and identically distributed ebers. This assuption is difficult to test epirically for enseble forecasts with a distribution that changes through tie and requires further investigation (cf. Bergsa 2004) so we appeal to the long lead tie of our rainfall forecasts for justification. In this case, the nuber K t of ebers that forecast the event in the enseble at tie t has a binoial distribution with ean Q t and variance Q t (1 Q t ). After soe lengthy algebra, the conditional variance, 2 M,n, ofbˆ M,n can be shown to satisfy Mn M,n M M 1 n 1 Q 4 t n M 1 n 1 M Q t 8 n I t Q 3 t 12 I n t Q 2 t 4 n I tq t Q M 1 3 t n M M 1 Q 2 t This variance decreases as 1 for large, sobˆ M,n is consistent for B M,n as. An unbiased estiator for 2 M,n can be constructed if 3 by replacing each Q s t in the previous equation with K t K t 1... K t s s 1 for positive integers s. The square root, M,n, of this unbiased estiator is then an estiator for the conditional standard error of Bˆ M,n. If 3, then we can replace Q s t with (K t /) s instead, but note that if 1, then M,n is always zero. Estiates of these conditional standard errors are shown in Fig. 2 for the ECMWF and MF rainfall forecasts. As expected, they are saller than their unconditional counterparts, which reflect the additional variation fro sapling observations. In fact, the conditional standard errors are sall enough to suggest that the superiority of the MF forecasts at thresholds below 90 is statistically significant and would reain for these particular observations even if different enseble ebers were sapled. c. Confidence intervals A noral (1 2 ) confidence interval for B M,n is Bˆ M,n M,n z. Bootstrap intervals approxiate the distribution of T M,n Bˆ M,n B M,n M,n by a bootstrap saple {T* M,n i : i 1,...,r}. This saple can be fored by repeating the following steps for each i 1,...,r. 1) Resaple Y* t,j fro {Y t,i : i 1,...,} for each j 1,...,, and repeat for each t 1,...,n. 2) For Bˆ * M,n and * M,n fro these resapled ensebles in the sae way that the original ensebles were used to for Bˆ M,n and M,n. 3) Set T M,n * i (Bˆ M,n * B* M,n )/ M,n *, where B* M,n n Qˆ 1 2 t 2I t Qˆ t I t 1 M Qˆ t 1 Qˆ t. Bootstrap (1 2 ) confidence liits then take the for Bˆ M,n M,n T* (l) M,n and Bˆ M,n M,n T* M,n. (k) Bootstrapped 90% confidence intervals for B,n are illustrated for the ECMWF and MF forecasts in Fig. 3. Again, the intervals are narrower than those for B. The ECMWF forecasts are now significantly worse than cliatology for any thresholds, that is, they are unlikely to do as well as cliatology for these observa-

12 OCTOBER 2007 F E R R O 1087 tions were new enseble ebers to be sapled, but are significantly better than rando forecasts except between the 30% and 50% quantiles (50 90 ). The MF forecasts are not significantly different than cliatology at ost thresholds, and are significantly better than rando forecasts at all thresholds. d. Siulation study The siulation study of section 4 was repeated for B,n. Results are not shown but were qualitatively siilar to those reported in section 4 for B except for rare events (p 0.9). In that case, bootstrap intervals reain preferable to noral intervals except when 0, for which both intervals have large coverage errors. e. Coparing Brier scores Confidence intervals for the difference, B M,n B M,n, between the conditional expected Brier scores of two systes are easy to construct if the forecasts fro the two systes at any tie t can be considered independent once the odel details Z t and Z t are fixed. This assuption ight be violated if the enseble generation process causes pairing of ebers between the two systes, though any such dependence is likely to diinish with lead tie. The distribution of D M,n Bˆ M,n Bˆ M,n B M,n B M,n M,n, where 2 M,n 2 M,n 2 M,n, can be approxiated by a bootstrap saple of the quantity D* M,n Bˆ * M,n Bˆ * M,n B* M,n B * M,n * M,n to obtain confidence liits and Bˆ M,n Bˆ M,n M,n D* (l) M,n Bˆ M,n Bˆ M,n M,n D* M,n. (k) Resapling follows the schee described earlier in the section independently for each syste. Figure 3 (botto panel) shows bootstrapped 90% confidence intervals for the difference between B,n for the ECMWF and MF forecasts. The MF score is significantly lower than the ECMWF score for ost thresholds below 90. APPENDIX B Hypothesis Tests We used confidence intervals in section 5 to test null hypotheses of equal expected Brier scores. Using the noral confidence interval is equivalent to a z test [e.g., the test labeled S 1 by Diebold and Mariano (1995)] and FIG. B1. Monte Carlo estiates of the powers of the bootstrap (solid), perutation (dashed), and z (dotted) and t tests (dotted dashed) against correlation 2 (see text) for (left) serially independent and (right) dependent observations. using the bootstrap interval is equivalent to a bootstrap test (e.g., Davison and Hinkley 1997, p. 171). Confidence intervals are useful for quantifying uncertainty even when no coparative test is attepted, but if coparison is the goal, then other test procedures ight also be eployed. Hypothesis tests such as the sign and signed-rank tests (Haill 1999) test for differences between edians and are inappropriate for testing differences between Brier scores, which are saple eans. Instead, we copare the powers of the z and bootstrap tests with those of a t test and a perutation test (Haill 1999) in a siulation study. The study design is siilar to that in section 4, except that two sets of forecasts are siulated, one uncorrelated with the observations ( 1 0) while the correlation for the other set is varied fro 2 0to 2 1. The sets have the sae expected Brier score when 2 0 and the scores diverge as 2 increases. For each value of 2, datasets are generated and subjected to the four tests at the 10% significance level. Figure B1 (left panel) shows Monte Carlo estiates of power for the four tests when 8, n 20, and p 0.5. All four tests have siilar powers, although the z test is slightly oversized and the bootstrap test has slightly lower power far fro the null hypothesis. The z test in Diebold and Mariano (1995) is adapted to handle serial dependence, and block resapling can be used for the perutation and bootstrap tests. The power study is repeated with observations siulated fro a first-order oving-average process with correlation 0.5 at lag one. Powers for these three tests are plotted in Fig. B1 (right panel) and show that the z test and, to a lesser extent, the bootstrap tests are oversized,

13 1088 W E A T H E R A N D F O R E C A S T I N G VOLUME 22 while the perutation test has reained well sized and its power has reduced only slightly fro the independent case. Fro this liited study, the perutation test appears to be an attractive alternative to the bootstrap test for differences between Brier scores. REFERENCES Bergsa, W. P., 2004: Testing conditional independence for continuous rando variables. EURANDOM Tech. Rep , 19 pp. Bradley, A. A., T. Hashino, and S. S. Schwartz, 2003: Distributions-oriented verification of probability forecasts for sall data saples. Wea. Forecasting, 18, Brier, G. W., 1950: Verification of forecasts expressed in ters of probability. Mon. Wea. Rev., 78, 1 3. Briggs, W., and D. Ruppert, 2005: Assessing the skill of yes/no predictions. Bioetrics, 61, Buizza, R., and T. N. Paler, 1998: Ipact of enseble size on enseble prediction. Mon. Wea. Rev., 126, Chatfield, C., 2004: The Analysis of Tie Series: An Introduction. Chapan and Hall, 333 pp. Davison, A. C., and D. V. Hinkley, 1997: Bootstrap Methods and Their Application. Cabridge University Press, 592 pp. DiCiccio, T. J., A. C. Monti, and G. A. Young, 2006: Variance stabilization for a scalar paraeter. J. Roy. Stat. Soc., 68B, Diebold, F. X., and R. S. Mariano, 1995: Coparing predictive accuracy. J. Bus. Econ. Stat., 13, Hall, P., 1987: On the bootstrap and continuity correction. J. Roy. Stat. Soc., 49B, Haill, T. M., 1999: Hypothesis tests for evaluating nuerical precipitation forecasts. Wea. Forecasting, 14, Jolliffe, I. T., 2007: Uncertainty and inference for verification easures. Wea. Forecasting, 22, , and D. B. Stephenson, 2003: Forecast Verification: A Practitioner s Guide in Atospheric Science. John Wiley and Sons, 240 pp. Kane, T. L., and B. G. Brown, 2000: Confidence intervals for soe verification easures a survey of several ethods. Preprints, 15th Conf. on Probability and Statistics, Asheville, NC, Aer. Meteor. Soc., Lahiri, S. N., 1993: Bootstrapping the Studentized saple ean of lattice variables. J. Mult. Anal., 45, Müller, W. A., C. Appenzeller, F. J. Doblas-Reyes, and M. A. Liniger, 2005: A debiased ranked probability skill score to evaluate probabilistic enseble forecasts with sall enseble sizes. J. Cliate, 18, Paler, T. N., and Coauthors, 2004: Developent of a European Multiodel Enseble Syste for Seasonal to Interannual Prediction (DEMETER). Bull. Aer. Meteor. Soc., 85, Potts, J. M., 2003: Basic concepts. Forecast Verification: A Practitioner s Guide in Atospheric Science, I. T. Jolliffe and D. B. Stephenson, Eds., John Wiley and Sons, Richardson, D. S., 2001: Measures of skill and value of enseble prediction systes, their interrelationship and the effect of enseble size. Quart. J. Roy. Meteor. Soc., 127, Roano, J. P., 1988: A bootstrap revival of soe nonparaetric distance tests. J. Aer. Stat. Assoc., 83, Seaan, R., I. Mason, and F. Woodcock, 1996: Confidence intervals for soe perforance easures of yes-no forecasts. Aust. Meteor. Mag., 45, Severini, T. A., 2005: Eleents of Distribution Theory. Cabridge University Press, 515 pp. Stephenson, D. B., 2000: Use of the odds ratio for diagnosing forecast skill. Wea. Forecasting, 15, Thornes, J. E., and D. B. Stephenson, 2001: How to judge the quality and value of weather forecast products. Meteor. Appl., 8, Wilks, D. S., 1997: Resapling hypothesis tests for autocorrelated fields. J. Cliate, 10, , 2006: Statistical Methods in the Atospheric Sciences. 2d ed. Acadeic Press, 627 pp. Woodcock, F., 1976: The evaluation of yes/no forecasts for scientific and adinistrative purposes. Mon. Wea. Rev., 104,

Comparing Probabilistic Forecasting Systems. with the Brier Score

Comparing Probabilistic Forecasting Systems with the Brier Score Christopher A. T. Ferro Walker Institute Department of Meteorology University of Reading, UK January 16, 2007 Corresponding author address: