Using Bayesian Model Averaging to Calibrate Forecast Ensembles 1

Size: px

Start display at page:

Download "Using Bayesian Model Averaging to Calibrate Forecast Ensembles 1"

Edward Franklin
6 years ago
Views:

1 Using Bayesian Mdel Averaging t Calibrate Frecast Ensembles 1 Adrian E. Raftery, Fadua Balabdaui, Tilmann Gneiting and Michael Plakwski Department f Statistics, University f Washingtn, Seattle, Washingtn Technical Reprt n. 440 Department f Statistics University f Washingtn December 15, Crrespnding authr address: Adrian E. Raftery, Department f Statistics, University f Washingtn, Bx , Seattle, WA

2 Reprt Dcumentatin Page Frm Apprved OMB N Public reprting burden fr the cllectin f infrmatin is estimated t average 1 hur per respnse, including the time fr reviewing instructins, searching existing data surces, gathering and maintaining the data needed, and cmpleting and reviewing the cllectin f infrmatin. Send cmments regarding this burden estimate r any ther aspect f this cllectin f infrmatin, including suggestins fr reducing this burden, t Washingtn Headquarters Services, Directrate fr Infrmatin Operatins and Reprts, 1215 Jeffersn Davis Highway, Suite 1204, Arlingtn VA Respndents shuld be aware that ntwithstanding any ther prvisin f law, n persn shall be subject t a penalty fr failing t cmply with a cllectin f infrmatin if it des nt display a currently valid OMB cntrl number. 1. REPORT DATE 15 DEC REPORT TYPE 3. DATES COVERED t TITLE AND SUBTITLE Using Bayesian Mdel Averaging t Calibrate Frecast Ensembles 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) University f Washingtn,Department f Statistics,Bx ,Seattle,WA, PERFORMING ORGANIZATION REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR S ACRONYM(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Apprved fr public release; distributin unlimited 13. SUPPLEMENTARY NOTES The riginal dcument cntains clr images. 14. ABSTRACT 15. SUBJECT TERMS 11. SPONSOR/MONITOR S REPORT NUMBER(S) 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT a. REPORT unclassified b. ABSTRACT unclassified c. THIS PAGE unclassified 18. NUMBER OF PAGES 32 19a. NAME OF RESPONSIBLE PERSON Standard Frm 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

3 Abstract Ensembles used fr prbabilistic weather frecasting ften exhibit a spread-skill relatinship, but they tend t be underdispersive. This paper prpses a principled statistical methd fr pstprcessing ensembles based n Bayesian mdel averaging (BMA), which is a standard methd fr cmbining predictive distributins frm different surces. The BMA predictive prbability density functin (PDF) f any quantity f interest is a weighted average f PDFs centered arund the individual (pssibly bias-crrected) frecasts, where the weights are equal t psterir prbabilities f the mdels generating the frecasts, and reflect the mdels skill ver the training perid. The BMA PDF can be represented as an unweighted ensemble f any desired size, by simulating frm the BMA predictive distributin. The BMA weights can be used t assess the usefulness f ensemble members, and this can be used as a basis fr selecting ensemble members; this can be useful given the cst f running large ensembles. The BMA predictive variance can be decmpsed int tw cmpnents, ne crrespnding t the between-frecast variability, and the secnd t the within-frecast variability. Predictive PDFs r intervals based slely n the ensemble spread incrprate the first cmpnent but nt the secnd. Thus BMA prvides a theretical explanatin f the tendency f ensembles t exhibit a spread-skill relatinship but yet t be underdispersive. The methd was applied t 48-hur frecasts f sea-level pressure in the Pacific Nrthwest in January June, 2000 using the University f Washingtn MM5 messcale ensemble. The predictive PDFs were much better calibrated than the raw ensemble, the BMA frecasts were sharp in that 90% BMA predictin intervals were 62% shrter n average than thse prduced by sample climatlgy. As a byprduct, BMA yields a deterministic pint frecast, and this had RMSE 11% lwer than any f the ensemble members, and 6% lwer than the ensemble mean. Similar results were btained fr frecasts f surface temperature.

4 Cntents 1 Intrductin 1 2 Bayesian Mdel Averaging 5 a. Basic Ideas b. Estimatin by Maximum Likelihd and the EM Algrithm c. The BMA Predictive Variance Decmpsitin and the Spread-Skill Relatinship 8 d. Example f BMA Predictive PDF Results 12 a. Length f Training Perid b. Results c. Selecting Ensemble Members: Results fr a Reduced Ensemble d. Results fr Surface Temperature Discussin 21 List f Figures 1 Spread-Skill Relatinship fr Daily Average Abslute Errrs in Sea-Level Pressure Frecast Verificatin Rank Histgram fr the UW Ensemble 48-Hur Frecasts f Sea- Level Pressure BMA Predictive PDF and its Five Cmpnents fr an Example Frecast f Sea-Level Pressure Ensemble f 100 Equally Likely Values frm the BMA PDF fr the Example Sea-Level Pressure Frecast BMA Prbabilistic 48-Hur Frecast f Sea-Level Pressure fr the Pacific Nrthwest, February 23, Cntur plts f (a) BMA deterministic frecast; (b) BMA margin f errr; (c) Lwer and (d) upper bund f the BMA 90% predictin interval Cmparisn f Training Perid Lengths fr Sea-Level Pressure BMA Frecast 15 7 Calibratin Rank Histgram fr Bayesian Mdel Averaging fr Sea-Level Pressure Bayesian Mdel Averaging Weights fr the Five Mdels Over the Evaluatin Perid fr Sea-Level Pressure i

5 List f Tables 1 One Example Frecast: UW-MM5 Ensemble Frecasts, Bias-Crrected Frecasts, BMA Weights, and Verifying Observatin Cverage f Predictin Intervals fr Sea-Level Pressure (%) Average Width f Predictin Intervals fr Sea-Level Pressure RMSEs f Deterministic Frecasts fr Sea-Level Pressure Cmparisn f BMA PDFs frm the Five-Member Ensemble and the Reduced Three-Member Ensemble fr Sea-Level Pressure Cverage f Predictin Intervals fr Surface Temperature (%) Average Width f Predictin Intervals fr Surface Temperature RMSEs f Deterministic Frecasts fr Surface Temperature ii

6 1 Intrductin The dminant apprach t prbabilistic weather frecasting has been the use f ensembles in which a mdel is run several times with different initial cnditins r mdel physics (Epstein 1969; Leith 1974). Ensembles based n glbal mdels have been fund useful fr medium-range prbabilistic frecasting (Tth and Kalnay 1993; Mlteni, Buizza, Palmer, and Petrliagis 1996; Hutekamer and Derme 1995; Hamill, Snyder, and Mrss 2000). Typically the ensemble mean utperfrms all r mst f the individual ensemble members, and in sme studies a spread-skill relatinship has been bserved, in which the spread in the ensemble frecasts is crrelated with the magnitude f the frecast errr. Often, hwever, the ensemble is underdispersive and thus nt calibrated. Here we fcus n shrt-range messcale frecasting. Several authrs have studied the use f a synptic ensemble, the 15-member NCEP Eta-RSM ensemble, fr shrt-range frecasting (Hamill and Clucci 1997; Hamill and Clucci 1998; Stensrud, Brks, Du, Tractn, and Rgers 1999). As was the case fr medium-range frecasting, the ensemble mean was mre skillful fr shrt-range frecasting than the individual ensemble members, but the spreadskill relatinship was weak. The first shrt-range messcale ensemble frecasting experiment was SAMEX (Hu, Kalnay, and Dregemeier 2001). This fund that the ensemble mean was mre skillful than the individual frecasts, and that there was a significant spread-skill relatinship, with a crrelatin n the rder f 0.4. Hwever, the ensemble was nt well calibrated; see Figures 8 and 9 f Hu, Kalnay, and Dregemeier (2001). Grimit and Mass (2002) described the University f Washingtn messcale shrt-range ensemble system fr the Pacific Nrthwest hereafter referred t as the UW ensemble. This is a five-member multianalysis ensemble cnsisting f different runs f the MM5 mdel, in which initial cnditins are taken frm different peratinal centers. The UW ensemble was run at 36km and 12km grid spacing, while the NCEP SREF has been run at 48km. Like ther authrs, Grimit and Mass (2002) fund the ensemble mean t be mre skillful than the individual frecasts, and they reprted a strnger spread-skill crrelatin than ther studies, ranging up t 0.6. Figure 1 is a scatterplt shwing the spread-skill relatinship fr sea-level pressure fr the UW ensemble fr the same perid as that n which Grimit and Mass (2002) s reprt was based, namely January June, The spread-skill crrelatin fr daily average abslute errrs, averaging spatially acrss the Pacific Nrthwest, was 0.42, which was highly statistically significant. Hwever, the verificatin rank histgram fr the same data, shwn in Figure 2, shws the ensemble t be underdispersive and hence uncalibrated. In this case, the 1

7 Abslute Errr Frecast Range Figure 1: Spread-Skill Relatinship fr Daily Average Abslute Errrs in the 48-hur Frecast f Sea-Level Pressure in the UW Ensemble, January-June, The vertical axis shws the daily average f the abslute errrs f the ensemble mean frecast, and the hrizntal axis shws the daily average f the difference between the highest and lwest frecasts in the ensemble. The slid line is the least-squares regressin line. The crrelatin is 0.42, and is highly statistically significant. ensemble range based n five members wuld cntain 4/6, r 66.7% f the bserved values if the ensemble were calibrated, i.e. if the ensemble frecasts were a sample frm the predictive prbability density functin (PDF), whereas in fact it cntained nly 51% f them. This behavir an ensemble that yields a significant spread-skill relatinship and useful predictins f frecast skill, and yet is uncalibrated is nt unique t the UW ensemble, as we have nted, and may seem cntradictry. On reflectin, thugh, it is nt s surprising. There are several surces f uncertainty in numerical weather frecasts, including uncertainty abut initial cnditins, lateral bundary cnditins, mdel physics, as well as discretizatin and integratin methds. Mst ensembles capture nly sme f these uncertainties, and then prbably nly partially. Thus it seems inevitable that ensembles based purely n perturbing initial and lateral bundary cnditins, mdel physics and integratin mdels will 2

8 Relative Frequency Rank f Observatin in Ensemble Figure 2: Verificatin Rank Histgram fr the UW Ensemble 48-Hur Frecasts f Sea-Level Pressure, January-June,

9 be underdispersive t sme extent. Because they d capture sme f the imprtant surces f uncertainty, hwever, it is reasnable t expect a significant spread-skill relatinship, even when the ensemble is uncalibrated. T btain a calibrated frecast PDF, therefre, it seems necessary t carry ut sme frm f statistical pst-prcessing, as suggested by Hamill and Clucci (1997, 1998). Our gal in this article is t prpse an apprach t btaining calibrated and sharp predictive PDFs f future weather quantities r events frm the utput f ensembles that may nt be themselves calibrated. By calibrated we mean simply that intervals r events that we declare t have prbability P happen a prprtin P f the time n average in the lng run. Sharpness is a functin f the widths f predictin intervals. Fr example, a 90% predictin interval verifying at a given time and place is defined by a lwer bund and an upper bund, such that the prbability that the verifying bservatin lies between the tw bunds is declared t be 90%. By sharp we mean that predictin intervals are narrwer n average than thse btained frm climatlgy. Clearly, the sharper the better. We adpt the principle that the gal f prbabilistic frecasting is t maximize sharpness subject t calibratin (Gneiting, Raftery, Balabdaui, and Westveld 2003). T achieve this, we prpse a principled statistical apprach t pstprcessing ensemble frecasts, based n Bayesian Mdel Averaging (BMA). This is a standard statistical apprach t inference in the presence f multiple cmpeting statistical mdels, and has been widely applied in the scial and health sciences; here we extend it t frecasts frm dynamical mdels. In BMA, the verall frecast PDF is a weighted average f frecast PDFs based n each f the individual frecasts; the weights are the estimated psterir mdel prbabilities and reflect the mdels frecast skill in the training perid. The weights can als prvide a basis fr selecting ensemble members: when they are small there is little t be lst by remving the crrespnding ensemble member. This can be useful given the cmputatinal cst f running ensembles. The BMA deterministic frecast is just a weighted average f the frecasts (pssibly bias-crrected) frm the ensemble. The BMA frecast PDF can be written as an analytic expressin, and it can als be represented as an equally weighted ensemble f any desired size, by simulating ptential bservatins frm the frecast PDF. The BMA frecast variance decmpses int tw cmpnents, crrespnding t between-mdel and within-mdel variance. The ensemble spread captures nly the first cmpnent. This decmpsitin prvides a theretical explanatin and quantificatin f the behavir bserved in several ensembles, in which a significant spread-skill relatinship cexists with a lack f calibratin. 4

10 In sectin 2 we describe BMA, shw hw the BMA mdel can be estimated using the EM algrithm, and give an example f BMA in actin. In sectin 3 we give BMA results fr the UW ensemble, and in sectin 4 we make sme cncluding remarks. While ur experiments are with the UW ensemble, i.e. a messcale, single-mdel, multi-analysis ensemble system, the idea applies t ther situatins, including synptic, perturbed bservatins, singular vectr, bred and multi-mdel ensembles, with small changes, as indicated belw. 2 Bayesian Mdel Averaging a. Basic Ideas Standard statistical analysis, such as, fr example, regressin analysis, typically prceeds cnditinally n ne assumed statistical mdel. Often this mdel has been selected frm amng several pssible cmpeting mdels fr the data, and the data analyst is nt sure that it is the best ne. Other plausible mdels culd give different answers t the scientific questin at hand. This is a surce f uncertainty in drawing cnclusins, and the typical apprach, that f cnditining n a single mdel deemed t be best, ignres this surce f uncertainty, thus underestimating uncertainty. Bayesian mdel averaging (Leamer 1978; Kass and Raftery 1995; Heting, Madigan, Raftery, and Vlinsky 1999) vercmes this prblem by cnditining, nt n a single best mdel, but n the entire ensemble f statistical mdels first cnsidered. a quantity y t be frecast n the basis f training data y T In the case f using K statistical mdels {M 1,..., M K }, the law f ttal prbability tells us that the frecast PDF, p(y), is given by K p(y) = p(y M k )p(m k y T ), (1) k=1 where p(y M k ) is the frecast PDF based n mdel M k alne, and p(m k y T ) is the psterir prbability f mdel M k being crrect given the training data, and reflects hw well mdel M k fits the training data. The psterir mdel prbabilities add up t ne, s that Kk=1 p(m k y T ) = 1, and they can thus be viewed as weights. The BMA PDF is a weighted average f the PDFs given the individual mdels, weighted by their psterir mdel prbabilities. BMA pssesses a range f theretical ptimality prperties and has shwn gd perfrmance in a variety f simulated and real data situatins (Raftery and Zheng 2003). We nw extend BMA frm statistical mdels t dynamical mdels. The basic idea is that fr any given frecast there is a best mdel, but we d nt knw what it is, and 5

11 ur uncertainty abut the best mdel is quantified by BMA. Once again, we dente by y the quantity t be frecast. Each deterministic frecast, f k, can be bias-crrected, yielding a bias-crrected frecast f k. The frecast f k is then assciated with a cnditinal PDF, g k (y f k ), which can be interpreted as the cnditinal PDF f y cnditinal n f k, given that f k is the best frecast in the ensemble. The BMA predictive mdel is then K p(y f 1,..., f K ) = w k g k (y f k ), (2) k=1 where w k is the psterir prbability f frecast k being the best ne, and is based n frecast k s skill in the training perid. The w k s are prbabilities and s they add up t 1, i.e. K k=1 w k = 1. We will describe hw t estimate w k in the next subsectin. When frecasting temperature and sea-level pressure, it ften seems reasnable t apprximate the cnditinal PDF by a nrmal distributin centered at f k, s that g k (y f k ) is a nrmal PDF with mean f k and an ensemble-member-specific standard deviatin, σ k. We dente this situatin by y f k N( f k, σk 2 ), (3) and we will describe hw t estimate σk 2 in the next subsectin. In that case, the BMA predictive mean is just the cnditinal expectatin f y given the frecasts, namely K E[y f 1,..., f K ] = w k fk. (4) k=1 This can be viewed as a deterministic frecast in its wn right, and can be cmpared with the individual frecasts in the ensemble r the ensemble mean. b. Estimatin by Maximum Likelihd and the EM Algrithm Fr cnvenience, we restrict attentin t the situatin where the cnditinal PDFs are apprximated by nrmal distributins. This seems t be reasnable fr sme variables, such as temperature and sea level pressure, but nt fr thers, such as wind speed and precipitatin; ther distributins wuld be needed fr the latter. The basic ideas carry acrss directly t ther distributins als. We nw cnsider hw t estimate the mdel parameters, w k and σk 2, k = 1,..., K, n the basis f training bservatinal data. We dente the set f BMA mdel parameters t be estimated by θ. We dente space and time by subscripts s and t, s that f kst dentes the kth frecast in the ensemble fr place s and time t, and y st dentes the crrespnding verificatin. Here we will take the frecast hrizn t be fixed; in practice we will estimate different mdels fr each frecast hrizn. 6

12 We estimate θ by maximum likelihd (Fisher 1922) frm the training data. The likelihd functin is defined as the prbability f the training data given θ, viewed as a functin f θ. The maximum likelihd estimatr is the value f θ that maximizes the likelihd functin, i.e. the value f the parameter under which the bserved data were mst likely t have been bserved. The maximum likelihd estimatr has many ptimality prperties (Casella and Berger 2001). It is cnvenient t maximize the lgarithm f the likelihd functin (r lg-likelihd functin) rather than the likelihd functin itself, fr reasns f bth algebraic simplicity and numerical stability; the same parameter value that maximizes ne als maximizes the ther. The lg-likelihd functin fr mdel (2) is l(θ) = s,t ( K ) lg w k g k (y st f kst ), (5) k=1 where the summatin is ver values f s and t that index bservatins in the training set. This cannt be maximized analytically, and it is cmplex t maximize numerically using direct nnlinear maximizatin methds such as Newtn-Raphsn and its variants. Instead, we maximize it using the expectatin-maximizatin, r EM algrithm (Dempster, Laird, and Rubin 1977; McLachlan and Krishnan 1997). The EM algrithm is a methd fr finding the maximum likelihd estimatr when the prblem can be recast in terms f missing data such that, if we knew the missing data, the estimatin prblem wuld be straightfrward. The missing data d nt have t be actual data that are missing; instead, they are ften latent r unbserved quantities, knwledge f which wuld simplify the estimatin prblem. The BMA mdel (2) is a finite mixture mdel (McLachlan and Peel 2000). Here we intrduce missing data z kst where z kst = 1 if ensemble member k is the best frecast fr verificatin place s and time t, and z kst = 0 therwise. Fr each (s, t), nly ne f {z 1st,..., z Kst } is equal t 1; the thers are all zer. The EM algrithm is iterative, and alternates between tw steps, the E (r expectatin) step, and the M (r maximizatin) step. It starts with an initial guess, θ (0), fr the parameter vectr θ. In the E step, the z kst are estimated given the current guess fr the parameters; the estimates f the z kst are nt necessarily integers, even thugh the true values are 0 r 1. In the M step, θ is estimated given the current values f the z kst. Fr the nrmal BMA mdel given by (2) and (3), the E step is ẑ (j) kst = g(y st f kst, σ (j 1) k ) Ki=1 g(y st f ist, σ (j 1) i ), (6) 7

13 where the superscript j refers t the jth iteratin f the EM algrithm, and g(y st f kst, σ (j 1) k ) is a nrmal density with mean f kst and standard deviatin σ (j 1) k evaluated at y st. The M step then cnsists f estimating the w k and the σ k using as weights the current estimates f z kst, i.e. ẑ (j) kst. Thus w (j) k = 1 n σ 2(j) k = ẑ (j) kst, s,t s,t ẑ (j) kst (y st f kst ) 2, s,t ẑ (j) kst where n is the number f bservatins (i.e. f distinct values f (s, t)). The E and M steps are then iterated t cnvergence, which we defined as changes n greater than sme small tlerances in any f the lg-likelihd, the parameter values, r the ẑ (j) kst in ne iteratin. The lg-likelihd is guaranteed t increase at each EM iteratin (Wu 1983), which implies that in general it cnverges t a lcal maximum f the likelihd. Cnvergence t a glbal maximum cannt be guaranteed, s the slutin reached by the algrithm can be sensitive t the starting values. Starting values based n past experience usually give gd slutins. In ur implementatin, the training set cnsists f frecasts and bservatins fr the previus m days. We will discuss the chice f m later. c. The BMA Predictive Variance Decmpsitin and the Spread-Skill Relatinship The BMA predictive variance f y st given the ensemble f frecasts can be written as ( ) 2 Var(y st f 1st,..., f K K K Kst ) = w k f kst w i fist + w k σk 2 (7) k=1 i=1 k=1 (Raftery 1993). The right-hand side has tw terms, the first f which summarizes betweenfrecast spread, and the secnd measures the expected uncertainty cnditinal n ne f the frecasts being best. We can summarize this verbally as Predictive Variance = Between-Frecast Variance + Within-Frecast Variance. (8) The first term represents the ensemble spread. Thus ne wuld expect t see a spreadskill relatinship, since the predictive variance includes the spread as a cmpnent. But it als implies that using the ensemble spread alne wuld underestimate uncertainty, because 8

14 it ignres the secnd term n the right-hand side f (7) r (8). The secnd term wuld be zer nly if the best frecast were always exact, which is nt the case. Thus BMA predicts a spread-skill relatinship, but als predicts ensembles t be underdispersive. This is exactly what we bserved in the UW ensemble, and it is als the case in ther ensembles. BMA prvides a theretical framewrk fr understanding these apparently cntradictry phenmena. d. Example f BMA Predictive PDF T illustrate the peratin f BMA, we first describe the predictin f just ne quantity at ne place and time; later we will give aggregate perfrmance results. We cnsider the 48- hur frecast f sea-level pressure near Pwell River, B.C., Canada, initialized at 0000 UTC n February 23, 2000 and verifying at 0000 UTC n February 25, As described belw, a 40-day training perid was used, in this case cnsisting f frecasts and bservatins in the 0000 UTC cycle frm January 14 t February 23, A simple linear bias crrectin was applied, with the cefficients estimated frm the training perid. Table 1 shws the frecasts and the bias-crrected frecasts, the BMA weights, and the bservatin fr the five members f the UW MM5 ensemble. There was strng disagreement amng the ensemble members: fur f the five frecasts were in clse agreement with ne anther, while the frecast btained with the CMC-GEM initializatin differed frm all the thers by at least 11 mb. The verifying bservatin turned ut t be utside the ensemble range, as happened fr 49% f the cases in ur dataset. The fur frecasts that agreed clsely were far frm the verificatin, while the utlying CMC-GEM-MM5 frecast was clse, and its bias-crrected versin was very clse indeed. The highest BMA weight was fr the CMC-GEM-MM5 frecast, which als turned ut t be the best frecast in this case. Figure 3 shws the BMA predictive PDF. This PDF (shwn as the thick curve in the figure), is a weighted sum f five nrmal PDFs (the cmpnents are the five thin lines). The distributin is multimdal, reflecting the disagreement between the frecasts. In particular, the brad right mde is centered arund the bias-crrected CMC-GEM-MM5 frecast, which is far frm the thers. The bservatin fell within the 90% BMA predictin interval, even thugh it was utside the ensemble range. The BMA PDF can als be represented as an unweighted ensemble f any size desired, simply by simulating frm the predictive distributin (2). T simulate M values frm the distributin (2), ne can prceed as fllws: Repeat M times: 9

15 Table 1: Frty-Eight Hur UW-MM5 Ensemble Frecasts f Sea-Level Pressure at Pwell River, B.C. Initialized at 0000 UTC n February 23, 2000, Bias-Crrected Frecasts, BMA Weights, and Verifying Observatin. The ensemble members are shwn in increasing rder f frecast value. MM5 Initializatin ETA NGM NOGAPS AVN GEM (Surce) (NCEP) (NCEP) (FNMOC) (NCEP) (CMC) Frecast Bias-crrected frecast BMA weight Observatin BMA Predictive PDF Sea Level Pressure Figure 3: BMA Predictive PDF (thick curve) and its Five Cmpnents (thin curves) fr the 48-Hur Sea-Level Pressure Frecast at Pwer River, B.C., Initialized at 0000 UTC n February 23, Als shwn are the ensemble member frecasts and range (slid hrizntal line and bullets), the BMA 90% predictin interval (dtted lines), and the verifying bservatin (slid vertical line). 10

16 Sea Level Pressure Figure 4: Ensemble f 100 Equally Likely Values frm the BMA PDF (2) fr the Pwell River Sea-Level Pressure Frecast. Als shwn are the ensemble member frecasts and range (slid hrizntal line and bullets), the BMA 90% predictin interval (dtted lines), and the verifying bservatin (slid vertical line). 1. Generate a value f k frm the numbers {1,..., K} with prbabilities {w 1,..., w K }. 2. Generate a value f y frm the PDF g k (y f k ). In the present case this will be a N( f k, σk 2) distributin. Figure 4 shws a BMA ensemble f size M = 100 generated in this way. In this case, 88 f the 100 ensemble members lay within the exact 90% predictin interval. This differs frm the expected number f 90, but the difference is well within the range f what wuld be bserved by chance. Figure 5 shws the BMA prbabilistic 48-hur frecast f sea-level pressure fr the same day fr the entire Pacific Nrthwest. Figure 5(a) shws the deterministic BMA frecast. Figure 5(b) shws the margin f errr f the 90% predictin interval, defined as half the 11

17 width f the interval. Rughly speaking, the predictin interval is apprximately equal t the deterministic frecast plus r minus the margin f errr, and the margin f errr indicates where the uncertainty is greatest. A cmparisn t Table 3 shws that this was a day with unusually high frecast uncertainty. Figures 5(c) and (d) display the lwer and upper bunds f the 90% predictin intervals. These fur plts shw ne way t summarize a prbabilistic frecast. 3 Results We nw give results f the applicatin f BMA t 48-hur sea-level pressure frecasts in the Pacific Nrthwest fr the 0000 UTC cycle in January June 2000, using the UW-MM5 ensemble described by Grimit and Mass (2002). The unit used is the millibar (mb). We first describe hw we chse the length f the training perid, then we give the main results, and finally we utline hw the results culd be used t select the members f a pssibly reduced ensemble. We als give summary results fr surface temperature ver the same perid. a. Length f Training Perid Hw many days shuld be used in the training perid t estimate the BMA weights, variances and bias crrectin cefficients? There is a trade-ff here, and n autmatic way f making it. Bth weather patterns and mdel specificatin change ver time, and there is advantage t using a shrt training perid s as t be able t adapt rapidly t such changes. In particular, the relative perfrmance f the mdels changes. On the ther hand, the lnger the training perid, the better the BMA parameters are estimated. In making ur chice, we were guided by the principle that prbabilistic frecasting methds shuld be designed t maximize sharpness subject t calibratin, i.e. t make the predictin intervals as shrt as pssible subject t their having the right cverage (Gneiting, Raftery, Balabdaui, and Westveld 2003). We als tend twards making the training perid as shrt as pssible s as t be able t adapt as quickly as pssible t the changing relative perfrmance f ensemble members, lengthening it nly if ding s seems t yield a clear advantage. Here we fcus n 90% predictin intervals. We cnsidered training perids f lengths 10, 20,..., 100 calendar days. Fr cmparability, the same verificatins were used in evaluating all the training perids, s the verificatins fr the first 100 days were nt used. Fr sme days the data were missing (Grimit and Mass 2002), s that the effective number f days in 12

18 Latitude Latitude Lngitude Lngitude (a) BMA Deterministic Frecast (b) BMA Margin f Errr Latitude Latitude Lngitude Lngitude (c) BMA Lwer bund (d) BMA Upper bund Figure 5: BMA Prbabilistic 48-Hur Frecast f Sea-Level Pressure fr the Pacific Nrthwest, Initialized at 0000 UTC n February 23, (a) BMA deterministic frecast. (b) BMA margin f errr, defined as half the width f the 90% predictin interval. (c) Lwer and (d) upper bund f the 90% predictin interval. 13

19 the training data was typically smaller than the nminal number f calendar days. Figure 6(a) shws the cverage f BMA 90% predictin intervals. It seems clear that 10 and 20-day training perids yield underdispersive PDFs, as d the 30-day perids, althugh less s. The 40-day training perid gives accurate cverage, and lnger perids give BMA PDFs that are slightly verdispersive. Figure 6(b) shws the average lengths f the 90% predictin intervals. These increase beynd 40 days, suggesting that there is little t be gained by increasing the length beynd 40 days. Thus a 40-day training perid seems clsest t maximizing sharpness subject t calibratin: amng thse that are reasnably well calibrated (40 days and abve), it has the shrtest average intervals. Figure 6(c) shws the RMSE f the BMA deterministic frecast given by (4). This decreases as the length f the training perid increases frm 10 t 40 days, and cntinues t decrease thereafter up t 80 days, but mre slwly. Finally, Figure 6(d) shws the ignrance scre fr BMA. This is the average f the negative f the natural lgarithms f the BMA PDFs evaluated at the bservatins. It was prpsed by Gd (1952), and its use in the present cntext was suggested by Rulstn and Smith (2002). Smaller scres are preferred. Like the RMSE, this decreases rapidly as the training perid increases frm 10 t 40 days, and mre slwly as the training perid is increased beynd 40 days. T summarize these results, it seems that there are substantial gains in increasing the training perid up t 40 days, and that beynd that there is little gain. We have therefre used 40 days here. It seems likely that different training perids wuld be best fr ther variables, frecast hrizns, time perids and regins. Further research n hw best t chse the length f the training perid is needed, and a gd autmatic way f ding this wuld be useful. b. Results We nw give results fr BMA, using the same evaluatin dataset as was used t cmpare the different training perids. The calibratin rank histgram fr BMA is shwn in Figure 7. T cmpute the calibratin rank histgram we prceeded as fllws. Fr each frecast initializatin time at each statin, we cmputed the BMA cumulative distributin functin (CDF), and we fund its value at the verifying bservatin. We then frmed the histgram f these BMA CDF values. This shuld be unifrm if the predictive PDF is calibrated; it can be cmpared directly with the verificatin rank histgram f the underlying ensemble in Figure 2. As can be seen, it was reasnably well calibrated, and clearly much mre s than the ensemble itself. 14

20 Cverage f BMA 90% Predictin Interval Average Length f BMA 90% Predictin Interval Days in Training Perid Days in Training Perid RMSE f BMA Deterministic Frecast Ignrance Scre Days in Training Perid Days in Training Perid Figure 6: Cmparisn f Training Perid Lengths fr Sea-Level Pressure: (a) Cverage f 90% predictin intervals. (b) Average width f 90% predictin intervals. (c) RMSE f BMA deterministic frecasts. (d) Ignrance scre. 15

21 Relative Frequency Rank Figure 7: Calibratin Rank Histgram fr Bayesian Mdel Averaging fr Sea-Level Pressure 16

22 Table 2: Cverage f Predictin Intervals fr Sea-Level Pressure (%) Interval 66.7% Interval 90% Interval Sample climatlgy Ensemble range 53.9 BMA Figure 8 shws the BMA weights fr the five ensemble members ver the evaluatin perid. The weights fr ETA-MM5 and NGM-MM5 were cnsistently lw thrughut, while the weights fr AVN-MM5 were cnsistently substantial. There seemed t be a tradeff between CMC-GEM-MM5 and NOGAPS-MM5: in the first half f the evaluatin perid CMC-GEM-MM5 had high weight and NOGAPS-MM5 had lw weight, while in the secnd half f the perid the ppsite was the case. The average weights f the five ensemble members ver the evaluatin perid were as fllws: AVN-MM5: 0.38; CMC-GEM-MM5: 0.16; ETA-MM5: 0.01; NGM-MM5: 0.01; NOGAPS-MM5: This suggests that the ETA-MM5 and NGM-MM5 frecasts were nt useful during this perid, relative t the ther three ensemble members. Hwever, the ETA-MM5 and NGM-MM5 frecasts were useful in January and February 2000, as is seen frm the BMA weights fr the Pwell River, B.C. frecasts in Table 1, which were issued well befre the evaluatin perid in the current experiment. Table 2 shws the cverage f varius predictin intervals. We included the predictin interval frm sample climatlgy, i.e. frm the marginal distributin f ur full dataset; this interval is the same fr each day and is useful as a baseline fr methds that use the numerical weather predictins. There were n strng seasnal effects in ur data and the distributin f sea-level pressure was rughly statinary in time, and spatial effects were a small part f verall variability, making it mre reasnable t use sample climatlgy as a baseline. The climatlgical frecast is f curse well calibrated, but at the expense f prducing very wide intervals, as we will see. The ensemble range is underdispersive, as we have already seen. The BMA intervals are clse t having the right cverage. Table 3 shws the average widths f the predictin intervals cnsidered. The ensemble range was much narrwer n average than the climatlgical 66.7% interval, but the price f this was that the ensemble range was far frm being a calibrated interval and was underdispersive. The BMA 66.7% interval was nt much wider n average than the ensemble range, and it is calibrated. The climatlgical and BMA 90% intervals were bth apprximately 17

23 AVN-MM5 frecast CMC-GEM-MM5 frecast BMA weight BMA weight Day index Day index ETA-MM5 frecast NGM-MM5 frecast BMA weight BMA weight Day index Day index NOGAPS-MM5 frecast BMA weight Day index Figure 8: Bayesian Mdel Averaging Weights fr the Five Mdels Over the Evaluatin Perid fr Sea-Level Pressure 18

24 Table 3: Average Width f Predictin Intervals fr Sea-Level Pressure Interval 66.7% Interval 90% Interval Sample climatlgy Ensemble range 3.9 BMA Table 4: RMSEs f Deterministic Frecasts fr Sea-Level Pressure Frecast RMSE Sample climatlgy 5.70 AVN-MM CMC-GEM-MM ETA-MM NGM-MM NOGAPS-MM Ensemble mean 2.72 BMA 2.57 calibrated, but the BMA intervals were ver 60% narrwer n average. Table 4 shws the RMSEs f the varius deterministic frecasts cnsidered ver the evaluatin perid. The numerical weather predictin frecasts perfrmed much better than the apprximate climatlgical frecast (with all frecasts equal t the sample mean), and amng these the AVN-MM5 frecast was best n average. The ensemble mean perfrmed better than any f the individual ensemble members, with RMSE 6% lwer than that fr the best ensemble member. This agrees with published results fr several ensembles, including the UW ensemble n which ur wrk is based (Grimit and Mass 2002). The BMA deterministic frecast given by (4) in turn perfrmed better than the ensemble mean, with RMSE 6% lwer than that f the ensemble mean, and was the best frecast cnsidered fr these data. c. Selecting Ensemble Members: Results fr a Reduced Ensemble Ensemble frecasts are very demanding in terms f cmputatinal time, and s it is imprtant that the members f the ensemble be carefully selected. Often the number f ensemble runs 19

25 Table 5: Cmparisn f BMA PDFs frm the Five-Member Ensemble and the Reduced Three-Member Ensemble fr Sea-Level Pressure 90% Predictin Interval Average Ignrance Cverage Width RMSE Scre 5-member ensemble 90.2% member ensemble 90.5% Nte: The ignrance scre is the mean negative lg predictive prbability f the bserved values. that can be dne by an rganizatin is limited, and large ensembles make demands n cmputer and persnnel resurces that culd be used fr ther purpses. Our apprach prvides a way f selecting ensemble members in situatins where the individual ensemble members cme frm different, identifiable surces. The BMA weights prvide a measure f the relative usefulness f the ensemble members, and s it wuld seem reasnable t cnsider eliminating ensemble members that cnsistently have lw weights. Over ur evaluatin perid, tw f the ensemble members, ETA-MM5 and NGM-MM5 had very lw weights n average, bth averaging One might then cnsider eliminating these tw members, and using instead a reduced three-member ensemble. Table 5 cmpares the results fr the five-member and the reduced three-member ensemble ver the evaluatin perid. They are almst indistinguishable, and the ignrance scre actually imprves slightly when the tw less useful members are remved. This wuld suggest that these members can be remved with little cst in terms f perfrmance, and the peratinal gain culd be cnsiderable. Befre making such a decisin, hwever, it wuld be necessary t study the BMA weights ver a lnger perid and fr all the variables and frecast hrizns f interest. variable might be useful fr thers. d. Results fr Surface Temperature Ensemble members that cntribute little t frecasting ne We nw briefly summarize the results fr 48-hur frecasts f surface temperature fr the same regin and time perid. These are qualitatively similar t thse fr sea-level pressure. The results fr chice f training perid are very similar t thse fr sea-level pressure, and again pint t a 40-day training perid as being best. The unit used is degrees Celsius. Table 6 shws the cverage f the varius predictin intervals. The ensemble range is 20

26 Table 6: Cverage f Predictin Intervals fr Surface Temperature (%) Interval 66.7% Interval 90% Interval Sample climatlgy Ensemble range 26.1 BMA Table 7: Average Width f Predictin Intervals fr Surface Temperature Interval 66.7% Interval 90% Interval Sample climatlgy Ensemble range 2.2 BMA strikingly underdispersed, mre than fr sea-level pressure, including nly 26.1% f the actual bservatins, as against the theretical 66.7% if the ensemble were well calibrated. BMA is well calibrated. Table 7 shws the average widths f the predictin intervals fr temperature. The ensemble range is very narrw but, as we have seen, this is at the cst f being prly calibrated. The BMA intervals are at least 66% shrter than the intervals frm sample climatlgy. Evidently, sample climatlgy is less useful as a baseline fr surface temperature than it is fr sea-level pressure, given seasnal and tpgraphic effects. Table 8 shws the RMSEs f the varius deterministic frecasts. The ensemble mean is as gd as the best single mdel frecast (again frm AVN-MM5), and the BMA deterministic frecast has an RMSE that is 11% lwer. 4 Discussin We have prpsed a new methd fr statistical pstprcessing f ensemble utput t prduce calibrated and sharp predictive PDFs. It is based n Bayesian mdel averaging, a statistically principled way f cmbining frecasts frm different mdels and analyses, and prvides a theretical explanatin f the empirical phenmenn f ensembles exhibiting a spreadskill relatinship while still being underdispersive. In ur case study, the BMA PDFs were much better calibrated than the ensemble itself, and prduced predictin intervals that were 21

27 Table 8: RMSEs f Deterministic Frecasts fr Surface Temperature Frecast RMSE Sample climatlgy 8.53 AVN-MM CMC-GEM-MM ETA-MM NGM-MM NOGAPS-MM Ensemble mean 3.31 BMA 2.94 much sharper than thse prduced by apprximate climatlgy. In additin, the BMA deterministic frecast had a lwer RMSE than any f the individual ensemble members, and als than the ensemble mean, althugh the latter was als better than any f the ensemble members. Our apprach uses bservatins and frecasts t estimate the BMA mdel fr a spatial regin, and is thus applicable t the prductin f prbabilistic frecasts n a grid. In ur experiment we applied it t the UW-MM5 ensemble s 12-km dmain, the Pacific Nrthwest, and it wuld seem desirable that the mdel be estimated separately fr different spatial regins. Clearly such regins shuld be fairly hmgeneus with respect t the variable being frecast, but precisely hw t determine them needs further research. We have used bservatins t estimate the mdel, but it wuld be pssible t d s als using an analysis, and this may be preferable in regins where there are few bservatinal assets. Our experiments here have been with shrt-range messcale prbabilistic frecasting frm multianalysis ensembles, but it wuld seem feasible als t apply the idea t ther situatins, including medium-range and synptic frecasting, and t perturbed bservatins, singular vectr, bred and pr man s ensembles. Our implementatin has been fr the situatin where the ensemble members cme frm clearly distinguishable surces. In ther cases, such as the current synptic NCEP and ECMWF ensembles, it may be mre apprpriate t cnsider sme r all f the ensemble members as being frm the same surce, and hence t treat them equally. This can be accmmdated within ur apprach with a small change in the mdel: fr ensemble members viewed as equivalent, the BMA weights w k in (2) wuld be 22

28 cnstrained t be equal. The EM algrithm can still be used, with a small mdificatin. In ur experiments we have assumed that the cnditinal densities g k (y f k ) in the BMA mdel (2) can reasnably be taken t be nrmal densities. This wrks well fr sea-level pressure, and in experiments that we summarized abve it als wrked well fr surface temperature data. Hwever, this may nt apply s directly t wind speed and precipitatin data, because they tend t have a psitive prbability f being equal t r very clse t zer, and because their distributin tends t be skewed. The BMA apprach itself culd be extended t these situatins by using a nn-nrmal cnditinal distributin g k (y f k ) in (2). It has been cmmn t mdel wind speed using a Weibull distributin and precipitatin using a Gamma distributin, and it may be necessary t augment these with a cmpnent representing a psitive prbability f being equal t zer. This can be dne within the framewrk f generalized linear mdels (McCullagh and Nelder 1989), and ne example f hw t mdel precipitatin in this way was given by Stern and Ce (1984). One way t imprve the perfrmance f this methd is t imprve the bias crrectin methd. In ur experiment we used a very simple linear bias crrectin methd. Mdel utput statistics (MOS) is the dminant apprach t bias crrectin, and may give imprved results (Wilks 1995). Appraches based n spatial and tempral neighbrhds have als been prpsed, fr example Eckel and Mass (2003) and Gel, Raftery, and Gneiting (2003b). Nte that t be useful in ur cntext, bias crrectin methds need t be applicable t grid-based frecasts and nt just t frecasts at bservatin sites. It is clear frm (2) that the MOS and neighbrhd bias crrectin methds mentined can be cmbined with BMA t prduce prbabilistic frecasts. Our methd prduces a predictive PDF fr ne lcatin, but it des nt reprduce the spatial crrelatin prperties f errr fields. Varius ways f creating ensembles f entire fields that reprduce the spatial crrelatin f the errr field have been prpsed fr the situatin where just ne numerical weather predictin mdel and initializatin is used (Hutekamer and Mitchell 1998; Hutekamer and Mitchell 2001; Gel, Raftery, and Gneiting 2003a). Such methds culd be cmbined with the present prpsal t prduce multimdel and/r multianalysis ensembles that reprduce spatial crrelatin f errr fields by creating ensembles f fields crrespnding t each ensemble member, and simulating a number f fields frm each f these ensembles that is prprtinal t the crrespnding BMA weight. Hamill and Clucci (1997, 1998) prpsed a statistical pstprcessing methd based n directly adjusting the prbabilities in the rank histgram; this methd was applied by Eckel and Walters (1998). This wrked well, but differs frm the present apprach in nt 23

29 being based n a principled prbabilistic mdel. Wilks (2002) prpsed t smth frecast ensembles by fitting mixtures f Gaussian distributins t ensemble data. This is related t the BMA apprach, but des nt accunt fr ensemble underdispersin. A different apprach t pstprcessing an ensemble called best member dressing has been prpsed by Rulstn and Smith (2003). This cnsists f identifying the best member f an ensemble fr each element f a histric recrd, finding the errr in that ensemble member frecast, finding the empirical distributin f such errrs, and then dressing each frecast in the ensemble with the empirical errr distributin fund in this way. Viewed in this way, BMA culd als be viewed as a way f dressing an ensemble f frecasts. The appraches differ in sme ways, hwever. The methd f Rulstn and Smith (2003) is designed fr the situatin where all the ensemble members can be treated equally, as exchangeable, and, fr example, it treats the ECMWF cntrl frecast in the same way as the ther 50 members f the ECMWF ensemble. In cntrast, BMA is applicable t the situatin where the ensemble members cme frm different, identifiable surces, but is als applicable t the exchangeable situatin, as we have nted. Fr example, BMA wuld allw different treatment f the cntrl and ther ECMWF ensemble members in a straightfrward way. Best member dressing is based n the assumptin that the best member can be identified with high prbability, and as such des nt take uncertainty abut the identificatin f the best member int accunt. Usually, hwever, there is cnsiderable uncertainty abut which is the best member. In cntrast, BMA takes accunt f this uncertainty thrugh the use f the mixture likelihd, and it is estimated explicitly by the ẑ kst that are prduced by the EM algrithm. BMA is designed t prduce prbabilistic frecasts, but as a byprduct it als prduces a deterministic frecast, and this utperfrmed all the ensemble members as well as the ensemble mean in ur experiment. It has als been prpsed that frecasts be cmbined using multiple linear regressin t prduce a single deterministic frecast r superensemble (van den Dl and Rukhvets 1994; Krishnamurti, Kishtawal, LaRw, Bachichi, Zhange, Willifrd, Gadgil, and Surendan 1999; Kharin and Zwiers 2002). It seems likely that BMA and regressin wuld give similar frecasts. Hwever, ne difference is that the weights in BMA are cnstrained t be psitive, whereas thse in regressin are nt; see, fr example, Tables 2, 4, 5 and 6 in van den Dl and Rukhvets (1994). Negative weights seem hard t interpret in this cntext; they imply that, all else equal, temperature (fr example) is predicted t be higher when the frecast with the negative weight is lwer. Stefanva and Krishnamurti (2002) have prpsed a way f using superensembles t estimate the prbability 24

30 f a dichtmus event. This des nt appear t apply t estimating the PDFs f cntinuus weather quantities, and the prblem f interpreting negative cefficients cntinues t apply in this case. Acknwledgements: The authrs are grateful t Mark Albright, Eric Grimit and Cliffrd Mass fr helpful discussins and useful cmments, and fr prviding data. This research was supprted by the DD Multidisciplinary University Research Initiative (MURI) prgram administered by the Office f Naval Research under Grant N References Casella, G. and R. L. Berger (2001). Statistical Inference (2nd ed.). Brks Cle. Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihd frm incmplete data via the EM algrithm. Jurnal f the Ryal Statistical Sciety, Series B 39, Eckel, F. and C. F. Mass (2003). Twards an effective shrt-range ensemble frecast system. In Prceedings f the Wrkshp n Ensemble Frecasting, Val-Mrin, Québec. Available at hamill/ef wrkshp 2003 schedule.html. Eckel, F. and M. Walters (1998). Calibrated prbabilistic quantitative precipitatin frecasts based n the MRF ensemble. Weather and Frecasting 13, Epstein, E. S. (1969). Stchastic dynamic predictin. Tellus 21, Fisher, R. A. (1922). On the mathematical fundatins f theretical statistics. Philsphical Transactins f the Ryal Sciety f Lndn, Series A 222, Gel, Y., A. E. Raftery, and T. Gneiting (2003a). Calibrated prbabilistic messcale weather field frecasting: The gestatistical utput perturbatin (GOP) methd. Technical Reprt 427, Department f Statistics, University f Washingtn. Available at Gel, Y., A. E. Raftery, and T. Gneiting (2003b). Cmbining lcal and glbal gridbased bias remval fr messcale numerical weather predictin mdels. Unpublished manuscript, Department f Statistics, University f Washingtn. Gneiting, T., A. E. Raftery, F. Balabdaui, and A. Westveld (2003). Verifying prbabilistic frecasts: Calibratin and sharpness. In Prceedings f 25

Using Bayesian Model Averaging to Calibrate Forecast Ensembles

Using Bayesian Model Averaging to Calibrate Forecast Ensembles Using Bayesian Mdel Averaging t Calibrate Frecast Ensembles Adrian E. Raftery, Fadua Balabdaui, Tilmann Gneiting and Michael Plakwski Department f Statistics, University f Washingtn, Seattle, Washingtn