Ensemble Bayesian model averaging using Markov Chain Monte Carlo sampling

Size: px

Start display at page:

Download "Ensemble Bayesian model averaging using Markov Chain Monte Carlo sampling"

Karen Harrell
5 years ago
Views:

1 DOI 1.17/s ORIGINAL ARTICLE Ensemble Bayesian model averaging using Markov Chain Monte Carlo sampling Jasper A. Vrugt Cees G. H. Diks Martyn P. Clark Received: 7 June 28 / Accepted: 25 September 28 Springer Science+Business Media B.V. 28 Abstract Bayesian model averaging (BMA) has recently been proposed as a statistical method to calibrate forecast ensembles from numerical weather models. Successful implementation of BMA however, requires accurate estimates of the weights and variances of the individual competing models in the ensemble. In their seminal paper (Raftery et al. Mon Weather Rev 133: , 25) has recommended the Expectation Maximization (EM) algorithm for BMA model training, even though global convergence of this algorithm cannot be guaranteed. In this paper, we compare the performance of the EM algorithm and the recently developed DiffeRential Evolution Adaptive Metropolis (DREAM) Markov Chain Monte Carlo (MCMC) algorithm for estimating the BMA weights and variances. Simulation experiments using 48-hour ensemble data of surface temperature and multi-model streamflow forecasts show that both methods produce similar results, and that their performance is unaffected by the length of the training data set. However, MCMC simulation with DREAM is capable of efficiently handling a wide variety of BMA predictive distributions, and provides useful information about the uncertainty associated with the estimated BMA weights and variances. Keywords Bayesian model averaging Markov Chain Monte Carlo Maximum likelihood DiffeRential Evolution Adaptive Metropolis Temperature forecasting Streamflow forecasting J. A. Vrugt (B) Los Alamos National Laboratory (LANL), Center for NonLinear Studies (CNLS), Mail Stop B258, Los Alamos, NM 87545, USA vrugt@lanl.gov C. G. H. Diks Center for Nonlinear Dynamics in Economics and Finance, University of Amsterdam, Amsterdam, The Netherlands M. P. Clark NIWA, P.O. Box 862, Riccarton, Christchurch, New Zealand

2 1 Introduction and scope During the last decade, multi-model ensemble prediction systems have become the basis for probabilistic weather and climate forecasts at many operational centers throughout the world [2,1,16,18]. Multi-model ensemble predictions aim to capture several sources of uncertainty in numerical weather forecasts, including uncertainty about the initial conditions, lateral boundary conditions and model physics, and have convincingly demonstrated improvements to numerical weather and climate forecasts and production of more skilful estimates of forecast probability density functions (pdf) [4,8,9,13,15,2,21]. However, because the current generation of ensemble systems do not explicitly account for all sources of forecast uncertainty, some form of postprocessing is necessary to provide predictive ensemble pdfs that are meaningful, and can be used to provide accurate forecasts [7,9,12,2,22,29]. Ensemble Bayesian model averaging (BMA) has recently been proposed by Raftery et al. [2] as a statistical postprocessing method for producing probabilistic forecasts from ensembles. The BMA predictive pdf of any future weather quantity of interest is a weighted average of pdfs centered on the bias-corrected forecasts from a set of individual models. The weights are the estimated posterior model probabilities, representing each model s relative forecast skill in the training period. Studies applying the BMA method to a range of different forecasting problems have demonstrated that BMA produces more accurate and reliable predictions than other available multi-model techniques [1,17,19,2,23,3]. Successful implementation of BMA however, requires estimates of the weights and variances of the individual competing models in the ensemble. In their seminal paper, Raftery et al. [2] recommend the use of the Expectation Maximization (EM) algorithm to estimate the BMA weights and variances. The advantages of the EM algorithm are not difficult to enumerate. The method is relatively easy to implement, computationally efficient, and the algorithmic steps are designed in such a way that they always satisfy the constraint that the BMA weights are positive and add up to one. Various contributions to the literature have indeed demonstrated that the EM algorithm works well, and provides robust estimates of the weights and variances of the individual ensemble members for little computational requirements. Notwithstanding this progress made, convergence of the EM algorithm to the global optimal BMA weights and variances cannot be guaranteed and becomes especially problematic for high-dimensional problems (ensembles consisting of many different member forecasts). Also, algorithmic modifications are required to adapt the EM method for predictive distributions other than the normal distribution. This will be necessary for variables such as precipitation, and all other quantities primarily driven by precipitation, such as streamflow. In this paper, we propose using Markov Chain Monte Carlo (MCMC) simulation to estimate the most likely values of the BMA weights and variances, and their underlying posterior pdf. This approach overcomes some of the limitations of the EM approach and has three distinct advantages. First, MCMC simulation does not require algorithmic modifications when using different conditional probability distributions for the individual ensemble members. Second, MCMC simulation provides a full view of the posterior distribution of the BMA weights and variances. This information is helpful to assess the usefulness of individual ensemble members. A small ensemble has important computational advantages, since it requires calibrating and running the smaller possible number of models. Finally, MCMC simulation can handle a relatively high number of BMA parameters, allowing large ensemble sizes containing the predictions of many different constituent models. We use the recently developed DiffeRential Evolution Adaptive Metropolis (DREAM) MCMC algorithm. This adaptive proposal updating method has recently been introduced [27] and maintains detailed balance and ergodicity while showing excellent efficiency on

3 complex, multi-modal and high-dimensional sampling problems. To illustrate our approach, we use two different data sets, (a) 48-h forecasts of surface temperature and sea level pressure in the Pacific Northwest in January June 2, obtained using the University of Washington mesoscale short-range ensemble system [1], and (b) multi-model streamflow forecasts for the Leaf River basin in Mississippi [24]. The remainder of this paper is organized as follows. In Sect. 2, we briefly describe the BMA method, and Sect. 3 provides a short description of the EM and DREAM algorithm for estimating the values of the BMA weights and variances. In Sect. 4, we compare the results of both algorithms for the two data sets. Here we are especially concerned with algorithm robustness, and flexibility. Finally, a summary with conclusions is presented in Sect Ensemble Bayesian model averaging (BMA) To explicate the BMA method, let f = f 1,..., f K denote an ensemble of predictions obtained from K different models, and be the quantity of interest. In BMA, each ensemble member forecast, f k, k = 1,...,K, is associated with a conditional pdf, g k ( f k ), which can be interpreted as the conditional pdf of on f k, given that f k is the best forecast in the ensemble. The BMA predictive model for dynamic ensemble forecasting can then be expressed as a finite mixture model [19,2] p( f 1,..., f k ) = K w k g k ( f k ), (1) where w k denotes the posterior probability of forecast k being the best one. The w k sare nonnegative and add up to one, and they can be viewed as weights reflecting an individual model s relative contribution to predictive skill over the training period. The original BMA method described in [19,2] assumes that the conditional pdfs, g k ( f k ) of the different ensemble members can be approximated by a normal distribution centered at a linear function of the original forecast, a k + b k f k and standard deviation σ k=1 f k N(a k + b k f k,σ 2 ). (2) The values for a k and b k are bias-correction terms that are derived by simple linear regression of on f k for each of the K ensemble members. The BMA predictive mean can be computed as E[ f 1,..., f K ]= K w k (a k + b k f k ), (3) k=1 which is a deterministic forecast, whose predictive performance can be compared with the individual forecasts in the ensemble, or with the ensemble mean. If we denote space and time with subscripts s and t respectively, so that f kst is the kth forecast in the ensemble for location s and time t, the associated variance of Eq. 3 can be computed as [2] ) K K var[ st f 1st,..., f Kst ]= w k ((a k + b k f kst ) w l (a l + b l f l ) + σ 2. (4) k=1 The variance of the BMA prediction defined in Eq. 3 consists of two terms, the first representing the ensemble spread, and the second representing the within-ensemble forecast variance. l=1

4 In this development, it is assumed that the pdf from each model at space s and time t can be approximated as a Normal distribution with mean equal to the deterministic prediction, f k, and variance σ 2. This formulation assumes a constant variance across models, which requires estimation of K + 1 parameters. An alternative approach is to vary σ 2 across models. The last term on the right-hand side of Eq. 4, σ 2 then needs to be replaced with K k=1 w k σ 2 k.this increases the number of parameters to be estimated from K + 1toK 2. The assumption of a normal distribution for the BMA predictive distribution of the individual ensemble members works well for weather quantities whose predictive pdfs are approximately normal such as temperature and sea-level pressure. For other variables such as precipitation, the normal distribution is not appropriate and other conditional pdfs need to be implemented [23]. The second case study reported in this paper considers the gamma distribution for streamflow forecasting. Further details on how to implement this distribution can be found there. 3 Computation of the BMA weights and variances Successful implementation of the BMA method described in the previous section requires estimates of the weights and variances of the pdfs of the individual forecasts. Following [2], we estimate the values for w k, k = 1,...,K and σ 2 by maximum likelihood (ML) from a calibration data set. Assuming independence of forecast errors in space and time, the log-likelihood function, l for the BMA predictive model defined in Eq. 3 is: l(w 1,...,w K,σ 2 f k,..., f K, ) = ( n K ) log w k g k ( st f kst ), (5) where n denotes the total number of measurements in the training data set. For reasons of numerical stability we optimize the log-likelihood function rather than the likelihood function itself. Unfortunately, no analytical solutions exist that conveniently maximize Eq. 5. Instead, we need to resort to iterative techniques to find the ML values for the BMA weights and variances. In the next two sections, we discuss two different methods to obtain the ML estimates of w k, k = 1,...,K and σ 2. s,t k=1 3.1 The Expectation Maximization (EM) algorithm The Expectation Maximization algorithm is a broadly applicable approach to the iterative computation of ML estimates, useful in a variety of incomplete-data problems [3,14]. To implement the EM algorithm for the BMA method we use the unobserved quantity z kst, where this variable has value 1 if ensemble member k is the best forecast at space s and time t and value otherwise. Hence, for each observation only one of {z 1st,...,z kst } is equal to 1. All the others are zero. After initialization of the weights and variances of the individual ensemble members, the EM algorithm alternates iteratively between an expectation and maximization step until convergence is achieved. In the expectation step, the values of z kst are re-estimated given the current values for the weights and variances according to ẑ ( j) kst = w kg( st f kst,σ ( j 1) ) Kl=1, (Expectation Step) (6) w l g( st f lst,σ ( j 1) )

5 where the superscript j refers to iteration number, and g( st f kst,σ ( j 1) ) is the conditional pdf of ensemble member k centered at observation st. Here we implement a normal density with mean f kst and standard deviation σ ( j 1). In the subsequent maximization step, the values for the weights and variances are updated using the current estimates of z kst, ẑ ( j) kst n w ( j) k = 1 n σ 2( j) = 1 n s,t n s,t ẑ ( j) kst K k=1 ẑ ( j) kst ( st f kst ) 2. (Maximization Step) By iterating between Eqs. 6 and 7 the EM algorithm iteratively improves the values of w k, k = 1,...,K,andσ 2. Convergence of the BMA weights and variances is achieved when the changes in l(w 1,...,w K,σ 2 f k,..., f K, ), w k, k = 1,...,K, σ 2 and ẑ ( j) kst become smaller than some predefined tolerance values for each of these individual quantities. This is similar to what [2] proposed. The EM method exhibits many desirable properties as it is relatively easy to implement, computationally efficient, and the maximization step in Eq. 7 is designed such that the weights are always positive and add up to one, K k=1 w k = 1. Yet, because a single search trajectory is used, the EM algorithm is prone to get stuck in a local optimal solution. Of course, multiple different trajectories can be used, but this is a major source of inefficiency because each of these individual optimization trials operate completely independent of each other with no sharing of information [5]. Another disadvantage of the EM method is that it requires different analytical solutions for different formulations of g k ( f k ). For instance, it might be necessary to replace the normal distribution in Eq. 2 by a more appropriate distribution for variables such as streamflow and precipitation that are known to be highly skewed [23,24]. Finally, the EM method does not provide any information about the uncertainty associated with the final optimized BMA weights and variances. This is not only useful information to assess BMA model uncertainty, but also to understand the importance of individual ensemble members. A search algorithm that can efficiently handle a wide variety of different conditional distributions, g k ( f k ) and simultaneously provides estimates of BMA parameter uncertainty is desirable. 3.2 Markov Chain Monte Carlo Sampling with DREAM Another approach to estimate the BMA weights and variances is MCMC simulation. Unlike EM, MCMC simulation uses multiple different trajectories (also called Markov Chains) simultaneously to sample w k, k = 1,...,K and σ 2 preferentially, based on their weight in the likelihood function. Vrugt et al. [27,28] recently presented a novel adaptive MCMC algorithm to efficiently estimate the posterior pdf of parameters in complex, high-dimensional sampling problems. This method, entitled DREAM, runs multiple chains simultaneously for global exploration, and automatically tunes the scale and orientation of the proposal distribution during the evolution to the posterior distribution. This scheme is an adaptation of the Shuffled Complex Evolution Metropolis [25] global optimization algorithm and has the advantage of maintaining detailed balance and ergodicity while showing excellent efficiency on complex, highly nonlinear, and multimodal target distributions [27]. In DREAM, N different Markov Chains are run simultaneously in parallel. If the state of a single chain is given by a single d-dimensional vector θ ={w k,...,w K,σ 2 },thenat (7)

6 each generation the N chains in DREAM define a population, which corresponds to an N d matrix, with each chain as a row. Jumps in each chain i ={1,...,N} are generated by taking a fixed multiple of the difference of randomly other chosen members (chains) of (without replacement): ϑ i = θ i + γ(δ) δ θ r( j) γ(δ) j=1 δ θ r(n) + e, (8) where δ signifies the number of pairs used to generate the proposal, and r( j), r(n) {1,...,N 1}; r( j) = r(n). The Metropolis ratio is used to decide whether to accept candidate points or not. This series of operations results in an MCMC sampler that conducts a robust and efficient search of the parameter space. Because the joint pdf of the N chains factorizes to π(θ 1 ) π(θ N ), the states θ 1...θ N of the individual chains are independent at any generation after DREAM has become independent of its initial value. After this so-called burn-in period, the convergence of a DREAM run can thus be monitored with the R-statistic of [6]. At every step, the points in contain the most relevant information about the search, and this population of points is used to globally share information about the progress of the search of the individual chains. This information exchange enhances the survivability of individual chains, and facilitates adaptive updating of the scale and orientation of the proposal distribution. Note that the method maintains pointwise detailed balance, as the pair (θt 1 r1,θr2 t 1 ) is as likely as (θt 1 r2,θr1 t 1 ) and the value of e N d(, b) is drawn from a symmetric distribution with small b. From the guidelines of the Random Walk Metropolis algorithm, the optimal choice of γ = 2.4/ 2δd. Every 5th generation γ = 1. to facilitate jumping between different disconnected modes [27]. To gear the search towards the high density region in the parameter space, DREAM does not require the unobserved quantity z kst as used in the EM algorithm, but directly evaluates l(w 1,...,w K,σ 2 f k,..., f K, ) defined in Eq. 5 for given values of w k, k = 1,...,K and σ 2. There are several advantages to adopting this approach. First, the formulation of l( ) is similar for different choices of g k ( f k ), requiring no modifications to the BMA software if other than normal conditional probability distributions are used for the individual ensemble members. Second, MCMC simulation with DREAM provides a full view of the posterior distribution of the BMA weights and variances. Finally, MCMC simulation with DREAM can handle a relatively high number of BMA parameters, allowing for large ensemble sizes containing the predictions of many different constituent models. The only potential drawback is that MCMC simulation generally requires many more function evaluations than the EM algorithm to estimate w k, k = 1,...,K and σ 2. This can be especially cumbersome for ensembles that contain probabilistic forecasts of individual members calibrated through stochastic optimization. To date, most multi-model forecasting applications presented in the literature consider ensemble sizes that are relatively small, typically in the order of K = 5toK = 2. The methodology presented herein should therefore have widespread applicability. n=1 4 Case studies We compare the robustness and efficiency of the EM and DREAM algorithm for two different case studies. The first case study considers the 48-h forecasts of surface temperature and sea level pressure in the Pacific Northwest in January June 2 using the University of

7 Washington (UW) mesoscale short-range ensemble system. This is a five member multianalysis ensemble (hereafter referred to as the UW ensemble) consisting of different runs of the fifth-generation Pennsylvania State University National Center for Atmospheric Research Mesoscale Model (MM5), in which initial conditions are taken from different operational centers. In the second case study, we apply the BMA method to probabilistic streamflow forecasting using a 36-year calibrated multi-model ensemble of daily streamflow forecasts for the Leaf River basin in Mississippi. This study deals with a variable that is highly skewed and not well described by a normal distribution. In both studies, the EM and DREAM algorithm were allowed a maximum of 15, log-likelihood function evaluations. In the first case study, a uniform prior distribution of w [, 1] K,andσ 2 [, 3 var( )] was used for the BMA weights and variances, respectively. The log-likelihood function in Eq. 5 then equates directly to the posterior density function. In case study 2, we again used w [, 1] K for the BMA weights, but the gamma shape and scale parameters were sampled between α, β [, 1]. These initial ranges were established through trial-and-error, encompass the posterior pdf, and have shown to work well for a range problems. 4.1 UW ensemble forecasts of temperature and sea level pressure We consider the 48-h forecast of surface temperature and sea level pressure using data from the UW ensemble consisting of forecasts and observations in the UTC cycle from 12 January to 9 June 2. Following [2] a 25 day training period between April 16 and 9 June 2 is used for BMA model calibration, whereas the remainder of the data set (12 January 16 April) is used to evaluate the model performance. For some days the data were missing, so that the number of calendar days spanned by the training data set is larger than the number of training days used. Note, that the performance statistics for the evaluation period were derived without a sliding window training approach. This is opposite to what was done in [2]. The emphasis of this paper is on the calibration performance of the EM and DREAM algorithm for the training data period rather than the evaluation data period. The evaluation period is simply used to test the statistical consistency of the calibrated BMA weights and variances. Figure 1 presents histograms of the DREAM derived posterior marginal probability density functions of the BMA weights and variance of the individual ensemble members for the 25 day training period for surface temperature (panels a f) and sea level pressure (panels g l). The DREAM algorithm has generated N = 1 parallel sequences, each with 1,5 samples. The first 1, samples in each chain are used for burn-in, leaving a total of 5, samples to draw inferences from. The optimal values derived with the EM algorithm are separately indicated in each panel with the x symbol. The results in Fig. 1 generally show an excellent agreement between the modes of the histograms derived with MCMC simulation and the ML estimates of the EM solution within this high-density region. Previous applications of the EM algorithm are likely to have yielded robust parameters for the UW ensemble data set. However, DREAM retains a desirable feature as it not only correctly identifies the ML values of the parameters, but simultaneously also samples the underlying probability distribution. This is helpful information to assess the usefulness of individual ensemble members in the BMA prediction, and the correlation among ensemble members. For instance, most of the histograms exhibit an approximate Gaussian distribution with relatively small dispersion around their modes. This indicates that there is high confidence in the weights applied to each of the individual models.

8 UW ensemble 48-forecasts surface temperature AVN (NCEP) (A) posterior density GEM (CMC) (B) ETA (NCEP) (C).25.3 NGM (NCEP) (D).5.1 NOGAPS (FNMOC) (E) (F) UW ensemble 48-forecasts sealevel pressure.3 AVN (NCEP).2.1 posterior density (G) GEM (CMC) ETA (NCEP) NGM (NCEP) (H) (I) (J) NOGAPS (FNMOC) (K) (L) w 1 w 2 w 3 w w 5 σ 2 Fig. 1 Marginal posterior pdfs of the DREAM derived BMA weights and variance of the individual ensemble members for the surface temperature (a f) and sea level pressure (g l) data sets. The EM derived solution is separately indicated in each panel with symbol x

9 .6 AVN (NCEP) (A).6 GEM (CMC) (B) w w , 1,5 5 1, 1,5.6 ETA (NCEP) (C).6 NGM (NCEP) (D) w w , 1,5 5 1, 1,5 w 5.6 NOGAPS (FNMOC) , 1,5 Sample number in Markov chain (E) σ BMA VAR. 5 1, 1,5 Sample number in Markov chain (F) Fig. 2 Transitions of the value of the BMA weight of ensemble member AVN (a), ETA (b), NGM (c), GEM (d), NOGAPS (e), and BMA variance (f) in three different parallel Markov Chains during the evolution of the DREAM algorithm to the posterior target distribution. To facilitate comparison, the maximum likelihood estimates computed with the EM algorithm are separately indicated at the right hand side in each panel with the symbol The evolution of the sampled BMA weights, w k, k = 1,...,K of the (a) AVN (NCEP), (b) ETA (NCEP), (c) NGM (NCEP), (d) GEM (CMC), (e) NOGAPS (FNMOC) ensemble members and BMA variance, σ 2 (f) to the posterior distribution for the 25-day surface temperature data set are shown in Fig. 2. We randomly selected three different Markov chains, and coded each of them with a different color and symbol. The 1D scatter plots of the sampled parameter space indicate that during the initial stages of the evolution the different sequences occupy different parts of the parameter space, resulting in a relatively high value for the R-statistic (not shown). After about 25 draws in each individual Markov chain, the three trajectories settle down in the approximate same region of the parameter space and successively visit solutions stemming from a stable distribution. This demonstrates convergence to a limiting distribution. To explore whether the performance of the EM and DREAM sampling methods are affected by the length of the training data set, we sequentially increased the length of the training set. We consider training periods of lengths 5, 1, 15, 2 and 25 days. For each fixed length, we ran both methods 3 different times, using a randomly sampled (without replacement) calibration period from the original 25 days training data set. Each time, a bias correction was first applied to the ensemble using simple linear regression of st on f kst for the training data set. Figure 3 presents the outcome of the 3 individual trials as a function of the length of the training set. The top panels (Fig. 3a c) present the results for the surface temperature data set, while the bottom panels (Fig. 3d f) depict the results for the sea level pressure data set. The solid lines denote the Root Mean Square Error (RMSE) of the forecast error of the BMA predictive mean (blue) and average width of the associated 95% prediction

10 Ratio Max. Lik. EM to DREAM 1.15 (A) CALIBRATION RESULTS EVALUATION RESULTS UW ensemble 48-forecasts sea temperature RMSE [K] (B) AVERAGE WIDTH [k] RMSE [K] (C) AVERAGE WIDTH [k] UW ensemble 48-forecasts sea level pressure Ratio Max. Lik. EM to DREAM (D) Length of training set [d] RMSE [hpa] (E) Length of training set [d] AVERAGE WIDTH [hpa] RMSE [hpa] (F) Length of training set [d] AVERAGE WIDTH [hpa] Fig. 3 Comparison of training period lengths for surface temperature and sea level pressure. The reported results represent averages of 3 independent trials with randomly selected training data sets from the original 25 days data set. Solid lines in panels b c and e f denote the EM derived forecast error of the BMA predictive mean (blue), and associated width of the prediction intervals (green) for the calibration (left side) and evaluation period (right side) respectively; squared symbols are DREAM derived counterparts uncertainty intervals (green) obtained with the EM algorithm, whereas the squared symbols represent their DREAM derived counterparts. The panels distinguish between the calibration and evaluation period. The results presented here highlight several important observations. First, the ratio between the ML values derived with the EM and DREAM algorithm closely approximates 1, and appears to be rather unaffected by the length of the training set. This provides strong empirical evidence that the performance of the EM and MCMC sampling methods is quite similar, and not affected by the length of the training data set. A ratio significantly deviating from 1 would indicate inconsistent results, where one of the two methods would consistently find better estimates of w k, k = 1,...,K and σ 2. Second, the RMSE of the BMA deterministic (mean) forecast (indicated in blue) generally increases with increasing length of the calibration period. This is to be expected as longer calibration time series are likely to contain a larger variety of weather events. Third, notice that the average width of the BMA 95% prediction uncertainty intervals (indicated in green) increases with the number of training days. This again has to do with the larger diversity of weather events observed in longer calibration time series. Finally, the BMA forecast pdf derived through MCMC simulation with DREAM is sharper (less spread) than its counterpart estimated with the EM method. This is consistent with the results depicted in Fig. 1, which show larger ML values of σ 2 estimated with the EM method. To summarize these results, it seems that there are small advantages in using MCMC sampling for BMA model training. The DREAM algorithm finds similar values of the log-likelihood function as the multi-start EM algorithm for both the UW surface temperature and sea level pressure data set. However, MCMC simulation with DREAM yields

11 BMA forecast pdfs that are slightly sharper, and provides estimates of parameter uncertainty of w k, k = 1,...,K and σ Probabilistic ensemble streamflow forecasting In this study, we apply the BMA method to probabilistic ensemble streamflow forecasting using historical data form the Leaf River watershed (1,95 km 2 ) located north of Collins, Mississippi. In a previous study [24] we generated a 36-year ensemble of daily streamflow forecasts using eight different conceptual watershed models. In this study, the first eight-year on record (WY ) are used for model evaluation purposes, whereas the other 28 years (WY ) of the available record are used for model training purposes. To discern the probabilistic properties of the streamflow ensemble, consider Fig. 4, which presents the deterministic forecasts of the individual, calibrated models of the ensemble for a representative period between 3 September and 6 June Note that the spread of the ensemble generally brackets the observations (solid circles). This is a desirable characteristic and prerequisite to facilitate accurate ensemble streamflow forecasting with the BMA method. Because the individual hydrologic models were calibrated first using the training data set and the SCE-UA algorithm, a linear bias correction of the individual ensemble members was deemed unnecessary prior to optimization of the BMA weights and variance. Moreover, as argued in [24] a (global) linear bias correction, as suggested by [2], is too simple to be useful in hydrologic modeling. Such an approach is essentially not powerful enough to remove heteroscedastic and non-gaussian model errors from the individual forecasts in the ensemble. For more information about the watershed models used and generation of the streamflow ensemble, please refer to [24]. The original development of BMA by [2] was for weather quantities whose predictive pdfs are approximately normal, such as temperature and sea-level pressure. However, one would not expect streamflow data, and more generally any other quantity primarily driven by precipitation to be normally distributed at short time scales. This is mainly Precipitation [mm/day] Streamflow [m 3 / s] Day of WY 1953 Fig. 4 Rainfall hyetograph (top panel) and streamflow predictions of the individual models of the BMA ensemble for a representative portion of the historical period (bottom panel). The solid circles represent the verifying streamflow observations

12 because the distribution of daily streamflow is highly skewed. The normal distribution does not fit data of this kind, and we must therefore modify the conditional pdf, g k ( f k ) in Eq. 5. We here implement the gamma distribution. This distribution with shape parameter α and scale parameter β has the pdf: f k 1 βɣ(α) α 1 exp( /β), (9) for >, and g k ( f k ) = for. The mean of this distribution is identical to µ = αβ, and its variance is σ 2 = αβ 2.Wederiveα k = µ 2 k /σ k 2,andβ k = σk 2/µ k of the gamma distribution from the original forecast, f k, of the individual ensemble members through µ k = f k and σk 2 = b f k + c. The gamma distribution is centered on the individual forecasts with and associated variance that is heteroscedastic and directly depends on the actual streamflow prediction. We estimate the ML values for w k, k = 1,...,K ; b; andc using the EM and DREAM algorithm and the eight-year training ensemble. When implementing the EM algorithm, there are no analytical solutions for the ML estimates of b and c, so we estimated them numerically at each iteration using the Simplex algorithm by optimizing the likelihood function in (5) using the current guesses of w k, k = 1,...,K derived in Eq. 7. Figure 5 depicts the evolution of w k, k = 1,...,K ; b and c to the posterior target distribution using MCMC sampling with DREAM. We plotted three different parallel Markov chains, and coded each of them with a different symbol and color. The ML values derived with the EM algorithm are separately indicated in each panel using the symbol. The individual Markov chains mix well and require about 2 samples in each sequence to locate the high probability density region of the parameter space for this 1-dimensional estimation problem. This indicates the relative efficiency of DREAM. Also notice that the posterior pdf does not narrow greatly, especially for the BMA weight of the GR4J, HBV and SAC-SMA models. Most of this uncertainty is caused by significant correlation between the optimized weights of the individual models (not shown), suggesting that one or more members can be removed from the streamflow ensemble with little loss. So, the correlation structure induced between the individual weights in the posterior pdf provides useful and objective information for selecting the most appropriate ensemble members. This can be useful given the computational cost of running models and creating ensembles. Also for this example, the modes of the posterior distribution of w k, k = 1,...,K ; b and c derived through MCMC simulation with DREAM are in very close agreement with their ML estimates of the EM algorithm. This suggests that both methods will receive similar calibration and (thus evaluation) performance. Thus, the EM algorithm suffices for finding reliable estimates of the BMA weights and variance for this eight-member streamflow ensemble. Table 1 presents summary statistics of the one-day-ahead streamflow forecasts for the eight-year evaluation (WY ) period for the eight individual conceptual watershed models considered in this study. The forecast statistics of the BMA predictive model and associated weights are also listed. The results in this Table suggest that it is not possible to select a single best watershed model that minimizes the RMSE and %BIAS of the forecast error, while simultaneously also maximizing the correlation (CORR) between the simulated and measured time series of streamflow. In this analysis, the Sacramento Soil Moisture Accounting model (SAC-SMA) has the most consistent performance, with the lowest quadratic forecast error during the calibration and evaluation period. However, the SAC-SMA model forecasts exhibit considerable bias. Note that the theoretical benefit of

13 (A).5 w 1 ABC (B) w 2.5 GR4J (C) 5 1, 1,5.5 HYMOD (D) 5 1, 1,5.5 TOPMO w 3 w 4 (E) 5 1, 1,5.5 AWBM (F) 5 1, 1,5.5 NAM w 5 w 6 (G) w 7 (I) b [- ] 5 1, 1,5.5 HBV 5 1, 1,5 1 Par. Gamma pdf 5 5 1, 1,5 Sample number in Markov chain (H) w 8 (J) c [ m 3 / s] 5 1, 1,5 1.5 SAC-SMA 5 1, 1,5 5 Par. Gamma pdf 5 1, 1,5 Sample number in Markov chain Fig. 5 Transitions of the value of the BMA weight of ensemble member ABC (a), GR4J (b), HYMOD (c), TOPMO (d), AWBM (e), NAM (f), HBV (g), and SAC-SMA (h), and gamma pdf parameters (I + J) in three different parallel Markov Chains during the evolution of the DREAM algorithm to the posterior target distribution. To facilitate comparison, the maximum likelihood estimates computed with the EM algorithm are separately indicated at the right hand side in each panel with the symbol Table 1 Summary statistics (RMSE, CORR, and BIAS) of the one-day-ahead streamflow forecasts using the individual calibrated models and BMA predictive model for the evaluation period (WY ) Model RMSE CORR % BIAS Weight ABC GR4J HYMOD TOPMO AWBM NAM HBV SAC-SMA BMA The BMA weights for the individual watershed model are also listed using multi-model averaging is not realized in this instance: the BMA deterministic forecast given by Eq. 3 has an average quadratic forecast error of a similar magnitude as the best performing model (SAC-SMA) in the ensemble. This suggests that the BMA approach does not necessarily improve predictive capabilities.

14 It is interesting to observe that the optimized BMA weights not necessarily reflect the predictive skill of the individual ensemble members. For instance, the GR4J model ranks fourth in predictive skill, but receives the second highest weight in the BMA model. Hamill [11] suggests that this de-weighting of some ensemble members and over-weighting of others is a serious flaw of the BMA method, and caused by overfitting of the EM-method when using small sample sizes. We do agree that it is rather counter-intuitive to assign better ensemble members a lower weight than worse performing members. However, we disagree that this is caused by overfitting. First of all, our DREAM derived posterior pdfs typically exhibit small dispersion around the ML values, indicating that the BMA weights are well defined by calibration against the observed streamflow data. Second, we argue that it is not the search method that causes the seemingly strange values of the BMA weights. As demonstrated in this paper, the EM and DREAM methods do their job well they correctly and consistently find the ML value of the log-likelihood function defined in Eq. 5. Instead, what is the problem is the strong correlation among the individual predictors in the ensemble. This causes a significant amount of redundancy, and therefore de-weighing of some ensemble members. One approach to reduce correlation among ensemble members is to use principle component analysis of the time series of predictions of the individual models of the ensemble prior to fitting the BMA model. Another approach would be to describe g k ( f k ) with flexible (Gaussian) mixture models or through sequential Monte Carlo (SMC) methods. This is the scope of future work, and our results related to this will be presented in due course. To determine whether the EM and DREAM algorithm consistently find the same location of the mode of the posterior pdf of the BMA weights and gamma distribution parameters, we conducted a comparison of both algorithms for different training period lengths. We followed the same procedure as described previously, and report in Fig. 6 the average outcome of 3 independent trials as functions of the length of the training data set. The results can be summarized as follows. First, the left panel (Fig. 6a) demonstrates that the EM and DREAM algorithm provide consistent results in terms of the ML estimates for this 8-member streamflow ensemble. The results are almost indistinguishable, and suggest that both approaches are viable to optimize BMA model specific predictive distributions other than normal distributions, as implemented previously. Second, and consistent with our previous results, longer training data sets generally result in BMA predictive pdfs that exhibit more spread, and better performance of the deterministic point predictor during the evaluation period. However, the differences appear minor again. Note that the evaluation period exhibits much lower values Ratio Max. Lik. EM to DREAM (A) Length of training set [d] CALIBRATION RESULTS RMSE [m 3 /s] (B) Length of training set [d] AVERAGE WIDTH [m 3 /s] RMSE [m 3 /s] EVALUATION RESULTS Length of training set [d] Fig. 6 Comparison of training period lengths for streamflow forecasting: (a) ratio of the maximum likelihood value derived with the EM and DREAM algorithm; (b) RMSE of BMA deterministic forecast (blue) and average width of the 95% prediction intervals (green) for the calibration period, (c) similar as previous, but for the evaluation period. Solid lines in panels b and c represent the EM derived results, while the squared symbols correspond to the DREAM results. The reported results represent averages of 3 independent trials with randomly selected training data sets from the original 28 years calibration data set (C) AVERAGE WIDTH [m 3 /s]

15 of the RMSE because the average flow level is significantly lower than the training period. Altogether, the results in Fig. 6c suggest that about 1 year of streamflow data is sufficient to obtain BMA model parameters that will receive good evaluation performance. 5 Summary and conclusions Bayesian model averaging has recently been proposed as a statistical method to calibrate forecast ensembles from numerical weather models. Successful implementation of BMA however, requires accurate estimates of the weights and variances of the individual competing models in the ensemble. In their seminal paper [2], has recommended the Expectation Maximization (EM) algorithm for BMA model training. This algorithm is relatively simple to use, and computationally efficient. However, global convergence to the appropriate BMA model cannot be guaranteed. Algorithmic modifications are also required to adopt the EM method for other than normal conditional pdfs of the individual ensemble members. This is not very user-friendly. In this paper, we have compared the performance of the EM algorithm and the recently developed DREAM MCMC algorithm for estimating the BMA weights and variances. Unlike EM, MCMC simulation uses multiple different trajectories (also called Markov Chains) simultaneously to sample the BMA parameters preferentially, based on the underlying probability mass of the likelihood function. Simulation experiments using multi-model ensembles of surface temperature, sea level pressure and streamflow forecasts have demonstrated that EM and DREAM receive a very similar performance, irrespective of the number of ensemble members and the length of the calibration data set. However, MCMC simulation accommodates a large variety of BMA conditional distributions without the need to modify existing source code, provides a full posterior view of the BMA weights and variances, and efficiently handles high-dimensionality when dealing with large ensemble sizes containing the predictions of many different constituent models. The various simulation experiments have also demonstrated that the ML values of the BMA weights generally do not follow the inverse rank order of the average quadratic error of the individual ensemble members. This appears counter-intuitive, because one would expect that models with good predictive skill would always receive higher weight than ensemble members that exhibit a poorer fit to the observations. This is because the predictions of individual ensemble members are highly correlated. One approach to reduce correlation among ensemble members is to use principle component analysis. This is the scope of future work. Finally, in our experiments we have optimized the BMA weights and other parameters of the model specific predictive distributions by ML estimation. There are however, various other summary statistics such as the Continuous Rank Probability Score (CRPS), Mean Average Error (MAE), and Ignorance Score that provide additional and perhaps conflicting information about the optimal parameters of the BMA model, but are deemed important for accurate probabilistic weather forecasts. In another paper [26], we have posed the BMA inverse problem in a multiobjective framework, and have examined the Pareto set of solutions between these various objectives. The source codes of BMA and DREAM are written in MATLAB and can be obtained from the first author (vrugt@lanl.gov) upon request. Acknowledgments The first author is supported by the a J. Robert Oppenheimer Fellowship from the LANL postdoctoral program. Support for the third author is from the New Zealand Foundation for Research Science and Technology, Contract C1X41. We thank Adrian Raftery from the Seattle School of Statistics at Washington University for providing the MM5 ensemble weather data, and Richard Ibbitt, Tom Hamill and

16 two anonymous reviewers for providing detailed and insightful reviews that have considerably improved the current manuscript. References 1. Ajami NK, Duan Q, Sorooshian S (27) An integrated hydrologic Bayesian multimodel combination framework: confronting input, parameter, and model structural uncertainty in hydrologic prediction. Water Resour Res 43:W143. doi:1.129/25wr Barnston AG, Mason SJ, Goddard L, DeWitt DF, Zebiak SE (23) Multimodel ensembling in seasonal climate forecasting at IRI. Bull Am Meteorl Soc 84: doi:1.1175/bams Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc 39: Doblas-Reyes FJ, Hagedorn R, Palmer TN (25) The rationale behind the success of multi-model ensembles in seasonal forecasting II. Calibration and combination. Tellus 57: Duan Q, Sorooshian S, Gupta V (1992) Effective and efficient global optimization for conceptual rainfall-runoff models. Water Resour Res 28(4): Gelman A, Rubin DR (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7: Georgekakos KP, Seo DJ, Gupta H, Schaake J, Butts MB (24) Characterizing streamflow simulation uncertainty through multi-model ensembles. J Hydrol 298(1 4): Gneiting T, Raftery AE (26) Weather forecasting with ensemble methods. Science 31: Gneiting T, Raftery AE, Westerveld AH III, Goldman T (25) Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon Weather Rev 133: Grimitt EP, Mass CF (22) Initial results of a mesoscale short-range ensemble forecasting system over the Pacific Northwest. Wea Forecast 17: Hamill TM (27) Comments on Calibrated surface temperature forecasts from the Canadian ensemble prediction system using Bayesian model averaging. Mon Weather Rev 135: doi:1.1175/27mwr Hamill TM, Colucci SJ (1997) Verification of Eta-RSM short range ensemble forecasts. Mon Weather Rev 125: Krishnamurti TN, Kishtawal CM, LaRow TE, Bachiochi D, Zhang Z, Williford CE, Gadgil S, Surendan S (1999) Improved weather and seasonal climate forecasts from multimodel superensembles. Science 258: McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York, 274 pp 15. Min S, Hense A (26) A Bayesian approach to climate model evaluation and multi-model averaging with an application to global mean surface temperatures from IPCC AR4 coupled climate models. Geophys Res Lett 33: Molteni F, Buizza R, Palmer TN, Petroliagis T (1996) The ECWMF ensemble prediction system: methodology and validation. Q J R Meteorol Soc 122: Neuman SP (23) Maximum likelihood Bayesian averaging of uncertain model predictions. Stoch Environ Res Risk Assess 17: doi:1.17/ Palmer TN, Alessandri A, Andersen U, Cantelaube P, Davey M, Délécluse P, Déqué M, Diez E, Doblas-Reyes J, Feddersen H, Graham R, Gualdi S, Guérémy J-F, Hagedorn R, Hoshen M, Keenlyside N, Latif M, Lazar A, Maisonnave E, Marletto V, Morse AP, Orfila B, Rogel P, Terres J-M, Thomson MC (24) Development of a European Multi-model ensemble system for seasonal-to-interannual prediction (DEMETER). Bull Am Meteorol Soc 85: doi:1.1175/bams Raftery AE, Madigan D, Hoeting JA (1997) Bayesian model averaging for linear regression models. J Am Stat Assoc 92: Raftery AE, Gneiting T, Balabdaoui F, Polakowski M (25) Using Bayesian model averaging to calibrate forecast ensembles. Mon Weather Rev 133: Rajagopalan B, Lall U, Zebiak SE (22) Categorical climate forecasts through regularization and optimal combination of multiple GCM ensembles. Mon Weather Rev 13: Richardson DS (21) Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of sample size. Q J R Meteorol Soc 127: Sloughter JM, Raftery AE, Gneiting T (26) Probabilistic quantitative precipitation forecasting using Bayesian model averaging. University of Washington, Department of Statistics, Technical Report 496, Seattle, WA, 2 pp

17 24. Vrugt JA, Robinson BA (26) Treatment of uncertainty using ensemble methods: comparison of sequential data assimilation and Bayesian model averaging. Water Resour Res W1411. doi:1.129/25wr Vrugt JA, Gupta HV, Bouten W, Sorooshian S (23) A Shuffled Complex Evolution Metropolis algorithm for optimization and uncertainty assessment of hydrologic model parameters. Water Resour Res 39(8):121. doi:1.129/22wr Vrugt JA, Clark MP, Diks CGH, Duan Q, Robinson BA (26) Multi-objective calibration of forecast ensembles using Bayesian model averaging. Geophys Res Lett 33:L doi:1.129/26gl Vrugt JA, ter Braak CJF, Diks CGH, Robinson BA, Hyman JM, Higdon D (28a) Accelerating Markov chain Monte Carlo simulation by self-adaptive differential evolution with randomized subspace sampling. Int J Nonlinear Sci Numer Simul (in press) 28. Vrugt JA, ter Braak CJF, Clark MP, Hyman JM, Robinson BA (28b) Treatment of input uncertainty in hydrologic modeling: doing hydrology backwards with Markov Chain Monte Carlo simulation. Water Resour Res doi:1.129/27wr Wöhling T, Vrugt JA (28) Combining multi-objective optimization and Bayesian model averaging to calibrate forecast ensembles of soil hydraulic models. Water Resour Res doi:1.129/28wr Ye M, Neuman SP, Meyer PD (24) Maximum likelihood Bayesian averaging of spatially variability models in unsaturated fractured tuff. Water Resour Res 4:W5113. doi:1.129/23wr2557

Using Bayesian Model Averaging to Calibrate Forecast Ensembles

MAY 2005 R A F T E R Y E T A L. 1155 Using Bayesian Model Averaging to Calibrate Forecast Ensembles ADRIAN E. RAFTERY, TILMANN GNEITING, FADOUA BALABDAOUI, AND MICHAEL POLAKOWSKI Department of Statistics,