BAYESIAN PROCESSOR OF OUTPUT: PROBABILITY OF PRECIPITATION OCCURRENCE. Roman Krzysztofowicz. University of Virginia. Charlottesville,Virginia.

Size: px

Start display at page:

Download "BAYESIAN PROCESSOR OF OUTPUT: PROBABILITY OF PRECIPITATION OCCURRENCE. Roman Krzysztofowicz. University of Virginia. Charlottesville,Virginia."

Miles Bond
5 years ago
Views:

1 BAYESIAN PROCESSOR OF OUTPUT: PROBABILITY OF PRECIPITATION OCCURRENCE By Roman Krzysztofowicz University of Virginia Charlottesville,Virginia and Coire J. Maranzano Johns Hopkins University Baltimore, Maryland Research Paper RK January 2006 Revised October 2006 Copyright c 2006 by R. Krzysztofowicz and C.J. Maranzano Corresponding author address: Professor Roman Krzysztofowicz, University of Virginia, P.O. Box , Charlottesville,VA rk@virginia.edu

2 ABSTRACT The Bayesian Processor of Output (BPO) is a theoretically-based technique for probabilistic forecasting of weather variates. The first version of the BPO described herein is for a binary predictand; it is illustrated by producing the probability of precipitation (PoP) occurrence forecast. This PoP is a posterior probability obtained through Bayesian fusion of a prior (climatic) probability and a realization of predictors output from a numerical weather prediction (NWP) model. The strength of the BPO derives from (i) the theoretic structure of the forecasting equation (which is Bayes theorem), (ii) the flexibility of the meta-gaussian family of likelihood functions (which allows any form of the marginal distribution functions of predictors, and a non-linear and heteroscedastic dependence structure between predictors), (iii) the simplicity of estimation, and (iv) the effective use of asymmetric samples (typically, a long climatic sample of the predictand and a short operational sample of the NWP model output). Modeling and estimation of the BPO are explained in a setup parallel to that of the Model Output Statistics (MOS) technique used operationally by the National Weather Service. The performance of the prototype BPO system is compared with the performance of the operational MOS system in terms of calibration and informativeness on two samples (estimation and validation). These preliminary results highlight the advantages of the BPO in terms of (i) performance for a specific location (and hence a user), (ii) efficiency of extracting predictive information from the NWP model output (fewer predictors needed), and (iii) parsimony of the predictors (no need for experimentation to find suitable transformations of the NWP model output). Potential implications for operational forecasting and ensemble processing are discussed. ii

3 TABLE OF CONTENTS ABSTRACT...ii 1. INTRODUCTION Towards Bayesian Forecasting Techniques BPOforBinaryPredictand BAYESIAN THEORY Variates Samples InputElements TheoreticStructure META-GAUSSIAN MODEL InputElements ForecastingEquation BasicProperties ModelValidation EXAMPLE WITH ONE PREDICTOR PriorProbability ConditionalDensityFunctions InformativenessofPredictor PosteriorProbability AnotherPredictor Binary-Continuous Predictor Monotonicity of Likelihood Ratio EXAMPLE WITH TWO PREDICTORS Conditional Correlation Coefficients Conditional Dependence Measures Conditional Dependence Structures SecondExample PredictorsSelection...18 iii

4 6. MOS SYSTEM ForecastingEquation Grid-BinaryTransform Estimation PredictorsSelection COMPARISON OF BPO WITH MOS SystemVersusTechnique PerformanceMeasures Comparative Verifications Explanations SUMMARY BayesianTechnique PreliminaryResults PotentialImplications...28 ACKNOWLEDGMENTS...29 APPENDIX A: NUMERICAL APPROXIMATION TO Q REFERENCES...31 TABLES...34 FIGURES...38 iv

5 1. INTRODUCTION 1.1 Towards Bayesian Forecasting Techniques Rational decision making by industries, agencies, and the public in anticipation of heavy precipitation, snow storm, flood, or other disruptive weather phenomenon, requires information about the degree of certitude that the user can place in a weather forecast. It is vital, therefore, to advance the meteorologist s capability of quantifying forecast uncertainty to meet the society s rising expectations for reliable information. Our objective is to develop and test a coherent set of theoretically-based techniques for probabilistic forecasting of weather variates. The basic technique, called Bayesian Processor of Output (BPO), processes output from a numerical weather prediction (NWP) model and optimally fuses it with climatic data in order to quantify uncertainty about a predictand. As is well known, Bayes theorem provides the optimal theoretic framework for fusing information from different sources and for obtaining the probability distribution of a predictand, conditional on a realization of predictors, or conditional on an ensemble of realizations. The optimality of Bayes theorem for fusing information, or updating uncertainty, or revising probability, rests on logical and mathematical arguments (see, for example, Savage, 1954; DeGroot, 1970; de Finetti, 1974). These arguments have long ago been adoptedbyengineersand decision theorists for information, or signal, or forecast processing, and for decision making based on forecasts (see, for example, Edwards et al., 1968; Sage and Melsa, 1971; Krzysztofowicz, 1983; Alexandridis and Krzysztofowicz, 1985). Introducing what we would call today a Bayesian processor of forecast for a binary predictand, DeGroot (1988) explains: The argument in favor of the Bayesian approach proceeds in two steps: (1) The quantitative assessment of uncertainty is in itself a sterile exercise unless that 1

6 assessment is to be used to make decisions. (2) The Bayesian approach provides the only coherent methodology for decision making under uncertainty. Lindley (1987), defending the inevitability of probability as a measure of uncertainty, presents logical arguments and a succinct verdict: Most intelligent behavior is simply obeying Bayes theorem. Any other procedure is incoherent. The challenge lying before us is to develop and test Bayesian procedures suitable for operational forecasting in meteorology. 1.2 BPO for Binary Predictand The present article describes the BPO for a binary predictand. producing the probability of precipitation (PoP) occurrence forecast. This BPO is illustrated by The overall setup for the illustration is parallel to the operational setup for the Model Output Statistics (MOS) technique (Glahn and Lowry, 1972) used in operational forecasting by the National Weather Service (NWS). In the currently deployed AVN-MOS system (Antolik, 2000), the predictors for the MOS forecasting equations are based on output fields from the Global Spectral Model run under the code name AVN. The performance of the operational AVN-MOS system is the primary benchmark for evaluation of the performance of the BPO. The article is organized as follows. Section 2 presents the gist of the Bayesian theory of forecasting for a binary predictand. Section 3 details the input elements, the forecasting equation, and the basic properties of the BPO. Section 4 presents a tutorial example of the BPO for PoP using a single predictor. Section 5 presents a tutorial example of the BPO for PoP using two predictors. The prototype BPO system is compared and contrasted with the operational MOS system in terms of the structure of the forecasting equations in Section 6, and in terms of performance on matched verifications in Section 7. Section 8 summarizes implications of these comparisons and potential advantages of the BPO. 2

7 2. BAYESIAN THEORY 2.1 Variates Let V be the predictand a binary variate serving as the indicator of some future event, such that V = 1 if and only if the event occurs, and V = 0 otherwise; its realization is denoted v, where v {0, 1}. Let X i be the predictor a variate whose realization x i is used to forecast V. Let X = (X 1,..., X I ) be the vector of I predictors; its realization is denoted x =(x 1,..., x I ). Each X i (i =1,..., I) is assumed to be a continuous variate an assumption that simplifies the presentation but can be relaxed if necessary. 2.2 Samples Suppose the forecasting problem has already been structured, and the task is to develop the forecasting equation in a setup similar to that of the MOS technique (Antolik, 2000). In the examples throughout the article, the event to be forecasted is the occurrence of precipitation (accumulation of at least mm of water) in Buffalo, New York, during the 6-h period UTC, beginning 60 h after the run of the AVN model at 0000 UTC. The predictors are the variates whose realizations are output from the AVN model. Forecasts are to be made every day in the cool season (October March). Let {v} denote the climatic sample of the predictand. The climatic sample comes from the database of the National Climatic Data Center (NCDC). This database contains hourly precipitation observations in Buffalo from over 56 years; however, the record is heterogeneous and must be processed in order to obtain a homogeneous sample. To avoid this task, only observations recorded by the Automated Surface Observing System (ASOS) are included in the prior sample. In effect, it is a 7-year long sample extending from 1 January 1997 through 31 December Each day provides one realization. The sample size for the cool season is M =

8 Let {(x,v)} denote the joint sample of the predictor vector and the predictand. The joint sample comes from the database that the Meteorological Development Laboratory (MDL) used to estimate the operational forecasting equations of the AVN-MOS system. It is a 4-year long sample extending from 1 April 1997 through 31 March The sample size for the cool season is N =698. Thepointoftheaboveexampleisthattypicallythejoint sample is much shorter than the climatic sample: N<<M. Classical statistical methods, such as the MOS technique, deal with this sample asymmetry by simply ignoring the long climatic sample. In effect, these methods ignore vast amounts of information about the predictand. In contrast, the BPO uses both samples; it extracts information from each sample and then optimally fuses information according to the laws of probability. (Pooling of samples from different months and stations in order to increase the sample size is a separate issue.) 2.3 Input Elements With P denoting the probability and p denoting a generic density function, define the following objects. g = P (V =1)is the prior probability of event V = 1; it is to be estimated from the climatic sample {v}. Probability g quantifies the uncertainty about the predictand V that exists before the NWP model output is available. Equivalently, it characterizes the natural variability of the predictand. f v (x) =p(x V = v) for v =0, 1;functionf v is the I-variate density function of the predictor vector X, conditional on the hypothesis that the event is V = v. The two conditional density functions, f 0 and f 1, are to be estimated from the joint sample {(x,v)}. For a fixed realization X = x, object f v (x) is the likelihood of event V = v. Thus (f 0,f 1 ) comprises the family 4

9 of likelihood functions. This family quantifies the stochastic dependence between the predictor vector X and the predictand V. Equivalently, it characterizes the informativeness of the predictors with respect to the predictand. (The informativeness is defined in Section 4.3.) 2.4 Theoretic Structure The probability g and the family of likelihood functions (f 0,f 1 ) carry information about the prior uncertainty and the informativeness of the predictors into the Bayesian revision procedure. The expected density function κ of the predictor vector X is given by the total probability law: κ(x) =f 0 (x)(1 g)+f 1 (x)g, (1) and the posterior probability π = P (V =1 X = x) of event V =1, conditional on a realization of the predictor vector X = x, is given by Bayes theorem: By inserting (1) into (2), one obtains an alternative expression: π = π = f 1(x)g κ(x). (2) 1+ 1 g 1 f 0 (x), (3) g f 1 (x) where (1 g)/g is the prior odds against event V =1,andf 0 (x)/f 1 (x) is the likelihood ratio against event V =1. Equation (3) defines the theoretic structure of the BPO for a binary predictand. Inasmuch as Eqs. (1) and (2) follow directly from the axioms of probability theory (sans additional assumptions), Eq. (3) is the most general solution for the conditional probability π. In that sense, it provides the optimal theoretic framework for fusing model output (which supplies a value of x) with climatic data (which supply a value of g). 5

10 3. META-GAUSSIAN MODEL To implement the BPO, a flexible and convenient model is needed for each multivariate conditional density function, f 0 and f 1. Weemploythemeta-GaussianmodeldevelopedbyKellyand Krzysztofowicz (1994, 1995, 1997) and used successfully in probabilistic river stage forecasting (Krzysztofowicz and Herr, 2001; Krzysztofowicz, 2002) and probabilistic rainfall modeling (Herr and Krzysztofowicz, 2005). 3.1 Input Elements A multivariate meta-gaussian distribution is constructed from specified marginal distributions, a correlation matrix, and the Gaussian dependence structure. To obtain expressions for f 0 and f 1, this construction must be replicated twice; every element of the construction must be duplicated, with one copy being conditioned on event V =0and another copy being conditioned on event V =1. Accordingly, the input elements are defined as follows. f iv (x i )=p(x i V = v) for i =1,..., I; v =0, 1. For fixed i {1,...,I} and v {0, 1}, function f iv is the marginal density function of the predictor X i, conditional on the hypothesis that the event is V = v. For a fixed realization X i = x i of the predictor, object f iv (x i ) is the marginal likelihood of event V = v. F iv (x i )=P(X i x i V = v) for i =1,..., I; v =0, 1. For fixed i {1,...,I} and v {0, 1}, functionf iv is the marginal distribution function of the predictor X i, conditional on the hypothesis that the event is V = v;thisf iv corresponds to f iv. γ ijv = Cor(Z i,z j V = v) for i =1,..., I 1; j = i +1,..., I; v =0, 1. Thisisthe Pearson s product-moment correlation coefficient between the standard normal predictors Z i and Z j, conditional on the hypothesis that the event is V = v. The standard normal predictor Z i, conditional on event V = v, is obtained from the original predictor X i, conditional on event 6

11 V = v, through the normal quantile transform (NQT): Z i = Q 1 (F iv (X i )), i =1,...,I; v =0, 1; where Q is the standard normal distribution function, and Q 1 is the inverse of Q. The conditional correlation coefficients are arranged in two conditional correlation matrices Γ v =[γ ijv ], v =0, 1, whose elements have the following properties: γ iiv =1for i =1,..., I; 1 <γ ijv < 1 for i 6= j; and γ ijv = γ jiv for i, j =1,..., I. It follows that matrix Γ v has dimension I I; is square, symmetric, and positive definite; and is uniquely determined by its I(I 1)/2 upper diagonal elements. 3.2 Forecasting Equation When each of the two multivariate conditional density functions, f 0 and f 1, is meta-gaussian, the BPO defined by Eq.(3) takes the following form. Given a prior probability g of event V =1, and given a realization x =(x 1,..., x I ) of the predictor vector, the posterior probability of event V =1is specified by the equation " # 1 π = 1+ 1 g IY f i0 (x i ) λ(x), (4) g f i1 (x i ) where λ is the likelihood ratio weighting function defined by the equation λ(x) = i=1 r det Γ1 exp 1 z T det Γ Γ 1 0 z 0 z T 0 z 0 z T 1 Γ 1 1 z 1 + z T 1 z 1, (5) and where the mapping of the vector x =(x 1,..., x I ) into two vectors z 0 =(z 10,...,z I0 ) and z 1 =(z 11,...,z I1 ) is definedbythenqt: z iv = Q 1 (F iv (x i )), i =1,..., I; v =0, 1. (6) 7

12 In numerical calculations, Q 1 is approximated by a rational function (Abramowitz and Stegun, 1972), which is reproduced in Appendix A. Equation (4) reveals that the posterior probability is determined by the product of the prior odds (1 g)/g against event V =1, the marginal likelihood ratios f i0 (x i )/f i1 (x i ) against event V =1, and the likelihood ratio weight λ(x). The marginal likelihood ratio function f i0 /f i1 carries information from predictor X i ; the likelihood ratio weighting function λ accounts for the conditional dependence among the predictors X 1,..., X I. If the predictors X 1,..., X I are independent, conditional on event V = 0 and conditional on event V = 1, then each of the two conditional correlation matrices, Γ 0 and Γ 1,simplifies to the identity matrix; consequently λ(x) =1at every point x, and the multivariate likelihood ratio f 0 (x)/f 1 (x) simplifies to the product of the marginal likelihood ratios f i0 (x i )/f i1 (x i ) for i =1,..., I. 3.3 Basic Properties The meta-gaussian model for f 0 and f 1, which is embedded in Eqs. (4) (6), offers these properties. 1. The marginal conditional distribution function F iv of predictor X i may take any form; this form may be different for each i {1,..., I} and each v {0, 1}; the marginal conditional density function f iv is simply derived from F iv. 2. The two transforms (the NQTs) for each predictor X i are uniquely specified once its marginal conditional distribution functions, F i0 and F i1, have been estimated. 3. The conditional dependence structure among the predictors X 1,...,X I is pairwise; the degree of dependence is quantified by the conditional correlation matrix Γ v. 4. The conditional dependence structure between any two predictors X i and X j, i 6= j, may 8

13 be non-linear (in the conditional mean) and heteroscedastic (in the conditional variance). 5. The probabilistic forecast (the posterior probability π) is given by an analytic expression. Properties1and4implytheflexibility in fitting the model to data an attribute necessary to produce forecasts that are well calibrated and most informative. Properties 2 and 3 imply the simplicity of estimation. Property 5 implies the computational efficiency an important attribute for operational forecasting. 3.4 Model Validation The meta-gaussian model can be validated on a given joint sample an advantage because one can gain additional insight into the BPO and the data. First, each marginal conditional distribution function F iv should be tested for the goodness of fit to the empirical distribution function of X i, conditional on V = v. Second, the two conditional dependence structures should be validated based on the following fact (Kelly and Krzysztofowicz, 1997): conditional on V = v (v =0, 1), the joint distribution of X 1,...,X I is meta-gaussian if and only if the joint distribution of Z 1,..., Z I is Gaussian. The NQT guarantees that the marginal distribution of each Z i is standard normal. Therefore, the validation amounts to testing the hypothesis that the distribution of each pair (Z i, Z j ), for i =1,...,I 1 and j = i+1,...,i, is bivariate standard normal. This test can be broken down into three tests of the following requirements: 1. Linearity the regression of Z i on Z j must be linear. 2. Homoscedasticity the variance of the residual Θ ij = Z i γ ijv Z j must be independent of Z j. 3. Normality the distribution of Θ ij must be normal with mean 0 and variance 1 γ 2 ijv. Inasmuch as these are requirements of a linear model, testing procedures are well known. 9

14 4. EXAMPLE WITH ONE PREDICTOR When there is only one predictor, the BPO is given by Eq.(3), with the vector x being replaced by the variable x which denotes a realization of predictor X. In effect, three elements are needed for forecasting: a prior probability g, and two univariate conditional density functions, f 0 and f Prior Probability The prior probability g is estimated for the cool season from the climatic sample. It is g =0.27, the value to be used for forecasting every day during the cool season. In general, when the climatic sample is large, the prior probability g could be estimated for a subseason, even for a day, by applying a moving window to the climatic sample. For instance, Table 1 shows that g varies from month to month. Thus, using a given g every day during a month and changing g from month to month would improve the calibration of the forecast probability within each month. (This statement is justified because the expectation of the posterior probability equals the prior probability.) Despite this potential advantage, all examples reported herein use g for the cool season because this parallels the setup of the operational MOS system (whose equations are estimated for the cool season) and because the available validation sample (from 2 1/2 years) is too short to verify the calibration of forecasts for each month. Overall, the prior probability has four attributes important for application. (i) It may be location-specific and season-specific (or day-specific) and thereby can capture the micro-climate. (ii) It may be estimated from a large climatic sample. (iii) It is independent of the choice of the predictors and the length of the NWP model output available for estimation (the size of the joint sample). (iv) It need not be re-estimated when the NWP model changes; thus it ensures a stable calibration of the forecast probabilities for as long as the climate remains stationary. 10

15 4.2 Conditional Density Functions The single predictor X is the mean relative humidity of a variable depth layer from sigma 1.0 to sigma 0.44 at 60 h after the 0000 UTC model run (for short, mean relative humidity at 60 h). Its two conditional density functions, f 0 and f 1, are for the cool season; they are derived from the corresponding conditional distribution functions, F 0 and F 1. The procedure for modeling and estimation of F 0 and F 1 is as follows. 1. The joint sample {(x, v)} of 698 realizations is stratified into two subsamples: {(x, 0)} containing 518 realizations and {(x, 1)} containing 180 realizations. 2. From each subsample, an empirical distribution function of X is constructed (Fig. 1). 3. A parametric model for F v is chosen, its parameters are estimated, and its goodness-of-fitto the empirical distribution function is evaluated. Here, both F 0 and F 1 are ratio type II log-weibull; each is definedontheinterval(0, 100) and is specified by two parameters (Fig. 1). The above procedure has been automated by creating a catalog of parametric models and by developing algorithms for estimation of the parameters and choice of the best model. The catalog includes expressions for the distribution functions and for the density functions. In effect, once a parametric model for F v is chosen, the expression for f v is known. Fig. 2 shows f 0 and f Informativeness of Predictor ApredictorX used in the BPO is characterized in terms of its informativeness. Intuitively, the informativeness of predictor X maybevisualizedbyjudgingthedegree of separation between the two conditional density functions, f 0 and f 1, shown in Fig. 2: the larger the separation, the more informative the predictor. Formally, the informativeness of predictor X is characterized by the Receiver Operating Characteristic (ROC) a graph of the probability of detection P (D x) versus the probability of false alarm P (F x) for all x. When the likelihood ratio L(x) =f 0 (x)/f 1 (x) 11

16 is a strictly monotonic function of x, the ROC may be constructed directly from the conditional distribution functions F 0 and F 1 : ifl = f 0 /f 1 is strictly increasing, then if L = f 0 /f 1 is strictly decreasing, then P (D x) = F 1 (x), P (F x) = F 0 (x); (7a) P (D x) = 1 F 1 (x), P (F x) = 1 F 0 (x). (7b) Given f 0 and f 1 shown in Fig. 2, Eq. (7b) holds; the resultant ROC is shown in Fig. 3. Clearly, the mean relative humidity X is an informative predictor of the precipitation occurrence indicator V, as the ROC lies decisively above the diagonal line (which characterizes an uninformative predictor); but X is far from being a perfect predictor of V,astheROCpassesfarfromthe upper left corner of the graph (which characterizes a perfect predictor). When there are two or more alternative predictors, they can be compared (and possibly ranked) in terms of a binary relation of informativeness. This relation derives from the Bayesian theory of sufficient comparisons, the essence of which is as follows (Blackwell, 1951, 1953; Krzysztofowicz and Long, 1990, 1991). Let X i and X j be two alternative predictors of V. Suppose a rational decision maker will use the probabilistic forecast of V from the BPO in a Bayesian decision procedure with a prior probability g and a loss function l. Let VA i (g, l) denote the value of a probabilistic forecast generated by predictor X i (as defined in the Bayesian theory). Definition. Predictor X i is said to be more informative than predictor X j ifandonlyifthe value of a forecast generated by predictor X i is at least as high as the value of a forecast generated by predictor X j, for every prior probability g and every loss function l; formally, if and only if VA i (g, l) > VA j (g, l), for every g, l. 12

17 Inasmuch as any two rational decision makers may employ different prior probabilities and different loss functions, the condition for every g, l is synonymous with the statement for every rational decision maker. Blackwell (1953) proved the following. Theorem. Predictor X i is more informative than predictor X j ifandonlyiftherocofx i is superior to the ROC of X j. The binary relation of informativeness establishes a quasi order on a set of predictors X 1,..., X I. The quasi order is reflexive and transitive, but is not complete. That is, there may exist two predictors such that neither is more informative than the other, which is the case when one ROC crosses the other. Also there may exist two predictors that are equally informative, which is the case when one ROC is identical to the other. In summary, an advantage of the BPO is that its elements, F 0 and F 1, enable us to characterize the informativeness of a predictor for a given predictand. When two or more predictors are available, they can easily be compared (and possibly ranked) in terms of the informativeness relation. 4.4 Posterior Probability Once the three elements (g, f 0,f 1 ) are specified, the posterior probability π of precipitation occurrence may be calculated from Eq. (3), given any value x of the mean relative humidity output from the AVN model. Figure 4 shows the plot of π versus x for three values of g. Regardless of the value of g, the posterior probability π is an increasing, non-linear, irreflexive function of x. The basic shape of this function is determined by the conditional density functions f 0,f 1. The prior probability g scales (nonlinearly) this basic shape. This has two practical implications. First, it illustrates the assertion made in Section 4.1 that the role of the prior probability g is to calibrate the forecast probability. Thus by estimating g from a large climatic sample (and by 13

18 properly modeling and estimating f 0 and f 1 ), the meteorologist can ensure the necessary condition for the forecast probability to be well calibrated against the climatic probability of precipitation occurrence at a specific location and within a specific season. Second, even though the conditional density functions f 0,f 1 remain fixed during a season (here the cool season), the prior probability g may change from month to month (or some other subseason). Consequently, the posterior probability π can be calibrated against the climatic probability for each month (or subseason) rather than for the 6-month long cool season. 4.5 Another Predictor Different predictors behave differently. That is why each predictor should be modeled individually, and the catalog of parametric models from which the conditional distribution functions F 0, F 1 are drawn should be large enough to afford flexibility. To underscore this point, let us model another predictor: the relative vorticity on the isobaric surface of 850 hpa at 63 h after the 0000 UTC model run (for short, 850 hpa relative vorticity at 63 h). Figure 5 shows the empirical conditional distribution functions and the parametric conditional distribution functions F 0, F 1 ; here, F 0 is Weibull and F 1 is log-logistic; each is definedontheinterval( 5, ) and is specified by two parameters. Figure 6 shows the conditional density functions f 0, f 1. Figure 7 shows the plot of π versus x for three values of g. Clearly, this predictor behaves differently than the previous one. A comparison of the ROCs (Fig. 3) reveals that the mean relative humidity at 60 h is approximately more informative than the 850 hpa relative vorticity at 63 h for predicting precipitation occurrence during the period h. (The adverb approximately is inserted because one ROC crosses the other near the left end.) Would a combination of the two predictors be more informative than either predictor alone? This question is answered at the end of Section 5. 14

19 4.6 Binary-Continuous Predictor Another informative predictor X of precipitation occurrence is an estimate of the total precipitation amount during a specified period, output from the NWP model with a specified lead time. Typically, X is a binary-continuous variate: its takes on value zero on some days, and positive values on other days. Thus, the sample space of X is the interval [0, ), and the probability distribution of X assigns a nonzero probability to the event X =0and spreads the complementary probability over the interval (0, ) according to some density function. When the probability of event X = 0 is small, X may be modeled approximately as a continuous variate. When the probability of event X =0is large, X should be modeled as a binary-continuous variate in order to extract from it all information. The BPO can be suitably modified to incorporate a binary-continuous predictor, alone or in combination with other continuous predictors. The case with a single binary-continuous predictor is described by Maranzano and Krzysztofowicz (2004). 4.7 Monotonicity of Likelihood Ratio When there exists a physical or a logical requirement for the posterior probability π to be a monotone function of the predictor value x, as is the case in Figs. 4 and 7,this requirement can be enforced via the likelihood ratio function L = f 0 /f 1. As may be inferred from Eq. (3), if L(x) decreases with x, then π increases with x; if L(x) increases with x, then π decreases with x. A monotonicity requirement may not be satisfied automatically by L simply because f 0 and f 1 are obtained without any constraint on their ratio f 0 /f 1. Thus, when a monotonicity requirement exists, it is necessary to check that L satisfies it. Algorithms have been developed to perform this checking and to force a monotonicity requirement on L. 15

20 5. EXAMPLE WITH TWO PREDICTORS 5.1 Conditional Correlation Coefficients Let X 1 denote the mean relative humidity predictor analyzed in Section 4.2, and let X 2 denote the relative vorticity predictor analyzed in Section 4.5. The analyses of individual predictors supply the conditional distribution functions (F 10,F 11 ; F 20,F 21 ) and the conditional density functions (f 10,f 11 ; f 20,f 21 ). In order to obtain the BPO with two predictors (X 1,X 2 ), it is necessary to estimate two conditional correlation coefficients (γ 120,γ 121 ) from which the two conditional correlation matrices, Γ 0 and Γ 1, are constructed. The estimation procedure, applicable to any number of predictors I > 2,isasfollows. 1. The joint sample {(x 1,..., x I,v)} is stratified into two conditional joint samples {(x 1,..., x I, 0)} and {(x 1,..., x I, 1)} according to the value of the precipitation indicator v. Every step that follows is performed twice, for v =0and v =1. 2. Each conditional joint realization (x 1,...,x I,v) is processed through the NQT z iv = Q 1 (F iv (x i )), i =1,...,I, to obtain a transformed conditional joint realization (z 1v,..., z Iv ). 3. The transformed conditional joint sample {(z 1v,..., z Iv )} is used to estimate the conditional Pearson s product-moment correlation coefficients γ ijv for i =1,..., I 1; j = i +1,..., I. When applied to the joint sample at hand, the above estimation procedure yields γ 120 =0.577 and γ 121 = Thereby all input elements have been estimated, and the BPO is ready for forecasting. 5.2 Conditional Dependence Measures Under the I-variate meta-gaussian density function f v, the parameter γ ijv characterizes the stochastic dependence between X i and X j, conditional on the hypothesized precipitation event 16

21 V = v. For the purpose of interpretation, γ ijv may be transformed into the Spearman s rank correlation coefficient ρ ijv between X i and X j, conditional on the hypothesized precipitation event V = v. The transformation is given by (Kelly and Krzysztofowicz, 1997) ρ ijv =(6/π) arcsin (γ ijv /2). (8) In the present example, ρ 120 =0.559 and ρ 121 = From the estimates of γ 120 and γ 121 (or ρ 120 and ρ 121 ) one can infer that the mean relative humidity X 1 and the 850 hpa relative vorticity X 2 are stochastically dependent, conditional on the predictand V, and that the degree of dependence is somewhat stronger when precipitation occurs, V =1, than when precipitation does not occur, V = Conditional Dependence Structures The purpose of the NQT is to transform a given dependence structure of the predictors into the Gaussian dependence structure. To learn the dependence structure and to judge the performance of the NQT, scatterplots of the conditional joint samples are examined. There are two scatterplots of the original sample points (x 1,x 2 ), conditional on V =0and V =1(Figs. 8a and 8b). Each exhibits a non-gaussian dependence structure: the scatters are not elliptic, especially the one conditional on V =1, and the right-most points form a vertical frontier an implication of X 1 being bounded above by 100%. Likewise, there are two scatterplots of the transformed sample points (z 1v,z 2v ),forv =0and v =1(Figs. 8c and 8d). In each case, the scatter is elliptic, and the hypothesis of the Gaussian dependence structure cannot be refuted. Thus the NQT performs well. When the number of predictors I>2, the analysis of the scatterplots should be performed for every pair of variates (X i,x j ),i=1,..., I 1; j = i+1,..., I. Pairwise analyses are sufficient to validate the I-variate meta-gaussian dependence structure. 17

22 5.4 Second Example The event to be forecasted is the occurrence of precipitation in Quillayute, Washington, during the 24-h period UTC, beginning 36 h after the 0000 UTC model run in the warm season (April September). Let X 1 denote the relative humidity on the isobaric surface of 850 hpa at 36 h, estimated by the AVN model. Let X 2 denote the relative vorticity on the isobaric surface of 850 hpa at 36 h, estimated by the AVN model. The scatterplots are shown in Fig. 9. As in the first example, the NQT performs well: each of the two non-gaussian dependence structures of the original sample points (especially the one in Fig. 9b) is transformed into the Gaussian dependence structure. What makes this example different from the previous one is the vastly different degrees of conditional dependence: X 1 and X 2 are (approximately) independent (ρ 120 =0.011), conditional on precipitation nonoccurrence, V =0; X 1 and X 2 are positively dependent (ρ 121 =0.358), conditional on precipitation occurrence, V =1. The BPO takes the two conditional correlation coefficients explicitly into account, but non- Bayesian techniques (such as MOS regression and logistic regression) fail to do so. When ρ 120 and ρ 121 are significantly different, this may be one of the reasons for the superior performance of the BPO. 5.5 Predictors Selection For every predictand, 34 potential predictors are defined by appropriately concatenating five variables (total precipitation amount, mean relative humidity, relative vorticity, relative humidity, and vertical velocity), three lead times, and four isobaric surfaces. From this set, the best combination of no more than five predictors is selected. The selection is accomplished via an algorithm that (i) maximizes RS (the area under the ROC, defined in Section 7.1) subject to the constraint that an additional predictor must increase RS by at least a specified threshold, (ii) employs objective op- 18

23 timization and heuristic search, and (iii) estimates the parameters of the BPO and the performance score RS from a given joint sample (an estimation sample here from 4 years). In the examples for Buffalo with one predictor, X 1 (meanrelativehumidityat60h)orx 2 (850 hpa relative vorticity at 63 h), and with two predictors, X 1 and X 2, the scores are as follows: RS(X 1 ) = 0.818, RS(X 2 )=0.742, RS(X 1,X 2 )= Although the combination of two predictors (X 1,X 2 ) outperforms each of the single predictors, X 1 and X 2, the gain is below a threshold of significance. Thus, given only these two potential predictors, it is best to select the single predictor X 1. 19

24 6. MOS SYSTEM 6.1 Forecasting Equation The primary benchmark for evaluation of the BPO is the MOS system (Glahn and Lowry, 1972; Antolik, 2000) currently used in operational forecasting by the NWS. For a binary predictand, the MOS forecasting equation has the general form P π = a 0 + I a i t i (x i ), (9) i=1 where t i is some transform determined experientially for each predictor X i (i =1,..., I), and a 0,a 1,..., a I are regression coefficients. The predictand and the predictors are defined at a station. For the predictand defined in Section 2.2, the MOS utilizes five predictors: 1. Total precipitation amount during 6-h period, h; cutoff 2.54 mm. 2. Total precipitation amount during 3-h period, h; cutoff mm. 3. Relative humidity at the pressure level of 700 hpa at 66 h; cutoff 70%. 4. Relative humidity at the pressure level of 850 hpa at 60 h; cutoff 90%. 5. Vertical velocity at the pressure level of 850 hpa at 57 h; cutoff Grid-Binary Transform In some cases, a predictor enters Eq. (9) untransformed, i.e., t i (x i )=x i. In the present case, each predictor is subjected to a grid-binary transformation, which is specified in terms of a heuristic algorithm (Jensenius, 1992). The algorithm takes the gridded field of predictor values and performs on it three operations: (i) mapping of each gridpoint value into "1" or "0", which indicates the exceedance or nonexceedance of a specified cutoff level; (ii) smoothing of the resultant binary field; and (iii) interpolation of the gridpoint values to the value t i (x i ) at a station. It follows that the transformed predictor value t i (x i ) at a station depends upon the original predictor values at all grid points in a vicinity. Thus when viewed as a transform of the original predictor X i into a 20

25 grid-binary predictor t i (X i ) at a fixed station, the transform t i is nonlinear and nonstationary (from one forecast time to the next). The grid-binary predictor t i (X i ) is dimensionless and its sample space is the closed unit interval [0,1]. 6.3 Estimation The regression coefficients in Eq. (9) are estimated from a joint sample {(t 1 (x 1 ),..., t I (x I ),v)} of realizations of the transformed predictors and the predictand. Like the sample for the BPO, this sample includes all daily realizations in the cool season (October March) in 4 years. Unlike thesampleforthebpo,thissampleincludesnotonlytherealizationsatthebuffalostation,but the realizations at all stations within the region to which Buffalo belongs. The pooling of station samples into a regional sample is needed to ensure a stable estimation of the MOS regression coefficients (Antolik, 2000). The estimates obtained by the MDL are: a 0 = , a 1 = , a 2 = , a 3 = , a 4 = , a 5 = These estimates are assumed to be valid for every station within the region. 6.4 Predictors Selection For every predictand, there are about 176 potential predictors. The main reason for this number being about five times larger than 34 in BPO is that MOS employs the grid-binary predictors: for each variable there are several cutoff levels, each of which generates a new predictor. The best predictors are selected sequentially according to the maximum variance reduction criterion of linear regression and the stopping criterion whereby an additional predictor must reduce variance by at least a specified threshold. Up to 15 predictors can be selected. 21

26 7. COMPARISON OF BPO WITH MOS 7.1 System Versus Technique There is a fundamental distinction between a forecasting technique and a forecasting system, which for our purposes is this. A forecasting technique is essentially a forecasting equation with a generic statistical interpretation, Eqs. (4) (6) for BPO and Eq. (9) for MOS. A forecasting system is a conjunction of a forecasting technique and a processing software that an organization employs to process real-time data into operational forecasts. For instance, any comparison involving the MOS technique, as defined by Eq. (9) but outside its processing software, would be a sterile experiment, unrepresentative of the actual MOS system of the NWS. For, as explained in Section 6.2, the grid-binary transformations are an intrinsic, though often overlooked, part of that system: they require processing of the entire gridded fields of model outputs, they cannot be reproduced except through software, and they cannot be executed on data from an isolated station or an isolated grid point at which comparison of techniques might be undertaken. Whereas it is of scientific interest to compare the BPO technique against the MOS technique and other traditional statistical techniques several such comparisons have already been performed and will be reported in future publications it is far more important to mission-oriented agencies to compare the prototype BPO system with the operational MOS system. In his review, C. Doswell concurred:... it probably would be revealing to compare forecasts generated by the BPO method against the real operational MOS... it would be a more convincing yardstick for comparison and contrast. 7.2 Performance Measures It is apparent that each system, BPO and MOS, processes information in a totally different manner. The objective of the following experiment is to compare the two systems with respect to 22

27 the efficiency of extracting the predictive information from the same data record the archive of the AVN model output. Towards this end, two comparative verifications of forecasts are performed based on two input samples: (i) the estimation joint sample {(x,v)} from 4 years (April 1997 March 2001); this is the same joint sample that was used for estimation of the family of likelihood functions (f 0,f 1 ) of the BPO; and (ii) the validation joint sample {(x,v)} from 2 1/2 years (April 2001 September 2003); this joint sample is used solely for validation. Given an input sample (either the estimation sample or the validation sample), each system (BPO and MOS) is used to calculate the forecast probability π basedoneveryrealizationofits predictors. (The MOS forecasts calculated from the validation sample are actually the operational AVN-MOS forecasts produced by the NWS during 2 1/2 years; we simply re-calculated them.) Then the joint sample {(π, v)} of realizations of the forecast probability and the predictand is used to calculate the following performance measures. The calibration function (CF) a graph of the conditional probability η(π) =P (V =1 Π = π) versus the forecast probability π. The receiver operating characteristic (ROC) a graph of the probability of detection versus the probability of false alarm. The calibration score (CS) the Euclidean distance (the square root of the expected quadratic difference) between the line of perfect calibration and the calibration function: CS = E([Π η(π)] 2 ) ª 1 2 ; 0 CS 1. The ROC score (RS) the area under the ROC (calculated from a piecewise linear estimate of the ROC using the trapezoidal rule); 1/2 RS 1. Some basic facts pertaining to this performance measure are as follows: (i) System A is more informative than system B if and only if the ROC of A is superior to the ROC of B. (ii) If system A is more informative than system B, 23

28 then the RS of A is not smaller than the RS of B. 7.3 Comparative Verifications Complete results are presented for the 6-h forecast period, h after the model run. The BPO uses one predictor (mean relative humidity at 60 h, as detailed in Section 4); the MOS uses five predictors (as detailed in Section 6). Figure 11 shows the CF and the CS from every verification. Both BPO and MOS exhibit stable calibration across the two samples, the estimation sample and the validation sample. The MOS probabilities smaller than 0.4 are well calibrated, but those greater than 0.4 are poorly calibrated on both samples. The BPO probabilities are well calibrated on both samples. Based on the CS from the validation sample, BPO is calibrated better than MOS, by on average (on the probability scale). Figure 12 shows the ROC and the RS from every verification. Both BPO and MOS exhibit stable informativeness across the two samples, the estimation sample and the validation sample. For each sample, the two ROCs cross each other. Thus neither system is more informative than the other. For each sample, the RS of BPO is slightly higher than the RS of MOS. A summary of results is presented for three forecast periods, 6-h, 12-h, 24-h, each beginning 60 h after the model run. Table 2 lists the predictors used by the BPO; Tables 3 and 4 report the scores from verifications on the estimation samples and on the validation samples. In all six cases, BPO is calibrated significantly better than MOS: the CS of BPO is at least 50% smaller than the CS of MOS. In five out of six cases, the RS of BPO is slightly higher than the RS of MOS. Finally, there is a consistent difference in terms of the number of optimal predictors selected for each system during its development: BPO uses 1 2 predictors, which are always extracted directly from the output fields of the AVN model; MOS uses 4 5 predictors, most of which are 24

29 obtained through grid-binary transformations of the output fields of the AVN model (Section 6.2). 7.4 Explanations Calibration. Why is it that BPO is calibrated significantly better than MOS? Why is it that MOS is poorly calibrated, contrary to the verification results of past studies? The explanation is twofold. First, as elaborated in Sections 4.1 and 4.4, the theoretic structure of the BPO forecasting equation (3) ensures the necessary condition for the forecast probability to be well calibrated against the prior (climatic) probability g input into the equation for a specific location and season. The ad-hoc structure of the MOS forecasting equation (9) does not offer this property. Second, the good calibrations of MOS reported inpaststudies(e.g., MurphyandBrown, 1984; Antolik, 2000) may have been the artifact of the analyses. For these studies did not verify the calibration of MOS at any specific location (which is of import to the users of forecasts at that location), but instead pooled the verification samples from many locations into one national sample from which verification statistics were calculated. If the prior probability and the degree of calibration varied across locations, then the verification statistics obtained from a pooled sample did not pertain to any location and thereforewouldbemisleadingtousers. Informativeness. Why is it that MOS needs two to four additional predictors to barely match the informativeness of BPO? The explanation once again is twofold. First, the laws of probability theory, from which the BPO is derived, ensure the optimal structure of the BPO forecasting equation (3). The structure of the MOS forecasting equation (9) is different. Therefore, given any single predictor, the BPO system, if properly operationalized, can never be less informative than the MOS system (or any other non-bayesian system for that matter). To make up for the non-optimal theoretic structure, a non-bayesian system needs additional 25

30 predictors (which are conditionally informative in that system). Second, the grid-binary transform (Jensenius, 1992) was invented to improve the calibration of the MOS system. But by mapping an original predictor (which is binary-continuous or continuous) into a binary predictor, this transform also removes part of predictive information contained in the original predictor. In the examples reported herein, two to four additional predictors are needed to make up for the lost information and the nonoptimal structure of the MOS forecasting equation. To dissect the predictive performance of the grid-binary transform, each system, MOS and BPO, was estimated and evaluated twice: first, utilizing an original predictor, and next utilizing the grid-binary transformation of that predictor. There were two findings. (i) The use of the grid-binary transform in the MOS leads to a compromise: the transform improves the CS but deteriorates the RS. (ii) The use of the grid-binary transform in the BPO is unnecessary for calibration (because the BPO automatically calibrates the posterior probability against the specified prior probability) and is detrimental for informativeness (because it removes part of the predictive information contained in the original predictor). 26

31 8. SUMMARY 8.1 Bayesian Technique 1. The BPO for a binary predictand described herein is the first technique of its kind for probabilistic forecasting of weather variates: it produces the posterior probability of an event through Bayesian fusion of a prior (climatic) probability and a realization of predictors output from a NWP model. 2. The BPO implements Bayes theorem, which provides the correct theoretic structure of the forecasting equation, and employs the meta-gaussian family of multivariate density functions, which provides a flexible and convenient parametric model. It can be estimated effectively from asymmetric samples the climatic sample of the predictand (which is typically long), and the joint sample of the predictor vector and the predictand (which is typically short). 3. The development of the BPO has focused on quality of modeling and simplicity of estimation. The BPO allows (i) the marginal conditional distribution functions of the predictors to be of any form (as typically they are non-gaussian), and (ii) the conditional dependence structure betweenanytwopredictorstobenon-linearandheteroscedastic (as typically is the case in meteorology). Despite this flexibility, the BPO requires the estimation of only distribution parameters and correlation coefficients. And the entire process of selecting predictors, choosing parametric distribution functions, and estimating parameters can be automated. 8.2 Preliminary Results 1. The PoP produced by the prototype BPO system is better calibrated than, and at least as informative as, the PoP produced by the operational MOS system for a specific location (and hence for a specificuser). 2. The BPO utilizing one or two predictors performs, in terms of both calibration and infor- 27

BAYESIAN PROCESSOR OF OUTPUT FOR PROBABILISTIC FORECASTING OF PRECIPITATION OCCURRENCE. Coire J. Maranzano. Department of Systems Engineering

BAYESIAN PROCESSOR OF OUTPUT FOR PROBABILISTIC FORECASTING OF PRECIPITATION OCCURRENCE By Coire J. Maranzano Department of Systems Engineering University of Virginia P.O. Box 400747 Charlottesville, VA