Analysis of Regression and Bayesian Predictive Uncertainty Measures

Analysis of and Predictive Uncertainty Measures Dan Lu, Mary C. Hill, Ming Ye Florida State University, dl7f@fsu.edu, mye@fsu.edu, Tallahassee, FL, USA U.S. Geological Survey, mchill@usgs.gov, Boulder, CO, USA ABSTRACT Predictive uncertainty can be quantified using confidence and probability intervals constructed around predictions. Confidence intervals are based on regression inferential theory; probability intervals are based on theory. For the confidence intervals, this work considered linear and nonlinear confidence intervals obtained using methods that require tens and hundreds of model runs, respectively. The probability intervals are obtained using Markov Chain Monte Carlo (MCMC) methods that require,s of model runs. Confidence and probability intervals are conceptually different and only mathematically equivalent under certain conditions. We use simple test cases to show that for linear models, the two types of intervals are mathematically equivalent with proper choices of prior probability. However, for nonlinear models, regardless of choice of prior probability, the two types of intervals are always different. The discrepancy depends on the model total nonlinearity. Therefore, it is inappropriate to use the two intervals to validate each other, as has been done in previous practice. INTRODUCTION Groundwater modeling is often used to predict effects of future anthropomorphic or natural occurrences. Since modeling predictions are inherently uncertain, quantification of predictive uncertainty is necessary. Confidence intervals and probability intervals constructed around the predictions can be used as measures of predictive uncertainty. Confidence intervals are based on inferential statistical theory from regression; linear and nonlinear intervals can be calculated. Probability intervals are based on theory. Markov chain Monte Carlo (MCMC) has been popular for estimating the probability intervals. While comparative studies of the two types of predictive uncertainty measures have been conducted (e.g., Vrugt and Bouten, ; Gallagher and Doherty, 7), underlying theoretical differences remain unclear. The purpose of this work is to compare the two kinds of predictive uncertainty measures by investigating their theoretical differences. To illustrate these differences, we consider a set of simple test cases with linear and nonlinear models. CONFIDENCE INTERVALS AND PROBABILITY INTERVALS Confidence intervals, from the frequentist point of view, represent the percent of the time in repeated sampling that the confidence intervals contain true predictions. To understand this better, consider the procedure of evaluating the confidence intervals. It involves first sampling N sets of observations based on distribution of errors and then calculating the confidence intervals (with confidence level -)for a certain prediction function based on the generated N sets of observations. For the N intervals, (- % of the intervals contain the true value. For linear intervals, the portions that the true value is either larger than the upper or smaller than the lower confidence limits are equal, being /. For nonlinear intervals the portions are not necessarily equal. The probability intervals, inferred from theory, represent posterior probability that the predictions lies in the interval. In statistics, a prediction is thought of as a random variable with its own distribution. The posterior distribution summarizes the state of knowledge about the unknown prediction conditional on the prior and current data. The narrower the distribution is, the greater our knowledge about the prediction. The amount is measured by the probability interval, which is a probabilistic region around posterior statistics such as posterior mean. They are calculated here using Markov Chain Monte Carlo (MCMC) methods that generate the entire posterior probability distribution from which the intervals are determined.

RELATIONSHIP BETWEEN CONFIDENCE AND PROBABILITY INTERVALS First consider a linear model, y Xβε, with n observations in the vector y, p unknown true parameters in the vector β, true random errors in the vector. The random error is assumed to be multivariate Gaussian, i.e., ε Nn (, C), where C ω and is the weight matrix used in objective function of inverse modeling. The estimates of β are multivariate Gaussian, i.e., ˆ * T β N p ( β, ( X ωx) ), where X is sensitivity matrix. Consider a linear prediction function g( β) Zβ. Using regression theory, the ( ) % confidence interval on the prediction (assuming that the model correctly represents reality) is given for two circumstances with unknown and known σ. When σ is unknown, and it is estimated by the calculated variance ˆ T s ( yxβ) ω( yxβ ˆ)/( n p), the distribution of g( β ˆ) is t-distribution, and the confidence interval is (Hill and Tiedeman, 7) ˆ T T / g( β) t /( n p)[ s Z ( X ωx) Z ] () where t -/ (n-p) is a t statistic with significance level and degrees of freedom equal to (n-p). When σ is known, the distribution of g( β ˆ) is normal distribution, the confidence interval is (McClave and Sincich, ) ˆ T T / g( β) z /[ Z ( X C X) Z ] () where z / is the z statistic with significance level. In statistics with noninformative priors for which p( β) constant and p( ) /, the posterior distribution of g( β ) is multivariate t-distribution. Thus, the ( ) % probability intervals for g( β) are the same as those of equation () derived from regression theories. In the same context, for informative conjugate prior with p( β) N p ( β p,c p ), and assume σ is known, the posterior distribution of g( β ) is multivariate normal. Thus, the ( ) % probability interval for g( β ) (assuming that the model correctly represents reality) is given as (McLaughlin and Townley, 996) ' T T / g( β p) z /[ Z ( X C XCp ) Z ] (3) As C p I, equation (3) reduces to equation (). The only difference is that the prediction is evaluated ' for β p the posterior mean, as determined from theory, instead of ˆβ the least square estimate, as determined from regression theory. For a linear problem, the two quantities of parameters are the same and equations () and (3) produce the same intervals. For a nonlinear model y f ( β) ε with parameters β, errors ε Nn(, C) with known C, based on theorem, with noninformative prior, the posterior density of parameter β is (Berger, 985) exp[log p( y β)] p( β y) (4) exp[log p( y)] dβ Consider a Taylor series expansion of log p( y β ) about ˆβ to the second order term, where ˆβ maximizes the log likelihood, log p( y β ). Then equation (4) is approximated by:

where ˆ ˆ T exp log ( ) ( ) ( ˆ)( ˆ p y β ββ I β ββ) p( β y) ˆ exp log ( ) ( ˆ T ) ( ˆ)( ˆ p y β ββ I β ββ) d β exp ( ˆ T ) I( ˆ)( ˆ ββ β ββ) p/ ˆ / ( ) I( β) ˆ log p( y β) I( β) T ββ β β ˆ β Xβ ), the posterior density ˆ ˆ p( β y) Nn β, [ I( β)] is the Fisher information matrix. When the model is linear (i.e., f ( ) is exact with [ ( ˆ T I β)] XC X. In this case, the probability interval of g( β ) from posterior distribution is mathematically equivalent with its confidence interval in regression as shown in equation (). However, if the model is highly nonlinear as indicated by large total nonlinearity, ignoring the higher order terms can cause significant error. In this case, confidence and probability intervals can be very different. The difference depends on the size of the higher order terms, which is reflected in the skew of the distribution. In addition to the linear confidence intervals above, nonlinear confidence intervals are also available from regression theories (Vecchia and Cooley, 987; Cooley, 4; Hill and Tiedeman, 7) that should be able to account for higher order terms resulted from model linearization. Nonlinear intervals can be calculated using likelihood method of Vecchia and Cooley (987). It determines the minimum and maximum values of prediction over a confidence region on the parameter set. The confidence region is defined in p-dimensional parameter space and has a specified probability of containing the true set of parameter values, as illustrated in Figure. (5) SA Figure : Geometry of a nonlinear confidence interval on prediction g(b). The parameter confidence region (shaded area), contours of constant g(b) (dashed lines), and locations of the minimum (g(b)=c, with b=b L ) and maximum (g(b)=c 4, with b=b U ) values of the prediction on the confidence region are shown. The lower and upper limits of the nonlinear confidence interval on prediction g(b) are thus c and c 4, respectively. (Adapted from Hill and Tiedeman, 7, Figure 8.3.) The method for computing nonlinear confidence intervals involves first defining the (-)-percent parameter confidence region. This region is defined as the set of parameter values for which the objective-function values, S(b), satisfy the following condition: ' / S( b) S( b ) s t ( n p) (6) Nonlinear intervals are also shown in the results below for simple test cases.

SIMPLE TEST CASES To compare the predictive uncertainty measures of confidence intervals in regression and probability intervals from theory, we apply the three measures, linear and nonlinear confidence intervals and probability intervals, to two simple test cases. In both test cases, we employ MCMC implemented in MICA code (Doherty, 3) to calculate the probability intervals. Linear Test Case Linear model y axbε, with parameters a= and b=3 and true errors i N(,) conjugate prior of the two parameters with C p. We consider I. Twenty data (x=,, ) are used to calibrate model and the calibrated model is used to predict the point at x=3. Nonlinear Test Case In the nonlinear test problem, the model is y x/ asin( abx) ε. All the other conditions are the same as those of the linear test problem. Cumulative distribution function F(a).8.6. Linear Model (a).8.9...3 Parameter a Cumulative distribution function F(a).8.6. Nonlinear Model (d).5.5 3 Parameter a Cumulative distribution function F(b).8.6. (b) 3 4 5 Parameter b Cumulative distribution function F(b).8.6. (e) -....3 Parameter b Cumulative distribution function F(y).8.6. (c) 6 6 64 66 68 Prediction y Cumulative distribution function F(y).8.6. (f) 5 5 5 Prediction y Figure : Cumulative distribution functions of parameters and prediction based on regression and theory for parameter a and b, and prediction y in both linear simple test case (a, b, and c); and nonlinear simple test case (d, e, and f). Figure 3: The nonlinear confidence interval limits (red dots), the minimum and maximum values of prediction (red lines), the confidence region of parameter set bounded by the objective function goal (black contour); the probability interval limits (blue dot), where the upper.5% and lower.5% prediction values include parameter samples indicated by green dots from MCMC, and the median 95% prediction values include the samples indicated by yellow dots. Figure plots the cumulative distribution functions (CDFs) of the parameters and prediction for the linear and nonlinear test cases. The left panel of Figure confirms that the distributions of parameters and prediction from regression and theory are identical in the linear model case, as the mathematical theory above indicates. Therefore, for the linear model, the confidence and probability intervals are equivalent. However, for the nonlinear model, due to nonzero higher order derivatives of the likelihood function that are discarded in equation (5), these two intervals are distinct. In this case, the probability interval is smaller than the linear confidence interval, as shown in the right panel of Figure. And it is also smaller than the nonlinear confidence interval as illustrated in Figure 3. In Figure 3, the

black ellipse represents the 95% confidence region of the true parameters, the black star at the center of the ellipse. The red lines are model evaluations that intersect with the ellipse, and the intersections are the maximum and maximum values of the prediction (specific to the confidence region). The yellow and green dots are parameter samples obtained from MCMC simulation. Model predictions of these samples are first sorted and the threshold parameters values of the.5% and 97.5% percentiles of the predictions are identified. Their corresponding model evaluations are plotted in blue lines in Figure 3. Figure 3 shows the discrepancy between nonlinear confidence interval determined by the minimum and maximum values of prediction over a confidence region on the parameter set and probability interval from MCMC samples. CONCLUSIONS This work includes theoretical analysis and numerical experiments (using simple test cases) for comparing the confidence intervals based on regression theory and probability intervals based on theory. For linear models, the two types of intervals are mathematically and numerically equivalent only with noninformative prior information. However, for the nonlinear models, the confidence intervals and probability intervals are distinct mathematically and numerically. Their discrepancy depends on the model total nonlinearity. For groundwater models that are always nonlinear, it is not appropriate to validate the confidence intervals and probability intervals for each other. ACKNOWLEDGMENTS The authors thank John Doherty for providing the MICA code of MCMC simulation. This work was supported in part by NSF-EAR grant 974 and DOE-SBR grant DE-SC687. REFERENCES Berger, J.O., 985. Statistical decision theory and analysis, nd edition, Springer. Cooley, R. L., 4. A theory for modelling groundwater flow in heterogeneous media, U. S. Geological Survey Professional Paper 979. Doherty, J., 3. MICA: model-independent Markov Chain Monte Carlo analysis, Watermark Numerical Computing, Brisbane, Australia. Gallagher M., Doherty J., 7. Parameter estimation and uncertainty analysis for a watershed model, Environmental Modelling and Software,, -. Hill, M.C., Tiedeman C., 7. Effective calibration of ground water models, with analysis of data, sensitivities, predictions, and uncertainty, John Wiley, New York. McClave, J.T., Sincich T.,. Statistics, 8 th edition, Prentice Hall. McLaughlin, D., Townley L.R., 996. A reassessment of the groundwater inverse problem, Water Resour. Res., 3(5), 3-6. Vecchia, A.V., Cooley R.L., 987. Simultaneous confidence and prediction intervals for nonlinear regression models with application to a groundwater flow model, Water Resour. Res., 3(7), 37-5. Vrugt, J. A., Bouten W.,. Validity of first-order approximations to describe parameter uncertainty in soil hydraulic models, Soil Sci. Soc. Am. J. 66:74-75.