A Kriging Approach to the Analysis of Climate Model Experiments

Size: px

Start display at page:

Download "A Kriging Approach to the Analysis of Climate Model Experiments"

Gillian Booker
6 years ago
Views:

1 A Kriging Approach to the Analysis of Climate Model Experiments Dorin Drignei Department of Mathematics and Statistics Oakland University, Rochester, MI 48309, USA Abstract. A climate model is a computer implementation of a mathematical model for the physical processes underlying the climate. An immediate use of a climate model is performing climate model experiments, where uncertain input quantities, such as greenhouse gas and aerosol concentrations, are systematically varied in order to understand their effects on the climate system. The climate models, however, are computationally intensive and only small size experiments can be conducted. This paper presents a multidimensional kriging method to predict climate model variables at new inputs, based on the experimental data available. The method is particularly suitable for situations where the climate model data sets share a common pattern across the input space, such as surface temperatures that are lower at Poles, higher at Equator and possibly increasing over time. The results demonstrate the potential of the kriging methodology presented in this paper as an exploratory tool in climate science. Keywords: Computer experiments; Equilibrium climate sensitivity; Nonstationary models; Surface temperature data. 1 Introduction The study of climate has become increasingly important due to its effects on the planet s environmental and ecological systems. Most often, the climate is defined in terms of weather variables, such as temperature, precipitation and wind, averaged over a time-span (e.g. 30 years, as defined by the World Meteorological Organization) and over the whole Earth or a region. Numerical models that simulate the climate based on physical processes are important tools used by scientists in understanding the climate system. A typical climate model, termed an atmosphere ocean general circulation model (AOGCM), is a complex computer code including atmosphere, ocean, land and ice components. Such a model may be used for experimentation, where uncertain input quantities (or inputs) are varied systematically in order to study their effects on the climate system. For example, the Third and Fourth Assessment Reports (TAR and AR4, respectively) of the Intergovernmental Panel on Climate Change (IPCC) discuss climate projections in terms of temperature, sea level change, precipitation over time and space, resulting from climate model experiments under various hypothetical future emission scenarios, driven by demographic, technological and economic factors. While this is perhaps one of the most widely publicized examples of climate model experiments in recent years, many other examples are discussed in the climate literature. For example, Sokolov and Stone (1998) proposed a simplified climate model and varied climate model parameters such as climate sensitivity and the rate of heat uptake by the deep ocean in order to study their effect on modeled temperature and sea level change over space and/or time. The climate models are in general computationally intensive, each run (corresponding to an input) taking hours or even days on high-performance computers. Therefore, only a small number of climate model runs, corresponding to a small set of inputs, can be performed in climate model 1

2 experiments. Such computational constraints have limited, for example, the size of the experiments considered in TAR and AR4. If the possible effect of a new, hypothetical greenhouse gas emission scenario on the time series of future global mean surface temperatures needs to be investigated, new climate model run(s) must be performed. This paper develops a computationally efficient statistical method to explore the experimental (or input) space by predicting variables of the climate system at any new input, therefore avoiding new runs with the computationally intensive climate model. The method is illustrated with a set of model runs whose inputs are parameters associated with a simplified climate model. More precisely, a limited number of inputs are carefully selected in the input space and the output from the corresponding slow climate model runs are obtained. Then an adequate statistical model for the output data set is postulated and kriging-type methods are used to interpolate statistically across the input space, therefore predicting the climate model output at new inputs. This paper extends in several directions a univariate statistical methodology called design and analysis of computer experiments and described, among others, by Sacks et al (1989), Currin et al (1991), Santner et al (2003), Fang et al (2006). Many computer models output time series or space-time data sets for each input. However, significant research on the analysis of computer experiments with multidimensional output only recently has been advanced (Fang et al 2006, Chapter 7). Bayarri et al (2007) and Higdon et al (2007) used basis representations for multidimensional output in a Bayesian framework and in the context of computer model calibration. Drignei (2006) proposed a computationally fast statistical model as a surrogate for a geophysical ocean model with high dimensional output. Drignei and Morris (2006) pointed out possible computational difficulties with likelihood optimization for large, multidimensional output data sets and suggested a statistical model underpinned by the output data generating mechanism. This paper proposes computationally efficient statistical models for multidimensional output data sets. The statistical models are particularly suitable for situations where the output data sets share a common pattern across the input space, such as low/high model surface temperatures at Poles/Equator and possibly increasing over time. The paper is organized as follows. Section 2 presents an experiment with the MIT 2D climate model and discusses the inputs and the output data set. Section 3 outlines the development of the statistical models and Section 4 describes the kriging prediction and model validation. This methodology is then applied in Section 5 to analyze the experiment with the MIT 2D climate model. Some concluding remarks are presented in Section 6. 2 A climate model experiment There is a suite of climate models in use today. The most realistic include three spatial dimensions (3D) but these are also very computationally intensive. For example, the AOGCM developed at the National Center for Atmospheric Research, called the Community Climate System Model (CCSM), requires weeks of computational time on a massively parallel supercomputer to simulate 50 model years. Due to tremendous computational resources required, these models are only suitable for very small size experiments. Simpler climate models that reproduce the large scale behavior of 3D AOGCMs may be more appropriate for experimentation. One such climate model is used in this paper and is a two dimensional (latitude and vertical) atmospheric model coupled with a diffusive ocean model, developed at the Massachusetts Institute of Technology (MIT) Joint Program on the Science and Policy of Global Change (Sokolov and Stone, 1998). This two dimensional climate model, called the MIT 2D climate model, reproduces many of the nonlinear interactions occurring in simulations with 3D AOGCMs and at the same time it requires much less computational resources. The MIT 2D climate model currently runs at 2

3 about 4 hours computational time per 50 model years on a 3GHz Pentium4 Linux workstation. Technical details about the MIT 2D model can be found in Sokolov and Stone (1998) who showed, for example, that there is wide disagreement among more complex coupled AOGCMs on the rate of heat uptake by the ocean (with corresponding uncertainty in surface warming), and that the MIT 2D model can match their transient behavior if appropriate values for the deep ocean diffusion coefficients are chosen. Stone and Yao (1987, 1990), Yao and Stone (1987) also provide technical details. For the purpose of this research, the MIT 2D climate model is considered a black-box model, in which the inputs are uncertain parameters and the resulting output data sets are recorded over space and time. Therefore, no information about the physics underlying the MIT 2D model is included in the statistical methodology. 2.1 The input parameters Climate models involve a number of parameters that are a priori unknown, but here we focus on three of them. An important and yet uncertain parameter in climatology is the equilibrium climate sensitivity, S, defined as the equilibrium global mean temperature response to a doubling of CO 2 from preindustrial levels. To predict the climate, one must also know how quickly the oceans will equilibrate to additional warming. The rate of warming is governed by how quickly the oceans can mix excess heat into the deeper layers. In 3D AOGCMs, multiple processes are affecting the net mixing of heat into the deep-ocean. In simpler models, such as the MIT 2D climate model, these processes can be set by a single parameter, the rate of heat uptake by the deep ocean, K v. The third uncertain parameter considered here is the strength of the anthropogenic aerosol forcing, F aer. These three parameters are collectively denoted by θ = [S, K v, F aer ], which defines the input vector. D = 20 inputs θ i are sampled in the input space [0.4, 10.5] [0.40, 12.65] [ 1.55, 0] according to a maximin distance design (Johnson et al 1990) from a large list of inputs of interest from a climatological point of view. A maximin distance design ensures that the sampled inputs cover the input space and are spread out, in the sense that no two sampled inputs are too close. Additionally, P = 5 more inputs are chosen for prediction validation purposes. These inputs are given in Table 1. A similar input space has been considered by Forest et al (2002), who used observational records in combination with output data sets to calibrate the MIT 2D climate model. Table 1. The design inputs (left four columns) and the validation inputs (fifth column) (S K v F aer ) (S K v F aer ) (S K v F aer ) (S K v F aer ) (S K v F aer ) The output data sets Among the output variables, probably the most popular is the Earth s surface temperature, which will be considered in this paper. For each input, the model surface temperature analyzed here is a matrix of size 24 56, corresponding to 24 latitudes and 56 years (the time interval ). 3

Also available are four replicates for each of these surface temperature data sets, called ensemble members, obtained by changing the initial conditions in the climate model.

4 Also available are four replicates for each of these surface temperature data sets, called ensemble members, obtained by changing the initial conditions in the climate model. This is a standard method for generating realizations of the same climate system. This paper will follow the common approach in climatology to analyze the means over the ensemble members, called ensemble means. The ensemble variability will also be accounted for in the statistical model. Unless otherwise specified in the rest of the paper, the climate model surface temperatures are the ensemble means. Figure 1 shows key features of the model surface temperature data set for one of the D = 20 sampled inputs, θ = (10.5, 0.40, 0.26). The left panel shows the surface temperature across the 24 latitudes for the first among the 56 years, with lower temperatures at Poles and higher temperature at Equator. While the MIT 2D model and the data sets considered here do not have a longitude component, it is instructive to draw the left plot with a fictitious longitude dimension in order to better see such temperature pattern, across Earth. The right panel shows the time series of surface temperature at Equator, which may have an increasing trend. The patterns noted in the two plots (quadratic in latitude and linear in time) are representative for the temperature data sets at all other inputs and this characteristic will be exploited in the next section in the development of the statistical models. The goal in this paper is to predict the model surface temperature at any new input in the input space, and therefore avoiding new runs with the computationally intensive climate model. This will be accomplished by developing a statistical model for the output data set at the D = 20 sampled inputs and then using kriging methodology to predict output data at new inputs. Figure 1. Model surface temperatures at sampled input θ = (10.5, 0.40, 0.26). Left: Temperatures across latitude at year 1 (temperature is constant across longitude). Right: Time series of temperatures at Equator. 3 Statistical models For many computer models there is a common output pattern across the input space. For example, Bayarri et al (2007) discuss an engineering application in which the output time series of model load data at each input appear to be strikingly similar. In another engineering application, Fang et al (2006), Chapter 7, show plots of log(engine noise) curves for each sampled input, which again appear very similar in shape. In the current climate application, for each sampled input, the surface temperatures are higher at Equator, lower at Poles and an increasing temporal trend appears to be present. Therefore, a statistical model with input-free mean seems reasonable and is intended to capture the general features common to all inputs, whereas the input-to-input variation is modeled by the covariance matrix which will be assumed separable in input, space and temporal dimensions. 4

5 As it will be pointed out later in this section, the specification of a general statistical model with unstructured, input-dependent mean and/or non-separable covariance has computational drawbacks for larger data sets. 3.1 The mean The climate model surface temperature data set can be organized as an array of dimension N L N T D with N L = 24, N T = 56, D = 20. To better describe the statistical model, this three dimensional data set is stacked as a vector of length N L N T D and denoted by Y. A multivariate normal distribution is assumed for Y, with general mean vector µ(y) = 1 D ν and covariance matrix Γ. Under the assumption of separability, the covariance matrix can be written as a Kronecker product of smaller covariance matrices Γ = Ω Θ C T C L reflecting the various dimensions of the data set (inputs, time and latitudes), so that Γ has elements Ω Θ (i 1, i 2 ) C T (j 1, j 2 ) C L (k 1, k 2 ). Therefore, here separability refers to the multiplicative decomposition of the covariance into purely spatial, temporal and input components. The maximum likelihood estimator of ν has the relatively simple analytical formula D (Ω 1 Θ ) i,j Y ṛ,i ˆν = D (Ω 1, Θ ) i,j a weighted average of output data over the sampled inputs, where Y ṛ,. is the output data vector Y reorganized as a N L N T D matrix. The above statistical model with general mean, however, cannot answer all the climatologically important statistical questions. For example, is there a statistically significant increasing temporal trend in the modeled surface temperature data set? For applications where regression variables are available, an alternative model ν = Xβ may be considered, leading to µ(y) = 1 D Xβ. The maximum likelihood estimator of β is ˆβ = [X (C T 1 C L 1 )X] 1 X (C T 1 C L 1 ) D (Ω Θ 1 ) i,j Y ṛ,i D (Ω Θ 1 ) i,j. The estimators ˆβ, ˆν and their variances are derived in Appendix A. The general mean model may be more flexible than the regression model, but it could have a larger number of mean parameters. The two models will be compared by testing their prediction capabilities on new climate model output data. In this application a polynomial regression with second order latitude terms, a linear temporal term and their interactions will be considered, so that X = [1, L, L 2, T, LT, L 2 T], L = 1 NT (1 : N L ) and T = (1 : N T ) 1 NL. In order to investigate if an input-dependent mean leads to an improved fit, an extended regression has been considered, with the regression matrix having the following structured form [ ] X e = 1 D X U 1 NT N L where the matrix X is given in the polynomial regression above and the D 3 matrix U is given by U i,. = [θ i,1, θ i,2, θ i,3 ], i = 1,..., D. Such a partitioned structure of the extended regression matrix leads to computationally efficient formulas for the generalized least squares estimates of the regression coefficients and their covariance matrix. 3.2 The covariance matrix While the data analyzed in this paper are the ensemble means, their model actually originates in a model for the ensemble members. This relationship is explained in Appendix B, with emphasis on 5

6 the covariance structure. The covariance matrix for the ensemble means is Γ = Ω Θ C T C L. The covariance of inputs is given by Ω Θ = σ 2 C Θ +τ 2 I, reflecting the decomposition of the model surface temperatures into a climate signal and ensemble noise. An unbiased estimate of the parameter τ 2 (see Appendix B) is ˆτ 2 = In an effort to reduce the computational burden for likelihood optimization, τ 2 will be fixed at its estimated value. The matrix C Θ describes a smooth input correlation in the climate signal and is given by C Θ (i, j) = exp( η 1 (l S (θ i ) l S (θ j )) 2 η 2 (l K (θ i ) l K (θ j )) 2 η 3 (l F (θ i ) l F (θ j )) 2 ), i, j = 1,..., D, where l S, l K, l F are the coordinates of the inputs θ, rescaled to [0, 1]. This correlation is commonly used in the computer experiments literature to describe smooth dependence across the input space (e.g. Sacks et al 1989, Santner et al 2003, Fang et al 2006). The temporal correlation matrix C T considered here has elements C T (i, j) = exp( η T l T (i) l T (j) ), i, j = 1,..., N T and the latitude correlation matrix C L has elements C L (i, j) = exp( η L l L (i) l L (j) ), i, j = 1,..., N L, where l T and l L are the time and latitude coordinates, rescaled to [0, 1] for better numerical stability. The more general power exponential correlation (e.g. Sacks et al 1989) for all dimensions may be considered, but the likelihood optimization is more computationally intensive since it would include five additional unknown statistical parameters. The correlations, however, need not be stationary. There are examples in the climate literature (e.g. Rauthe et al 2004, among others), where a nonparametric and non-stationary covariance for the spatial dimension is estimated from control climate model runs. In order to study the sensitivity of the proposed statistical model on the covariance stationary assumption, here a parametric non-stationary correlation (e.g. Hughes- Olivier et al 1998, Schabenberger and Gotway 2005, p. 422) for latitudes is fitted from the data described in Section 2, C L (i, j) = exp( η s1 l L (i) l L (j) exp(η s2 c i c j + η s3 min(c i, c j ))), i, j = 1,..., N L, including a point source at location c, where c i = l L (i) c. Here the Equator will be considered a temperature point source (due to various circulation and transfer mechanisms) and therefore c = 0.5 is chosen. 3.3 Likelihood optimization The estimated values of the parameters appearing in the covariance matrix Γ for the general mean model are obtained by minimizing the function -2 Log (Likelihood)/DN T N L given by (ignoring some constants) log(det(ω Θ )) D + log(det(c T)) N T + log(det(c L)) N L + (Y 1 D ˆν) Γ 1 (Y 1 D ˆν) DN T N L. For the regression model, the covariance parameters estimates are obtained similarly, with Xˆβ instead of ˆν, whereas for the extended regression model replace 1 D ˆν with X e ˆβ. The nonlinear likelihood optimization is done iteratively after some starting values are chosen. At each iteration an updated value of the maximum likelihood estimator ˆν (or ˆβ) is used. The statistical parameters will then be fixed at their final values throughout the rest of the statistical analysis. Besides 6

7 naturally occurring in many applications, the two models for the mean described above have also computational advantages, due mainly to the simplicity of the analytical formulas for ˆν and ˆβ. When comparing these two models, note that ˆν does not include C 1 T C 1 L, whereas ˆβ does, which leads to further computational savings when using the general mean model. While the maximum likelihood approach is perhaps the most popular method for parameter estimation in computer experiments, other estimation methods could be used, such as penalized likelihood (Li and Sudjianto, 2005), cross-validation, REML or posterior mode (e.g Santner et al 2003, section 3.3.2). There are computational difficulties with general statistical models having unstructured inputdependent regression variables (Drignei and Morris 2006), such as coarser numerical solutions of the same climate model or output from simpler but faster climate models. For example, for the data set considered here, the likelihood of a statistical model with six unstructured, input-dependent regression variables would be about 25 times more computationally intensive than the likelihood of the regression model with six input-free variables. This computational time increases even further if, in addition, the covariance would be unstructured, non-separable (e.g. Genton, 2007). The differences among the computational times could become substantial when the objective functions are optimized nonlinearly and a large number of function evaluations are required. These differences would be magnified even further for larger data sets including all three spatial dimensions, a possibly longer time interval and/or a larger set of statistical parameters. 4 Prediction at new inputs and validation The statistical model described above is used to predict the climate model output at an arbitrary set Π of P new inputs in the input space. Let Y Π be the corresponding climate model output reorganized in a vector of length N L N T P. Two versions of kriging will be considered: simple and universal. These will lead to the same computed point prediction, but the prediction covariance matrices will be different, with the universal kriging having an extra term depending on regression variables. The prediction distribution is multivariate normal with mean vector Ỹ Π = 1 P ˆν+Γ ΠΘ Γ 1 (Y 1 D ˆν) = 1 P ˆν+{[(σ 2 C ΠΘ )(σ 2 C Θ +τ 2 I D ) 1 ] I NL N T }(Y 1 D ˆν) (replace ˆν with X ˆβ for the input-free regression model). The simple kriging prediction covariance matrix is V Π s = Γ Π Γ ΠΘ Γ 1 Γ ΠΘ = [(σ 2 C Π + τ 2 I P ) (σ 2 C ΠΘ )(σ 2 C Θ + τ 2 I D ) 1 (σ 2 C ΠΘ ) ] C T C L where Γ Π = (σ 2 C Π + τ 2 I P ) C T C L and Γ ΠΘ = σ 2 (C ΠΘ C T C L ). Here C Π (i, j) = exp( η 1 (l S (π i ) l S (π j )) 2 η 2 (l K (π i ) l K (π j )) 2 η 3 (l F (π i ) l F (π j )) 2 ), i, j = 1,..., P, where l S, l K, l F are the coordinates of the new inputs π s rescaled to [0, 1], and C ΠΘ (i, j) = exp( η 1 (l S (θ i ) l S (π j )) 2 η 2 (l K (θ i ) l K (π j )) 2 η 3 (l F (θ i ) l F (π j )) 2 ) for i = 1,..., D, j = 1,..., P. The formula (5.34) in Schabenberger and Gotway (2005) for the universal kriging prediction covariance matrix becomes V Π u = V Π s +[1 P X Γ ΠΘ Γ 1 (1 D X)][(1 D X) Γ 1 (1 D X)] 1 [1 P X Γ ΠΘ Γ 1 (1 D X)] = 7

8 [(σ 2 C Π +τ 2 I P ) (σ 2 C ΠΘ )(σ 2 C Θ +τ 2 I D ) 1 (σ 2 C ΠΘ ) ] C T C L +E {X[X (C T 1 C L 1 )X] 1 X } for the input-free regression model, where E = 1 D (Ω 1 {1 P [(σ 2 C ΠΘ )(σ 2 C Θ +τ 2 I D ) 1 ].,j }{1 P [(σ 2 C ΠΘ )(σ 2 C Θ +τ 2 I D ) 1 ].,j }. Θ ) i,j j=1 j=1 Setting X = I NT N L in the previous formula, one obtains V Π u = [(σ 2 C Π + τ 2 I P ) (σ 2 C ΠΘ )(σ 2 C Θ + τ 2 I D ) 1 (σ 2 C ΠΘ ) + E] C T C L, the universal kriging covariance matrix for the input-free general mean model. To avoid a possible confusion, one should note that while the data sets considered here have a spatial component (the latitudes), the kriging prediction is done, in fact, over the input space. The measures of validation are based on the output data and their predictions, at a set Π of P new inputs. These measures are: the root mean square error RMSE = 1 N L N T P the maximum absolute value of prediction error N L,N T,P i,t,p=1 (Y Π Ỹ Π ) 2 i,t,p, MaxErr = max (Y Π Ỹ Π ) i,t,p and the actual coverage of prediction intervals with a nominal coverage (e.g. 95%) COV ER = 1 N L N T P N L,N T,P i,t,p=1 δ [(YΠ ) i,t,p (INT ) i,t,p ], where (INT ) i,t,p is the prediction interval of (Y Π ) i,t,p at each point (i, t, p), and δ is the indicator function. Here we distinguish between prediction intervals based on the simple or universal kriging. In addition, residual analysis based on the standardized prediction residuals V Π 1/2 (Y Π Ỹ Π ) can be performed separately for simple and universal kriging to check the normality of the prediction distribution. Ideally, output data from at least a few model runs should be used for validation purposes. However, if this is not possible, cross-validation methods may be used instead. 5 Analysis of MIT 2D climate model experiment This section presents results for the MIT 2D climate model experiment discussed in Section 2, based on the methodology presented in Sections 3 and 4. The estimates of parameters β and their standard errors for the input-free polynomial regression are shown in Table 2 whereas the estimates of parameters β and their standard errors for the extended, input-dependent regression are shown in Table 3, with the index s referring to the model including the latitude stationary covariance and the index n referring to the model including the latitude non-stationary covariance. There is perceptible evidence of an increasing time trend in the surface temperatures and of positive association between the climate sensitivity parameter and the surface temperatures. However, there is not much difference between these estimated values when comparing the stationary versus non-stationary latitude covariances. A thorough visual inspection of Figure 1 (right panel) reveals that the linear trend is not perfect, with approximately the first and last two thirds of the time series 8

9 increasing whereas the middle third is roughly constant, which may be better fitted by the general mean model. Table 4 shows the estimates of the covariance parameters for the input-free general mean model (GenM ean), input-free polynomial regression model (P olyreg) and the extended, input-dependent regression model (ExtdReg). Table 2. Coefficients estimates and their standard errors for the input-free polynomial regression. Variable 1 L L 2 T L T L 2 T ˆβ s SE( ˆβ s ) ˆβ n SE( ˆβ n ) Table 3. Coefficients estimates and their standard errors for the extended regression. Variable 1 L L 2 T L T L 2 T S K v F aer ˆβ s SE( ˆβ s ) ˆβ n SE( ˆβ n ) Table 4. Variance parameters estimates. Parameter η 1 η 2 η 3 η T η L η s1 η s2 η s3 σ 2 GenMean s GenMean n P olyreg s P olyreg n ExtdReg s ExtdReg n Table 5. Results. Parameter RMSE ( o C) MaxErr ( o C) COV ER sp COV ER uv GenMean s GenMean n P olyreg s P olyreg n ExtdReg s ExtdReg n The set Π of P = 5 new inputs shown in Table 1 is used for validation. The validation measures are given in Table 5 and they show that the general mean model is at least twice as accurate as any of the regression models, with respect to RMSE and MaxErr. The choice of stationary or non-stationary latitude correlation again doesn t seem to have an important effect. The actual coverage is close to the nominal 95% for all statistical models, except perhaps for the general mean model with non-stationary covariance model, which appears to have a lower actual coverage rate. Here COV ER sp refers to simple kriging and COV ER uv refers to universal kriging, and one can 9

10 notice only very little difference between their values. The results for the two regression models do not differ much, although the input-free regression model seems to have a larger RMSE but a lower MaxErr than the extended regression model. Figure 2 shows results for the input-free general mean model and for the input-free polynomial regression, with stationary covariances only. The plots from the models with latitude non-stationary covariance as well as and the input-dependent extended regression model (with both stationary and non-stationary covariances) are similar to those presented in Figure 2, and are therefore omitted. The left panels in Figure 2 show the plot of true new output data versus the predicted new output data and one can see that the general input-free mean model (upper left) is more accurate than the input-free polynomial regression model (lower left). In such plots, the closer the scatterplot is to the main diagonal, the more accurate the point prediction is. Figure 2 right panels show normal probability plots of standardized prediction residuals based on simple kriging covariance formulas for the two models (plots based on universal kriging are similar). These normal probability plots reveal that the normality assumption is not perfectly satised by either model, although it seems much more appropriate for the general mean model. Figure 2. True versus predicted new output data (left panels) and normal probability plots of standardized prediction residuals (right panels). The upper row corresponds to the general input-free mean model and the lower row corresponds to the input-free polynomial regression. 6 Concluding remarks This paper has presented a kriging approach to the analysis of climate model experiments. The basic strategy was to perform a relatively small number of runs with the computationally intensive climate model and then obtain the output data. An appropriate statistical model was considered for these 10

11 data, which was then used to predict climate model output at new inputs. The statistical model has a mean vector suitable for the climate application discussed here, where there is a common pattern for the output data set at each input. Prediction results from statistical models with two different input-free means have been presented. While a polynomial regression model appears to be more intuitive for the climate application discussed, the general mean model is more flexible and appears to predict more accurately new data. Since it is not always possible to find good regression variables, the latter model is also expected to be more widely used in applications. An extended regression model which includes input-dependent variables does not seem to improve significantly on the previous fit. Similarly, results based on universal kriging and/or a non-stationary covariance model for latitudes do not differ significantly from those based on simple kriging and/or stationary covariance models. Computationally intensive numerical models are not uncommon in environmental sciences. For example, the U.S. Environmental Protection Agency (EPA) recommends using CALPUFF, a complex dispersion model that simulates the effects of space-time meteorological conditions on pollution transport, transformation and removal. Its User s Guide (available as a link from the EPA s web page) points out that CALPUFF can take hours of computational time for some applications. While the focus in this paper was on a particular problem important to climate science, there are some useful general principles underlying this work which may be applied to other environmental problems involving computationally intensive numerical models and associated numerical experiments. Appendix A The maximum likelihood estimator of the parameter vector β along with its covariance matrix are derived in this Appendix. The formulas for the maximum likelihood estimator of ν and its variance result as a particular case in the derivation below, with X = I and β = ν. The part of -2 Log (Likelihood) that contains β can be rewritten as (Y 1 D Xβ) Γ 1 (Y 1 D Xβ) = (Ω 1 Θ ) i,j (Y ṛ,i Xβ) (C 1 T C 1 L )(Y ṛ,j Xβ) = (Ω 1 Θ ) i,j Y ṛ,i (C 1 T C 1 L )Y ṛ,j 2β X (C 1 T C 1 L ) (Ω 1 Θ ) i,j Y ṛ,j +β X (C 1 T C 1 L )Xβ (Ω 1 Θ ) i,j. By taking its partial derivatives with respect to β and setting them equal to zero, one obtains X (C 1 T C 1 L )X (Ω 1 Θ ) i,j β = X (C 1 T C 1 L ) (Ω 1 Θ ) i,j Y ṛ,j and hence the formula for ˆβ. If one denotes w j = D i=1 (Ω Θ 1 ) i,j D (Ω Θ 1 ) i,j then ˆβ = [X (C 1 T C 1 L )X] 1 X (C 1 T C 1 L ) w j Y ṛ,j j=1 11

12 and therefore var( ˆβ) = [X (C 1 T C 1 L )X] 1 X (C 1 T C 1 L )[ w i w j cov(y ṛ,i, Y ṛ,j)] (C 1 T C 1 L )X[X (C 1 T C 1 L )X] 1 = [X (C 1 T C 1 L )X] 1 X (C 1 T C 1 L ) [ w i w j (Ω Θ ) i,j (C T C L )](C 1 T C 1 L )X[X (C 1 T C 1 L )X] 1 = = ( w i w j (Ω Θ ) i,j )[X (C 1 T C 1 L )X] 1. Appendix B This Appendix details the covariance structure of the statistical models presented in this paper. Denote by Y k,i the data set of ensemble members, with k = 1,..., R (R = 4 is the number of ensemble members in this application) and i = 1,..., N T N L D. A mixed model for Y is Y k,i = µ(y) i + Z i + ɛ k,i, where the climate signal Z and the (partially colored) ensemble noise ɛ are independent of each other, of mean zero, with covariance matrices σ 2 Σ Z and γ 2 Σ ɛ I R, respectively. In vector format, Y has covariance matrix σ 2 Σ Z J R + γ 2 Σ ɛ I R, where J R is the R R matrix of ones. The model for ensemble means is Y i = Ȳ.,i = µ(y) i + Z i + ɛ.,i with covariance matrix Γ = σ 2 Σ Z + τ 2 Σ ɛ and τ 2 = γ 2 /R. In practice it would be difficult to work with general matrices Σ Z and Σ ɛ, and therefore one needs to make some assumptions. First, under the assumption of separability, each matrix will be written as a Kronecker product of smaller matrices to reflect the input, time and space (latitude) dimensions. Another assumption is that each ensemble member has a time-space correlation matrix C T C L, which is then inherited by the climate signal and the ensemble noise. However, the climate signal varies smoothly from input to input, whereas the ensemble noise does not depend on inputs. These assumptions lead to Γ = (σ 2 C Θ + τ 2 I) C T C L. Finally, note that N ˆγ 2 1 T N L D 1 = [ N T N L D R 1 i=1 R (Y k,i Ȳ.,i ) 2 ] is unbiased estimator of γ 2 and ˆτ 2 = ˆγ 2 /R is unbiased estimator of τ 2. k=1 References [1] Bayarri, M.J., Walsh, D., Berger, J.O., Cafeo, J., Garcia-Donato, G., Liu, F., Palomo, Parthasarathy, R.J., Paulo, R. and Sacks, J. (2007), Computer Model Validation with Functional Output, Annals of Statistics, 35,

13 [2] Currin, C., Mitchell, T., Morris, M., Ylvisaker D. (1991), Bayesian Prediction of Deterministic Functions, with Applications to the Design and Analysis of Computer Experiments, Journal of the American Statistical Association, 86, [3] Drignei, D. (2006), Empirical Bayesian Analysis for High-Dimensional Computer Output, Technometrics, 48, [4] Drignei, D. and Morris, M.D. (2006), Empirical Bayesian Analysis for Computer Experiments Involving Finite-Difference Codes, Journal of the American Statistical Association, 101, p [5] Fang, K.T., Li, R. and Sudjianto, A. (2006). Design and Modeling for Computer Experiments, Boca Raton, FL: Chapman and Hall/CRC Press. [6] Forest, C. E., Stone, P. H., Sokolov, A.P., Allen, M.R. and Webster, M.D. (2002), Quantifying Uncertainties in Climate System Properties with the Use of Recent Climate Observations, Science, 295, [7] IPCC,2001: Climate Change 2001:The Scientific Basis. Contribution of Working Group I to the Third Assessment Report of the Intergovernmental Panel on Climate Change [Houghton,J.T.,Y. Ding,D.J. Griggs,M. Noguer,P.J. van der Linden,X. Dai,K. Maskell,and C.A. Johnson (eds.)]. Cambridge University Press, Cambridge,United Kingdom and New York, NY, USA, 881pp. [8] IPCC, 2007: Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change [Solomon, S., D. Qin, M. Manning, Z. Chen, M. Marquis, K.B. Averyt, M. Tignor and H.L. Miller (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, 996 pp. [9] Genton, M. G. (2007), Separable Approximations of Space-time Covariance Matrices, Environmetrics, 18, [10] Higdon, D., Gattiker, J., Williams, B., and Rightley, M. (2007), Computer Model Validation Using High Dimensional Outputs, in Bayesian Statistics 8, eds. Bernardo, J., Bayarri, M. J., Dawid, A. P., Berger, J. O., Heckerman, D., Smith, A. F. M., and West, M., London: Oxford University Press. [11] Hughes-Olivier, J.M., Gonzales-Farias, G., Lu, J.C and Chen, D (1998), Parametric Nonstationary Correlation Models, Statistics and Probability Letters, 40, [12] Johnson, M., Moore, L. and Ylvisaker D. (1990), Minimax and Maximin Distance Designs, Journal of Statistical Planning and Inference, 26, [13] Li, R. and Sudjianto, A. (2005). Analysis of Computer Experiments Using Penalized Likelihood in Gaussian Kriging Models, Technometrics, 47, [14] Rauthe, H., Hense, A. and Paeth, H. (2004). A Model Intercomparison Study of Climate Change Signals in Extratropical Circulation, International Journal of Climatology, 24, [15] Sacks, J., Welch, W.J., Mitchell, T.J. and Wynn, H.P. (1989), Design and Analysis of Computer Experiments, Statistical Science, 4,

14 [16] Santner, T.J., Williams, B.J., and Notz, W.I. (2003), The Design and Analysis of Computer Experiments, New York: Springer. [17] Schabenberger, O. and Gotway, C. A. (2005), Statistical Methods for Spatial Data Analysis, Boca Raton, FL: Chapman and Hall/CRC Press. [18] Sokolov, A. P., and Stone, P. H. (1998), A Flexible Climate Model for Use in Integrated Assessments, Climate Dynamics, 14, [19] Stone, P.H. and Yao, M.S. (1987), Development of a Two-dimensional Zonally Averaged Statistical-dynamical Model. II: the Role of Eddy Momentum Fluxes in the General Circulation and their Parametrization, Journal of the Atmospheric Sciences, 44, [20] Stone, P.H. and Yao, M.S. (1990), Development of a Two-dimensional Zonally Averaged Statistical-dynamical Model. III: Parametrization of the Eddy Fluxes of Heat and Moisture, Journal of Climate, 3, [21] Yao, M.S. and Stone, P.H. (1987), Development of a Two-dimensional Zonally Averaged Statistical-dynamical Model. I: the Parameterization of Moist Convection and its Role in the General Circulation, Journal of the Atmospheric Sciences, 44,

Fast Statistical Surrogates for Dynamical 3D Computer Models of Brain Tumor

Fast Statistical Surrogates for Dynamical 3D omputer Models of Brain Tumor Dorin Drignei Department of Mathematics and Statistics Oakland University, Rochester, MI 48309, USA Email: drignei@oakland.edu