A Bayesian Approach to Prediction and Variable Selection Using Nonstationary Gaussian Processes

Size: px

Start display at page:

Download "A Bayesian Approach to Prediction and Variable Selection Using Nonstationary Gaussian Processes"

Bryan Dorsey
5 years ago
Views:

1 A Bayesian Approach to Prediction and Variable Selection Using Nonstationary Gaussian Processes Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Casey Davis, B.S., M.A., M.S. Graduate Program in Statistics The Ohio State University 25 Dissertation Committee: Dr. Christopher M. Hans, Co-Advisor Dr. Thomas J. Santner, Co-Advisor Dr. Matt Pratola

2 c Copyright by Casey Davis 25

3 Abstract This research proposes a Bayesian formulation of the Composite Gaussian Process GP of Ba and Joseph 22. The composite Gaussian Process model generalizes the regression plus stationary GP model in both a stationary and nonstationary manner. The likelihood stage of the model combines two independent Gaussian processes and the remaining stages put priors on the means, variances, and correlation parameters of the Gaussian processes. Markov chain Monte Carlo methods are used to estimate posterior predictions and prediction intervals and are compared with predictions from the composite GP model, a treed GP model, and a universal kriging approach. This research also develops screening methodology for experiments with many inputs that is based on a hierarchical Bayesian Gaussian process model. This flexible model is able to describe output functions having varying range and patterns of fluctuation. Screening is accomplished by identifying inputs with small posterior probability of being correlated with the output by incorporating a Bayesian variable selection prior for the correlation parameters. ii

4 This is dedicated to those that have helped me get here. iii

5 Acknowledgments I would like to thank my co-advisors Professors Christopher Hans and Thomas Santner for their ideas, support, patience, and willingness to help. I am extremely grateful for the effort they expended in helping me do the research for this dissertation and in editing and proofreading my writing. I truly lucked out with my advisors. I would also like to thank my parents, John and Beverly, for encouraging me through all twelve years of college. Without your support, I would not have made it through, and I might be living under a bridge somewhere. Thanks to my sister Katie for letting me whine and complain about grad school in general and dissertationwriting in particular. Now it s your turn. To my brother Jamie and his wife Sara, thanks for letting me come over and play with the kids instead of putting me to work. To my stepmother Judy, thanks for making me feel welcome on my rare trips down South. I was lucky to have my cousins, Matt, Adam, and Annie, and sort-of cousin, Jackie, in Columbus during my time in school. You all made my time in school much more enjoyable. To Uncle Jim, thanks for the encouragement. To Chris, Grant, and John, thanks for the lunches, the s, and the text messages that gave me a break in otherwise long days. To Agniva, Jingjing, Matt, and Rob, thanks for hanging out. iv

6 Vita January 26, Born - Columbus, NC, USA B.S. Mathematics, University of North Carolina Greensboro M.A. Applied Economics, University of North Carolina Greensboro M.S. Statistics, The Ohio State University 2-present Graduate Teaching Associate, The Ohio State University. Fields of Study Major Field: Statistics v

7 Table of Contents Page Abstract Dedication Acknowledgments Vita List of Tables ii iii iv v ix List of Figures x. Introduction Gaussian Process Models Stationary vs. Nonstationary Processes Applications of Gaussian Process Models Prediction Methodologies Kriging Treed Gaussian Processes Composite Gaussian Processes Variable Selection Methodologies Spike-and-Slab and Closely Related Methods Reference Distribution Variable Selection Two-Stage Sensitivity-Based Group Screening Markov Chain Monte Carlo Overview Overview of Dissertation Bayesian Composite Gaussian Processes for Prediction The Stationary BCGP Model Priors Examples of Draws from the Stationary BCGP vi

8 2..3 Computational Methods for Sampling the Posterior from the Stationary BCGP Model Prediction Based on the Stationary BCGP Model Stationary Examples Nonstationary Examples The Nonstationary BCGP Model Priors Examples of Draws from the Nonstationary BCGP Computational Methods for the Nonstationary BCGP Model Prediction Based on the Nonstationary BCGP Model Stationary Examples Nonstationary Examples Prediction Comparisons Bayesian Composite Gaussian Processes for Variable Selection The BCGP Model for Variable Selection Priors Examples of Draws from the Nonstationary BCGP Computational Methods for the Nonstationary BCGP Model for Variable Selection Determining Those Variables That Are Active Examples Discussion Contributions and Future Research Contributions Future Research Appendices 5 A. Proof of Minimum MSPE Predictor and Derivations of Full Conditional Distributions A. Proof of Theorem A.2 Derivation of the Full Conditional Distribution of µ Λ µ, y A.3 Derivation of the Full Conditional Distribution of µ V Λ µv, y.. 5 A.4 Derivation of the Full Conditional Distribution of σk 2 Λ σk 2, y.. 52 A.5 Derivation of the Full Conditional Distribution of p i Λ pi, y B. Software Manuals B. Stationary BCGP for Prediction vii

9 B.. The MATLAB function BCGPpredStat.m B..2 Functions Called by BCGPpredStat B..3 An Example B.2 Nonstationary BCGP for Prediction B.2. The MATLAB function BCGPpredNonStat.m B.2.2 Functions Called by BCGPpredNonStat B.2.3 An Example B.3 Nonstationary BCGP for Variable Selection B.3. The MATLAB function BCGPvarSel.m B.3.2 Functions Called by BCGPvarSel B.3.3 An Example viii

10 List of Tables Table Page 2. MSPEs for Example Functions with Noise-Free Data ix

11 List of Figures Figure Page. Ten draws from each of a stationary GP and nonstationary GP and sample covariances of Y.2 and Y.5 and Y.5 and Y.8 plotted against each other Example draws from the process in Example draws from the process in Kriging predictors for the function in. with 95% prediction intervals. 6.5 Kriging predictors and plots of errors for the function in Kriging predictors for the function in.2 with 95% prediction intervals. 9.7 Kriging predictors for the function in.3 with 95% prediction intervals. 2.8 Kriging predictors and plots of errors for the function in An example partition TGP predictors for the functions in.,.2, and.3 with 95% prediction intervals and small n TGP predictors for the functions in.,.2, and.3 with 95% prediction intervals and larger n TGP predictor and plot of errors at a 2 2 grid for TGP predictor and plot of errors at a 2 2 grid for Examples of two different fixed vx functions and draws from the process in x

12 .5 CGP predictors for the functions in.,.2, and.3 with 95% prediction intervals left column and the estimated vx functions right column CGP predictor and plot of errors at a 2 2 grid for CGP predictor and plot of errors at a 2 2 grid for N, v γi and N, v γi densities. Intersections at δ iγ and δ iγ For.25, boxplots of posterior draws of correlation parameters for one iteration left and m = iterations combined right For.26, boxplots of posterior draws of correlation parameters for one iteration left and m = iterations combined right For.27, boxplots of posterior draws of correlation parameters for one iteration left and m = iterations combined right For.28, boxplots of posterior draws of correlation parameters for one iteration left and m = iterations combined right Example draws from the process in 2. for each w {.5,.75, } An example draw from the process in 2. for each w {.5,.75, } Stationary BCGP predictor and 95% prediction intervals for the function in. when the data is noise-free left and noisy right Stationary BCGP predictor and plot of errors for the function in. when the data is noise-free left and noisy right Stationary BCGP predictor and 95% prediction intervals for the function in.2 when the data is noise-free left and noisy right Stationary BCGP predictor and 95% prediction intervals for the function in.3 when the data is noise-free left and noisy right Stationary BCGP predictor and plot of errors for the function in.4 when the data is noise-free left and noisy right xi

13 2.8 Examples of two different fixed σ 2 x functions and draws from the process in 2.8 for each w {.5,.75, } An example of a fixed σ 2 x function and two draws from the process in 2.8 for each w {.5,.75, } An example draw from the process in 2.2 for w {.5,.75, } Nonstationary BCGP predictor and 95% prediction intervals for the function in. and posterior mean of σ 2 x when the data is noise-free left and noisy right Nonstationary BCGP predictor and plot of errors for the function in. when the data is noise-free left and noisy right Posterior mean of σ 2 x when the data is noise-free left and noisy right Nonstationary BCGP predictor and 95% prediction intervals for the function in.2 and posterior mean of σ 2 x when the data is noise-free left and noisy right Nonstationary BCGP predictor and 95% prediction intervals for the function in.3 and posterior mean of σ 2 x when the data is noise-free left and noisy right Nonstationary BCGP predictor and plot of errors for the function in.4 when the data is noise-free left and noisy right Posterior mean of σ 2 x when the data is noise-free left and noisy right An example of a fixed σ 2 x function and two draws from the process in 3. for each w {.5,.75, } Two draws from the process in 3. for a fixed σ 2 x function and w {.5,.75, } Boxplots of global correlation parameters left and local correlation parameters right for one iteration Boxplots of global correlation parameters left and local correlation parameters right for m = iterations combined xii

14 3.5 Boxplots of global correlation parameters left and local correlation parameters right for one iteration Boxplots of global correlation parameters left and local correlation parameters right for m = iterations combined Boxplots of global correlation parameters left and local correlation parameters right for m = iterations combined Boxplots of global correlation parameters left and local correlation parameters right for one iteration Boxplots of global correlation parameters left and local correlation parameters right for m = iterations combined Boxplots of global correlation parameters left and local correlation parameters right for one iteration Boxplots of global correlation parameters left and local correlation parameters right for m = iterations combined The true function in 3.2 in x and x 2 and training data Boxplots of global correlation parameters left and local correlation parameters right for one iteration Boxplots of global correlation parameters left and local correlation parameters right for m = iterations combined B. Franke function B.2 Franke function B.3 True function xiii

15 Chapter : Introduction The goal of this chapter is to introduce Gaussian process models and some of their applications. A review of previous methods that have employed these models for both prediction and variable selection will be presented in Sections.3 and.4, as will a general overview of Markov Chain Monte Carlo methods.5. This review of these methods will motivate and provide a foundation for the methodology presented later in this dissertation.. Gaussian Process Models A Gaussian process GP can be thought of as an infinite-dimensional generalization of a multivariate normal distribution. A stochastic process, Y x, x X R d with underlying probability space Ω, B, P is a GP if for any n and any x,..., x n in X, the joint distribution of the vector Y = Y x,..., Y x n has a multivariate normal distribution Santner et al. 23. That is, Y N n µ, C,. where µ = µ x,..., µ x n, and C is an n n covariance matrix with ij th element attained from the covariance function, C x i, x j = Cov Y x i, Y x j. A GP is fully specified by its mean function, µx = E Y x, and its covariance function. A function C, is said to be a valid covariance function if it is a symmetric positive semi-definite function, that is, if

16 m m α i α j C x i, x j and i= j= 2 C x, x = C x, x hold for any choice of m, α R m, and x,..., x m X see pg. 27 in Kuß 26. It is also common to work with correlation functions. A function R x, x = Cor Y x, Y x is a valid correlation function if the function is a positive semidefinite, symmetric R x, x = R x, x function, and R x, x =. Valid covariance and correlation functions are typically difficult to construct. However, there are some properties that make this construction easier. Let C i x, x, i =, 2, be valid covariance functions and let R i x, x, i =, 2, be valid correlation functions. Then C x, x = C x, x + C 2 x, x is a valid covariance function. 2 C x, x = C x, x C 2 x, x is a valid covariance function, and R x, x = R x, x R 2 x, x is a valid correlation function. 3 For < α <, C x, x = αc x, x + αc 2 x, x is a valid covariance function, and R x, x = αr x, x + αr 2 x, x is a valid correlation function. These three properties provide many methods for constructing valid correlation and covariance functions. For example, many correlation functions used in practice are separable. A correlation function is separable if R x, x = d i= R i x i, x i, where R i x i, x i is a valid correlation function on R. The validity of separable correlation functions follows directly from property 2 above. Some of the more common families of correlation functions include the Matern family very common in the geostatistical literature, the cubic correlation family, the spherical correlation family, and the product power exponential family. The power 2

17 exponential family has the form d R x, x θ = exp θ j x j x j p j, < p j 2 and θ j >, j =,..., d..2 j= A special case of the power exponential family is when p j = 2, j =,..., d: d R x, x θ = exp θ j xj x 2 j,.3 Equation.3 is the separable Gaussian correlation function. An equivalent parameterization for this correlation function is to let θ j = k lnρ j for some positive constant k so that R x, x ρ = d j= j= ρ kx j x j 2 j, ρ j, j =,..., d,.4 This parameterization is often preferred due to ρ j having the interpretation of the correlation between outputs at two inputs that differ only in the j th dimension by / k of a unit or / k of the domain if the inputs have been scaled to [, ] d. For example, this thesis will let k = 6, so that ρ j is the correlation between outputs at two inputs that differ only in the j th dimension by /4 of a unit. Another common choice is k = 4 which has the same interpretation for /2 of a unit. This correlation function leads to smooth sample paths that are continuous and infinitely differentiable. Covariance and correlation functions may be either isotropic or anisotropic. A covariance function that satisfies Cx, x = C x x, where x x is Euclidean distance is isotropic. This means that the covariance between two values depends only on the Euclidean distance between the two locations, so the covariance decays in the same manner in every direction. An example would be the Gaussian correlation function from.3 with θ,..., θ d = θ. A valid covariance 3

18 function that satisfies Cx, x = C x x K, where x x K = [x x Kx x ] /2 is anisotropic. To ensure validity of the covariance function C x x K, the isotropic correlation function C x x must be positive definite and K must be a symmetric, positive semi-definite matrix see Abrahamsen pg.. Commonly, K = diag θ,..., θ d, θ i >, i =,..., d. The anisotropic property is less restrictive in that the covariance does not have to decay at the same rate in every direction. An example of this is the Gaussian correlation function in.3 with at least one θ i different from the others... Stationary vs. Nonstationary Processes A covariance funtion is stationary if, for any x, x X and any translation h R d such that x + h, x + h X, Cx, x = Cx + h, x + h. A GP is stationary if, for any n, any x,..., x n X, and any translation h R d such that x +h,..., x n + h X, Y x,..., Y x n and Y x + h,..., Y x n + h have the same mean vector and covariance matrix. This is equivalent to saying that a GP is stationary if its mean function is constant and its covariance function is stationary. In this case, V ar Y x = σ 2 x X. Then C, = σ 2 R,. Stationary GPs have the property that the covariance is the same for all pairs of locations that have the same relative orientation and same Euclidean distance from each other. For an isotropic stationary GP, the covariance is a function only of distance between points. In practice, the outputs y = y x,..., y x n = Y x, ω,..., Y x n, ω for some ω Ω are observed from a single sample path. Stationary GPs have a property called ergodicity as long as Ch as h. This property allows 4

19 inference about the process as a whole based on a single draw from the process, yx. See Cressie 993 for more details. A GP is nonstationary if either the mean function is not constant or the covariance function is nonstationary. A nonconstant mean function means that it is not necessary that E Y x = E Y x + h. A common technique for generating a nonconstant mean function is to let the mean depend on x much like a regression model. This process has the form Y x = p f i xβ i + Zx = f xβ + Zx, i= where fx = f x,..., f p x is a vector of known regression functions, β = β,..., β p is a vector of unknown regression coefficients, and Zx is a zero-mean stationary Gaussian process. If a covariance function is nonstationary, then it is not necessarily the case that Cx, x = Cx + h, x + h. The covariance between two locations then depends not only on the orientation and distance between the points, but on the location of the points in X. As mentioned previously, a stationary covariance function can be written as σ 2 R,. One possible method for constructing a nonstationary covariance function is to multiply the correlation function R, by a nonconstant variance function σ 2 such that C x, x = σ x σ x R x, x..5 A stationary covariance function is a special case where σ 2 x = σ 2 x. An example is shown in Figure.. The GP in the left column has constant mean and a stationary Gaussian correlation function.4 with ρ =.6 and σ 2 x = σ 2 =. The GP in the right column has constant mean and correlation function.4 with ρ =.6 multiplied by the nonconstant variance function pictured in the top right 5

20 2 Stationary GP: σ 2 x= 4 Nonstationary GP: σ 2 x Draws From a Stationary GP Draws From a Nonstationary GP Y Y.2 vs. Y Y.2 Y Y.2 vs. Y Y.2 Y Y.5 vs. Y Y.5 Y Y.5 vs. Y Y.5 Figure.: Ten draws from each of a stationary GP and nonstationary GP and sample covariances of Y.2 and Y.5 and Y.5 and Y.8 plotted against each other. panel in the manner of.5. The second row of Figure. shows that the variance at a given value for x is the same for every x in the left panel, but varies as x varies in the right panel. The third row shows the values of Y.2 and Y.5 plotted against each other for 4 draws from each of the processes. Overlayed on each plot is a mean zero bivariate normal density with covariance C.2,.5. The fourth row does the same for x =.5 and x =.8. In the left column, the empirical covariances of the samples in the third and fourth rows look the same, which is expected because the covariance here is a function only of the distance between the locations. In the right 6

21 column, the empirical covariances of the samples in the third and fourth rows look very different, emphasizing that the covariance structure varies throughout X..2 Applications of Gaussian Process Models The general idea of modeling with GPs is to represent an unknown function as a realization of a Gaussian process. There are many possible functions that are consistent with a given dataset. A Gaussian process is used as a prior distribution over the infinite-dimensional space of functions see Rasmussen and Williams 26. O Hagan 978 first introduced these priors over functions in a Bayesian regression context, although his approach was not fully Bayesian in that he did not use prior distributions on the correlation parameters. A fully Bayesian approach to modeling with Gaussian processes involves setting a prior for the mean function, µx, assuming a covariance function, Cx, x, and assuming priors for the hyperparameters in Λ, where Λ contains the correlation and variance parameters in C,. For example, for the stationary Gaussian covariance function, σ 2 R,, where R, is the Gaussian correlation function in.4, Λ = σ 2, ρ,..., ρ d. A mean function is often specified as a constant, µx = µ, or as a linear model µx = f xβ,.6 where fx = f x,..., f p x is a vector of known regression functions and β = β,..., β p is a vector of unknown coefficients. The shape of the model in.6 is meant to approximate the trend in Y x. For example, if x is in one dimension, then a possible fx might be, x, x 2,..., x p, in which case the overall trend in Y x can be approximated by a polynomial of degree p. The constant mean is often given an improper uniform prior, pµ, or a Nm, σµ 2 prior distribution with m and σµ 2 assumed known. For a regression-type mean function, β is often given 7

22 the prior β N p b, B, where b and B are considered known or an improper uniform prior with pβ. When Λ = σ 2, θ,..., θ d as in the Gaussian covariance function in.3, σ 2 can be given an inverse gamma prior, σ 2 IGα, β, or equivalently, if the covariance function is defined in terms of the precision λ = σ 2, then λ Gammaα, β. The hyperparameters in Λ that correspond to correlation function parameters are more difficult to assign appropriate priors. It is often the case that there is little prior information available, so a vague prior is desired. However, it is inadvisable to assign improper priors to correlation parameters, since they will often produce an improper posterior Neal 998 and Banerjee et al. 24. For the Gaussian correlation function.3, the correlation parameters are often given Gamma priors with large variances when little prior information is available. In the re-parameterized Gaussian correlation function in.4, a natural choice for the prior for the new parameter is ρ i Betaα i, β i. Oakley 22 talks about the process of obtaining information from a scientific expert to form useful priors in the field of computer experiments. It involves initially making sure that a GP is appropriate, then obtaining information about the differentiability of Y x so an appropriate covariance function can be chosen. The expert should then propose an approximation of the shape of the function so that an appropriate fx can be chosen. They can also make informed guesses about the correlation and variance of the process, leaving the statistician to formulate this information into a useful prior. In Bayesian analysis, inferences are made using the posterior distribution, y, where y = y x,..., x n is the observed training data. Generally, when the 8

23 model has unknown hyperparameters in the correlation function, this posterior distribution is very difficult to work with directly. Markov Chain Monte Carlo MCMC methods can be used to sample from this posterior. An overview of MCMC in general, and Gibbs sampling and the Metropolis Hastings Algorithm in particular, are provided in Section 3. Neal 998 advocates for hybrid Monte Carlo due to its efficiency when properly implemented. Gaussian process computations can be fairly difficult. In particular, the inversion of the n n covariance matrix of the training data causes problems. This matrix can become ill-conditioned, either due to having a large sample size or having training data very close together, causing the computations to be unstable or unreliable due to round-off errors. The most common approach to avoiding this problem is to add a small nugget, σ 2 ɛ, to the diagonal elements. This makes the matrix more wellconditioned while having a negligible effect on the model. Also, inverting an n n matrix is computationally expensive when n is large, having complexity of the order n 3. This can be especially prohibitive in an iterative algorithm like MCMC, where this inversion must be done at each iteration. To reduce some of the computing issues, some have used the maximum likelihood type II ML-II estimate for the hyperparameters as a plug-in estimate while performing Bayesian inference on the other unknown parameters Kuß 26. The ML-II estimate, Λ is given by Λ = arg max [y Λ] Λ For more on the ML-II approach, see Section in Berger 985. Csato and Opper 22 present another approach, sparse approximation, for large training data sets which performs the time consuming matrix operations only on a representative subset of size p < n of the training data. Chapter 8 in Rasmussen and Williams 26 9

24 presents other methods for large datasets, including a Nyström approximation to the covariance matrix. Gaussian processes have been used in many applications including physical experiments, computer experiments, and machine learning. Kolmogorov 94 presented GPs for use in time series analysis. The use of GPs for prediction in a spatial context can be traced back to Matheron 973, and is often called kriging in that context. Cressie 993 also presents this method. GPs in a regression context were first presented by O Hagan 978 and have been seen since then in this context for models with measurement error physical experiments and without measurement error computer experiments for prediction and uncertainty quantification. GPs are also used in classification problems, where the responses are categorical rather than continuous as in the normal linear regression context. Williams and Barber 998 presents a Bayesian method for this application. GPs in a machine learning context were presented by Williams and Rasmussen 996 after Neal 996 described the connection between GPs and infinite neural networks. Sacks et al. 989 apply GPs for both prediction and design in computer experiments. They propose a sequential design strategy that involves minimizing the integrated mean squared error. GPs have also been used in sensitivity analysis Oakley and O Hagan 24 to determine how an output changes in response to changes in the inputs. Kennedy and O Hagan 2 used GPs in calibration, a process in which unknown parameters in the computer model are adjusted so that the simulator output fits observed data.

25 .3 Prediction Methodologies One common goal in the use of a GP as a model of an unknown function is to predict the value of the function at a previously unobserved location based on observed values of the process. This section presents a few of the methods that have been proposed in the past..3. Kriging One method for prediction using GPs is commonly known as universal kriging. The GP, Y x, is specified as follows: Y x = p f i xβ i + Zx = f xβ + Zx,.7 i= where fx = f x,..., f p x is a vector of known regression functions, β = β,..., β p is a vector of unknown regression coefficients, and Zx is a zero-mean stationary Gaussian process with stationary covariance function CovZx, Zx + h = σ 2 Rh. This model is sometimes called regression plus stationary GP and is nonstationary in the sense that it allows a global trend f xβ to be fit allowing the mean to vary across the input space while allowing for local deviations from this trend. It should be noted that when this model has a constant trend, it reduces to a stationary model the mean does not vary across the input space and is often called ordinary kriging. Some examples of draws from this process can be seen in Figure.2 and Figure.3. Figure.2 shows three draws from the process in.7 for a constant trend with fx =, β =.5, a linear trend with fx =, x, β =.5,.65,

26 .3 Constant Trend.6 Linear Trend Yx. Yx Constant Trend x.6 Linear Trend x.4 Quadratic Trend.4 Cubic Trend.2.2 Yx.2.4 Yx Quadratic Trend x.8 Cubic Trend x Figure.2: Example draws from the process in.7. a quadratic trend with fx =, x, x 2, β =.7, 2.2,.6, and a cubic trend with fx =, x, x 2, x 3, β =.75, 2.4, 4.5, 2.9 in one dimension with σ 2 =. and ρ =.2. This figure clearly shows how the mean of the process varies across the input space. Each draw follows roughly the overall trend with smaller, local deviations around the trend. Figure.3 shows one draw from the process in.7 for a constant trend with fx =, β =.5, a linear trend in two dimensions with fx =, x, x 2, β =., 2,, 2

27 a quadratic trend in two dimensions with fx =, x, x 2, x 2, x 2 2, β =.7, 2.2,.6, 4.5, 2.9, and a quadratic trend in two dimensions with an interaction fx =, x, x 2, x 2, x 2 2, x x 2, β =.7, 2.2,.6, 4.5, 2.9, 3.5. For these draws, σ 2 =.25 and ρ =.2,.4. Constant Trend Linear Trend in x and x 2 Yx 2 Yx x 2 x x 2 x.5 Quadratic Trend in x and x 2 Quadratic Trend and Interaction Yx 2 2 Yx x 2 x x 2 x.5 Figure.3: Example draws from the process in.7. Now, consider training data y = yx,..., yx n, a set of data measured at n different input settings {x,..., x n } R d. It is often desired to make a prediction at a new input setting, x, given the training data. The following theorem gives a result about choosing a best predictor. 3

28 Theorem. Let Y x and Y = Y x,..., Y x n be jointly distributed as follows: Y x Y G, where G is some distribution, and suppose the conditional mean E Y x Y = y = ŷx exists. Then ŷx is the minimum mean square prediction error MSPE predictor. Proof. See Appendix A. In particular, Theorem is true for Gaussian processes, which are being used as the model for the unknown function. In this case, G will be a multivariate normal distribution. Now define an n p matrix F = fx,..., fx n, f = f x,..., f p x, rx = Rx x,..., Rx x n, a vector containing the correlation between the process at the new prediction location and the process at each of the training data locations, and R to be an n n matrix with ij th element Rx i x j, the correlation of the process between the i th and j th locations. Then Y x Y [ f N F rx n+ β, σ 2 rx R ], and, by Theorem and multivariate normal theory, the minimum MSPE predictor is ŷx = E Y x Y = y = f β + r x R y Fβ. Now β is generally unknown, but, as shown in Santner et al. 23, the best linear unbiased predictor is ŷ UK x = f β + r x R y F β,.8 4

29 where β = F R F F R y is the usual generalized least squares estimator. The variance of ŷx, as shown in Santner et al. 23 in Section 4., is V arŷx = σ 2 r x R r x + h F R F h where h = f F R r x. α% prediction intervals can then be given by ŷ UK x ± zα/2 V arŷx..9 This regression plus GP approach is relatively straightforward. A maximum likelihood ML or restricted maximum likelihood REML approach can be used to estimate the parameters in the model. There is a function called MPeRK and a toolbox called DACE in MATLAB that can be used to estimate the parameters and to make predictions at new input locations using this regression plus GP approach. After the correlation parameters are estimated, the kriging predictors and prediction intervals are found as in.8 and.9. Kriging Examples Consider the simple test function yx = sin5x, x [, 3].. Figure.4 shows the true function black, a kriging predictor red, and 95% prediction intervals yellow for each of a constant trend top left, MSP E =.27 9, a linear trend top right, MSP E =.282 9, a quadratic trend bottom left, MSP E = , and a cubic trend bottom right, MSP E =.638 9, where each of the MSPEs were calculated over a -point grid of test locations. The training data here indicate that a stationary process is appropriate, as it looks like there is no trend and a constant variance. This function is fairly easy to predict-the 5

30 Figure.4: Kriging predictors for the function in. with 95% prediction intervals. true function and the kriging predictor overlap nearly perfectly, as indicated by the small MSPEs. There is also very little uncertainty, as indicated by the miniscule prediction intervals in all four plots. A two-dimensional example is the Franke function from Franke 979. yx = yx, x 2 =.75exp 9x 2 2 9x exp 9x + 2 9x exp 9x 7 2 9x exp 9x 4 2 9x 2 7 2, 4 4 x, x 2 [, ].. 6

31 Figure.5 shows the true function and the training data from a 24-run maximin Latin hypercube design top, along with a kriging predictor with a constant trend middle left, MSP E =.4, a kriging predictor with a cubic trend and interactions middle right, MSP E =.4, and a plot for each that shows the degree of prediction error across the surface. The MSPEs were calculated over a 2 2 grid of test locations. It is clear that the kriging predictor with a constant trend per- Figure.5: Kriging predictors and plots of errors for the function in.. forms much better than the kriging predictor with a cubic trend and interactions. A misspecification of the overall trend can lead to poor predictions 7

32 Both of these functions could be modeled well using a GP with stationary covariance. Now consider the test function presented in Ba and Joseph 22 originally from Xiong et al. 27. The true function is yx = sin 3x.9 4 cos 2x.9 + x.9, x [, ]..2 2 By looking at the true function in Figure.6, it can be seen that the mean of the function in the region x [,.4] is smaller than the mean in the region x.4, ]. Also, the volatility is larger in the region x [,.4] than in the region x.4, ]. For these reasons, it seems that a nonstationary model may be more appropriate. This figure shows the true function black, a kriging predictor red for each of a constant trend top left, MSP E =.5, linear trend top right, MSP E =.7, quadratic trend bottom left, MSP E =.6, and cubic trend bottom right, MSP E =.3, where the MSPEs were calculated over a -point grid of test locations, and 95% prediction intervals yellow for the kriging predictors. As mentioned previously, it seems that the volatility is larger for large x than for small x, and so it would make sense that the prediction intervals should narrow to account for this. The model with stationary covariance does not allow for this adjustment, and so the prediction intervals seem to be too wide for large x. This phenomenon can also be seen by considering the test function, yx = e 2x sin4πx 2, x [, 3]..3 It can be seen in Figure.7 that the true function has a fairly constant mean over the input space, but the volatility is decreasing as x gets larger, so a nonstationary model might be more appropriate to account for this. This figure shows the true function black, a kriging predictor red for each of a constant trend top left, MSP E =.82, a linear trend top right, MSP E =.82, a quadratic 8

33 Figure.6: Kriging predictors for the function in.2 with 95% prediction intervals. trend bottom left,msp E =.82, and cubic trend bottom right,msp E =.83, where the MSPEs were calculated over a -point grid of test locations, and 95% prediction intervals yellow for the kriging predictors. Again, it seems as though the prediction intervals should be narrower when the volatility is low large x, but there is no reduction in the width of these prediction intervals. Consider the following 2-dimensional test function: x2.9 y x, x 2 = sin 2 x.9 2 cos 2 x.9 2 x.9 sin 2 x cos 2 x 2.9, 2 9

34 Figure.7: Kriging predictors for the function in.3 with 95% prediction intervals. x, x 2 [, ]..4 Figure.8 below shows this function. This function is very volatile when x and x 2 are both near and is smoother elsewhere in the space. A 4-run design with more densely populated design points for x <.4 was used for this example, and the predictions were tested over a 2 2 grid of test locations. Figure.8 shows the MPERK predictors with a linear trend left column, MSP E =.78 and a cubic trend with interactions right column, MSP E =.87 and a plot that shows the errors of each of these predictors across the input space. The stationary covariance is 2

35 Figure.8: Kriging predictors and plots of errors for the function in.4. not able to capture the high volatility when x and x 2 are small, and so the predictors are too smooth. This model also requires the specification of the overall trend constant, linear, quadratic, interactions, etc.. A misspecification of this trend can lead to poor predictions, particularly in areas that are far from the training data, where predictions will tend towards the overall trend. This can best be seen in Figure.5. Choosing a constant trend leads to relatively good predictions MSP E =.4, while choosing a cubic trend with interactions leads to poor predictions M SP E =.4. The large errors in prediction can be seen particularly at the edges where there is less 2

36 training data. It may be desired to have a more flexible trend that does not need to be specified..3.2 Treed Gaussian Processes A proposed method for handling this nonstationary mean and nonstationary covariance structure is the Treed Gaussian Process TGP model of Gramacy and Lee 28. The basic idea behind this model is to assume that the input space can be partitioned into R rectangular regions such that a GP with a linear trend and stationary covariance structure is appropriate in each region. Treed partition models partition the input space by making binary splits on the value of one input variable at a time, so that the partition boundaries are parallel to coordinate axes. Also, each new partition is a subpartition of a previous partition. The input space is partitioned into R regions: {r ν } R ν=. In the ν th region, there are n ν training data locations and their corresponding responses, D ν = {X ν, Y ν }. For example, in a two-dimensional input space, x = x, x 2 [, ] 2, a first partition may divide the input space by whether x.4 or x >.4. A second partition on whether x 2.6 or x 2 >.6 will then only divide one of the previous rectangles. An example of this partitioning method is shown in Figure.9. The data in each region is then used to fit models independently in the regions. Classification and regression trees CART from Breiman et al. 984 fit a constant surface to each region. Chipman et al. 998 fit a Bayesian hierarchical linear model in each region. The treed Gaussian process model extends the model of Chipman et al. 998 by fitting a GP with a linear trend and stationary covariance structure in each region. This will lead to different mean and covariance structures across the space as a whole. 22

37 .9.8 r x 2.5 r r x Figure.9: An example partition. As mentioned in the previous paragraph, each region, r ν, contains data, D ν, at n ν locations. Let m = d +, be the number of covariates recall that there will be a linear trend for each dimension plus an intercept. The hierarchical model is set up as follows: Y ν β ν, σν, 2 R ν N nν Fν β ν, σνr 2 ν β ν σν, 2 τν 2, W, β N m β, σντ 2 ν 2 W β N m µ, B τν 2 IG ατ /2, qτ /2 σν 2 IG ασ /2, qσ /2 W W ρv, ρ for ν =,..., R, where N p, IG, and W are the p-variate normal, inverse-gamma, and Wishart distributions, respectively, F ν =, X ν, R ν = R ν + σ 2 ɛ ν I nν is a correlation matrix defined in.3 plus a nugget, W is an m m matrix, and the hyperparameters µ, B, V, ρ, α σ, q σ, α τ, and q τ are assumed known. 23

38 To make predictions, samples from the posterior are obtained using MCMC methods. Given a tree, T, that is, a specific partitioning of the input space, the parameters can be sampled from the posterior using Gibbs and Metropolis Hastings steps. Sampling from the posterior for the tree structure is performed by Reversible Jump MCMC. For more details, see Gramacy and Lee 28. Point predictions for the TGP model have a similar form to that of universal kriging. Given a region, r ν, the point prediction has the form ŷ T GP x = E Y x Y = y, x r ν = E E Y x Y = y, x r ν, Λ = E f β ν + r ν x R ν Y ν F ν βν x r ν n mcmc f [i] β ν + r ν [i] x R ν [i] Y ν F ν n β[i] ν mcmc i=.5 where Λ contains all of the parameters, β ν = F ν K ν F ν + W τν 2 f =, x,..., x d, and r ν x = R ν x x,..., R ν x x nν, F ν K ν F ν + W, τν 2 a vector containing the correlation between the process at the new prediction location and the process at each of the training data locations in r ν, and the superscript, [i], indicates the use of the parameters from the i th draw of the MCMC algorithm to calculate the respective estimators. A major advantage this method has is computational. The usual stationary GP model tends to run into trouble when the number of training data points, n, is large. It requires calculating the inverse of an n n covariance matrix. This leads to two computing issues. First, large matrices are often ill-conditioned, and so are 24

39 numerically unstable. Second, even if a large matrix is well-conditioned, it is computationally intensive On 3 to invert the n n matrix. In an iterative algorithm such as MCMC, inverting a large matrix many times leads to slow-running computer code. TGP uses a divide-and-conquer approach, partitioning the data so that there are R smaller n ν n ν, ν =,..., R matrices. These smaller matrices are more likely to be well-conditioned and are not as computationally intensive to invert. The tgp package in R implements the treed Gaussian process model. One of its characteristics is that it gives a zero probability to trees with partitions containing less than min{, n + } data points. For example, for x R, at least 2 training data points are needed for the input space to be partitioned into two regions. If there were only 9 training data points, one of the regions would have to have 9 or fewer points, and so the input space will not be partitioned at all. This is done to ensure that there is enough data in each region to make useful predictions. TGP Examples Below in Figure. are three known functions black, the TGP predictor red, and 95% prediction intervals yellow. For the function in the top row, yx = sin5x, x [, 3], it looks as though a stationary model would be appropriate. The TGP method does not partition the input space in this example and fits a stationary model whose predictor has an MSPE over a -point grid of test locations of As mentioned previously, this function is fairly easy to predict. The true function and the kriging predictor overlap nearly perfectly, as indicated by the small MSPE. There is also very little uncertainty, as indicated by the miniscule prediction intervals. For 25

40 the functions in the second row, yx = sin 3x.9 4 cos 2x.9 + in the left column, and in the right column. yx = e 2x sin4πx 2, x [, 3] x.9, x [, ] 2 Figure.: TGP predictors for the functions in.,.2, and.3 with 95% prediction intervals and small n. It looks as though the mean changes in the input space for the function in the left column and the variance is decreasing as x gets larger in both functions. A 26

41 nonstationary model might seem appropriate to model the data from both of these functions. In the left column, there are n = 7 data points, and in the right column, there are n = 5 data points. With this small amount of data, the input space is not able to be partitioned into more than one region, and so a stationary model is fit for each function. It should be noted that the prediction intervals remain fairly wide for the entire space. The MSPE over a -point grid of test locations for the predictor in the left column is.45, and the MSPE for the predictor in the right column is.7. Below in Figure. are the same known functions but with n = 25 training data observations. Having more data allows the input space to be partitioned if the data advises it. The data in the plot on the first row indicates that a stationary model is appropriate, and so there is no partitioning of the data. The key thing to notice in the plots on the second row is the prediction intervals yellow becoming much narrower at approximately x =.5 in the left panel and x =.6 in the right panel. In both cases, the input space is partitioned into two regions, one with a larger variance, and one with a smaller variance. Figure.2 below shows the function in. and the same 24-run maximin Latin hypercube design as in the previous section, along with the TGP predictor and a plot that shows the errors across the input space. This predictor has an MSPE of.9 over a 2 2 grid of test locations. Figure.3 below shows the function in.4 and the same 4-run design as in the previous section, along with the TGP predictor and a plot that shows the errors across the input space. The predictor has an MSPE of.57 over a 2 2 grid of test locations. 27

42 Figure.: TGP predictors for the functions in.,.2, and.3 with 95% prediction intervals and larger n. It should be noted here that this model estimates a rather large nugget, and so is not an interpolator. The nugget can be fixed to a very small number, but the predictions often become unstable. This often leads to the model interpreting volatility in the data as noise rather than signal, so the predictor remains smooth through the area of volatilty, as can be seen in both Figure.2 and Figure Composite Gaussian Processes Another proposed method for handling a nonstationary mean and nonstationary covariance structure is the Composite Gaussian Process CGP model of Ba and 28

43 Figure.2: TGP predictor and plot of errors at a 2 2 grid for.. Joseph 22. This model incorporates a flexible global trend and a variance model to account for a changing variance throughout the input space. Given the process parameters, Λ, the CGP model is expressed as a sum of two Gaussian processes as follows: Y x = Y g x + σxy l x.6 Y g x Λ GP µ, τ 2 g Y l x Λ GP, l. 29

44 Figure.3: TGP predictor and plot of errors at a 2 2 grid for.4. Without loss of generality, write σ 2 x = σ 2 vx. Then Λ = µ, τ 2, σ 2, θ, κ, vx. The functions, g and l, are Gaussian correlation functions with unknown correlation parameters θ and κ, each of length equal to the number of inputs, d. So gh θ = exp θ j h 2 j, and lh κ = exp κ j h 2 j, and vx is a function that allows the volatility of Y l x to change throughout the input space. Typically, vx is normalized so that the average magnitude of the variance is σ 2, while vx adjusts the magnitude of the variance at x, x X. 3

45 Y g x and Y l x are independent Gaussian process, and Y g x is a smooth, stationary process that captures the global trend, while Y l x makes local adjustments to the trend. To ensure that the global process, Y g x, is smoother than the local process, θ is given an upper bound, κ l, so that θ κ l κ. Note that Y l x is augmented by a variance model that allows the local variability to change throughout the input space. Informally, the model in.6 can be thought of as Y x Λ GP µ, τ 2 g + σ 2 xl. Let V = diag{vx,..., vx n } contain the local variances at each of the training data sites, and let G and L be n n correlation matrices with ij th element gx i x j and lx i x j, respectively. Then, because of the properties of Gaussian processes, the joint distribution of any Y x and the training data Y = Y x,..., Y x n is a multivariate normal distribution Y x Y Λ N +n [ µ µ τ, 2 + σ 2 vx C C C ].7 where C = τ 2 G + σ 2 V /2 LV /2 is the n n covariance matrix for the training data, and C = τ 2 gx + σ 2 v /2 x V /2 lx, with gx = gx x,..., gx x n and lx = lx x,..., lx x n. The first row of Figure.4 shows two possible vx functions. The second row shows six draws from the process in.6 for each vx, where µ =, τ 2 =, σ 2 =.5, θ =, and κ =, and the third row shows a single draw from the process along with its components, the global process and the local process. The key thing to notice in the second row of Figure.4 is the increase in volatility of each draw from the process where the local volatility function, vx, is larger. In the third row of Figure.4, it can be seen that the global process green is relatively smooth and looks stationary, and it captures the overall trend of Y x fairly well. The local process red has varying volatility, becoming more 3

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible