Fte β, λ. N ne. ,λ 1. Z Rte R te,tr R 1

Size: px

Start display at page:

Download "Fte β, λ. N ne. ,λ 1. Z Rte R te,tr R 1"

Heather Annabella Preston
5 years ago
Views:

1 02 4 Bayesian Prediction Methodology from Y te ] ] Fte Y ϑ N ne +n s β, λ F Rte R te, R te, R 4..3 where ϑ depends on the case being studied. A second stage specifies the prior disibution of ϑ ]. The following notation is used throughout the chapter. W] denotes the disibution of W; where needed explicitly, πw, EW}, and Cov W denote the joint probability density function of W, the mean of W, and the variance-covariance maix of W, respectively; F te is the n e p maix whose i th row consists of the known regression functions for the input x te i, i n e; F denotes the n s p maix of known regression functions for the n s aining data inputs; β denotes the unknown p vector of regression coefficients; λ λ is the precision variance of the GP that describes deviations from the regression; R te, the n e n e correlation maix Cor Y te, Y te ; R te, is the n e n s cross-correlation maix Cor Y te, Y ; R is the n s n s correlation maix Cor Y, Y ; κ denotes the vector of parameters that determine a given correlation function; ϑ denotes the vector of all unknown model parameters for the model under discussion;ϑisβin Section 4.2.,ϑ is β,λ in Section 4.2.2, andϑis β,λ,κ in Section 4.3. For easy reference note that when 4..3 holds then conditionally Y te Y = y,ϑ] N ne Fte β+ R te, R y F β ],λ Rte R te, R R te, 4..4 whereϑdepends on the case being studied. The chapter is organized as follows. Section 4.2 presents conjugate cases where analytic expressions can be derived forπ Y te Y, E Y te Y }, and Var Y te Y. The caseϑ=β is a toy example that is sufficiently simple to illusate the calculations that become more involved in the remaining conjugate and non-conjugate cases. The conjugate caseϑ=β,σ 2 assumes that the local min/max sucture of the residual process is known and hence is usually not directly useful in applications. However, in conjunction with the conjugate results forϑ=β,σ 2, Section 4.3 describes Bayesian methodology for the caseϑ=β,σ 2,κ which is used in practical settings.

2 4.2 Conjugate Bayesian Models and Prediction Examples of Conjugate Bayesian Models and Prediction The idea that is used in this and the following sections is that the predictive density π y te y can be obtained as π y te y = π y te,ϑ y dϑ= π y te ϑ, y π ϑ y dϑ where ϑ are the unknown model parameters. Inference about the unknown model parameters ϑ can be obtained from the conditional disibution of ϑ given the aining data, i.e., from ϑ y ]. For example, the mean of ϑ y ] is a Bayesian estimate of ϑ while the standard deviation of this posterior measures the uncertainty in the estimatedϑ. Theorems 4. and 4.2 also provide the ϑ y ] conditional disibutions Predictive Disibutions When ϑ = β This subsection illusates the application of 4.2. in a toy example that is sufficiently simple that the calculations can be completely provided. It is assumed that the aining and test data can be described as draws from the regression + stationary GP model, in which the regression coefficient is unknown but with known process precision and correlations, i.e., ν = β. The predictive disibution Y te Y ] is derived; as a consequence the Bayesian predictor EY te Y } and the uncertainty quantification Var Y te Y are immediately available. Section will consider a more challenging model that has application to the usual case where all parameters of the regression + stationary GP model are unknown. The following theorem provides the predictive disibution of Y te for two different conjugate choices of second-stage priors for β. The first choice is the normal prior which can be regarded as an informative choice while the second can be thought of as a non-informative prior. Formally, the non-informative prior and predictive disibution is obtained by letting the prior precisionλ β 0 in the normal prior case a. Theorem 4.. Suppose Y te, Y follows a two-stage model with first-stage Y te ] ] Fte Y β N ne +n s β,λ F Rte R te, R. te, R whereβis unknown whileλ and all correlations are known. a If β] N p bβ,λ β V β], 4.2.2

3 04 4 Bayesian Prediction Methodology is the second-stage model where V β is a known positive definite correlation maix with b β andλ β also known, then the posterior disibutions of b β is β Y ] N p µβ,σ β ], where µ β = λ F R F +λ β V β λ F R F β+λβ V β b β, and Σ β = λ F R F +λ β V β The predictive disibution of Y te is Y te Y = y ] N ne µte,σ te ], with mean and covariance maix µ te =µ te x te = F te µ β + R te, R y F µ β, Σ te =λ F te, R te, λ β ] λ V β F F te F R R te, b If πβ = on IR p, then the posterior disibutions of b β is ] β Y N p β= F R F F R y,λ F R F ], The predictive disibution of Y te is Y te Y n = y n ] N ne µte,σ te ], where the meanµ te and covarianceσ te are modifications of and 4.2.6, respectively, in whichµ β is replaced by β and λ β λ V β is replaced by the p p maix of zeroes. The proof of Theorem 4. is given Section 4.5. It requires saightforward calculations to implement the right-hand integral in Consider first some observations about inferences concerning β provided by the posterior disibution β Y ]. The posterior mean ofβdepends only the ratioλ β /λ because

4 4.2 Conjugate Bayesian Models and Prediction 05 µ β =λ F R F + V λ β λ β /λ F R F β+v β b βλ β /λ, = F R F + V β λ β /λ F R F β+v β b βλ β /λ. In the special case of a where the prior variance ofβis identical to the Yx process variance, i.e.,λ β =λ, then both the posterior mean and covariance simplify further. In this situation, the Bayes estimator ofβcan be thought of as µ β = F R F + V β F R F β+v β b β =Ω β+i p Ω b β whereω= F R F + V β F R F which is the maix convex combination of the prior mean b β and the BLUP β. In conast, the posterior covariance, Σ β = λ F R F +λ β V β ] =λ F R F + V β λ β/λ ], depends on the individual precision parameters whether or notλ β =λ. Prediction at a Single Test Input x te To provide additional insight about the nature of the Bayesian predictorµ te and its UQ quantificationσ 2 te that are given by Theorem 4., we resict attention to the case of a single test input x te to simplify the discussion. Examining first the predictive mean, the following properties hold for both cases a and b. Algebra shows thatµ te =µ te x te is linear in y and, with additional calculation, that it is an unbiased predictor of Yx te, i.e.,µ te x te is a linear unbiased predictor of Yx te. The continuity and other smoothness properties ofµ te are inherited from those of the correlation function R and the regressors f j } p j= because µ te = f x i µ β + r ter Y F µ β p n = f j x te µ β, j + d i Rx te x i, j= whereµ β, j is the j th element ofµ β. Thus in case bµ te depends only on the ratio ofλ andλ β because this is ue ofµ β. Lastly,µ te interpolates the aining data. This is ue because when x te = x i for some i,..., n}, f te = f x te = fx i, and r te R = e i, the ith unit vector. Thus i= µ te = f x i µ β + r ter Y F µ β = f x i µ β + e i Y F µ β = f x i µ β + Y i f x i µ β = Y i.

5 06 4 Bayesian Prediction Methodology For prior b, the n e n e posterior covarianceσ te reduces to σ 2 te xte =λ r te R r te+h F te R F te h } where h= f te F ter r te ; equation was given previously as the variance of the BLUP For prior a, σ 2 te =λ =λ =λ =λ where h= fx te F R r te and λ f te, r β ] λ te V β F f te F R r te f te Q f te + 2 f te Q F R r te + r te R R F Q F R }r ]} te r te R r te+ f te Q f te 2 f te Q F R r te } + r ter F Q F R r te 4.2. r te R r te + h Q h }, Q= F R F + λ β λ V β ; the equality in 4.2. follows from Lemma C.3. Intuitively, the variance of the posterior of Yx te given the aining data should be zero whenever x te = x i, i n s because we know exactly the response at each of the aining data sites and there is no measurement error term in the stochastic process model. This was shown previously for 4.2.0, prior b. To see that this is also the case for prior a, fix x te = x, say. In this case, recall that r te R = e, and observe that fx te = fx. From 4.2.2, σ 2 te x i=λ r te R r te+ f x te r te R F Q fx te F te R r te } e r te x + f x e i F Q fx F e i } =λ =λ + f x f x Q fx fx } =λ +0}=0 where Q is given in Perhaps the most important use of Theorem 4. is to provide pointwise uncertainty bands about the predictorµ te x te in The bands can be obtained by using the fact that Yx te µ te x te N0,. σ 2 te xte This gives the posterior prediction interval

6 4.2 Conjugate Bayesian Models and Prediction Fig. 4. The function yx = exp.4x} cos3.5πx solid line; a seven point input aining data sets solid circles; the Bayesian predictorµ te =µ β +r te R y n µ β in and forλ β = 0 blue, forλ β = 0 red, and forλ β = 00 green. PYx te µ te x te ±σ te x te z α/2 Y }= α, where z α/2 is the upperα/2 critical point of the standard normal disibution see Appendix A. As a special case, if the input x te a, b, thenµ te x te ±σ te x te z α/2 are pointwise 00 α% prediction bands for Yx te, a< x te < b. Below, we illusate the prediction band calculation for the hierarchical Y 0, Y model presented in Theorem 4.2. Example 4. Damped Sine Curve. This example illusates the effect of the prior β] on the mean of the predictive disibutionµ te x te in Theorem 4.. Consider the damped cosine function yx=e.4x cos7πx/2, 0< x<, and n s = 7 points of aining data which are the solid curve and dots in in Figure 4., respectively. For any x te 0, The predictive disibution of Yx te is based on the hierarchical Bayes model whose first stage is the stationary stochastic process Yx β0 ] =β0 + x, 0< x<, whereβ 0 IR and for purpose here of illusating the case where onlyβ 0 is known, Rh is taken to be Rh=exp 0.0 h 2 }. Suppose in part a of Theorem 4. it is assumed thatβ 0 Nb te, λ β v 2 te with v te = with known prior mean, b te, and known prior precision,λ β. For any x te 0,, the Bayesian predictor of yx te is the posterior mean µ te =µ te x te =µ β +r ter whereµ β is the posterior mean ofβ 0 given Y n which is y n µ β, 4.2.4

7 08 4 Bayesian Prediction Methodology Fig. 4.2 The factor r te R n versus x te 0,. µ β = n R yn + b te λ β /λ n R n+λ β /λ =ωb te + ω n R n n R y n =ωb te + ω β 0, whereω=λ β /λ n R n+λ β ] 0,. In words, can be interpreted as saying that the posterior mean ofβ 0 given Y is a convex combination of the prior mean b β and the MLE ofβ 0. For fixed process precisionλ,ω andµ β b β as theβ 0 prior processλ β ; the predictor guesses the prior mean and ignores the data. Similarly,ω 0andµ β β 0 as theβ 0 prior processλ β 0; the predictor uses only the data and ignores the prior information. However, the impact of the prior precision ofβ 0 on the predictorµ te x te of yx te can be relatively minor. Consider the data shown in Figure 4. by the solid circles that are superimposed on the damped sine curve. Calculation gives β 0 = Suppose b β = 5,θ = 0.0 andλ = 6; thenµ β 5 asλ β and µ β asλ β 0. Figure 4. shows that the effect onµ te x 0 of changing the prior precision is relatively minor. The calculation that shows this effect is to observed that µ te x te = r ter y + r ter n µ β which shows that the magnitude of the Bayes predictor depends on the posterior meanµ β only through the factor r ter n. Figure 4.2 shows that the factor r te R n is relatively minor in the center of the aining data but can increase as x te moves away from the aining data.

8 4.2 Conjugate Bayesian Models and Prediction Predictive Disibutions Whenϑ=β,λ Theorem 4.2 states the predictive disibution of Y te given Y for informative and non-informative β,λ priors. The informative prior is stated in terms of the factors β,λ ]=β λ ] λ ]. In both cases, the Y te Y ] posterior is a location shifted and scaled multivariate t disibution having degrees of freedom that are increased for informative prior information from eitherβorλ see Appendix C.4 for the definition of the non-cenal m-variate t disibution, T m ν,µ,σ. The informative conditional β λ ] choice is the multivariate normal disibution with known mean b β and known correlation maix V β ; lacking more definitive information, V β is often taken to be diagonal, if not simply the identity maix. This prior model makes song assumptions; for example, it says that each component of β is equally likely to be less than or greater than the corresponding component of b β. The non-informativeβ prior is the intuitive choice πβ= used in Theorem 4.. The informative λ ] prior is the gamma disibution with specified mean and variance. This prior can be made quite diffuse and hence is also stated as an option for the non-informative prior case. A second non-informative prior forλ is Jeffreys prior πλ = λ, λ > 0 see Jeffreys 96, who gives arguments for this choice. Theorem 4.2. Suppose Y te, Y follows a two-stage model in which the conditional disibution Y te, Y β,λ ] is given by 4..3 and all correlations are known. a If β,λ ]=β λ ]λ ] has prior specified by β λ ] N p bβ,λ V } β and λ ] Γc, d with known prior parameters then the posterior disibutions ofβandλ are λ y ] Γc a, da and β y ] T p 2c a + n s,µ β, d a Σ β /c a where c a = 2c+n s/2 d a = 2d+ y F β R β= F R F F R F y Σ π = V β + F R F Σ y F β + β bβ π β bβ /2

9 0 4 Bayesian Prediction Methodology µ β = F R F + V β F R F ] β+v β b β Σ β =. V β + F R F The predictive disibution of Y te is Y te Y = y ] T ne 2c+ ns,µ te, M te d a /ca where µ te = F te µ β + R te, R y F µ β M te = R te R te, R R te,+h te Σβ H te b Suppose thatβandλ are independent with β] and λ ] having prior b. or b.2 in the following table λ ] Prior Γc, d /λ Case Designation b. b.2 For b. the posterior disibutions ofβandλ are λ y ] Γc b., db. and β y ] T p 2c+n s p, β, F R F d b. /cb. where β is defined in a and c b. = 2c+n s p/2 d b. = 2d+ y F β R y F β /2. The predictive disibution of Y te is Y te Y = y ] T ne 2c+ns p,µ te, M te d b. /cb where µ te = F te β+ R te, R y F β M te = R te R te, R R te,+h te F R F H te For b.2 the posterior disibutions ofβandλ are λ y ] Γc b.2, db.2 and β y ] T p n s p, β, F R F d b.2 /cb where c b.2 = n s p/2 d b.2 = y F β R y F β /2

10 4.2 Conjugate Bayesian Models and Prediction The predictive disibution of Y te is Y te Y = y ] T ne ns p,µ te, d b.2 M te /c b whereµ te and M te are defined as in b.. The formulas above for the degrees of freedom, location shift, and scale factor in the Y te predictive disibution all have intuitive interpretations. The base value for the degrees of freedom is n s p, which is augmented by p additional degrees of freedom when β has the informative Gaussian case a, and is further augmented by 2c a or 2cb. degrees of freedom whenλ has the informative gamma prior cases c and b.. The mean of Y te predictive disibution is the same for twoλ cases of Theorem 4. and 4.2. In the case of Theorem 4. which has knownλ, and give the predictive mean to be µ te = F te µ β + R te, R y F µ β whereµ β is the Bayes estimator of conditional disibution ofβfor the informative or non-informative priors. In the case of Theorem 4.2 which takesλ to be a hierarchical parameter, the predictive mean is the location parameter of the t-disibution. Examination ofµ te shows it is identical to that of Theorem 4.. Thus the Bayesian predictor is the same for the two cases. The UQ of Y te for knownλ that is given in Theorem 4. and for unknownλ but given a prior given in Theorem 4.2 are related. To simplify the discussion, consider the case of a single input x te at which prediction is desired. Whenλ is known and it is assumedλ =λ β, Theorem 4. gives the predictive variance of Y te to be where h= fx te F R r te and σ 2 te =λ r te R r te + h Q h }, Q= F R F + V β or Q= F R F as the informative or non-informativeβ prior is assumed. Whenλ is unknown but has gamma or Jeffreys prior, Theorem 4.2 gives the predictive variance of Y te σ 2 te = db.i c b.i K r te R r te+h Q h }, i=, where h is as above and K= 2c+n s p/2c+n s p 2 or K= n s p/n s p 2 assuming the denominator is positive. The final factor is the same as in the known λ case. The product of the first two terms in should be thought of as an estimator of the factorλ in This is because the posterior ofλ is gamma with parameters c b.i and d b.i, i=, 2. Thus cb.i /db.i is the mean of theλ posterior

11 2 4 Bayesian Prediction Methodology Fig % pointwise prediction bands for yx at n e = 03 equally spaced x te values over 0,. The solid black curve is the damped sine curve used to generate n s = 7 aining data points solid circles. The left Panel assumesθ=0 in and right Panel assumesθ=75 in and d b.i/cb.i is a naive guess ofλ. The degrees of freedom correction will be near unity for small regression model p and large data and/orλ prior information. Again, considering the case of a single input x te, Theorem 4.2 can be used to place pointwise prediction bands about yx te. Using the fact that, given Y, Yx te µ te x te σ 2 te xte T d.o.f., 0,. where d.o.f. is either 2c+n s p or n s p for the informative or non-informativeλ prior, respectively, gives P Yx te µ te x te ±σ te x te t α/2 Y d.o.f. } = α, where t α/2 ν is the upper α/2 critical point of the univariate cenal t-disibution withνdegrees of freedom see Appendix A. When x te a, b thenµ te x te ± σ te x te t α/2 ν i for a< x te < b are pointwise 00 α% prediction bands for yx te. Example 4. Continued] Figure 4.3 plots the 95% pointwise prediction bands 6 d b.2 µ te ± r 4 c b.2 te R r te+ h 2 /Q } obtained for the prior b.2 of Theorem 4.2 based on the n s = 7 point aining data set obtained from the damped sine curve and shown as filled circles. Here h= 7 R r te while c b.2 and d b.2 are specified in Theorem 4.2. The first stage of the model is a GP with constant mean p= and Gaussian correlation function Rh θ=exp θh 2}.

12 4.3 Non-conjugate Bayesian Models and Prediction 3 The bands have been computed forθ 0, 75} to show the effect of assuming a songer correlation sucture between the Yx values θ = 0 and a weaker correlation sucture between the Yx values θ=75. Intuitively, when the model assumption allows Yx to vary more over a given interval of x, the confidence bands should be wider as seen in the right panel of the figure. For anyθthe bands have zero width at each of the ue data points. Finally, the pointwise yx predictor,µ te, is relatively insensitive to the correlation sucture while interpolating the data, although the different left hand prediction values show that this need not be the case when exapolating. 4.3 Examples of Non-conjugate Bayesian Models and Prediction TJS original brief outline of topics the HB model the posterior examples ex3.3 damped cosine/something -d?? and which one? 4.3. Inoduction Subsections 4.2. and assumed that the correlations among the observations are known, i.e., R and r 0 are known. Now we assume that y has a hierarchical Gaussian random field prior with parameic correlation function R ψ having unknown vector of parameters ψ as inoduced in Subsection?? and previously considered in Subsection for predictors. To facilitate the discussion below, suppose that the mean and variance of the normal predictive disibution in and?? are denoted byµ 0 n x 0 =µ 0 n x 0 ψ andσ 2 0 n x 0=σ 2 0 n x 0 ψ, whereψ was known in these earlier sections. Similarly, recall that the location and scale parameters of the predictive t disibutions in?? are denoted byµ i x 0 =µ i x 0 ψ andσ 2 i x 0=σ 2 i x 0 ψ, for i, 2, 3, 4}. We consider two issues. The first is the assessment of the standard error of the plug-in predictorµ 0 n x 0 ψ of Y 0 x 0 that is derived fromµ 0 n x 0 ψ by substituting ψ, which is an estimator ofψthat might be the MLE or REML. This question is implicitly stated from the frequentist viewpoint. The second issue is Bayesian; we describe the Bayesian approach to uncertainty inψwhich is to model it by a prior disibution. Whenψis known, recall thatσ 2 0 n x 0 ψ is the MSPE ofµ 0 n x 0 ψ. This suggests estimating the MSPE ofµ 0 n x 0 ψ by the plug-in MSPEσ 2 0 n x 0 ψ. The correct expression for the MSPE ofµ 0 n x 0 ψ is MSPEµ 0 n x 0 ψ,ψ=e ψ µ0 n x 0 ψ Yx 0 2 }. 4.3.

Linear Models A linear model is defined by the expression

Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose