Bayesian Prediction of Code Output ASA Albuquerque Chapter Short Course October 2014
Abstract This presentation summarizes Bayesian prediction methodology for the Gaussian process (GP) surrogate representation of computer model output. The conditional predictive distribution for predicting unsampled code output is derived, followed by posterior distributions for GP regression and precision parameters based on two common prior distribution assumptions. These results lead to a description of the predictive distribution itself when the GP correlation parameters are assumed known. In the case of unknown correlation parameters, their posterior distribution is provided. Markov chain Monte Carlo (MCMC) techniques are introduced as a means of sampling this posterior, and this results in a Monte Carlo method of sampling the predictive distribution when the GP correlation parameters are unknown. Two examples of Bayesian prediction of unsampled code output are provided to illustrate the methods discussed. 2
Bayesian Prediction: Framework Inference based on predictive distribution ytr s =(x tr 1,y s (x tr 1 )),...,(x tr n,y s (x tr n )) yte s =(x te 1,y s (x te 1 )),...,(x te n p,y s (x te n p )) training data test data Predictive distribution Z p( yte s ytr s )= Z = All uncertain model parameters p( y s te, y s tr ) d p( y s te y s tr, ) ( y s tr ) d Derived from process modeling assumptions, e.g. Gaussian process Derived analytically (conjugate prior) or sampled via MCMC 3
Bayesian Prediction: Sampling Distribution Joint Sampling Distribution (test and training data) (Yte s, Ytr) s Y s x te 1,...,Y s x te n p,y s x tr 1,...,Y s x tr n F = Fte N np +n F, F tr (n p+n) k regression matrix R = 1 R (given (β, λ, R)) Rte R te,tr R T te,tr R tr correlation (n p+n) (n p+n) Conditional Distribution (test data given training data) matrix p ( y s te y s tr,, ) is N np (m (y s te y s tr,, ),V (y s te y s tr,, )) m (y s te y s tr,, )=F te + R te,tr R 1 tr (y s tr F tr ) V (y s te y s tr,, )= 1 R te R te,tr R 1 tr R T te,tr 4
Prior Distributions: Case I (Informative) ( )isn k Bayesian Prediction: Priors and Posteriors I 0, ( ) is Gamma(a, b) 1 1 Posterior Distributions ( ytr) s is Gamma(a 1,b 1 ) a 1 =(2a + n)/2 b 1 = 2b + y s tr F tr b T R 1 tr ytr s F tr b T + b 0 1 b 0 /2 b = F T tr Rtr 1 1 F tr F T tr Rtr 1 ytr s and = 1 + F T trrtr 1 1 F tr ( ytr s )ist k 2a 1, bµ, b 1 /a1 b b 1 = + F T trr 1 tr F tr and bµ = b h F T trr 1 tr F tr b + 0 i 5
Prior Distributions: Case II (Noninformative) Posterior Distributions ( ytr) s is Gamma(a 1,b 1 ) Bayesian Prediction: Priors and Posteriors II ( ) / 1 independent of ( ) is Gamma(a, b) a 1 =(2a + n b 1 = 2b + k)/2 ytr s F tr b T R 1 tr y s tr F tr b /2 b = F T tr Rtr 1 1 F tr F T tr Rtr 1 ytr s ( ytr s )ist k 2a 1, b,b 1 F T trrtr 1 F tr Results for Jeffreys Prior Distribution (, ) / (1/ ): Set a = b =0 1 /a1 6
Predictive Distribution Bayesian Prediction: Predictive Distribution p( y s te y s tr )ist np (2a 1,µ te,b 1 te /a 1 ) µ te = F te bµ + R te,tr R 1 tr (y s tr F tr bµ) te = R te R te,tr R 1 tr R T te,tr + H te b H T te! Here, H te = F te R te,tr R 1 tr F tr! For Case I priors, bµ and b given on slide 54 For Case II priors, bµ = b and b = F T trrtr 1 1! F tr! For Je reys prior, µ te is the BLUP and te the associated prediction uncertainty 7
Bayesian Prediction: Uncertain Correlation Parameters Parametric Correlation Functions φ: Denotes uncertain correlation function parameters For example, consider the Gaussian correlation function parameterized by correlation length 0 < ρ < 1:# R ( u, v )=exp " 4 # kx log ( i )(u i v i ) 2 ; u, v 2 [0, 1] k i=1 =( 1,..., k ) Obtained from results on next slide Predictive Distribution Z p ( yte s ytr s )= p ( yte s ytr, s ) ( ytr s ) d Obtained from results on slide 7 8
Bayesian Prediction: Correlation Posterior Correlation Parameter Posterior Distribution ( ytr s ( ) ) / b a 1 1 R tr 1/2 F T trrtr 1 1/2 F tr 1/2 For ( ) / 1, use =1 Example: Gaussian Correlation ( 1,..., k ) independent i Beta ( a,b ) a =1, b =0.1 =) e ect sparsity ( 1,..., M ) sampled from ( y s tr ) via MCMC 9
Objective: Generate a sample from the target π(x) Algorithm: Repeat for j = 1, 2,, M proposal density Generate y from q(x j,!) and u from Uniform(0, 1) If ( h i min (y)q(y,x) u apple (x j, y) for (x, y) = (x)q(x,y), 1, if (x)q(x, y) > 0 1, otherwise set x j+1 = y Else, set x j+1 = x j Return values x 1,, x M Implementation: Discard initial m 0 samples as ``burn-in (x, y) =min MCMC: Metropolis-Hastings apple (y) (x), 1 Metropolis: symmetric proposal distribution q(y,x) = q(x,y) Challenge is choosing q(x,!) for effective mixing 23.4% multi-parameter, 44% single parameter, 57.4% Langevin diffusion 10
Bayesian Prediction: Estimation of Predictive Distribution Predictive Distribution Z p ( yte s ytr s )= p ( yte s ytr s, ) ( ytr s ) d Predictive Distribution Estimation Given ( 1,..., M ) ( ytr s ) sampled via MCMC, p ( y s te y s tr ) 1 M µ s te = 1 M s te = 1 M MX p ( yte s ytr, s i ) i=1 MX i=1 MX i=1 µ i te In previous example with uncertain correlation parameters (ξ = φ), estimated predictive distribution is a mixture of t-distributions h i te + µ i te µ s te µ i te µ s te T i 11
p ( y s te y s tr )= Z Bayesian Prediction: Damped Sine Example p ( y s te y s tr, ) ( y s tr ) d 1.5 1 Realizations 0.35 0.3 0.25 Priors: β = 0; λ ~ Gamma(5,5);# φ ~ Beta(1,0.1) y 0.5 0 predictive standard error 0.2 0.15 MAP Bayes 0.1 0.5 Data 0.05 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Small variation in conditional means compared with average conditional variances s te = v u t 1 M 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x MX h i=1 i te 2 + µ i te µ s te 2 i 12
Observed Failure Depth 350 300 250 200 150 100 50 RMSPE = 17.3 r = 0.98 RMSPE = 17.9 r = 0.98 Bayesian Prediction of 174 Failure Depths 0 50 0 50 100 150 200 250 300 Bayesian Prediction of y(x) Bayesian Prediction: Sheet Metal Pockets Example 6 code inputs 60 training runs 174 test runs Bayesian Prediction Standard Error REML EBLUP Standard Deviation 18 16 14 12 10 8 6 4 Bayesian prediction standard errors tend to be larger than REML-EBLUP standard errors Prior Distributions: β = 0 λ ~ Gamma(5,5) ρ ~ Beta(5,1) Frequentist coverage: Nominal: 95% REML-EBLUP: 61% Bayes: 71% 2 2 4 6 8 10 12 14 Estimated Bayesian Standard Deviation REML-EBLUP Prediction Standard Error 13
Bayesian prediction based on the predictive distribution: π( y new y current ) Derive analytically when possible Bayesian Prediction: Summary Realizations generated from parameter posterior (MCMC) and conditional predictive samples Many MCMC algorithms implemented in software MCMCpack, mcmc, adaptmcmc, AMCMC in R http://cran.r-project.org OpenBUGS, WinBUGS http://www.mrc-bsu.cam.ac.uk/software/bugs/ Delayed Rejection Adaptive Metropolis http://helios.fmi.fi/~lainema/dram/ R package coda provides a suite of MCMC diagnostic tools 14
Andrieu, C. and Thoms, J. (2008). A tutorial on adaptive MCMC, Statistics and Computing, 18, 343-373. Casella, G. and George, E. (1992). Explaining the Gibbs sampler, The American Statistician, 46, 167-174. Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm, The American Statistician, 49, 327-335. Currin, C., Mitchell, T.J., Morris, M.D., and Ylvisaker, D. (1991). Bayesian prediction of deterministic functions, with applications to the design and analysis of computer experiments. Journal of the American Statistical Association, 86, 953-963. Haario, H., Laine, M., and Mira, A. (2006). DRAM: Efficient adaptive MCMC, Statistics and Computing, 16, 339-354. O Hagan, A. (1994). Kendall s Advanced Theory of Statistics, 2B, Bayesian Inference. London: Edward Arnold. Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods. New York: Springer. References Santner, T.J., Williams, B.J., and Notz, W.I. (2003). The Design and Analysis of Computer Experiments. New York: Springer. 15