Optimal Designs for Gaussian Process Models via Spectral Decomposition. Ofir Harari

Size: px

Start display at page:

Download "Optimal Designs for Gaussian Process Models via Spectral Decomposition. Ofir Harari"

Jason Tyler
6 years ago
Views:

1 Optimal Designs for Gaussian Process Models via Spectral Decomposition Ofir Harari Department of Statistics & Actuarial Sciences, Simon Fraser University September 2014 Dynamic Computer Experiments, /38

2 Outline 1 Introduction Gaussian Process Regression Karhunen-Loève Decomposition of a GP 2 IMSPE-optimal Designs Application of the K-L Decomposition Approximate Minimum IMSPE Designs Adaptive Designs Extending to Models with Unknown Regression Parameters 3 Other Optimality Criteria 4 Concluding Remarks Dynamic Computer Experiments, /38

3 Introduction Outline 1 Introduction Gaussian Process Regression Karhunen-Loève Decomposition of a GP 2 IMSPE-optimal Designs Application of the K-L Decomposition Approximate Minimum IMSPE Designs Adaptive Designs Extending to Models with Unknown Regression Parameters 3 Other Optimality Criteria 4 Concluding Remarks Dynamic Computer Experiments, /38

4 Introduction Gaussian Process Regression The Gaussian Process Model Suppose that we observe y = [y (x 1 ),..., y (x n)] T, where y (x) = f (x) T β + Z (x). Here f (x) is a vector of regression functions evaluated at x and Z (x) is a zero mean stationary Gaussian process with marginal variance σ 2 and correlation function R( ; θ). An extremely popular modeling approach in spatial statistics, geostatistics, computer experiments and more. Sometimes measurement error (i.e. a nugget term) is added. Dynamic Computer Experiments, /38

5 Introduction Gaussian Process Regression The Gaussian Process Model (cont.) Notations: R ij = R(x i x j ) - the correlation matrix at the design D = {x 1,..., x n}. y = [ y(x 1 ),..., y(x ] T n) - the vecor of observatios at D. r(x) = [ R(x x 1 ),..., R(x x ] T n) - the vector of correlations at site x. F ij = f j (x i ) - the design matrix at D. Dynamic Computer Experiments, /38

6 Introduction Gaussian Process Regression The Gaussian Process Model (cont.) Universal Kriging Assuming π (β) 1 ŷ (x) = E { } ) y(x) y = f T (x) β + r T (x) R (y 1 F β, (1) is the minimizer of the mean squared prediction error (MSPE), which will then be [ {ŷ(x) } ] 2 y E y(x) = var { } y(x) y = σ 2 1 [ f T (x), r T (x) ] [ 0 F T F R ] 1 [ f (x) r (x) ]. If the assumption of Gaussianity is dropped, (1) would still be the BLUP for y (x). Dynamic Computer Experiments, /38

7 Introduction Karhunen-Loève Decomposition of a GP Spectral Decomposition of a GP Mercer s Theorem If X is compact and R (the covariance function of a GP) is continuous in X 2, then R(x, y) = λ i ϕ i (x)ϕ i (y), i=1 where the ϕ i s and the λ i s are the solutions of the homogeneuos Fredholm integral equation of the second kind X R(x, y)ϕ i (y)dy = λ i ϕ i (x) and X ϕ i (x)ϕ j (x)dx = δ ij. Dynamic Computer Experiments, /38

8 Introduction Karhunen-Loève Decomposition of a GP Spectral Decomposition of a GP (cont.) The Karhunen-Loève Decomposition Under the aforementioned conditions Z (x) = α i ϕ i (x) (in the MSE sense), i=1 where the α i s are independent N (0, λ i ) random variables. The best approximation (in the MSE sense) of Z (x) is the truncated series in which the eigenvalues are arranged in decreasing order. For piecewise analytic R(x, y), λ k c 1 exp ( c 2 k 1/d) for some constants c 1 and c 2 (Frauenfelder et al. 2005). Numerical solutions involve Galerkin-type methods. Dynamic Computer Experiments, /38

9 Introduction Karhunen-Loève Decomposition of a GP A Worked Example R(x, w) = exp { (x 1 w 1 ) 2 2 (x 2 w 2 ) 2}, logλ k Eigenvalues vs. Index (log scale) k Dynamic Computer Experiments, /38

10 Introduction Karhunen-Loève Decomposition of a GP 0 0 A Worked Example (cont.) Eigenfunction #1, λ = Eigenfunction #2, λ = Eigenfunction #3, λ = Eigenfunction #4, λ = Eigenfunction #5, λ = Eigenfunction #6, λ = Dynamic Computer Experiments, /38

11 y Introduction Karhunen-Loève Decomposition of a GP 0 A Worked Example (cont.) A single realization based on the first 17 K-L terms ( 99.5% of the process energy) X1 X Dynamic Computer Experiments, /38

12 IMSPE-optimal Designs Outline 1 Introduction Gaussian Process Regression Karhunen-Loève Decomposition of a GP 2 IMSPE-optimal Designs Application of the K-L Decomposition Approximate Minimum IMSPE Designs Adaptive Designs Extending to Models with Unknown Regression Parameters 3 Other Optimality Criteria 4 Concluding Remarks Dynamic Computer Experiments, /38

13 IMSPE-optimal Designs Minimum IMSPE Designs Integrated Mean Squared Prediction Error Sacks et al. (1989b) suggested a suitable design for computer experiments should minimize J ( D, ŷ ) [ {ŷ(x) } ] 2 y E y(x) dx = 1 σ 2 σ X 2 [ ] 1 0 F T = vol (X ) tr F R where vol(x ) is the volume of X. X [ f (x) f T (x) r (x) f T (x) f (x) r T (x) r (x) r T (x) ] dx Dynamic Computer Experiments, /38

14 IMSPE-optimal Designs Application of the K-L Decomposition Consider the standard Gaussian process y (x) = f (x) T β + Z (x), where β is (for now) a known vector. The Kriging predictor would now be ŷ (x) = f (x) T β + r T (x) R 1 (y F β). Proposition For any compact X and continuous covariance kernel R, J ( D, ŷ ) = vol(x ) λ 2 kφ T k R 1 φ k, (2) σ 2 where φ p = [ϕ p(x 1 ),..., ϕ p(x n)] T. k=1 Dynamic Computer Experiments, /38

15 IMSPE-optimal Designs Application of the K-L Decomposition (cont.) Theorem 1. For any design D = {x 1,..., x n}, J ( D, ŷ ) σ 2 Z λ k (3) k=n+1 2. The lower bound (3) will be achieved if D satisfies 1 k n λ k φ T k R 1 φ k = 0 k > n 3. If such an ideal D exists, the prediction ŷ (x) at any x X would be identical to the prediction based on the finite dimensional Bayesian linear regression model y (x) = α 1 ϕ 1 (x) + + α nϕ n (x) where α j N ( 0, λ j ) are independent. Dynamic Computer Experiments, /38

16 IMSPE-optimal Designs Application of the K-L Decomposition (cont.) By definition, IMSPE-optimal designs are prediction-oriented. What the Theorem tells us is that in some sense, they are also useful for variable selection (w.r.t the K-L basis functions). In reality, such ideal designs do not exist - but IMSPE-optimal designs get pretty close: λ k φ k T R 1 φ k k 17 k > k Dynamic Computer Experiments, /38

17 IMSPE-optimal Designs Approximate Minimum IMSPE Designs Approximate Minimum IMSPE Designs Approximate IMSPE ( ) J M D, ŷ where σ 2 = vol(x ) M λ 2 kφ T k R 1 φ k = vol(x ) tr { Λ 2 Φ T R 1 Φ }, k=1 Λ = diag (λ 1,..., λ M ) and Φ ij = ϕ j (x i ), 1 i n, 1 j M. Approximate Minimum IMSPE Designs D = argmin D X ( ) J M D, ŷ = argmax tr { Λ 2 Φ T R 1 Φ } D X Dynamic Computer Experiments, /38

18 IMSPE-optimal Designs Approximate Minimum IMSPE Designs A Worked Example (cont.) Size 7, 15 and 25 approximate Minimum IMSPE designs for R(x, w) = exp { (x 1 w 1 ) 2 2 (x 2 w 2 ) 2} Dynamic Computer Experiments, /38

19 IMSPE-optimal Designs Approximate Minimum IMSPE Designs Controlling the Relative Truncation Error Proposition Under some conditions (which are easily met), λ 2 kφ T k R 1 φ k r M = k=m+1 λ 2 j φ T j R 1 φ j k=m+1 j=1 j=1 λ k. λ j By conserving enough of the process energy, we are guaranteed to have a design as close to optimal as we wish. Dynamic Computer Experiments, /38

20 IMSPE-optimal Designs Approximate Minimum IMSPE Designs Controlling the Relative Truncation Error (cont.) Relative truncation error vs. sample size for M = 17 ( 0.492% energy loss): Relative Error Relative Truncation Error vs. Sample Size ε = Ideal True Sample Size Dynamic Computer Experiments, /38

21 IMSPE-optimal Designs Adaptive Designs Adaptive Designs Suppose now that we have the opportunity (or we are forced) to run a multi-stage experiment: n 1 runs at stage 1, n 2 runs at stage 2 and so on. We can use that to learn/re-estimate the essential parameters and plug-in the new estimates, to hopefully improve the design from one stage to another. Such designs are called adaptive or batch-sequential designs. Dynamic Computer Experiments, /38

22 IMSPE-optimal Designs Adaptive Designs Adaptive Designs (cont.) Denote R aug = R new n 2 n 2 R cross n 1 n 2 R cross n 2 n 1 R old n 1 n 1, where R aug, R new, R old and R cross are the augmented correlation matrix, the matrix of correlations within the new inputs, the matrix of correlations within the original design and the cross correlation matrix, and Φ aug = Φ new n 2 M Φ old n 1 M. Parameters may be re-estimated in between batches. Dynamic Computer Experiments, /38

23 IMSPE-optimal Designs Adaptive Designs Adaptive Designs (cont.) Using basic block matrix properties, we look to maximize tr { Λ 2 Φ T augr 1 augφ aug } = [ = tr Λ {( ) 2 Φ T new Φ T oldr 1 old RT cross Q 1 1 Φnew + ( ) Φ T old Φ T newr 1 newr cross Q 1 2 Φ old} ], where Q 1 = R new R T crossr 1 old Rcross and Q 2 = R old R crossr 1 newr T cross, and R old and Φ old remain unchanged. Dynamic Computer Experiments, /38

24 IMSPE-optimal Designs Adaptive Designs Adaptive Designs (cont.) Adding 7 New Runs to an Existing 8 Run Design (λ = 0.1): Size 8+7 design, IMSPE= X st Batch 2nd Batch X 1 Dynamic Computer Experiments, /38

25 IMSPE-optimal Designs Adaptive Designs Adaptive Designs (cont.) Adding 7 New Runs to an Existing 8 Run Design (λ = 0.1): Size 8+7 design, IMSPE= X st Batch 2nd Batch X 1 Dynamic Computer Experiments, /38

26 IMSPE-optimal Designs Extending to Models with Unknown Regression Parameters Approximate IMSPE Criterion for the Universal Kriging Model Expanding the covariance kernel, using Mercer s Theorem, and truncating after the first M terms yields H M ( D, ŷ ) = JM ( D, ŷ ) { (F +tr T R 1 F ) [ ]} 1 f (x) f T (x) dx 2A T φ (x) f T (x) dx + A T A X X (4), where JM is the criterion previously derived for the simple Kriging model, A = ΛΦ T R 1 F and φ (x) = [ϕ 1 (x),..., ϕ M (x)] T. Dynamic Computer Experiments, /38

27 IMSPE-optimal Designs Extending to Models with Unknown Regression Parameters Approximate IMSPE Criterion for the Universal Kriging Model Expression (4) is somewhat simplified by substituting f (x) = 1 and F = 1 n to where H M ( D, ŷ ) = JM ( D, ŷ ) T nr 1 1 n { vol (X ) 2a T γ + a T a }, a = ΛΦ T R 1 1 n and γ = [ X ϕ 1 (x)dx,..., The vector of integrals γ only needs to be evaluated once. X ϕ M (x)dx] T. Almost identical designs to those obtained for the simple Kriging model, for a reasonably large n. Dynamic Computer Experiments, /38

28 IMSPE-optimal Designs Extending to Models with Unknown Regression Parameters With and Without Estimating the Intercept X X X X 1 Dynamic Computer Experiments, /38

29 IMSPE-optimal Designs Extending to Models with Unknown Regression Parameters With and Without Estimating the Intercept (cont.) Deviation from the optimum vs. sample size n Dynamic Computer Experiments, /38

30 Other Optimality Criteria Outline 1 Introduction Gaussian Process Regression Karhunen-Loève Decomposition of a GP 2 IMSPE-optimal Designs Application of the K-L Decomposition Approximate Minimum IMSPE Designs Adaptive Designs Extending to Models with Unknown Regression Parameters 3 Other Optimality Criteria 4 Concluding Remarks Dynamic Computer Experiments, /38

31 Other Optimality Criteria Bayesian A-optimal Designs { )} In classical DOE, a design is called A-optimal if it minimizes tr var ( θ, where θ is the vector of estimators for the model parameters. For our model y (x) = f (x) T β + Z (x) = f (x) T β + α k ϕ k (x), α k N (0, λ i ), k=1 we may consider the Bayesian A-optimality criterion Q ( D, ŷ ) = var { } α k y. k=1 When β is a known vector, the IMSPE and the A-optimality criteria coincide. Dynamic Computer Experiments, /38

32 Other Optimality Criteria Bayesian A-optimal Designs (cont.) Proposition If π (β) 1, Q ( D, ŷ ) = vol (X ) tr 0 F F T R 1 X r (x) r T (x) dx. (and hence we don t even need to derive the Mercer expansion) IMSPE optimal A optimal X X X X 1 Dynamic Computer Experiments, /38

33 Other Optimality Criteria Bayesian D-optimal Designs In classical DOE, a design is called D-optimal if it minimizes the determinant of var ( θ) (typically the inverse Fisher information). In our framework, an approximate Bayesian D-optimal design would be D = argmin det ( var { }) α 1,..., α M y D Where var { α 1,..., α M y } = Λ ΛΦ T [R 1 R 1 F ( F T R 1 F ) 1 F T R 1 ] ΦΛ. D optimal IMSPE optimal A optimal X X X X X X1 Dynamic Computer Experiments, /38

34 Concluding Remarks Outline 1 Introduction Gaussian Process Regression Karhunen-Loève Decomposition of a GP 2 IMSPE-optimal Designs Application of the K-L Decomposition Approximate Minimum IMSPE Designs Adaptive Designs Extending to Models with Unknown Regression Parameters 3 Other Optimality Criteria 4 Concluding Remarks Dynamic Computer Experiments, /38

35 Concluding Remarks Few comments before we go If Z (x) is a zero mean non-gaussian, weakly-stationary process with the same correlation structure - the Kriging predictor will no longer be the posterior mean, but will still be the BLUP. - The MSPE will no longer be the posterior variance, but all the results still hold. Inclusion of measurement error (i.e. Nugget Term) y (x) = f (x) T β + Z (x) + ε (x), ε N ( 0, τ 2 I ), λ = τ 2 /σ 2 will simply result in replacing R with R + λi everywhere, but the theory will cease to apply. - Great computational benefits for small values of λ. Dynamic Computer Experiments, /38

36 Concluding Remarks Thank you! Dynamic Computer Experiments, /38

37 References References I 1. Andrianakis, Ioannis, & Challenor, Peter G The effect of the nugget on Gaussian process emulators of computer models. Computational Statistics and Data Analysis, 56(12), Atkinson, A.C., Doney, A.N., & Tobias, R.D Optimum Experimental Designs, with SAS. Oxford Statistical Science Series. 3. Bursztyn, D., & Steinberg, D. M Data analytic tools for understanding random field regression models. Technometrics, 46(4), Cressie, Noel A. C Statistics for Spatial Data. Wiley-Interscience. 5. Frauenfelder, Philipp, Schwab, Christoph, & Todor, Radu Alexandru Finite Elements for Elliptic Problems with Stochastic Coefficients. Computer Methods in Applied Mechanics and Engineering, 194(2), Guenther, Ronald B., & Lee, John W Partial Differential Equations of Mathematical Physics and Integral Equations. Courier Dover Publications. 7. Jones, Brad, & Johnson, Rachel T The Design and Analysis of the Gaussian Process Model. Quality and Reliability Engineering International, 25(5), Journel, A., & Rossi, M When Do We Need a Trend Model in Kriging? Mathematical Geology, 21(7), Dynamic Computer Experiments, /38

38 References References II 9. Linkletter, Crystal, Bingham, Derek, Hengartner, Nicholas, Higdon, David, & Ye, Kenny Q Variable Selection for Gaussian Process Models in Computer Experiments. Technometrics, 48(4), Loeppky, Jason L., Sacks, Jerome, & Welch, William J Choosing the Sample Size of a Computer Experiment: A Practical Guide. Technometrics, 51(4), Loeppky, Jason L., Moore, Leslie M., & Williams, Brian J Batch Sequential Designs for Computer Experiments. Journal of Statistical Planning and Inference, 140(6), Picheny, Victor, Ginsbourger, David, Roustant, Olivier, Haftka, Raphael T., & Kim, Nam H Adaptive Designs of Experiments for Accurate Approximation of a Target Region. Journal of Mechanical Design, 132(7), Rasmussen, Carl Edward, & Williams, Chris Gaussian Processes for Machine Learning. The MIT Press. 14. Sacks, Jerome, Welch, William J., Mitchell, Toby J., & Wynn, Henry P. 1989a. Design and Analysis of Computer Experiments. Statistical Science, 4(4), Sacks, Jerome, Schiller, Susannah B., & Welch, William J. 1989b. Designs for Computer Experiments. Technometrics, 31(1), Dynamic Computer Experiments, /38

39 References References III 16. Santner, Thomas J., Williams, Brian J., & Notz, William I The Design and Analysis of Computer Experiments. Springer. 17. Shewry, Michael C., & Wynn, Henry P Maximum Entropy Sampling. Journal of Applied Statistics, 14(2), Singhee, Amith, Singhal, Sonia, & Rutenbar, Rob A Exploiting Correlation Kernels for Efficient Handling of Intra-Die Spatial Correlation, with Application to Statistical Timing. Pages of: Proceedings of the conference on Design, Automation and Test in Europe. 19. Wahba, Grace Spline Models for Observational Data. Society for Industrial and Applied Mathematics. Dynamic Computer Experiments, /38

arxiv: v1 [stat.me] 31 Dec 2014

arxiv: v1 [stat.me] 31 Dec 2014 Variable Selection in Bayesian Semiparametric Regression Models Ofir Harari a, David M. Steinberg a a Department of Statistics and Operations Research, Raymond and Beverly Sackler Faculty of Exact Sciences,