Bayesian optimal design for Gaussian process models

Size: px

Start display at page:

Download "Bayesian optimal design for Gaussian process models"

Letitia Blair
5 years ago
Views:

1 Bayesian optimal design for Gaussian process models Maria Adamou and Dave Woods Southampton Statistical Sciences Research Institute University of Southampton 1

2 Outline Background and motivation Computer experiments Spatial data collection Physical experiments Bayesian optimal design for GP regression Decision-theoretic design for prediction and estimation Approximating the objective function Numerical examples 2

3 Background Experiments are used to investigate the impact of an intervention ( treatment ) on a process or system through its application to a number of objects ( units ) Design of experiments - the active selection of the treatments (settings of the controllable variables) to be applied taking account of multiple sources of uncertainty Usually, the aim of the experiment is to collect data to answer scientific questions through the estimation of a statistical model The design can be selected to maximise information gained for a given resource Information is typically measured with respect to uncertainty about the model and its parameters 3

4 Motivation We consider design of experiments from a Bayesian prospective For example, prior information is used to choose the factors to study, their possible values and plausible models relating factors to the response This information may be subjective, empirical or mechanistic A Bayesian approach allows uncertainty in these decisions to be incorporated into selection of an appropriate design for the experiment For some experiments, it is natural or even necessary to be Bayesian 4

5 Computional modelling Many physical and social processes can be approximated by computer codes which encapsulate mathematical descriptions For example, partial differential equations, solved using e.g. finite element methods Inputs x Computer code g(x) Outputs y A computer experiment is performed when g(x) is computationally expensive, and hence evaluation of g(x) for all x is infeasible Choose a design x 1,..., x n, evaluate g(x 1),..., g(x n) and fit a statistical model (emulator) to this data The most common emulator is the Gaussian Process 5

6 Ride optimisation Jaguar-Land Rover offers a range of customisations for each model Aim: use a computer model to ensure satisfactory ride characteristics of 10 vehicle variants 6

7 Spatial statistics: sensor placement Spatial data is collected in a broad range of applications e.g. environmental, climatology and agriculture prediction candidate optimal Monitoring spatial sulphate deposition in the eastern USA 122 candidate locations, 10 prediction points and 40 design points Gaussian process model is commonly employed Where should monitoring stations be sited for best prediction? Cressie (1993) 7

8 Physical experiments Choose treatment settings in the presence of e.g. spatial or temporal trends for example, field trials in agriculture, lab experiments in well-plates Assume random effects for nuisance factors (fields, plates,...) with covariance of observed response depending on distance between units 8

9 Gaussian process model 9

10 Gaussian process regression Bayesian Gaussian process models: nonparametric approach to modelling noisy data y(x) = g(x) + ε, ε N(0, σ 2 ε) (1) Assume a Gaussian process prior g(x) GP {µ(x), κ(x, x )} g(x), g(x ) joint Gaussian E{g(x)} = µ(x) = f T (x)β trend cov ( g(x) g(x ) ) = 1 k σ2 g ( λ (x, x ) k λ (x, x ) covariance ) 1 Rasmussen & Williams (2006) 10

11 Inference Updated uncertainty, incorporating the data, is given by the posterior distribution g y Assuming the Gaussian process prior, it is straightforward to derive the posterior conditional distribution where g(x) y, β, λ, σ 2 g, σ 2 ε N[ m(x), σ 2 gs 2 ] m(x) = f T (x)β + k T L 1 n (y F β), k = [k λ (x, x i)] n i=1 L n = [k λ (x i, x j) + τ 2 δ ij] n i,j=1, s 2 = 1 k T L 1 n k, and τ 2 = σ 2 ɛ /σ 2 g, δ ij = 1 if i = j, 0 otherwise 11

12 A simple example g(x) = sin(x) µ(x) = β 0 (constant) k λ (x, x ) exp{ λ 2 (x x ) 2 } Posterior uncertainty approaches 0 at the design points Prediction Prediction 95% intervals Prior mean sin(x) x Learning λ and τ 2 requires numerical methods Empirical Bayes estimate λ, τ 2 (e.g. as posterior mode) and plug-in Markov chain Monte Carlo or Sequential Monte Carlo Titsias et al. (2011); Gramacy & Polson (2011) 12

13 Design for learning and prediction using GPs Space-filling: Latin Hypercubes, maximin and minimax,... McKay et al. (1979). Johnson et al., (1990),... Prediction: kriging variance, usually assuming known λ, IMSPE, Bayesian approaches... O Hagan (1978), Sacks et al., (1989), Zhou & Stein (2006), Diggle & Lophaven (2006), Gorodetsky & Marzouk (2016), Leatherman et al. (2017),... Estimation: Entropy, D-optimality,... Sebastiani & Wynn (2000), Zhou & Stein (2005), Xia et al. (2006), Boukouvalas et al. (2014),... Sequential design: optimisation (Expected Improvement), prediction,... Jones et al. (1998), Wikle & Royle (2005), Gramacy & Lee (2009), Beck & Guillas (2016),... 13

14 Bayesian Optimal Design 14

15 Decision theoretic design Loss functions L(d(y), θ; ξ) defines the cost of decision d(y) when the true state of nature is θ, where data y is obtained from design ξ = {x 1,..., x n} Example loss functions include squared error loss negative log-score 0-1 loss Make the decision that minimises the expected posterior loss d (y) = argmin d D L (d(y), θ; ξ) π(θ y, ξ) dθ Θ d (y) - Bayes decision for data y 15

16 Decision theoretic design Bayes risk Ψ(ξ) = Y Θ L (d (y), θ; ξ) π(θ, y ξ) dθdy Loss evaluated at Bayes decision for each y Y π(θ, y ξ) is the joint distribution of model parameters and data A Bayesian optimal design is then given by ξ = argmin ξ Ξ c.f. Berger (1985 Ch.7 preposterior analysis) Ψ(ξ) Lindley (1972), Chaloner & Verdinelli (1995) 16

17 Prediction 17

18 Prediction For some experiments, the aim is accurate model-free prediction of the response e.g. spatial studies, computer experiments Bayesian nonparametrics Place a Gaussian process prior on the unknown response function Correlation function allows predictions to smoothly follow data Optimal design by minimising prediction uncertainty O Hagan (1978), Müller et al., (2004), Diggle & Lophaven (2006), Adamou (2014) 18

19 Space-filling designs Space-filling designs are most commonly applied to experiments tailored for prediction Don t rely on the functional form of the relationship between inputs and outputs Good coverage is important for prediction (strongly influenced by nearby points) 19

20 Space-filling designs For Gaussian process model with (i) constant mean function and (ii) known values of correlation parameters: maximin optimal designs are asymptotically D-optimal minimax optimal designs are asymptotically G-optimal Johnson et al., (1990) D-optimal design maximises the generalised variance of the parameter estimators G-optimal design minimises the maximum predictive variance Atkinson et al. (2007) 20

21 Bayesian optimal design for prediction For prediction of future observation y(x n+) at unobserved point x n+, we apply the square error loss function L(ỹ, y(x n+); ξ) = ỹ y(x n+) 2 Best prediction of y(x n+) is ŷ = E(y(x n+) y) (Bayes decision) The objective function to be minimised is Ψ(ξ) = L(ŷ, y(x n+); ξ)π(y(x n+) y; ξ)π(y; ξ) dy(x n+) dy dx n+ = E[ ŷ y(x n+) 2 y; ξ]π(y; ξ) dy dx n+ = var(y(x n+) y; ξ)π(y; ξ) dy dx n+ 21

22 Evaluating the objective function For conjugate priors on β and σ 2 g 1. Assume φ = (λ, τ 2 ) are known: Ψ(ξ; φ) = var(y(x n+) φ, y; ξ)π(y φ; ξ) dy dx n+ Tractable integral with respect to the data Common in the literature, e.g Diggle & Lophaven (2006), Zimmerman (2006) 2. Unknown φ: Ψ(ξ) = var(y(x n+) y; ξ)π(y φ; ξ)π(φ) dφ dy dx n+ The integrand is intractable Numerical methods are required 22

23 An approximation Decompose the objective function Ψ(ξ) = E φ y [var(y(x n+) φ, y; ξ)] π(y; ξ) dy dx n+ + var φ y [E(y(x n+) φ, y; ξ)] π(y; ξ) dy dx n+ = Ψ 1(ξ) + Ψ 2(ξ) The integrand in Ψ 1(ξ) is analytically tractable with respect to the data y To approximate Ψ 2(ξ) options include: Markov Chain Monte Carlo simulations and Monte Carlo integration which result in low bias, high variance and high computation cost Fixed Ψ 2(ξ) = 0 with potentially high bias, zero variance and zero computation cost 23

24 An approximation Set Ψ 2(ξ) = 0, and use the approximation Ψ(ξ) Ψ 1(ξ) = E φ y [var(y(x n+) φ, y; ξ)] π(y; ξ) dy dx n+ = var(y(x n+) φ, y; ξ)π(y φ; ξ)π(φ) dy dφ dx n+ S 2 (x n+; φ, ξ)π(φ) dφ dx n+ where S 2 (x n+; φ, ξ) is the generalised kriging variance at the point x n+, and the integral with respect φ can be approximated using quadrature Gen. kriging var. 24

25 Assessing the approximation Numerical sensitivity study Generated random designs, for various n and prior hyperparameters, and evaluated Ψ 1 and Ψ 2 (using simulation) Consistently found Ψ 2 << Ψ 1 (i.e. Ψ Ψ 1) Ranking of designs very similar under Ψ 1 and Ψ, esp. for small Ψ Theoretical insights Ψ 2 depends on π(φ y; ξ) π(y φ; ξ)π(φ) Under some assumptions, integrated likelihood π(y φ; ξ) is bounded by a polynomial in φ E.g. as (λ, τ 2 ) (0, 0), π(y φ; ξ) 0 at a faster rate In the literature Berger et al. (2001) Wu & Kaufman (2013): studied predictive performance wrt choice of prior, found var φ y [E(y(x n+) φ, y; ξ)] 0 Leatherman et. al (2017a, 2017b): Bayesian-IMSPE (frequentist derivation, non-informative prior) 25

26 An approximation Approximate integral in Ψ(ξ) with respect to the prediction points Fixed grid results in zero variance and high bias low computation cost Monte Carlo grid results in high variance and low (zero) bias high computation cost For special cases of correlation function, i.e. Gaussian correlation function the integral with respect to the prediction points is analytically tractable 26

Computer experiments example Computer model: simple simulator of a helical compression spring Wire diameter (mm) Spring index Spring coefficient http://commons.wikimedia.org/wiki/file:coil spring.

27 Computer experiments example Computer model: simple simulator of a helical compression spring Wire diameter (mm) Spring index Spring coefficient spring.jpg Three variables; linear mean function f T (x)β = β k=1 β k x k k λ (x, x ) = 3 j=1 exp( λ j x j x j ) for x, x [ 1, 1] 3 Prior for β, σ 2 g is Normal Inverse Gamma Bayesian optimal designs with n = 10 runs for two different prior distributions Prior 1: τ 2 =0, λ 1 Unif(1, 3), λ 2 Unif(3, 5), λ 3 0 Prior 2: τ 2 =0.5, λ 1 Unif(1, 3), λ 2 Unif(1, 3), λ 3 = 0 Tudose & Jucan (2007); Forrester et al. (2008) 27

28 Computer experiments example x x 2 x x x 2 x Bayesian Ψ-optimal design and maximin Latin hypercube design for prior 1 (left) and prior 2 (right). 28

29 Computer experiments example For prior 1: Bayesian optimal design has similar space-filling properties to the LHD Average inter-point distance of 1.43 vs 1.40 Bayesian design has 30% smaller average posterior predictive variance than that of the LHD For prior 2: Design points in the x 3 dimension collapse onto the extremes Bayesian design has 18% smaller average posterior predictive variance than that of the LHD Two-dimensional example 29

30 Ride optimisation: design and response variables Each variant can be described by different mass properties, and has different ride characteristics Mass properties (design variables) units Mass kg Mass distribution % Fr Centre of gravity mm Roll inertia kgm 2 Pitch inertia (damping) kgm 2 Damping can be used to tune the ride performance of a given variant 30

31 10 Variants for study Mass Mass dist. Centre of gravity Roll inertia Running physical tests to assess the performance of each of these variants is time consuming and expensive. Instead, a computer model is run that simulates ride performance 31

32 Model-based design Mass Mass dist CG Roll Damp Tailor the design for Gaussian process prediction for the given variants 32

33 Spatial examples Example 1: Known noise to signal ratio τ 2 = σ 2 ε/σ 2 g Linear mean function f T (x)β = β 0 + β 1x 1 + β 2x 2 Correlation function k λ (x, x ) = exp( λ x x 2), with x i, x j [ 1, 1] 2 Prior distributions β σ 2 g N(0, σ 2 gi), σ 2 g Inverse Gamma(3, 1) and λ Uniform(0.1, 1) τ 2 has fixed values 0,0.5,1,2.5 Optimal designs with n = 10 points and average correlation contours 33

0.55 0.55 0.6 0.85 0.9 0.95 0.8 0.75 0.7 0.65 0.38 0.4 0.44 0.46 0.48 0.42 0.36 0.34 0.32 0.3 0.28 0.55 0.55 0.4 0.45 0.5 0.55 0.6 0.2 0.24 0.26 0.

24 0.26 0.5 0.35 1.0 0.5 0.0 0.5 1.0 x 1 τ 2 = 1 τ 2 = 2.5 0.26 0.24 0.14 0.35 0.14 x 2 1.0 0.5 0.0 0.5 1.0 1.0 0.8 0.6 0.4 0.2 0.0 x 2 1.0 0.5 0.0 0.5 1.0 1.0 0.8 0.6 0.4 0.2 0.0 0.24 0.26 1.

34 τ 2 = 0 τ 2 = x x x x 1 τ 2 = 1 τ 2 = x x x x

35 Spatial examples Example 2: Unknown noise to signal ratio τ 2 Linear mean function f T (x)β = β 0 + β 1x 1 + β 2x 2 Matern correlation function k λ (x, x ) = {2 λ 2 1 Γ(λ 2)} 1 (2λ 1/2 2 d/λ1)λ 2 K λ2 (2λ 1/2 2 x x 2/λ 1) where Γ is the gamma function and K λ2 is the modified Bessel function of the second kind Prior distributions β σ 2 g N(0, σ 2 gi), σ 2 g Inverse Gamma(3, 1) and λ 1 Uniform(0.1, 1) τ 2 Uniform(0, 1) Optimal designs with n = 10 points for λ 2 = 0.5 (exponential correlation) and λ 2 =

λ 2 = 0.5 λ 2 = 1.5 0.35 0.4 0.35 0.54 0.56 0.58 0.6 0.58 0.56 0.54 x 2 1.

7 0.66 0.64 0.62 1.0 0.8 0.6 0.4 0.2 0.0 0.35 1.0 0.5 0.0 0.5 1.0 x 1 0.

36 λ 2 = 0.5 λ 2 = x x x x

37 τ 2 = 0 τ 2 = Density Density correlation correlation τ 2 = 0.5 τ 2 = Density Density correlation correlation 37

38 τ 2 = 0 τ 2 = Density Density correlation correlation τ 2 = 0.5 τ 2 = Density Density correlation correlation 38

39 Key points New designs compromise between minimax (coverage) and maximin (spread) designs Different prior information on signal and noise can lead to different designs (space-filling or linear model) Spread of the design points depends crucially on the distribution of the correlations problems with less uniform correlations across the spatial region produce designs with better space-filling properties 39

40 Sensor placement (1) (a) prediction candidate optimal Linear mean; low correlation 40

41 Sensor placement (2) prediction candidate optimal (b) prediction candidate optimal Linear mean; medium correlation 41

42 Sensor placement (3) (c) prediction candidate optimal Linear mean; high correlation 42

43 Estimation 43

44 Bayesian optimal design for estimation For estimating the trend parameters β, we apply the square error loss function L( β, β; ξ) = β β 2 Best estimate of β is ˆβ = E(β y) (Bayes decision) The objective function to be minimised is Ψ(ξ) = E [{β E(β y)} T {β E(β y)}] π(y)dy Y = E [tr [{β E(β y)} {β E(β y)} T ]] π(y)dy Y = tr [var(β y)] π(y)dy. Y 44

45 An approximation Decompose the objective function Ψ(ξ) = E φ y [var(β φ, y; ξ)] π(y; ξ) dy + var φ y [E(β φ, y; ξ)] π(y; ξ) dy = Ψ 1(ξ) + Ψ 2(ξ) Set Ψ 2(ξ) = 0, and use the approximation Ψ(ξ) Ψ 1(ξ) = E φ y [var β φ, y; ξ)] π(y; ξ) dy = var(β φ, y; ξ)π(y φ; ξ)π(φ) dy dφ tr[f T L 1 n F + R] 1 π(φ) dφ 45

46 Example We assume the model where: s corresponds to treatment factors x corresponds to nuisance factors y(s, x) = m(s) + Z(x) + ɛ m(s) is the mean function and Z(x) a mean zero Gaussian process prior We consider a well-plate with nine fixed locations and in each location a different treatment is applied. See also work on design for nearest neighbour and autocorrelation structures Kiefer and Wynn (1981,1984) 46

47 Example Make a choice of treatments and where to apply them Four treatment factors; linear mean function f T (s)β = β k=1 β k s k Treatments have three levels 1, 0, 1 Taxicab distance between the location k λ (x, x ) = exp( λ x x 1) for x, x [0, 1] 2 Prior for β, σg 2 is Normal Inverse Gamma Priors for λ, τ 2 are unifrom 47

Example s 1 s 2 s 3 s 4-1 1 0 1 0 0-1 -1-1 -1 1-1 1

48 Example s 1 s 2 s 3 s

49 Example Frequency Value of the objective function 49

50 Estimation Identifiability and consistency problems when estimating trend parameters in Gaussian Process model Identifiability issues with mean parameters as they are not orthogonal to Gaussian process Kennedy and O Hagan (2001) The identifiability problem can be avoided by making the mean function orthogonal to the Gaussian process prior Achieved by defining a new correlation function that incorporates the orthogonality condition Modify the correlation function and apply our methodology Plumlee and Joseph (2015) 50

51 Summary New model-based designs Tailored for prediction but take account of uncertainty in parameters that require estimation Strongly influenced by the range and the degree of correlation and the choice of mean function Largely insensitive to the choice of prior hyperparameters for σ 2 g Suitable for computer experiments, spatial/spatio-temporal studies, machine learning applications,... Methodology can be used for robust product designs 51

52 References I Adamou, M. (2014). PhD thesis, University of Southampton. Atkinson, A., Donev, A. & Tobias, R. (2007). Optimum Experimental Designs with SAS. Oxford University Press, 2nd edn. Beck, J. & Guillas. S. (2016). JUQ, 4 (1), Berger, J. & De Oliveria, V. & Sanso, B. (2001). JASA, 96, Boukouvalas, A., Cornford, D. & Stehlik, M. (2014). CSDA, 71, Chaloner, K. & Verdinelli, I. (1995). Statistical Science, 10, Cressie, N.A.C. (1993). Statistics for Spatial Data. Wiley. Diggle, P. & Lophaven, S. (2006). Scand. J. Statist., 33, Forrester, A., Sobester, A. & Keane, A. (2008). Engineering Design via Surrogate Modelling. Wiley. Gorodetsky, A. & Marzouk, Y. (2016). JUQ, 4(1), Gramacy, R.B. & Lee, H.K.H. (2009). Technometrics, 51, Gramacy, R.B. & Polson, N.G. (2011). JCGS, 20, Johnson, M.E., Moore, L.M. & Ylvisaker, D. (1990). JSPI, 26,

53 References II Jones, D.R., Schonlau, M. & Welch, W.J. (1998). J. Glob. Opt., 13, Kiefer, J. and Wynn, H.P. (1981). Annals of Statistics, 9, Kiefer, J. and Wynn, H.P. (1984). Annals of Statistics, 12, Leatherman, E.R., Dean, A.M., & Santner, T.J. (2017), Computational Statistics & Data Analysis, 113, Leatherman, E.R., Santner, T.J. & Dean, A.M. (In Press), Statistics and Computing. McKay, M.D., Beckman, R.J. & Conover, W.J. (1979). Technometrics, 21, Meyer, R. and Nachtsheim, C. (1995). Technometrics, 37, Müller, P. & Parmigiani, G. (1995). JASA, 90, O Hagan, A. (1978). JRSSB, 40, Rasmussen, C.E. & Williams, C.K.I. (2006). Gaussian Processes for Machine Learning, MIT Press. Sacks, J., Welch, W.J., Mitchell, T.J. & Wynn, H.P. (1989). Statistical Science, 4,

54 References III Santner, T.J., Williams, B.J. & Notz, W.I. (2003). The Design and Analysis of Computer Experiments. Springer. Sebastiani, P. & Wynn, H.P. (2000). JRSSB, 62, Tudose, L. & Jucan, D. (2007). Ann. Oradea Uni., Wikle, C.K. & Royle, J.A. (2005). Environmetrics, 16, Wu, R. & Kaufman, C.G. (2013). Submitted. Xia, G., Miranda, M.L. & Gelfand, A.E. (2006). Environmetrics, 17, Zimmerman, D.L. (2006). Environmetrics, 17, Zhou, Z. & Stein, M.L. (2005). JSPI, 134, Zhou, Z. & Stein, M.L. (2006). JABES, 11,

55 55

56 Generalised kriging variance S 2 (x n+ ; φ, ξ) S 2 (x n+; φ, ξ) = L n+ L T n,n+l 1 n L n,n+ + (F n+ L T n,n+l 1 n F )(F T L 1 n F + R) 1 (F n+ L T n,n+l 1 n F ) T Derived from conjugate priors on β and σ 2 g R 1 is the prior precision matrix for β F model matrix; F n+ model matrix for prediction points L n = [k λ (x i, x j) + τ 2 δ ij] n i,j=1, L n+ = [k λ (x i,n+, x j,n+) + τ 2 δ ij] n i,j=1, L n,n+ = [k λ (x i, x j,n+) + τ 2 δ ij] i=n,j=n+ i=1,j=1 56

57 Comparison with IMSPE Leatherman et al. (2017) applied a pseudo-bayesian IMSPE objective function where: Φ(ξ) = σ 2 S 2 (x n+; φ, ξ)π(φ)dφ dx n+ S 2 = L n+ L T n,n+l 1 n L n,n+ + (F n+ L T n,n+l 1 n F )(F T L 1 n F ) 1 (F n+ L T n,n+l 1 n F ) T L n+, L n,n+ and K correlation matrices, F n+ and F model matrices Derived from frequentist perspective Equivalent to non-informative prior for β Our approximation Ψ 1(ξ) demonstrates that Φ(ξ) is also a good approximation to the decision theoretic Bayesian approach 57

58 Coordinate exchange algorithm Overview of algorithm: 1. Choose a random starting design 2. For each point, perform a one dimensional optimisation for each coordinate in turn, with all the other coordinates (in all points) remaining fixed 3. The algorithm selects the coordinate value that minimises the objective function 4. Repeat 2 3 for each point and coordinate, until no non-negligible decrease is obtained in the objective function after one complete iteration through the design Meyer and Nachtsheim, (1995) 58

Gaussian Processes for Computer Experiments

Gaussian Processes for Computer Experiments Jeremy Oakley School of Mathematics and Statistics, University of Sheffield www.jeremy-oakley.staff.shef.ac.uk 1 / 43 Computer models Computer model represented