Introduction to Gaussian Processes

Size: px

Start display at page:

Download "Introduction to Gaussian Processes"

Bertina McDowell
6 years ago
Views:

1 Introduction to Gaussian Processes 1

2 Objectives to express prior knowledge/beliefs about model outputs using Gaussian process (GP) to sample functions from the probability measure defined by GP to build a Bayesian surrogate of a model using GP to use the GP model for uncertainty propagation to use the GP model for global optimization 2

3 The Best Book on the Subject Gaussian Processes for Machine Learning Carl Edward Rasmussen and Christopher K. I. Williams The MIT Press, ISBN X. Free online at With Matlab code. 3

4 The Best Code on the Subject GPy (in Python) from the group of N. University of Sheffield My lab s 4

5 Motivation Input Parameters Physical model Quantities of interest x f y We ll think about it as a mathematical function: y = f (x) 5

6 p(something) = probability something is true The essence of the present theory is that no probability, direct, prior, or posterior, is simply a frequency. H. Jeffreys (1939) Probability Theory: The Logic of Science, by E. T. Jaynes 6

7 Some of the Problems of Uncertainty Quantification Uncertainty propagation: Model calibration: p(x ) f p(y ) y f p(x y ) Design optimization under uncertainty: x * = max E ξ [O(f (x;ξ))] 7

8 Why are these problems difficult? High computational cost of models. High-dimensionality of inputs/outputs. Fusion of information from multiple sources. Quantification of model-form uncertainties. 8

9 The Surrogate Idea Do a finite number of simulations. Replace model with an approximation: y ˆ f (x) The surrogate is usually cheap to evaluate. Solve the UQ problem with the surrogate. 9

10 The Surrogate Idea 10

11 Classic Approach to Surrogates Usually f ˆ(x) = M w φ (x) j j j =1 with weights by looking at : D = {(x,y )} N i i i =1 using either a quadrature rule (orthogonal basis), least squares, or machine learning techniques. 11

12 Examples of Surrogates generalized polynomial chaos Fourrier expansions splines wavelets neural networks support vector machines compressive sensing 12

13 Limitations of Surrogates limited expressivity inability to quantify epistemic uncertainties due to limited number of observations high-dimensionality 13

14 Questions of interest You can do 5 simulations What is the best you can say about the solution of the X problem with this budget? If you could do one more simulation where should it be? 14

15 The Bayesian surrogate idea Put prior on functions. Evaluate model output on a finite set of inputs. Compute the posterior on functions. Use Bayes rule to solve UQ problems. Most people, even Bayesians, think that this sounds crazy when they first hear about it. -Persi Diaconis (1988) 15

16 Bayesian surrogate 16

17 Bayesian Surrogate 17

18 Bayesian Surrogate Bayesian surrogate = Gaussian process 18

19 Gaussian Process Regression is extremely expressive since it is equivalent to an infinite expansion: f ˆ(x) = w φ (x) j j j =1 with basis functions that can be tuned includes as sub-cases many standard methods is fully Bayesian Ch. 7, Rasmussen (2006) 19

20 Let s set up our workspace before we start going into the mathematical details 20

21 Definition of a Gaussian process A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. Let s just explain in plain English what it is 21

22 Definition of a Gaussian process Input Parameters Physical model Quantities of interest x f y Treat f as unknown Unknown = uncertain = random, i.e., described with probabilities Let us denote our beliefs about f as follows: f ( ) ~ p(f ( )) 22

23 Definition of a Gaussian process A Gaussian process needs two ingredients: a mean function a covariance function It uses them to define a probability measure on the space of functions. We write: f ( ) ~ p(f ( )) = GP( f ( ) m( ),k(, ) ) 23

24 The mean function What do you think f(x) could be? Define the mean function by: m(x ) = E[f (x )] It models your expectation about f(x). 24

25 The covariance function How sure are you about this prediction? Consider the variance: ( k(x,x ) = E f (x ) m(x )) 2 It models your uncertainty about f(x). 25

26 The covariance function Now, consider two inputs x and x. How close do you think the corresponding outputs are? Consider the covariance function: k(x,x ) = E ( f (x) m(x) ) f (x ) m(x ) ( ) It models yours beliefs about the similarity of f(x) and f(x ). 26

27 To wrap it up We write: f ( ) ~ GP( f ( ) m( ),k(, ) ) and we interpret: m(x): What do I think f(x) could be? k(x, x): How sure am I about my expectation of f(x)? k(x, x ): How similar are f(x) and f(x )? 27

28 The most common covariance function: Squared Exponential (SE) Also known as radial basis function (RBF). k(x, x ) = v exp 1 2 Variance models uncertainty about f(x) d i =1 ( x x ) 2 i i l i 2. Length-scale models similarity of specific input dimensions 28

29 Example 1.1: Drawing covariance functions You have 15 minutes 29

30 The covariance matrix Consider an arbitrary selection of input points and their corresponding outputs: X = {x 1,,x n } f = {f (x 1 ),,f (x n )} The covariance matrix is defined to be: E (f m)(f m) T := K := k(x 1,x 1 ) k(x 1,x n )! "! k(x n,x 1 ) k(x n,x n ) 30

31 Restrictions on the covariance functions The covariance function has to be positive definite. That is, for any finite collection of inputs, the covariance matrix must be positive definite: K := k(x 1,x 1 ) k(x 1,x n )! "! k(x n,x 1 ) k(x n,x n ) 31

32 Covariance function factory The sum of two covariance functions is a covariance function. What does this model? k(x,x') = k 1 (x,x ) + k 2 (x,x ) The belief that the response comes from two sources: f (x) = f 1 (x) + f 2 (x) f ( ) ~ GP( f ( ) m ( ),k (, ) ),i = 1,2 i i i i 32

33 Covariance function factory The product of two covariance functions is a covariance function. k(x,x ) = k 1 (x,x )k 2 (x,x ) What does this model? The belief that the response comes from two sources: f (x) = f 1 (x)f 2 (x) f ( ) ~ GP( f ( ) m ( ),k (, ) ),i = 1,2 i i i i 33

34 Example 1.2: The covariance matrix and some properties of covariance functions 34

35 Sampling a Gaussian process A Gaussian process defines a probability measure over a function space: f ( ) ~ GP( f ( ) m( ),k(, ) ) How can we sample functions from it? Sample f at a finite, albeit large, set of inputs. 35

36 Sampling a Gaussian process Take a finite number of inputs: X = {x 1,,x n } and consider the model output on them: f = {f (x 1 ),,f (x n )} We believe that they are distributed according to: f ~ N ( f m,k ) 36

37 Sampling a Gaussian process Ok, so we need to be able to sample from this: with f ~ N ( f m,k ) m = m(x 1 )! m(x n ),K := k(x 1,x 1 ) k(x 1,x n )! "! k(x n,x 1 ) k(x n,x n ). 37

38 Sampling a Gaussian process To sample from: f ~ N ( f m,k ) Take the lower Cholesky decomposition L of K: Sample a standard normal: and set: K = LL T z ~ N ( z 0,I ) n n f = m + Lz 38

39 Sampling from a Gaussian process 39

40 Sampling from a Gaussian process 40

41 Changing the length scale 41

42 The samples are as smooth as the covariance Infinitely smooth SE covariance 42

43 The samples are as smooth as the covariance Matern 2-3, 2 times differentiable 43

44 The samples are as smooth as the covariance Exponential, continuous, nowhere differentiable 44

45 Invariances may be builtinto covariance functions Periodic Exponential, period =

46 Example 2.1 & 2.2: Drawing samples from a Gaussian process 46

47 Example 1: Motivation 47

48 Selection of the starting pool of input points Random (e.g., uniformly) selection is a good starting point. A latin hyper-cube design is a much better choice. Code develop by our lab here: 48

49 Adaptive selection: What is your goal? Demo: selecting observations with maximum predictive uncertainty 49

50 Gaussian process regression Assume that we have observed: X = {x,,x }, 1 N f = {f (x ),,f (x )} 1 N and that we want to make predictions at an arbitrary set of test inputs: X * = {x 1 *,,x N * * } f * = {f (x * ),,f (x * 1 N * )} 50

51 Gaussian process regression Since, we have assumed a priori that: f ( ) ~ GP( f ( ) m( ),k(, ) ) then by definition: f f * ~ N f f * m m *, K(X,X) K(X,X * ) K(X *,X) K(X *,X * ) 51

52 Gaussian process regression Mean on observations Covariance matrix of observations f f * ~ N f f * m m *, K(X,X) K(X,X * ) K(X *,X) K(X *,X * ) Mean on test inputs Cross covariance matrix (testobserved) 52 Covariance matrix of test inputs

53 Gaussian process regression f f * ~ N f f * m m *, K(X,X) K(X,X * ) K(X *,X) K(X *,X * ) Bayes rule f * X *,X,f ~? 53

54 Gaussian process regression f f * ~ N f f * m m *, K(X,X) K(X,X * ) K(X *,X) K(X *,X * ) f * X * (,X,f ~ N f * m!,k! ), m! = m * + K(X *,X)K 1 (f m), K! = K * K(X *,X)K 1 K(X,X * ) 54 Bayes rule Proof in Ch. 2.3 Bishop (2006)

55 The posterior Gaussian process Since the choice of test points was arbitrary, the procedure actually defines a posterior Gaussian process: ( f ( ) X,f ~ GP f ( )!m( ), k(, )! ),!m(x) = m(x) + K(x,X)K 1 (f m),!k(x,x ) = k(x,x ) K(x,X)K 1 K(X,x ) This encodes are beliefs about the model output after seeing the data. Predictions require a Cholesky decomposition. 55

56 Gaussian process regression Bayes rule Prior GP Posterior GP 56

57 Posterior GP: The point predictive distribution ( f ( ) X,f ~ GP f ( )!m( ), k(, )! ), Looking at just one point, we get the point predictive distribution: y x,x,f ~ N ( y!m(x), σ! 2 (x)), σ! 2 (x) = k(x,x).! You may use the mean as a surrogate. 57

58 Gaussian process regression y x,x,f ~ N ( y!m(x), σ! 2 (x)), f (x) =!m(x) ± 2! σ (x) 58

59 Gaussian process regression - Noisy observations Assume that we have observed: X = {x,,x }, 1 N y = {y,,y } 1 N where y is a noisy measurement of the ideal f(x) (MD simulation). We need to model the measurement process using a likelihood (typically Gaussian): y f (x ) = N ( y f (x ),σ 2 ) i i i i Noise (likelihood) variance 59

60 Gaussian process regression - Noisy observations The posterior GP, changes to: f ( ) X,f,σ 2 ( ~ GP f ( )!m( ), k(, )! ),!m(x) = m(x) + K(x,X)(K + σ 2 I N ) 1 (f m),!k(x,x ) = k(x,x ) K(x,X)(K + σ 2 I N ) 1 K(X,x ) and the point predictive distribution to: y x,x,f ~ N ( y!m(x), σ! 2 (x)), σ! 2 (x) = k(x,x)! + σ 2 60

61 Gaussian process regression - Noisy observations Each choice of the noise corresponds to a different interpretation of the data. 61

62 Noise improves numerical stability It is common to use small noise even if there is not any in the data. Cholesky fails when covariance is close to being semi-positive definite. Adding a small noise improves numerical stability. It is known as the jitter or as the nugget in this case. 62

63 Example 3.1, Questions 1-5: Gaussian process regression 63

64 Model Selection for GP regression Our prior assumptions were conditional mean and covariance parameters: f ( ) θ ~ GP( f ( ) m( ;θ),k(, ;θ) ) Observations are conditional on the noise level: y f (x ),σ 2 = N ( y f (x ),σ 2 ) Thus, the likelihood of all the observations is: ( ) y X,θ,σ 2 ~ p(y X,θ,σ 2 ) = N y m,k + σ 2 I N 64

65 Model Selection for GP regression The (marginal) likelihood of all the observations is: y X,θ,σ 2 ~ p(y X,θ,σ 2 ) = N ( y m,k + σ 2 I ) N To complete the prior specification, we must give: θ,σ 2 ~ p(θ,σ 2 ). Then, after seeing the data, our beliefs about the parameters should change to: θ,σ 2 X,y = p(θ,σ 2 X,y) p(y X,σ 2 )p(θ,σ 2 ) 65

66 Model Selection for GP regression After seeing the data, our beliefs about the parameters are: θ,σ 2 X,y = p(θ,σ 2 X,y) p(y X,σ 2 )p(θ,σ 2 ) Ideally, we would sample from this posterior with MCMC. Alternatively, we can find the MAP estimate of the parameters: { } θ *,(σ * ) 2 = argmax θ,σ logp(y X,σ 2 ) + logp(θ,σ 2 ) 66

67 Model Selection for GP regression MAP estimate of the parameters: θ *,(σ * ) 2 = argmax { logp(y X,σ 2 ) + logp(θ,σ 2 )} θ,σ If our prior assumptions are vague, then logp(θ,σ 2 ) = const and we are effectively just maximizing the likelihood. 67

68 Model Selection for GP regression noise standard deviation characteristic lengthscale Contour plot of marginal likelihood for specific example in Rasmussen (2006) 68

69 Careful: Different optima correspond to different interpretations noise standard deviation characteristic lengthscale (a) 2 2 output, y 1 0 output, y input, x input, x Contour plot of marginal (b) likelihood for specific example (c) in Rasmussen (2006) 2 69

70 Example 3.1, Questions 6-8, Example

71 Bayesian global optimization - The problem Problem: x * = argmin x f (x) when the objective is: very expensive to evaluate you don t have gradients might be noisy dimensionality < 30 parameters 71

72 Bayesian global optimization - The Idea Assume that we have observed: X = {x,,x }, 1 N y = {y,,y } 1 N and that we can make one more observation. Which observation do we choose? 72

73 Bayesian global optimization - The Idea Let s say that we make an observation at x and we see y. The improvement we would observe is: I(x,y ) = 0, y > min n y n min n y n y, otherwise But y, could be anything 73

74 Bayesian global optimization - The Idea Use data to build a GP that represents our state of knowledge about the model output. The point predictive distribution summarizes everything: y x,x,y ~ p(y x,x,y)n ( y!m(x), σ! 2 (x)) Integrate to get rid of y from the improvement: EI(x) = I(x,y )p(y x,x,y)dy 74

75 Bayesian global optimization - The Idea The integration is analytically available: EI(x) = ( min n y n!m(x) )Φ min y n n!m(x) σ!(x) + σ!(x)φ min n y n!m(x) σ!(x) Jones et al. (1998) 75

76 Bayesian global optimization - The algorithm 1. Observe initial pool of inputs-outputs (e.g., randomly selected or whatever is available). 2. Given current observations, build a GP representing our state of knowledge about the output. 3. Select the input with maximum expected improvement. If below threshold, STOP. Otherwise, do new simulation and GO TO 2. 76

77 Minimizing the energy of a binary molecule Consider the O2 molecule. Let r be the distance between to O atoms. We wish to find the interatomic distance. r * = argmin r V (r ) We run BGO starting with 1 randomly chosen simulation. 77

78 Minimizing the energy of a binary molecule 78

79 Example 4: BGO application to finding the minimum energy structure 79

80 BGO for solving inverse problems Suppose we observe model output y and want to find input x that gave rise to it. Simplest mathematical formulation is via a loss function: x * = argmin x L(x):= argmin x! y f (x)! 2 2 We represent the loss function with a GP and we employ BGO. 80

81 Example 5: BGO for solving inverse problems 81

82 Cool stuff you did not learn about today 82

Detecting discontinuities 1-1.2-0.6 0 0.6 1.2 1 0.5 0.

83 Detecting discontinuities x 2 0 x x x 1 83

84 Detecting important inputs ct to SIAM license or copyright; see L 2 error nom L 2 error norm SGC ASGC, ε=10 1 ASGC, ε=10 2 ASGC, ε=10 3 ASGC, ε=10 4 RVM SE, N=20 RVM GPC, N=20, P= Number of samples SGC ASGC, ε=10 1 ASGC, ε=10 2 ASGC, ε=10 3 ASGC, ε=10 4 RVM SE, N=40 RVM GPC, N=40, P= Number of samples 84 Number of splits Number of splits RVM GPC, N=20, P=2, δ=10 7 RVM GPC, N=20, P=2, δ=10 6 RVM GPC, N=20, P=2, δ=10 5 RVM GPC, N=20, P=2, δ=10 4 RVM GPC, N=20, P=2, δ= Dimension RVM GPC, N=40, P=2, δ=10 7 RVM GPC, N=40, P=2, δ=10 6 RVM GPC, N=40, P=2, δ=10 5 RVM GPC, N=40, P=2, δ=10 4 RVM GPC, N=40, P=2, δ= Dimension

85 Getting predictive error bars for EVERYTHING verse Problems 30 (2014) I Bilionis and N Zabara (a) Single, 20 Obs. (b) Semi, 20 Obs. (c) Full, 20 Obs. 85

86 Learning non-linear dynamics from data (recursive GPs) 86

87 0.4 Doing multi-fidelity 0 optimization under budget constraints mev/atom mev/atom % Al in NiAl FIG. 2. (Color) Six di erent stages, set up in two columns, 87 in the Bayesian global optimization algorithm for learning

88 Doing multi-objective optimization with limited simulations 88

89 Doing high-dimensions & encoding physics 89

90 Unifying all UQ problems 90

91 Incomplete References Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge, MA: MIT Press. Jones, D. R., Schonlau, M., & Welch, W. J. (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4), doi: Doi /A: and a many many more 91

92 Thanks! 92

GWAS V: Gaussian processes

GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011