Gaussian Processes. Recommended reading: Rasmussen/Williams: Chapters 1, 2, 4, 5 Deisenroth & Ng (2015)[3] Data Analysis and Probabilistic Inference

Size: px

Start display at page:

Download "Gaussian Processes. Recommended reading: Rasmussen/Williams: Chapters 1, 2, 4, 5 Deisenroth & Ng (2015)[3] Data Analysis and Probabilistic Inference"

Marian Thompson
5 years ago
Views:

1 Data Analysis and Probabilistic Inference Gaussian Processes Recommended reading: Rasmussen/Williams: Chapters 1, 2, 4, 5 Deisenroth & Ng (2015)[3] Marc Deisenroth Department of Computing Imperial College London February 22, 2017

2 Gaussian Processes Marc Deisenroth February 22,

3 Problem Setting f(x) x Objective For a set of observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f q that explains the data Probabilistic regression problem Gaussian Processes Marc Deisenroth February 22,

4 Problem Setting 2 f(x) x Objective For a set of observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f q that explains the data Probabilistic regression problem Gaussian Processes Marc Deisenroth February 22,

5 Recap from CO-496: Bayesian Linear Regression Linear Regression Model: f pxq φpxq J w, y f pxq ` ɛ, w N `0, Σ p ɛ N `0, σn 2 Gaussian Processes Marc Deisenroth February 22,

6 Recap from CO-496: Bayesian Linear Regression Linear Regression Model: f pxq φpxq J w, y f pxq ` ɛ, w N `0, Σ p ɛ N `0, σn 2 Integrating out the parameters when predicting leads to a distribution over functions: ż pp f px q x, X, yq pp f px q x, wqppw X, yqdw N `µpx q, σ 2 px q Gaussian Processes Marc Deisenroth February 22,

7 Recap from CO-496: Bayesian Linear Regression Linear Regression Model: f pxq φpxq J w, y f pxq ` ɛ, w N `0, Σ p ɛ N `0, σn 2 Integrating out the parameters when predicting leads to a distribution over functions: ż pp f px q x, X, yq pp f px q x, wqppw X, yqdw N `µpx q, σ 2 px q µpx q φ J Σ p ΦpK ` σn 2 Iq 1 y σ 2 px q φ J Σ p φ φ J Σ p ΦpK ` σn 2 Iq 1 Φ J Σ p φ K Φ J Σ p Φ Gaussian Processes Marc Deisenroth February 22,

8 b Sampling from the Prior over Functions Consider a linear regression setting y a ` bx ` ɛ, ppa, bq N `0, I ɛ N `0, σn a Gaussian Processes Marc Deisenroth February 22,

9 b Sampling from the Prior over Functions Consider a linear regression setting y a ` bx ` ɛ, ppa, bq N `0, I ɛ N `0, σn a Gaussian Processes Marc Deisenroth February 22,

10 b y Sampling from the Prior over Functions Consider a linear regression setting y a ` bx ` ɛ, ppa, bq N `0, I ɛ N `0, σn a x Gaussian Processes Marc Deisenroth February 22,

11 y Sampling from the Prior over Functions Consider a linear regression setting y a ` bx ` ɛ, ppa, bq N `0, I ɛ N `0, σn x Gaussian Processes Marc Deisenroth February 22,

12 b Sampling from the Posterior over Functions Consider a linear regression setting y a ` bx ` ɛ, ppa, bq N `0, I ɛ N `0, σn a Gaussian Processes Marc Deisenroth February 22,

13 b Sampling from the Posterior over Functions Consider a linear regression setting y a ` bx ` ɛ, ppa, bq N `0, I ɛ N `0, σn a Gaussian Processes Marc Deisenroth February 22,

14 b y Sampling from the Posterior over Functions Consider a linear regression setting y a ` bx ` ɛ, ppa, bq N `0, I ɛ N `0, σn a x Gaussian Processes Marc Deisenroth February 22,

15 Fitting Nonlinear Functions Fit nonlinear functions using (Bayesian) linear regression: Linear combination of nonlinear features Example: Radial-basis-function (RBF) network nÿ f pxq w i φ i pxq, i 1 w i N `0, σp 2 where φ i pxq exp ` 1 2 px µ i qj px µ i q for given centers µ i Gaussian Processes Marc Deisenroth February 22,

16 f(x) Illustration: Fitting a Radial Basis Function Network φ i pxq exp ` 1 2 px µ i qj px µ i q x Place Gaussian-shaped basis functions φ i at 25 input locations µ i, linearly spaced in the interval r 5, 3s Gaussian Processes Marc Deisenroth February 22,

17 f(x) Samples from the RBF Prior f pxq 4 nÿ w i φ i pxq, i 1 ppwq N `0, I x Gaussian Processes Marc Deisenroth February 22,

18 f(x) Samples from the RBF Posterior nÿ f pxq w i φ i pxq, 4 i 1 ppw X, yq N `m N, S N x Gaussian Processes Marc Deisenroth February 22,

19 f(x) RBF Posterior x Gaussian Processes Marc Deisenroth February 22,

20 f(x) Limitations Feature engineering Finite number of features: x Above: Without basis functions on the right, we cannot express any variability of the function Ideally: Add more (infinitely many) basis functions Gaussian Processes Marc Deisenroth February 22,

21 Approach Instead of sampling parameters, which induce a distribution over functions, sample functions directly Make assumptions on the distribution of functions Intuition: function = infinitely long vector of function values Make assumptions on the distribution of function values Gaussian Processes Marc Deisenroth February 22,

22 Gaussian Process We will place a distribution pp f q on functions f Informally, a function can be considered an infinitely long vector of function values f r f 1, f 2, f 3,...s A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables. Gaussian Processes Marc Deisenroth February 22,

23 Gaussian Process We will place a distribution pp f q on functions f Informally, a function can be considered an infinitely long vector of function values f r f 1, f 2, f 3,...s A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables. Definition A Gaussian process (GP) is a collection of random variables f 1, f 2,..., any finite number of which is Gaussian distributed. Gaussian Processes Marc Deisenroth February 22,

24 Gaussian Process We will place a distribution pp f q on functions f Informally, a function can be considered an infinitely long vector of function values f r f 1, f 2, f 3,...s A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables. Definition A Gaussian process (GP) is a collection of random variables f 1, f 2,..., any finite number of which is Gaussian distributed. A Gaussian distribution is specified by a mean vector µ and a covariance matrix Σ A Gaussian process is specified by a mean function mp q and a covariance function (kernel) kp, q Gaussian Processes Marc Deisenroth February 22,

25 Covariance Function The covariance function (kernel) is symmetric and positive semi-definite It allows us to compute covariances between (unknown) function values by just looking at the corresponding inputs: Covr f px i q, f px j qs kpx i, x j q Gaussian Processes Marc Deisenroth February 22,

26 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ɛ, ɛ N `0, σ 2 n, find a (posterior) distribution over functions pp f X, yq that explains the data Gaussian Processes Marc Deisenroth February 22,

27 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ɛ, ɛ N `0, σn, 2 find a (posterior) distribution over functions pp f X, yq that explains the data Training data: X, y. Bayes theorem yields pp f X, yq ppy f, Xq pp f q ppy Xq Gaussian Processes Marc Deisenroth February 22,

28 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ɛ, ɛ N `0, σn, 2 find a (posterior) distribution over functions pp f X, yq that explains the data Training data: X, y. Bayes theorem yields pp f X, yq ppy f, Xq pp f q ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k. Gaussian Processes Marc Deisenroth February 22,

29 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ɛ, ɛ N `0, σn, 2 find a (posterior) distribution over functions pp f X, yq that explains the data Training data: X, y. Bayes theorem yields pp f X, yq ppy f, Xq pp f q ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k. Likelihood (noise model): ppy f, Xq N ` f pxq, σ 2 n I Gaussian Processes Marc Deisenroth February 22,

30 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ɛ, ɛ N `0, σn, 2 find a (posterior) distribution over functions pp f X, yq that explains the data Training data: X, y. Bayes theorem yields pp f X, yq ppy f, Xq pp f q ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k. Likelihood (noise model): ppy f, Xq N ` f pxq, σ 2 n I Marginal likelihood (evidence): ppy Xq ş ppy f pxqqpp f Xqd f Gaussian Processes Marc Deisenroth February 22,

31 GP Regression as a Bayesian Inference Problem Objective For a set of observations y i f px i q ` ɛ, ɛ N `0, σn, 2 find a (posterior) distribution over functions pp f X, yq that explains the data Training data: X, y. Bayes theorem yields pp f X, yq ppy f, Xq pp f q ppy Xq Prior: pp f q GPpm, kq Specify mean m function and kernel k. Likelihood (noise model): ppy f, Xq N ` f pxq, σ 2 n I Marginal likelihood (evidence): ppy Xq ş ppy f pxqqpp f Xqd f Posterior: pp f y, Xq GPpm post, k post q Gaussian Processes Marc Deisenroth February 22,

32 Prior over Functions Treat a function as a long vector of function values: f r f 1, f 2,... s Look at a distribution over function values f i f px i q Gaussian Processes Marc Deisenroth February 22,

33 Prior over Functions Treat a function as a long vector of function values: f r f 1, f 2,... s Look at a distribution over function values f i f px i q Consider a finite number of N function values f and all other (infinitely many) function values f. Informally:» fi» pp f, f q N µ f fl, µ f Σ f f Σ f f Σ f f Σ f f where Σ f f P R mˆm and Σ f f P R Nˆm, m Ñ 8. Σ pi,jq f f Covr f px i q, f px j qs kpx i, x j q fi fl Gaussian Processes Marc Deisenroth February 22,

34 Prior over Functions Treat a function as a long vector of function values: f r f 1, f 2,... s Look at a distribution over function values f i f px i q Consider a finite number of N function values f and all other (infinitely many) function values f. Informally:» fi» pp f, f q N µ f fl, µ f Σ f f Σ f f Σ f f Σ f f where Σ f f P R mˆm and Σ f f P R Nˆm, m Ñ 8. Σ pi,jq f f Covr f px i q, f px j qs kpx i, x j q fi fl Key property: The marginal remains finite ż pp f q pp f, f qd f N ` µ f, Σ f f Gaussian Processes Marc Deisenroth February 22,

35 Training and Test Marginal In practice, we always have finite training and test inputs x train, x test. Define f : f test, f : f train. Gaussian Processes Marc Deisenroth February 22,

36 Training and Test Marginal In practice, we always have finite training and test inputs x train, x test. Define f : f test, f : f train. Then, we obtain the finite marginal ż pp f, f q «ff «ff µ f Σ f f Σ f pp f, f, f other qd f other N, µ Σ f Σ Gaussian Processes Marc Deisenroth February 22,

37 GP Regression as a Bayesian Inference Problem (ctd.) Posterior over functions (with training data X, y): pp f X, yq ppy f, Xq pp f Xq ppy Xq Gaussian Processes Marc Deisenroth February 22,

38 GP Regression as a Bayesian Inference Problem (ctd.) Posterior over functions (with training data X, y): ppy f, Xq pp f Xq pp f X, yq ppy Xq Using the properties of Gaussians, we obtain ppy f, Xq pp f Xq N `y f pxq, σn 2 I N ` f pxq mpxq, K K kpx, Xq Gaussian Processes Marc Deisenroth February 22,

39 GP Regression as a Bayesian Inference Problem (ctd.) Posterior over functions (with training data X, y): ppy f, Xq pp f Xq pp f X, yq ppy Xq Using the properties of Gaussians, we obtain ppy f, Xq pp f Xq N `y f pxq, σn 2 I N ` f pxq mpxq, K ZN ` f pxq looooooooooooooooooooomooooooooooooooooooooon mpxq ` KpK ` σn 2 Iq 1 py mpxqq, K KpK ` σn 2 Iq 1 loooooooooooomoooooooooooon K K kpx, Xq posterior mean posterior covariance Gaussian Processes Marc Deisenroth February 22,

40 GP Regression as a Bayesian Inference Problem (ctd.) Posterior over functions (with training data X, y): ppy f, Xq pp f Xq pp f X, yq ppy Xq Using the properties of Gaussians, we obtain ppy f, Xq pp f Xq N `y f pxq, σn 2 I N ` f pxq mpxq, K ZN ` f pxq looooooooooooooooooooomooooooooooooooooooooon mpxq ` KpK ` σn 2 Iq 1 py mpxqq, K KpK ` σn 2 Iq 1 loooooooooooomoooooooooooon K posterior mean posterior covariance K kpx, Xq Marginal likelihood: ż Z ppy Xq ppy f, Xq pp f Xq d f N `y mpxq, K ` σn 2 I Gaussian Processes Marc Deisenroth February 22,

41 GP Predictions (1) y f pxq ` ɛ, ɛ N `0, σn 2 Objective: Find pp f px q X, yq for training data X, y and test inputs X. GP prior: pp f Xq N `mpxq, K Gaussian Likelihood: ppy f pxqq N ` f pxq, σ 2 n I Gaussian Processes Marc Deisenroth February 22,

42 GP Predictions (1) y f pxq ` ɛ, ɛ N `0, σn 2 Objective: Find pp f px q X, yq for training data X, y and test inputs X. GP prior: pp f Xq N `mpxq, K Gaussian Likelihood: ppy f pxqq N ` f pxq, σ 2 n I With f GP it follows that f, f are jointly Gaussian distributed: «ff «ff mpxq K kpx, X q pp f, f X, X q N, mpx q kpx, Xq kpx, X q Gaussian Processes Marc Deisenroth February 22,

43 GP Predictions (1) y f pxq ` ɛ, ɛ N `0, σn 2 Objective: Find pp f px q X, yq for training data X, y and test inputs X. GP prior: pp f Xq N `mpxq, K Gaussian Likelihood: ppy f pxqq N ` f pxq, σ 2 n I With f GP it follows that f, f are jointly Gaussian distributed: «ff «ff mpxq K kpx, X q pp f, f X, X q N, mpx q kpx, Xq kpx, X q Due to the Gaussian likelihood, we also get ( f is unobserved) «ff «ff mpxq K `σn 2 I kpx, X q ppy, f X, X q N, mpx q kpx, Xq kpx, X q Gaussian Processes Marc Deisenroth February 22,

44 GP Predictions (2) Prior: ppy, f X, X q N ˆ j j mpxq K ` σ 2, n I kpx, X q mpx q kpx, Xq kpx, X q Posterior predictive distribution pp f X, y, X q at test inputs X Gaussian Processes Marc Deisenroth February 22,

45 GP Predictions (2) Prior: ppy, f X, X q N ˆ j j mpxq K ` σ 2, n I kpx, X q mpx q kpx, Xq kpx, X q Posterior predictive distribution pp f X, y, X q at test inputs X obtained by Gaussian conditioning: pp f X, y, X q N `Er f X, y, X s, Vr f X, y, X s Er f X, y, X s m post px q loomoon mpx q`kpx, XqpK ` σn 2 Iq 1 py mpxqq Vr f X, y, X s k post px, X q prior mean loooomoooon kpx, X q kpx, XqpK ` σn 2 Iq 1 kpx, X q prior variance Gaussian Processes Marc Deisenroth February 22,

46 GP Predictions (2) Prior: ppy, f X, X q N ˆ j j mpxq K ` σ 2, n I kpx, X q mpx q kpx, Xq kpx, X q Posterior predictive distribution pp f X, y, X q at test inputs X obtained by Gaussian conditioning: pp f X, y, X q N `Er f X, y, X s, Vr f X, y, X s Er f X, y, X s m post px q loomoon mpx q`kpx, XqpK ` σn 2 Iq 1 py mpxqq Vr f X, y, X s k post px, X q prior mean loooomoooon kpx, X q kpx, XqpK ` σn 2 Iq 1 kpx, X q prior variance From now: Set prior mean function m 0 Gaussian Processes Marc Deisenroth February 22,

47 Illustration: Inference with Gaussian Processes f(x) x Prior belief about the function Predictive (marginal) mean and variance: Er f px q x, s mpx q 0 Vr f px q x, s σ 2 px q kpx, x q Gaussian Processes Marc Deisenroth February 22,

48 Illustration: Inference with Gaussian Processes f(x) x Prior belief about the function Predictive (marginal) mean and variance: Er f px q x, s mpx q 0 Vr f px q x, s σ 2 px q kpx, x q Gaussian Processes Marc Deisenroth February 22,

49 Illustration: Inference with Gaussian Processes f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc Deisenroth February 22,

50 Illustration: Inference with Gaussian Processes f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc Deisenroth February 22,

51 Illustration: Inference with Gaussian Processes f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc Deisenroth February 22,

52 Illustration: Inference with Gaussian Processes f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc Deisenroth February 22,

53 Illustration: Inference with Gaussian Processes f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc Deisenroth February 22,

54 Illustration: Inference with Gaussian Processes f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc Deisenroth February 22,

55 Illustration: Inference with Gaussian Processes f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc Deisenroth February 22,

56 Illustration: Inference with Gaussian Processes f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc Deisenroth February 22,

57 Illustration: Inference with Gaussian Processes f(x) x Posterior belief about the function Predictive (marginal) mean and variance: Er f px q x, X, ys mpx q kpx, x q J pk ` σ 2 ε Iq 1 y Vr f px q x, X, ys σ 2 px q kpx, x q kpx, x q J pk ` σ 2 ε Iq 1 kpx, x q Gaussian Processes Marc Deisenroth February 22,

58 Covariance Function A Gaussian process is fully specified by a mean function m and a kernel/covariance function k The covariance function (kernel) is symmetric and positive semi-definite Covariance function encodes high-level structural assumptions about the latent function f (e.g., smoothness, differentiability, periodicity) Gaussian Processes Marc Deisenroth February 22,

59 Gaussian Covariance Function k Gauss px i, x j q σ 2 f exp ` px i x j q J px i x j q{l 2 σ f : Amplitude of the latent function l: Length scale. How far do we have to move in input space before the function value changes significantly Smoothness parameter Assumption on latent function: Smooth (8 differentiable) Gaussian Processes Marc Deisenroth February 22,

60 f(x) Length-Scales Length scales determine how wiggly the function is and how much information we can transfer to other function values 0.3 Data x Gaussian Processes Marc Deisenroth February 22,

61 f(x) Length-Scales Length scales determine how wiggly the function is and how much information we can transfer to other function values 0.3 Data x Gaussian Processes Marc Deisenroth February 22,

62 f(x) Length-Scales Length scales determine how wiggly the function is and how much information we can transfer to other function values 0.3 Data x Gaussian Processes Marc Deisenroth February 22,

63 f(x) Length-Scales Length scales determine how wiggly the function is and how much information we can transfer to other function values 0.3 Data x Gaussian Processes Marc Deisenroth February 22,

How far do we have to move in input space before the function value changes

64 Matérn Covariance Function k Mat,3{2 px i, x j q σ 2 f 1 `? 3}xi x j } σ f : Amplitude of the latent function l exp? 3}xi x j } l: Length scale. How far do we have to move in input space before the function value changes significantly? l Assumption on latent function: 1-times differentiable Gaussian Processes Marc Deisenroth February 22,

65 Periodic Covariance Function k per px i, x j q σ 2 f exp 2 ` sin2 κpx i x j q 2π κ: Periodicity parameter l 2 k Gauss pupx i q, upx j qq, upxq j cospκxq sinpκxq Gaussian Processes Marc Deisenroth February 22,

66 Meta-Parameters of a GP The GP possesses a set of hyper-parameters: Parameters of the mean function Hyper-parameters of the covariance function (e.g., length-scales and signal variance) Likelihood parameters (e.g., noise variance σn) 2 Gaussian Processes Marc Deisenroth February 22,

67 Meta-Parameters of a GP The GP possesses a set of hyper-parameters: Parameters of the mean function Hyper-parameters of the covariance function (e.g., length-scales and signal variance) Likelihood parameters (e.g., noise variance σn) 2 Train a GP to find a good set of hyper-parameters Gaussian Processes Marc Deisenroth February 22,

68 Meta-Parameters of a GP The GP possesses a set of hyper-parameters: Parameters of the mean function Hyper-parameters of the covariance function (e.g., length-scales and signal variance) Likelihood parameters (e.g., noise variance σ 2 n) Train a GP to find a good set of hyper-parameters Model selection to find good mean and covariance functions (can also be automated Automatic Statistician (Lloyd et al., 2014)) Gaussian Processes Marc Deisenroth February 22,

69 Gaussian Process Training: Hyper-Parameters GP Training f θ Find good GP hyper-parameters θ (kernel and mean function parameters) x i y i N σ n Gaussian Processes Marc Deisenroth February 22,

70 Gaussian Process Training: Hyper-Parameters GP Training Find good GP hyper-parameters θ (kernel and mean function parameters) Place a prior ppθq on hyper-parameters Posterior over hyper-parameters: ż ppθq ppy X, θq ppθ X, yq, ppy X, θq ppy Xq x i y i f θ N σ n ppy f pxqqpp f X, θqd f Gaussian Processes Marc Deisenroth February 22,

71 Gaussian Process Training: Hyper-Parameters GP Training Find good GP hyper-parameters θ (kernel and mean function parameters) Place a prior ppθq on hyper-parameters Posterior over hyper-parameters: ż ppθq ppy X, θq ppθ X, yq, ppy X, θq ppy Xq Choose hyper-parameters θ, such that x i y i f θ N σ n ppy f pxqqpp f X, θqd f θ P arg max log ppθq ` log ppy X, θq θ Gaussian Processes Marc Deisenroth February 22,

72 Gaussian Process Training: Hyper-Parameters GP Training Find good GP hyper-parameters θ (kernel and mean function parameters) Place a prior ppθq on hyper-parameters Posterior over hyper-parameters: ż ppθq ppy X, θq ppθ X, yq, ppy X, θq ppy Xq Choose hyper-parameters θ, such that x i y i f θ N σ n ppy f pxqqpp f X, θqd f θ P arg max log ppθq ` log ppy X, θq θ Maximize marginal likelihood if ppθq U (uniform prior) Gaussian Processes Marc Deisenroth February 22,

73 Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood (probability of the data given the hyper-parameters, where the unwieldy f has been integrated out) Also called Maximum Likelihood-Type-II Marginal likelihood: ż ppy X, θq ppy f pxqqpp f X, θqd f ż N `y f pxq, σn 2 I N ` f pxq 0, K d f N `y 0, K ` σn 2 I Gaussian Processes Marc Deisenroth February 22,

74 Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood (probability of the data given the hyper-parameters, where the unwieldy f has been integrated out) Also called Maximum Likelihood-Type-II Marginal likelihood: ż ppy X, θq ppy f pxqqpp f X, θqd f ż N `y f pxq, σn 2 I N ` f pxq 0, K d f N `y 0, K ` σn 2 I Learning the GP hyper-parameters: θ P arg max log ppy X, θq θ log ppy X, θq 1 2yJK 1 θ y 1 2 log K θ ` const, K θ : K ` σn 2 I Gaussian Processes Marc Deisenroth February 22,

75 Training via Marginal Likelihood Maximization Log-marginal likelihood: log ppy X, θq 1 2yJK 1 θ y 1 2 log K θ ` const, K θ : K ` σn 2 I Gaussian Processes Marc Deisenroth February 22,

76 Training via Marginal Likelihood Maximization Log-marginal likelihood: log ppy X, θq 1 2yJK 1 θ y 1 2 log K θ ` const, K θ : K ` σn 2 I Automatic trade-off between data fit and model complexity Gaussian Processes Marc Deisenroth February 22,

77 Training via Marginal Likelihood Maximization Log-marginal likelihood: log ppy X, θq 1 2yJK 1 θ y 1 2 log K θ ` const, K θ : K ` σn 2 I Automatic trade-off between data fit and model complexity Gradient-based optimization of hyper-parameters θ: B log ppy X, θq 1 BK θ Bθ 2yJK 1 K 1 θ i Bθ θ y 1 i 1 2 tr`pαα J K 1 θ qbk θ, Bθ i α : K 1 θ y tr`k 1 2 θ BK θ Bθ i Gaussian Processes Marc Deisenroth February 22,

78 y Example: Training Data x Gaussian Processes Marc Deisenroth February 22,

79 log-length-scales Example: Marginal Likelihood Contour 5 Log-Marginal Likelihood, N= log-noise -4 Gaussian Processes Marc Deisenroth February 22,

80 y Example: Exploring the Modes (1) x Gaussian Processes Marc Deisenroth February 22,

81 y Example: Exploring the Modes (2) x Gaussian Processes Marc Deisenroth February 22,

82 log-length-scales Marginal Likelihood (1) 5 Log-Marginal Likelihood, N= log-noise -4 Gaussian Processes Marc Deisenroth February 22,

83 log-length-scales Marginal Likelihood (2) 5 Log-Marginal Likelihood, N= log-noise -4 Gaussian Processes Marc Deisenroth February 22,

84 log-length-scales Marginal Likelihood (3) 5 Log-Marginal Likelihood, N= log-noise -4 Gaussian Processes Marc Deisenroth February 22,

85 log-length-scales Marginal Likelihood (4) 5 Log-Marginal Likelihood, N= log-noise -4 Gaussian Processes Marc Deisenroth February 22,

86 log-length-scales Marginal Likelihood (5) 5 Log-Marginal Likelihood, N= log-noise -4 Gaussian Processes Marc Deisenroth February 22,

87 log-length-scales Marginal Likelihood (6) 5 Log-Marginal Likelihood, N= log-noise -4 Gaussian Processes Marc Deisenroth February 22,

88 log-length-scales Marginal Likelihood (7) 5 Log-Marginal Likelihood, N= log-noise -4 Gaussian Processes Marc Deisenroth February 22,

89 log-length-scales Marginal Likelihood (8) 5 Log-Marginal Likelihood, N= log-noise -4 Gaussian Processes Marc Deisenroth February 22,

90 log-length-scales Marginal Likelihood (9) 5 Log-Marginal Likelihood, N= log-noise -4 Gaussian Processes Marc Deisenroth February 22,

91 Marginal Likelihood and Parameter Learning The marginal likelihood is non-convex Gaussian Processes Marc Deisenroth February 22,

92 Marginal Likelihood and Parameter Learning The marginal likelihood is non-convex In particular in the very-small-data regime, a GP can end up in three different modes when optimizing the hyper-parameters: Gaussian Processes Marc Deisenroth February 22,

93 Marginal Likelihood and Parameter Learning The marginal likelihood is non-convex In particular in the very-small-data regime, a GP can end up in three different modes when optimizing the hyper-parameters: Overfitting (unlikely, but possible) Gaussian Processes Marc Deisenroth February 22,

94 Marginal Likelihood and Parameter Learning The marginal likelihood is non-convex In particular in the very-small-data regime, a GP can end up in three different modes when optimizing the hyper-parameters: Overfitting (unlikely, but possible) Underfitting (everything is considered noise) Gaussian Processes Marc Deisenroth February 22,

95 Marginal Likelihood and Parameter Learning The marginal likelihood is non-convex In particular in the very-small-data regime, a GP can end up in three different modes when optimizing the hyper-parameters: Overfitting (unlikely, but possible) Underfitting (everything is considered noise) Good fit Gaussian Processes Marc Deisenroth February 22,

96 Marginal Likelihood and Parameter Learning The marginal likelihood is non-convex In particular in the very-small-data regime, a GP can end up in three different modes when optimizing the hyper-parameters: Overfitting (unlikely, but possible) Underfitting (everything is considered noise) Good fit Re-start hyper-parameter optimization from random initialization to mitigate the problem Gaussian Processes Marc Deisenroth February 22,

97 Marginal Likelihood and Parameter Learning The marginal likelihood is non-convex In particular in the very-small-data regime, a GP can end up in three different modes when optimizing the hyper-parameters: Overfitting (unlikely, but possible) Underfitting (everything is considered noise) Good fit Re-start hyper-parameter optimization from random initialization to mitigate the problem With increasing data set size the GP typically ends up in the good-fit mode. Overfitting (indicator: small length-scales and small noise variance) is very unlikely. Gaussian Processes Marc Deisenroth February 22,

98 Marginal Likelihood and Parameter Learning The marginal likelihood is non-convex In particular in the very-small-data regime, a GP can end up in three different modes when optimizing the hyper-parameters: Overfitting (unlikely, but possible) Underfitting (everything is considered noise) Good fit Re-start hyper-parameter optimization from random initialization to mitigate the problem With increasing data set size the GP typically ends up in the good-fit mode. Overfitting (indicator: small length-scales and small noise variance) is very unlikely. Ideally, we would integrate the hyper-parameters out Why can we do not do this easily? Gaussian Processes Marc Deisenroth February 22,

99 Model Selection Mean Function and Kernel Assume we have a finite set of models M i, each one specifying a mean function m i and a kernel k i. How do we find the best one? Gaussian Processes Marc Deisenroth February 22,

100 Model Selection Mean Function and Kernel Assume we have a finite set of models M i, each one specifying a mean function m i and a kernel k i. How do we find the best one? Some options: BIC, AIC (see CO-496) Compare marginal likelihood values (assuming a uniform prior on the set of models) Gaussian Processes Marc Deisenroth February 22,

101 f(x) Example x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc Deisenroth February 22,

102 f(x) Example 3 Constant kernel, LML= x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc Deisenroth February 22,

103 f(x) Example 3 Linear kernel, LML= x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc Deisenroth February 22,

104 f(x) Example 3 Matern kernel, LML= x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc Deisenroth February 22,

105 f(x) Example 3 Gaussian kernel, LML= x Four different kernels (mean function fixed to m 0) MAP hyper-parameters for each kernel Log-marginal likelihood values for each (optimized) model Gaussian Processes Marc Deisenroth February 22,

robotics Model value functions and/or dynamics with GPs Bayesian

106 Application Areas 5 8 ang.vel. in rad/s angle in rad 2 Reinforcement learning and robotics Model value functions and/or dynamics with GPs Bayesian optimization (Experimental Design) Model unknown utility functions with GPs Geostatistics Spatial modeling (e.g., landscapes, resources) Sensor networks Time-series modeling and forecasting Gaussian Processes Marc Deisenroth February 22,

107 Limitations of Gaussian Processes Computational and memory complexity Training set size: N Training scales in OpN 3 q Prediction (variances) scales in OpN 2 q Memory requirement: OpND ` N 2 q Practical limit N «10, 000 Gaussian Processes Marc Deisenroth February 22,

108 Tips and Tricks for Practitioners To set initial hyper-parameters, use domain knowledge if possible. Standardize input data and set initial length-scales l to «0.5. Standardize targets y and set initial signal variance to σ f «1. Often useful: Set initial noise level relatively high (e.g., σ n «0.5 ˆ σ f amplitude, even if you think your data have low noise. The optimization surface for your other parameters will be easier to move in. When optimizing hyper-parameters, try random restarts or other tricks to avoid local optima are advised. Mitigate the problem of numerical instability (Cholesky decomposition of K ` σn 2 I) by penalizing high signal-to-noise ratios σ f {σ n Gaussian Processes Marc Deisenroth February 22,

109 Appendix Gaussian Processes Marc Deisenroth February 22,

110 The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 p(x) Mean 95% confidence bound p(x) x 8 6 x Gaussian Processes Marc Deisenroth February 22, y p(x, y)

111 The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 p(x) Mean 95% confidence bound 3 2 Mean 95% confidence bound p(x) x x x 1 Gaussian Processes Marc Deisenroth February 22,

112 The Gaussian Distribution ppx µ, Σq p2πq D 2 Σ 1 2 exp ` 1 2 px µqj Σ 1 px µq Mean vector µ Average of the data Covariance matrix Σ Spread of the data 0.3 Data p(x) Mean 95% confidence interval 3 2 Data Mean 95% confidence bound p(x) x x x 1 Gaussian Processes Marc Deisenroth February 22,

113 Sampling from a Multivariate Gaussian Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. However, we only have access to a random number generator that can sample x from N `0, I... Gaussian Processes Marc Deisenroth February 22,

114 Sampling from a Multivariate Gaussian Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. However, we only have access to a random number generator that can sample x from N `0, I... Exploit that affine transformations y Ax ` b of a Gaussian random variable x remain Gaussian Mean: E x rax ` bs AE x rxs ` b Covariance: V x rax ` bs AV x rxsa J Gaussian Processes Marc Deisenroth February 22,

115 Sampling from a Multivariate Gaussian Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. However, we only have access to a random number generator that can sample x from N `0, I... Exploit that affine transformations y Ax ` b of a Gaussian random variable x remain Gaussian Mean: E x rax ` bs AE x rxs ` b Covariance: V x rax ` bs AV x rxsa J 1. Find conditions for A, b to match the mean of y 2. Find conditions for A, b to match the covariance of y Gaussian Processes Marc Deisenroth February 22,

116 Sampling from a Multivariate Gaussian (2) Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. x = randn(d,1); y = chol(σ) *x + µ; Sample x N `0, I Scale x and add offset Here chol(σ) is the Cholesky factor L, such that L J L Σ Gaussian Processes Marc Deisenroth February 22,

117 Sampling from a Multivariate Gaussian (2) Objective Generate a random sample y N `µ, Σ from a D-dimensional joint Gaussian with covariance matrix Σ and mean vector µ. x = randn(d,1); y = chol(σ) *x + µ; Sample x N `0, I Scale x and add offset Here chol(σ) is the Cholesky factor L, such that L J L Σ Therefore, the mean and covariance of y are Erys ȳ ErL J x ` µs L J Erxs ` µ µ Covrys Erpy ȳqpy ȳq J s ErL J xx J Ls L J Erxx J sl L J L Σ Gaussian Processes Marc Deisenroth February 22,

118 Conditional 3 2 «ff µx Joint p(x,y) ppx, yq N, µ y «ff Σxx Σ xy Σ yx Σ yy 1 0 y x Gaussian Processes Marc Deisenroth February 22,

119 Conditional Joint p(x,y) Observation ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy y x Gaussian Processes Marc Deisenroth February 22,

120 Conditional 3 2 Joint p(x,y) Observation y Conditional p(x y) ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy y ppx yq N `µ x y, Σ x y µ x y µ x ` Σ xy Σ 1 yy py µ y q Σ x y Σ xx Σ xy Σ 1 yy Σ yx x Conditional ppx yq is also Gaussian Computationally convenient Gaussian Processes Marc Deisenroth February 22,

121 Marginal y Joint p(x,y) Marginal p(x) x ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Σ yy Marginal distribution: ż pp x q pp x, y qd y N ` µ x, Σ xx Gaussian Processes Marc Deisenroth February 22,

122 Marginal y Joint p(x,y) Marginal p(x) x ppx, yq N «µx µ y ff, «ff Σxx Σ xy Σ yx Marginal distribution: ż pp x q pp x, y qd y N ` µ x, Σ xx The marginal of a joint Gaussian distribution is Gaussian Intuitively: Ignore (integrate out) everything you are not interested in Gaussian Processes Marc Deisenroth February 22, Σ yy

123 The Gaussian Distribution in the Limit Consider the joint Gaussian distribution ppx, xq, where x P R D and x P R k, k Ñ 8 are random variables. Gaussian Processes Marc Deisenroth February 22,

124 The Gaussian Distribution in the Limit Consider the joint Gaussian distribution ppx, xq, where x P R D and x P R k, k Ñ 8 are random variables. Then «ff j µx Σxx Σ ppx, xq N, x x µ x Σ xx where Σ x x P R kˆk and Σ x x P R Dˆk, k Ñ 8. Σ x x Gaussian Processes Marc Deisenroth February 22,

125 The Gaussian Distribution in the Limit Consider the joint Gaussian distribution ppx, xq, where x P R D and x P R k, k Ñ 8 are random variables. Then «ff j µx Σxx Σ ppx, xq N, x x µ x Σ xx Σ x x where Σ x x P R kˆk and Σ x x P R Dˆk, k Ñ 8. However, the marginal remains finite ż pp x q pp x, x qd x N ` µ x, Σ xx where we integrate out an infinite number of random variables x i. Gaussian Processes Marc Deisenroth February 22,

126 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Gaussian Processes Marc Deisenroth February 22,

127 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Then, x tx train, x test, x other u (x other plays the role of x from previous slide) Gaussian Processes Marc Deisenroth February 22,

128 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Then, x tx train, x test, x other u (x other plays the role of x from previous slide)» fi» ppxq N µ train µ test µ other ffi ffi fl, Σ train Σ train,test Σ test,train Σ test Σ test,other Σ train,other Σ other,train Σ other,test Σ other fi ffi ffi fl Gaussian Processes Marc Deisenroth February 22,

129 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Then, x tx train, x test, x other u (x other plays the role of x from previous slide)» fi» ppxq N µ train µ test µ other ffi ffi fl, Σ train Σ train,test Σ test,train Σ test Σ test,other Σ train,other Σ other,train Σ other,test Σ other ż ppx train, x test q pp x train, x test, x other qd x other fi ffi ffi fl Gaussian Processes Marc Deisenroth February 22,

130 Marginal and Conditional in the Limit In practice, we consider finite training and test data x train, x test Then, x tx train, x test, x other u (x other plays the role of x from previous slide)» fi» ppxq N µ train µ test µ other ffi ffi fl, Σ train Σ train,test Σ test,train Σ test Σ test,other Σ train,other Σ other,train Σ other,test Σ other ż ppx train, x test q pp x train, x test, x other qd x other ppx test x train q N `µ, Σ µ µ test ` Σ test,train Σ 1 train px train µ train q Σ Σ test Σ test,train Σ 1 train Σ train,test fi ffi ffi fl Gaussian Processes Marc Deisenroth February 22,

131 Gaussian Process Training: Hierarchical Inference Level-1 inference (posterior on f ): ppy X, f q pp f X, θq pp f X, y, θq ppy X, θq ż ppy X, θq ppy f, Xq pp f X, f θqd f f θ x i y i N σ n Gaussian Processes Marc Deisenroth February 22,

132 Gaussian Process Training: Hierarchical Inference Level-1 inference (posterior on f ): ppy X, f q pp f X, θq pp f X, y, θq ppy X, θq ż ppy X, θq ppy f, Xq pp f X, f θqd f f θ Level-2 inference (posterior on θ) ppθ X, yq ppy X, θq ppθq ppy Xq x i y i N σ n Gaussian Processes Marc Deisenroth February 22,

133 GP as the Limit of an Infinite RBF Network Consider the universal function approximator f pxq ÿ 1 Nÿ n pi ` lim γ n exp px NÑ8 N λ 2, ipz n 1 x P R, λ P R` with γ n N `0, 1 (random weights) Gaussian-shaped basis functions (with variance λ 2 {2) everywhere on the real axis Gaussian Processes Marc Deisenroth February 22,

134 GP as the Limit of an Infinite RBF Network Consider the universal function approximator f pxq ÿ 1 Nÿ n pi ` lim γ n exp px NÑ8 N λ 2, ipz n 1 x P R, λ P R` with γ n N `0, 1 (random weights) Gaussian-shaped basis functions (with variance λ 2 {2) everywhere on the real axis f pxq ÿ ipz ż i`1 i ˆ px sq2 γpsq exp λ 2 ż 8 ds γpsq exp 8 ˆ px sq2 λ 2 ds Gaussian Processes Marc Deisenroth February 22,

135 GP as the Limit of an Infinite RBF Network Consider the universal function approximator f pxq ÿ 1 Nÿ n pi ` lim γ n exp px NÑ8 N λ 2, ipz n 1 x P R, λ P R` with γ n N `0, 1 (random weights) Gaussian-shaped basis functions (with variance λ 2 {2) everywhere on the real axis f pxq ÿ ipz ż i`1 i ˆ px sq2 γpsq exp λ 2 ż 8 ds γpsq exp 8 ˆ px sq2 λ 2 ds Mean: Er f pxqs 0 Covariance: Covr f pxq, f px 1 qs θ 2 1 exp px x1 q 2 2λ 2 GP with mean 0 and Gaussian covariance function for suitable θ 2 1 Gaussian Processes Marc Deisenroth February 22,

136 References I [1] N. A. C. Cressie. Statistics for Spatial Data. Wiley-Interscience, [2] M. P. Deisenroth and S. Mohamed. Expectation Propagation in Gaussian Process Dynamical Systems. In Advances in Neural Information Processing Systems, pages , [3] M. P. Deisenroth and J. W. Ng. Distributed Gaussian Processes. In Proceedings of the International Conference on Machine Learning, [4] M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian Process Dynamic Programming. Neurocomputing, 72(7 9): , March [5] M. P. Deisenroth, R. Turner, M. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust Filtering and Smoothing with Gaussian Processes. IEEE Transactions on Automatic Control, 57(7): , [6] R. Frigola, F. Lindsten, T. B. Schön, and C. E. Rasmussen. Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, pages Curran Associates, Inc., [7] J. Kocijan, R. Murray-Smith, C. E. Rasmussen, and A. Girard. Gaussian Process Model Based Predictive Control. In Proceedings of the 2004 American Control Conference (ACC 2004), pages , Boston, MA, USA, June July [8] A. Krause, A. Singh, and C. Guestrin. Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies. Journal of Machine Learning Research, 9: , February [9] J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic Construction and Natural-Language Description of Nonparametric Regression Models. In AAAI Conference on Artificial Intelligence, pages 1 11, [10] M. A. Osborne, S. J. Roberts, A. Rogers, S. D. Ramchurn, and N. R. Jennings. Towards Real-Time Information Processing of Sensor Network Data Using Computationally Efficient Multi-output Gaussian Processes. In Proceedings of the International Conference on Information Processing in Sensor Networks, pages IEEE Computer Society, [11] J. Quiñonero-Candela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression. Journal of Machine Learning Research, 6(2): , [12] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. The MIT Press, Cambridge, MA, USA, [13] S. Roberts, M. A. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian Processes for Time Series Modelling. Philosophical Transactions of the Royal Society (Part A), 371(1984), February Gaussian Processes Marc Deisenroth February 22,

Lecture 11: Probability Distributions and Parameter Estimation

Intelligent Data Analysis and Probabilistic Inference Lecture 11: Probability Distributions and Parameter Estimation Recommended reading: Bishop: Chapters 1.2, 2.1 2.3.4, Appendix B Duncan Gillies and