Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53

What is Machine Learning? Machine learning algorithms adapt with data versus having fixed decision rules. Machine learning aims not only to equip people with tools to analyse data, but to create algorithms which can learn and make decisions without human intervention. 1,2 In order for a model to automatically learn and make decisions, it must be able to discover patterns and extrapolate those patterns to new situations. 1 E.g., N.D. Lawrence (2010), What is Machine Learning? 2 T.M. Mitchell (2006), What is Machine Learning and Where Is it Headed? 2 / 53

Function Learning Example 700 Airline Passengers (Thousands) 600 500 400 300 200 Train 100 1949 1951 1953 1955 1957 1959 1961 Year 3 / 53

Function Learning Example 700 Airline Passengers (Thousands) 600 500 400 300 200 Train Human? 100 1949 1951 1953 1955 1957 1959 1961 Year 4 / 53

Function Learning Example 700 Airline Passengers (Thousands) 600 500 400 300 200 Train Alien? 100 1949 1951 1953 1955 1957 1959 1961 Year 5 / 53

Function Learning Example Airline Passengers (Thousands) 700 600 500 400 300 200 Train Human? 100 1949 1951 1953 1955 1957 1959 1961 Year 6 / 53

Function Learning Example Airline Passengers (Thousands) 700 600 500 400 300 200 Train Human? Alien? 100 1949 1951 1953 1955 1957 1959 1961 Year 7 / 53

Function Learning Example Airline Passengers (Thousands) 700 600 500 400 300 200 Train Alien? Test Human? 100 1949 1951 1953 1955 1957 1959 1961 Year 8 / 53

Building an Intelligent Model The ability for a model to learn from data depends on its: 1. Support: what solutions we think are a priori possible. 2. Inductive biases: what solutions we think are a priori likely. Examples: Function Learning, Character Recognition Human ability to make remarkable generalisations from data could derive from an expressive prior combined with Bayesian inference. 9 / 53

Statistics from Scratch Basic Regression Problem Training set of N targets (observations) y = (y(x 1 ),..., y(x N )) T. Observations evaluated at inputs X = (x 1,..., x N ) T. Want to predict the value of y(x ) at a test input x. For example: Given CO 2 concentrations y measured at times X, what will the CO 2 concentration be for x = 2024, 10 years from now? Just knowing high school math, what might you try? 10 / 53

Statistics from Scratch 700 Airline Passengers (Thousands) 600 500 400 300 200 100 1949 1951 1953 1955 1957 1959 1961 Year 11 / 53

Statistics from Scratch Guess the parametric form of a function that could fit the data f (x, w) = w T x [Linear function of w and x] f (x, w) = w T φ(x) Model) f (x, w) = g(w T φ(x)) [Linear function of w] (Linear Basis Function [Non-linear in x and w] (E.g., Neural Network) φ(x) is a vector of basis functions. For example, if φ(x) = (1, x, x 2 ) and x R 1 then f (x, w) = w 0 + w 1 x + w 2 x 2 is a quadratic function. Choose an error measure E(w), minimize with respect to w E(w) = N i=1 [f (x i, w) y(x i )] 2 12 / 53

Statistics from Scratch A probabilistic approach We could explicitly account for noise in our model. y(x) = f (x, w) + ɛ(x), where ɛ(x) is a noise function. One commonly takes ɛ(x) = N (0, σ 2 ) for i.i.d. additive Gaussian noise, in which case p(y(x) x, w, σ 2 ) = N (y(x); f (x, w), σ 2 ) Observation Model (1) N p(y x, w, σ 2 ) = N (y(x i ); f (x i, w), σ 2 ) Likelihood (2) i=1 Maximize the likelihood of the data p(y x, w, σ 2 ) with respect to σ 2, w. For a Gaussian noise model, this approach will make the same predictions as using a squared loss error function: log p(y X, w, σ 2 ) 1 2σ 2 N [f (x i, w) y(x i )] 2 (3) i=1 13 / 53

Statistics from Scratch The probabilistic approach helps us interpret the error measure in a deterministic approach, and gives us a sense of the noise level σ 2. Probabilistic methods thus provide an intuitive framework for representing uncertainty, and model development. Both approaches are prone to over-fitting for flexible f (x, w): low error on the training data, high error on the test set. Regularization Use a penalized log likelihood (or error function), such as model fit {}}{ complexity penalty log p(y X, w) 1 n {}}{ 2σ 2 (f (x i, w) y(x i ) 2 ) λw T w. (4) i=1 But how should we define complexity, and how much should we penalize complexity? Can set λ using cross-validation. 14 / 53

Bayesian Inference Bayes Rule p(a b) = p(b a)p(a)/p(b), p(a b) p(b a)p(a). (5) posterior = likelihood prior marginal likelihood, p(w y, X, σ2 ) = p(y X, w, σ2 )p(w) p(y X, σ 2 ) Predictive Distribution p(y x, y, X) = p(y x, w)p(w y, X)dw. (7). (6) Average of infinitely many models weighted by their posterior probabilities. No over-fitting, automatically calibrated complexity. Typically more interested in distribution over functions than in parameters w. 15 / 53

Representing Uncertainty Different types of uncertainty: Uncertainty through lack of knowledge Intrinsic uncertainty; e.g., radioactive decay. Uncertainty through lack of knowledge can seem like intrinsic uncertainty (e.g., rolling dice). Regardless of whether or not the universe is deterministic whether there is some underlying true answer we will always have uncertainty. We can represent this belief using probability distributions (Bayesian methods, probabilistic modelling). 16 / 53

Parametric Regression Review Deterministic E(w) = N (f (x i, w) y i ) 2. (8) i=1 Maximum Likelihood Bayesian p(y(x) x, w) = N (y(x); f (x, w), σn) 2, (9) N p(y X, w) = N (y(x i ); f (x i, w), σn) 2. (10) i=1 posterior = likelihood prior marginal likelihood, p(y X, w)p(w) p(w y, X) =. (11) p(y X) 17 / 53

Model Selection and Marginal Likelihood p(y M 1, X) = p(y f 1 (x, w))p(w)dw (13) Complex Model Simple Model Appropriate Model p(y M) y All Possible Datasets 18 / 53

Blackboard: Examples of Occam s Razor in Everyday Inferences For further reading, see MacKay (2003) textbook, Information Theory, Inference, and Learning Algorithms. 19 / 53

Occam s Razor Example -1, 3, 7, 11,??,?? H 1 : the sequence is an arithmetic progression, add n, where n is an integer. H 2 : the sequence is generated by a cubic function of the form cx 3 + dx 2 + e, where c, d, and e are fractions. ( 1 11 x3 + 9 11 x2 + 23 11 ) 20 / 53

Model Selection 2 1.5 Outputs, y(x) 1 0.5 0 0.5 1 0 20 40 60 80 100 Inputs, x Observations y(x). Assume p(y(x) f (x)) N (y(x); f (x), σ 2 ). Consider polynomials of different orders. As always, observations are out of the chosen model class! Which model should we choose? f 0 (x) = a 0, (14) f 1 (x) = a 0 + a 1 x, (15) f 2 (x) = a 0 + a 1 x + a 2 x 2, (16). (17) f J (x) = a 0 + a 1 x + a 2 x 2 + + a J x J. (18) 21 / 53

Model Selection: Occam s Hill 0.25 Marginal Likelihood (Evidence) 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order Marginal likelihood (evidence) as a function of model order, using an isotropic prior p(a) = N (0, σ 2 I). 22 / 53

Model Selection: Occam s Asymptote 0.25 Marginal Likelihood (Evidence) 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order Marginal likelihood (evidence) as a function of model order, using an anisotropic prior p(a i ) = N (0, γ i ), with γ learned from the data. 23 / 53

Occam s Razor 0.25 0.25 0.2 0.2 Marginal Likelihood (Evidence) 0.15 0.1 0.05 Marginal Likelihood (Evidence) 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order 0 1 2 3 4 5 6 7 8 9 10 11 Model Order (a) Isotropic Gaussian Prior (b) Anisotropic Gaussian Prior For further reading, see Rasmussen and Ghahramani (2001) (Occam s Razor) and Kass and Raftery (1995) (Bayes Factors) 24 / 53

Linear Basis Models Consider the simple linear model, f (x) = a 0 + a 1 x, (19) a 0, a 1 N (0, 1). (20) 25 20 15 10 Output, f(x) 5 0 5 10 15 20 25 10 8 6 4 2 0 2 4 6 8 10 Input, x 25 / 53

Linear Models We are interested in the induced distribution over functions, not the parameters... Let s characterise the properties of these functions directly: f (x a 0, a 1 ) = a 0 + a 1 x, a 0, a 1 N (0, 1). (21) E[f (x)] = E[a 0 ] + E[a 1 ]x = 0. (22) cov[f (x b ), f (x c )] = E[f (x b )f (x c )] E[f (x b )]E[f (x c )] (23) = E[a 2 0 + a 0 a 1 (x b + x c ) + a 2 1x b x c ] 0 (24) = E[a 2 0] + E[a 2 1x b x c ] + E[a 0 a 1 (x b + x c )] (25) = 1 + x b x c + 0 (26) = 1 + x b x c. (27) 26 / 53

Linear Models Therefore any collection of values has a joint Gaussian distribution [f (x 1 ),..., f (x N )] N (0, K), (28) By definition, f (x) is a Gaussian process. Definition K ij = cov(f (x i ), f (x j )) = k(x i, x j ) = 1 + x b x c. (29) A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. We write f (x) GP(m, k) to mean [f (x 1 ),..., f (x N )] N (µ, K) (30) µ i = m(x i ) (31) K ij = k(x i, x j ), (32) for any collection of input values x 1,..., x N. In other words, f is a GP with mean function m(x) and covariance kernel k(x i, x j ). 27 / 53

Linear Basis Function Models Model Specification f (x, w) = w T φ(x) (33) p(w) = N (0, Σ w ) (34) Moments of Induced Distribution over Functions E[f (x, w)] = m(x) = E[w T ]φ(x) = 0 (35) cov(f (x i ), f (x j )) = k(x i, x j ) = E[f (x i )f (x j )] E[f (x i )]E[f (x j )] (36) = φ(x i ) T E[ww T ]φ(x j ) 0 (37) = φ(x i ) T Σ w φ(x j ) (38) f (x, w) is a Gaussian process, f (x) N (m, k) with mean function m(x) = 0 and covariance kernel k(x i, x j ) = φ(x i ) T Σ w φ(x j ). The entire basis function model of Eqs. (33) and (34) is encapsulated as a distribution over functions with kernel k(x, x ). 28 / 53

Gaussian Processes We are ultimately more interested in and have stronger intuitions about the functions that model our data than weights w in a parametric model, and we can express those intuitions using a covariance kernel. The kernel controls the support and inductive biases of our model, and thus its ability to generalise. 29 / 53

Example: RBF Kernel k RBF (x, x ) = cov(f (x), f (x )) = a 2 exp( x x 2 Far and above the most popular kernel. 2l 2 ) (39) Expresses the intuition that function values at nearby inputs are more correlated than function values at far away inputs. The kernel hyperparameters a and l control amplitudes and wiggliness of these functions. GPs with an RBF kernel have large support and are universal approximators. 30 / 53

Sampling from a GP with an RBF Kernel x = [-10:0.2:10] ; % inputs (where we query the GP) N = numel(x); % number of inputs K = zeros(n,n); % covariance matrix % very inefficient way of creating K in Matlab for i=1:n for j=1:n K(i,j) = k_rbf(x(i),x(j)); end end K = K + 1e-6*eye(N); CK = chol(k); f = CK *randn(n,1); % add jitter for conditioning % draws from N(0,K) plot(x,f); 31 / 53

Samples from a GP with an RBF Kernel 3 2 1 output, f(x) 0 1 2 3 10 8 6 4 2 0 2 4 6 8 input, x 32 / 53

1D RBF Kernel with Different Length-scales k RBF (x, x ) = cov(f (x), f (x )) = a 2 exp( x x 2 2l 2 ) (40) 1 0.9 0.8 0.7 SE kernels with Different Length scales l=7 l = 0.7 l = 2.28 0.6 k(τ) 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 20 τ Figure: SE kernels with different length-scales, as a function of τ = x x. 33 / 53

RBF Kernel Covariance Matrix k RBF (x, x ) = cov(f (x), f (x )) = a 2 exp( x x 2 2l 2 ) (41) The covariance matrix K for ordered inputs on a 1D grid. K ij = k RBF (x i, x j ). 20 40 60 80 100 20 40 60 34 / 53

Gaussian Process Inference Observed noisy data y = (y(x 1 ),..., y(x N )) T at input locations X. Start with the standard regression assumption: N (y(x); f (x), σ 2 ). Place a Gaussian process distribution over noise free functions f (x) GP(0, k θ ). The kernel k is parametrized by θ. Infer p(f y, X, X ) for the noise free function f evaluated at test points X. Joint distribution [ ] ( [ y Kθ (X, X) + σ 2 I K θ (X, X ) N 0, K θ (X, X) K θ (X, X ) f Conditional predictive distribution ]). (42) f X, X, y, θ N ( f, cov(f )), (43) f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, (44) cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). (45) 35 / 53

Inference using an RBF kernel Specify f (x) GP(0, k). Choose k RBF (x, x ) = a 2 0 exp( x x 2 ). Choose values for a 0 and l 0. Observe data, look at the prior and posterior over functions. 2l 2 0 4 Samples from GP Prior 4 Samples from GP Posterior 3 3 2 2 Output, f(x) 1 0 1 Output, f(x) 1 0 1 2 2 3 3 4 10 5 0 5 10 Input, x (a) 4 10 5 0 5 10 Input, x (b) Does something look strange about these functions? 36 / 53

Inference using an RBF kernel Increase the length-scale l. 4 Samples from GP Prior 3 2 Output, f(x) 3 2 1 0 1 Output, f(x) 1 0 1 2 3 2 3 4 10 5 4 10 5 0 5 10 Input, x (a) (b) 37 / 53

Learning and Model Selection p(m i y) = p(y M i)p(m i ) (46) p(y) We can write the evidence of the model as p(y M i ) = p(y f, M i )p(f)df, (47) Complex Model Simple Model Appropriate Model 4 3 2 Data Simple Complex Appropriate p(y M) Output, f(x) 1 0 1 2 3 y All Possible Datasets (a) 4 10 8 6 4 2 0 2 4 6 8 10 Input, x (b) 38 / 53

Learning and Model Selection We can integrate away the entire Gaussian process f (x) to obtain the marginal likelihood, as a function of kernel hyperparameters θ alone. p(y θ, X) = p(y f, X)p(f θ, X)df. (48) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π). (49) An extremely powerful mechanism for kernel learning. 4 Samples from GP Prior 4 Samples from GP Posterior 3 3 2 2 Output, f(x) 1 0 1 Output, f(x) 1 0 1 2 2 3 3 4 10 5 0 5 10 Input, x 4 10 5 0 5 10 Input, x 39 / 53

Learning and Model Selection A fully Bayesian treatment would integrate away kernel hyperparameters θ. p(f X, X, y) = p(f X, X, y, θ)p(θ y)dθ (50) For example, we could specify a prior p(θ), use MCMC to take J samples from p(θ y) p(y θ)p(θ), and then find p(f X, X, y) 1 J J p(f X, X, y, θ (i) ), θ (i) p(θ y). (51) i=1 If we have a non-gaussian noise model, and thus cannot integrate away f, the strong dependencies between Gaussian process f and hyperparameters θ make sampling extremely difficult. In my experience, the most effective solution is to use a deterministic approximation for the posterior p(f y) which enables one to work with an approximate marginal likelihood. 40 / 53

Gaussian Process Covariance Kernels Let τ = x x : k SE (τ) = exp( 0.5τ 2 /l 2 ) (52) 3τ 3τ k MA (τ) = a(1 + ) exp( ) (53) l l k RQ (τ) = (1 + τ 2 2 α l 2 ) α (54) k PE (τ) = exp( 2 sin 2 (π τ ω)/l 2 ) (55) 41 / 53

Inference and Learning 1. Learning: Optimize marginal likelihood, model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y with respect to kernel hyperparameters θ. complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π), 2. Inference: Conditioned on kernel hyperparameters θ, form the predictive distribution for test inputs X : f X, X, y, θ N ( f, cov(f )), f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). 42 / 53

Gaussian process graphical model Squared are observed, circles are latent, the thick bar is a set of fully connected nodes. Each y i is conditionally independent given f i. Because of the marginalization property of a GP, addition of further inputs x and unobserved targets y does not change the distribution of any other variables. Figure from GPML, Rasmussen and Williams (2006) 43 / 53

Worked Example: Combining Kernels, CO 2 Data CO 2 Concentration (ppm) 400 380 360 340 320 1968 1977 1986 1995 2004 Year Example from Rasmussen and Williams (2006), Gaussian Processes for Machine Learning. 44 / 53

Worked Example: Combining Kernels, CO 2 Data 45 / 53

Worked Example: Combining Kernels, CO 2 Data Long rising trend: k 1 (x p, x q ) = θ 2 1 exp ( (xp xq)2 2θ2 2 Quasi-periodic seasonal changes: ( k 2 (x p, x q ) = k RBF (x p, x q )k PER (x p, x q ) = θ3 2 exp (xp xq) Multi-scale medium term irregularities: ) k 3 (x p, x q ) = θ6 (1 2 θ8 + (xp xq)2 2θ 8θ7 2 ( Correlated and i.i.d. noise: k 4 (x p, x q ) = θ9 2 exp 2θ 2 4 ) 2 sin2 (π(x p x q)) θ 2 5 (xp xq)2 2θ10 2 k total (x p, x q ) = k 1 (x p, x q ) + k 2 (x p, x q ) + k 3 (x p, x q ) + k 4 (x p, x q ) ) ) + θ 2 11 δ pq 46 / 53

Worked Example: Combining Kernels, CO 2 Data Hand crafted a kernel combination to perform extrapolation Confidence in the extrapolation is high (suggests that model is well specified). Can interpret the learned kernel hyperparameters θ to learn information about our dataset. A lot of the interesting pattern recognition has been done by a human in this example. We would like to completely automate this modelling procedure. 47 / 53

Non-Gaussian Likelihoods We can no longer analytically integrate away the Gaussian process. But we can use a simple Monte carlo sum: p(f y, X, x ) = p(f f, x )p(f y)df 1 J J p(f f (j), x ), j=1 f (j) p(f y) But how do we sample from p(f y)? 48 / 53

Non-Gaussian Likelihoods But what about hyperparameters? It s easy to implement Gibbs sampling: p(f y, θ) p(y f)p(f θ) (56) p(θ f, y) p(f θ)p(θ). (57) But this won t work because of strong correlations between f and θ. 50 / 53

Non-Gaussian Likelihoods But what about hyperparameters? It s easy to implement Gibbs sampling: p(f y, θ) p(y f)p(f θ) (58) p(θ f, y) p(f θ)p(θ). (59) But this won t work because of strong correlations between f and θ. Transform into a whitened space, f = Lν, and sample from ν and θ, which decouples correlations. 51 / 53

Non-Gaussian Likelihoods But what about hyperparameters? It s easy to implement Gibbs sampling: p(f y, θ) p(y f)p(f θ) (60) p(θ f, y) p(f θ)p(θ). (61) But this won t work because of strong correlations between f and θ. Transform into a whitened space, f = Lν, and sample from ν and θ, which decouples correlations. Use a deterministic approach to approximately integrate away f to access a marginal likelihood, conditioned only on kernel hyperparameters θ: p(y θ) = p(y f)p(f θ)df (62) The Laplace approximation, for example, approximates p(f y) as a Gaussian. 52 / 53

Readings for Next Time C. Rasmussen and C. Williams, GPML, Ch. 4, 5 Y. Saatchi, PhD Thesis, 2011. Chapter 5 J. Candela and C.E. Rasmussen, A unifying view of sparse approximation Gaussian process regression, JMLR 2005. A.G. Wilson and R.P. Adams. Gaussian process kernels for pattern discovery and extrapolation, ICML 2013. 53 / 53