Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes

Size: px

Start display at page:

Download "Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes"

Cory Moody
5 years ago
Views:

1 Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson Carnegie Mellon University April 1, / 100

2 Gaussian process review Definition A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. Nonparametric Regression Model Prior: f (x) GP(m(x), k(x, x )), meaning (f (x 1 ),..., f (x N )) N (µ, K), with µ i = m(x i ) and K ij = cov(f (x i ), f (x j )) = k(x i, x j ). GP posterior {}}{ p(f (x) D) Likelihood GP prior {}}{{}}{ p(d f (x)) p(f (x)) output, f(t) Gaussian process sample prior functions output, f(t) Gaussian process sample posterior functions / 100

3 Gaussian Process Inference Observed noisy data y = (y(x 1 ),..., y(x N )) T at input locations X. Start with the standard regression assumption: N (y(x); f (x), σ 2 ). Place a Gaussian process distribution over noise free functions f (x) GP(0, k θ ). The kernel k is parametrized by θ. Infer p(f y, X, X ) for the noise free function f evaluated at test points X. Joint distribution [ ] ( [ y Kθ (X, X) + σ 2 I K θ (X, X ) N 0, K θ (X, X) K θ (X, X ) f Conditional predictive distribution ]). (1) f X, X, y, θ N ( f, cov(f )), (2) f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, (3) cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). (4) 3 / 100

4 Learning and Model Selection p(m i y) = p(y M i)p(m i ) (5) p(y) We can write the evidence of the model as p(y M i ) = p(y f, M i )p(f)df, (6) Complex Model Simple Model Appropriate Model Data Simple Complex Appropriate p(y M) Output, f(x) y All Possible Datasets (a) Input, x (b) 4 / 100

5 Learning and Model Selection We can integrate away the entire Gaussian process f (x) to obtain the marginal likelihood, as a function of kernel hyperparameters θ alone. p(y θ, X) = p(y f, X)p(f θ, X)df. (7) model fit complexity penalty {}}{ log p(y θ, X) = 1 {}}{ 2 yt (K θ + σ 2 I) 1 1 y 2 log K θ + σ 2 I N log(2π). (8) 2 An extremely powerful mechanism for kernel learning. 4 Samples from GP Prior 4 Samples from GP Posterior Output, f(x) Output, f(x) Input, x Input, x 5 / 100

6 Inference and Learning 1. Learning: Optimize marginal likelihood, model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y with respect to kernel hyperparameters θ. complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π), 2. Inference: Conditioned on kernel hyperparameters θ, form the predictive distribution for test inputs X : f X, X, y, θ N ( f, cov(f )), f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). 6 / 100

7 Learning and Model Selection A fully Bayesian treatment would integrate away kernel hyperparameters θ. p(f X, X, y) = p(f X, X, y, θ)p(θ y)dθ (9) For example, we could specify a prior p(θ), use MCMC to take J samples from p(θ y) p(y θ)p(θ), and then find p(f X, X, y) 1 J J p(f X, X, y, θ (i) ), θ (i) p(θ y). (10) i=1 If we have a non-gaussian noise model, and thus cannot integrate away f, the strong dependencies between Gaussian process f and hyperparameters θ make sampling extremely difficult. In my experience, the most effective solution is to use a deterministic approximation for the posterior p(f y) which enables one to work with an approximate marginal likelihood. 7 / 100

8 Popular Kernels Let τ = x x : k SE (τ) = exp( 0.5τ 2 /l 2 ) (11) 3τ 3τ k MA (τ) = a(1 + ) exp( ) (12) l l k RQ (τ) = (1 + τ 2 2 α l 2 ) α (13) k PE (τ) = exp( 2 sin 2 (π τ ω)/l 2 ) (14) 8 / 100

9 Worked Example: Combining Kernels, CO 2 Data CO 2 Concentration (ppm) Year Example from Rasmussen and Williams (2006), Gaussian Processes for Machine Learning. 9 / 100

10 Worked Example: Combining Kernels, CO 2 Data 10 / 100

11 Worked Example: Combining Kernels, CO 2 Data Long rising trend: k 1 (x p, x q ) = θ 2 1 exp ( (xp xq)2 2θ2 2 Quasi-periodic seasonal changes: ( k 2 (x p, x q ) = k RBF (x p, x q )k PER (x p, x q ) = θ3 2 exp (xp xq) Multi-scale medium term irregularities: ) k 3 (x p, x q ) = θ6 (1 2 θ8 + (xp xq)2 2θ 8θ7 2 ( Correlated and i.i.d. noise: k 4 (x p, x q ) = θ9 2 exp 2θ 2 4 ) 2 sin2 (π(x p x q)) θ 2 5 (xp xq)2 2θ10 2 k total (x p, x q ) = k 1 (x p, x q ) + k 2 (x p, x q ) + k 3 (x p, x q ) + k 4 (x p, x q ) ) ) + θ 2 11 δ pq 11 / 100

12 What is a kernel? Informally, k describes the similarities between pairs of data points. For example, far away points may be considered less similar than nearby points. K ij = φ(x i ), φ(x j ) and so tells us the overlap between the features (basis functions) φ(x i ) and φ(x j ) We have seen that all linear basis function models f (x) = w T φ(x), with p(w) = N (0, Σ w ) correspond to Gaussian processes with kernel k(x, x ) = φ(x) T Σ w φ(x ). We have also accumulated some experience with the RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l 2 ). The kernel controls the generalisation behaviour of a kernel machine. For example, a kernel controls the support and inductive biases of a Gaussian process which functions are a priori likely. A kernel is also known as covariance function or covariance kernel in the context of Gaussian processes. 12 / 100

13 Candidate Kernel Symmetric k(x, x ) = { 1 x x 1 0 otherwise Provides information about proximity of points Exercise: Is it a valid kernel? 13 / 100

14 Candidate Kernel k(x, x ) = { 1 x x 1 0 otherwise Try the points x 1 = 1, x 2 = 2, x 3 = 3. Compute the kernel matrix??? K =??? (15)??? 14 / 100

15 Candidate Kernel k(x, x ) = { 1 x x 1 0 otherwise Try the points x 1 = 1, x 2 = 2, x 3 = 3. Compute the kernel matrix K = (16) The eigenvalues of K are ( (2) 1) 1, 1, and (1 (2)). Therefore K is not positive semidefinite. 15 / 100

16 Representer Theorem A decision function f (x) can be written as N f (x) = w, φ(x) = α i φ(x i ), φ(x) = i=1 N α i k(x i, x). (17) i=1 Representer theorem says this function exists with finitely many coefficients α i even when φ is infinite dimensional (an infinite number of basis functions). Initially viewed as a strength of kernel methods, for datasets not exceeding e.g. ten thousand points. Unfortunately, the number of nonzero α i often grows linearly in the size of the training set N. Example: In GP regression, the predictive mean is E[f y, X, x ] = k T (K + σ 2 I) 1 y = N α i k(x i, x ), (18) i=1 where α i = (K + σ 2 I) 1 y. 16 / 100

17 Making new kernels from old Suppose k 1 (x, x ) and k 2 (x, x ) are valid. Then the following covariance functions are also valid: k(x, x ) = g(x)k 1 (x, x )g(x ) (19) k(x, x ) = q(k 1 (x, x )) (20) k(x, x ) = exp(k 1 (x, x )) (21) k(x, x ) = k 1 (x, x ) + k 2 (x, x ) (22) k(x, x ) = k 1 (x, x )k 2 (x, x ) (23) k(x, x ) = k 3 (φ(x), φ(x )) (24) k(x, x ) = x T Ax (25) k(x, x ) = k a (x a, x a) + k b (x b, x b) (26) k(x, x ) = k a (x a, x a)k b (x b, x b) (27) where g is any function, q is a polynomial with nonnegative coefficients, φ(x) is a function from x to R M, k 3 is a valid covariance function in R M, A is a symmetric positive definite matrix, x a and x b are not necessarily disjoint variables with x = (x a, x b ) T, and k a and k b are valid kernels in their respective spaces. 17 / 100

18 Stationary Kernels A stationary kernel is invariant to translations of the input space. Equivalently, k = k(x x ) = k(τ). All distance kernels, k = k( x x ) are examples of stationary kernels. The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is a stationary kernel. 2 The polynomial kernel k POL (x, x ) = (x T x + σ0 2)p is an example of a non-stationary kernel. Stationarity provides a useful inductive bias. 18 / 100

19 Bochner s Theorem Theorem (Bochner) A complex-valued function k on R P is the covariance function of a weakly stationary mean square continuous complex-valued random process on R P if and only if it can be represented as k(τ) = Tτ ψ(ds), (28) where ψ is a positive finite measure. R P e 2πis If ψ has a density S(s), then S is called the spectral density or power spectrum of k, and k and S are Fourier duals: k(τ) = S(s)e 2πisTτ ds, (29) S(s) = k(τ)e 2πisTτ dτ. (30) 19 / 100

20 Review: Linear Basis Function Models Model Specification f (x, w) = w T φ(x) (31) p(w) = N (0, Σ w ) (32) Moments of Induced Distribution over Functions E[f (x, w)] = m(x) = E[w T ]φ(x) = 0 (33) cov(f (x i ), f (x j )) = k(x i, x j ) = E[f (x i )f (x j )] E[f (x i )]E[f (x j )] (34) = φ(x i ) T E[ww T ]φ(x j ) 0 (35) = φ(x i ) T Σ w φ(x j ) (36) f (x, w) is a Gaussian process, f (x) N (m, k) with mean function m(x) = 0 and covariance kernel k(x i, x j ) = φ(x i ) T Σ w φ(x j ). The entire basis function model of Eqs. (31) and (32) is encapsulated as a distribution over functions with kernel k(x, x ). 20 / 100

21 Deriving the RBF Kernel Start with the basis model f (x) = J w i φ i (x), (37) i=1 ) w i N (0, σ2, (38) J φ i (x) = exp ( (x c i) 2 ) 2l 2. (39) Equations (37)-(39) define a radial basis function regression model, with radial basis functions centred at the points c i. Using our result for the kernel of a generalised linear model, k(x, x ) = σ2 J J φ i (x)φ i (x ). (40) i=1 21 / 100

22 Deriving the RBF Kernel f (x) = J w i φ i (x), i=1 ) w i N (0, σ2, φ i (x) = exp ( (x c i) 2 ) J 2l 2 k(x, x ) = σ2 J (41) J φ i (x)φ i (x ) (42) i=1 Letting c i+1 c i = c = 1 J, and J, the kernel in Eq. (42) becomes a Riemann sum: k(x, x σ 2 ) = lim J J J φ i (x)φ i (x ) = i=1 c c 0 φ c (x)φ c (x )dc (43) By setting c 0 = and c =, we spread the infinitely many basis functions across the whole real line, each a distance c 0 apart: k(x, x ) = exp( x c 2l 2 ) c exp( x 2l 2 )dc (44) = πlσ 2 exp( (x x ) 2 ). (45) 22 / 100

23 Deriving the RBF Kernel It is remarkable we can work with infinitely many basis functions with finite amounts of computation using the kernel trick replacing inner products of basis functions with kernels. The RBF kernel, also known as the Gaussian or squared exponential kernel, is by far the most popular kernel. k RBF (x, x ) = a 2 exp( x x 2 2l 2 ). Recall Bochner s theorem. If we take the Fourier transform of the RBF kernel we recover a Gaussian spectral density, S(s) = (2πl 2 ) D/2 exp( 2π 2 l 2 s 2 ) for x R D. Therefore the RBF kernel kernel does not have much support for high frequency functions, since a Gaussian does not have heavy tails. Functions drawn from a GP with an RBF kernel are infinitely differentiable. For this reason, the RBF kernel is accused of being overly smooth and unrealistic. Nonetheless it has nice theoretical properties / 100

24 The RBF Kernel SE kernels with Different Length scales l=7 l = 0.7 l = k(τ) τ Figure: SE kernels with different length-scales, as a function of τ = x x. 24 / 100

25 Representer Theorem A decision function f (x) can be written as N f (x) = w, φ(x) = α i φ(x i ), φ(x) = i=1 N α i k(x i, x). (46) i=1 Representer theorem says this function exists with finitely many coefficients α i even when φ is infinite dimensional (an infinite number of basis functions). Initially viewed as a strength of kernel methods, for datasets not exceeding e.g. ten thousand points. Unfortunately, the number of nonzero α i often grows linearly in the size of the training set N. Example: In GP regression, the predictive mean is E[f y, X, x ] = k T (K + σ 2 I) 1 y = N α i k(x i, x ), (47) i=1 where α i = (K + σ 2 I) 1 y. 25 / 100

26 Polynomial Kernel We have already shown that the simple linear model corresponds to a Gaussian process with kernel f (x, w) = w T x + b, (48) p(w) = N (0, α 2 I), (49) p(b) = N (0, β 2 ), (50) k LIN (x, x ) = α 2 x T x + β 2. (51) Samples from a GP with k LIN (x, x ) will thus be straight lines. Recall that the product of two kernels is a valid kernel. The product of two linear kernels is a quadratic kernel, which gives rise to quadratic functions: k QUAD (x, x ) = k LIN (x, x )k LIN (x, x ). (52) For example, if β = 0, α = 1, and x R 2, then k QUAD (x, x ) = φ(x) T φ(x ) with φ(x) = (x 2 1, x2 2, 2x 1 x 2 ) T, where x = (x 1, x 2 ). We can generalize to the polynomial kernel k POL (x, x ) = (α 2 x T x + β 2 ) p. (53) 26 / 100

27 The Rational Quadratic Kernel What if we want data varying at multiple scales? 27 / 100

28 The Rational Quadratic Kernel Try a scale mixture of RBF kernels. Let r = x x. k(r) = exp( r2 2l )p(l)dl. 2 For example, we can consider a Gamma density for p(l). Letting γ = l 2, g(γ α, β) γ α 1 exp( αγ/β), with β 1 = l 2, the rational quadratic (RQ) kernel is derived as k RQ (r) = 0 k RBF (r γ)g(γ α, β)dγ = (1 + r2 2αl 2 ) α. (54) One could derive other interesting covariance functions using different (non-gamma) functions for p(l). 28 / 100

29 The Rational Quadratic Kernel k RQ (r) = (1 + r2 2αl 2 ) α (55) r = τ = x x. (56) 1.4 RQ kernels with Different α 2.5 Sample GP RQ Functions α = 40 α=0.1 α= k(τ) 0.6 f(x) (a) τ Input, x (b) 29 / 100

30 Neural Network Kernel The neural network kernel (Neal, 1996) is famous for triggering research on Gaussian processes in the machine learning community. Consider a neural network with one hidden layer: J f (x) = b + v i h(x; u i ). (57) i=1 b is a bias, v i are the hidden to output weights, h is any bounded hidden unit transfer function, u i are the input to hidden weights, and J is the number of hidden units. Let b and v i be independent with zero mean and variances σ 2 b and σ2 v /J, respectively, and let the u i have independent identical distributions. Collecting all free parameters into the weight vector w, E w [f (x)] = 0, cov[f (x), f (x )] = E w [f (x)f (x )] = σ 2 b + 1 J J σv 2 E u [h i (x; u i )h i (x ; u i )], i=1 (58) (59) = σ 2 b + σ 2 v E u [h(x; u)h(x ; u)]. (60) 30 / 100

31 Neural Network Kernel f (x) = b + J v i h(x; u i ). (61) Let h(x; u) = erf(u 0 + P j=1 u jx j ), where erf(z) = 2 π z 0 e t2 dt Choose u N (0, Σ) Then we obtain i=1 k NN (x, x ) = 2 π sin( 2 x T Σ x (1 + 2 xt Σ x)(1 + 2 x T Σ x ) ), (62) where x R P and x = (1, x T ) T. 31 / 100

32 Neural Network Kernel Figure: Draws from a GP with a Neural Network Kernel with Varying σ Rasmussen and Williams (2006) 32 / 100

33 Gibbs Kernel Recall the RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l 2 ). (63) What if we want to make the length-scale of l input dependent, so that the resulting function is biased to vary more quickly in parts of the input space than in others? 33 / 100

34 Gibbs Kernel Recall the RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l 2 ). (64) What if we want to make the length-scale of l input dependent, so that the resulting function is biased to vary more quickly in parts of the input space than in others? Just letting l l(x) doesn t produce a valid kernel. 34 / 100

35 Gibbs Kernel k Gibbs (x, x ) = P p=1 where x p is the p th component of x. ( 2l p (x)l p (x ) ) 1/2 ( P exp lp(x) 2 + lp(x 2 ) p=1 (x p x p) 2 ), (65) lp(x) 2 + lp(x 2 ) lengt 0.5 ou (a) (b) Rasmussen and Williams (2006) 35 / 100

36 Periodic Kernel Transform the inputs through a vector-valued function: u(x) = (cos(x), sin(x)). Apply the RBF kernel in u space: k RBF (x, x ) k RBF (u(x), u(x )). Recover the periodic kernel k PER (x, x ) = exp( 2 sin2 ( x x 2 ) l 2 ). (66) Can you see anything unusual about this kernel? 36 / 100

37 Periodic Kernel Covariance Output, y(x) Input distance, r (a) Input, x (b) 37 / 100

38 Non-Stationary Kernels A stationary kernel is invariant to translations of the input space: k = k(τ), τ = x x. Intuitively, this means the properties of the function are similar across different regions of the input domain. How might we make other non-stationary kernels, besides the Gibbs kernel? 38 / 100

39 Non-Stationary Kernels Figure: Non-stationary function 39 / 100

40 Non-Stationary Kernels (a) (b) Figure: Warp the inputs (in this case, x x 2 to go from non-stationary function to a stationary function). E.g., apply k(g(x), g(x )) to the data, where g is a warping function. 40 / 100

41 Non-Stationary Kernels / 100

42 Non-Stationary Kernels Warp the input space: k(x, x ) k(g(x), g(x )) where g is an arbitrary warping function. Modulate the amplitude of the kernel. If f (x) GP(0, k(x, x )) then a(x)f (x) has kernel a(x)k(x, x )a(x ), conditioned on a(x). What would happen if we tried w 1 (x)f 1 (x) + w 2 (x)f 2 (x) where f 1 and f 2 are GPs with different kernels? How about σ(w 1 (x))f 1 (x) + (1 σ(w 1 (x)))f 2 (x)? 42 / 100

43 Matérn Kernel The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is criticized for being 2 too smooth. How might we create a drop-in replacement, while retaining useful inductive biases? 43 / 100

44 Matérn Kernel The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is criticized for being 2 too smooth. How might we create a drop-in replacement, while retaining useful inductive biases? Could replace the Euclidean distance measure with an absolute distance measure... then we recover the Ornstein-Uhlenbeck kernel: k OU (x, x ) = exp( x x /l). The velocity of a particle undergoing brownian motion is described by a GP with the OU kernel. 44 / 100

45 Matérn Kernel The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is criticized for being 2 too smooth. How might we create a drop-in replacement, while retaining useful inductive biases? Could replace the Euclidean distance measure with an absolute distance measure... then we recover the Ornstein-Uhlenbeck kernel: k OU (x, x ) = exp( x x /l). The velocity of a particle undergoing brownian motion is described by a GP with the OU kernel. Recall that stationary kernels k(τ), τ = x x and spectral densities are Fourier duals of one another: k(τ) = S(s)e 2πisTτ ds, (67) S(s) = k(τ)e 2πisTτ dτ. (68) If we take the Fourier transform of the RBF kernel, we recover a Gaussian spectral density... But we can go from spectral densities to kernels too / 100

46 Matérn Kernel The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is criticized for being 2 too smooth. How might we create a drop-in replacement, while retaining useful inductive biases? Could replace the Euclidean distance measure with an absolute distance measure... then we recover the Ornstein-Uhlenbeck kernel: k OU (x, x ) = exp( x x /l). The velocity of a particle undergoing brownian motion is described by a GP with the OU kernel. Recall that stationary kernels k(τ), τ = x x and spectral densities are Fourier duals of one another: k(τ) = S(s)e 2πisTτ ds, (69) S(s) = k(τ)e 2πisTτ dτ. (70) If we take the Fourier transform of the RBF kernel, we recover a Gaussian spectral density... But we can go from spectral densities to kernels too... If we use a Student-t spectral density for S(s), and take the inverse Fourier transform, we recover the Matérn kernel. 46 / 100

47 Matérn Kernel k Matérn (x, x ) = 21 ν 2ν x x Γ(ν) ( 2ν x x ) ν K ν ( ), (71) l l where K ν is a modified Bessel function. In one dimension, and when ν + 1/2 = p, for some natural number p, the corresponding GP is a continuous time AR(p) process. By setting ν = 1, we obtain the Ornstein-Uhlenbeck (OU) kernel, k OU (x, x ) = exp( x x ). (72) l The Matérn kernel does not have concentration of measure problems for high dimensional inputs to the extent of the RBF (Gaussian) kernel (Fastfood: Le, Sarlos, Smola, ICML 2013). The kernel gives rise to a Markovian process (and classical filtering and smoothing algorithms can be applied). 47 / 100

48 Matérn Kernel OU and RBF kernels both with lengthscale l = k OU Covariance k RBF Outputs, f(x) GP OU GP RBF Distance, r (a) Inputs, x (b) 48 / 100

49 Matérn Kernel From Yunus Saatchi s PhD thesis, Scalable Inference for Structured Gaussian Process Models, / 100

50 Gaussian Processes Are Gaussian processes Bayesian nonparametric models? 50 / 100

51 Nonparametric Kernels For a Gaussian process f (x) to be non-parametric, f (x i ) f i, where f i is any collection of function values excluding f (x i ), must be free to take any value in R. For this freedom to be possible it is a necessary (but not sufficient) condition for the kernel of the Gaussian process to be derived from an infinite basis function expansion. Nonparametric kernels allow for a great amount of flexibility: the amount of information the model can represent grows with the amount of available data. 51 / 100

52 Nonparametric RBF vs Finite Dimensional Analogue The parametric analogue to a GP with a non-parametric RBF kernel becomes more confident in its predictions, the further away we get from the data! Rasmussen, MLSS Cambridge, / 100

53 Simple Random Walk Discrete time auto-regressive model f (t) = a f (t 1) + ɛ(t), (73) ɛ(t) N (0, 1), (74) a R, (75) t = 1, 2, 3, 4,... (76) (77) Is this model a Gaussian process? 53 / 100

54 Gaussian Process Covariance Kernels Let τ = x x : k SE (τ) = exp( 0.5τ 2 /l 2 ) (78) 3τ 3τ k MA (τ) = a(1 + ) exp( ) (79) l l k RQ (τ) = (1 + τ 2 2 α l 2 ) α (80) k PE (τ) = exp( 2 sin 2 (π τ ω)/l 2 ) (81) 54 / 100

55 CO 2 Extrapolation with Standard Kernels CO 2 Concentration (ppm) Year 55 / 100

56 Gaussian processes How can Gaussian processes possibly replace neural networks? Did we throw the baby out with the bathwater? David MacKay, / 100

57 More Expressive Covariance Functions ˆf 1 (x) W 11(x) y 1 (x). W1q(x). x ˆf i (x) y j (x).. Wp1(x) ˆf q (x) W pq(x) y p (x) Gaussian Process Regression Networks. Wilson et. al, ICML / 100

Gaussian Process Regression Network latitude 6 5 4 3 2 1 0.

58 Gaussian Process Regression Network latitude longitude / 100

59 Expressive Covariance Functions GPs in Bayesian neural network like architectures. (Salakhutdinov and Hinton, 2008; Wilson et. al, 2012; Damianou and Lawrence, 2012). Task specific, difficult inference, no closed form kernels. Compositions of kernels. (Archambeau and Bach, 2011; Durrande et. al, 2011; Rasmussen and Williams, 2006). In the general case, difficult to interpret, difficult inference, struggle with over-fitting. Can learn almost nothing about the covariance function of a stochastic process from a single realization, if we assume that the covariance function could be any positive definite function. Most commonly one assumes a restriction to stationary kernels, meaning that covariances are invariant to translations in the input space. 59 / 100

60 Bochner s Theorem Theorem (Bochner) A complex-valued function k on R P is the covariance function of a weakly stationary mean square continuous complex-valued random process on R P if and only if it can be represented as k(τ) = Tτ ψ(ds), (82) where ψ is a positive finite measure. R P e 2πis If ψ has a density S(s), then S is called the spectral density or power spectrum of k, and k and S are Fourier duals: k(τ) = S(s)e 2πisTτ ds, (83) S(s) = k(τ)e 2πisTτ dτ. (84) 60 / 100

61 Idea k and S are Fourier duals: k(τ) = S(s) = S(s)e 2πisTτ ds, (85) k(τ)e 2πisTτ dτ. (86) If we can approximate S(s) to arbitrary accuracy, then we can approximate any stationary kernel to arbitrary accuracy. We can model S(s) to arbitrary accuracy, since scale-location mixtures of Gaussians can approximate any distribution to arbitrary accuracy. A scale-location mixture of Gaussians can flexibly model many distributions, and thus many covariance kernels, even with a small number of components. 61 / 100

62 Kernels for Pattern Discovery Let τ = x x R P. From Bochner s Theorem, k(τ) = S(s)e 2πis Tτ ds (87) R P For simplicity, assume τ R 1 and let Then S(s) = [N (s; µ, σ 2 ) + N ( s; µ, σ 2 )]/2. (88) k(τ) = exp{ 2π 2 τ 2 σ 2 } cos(2πτµ). (89) More generally, if S(s) is a symmetrized mixture of diagonal covariance Gaussians on R p, with covariance matrix M q = diag(v (1) q,..., v (P) q ), then k(τ) = Q P w q cos(2πτ p µ (p) q ) exp{ 2π 2 τp 2 v (p) q }. (90) q=1 p=1 62 / 100

63 GP Model for Pattern Extrapolation Observations y(x) N (y(x); f (x), σ 2 ) f (x) GP(0, k SM (x, x θ)) kernel). (can easily be relaxed). (f (x) is a GP with SM k SM (x, x θ) can approximate many different kernels with different settings of its hyperparameters θ. Learning involves training these hyperparameters through maximum marginal likelihood optimization (using BFGS) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π). (91) Once hyperparameters are trained as ˆθ, making predictions using p(f y, X, ˆθ), which can be expressed in closed form. 63 / 100

64 Results, CO 2 CO 2 Concentration (ppm) % CR Train Test MA RQ PER SE SM (c) Year Log Spectral Density SM SE Empirical Frequency (1/month) (d) 64 / 100

65 Results, Reconstructing Standard Covariances Correla 0 Correla τ a) τ b) 65 / 100

66 Results, Negative Covariances Truth SM Observation 0 Covariance x (input) (e) τ (f) 20 Log Spectral Density SM Empirical Frequency (g) 66 / 100

67 Results, Sinc Pattern Observation Observation Train Test MA RQ SE PER SM x (input) (h) x (input) (i) 1 MA SM 0 Correlation Log Spectral Density SM SE Empirical τ (j) Frequency (k) 67 / 100

68 Results, Airline Passengers Airline Passengers (Thousa PER Train Test SE RQ MA SM SSGP Year (l) Log Spectral Density SM SE Empirical Frequency (1/month) (m) 68 / 100

69 Scaling up kernel machines Expressive kernels will be most valuable on large datasets. Computational bottlenecks for GPs: Inference: (Kθ + σ 2 I) 1 y for n n matrix K. Learning: log Kθ + σ 2 I, for marginal likelihood evaluations needed to learn θ. Both inference and learning naively require O(n 3 ) operations and O(n 2 ) storage (typically from computing a Cholesky decomposition of K). Afterwards, the predictive mean and variance cost O(n) and O(n 2 ) per test point. 69 / 100

70 Inference and Learning 1. Learning: Optimize marginal likelihood, model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y with respect to kernel hyperparameters θ. complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π), 2. Inference: Conditioned on kernel hyperparameters θ, form the predictive distribution for test inputs X : f X, X, y, θ N ( f, cov(f )), f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). 70 / 100

71 Scaling up kernel machines Three Families of Approaches Approximate non-parametric kernels in a finite basis dual space. Requires O(m 2 n) computations and O(m) storage for m basis functions. Examples: SSGP, Random Kitchen Sinks, Fastfood, À la Carte. Inducing point based sparse approximations. Examples: SoR, FITC, KISS-GP. Exploit existing structure in K to quickly (and exactly) solve linear systems and log determinants. Examples: Toeplitz and Kronecker methods. 71 / 100

72 Parametric Expansions via Random Basis Functions Return to Bochner s Theorem k(τ) = S(s) = S(s)e 2πisTτ ds, (92) k(τ)e 2πisTτ dτ. (93) We can treat S(s) as a probability distribution and sample from it, to approximate the integral for k(τ)! It is a valid Monte Carlo procedure to sample the pairs {s j, s j } from S(s): k(τ) 1 J [ exp(2πis T 2J j τ) + exp( 2πis T j τ) ], s j S(s) (94) = 1 J j=1 J cos(2πs T j τ) (95) j=1 This is exactly the covariance function we get if we use a linear basis function model with trigonometric basis functions! Use the basis function representation with finite J for computational efficiency. 72 / 100

73 Scaling a Gaussian process: inducing inputs Gaussian process f and f evaluated at n training points and J testing points. m n inducing points u, p(u) = N (0, K u,u ) p(f, f) = p(f, f, u)du = p(f, f u)p(u)du Assume that f and f are conditionally independent given u: p(f, f) q(f, f) = q(f u)q(f u)p(u)du (96) Exact conditional distributions p(f u) = N (K f,u K 1 u,uu, K f,f Q f,f ) (97) p(f u) = N (K f,uku,uu, 1 K f,f Q f,f ) (98) Q a,b = K a,u Ku,uK 1 u,b (99) Cost for predictions reduced from O(n 3 ) to O(m 2 n) where m n. Different inducing approaches correspond to different additional assumptions about q(f u) and q(f u). 73 / 100

74 Inducing Point Methods The inducing points act as a communication channel between the GP evaluated at the training and test points, f and f : 74 / 100

75 Subset of Regression (SoR) The subset of regressors method uses deterministic conditional distributions with exact means: q(f u) = N (K X,U KU,U 1 u, 0) (100) q(f u) = N (K X,UKU,U 1 u, 0) (101) (102) Integrate away u via p(f, f) q(f, f) = q(f u)q(f u)p(u)du (103) to obtain the joint distribution ( q SoR = N 0, [ QX,X Q X,X Q X,X Q X,X ]). (104) Q a,b = K a,u K 1 u,uk u,b (105) 75 / 100

76 Subsets of Regressors The predictive conditional can then be derived as before: q SoR (f y) = N (µ, A) (106) µ = Q X,X(Q X,X + σ 2 I) 1 y (107) A = Q X,X Q X,X(Q X,X + σ 2 ) 1 Q X,X (108) This method can be viewed as replacing the exact covariance function k with an approximate covariance function which admits fast computations. k SoR (x i, x j ) = k(x i, U)K 1 U,U k(u, x j) (109) 76 / 100

77 Subsets of Regressors The SoR covariance matrix is n n {}}{ K SoR (X, X) = n m m m {}}{{}}{{}}{ K U,X (110) K X,U KU,U 1 m n For m < n, this is a low rank covariance matrix, corresponding to a degenerate (finite basis) Gaussian process. As a result, for n large, SoR tends to underestimate uncertainty. 77 / 100

78 FITC FITC, the most popular inducing point method, uses the exact test conditional, and a factorized training conditional: q FITC (f u) = n p(f i u) (111) i=1 q FITC (f u) = p(f u). (112) Integrating away u, we can derive the FITC approximate kernel as: k SoR (x, z) = K x,u K 1 U,U K U,z, (113) k FITC (x, z) = k SoR (x, z) + δ xz ( k(x, z) k SoR (x, z) ). (114) FITC replaces the diagonal of the SoR approximation with the true diagonal of k. FITC corresponds to a non-parametric GP. 78 / 100

79 Kronecker methods Suppose If x R P, k decomposes as a product of kernels across each input dimension: k(x i, x j ) = P p=1 kp (x p i, xp j ) (e.g., the RBF kernel has this property). Suppose the inputs x X are on a multidimensional grid X = X 1 X P R P. Then K decomposes into a Kronecker product of matrices over each input dimension K = K 1 K P. The eigendecomposition of K into QVQ also decomposes: Q = Q 1 Q P, V = Q 1 Q P. Assuming equal cardinality for each input dimension, we can thus eigendecompose an N N matrix K in O(PN 3/P ) operations instead of O(N 3 ) operations. 79 / 100

80 Kronecker methods Suppose If x R P, k decomposes as a product of kernels across each input dimension: k(x i, x j ) = P p=1 kp (x p i, xp j ) (e.g., the RBF kernel has this property). Suppose the inputs x X are on a multidimensional grid X = X 1 X P R P. Then K decomposes into a Kronecker product of matrices over each input dimension K = K 1 K P. The eigendecomposition of K into QVQ also decomposes: Q = Q 1 Q P, V = Q 1 Q P. Assuming equal cardinality for each input dimension, we can thus eigendecompose an N N matrix K in O(PN 3/P ) operations instead of O(N 3 ) operations. Then inference and learning are highly efficient: (K + σ 2 I) 1 y = (QVQ T + σ 2 I) 1 y = Q(V + σ 2 I) 1 Q T y, (115) N log K + σ 2 I = log QVQ T + σ 2 I = log(λ i + σ 2 ), (116) i=1 80 / 100

81 Kronecker Methods We assumed that the inputs x X are on a multidimensional grid X = X 1 X P R P. How might we relax this assumption, to use Kronecker methods if there are gaps (missing data) in our multidimensional grid? 81 / 100

82 Kronecker Methods Assume imaginary points that complete the grid Place infinite noise on these points so they have no effect on inference The relevant matrices are no longer Kronecker, but we can get around this using pre-conditioned conjugate gradients, an iterative linear solver. 82 / 100

83 Kronecker Methods with Missing Data Assuming we have a dataset of M observations which are not necessarily on a grid, we propose to form a complete grid using W imaginary observations, y W N (f W, ɛ 1 I W ), ɛ 0. The total observation vector y = [y M, y W ] T has N = M + W entries: y = N (f, D N ), where the noise covariance matrix D N = diag(d M, ɛ 1 I W ), D M = σ 2 I M. The imaginary observations y W have no corrupting effect on inference: the moments of the resulting predictive distribution are exactly the same as for the standard predictive distribution, namely lim ɛ 0 (K N + D N ) 1 y = (K M + D M ) 1 y M. 83 / 100

84 Kronecker Methods with Missing Inputs We use preconditioned conjugate gradients to compute (K N + D N ) 1 y. We use the preconditioning matrix C = D 1/2 N to solve C T (K N + D N ) Cz = C T y. The preconditioning matrix C speeds up convergence by ignoring the imaginary observations y W. For the log complexity in the marginal likelihood (used in hyperparameter learning), log K M + D M = M log(λ M i i=1 + σ 2 ) M log( λ M i + σ 2 ), (117) i=1 where λ M i = M N λn i for i = 1,..., M. 84 / 100

85 Spectral Mixture Product Kernel The spectral mixture kernel, in its standard form, does not quite have Kronecker structure. Introduce a spectral mixture product kernel, which takes a product of across input dimensions of one dimensional spectral mixture kernels. k SMP (τ θ) = P k SM (τ p θ p ). (118) p=1 85 / 100

86 GPatt Observations y(x) N (y(x); f (x), σ 2 ) (can easily be relaxed). f (x) GP(0, k SMP (x, x θ)) (f (x) is a GP with SMP kernel). k SMP (x, x θ) can approximate many different kernels with different settings of its hyperparameters θ. Learning involves training these hyperparameters through maximum marginal likelihood optimization (using BFGS) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π). (119) Once hyperparameters are trained as ˆθ, making predictions using p(f y, X, ˆθ), which can be expressed in closed form. Exploit Kronecker structure for fast exact inference and learning (and extend Kronecker methods to allow for non-grid data). Exact inference and learning requires O(PN P+1 P ) operations and O(PN 2 P ) storage, compared to O(N 3 ) operations and O(N 2 ) storage, for N datapoints, and P input dimensions. 86 / 100

87 Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC (g) GP-SE (h) GP-MA (i) GP-RQ 87 / 100

88 Results: Extrapolation and Interpolation with Shadows (a) Train (b) GPatt (c) GP-MA (d) Train (e) GPatt (f) GP-MA 88 / 100

89 Automatic Model Selection via Marginal Likelihood 0 µ µ 2 Simple initialisation The marginal likelihood shrinks weights of extraneous components to zero through the log K complexity penalty. 89 / 100

Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC 0 µ 0.

90 Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC 0 µ (g) GP-SE (h) GP-MA (i) GP-RQ (j) GPatt Initialisation (k) Train (l) GPatt (m) GP-MA (n) Train (o) GPatt (p) GP-MA 90 / 100

91 More Patterns (a) Rubber mat (b) Tread plate (c) Pores (d) Wood (e) Chain mail (f) Cone 91 / 100

92 Speed and Accuracy Stress Tests (a) Runtime Stress Test (b) Accuracy Stress Test 92 / 100

93 Image Inpainting 93 / 100

94 Recovering Sophisticated Out of Class Kernels True Recovered k1 0.5 k2 0.5 k τ τ τ 94 / 100

95 Video Extrapolation GPatt makes almost no assumptions about the correlation structures across input dimensions: it can automatically discover both temporal and spatial correlations! Top row: True frames taken from the middle of a movie. Bottom row: Predicted sequence of frames (all are forecast together). 112,500 datapoints. GPatt training time is under 5 minutes. 95 / 100

96 Land Surface Temperature Forecasting Train using 9 years of temperature data. First two rows are the last 12 months of training data, last two rows is a 12 month ahead forecast. 300, 000 data points, with 40% missing data (from ocean). Predictions using GP-SE (GP with an SE or RBF kernel), and Kronecker Inference / 100

97 Land Surface Temperature Forecasting Train using 9 years of temperature data. First two rows are the last 12 months of training data, last two rows is a 12 month ahead forecast. 300, 000 data points, with 40% missing data (from ocean). Predictions using GPatt. Training time < 30 minutes / 100

98 Learned Kernels for Land Surface Temperatures Correlations X [Km] Correlations Y [Km] (a) Learned GPatt Kernel for Temperatures X [Km] Y [Km] Time [mon] 0 5 Time [mon] (b) Learned GP-SE Kernel for Temperatures The learned GPatt kernel tells us interesting properties of the data. In this case, the learned kernels are heavy tailed and quasi-periodic. 98 / 100

99 Building Gauss-Markov Processes 99 / 100

100 Generalising inducing point methods Blackboard discussion 100 / 100

Kernels for Automatic Pattern Discovery and Extrapolation

Kernels for Automatic Pattern Discovery and Extrapolation Andrew Gordon Wilson agw38@cam.ac.uk mlg.eng.cam.ac.uk/andrew University of Cambridge Joint work with Ryan Adams (Harvard) 1 / 21 Pattern Recognition