Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes

Size: px
Start display at page:

Download "Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes"

Transcription

1 Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson Carnegie Mellon University April 1, / 100

2 Gaussian process review Definition A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. Nonparametric Regression Model Prior: f (x) GP(m(x), k(x, x )), meaning (f (x 1 ),..., f (x N )) N (µ, K), with µ i = m(x i ) and K ij = cov(f (x i ), f (x j )) = k(x i, x j ). GP posterior {}}{ p(f (x) D) Likelihood GP prior {}}{{}}{ p(d f (x)) p(f (x)) output, f(t) Gaussian process sample prior functions output, f(t) Gaussian process sample posterior functions / 100

3 Gaussian Process Inference Observed noisy data y = (y(x 1 ),..., y(x N )) T at input locations X. Start with the standard regression assumption: N (y(x); f (x), σ 2 ). Place a Gaussian process distribution over noise free functions f (x) GP(0, k θ ). The kernel k is parametrized by θ. Infer p(f y, X, X ) for the noise free function f evaluated at test points X. Joint distribution [ ] ( [ y Kθ (X, X) + σ 2 I K θ (X, X ) N 0, K θ (X, X) K θ (X, X ) f Conditional predictive distribution ]). (1) f X, X, y, θ N ( f, cov(f )), (2) f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, (3) cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). (4) 3 / 100

4 Learning and Model Selection p(m i y) = p(y M i)p(m i ) (5) p(y) We can write the evidence of the model as p(y M i ) = p(y f, M i )p(f)df, (6) Complex Model Simple Model Appropriate Model Data Simple Complex Appropriate p(y M) Output, f(x) y All Possible Datasets (a) Input, x (b) 4 / 100

5 Learning and Model Selection We can integrate away the entire Gaussian process f (x) to obtain the marginal likelihood, as a function of kernel hyperparameters θ alone. p(y θ, X) = p(y f, X)p(f θ, X)df. (7) model fit complexity penalty {}}{ log p(y θ, X) = 1 {}}{ 2 yt (K θ + σ 2 I) 1 1 y 2 log K θ + σ 2 I N log(2π). (8) 2 An extremely powerful mechanism for kernel learning. 4 Samples from GP Prior 4 Samples from GP Posterior Output, f(x) Output, f(x) Input, x Input, x 5 / 100

6 Inference and Learning 1. Learning: Optimize marginal likelihood, model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y with respect to kernel hyperparameters θ. complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π), 2. Inference: Conditioned on kernel hyperparameters θ, form the predictive distribution for test inputs X : f X, X, y, θ N ( f, cov(f )), f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). 6 / 100

7 Learning and Model Selection A fully Bayesian treatment would integrate away kernel hyperparameters θ. p(f X, X, y) = p(f X, X, y, θ)p(θ y)dθ (9) For example, we could specify a prior p(θ), use MCMC to take J samples from p(θ y) p(y θ)p(θ), and then find p(f X, X, y) 1 J J p(f X, X, y, θ (i) ), θ (i) p(θ y). (10) i=1 If we have a non-gaussian noise model, and thus cannot integrate away f, the strong dependencies between Gaussian process f and hyperparameters θ make sampling extremely difficult. In my experience, the most effective solution is to use a deterministic approximation for the posterior p(f y) which enables one to work with an approximate marginal likelihood. 7 / 100

8 Popular Kernels Let τ = x x : k SE (τ) = exp( 0.5τ 2 /l 2 ) (11) 3τ 3τ k MA (τ) = a(1 + ) exp( ) (12) l l k RQ (τ) = (1 + τ 2 2 α l 2 ) α (13) k PE (τ) = exp( 2 sin 2 (π τ ω)/l 2 ) (14) 8 / 100

9 Worked Example: Combining Kernels, CO 2 Data CO 2 Concentration (ppm) Year Example from Rasmussen and Williams (2006), Gaussian Processes for Machine Learning. 9 / 100

10 Worked Example: Combining Kernels, CO 2 Data 10 / 100

11 Worked Example: Combining Kernels, CO 2 Data Long rising trend: k 1 (x p, x q ) = θ 2 1 exp ( (xp xq)2 2θ2 2 Quasi-periodic seasonal changes: ( k 2 (x p, x q ) = k RBF (x p, x q )k PER (x p, x q ) = θ3 2 exp (xp xq) Multi-scale medium term irregularities: ) k 3 (x p, x q ) = θ6 (1 2 θ8 + (xp xq)2 2θ 8θ7 2 ( Correlated and i.i.d. noise: k 4 (x p, x q ) = θ9 2 exp 2θ 2 4 ) 2 sin2 (π(x p x q)) θ 2 5 (xp xq)2 2θ10 2 k total (x p, x q ) = k 1 (x p, x q ) + k 2 (x p, x q ) + k 3 (x p, x q ) + k 4 (x p, x q ) ) ) + θ 2 11 δ pq 11 / 100

12 What is a kernel? Informally, k describes the similarities between pairs of data points. For example, far away points may be considered less similar than nearby points. K ij = φ(x i ), φ(x j ) and so tells us the overlap between the features (basis functions) φ(x i ) and φ(x j ) We have seen that all linear basis function models f (x) = w T φ(x), with p(w) = N (0, Σ w ) correspond to Gaussian processes with kernel k(x, x ) = φ(x) T Σ w φ(x ). We have also accumulated some experience with the RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l 2 ). The kernel controls the generalisation behaviour of a kernel machine. For example, a kernel controls the support and inductive biases of a Gaussian process which functions are a priori likely. A kernel is also known as covariance function or covariance kernel in the context of Gaussian processes. 12 / 100

13 Candidate Kernel Symmetric k(x, x ) = { 1 x x 1 0 otherwise Provides information about proximity of points Exercise: Is it a valid kernel? 13 / 100

14 Candidate Kernel k(x, x ) = { 1 x x 1 0 otherwise Try the points x 1 = 1, x 2 = 2, x 3 = 3. Compute the kernel matrix??? K =??? (15)??? 14 / 100

15 Candidate Kernel k(x, x ) = { 1 x x 1 0 otherwise Try the points x 1 = 1, x 2 = 2, x 3 = 3. Compute the kernel matrix K = (16) The eigenvalues of K are ( (2) 1) 1, 1, and (1 (2)). Therefore K is not positive semidefinite. 15 / 100

16 Representer Theorem A decision function f (x) can be written as N f (x) = w, φ(x) = α i φ(x i ), φ(x) = i=1 N α i k(x i, x). (17) i=1 Representer theorem says this function exists with finitely many coefficients α i even when φ is infinite dimensional (an infinite number of basis functions). Initially viewed as a strength of kernel methods, for datasets not exceeding e.g. ten thousand points. Unfortunately, the number of nonzero α i often grows linearly in the size of the training set N. Example: In GP regression, the predictive mean is E[f y, X, x ] = k T (K + σ 2 I) 1 y = N α i k(x i, x ), (18) i=1 where α i = (K + σ 2 I) 1 y. 16 / 100

17 Making new kernels from old Suppose k 1 (x, x ) and k 2 (x, x ) are valid. Then the following covariance functions are also valid: k(x, x ) = g(x)k 1 (x, x )g(x ) (19) k(x, x ) = q(k 1 (x, x )) (20) k(x, x ) = exp(k 1 (x, x )) (21) k(x, x ) = k 1 (x, x ) + k 2 (x, x ) (22) k(x, x ) = k 1 (x, x )k 2 (x, x ) (23) k(x, x ) = k 3 (φ(x), φ(x )) (24) k(x, x ) = x T Ax (25) k(x, x ) = k a (x a, x a) + k b (x b, x b) (26) k(x, x ) = k a (x a, x a)k b (x b, x b) (27) where g is any function, q is a polynomial with nonnegative coefficients, φ(x) is a function from x to R M, k 3 is a valid covariance function in R M, A is a symmetric positive definite matrix, x a and x b are not necessarily disjoint variables with x = (x a, x b ) T, and k a and k b are valid kernels in their respective spaces. 17 / 100

18 Stationary Kernels A stationary kernel is invariant to translations of the input space. Equivalently, k = k(x x ) = k(τ). All distance kernels, k = k( x x ) are examples of stationary kernels. The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is a stationary kernel. 2 The polynomial kernel k POL (x, x ) = (x T x + σ0 2)p is an example of a non-stationary kernel. Stationarity provides a useful inductive bias. 18 / 100

19 Bochner s Theorem Theorem (Bochner) A complex-valued function k on R P is the covariance function of a weakly stationary mean square continuous complex-valued random process on R P if and only if it can be represented as k(τ) = Tτ ψ(ds), (28) where ψ is a positive finite measure. R P e 2πis If ψ has a density S(s), then S is called the spectral density or power spectrum of k, and k and S are Fourier duals: k(τ) = S(s)e 2πisTτ ds, (29) S(s) = k(τ)e 2πisTτ dτ. (30) 19 / 100

20 Review: Linear Basis Function Models Model Specification f (x, w) = w T φ(x) (31) p(w) = N (0, Σ w ) (32) Moments of Induced Distribution over Functions E[f (x, w)] = m(x) = E[w T ]φ(x) = 0 (33) cov(f (x i ), f (x j )) = k(x i, x j ) = E[f (x i )f (x j )] E[f (x i )]E[f (x j )] (34) = φ(x i ) T E[ww T ]φ(x j ) 0 (35) = φ(x i ) T Σ w φ(x j ) (36) f (x, w) is a Gaussian process, f (x) N (m, k) with mean function m(x) = 0 and covariance kernel k(x i, x j ) = φ(x i ) T Σ w φ(x j ). The entire basis function model of Eqs. (31) and (32) is encapsulated as a distribution over functions with kernel k(x, x ). 20 / 100

21 Deriving the RBF Kernel Start with the basis model f (x) = J w i φ i (x), (37) i=1 ) w i N (0, σ2, (38) J φ i (x) = exp ( (x c i) 2 ) 2l 2. (39) Equations (37)-(39) define a radial basis function regression model, with radial basis functions centred at the points c i. Using our result for the kernel of a generalised linear model, k(x, x ) = σ2 J J φ i (x)φ i (x ). (40) i=1 21 / 100

22 Deriving the RBF Kernel f (x) = J w i φ i (x), i=1 ) w i N (0, σ2, φ i (x) = exp ( (x c i) 2 ) J 2l 2 k(x, x ) = σ2 J (41) J φ i (x)φ i (x ) (42) i=1 Letting c i+1 c i = c = 1 J, and J, the kernel in Eq. (42) becomes a Riemann sum: k(x, x σ 2 ) = lim J J J φ i (x)φ i (x ) = i=1 c c 0 φ c (x)φ c (x )dc (43) By setting c 0 = and c =, we spread the infinitely many basis functions across the whole real line, each a distance c 0 apart: k(x, x ) = exp( x c 2l 2 ) c exp( x 2l 2 )dc (44) = πlσ 2 exp( (x x ) 2 ). (45) 22 / 100

23 Deriving the RBF Kernel It is remarkable we can work with infinitely many basis functions with finite amounts of computation using the kernel trick replacing inner products of basis functions with kernels. The RBF kernel, also known as the Gaussian or squared exponential kernel, is by far the most popular kernel. k RBF (x, x ) = a 2 exp( x x 2 2l 2 ). Recall Bochner s theorem. If we take the Fourier transform of the RBF kernel we recover a Gaussian spectral density, S(s) = (2πl 2 ) D/2 exp( 2π 2 l 2 s 2 ) for x R D. Therefore the RBF kernel kernel does not have much support for high frequency functions, since a Gaussian does not have heavy tails. Functions drawn from a GP with an RBF kernel are infinitely differentiable. For this reason, the RBF kernel is accused of being overly smooth and unrealistic. Nonetheless it has nice theoretical properties / 100

24 The RBF Kernel SE kernels with Different Length scales l=7 l = 0.7 l = k(τ) τ Figure: SE kernels with different length-scales, as a function of τ = x x. 24 / 100

25 Representer Theorem A decision function f (x) can be written as N f (x) = w, φ(x) = α i φ(x i ), φ(x) = i=1 N α i k(x i, x). (46) i=1 Representer theorem says this function exists with finitely many coefficients α i even when φ is infinite dimensional (an infinite number of basis functions). Initially viewed as a strength of kernel methods, for datasets not exceeding e.g. ten thousand points. Unfortunately, the number of nonzero α i often grows linearly in the size of the training set N. Example: In GP regression, the predictive mean is E[f y, X, x ] = k T (K + σ 2 I) 1 y = N α i k(x i, x ), (47) i=1 where α i = (K + σ 2 I) 1 y. 25 / 100

26 Polynomial Kernel We have already shown that the simple linear model corresponds to a Gaussian process with kernel f (x, w) = w T x + b, (48) p(w) = N (0, α 2 I), (49) p(b) = N (0, β 2 ), (50) k LIN (x, x ) = α 2 x T x + β 2. (51) Samples from a GP with k LIN (x, x ) will thus be straight lines. Recall that the product of two kernels is a valid kernel. The product of two linear kernels is a quadratic kernel, which gives rise to quadratic functions: k QUAD (x, x ) = k LIN (x, x )k LIN (x, x ). (52) For example, if β = 0, α = 1, and x R 2, then k QUAD (x, x ) = φ(x) T φ(x ) with φ(x) = (x 2 1, x2 2, 2x 1 x 2 ) T, where x = (x 1, x 2 ). We can generalize to the polynomial kernel k POL (x, x ) = (α 2 x T x + β 2 ) p. (53) 26 / 100

27 The Rational Quadratic Kernel What if we want data varying at multiple scales? 27 / 100

28 The Rational Quadratic Kernel Try a scale mixture of RBF kernels. Let r = x x. k(r) = exp( r2 2l )p(l)dl. 2 For example, we can consider a Gamma density for p(l). Letting γ = l 2, g(γ α, β) γ α 1 exp( αγ/β), with β 1 = l 2, the rational quadratic (RQ) kernel is derived as k RQ (r) = 0 k RBF (r γ)g(γ α, β)dγ = (1 + r2 2αl 2 ) α. (54) One could derive other interesting covariance functions using different (non-gamma) functions for p(l). 28 / 100

29 The Rational Quadratic Kernel k RQ (r) = (1 + r2 2αl 2 ) α (55) r = τ = x x. (56) 1.4 RQ kernels with Different α 2.5 Sample GP RQ Functions α = 40 α=0.1 α= k(τ) 0.6 f(x) (a) τ Input, x (b) 29 / 100

30 Neural Network Kernel The neural network kernel (Neal, 1996) is famous for triggering research on Gaussian processes in the machine learning community. Consider a neural network with one hidden layer: J f (x) = b + v i h(x; u i ). (57) i=1 b is a bias, v i are the hidden to output weights, h is any bounded hidden unit transfer function, u i are the input to hidden weights, and J is the number of hidden units. Let b and v i be independent with zero mean and variances σ 2 b and σ2 v /J, respectively, and let the u i have independent identical distributions. Collecting all free parameters into the weight vector w, E w [f (x)] = 0, cov[f (x), f (x )] = E w [f (x)f (x )] = σ 2 b + 1 J J σv 2 E u [h i (x; u i )h i (x ; u i )], i=1 (58) (59) = σ 2 b + σ 2 v E u [h(x; u)h(x ; u)]. (60) 30 / 100

31 Neural Network Kernel f (x) = b + J v i h(x; u i ). (61) Let h(x; u) = erf(u 0 + P j=1 u jx j ), where erf(z) = 2 π z 0 e t2 dt Choose u N (0, Σ) Then we obtain i=1 k NN (x, x ) = 2 π sin( 2 x T Σ x (1 + 2 xt Σ x)(1 + 2 x T Σ x ) ), (62) where x R P and x = (1, x T ) T. 31 / 100

32 Neural Network Kernel Figure: Draws from a GP with a Neural Network Kernel with Varying σ Rasmussen and Williams (2006) 32 / 100

33 Gibbs Kernel Recall the RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l 2 ). (63) What if we want to make the length-scale of l input dependent, so that the resulting function is biased to vary more quickly in parts of the input space than in others? 33 / 100

34 Gibbs Kernel Recall the RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l 2 ). (64) What if we want to make the length-scale of l input dependent, so that the resulting function is biased to vary more quickly in parts of the input space than in others? Just letting l l(x) doesn t produce a valid kernel. 34 / 100

35 Gibbs Kernel k Gibbs (x, x ) = P p=1 where x p is the p th component of x. ( 2l p (x)l p (x ) ) 1/2 ( P exp lp(x) 2 + lp(x 2 ) p=1 (x p x p) 2 ), (65) lp(x) 2 + lp(x 2 ) lengt 0.5 ou (a) (b) Rasmussen and Williams (2006) 35 / 100

36 Periodic Kernel Transform the inputs through a vector-valued function: u(x) = (cos(x), sin(x)). Apply the RBF kernel in u space: k RBF (x, x ) k RBF (u(x), u(x )). Recover the periodic kernel k PER (x, x ) = exp( 2 sin2 ( x x 2 ) l 2 ). (66) Can you see anything unusual about this kernel? 36 / 100

37 Periodic Kernel Covariance Output, y(x) Input distance, r (a) Input, x (b) 37 / 100

38 Non-Stationary Kernels A stationary kernel is invariant to translations of the input space: k = k(τ), τ = x x. Intuitively, this means the properties of the function are similar across different regions of the input domain. How might we make other non-stationary kernels, besides the Gibbs kernel? 38 / 100

39 Non-Stationary Kernels Figure: Non-stationary function 39 / 100

40 Non-Stationary Kernels (a) (b) Figure: Warp the inputs (in this case, x x 2 to go from non-stationary function to a stationary function). E.g., apply k(g(x), g(x )) to the data, where g is a warping function. 40 / 100

41 Non-Stationary Kernels / 100

42 Non-Stationary Kernels Warp the input space: k(x, x ) k(g(x), g(x )) where g is an arbitrary warping function. Modulate the amplitude of the kernel. If f (x) GP(0, k(x, x )) then a(x)f (x) has kernel a(x)k(x, x )a(x ), conditioned on a(x). What would happen if we tried w 1 (x)f 1 (x) + w 2 (x)f 2 (x) where f 1 and f 2 are GPs with different kernels? How about σ(w 1 (x))f 1 (x) + (1 σ(w 1 (x)))f 2 (x)? 42 / 100

43 Matérn Kernel The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is criticized for being 2 too smooth. How might we create a drop-in replacement, while retaining useful inductive biases? 43 / 100

44 Matérn Kernel The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is criticized for being 2 too smooth. How might we create a drop-in replacement, while retaining useful inductive biases? Could replace the Euclidean distance measure with an absolute distance measure... then we recover the Ornstein-Uhlenbeck kernel: k OU (x, x ) = exp( x x /l). The velocity of a particle undergoing brownian motion is described by a GP with the OU kernel. 44 / 100

45 Matérn Kernel The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is criticized for being 2 too smooth. How might we create a drop-in replacement, while retaining useful inductive biases? Could replace the Euclidean distance measure with an absolute distance measure... then we recover the Ornstein-Uhlenbeck kernel: k OU (x, x ) = exp( x x /l). The velocity of a particle undergoing brownian motion is described by a GP with the OU kernel. Recall that stationary kernels k(τ), τ = x x and spectral densities are Fourier duals of one another: k(τ) = S(s)e 2πisTτ ds, (67) S(s) = k(τ)e 2πisTτ dτ. (68) If we take the Fourier transform of the RBF kernel, we recover a Gaussian spectral density... But we can go from spectral densities to kernels too / 100

46 Matérn Kernel The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is criticized for being 2 too smooth. How might we create a drop-in replacement, while retaining useful inductive biases? Could replace the Euclidean distance measure with an absolute distance measure... then we recover the Ornstein-Uhlenbeck kernel: k OU (x, x ) = exp( x x /l). The velocity of a particle undergoing brownian motion is described by a GP with the OU kernel. Recall that stationary kernels k(τ), τ = x x and spectral densities are Fourier duals of one another: k(τ) = S(s)e 2πisTτ ds, (69) S(s) = k(τ)e 2πisTτ dτ. (70) If we take the Fourier transform of the RBF kernel, we recover a Gaussian spectral density... But we can go from spectral densities to kernels too... If we use a Student-t spectral density for S(s), and take the inverse Fourier transform, we recover the Matérn kernel. 46 / 100

47 Matérn Kernel k Matérn (x, x ) = 21 ν 2ν x x Γ(ν) ( 2ν x x ) ν K ν ( ), (71) l l where K ν is a modified Bessel function. In one dimension, and when ν + 1/2 = p, for some natural number p, the corresponding GP is a continuous time AR(p) process. By setting ν = 1, we obtain the Ornstein-Uhlenbeck (OU) kernel, k OU (x, x ) = exp( x x ). (72) l The Matérn kernel does not have concentration of measure problems for high dimensional inputs to the extent of the RBF (Gaussian) kernel (Fastfood: Le, Sarlos, Smola, ICML 2013). The kernel gives rise to a Markovian process (and classical filtering and smoothing algorithms can be applied). 47 / 100

48 Matérn Kernel OU and RBF kernels both with lengthscale l = k OU Covariance k RBF Outputs, f(x) GP OU GP RBF Distance, r (a) Inputs, x (b) 48 / 100

49 Matérn Kernel From Yunus Saatchi s PhD thesis, Scalable Inference for Structured Gaussian Process Models, / 100

50 Gaussian Processes Are Gaussian processes Bayesian nonparametric models? 50 / 100

51 Nonparametric Kernels For a Gaussian process f (x) to be non-parametric, f (x i ) f i, where f i is any collection of function values excluding f (x i ), must be free to take any value in R. For this freedom to be possible it is a necessary (but not sufficient) condition for the kernel of the Gaussian process to be derived from an infinite basis function expansion. Nonparametric kernels allow for a great amount of flexibility: the amount of information the model can represent grows with the amount of available data. 51 / 100

52 Nonparametric RBF vs Finite Dimensional Analogue The parametric analogue to a GP with a non-parametric RBF kernel becomes more confident in its predictions, the further away we get from the data! Rasmussen, MLSS Cambridge, / 100

53 Simple Random Walk Discrete time auto-regressive model f (t) = a f (t 1) + ɛ(t), (73) ɛ(t) N (0, 1), (74) a R, (75) t = 1, 2, 3, 4,... (76) (77) Is this model a Gaussian process? 53 / 100

54 Gaussian Process Covariance Kernels Let τ = x x : k SE (τ) = exp( 0.5τ 2 /l 2 ) (78) 3τ 3τ k MA (τ) = a(1 + ) exp( ) (79) l l k RQ (τ) = (1 + τ 2 2 α l 2 ) α (80) k PE (τ) = exp( 2 sin 2 (π τ ω)/l 2 ) (81) 54 / 100

55 CO 2 Extrapolation with Standard Kernels CO 2 Concentration (ppm) Year 55 / 100

56 Gaussian processes How can Gaussian processes possibly replace neural networks? Did we throw the baby out with the bathwater? David MacKay, / 100

57 More Expressive Covariance Functions ˆf 1 (x) W 11(x) y 1 (x). W1q(x). x ˆf i (x) y j (x).. Wp1(x) ˆf q (x) W pq(x) y p (x) Gaussian Process Regression Networks. Wilson et. al, ICML / 100

58 Gaussian Process Regression Network latitude longitude / 100

59 Expressive Covariance Functions GPs in Bayesian neural network like architectures. (Salakhutdinov and Hinton, 2008; Wilson et. al, 2012; Damianou and Lawrence, 2012). Task specific, difficult inference, no closed form kernels. Compositions of kernels. (Archambeau and Bach, 2011; Durrande et. al, 2011; Rasmussen and Williams, 2006). In the general case, difficult to interpret, difficult inference, struggle with over-fitting. Can learn almost nothing about the covariance function of a stochastic process from a single realization, if we assume that the covariance function could be any positive definite function. Most commonly one assumes a restriction to stationary kernels, meaning that covariances are invariant to translations in the input space. 59 / 100

60 Bochner s Theorem Theorem (Bochner) A complex-valued function k on R P is the covariance function of a weakly stationary mean square continuous complex-valued random process on R P if and only if it can be represented as k(τ) = Tτ ψ(ds), (82) where ψ is a positive finite measure. R P e 2πis If ψ has a density S(s), then S is called the spectral density or power spectrum of k, and k and S are Fourier duals: k(τ) = S(s)e 2πisTτ ds, (83) S(s) = k(τ)e 2πisTτ dτ. (84) 60 / 100

61 Idea k and S are Fourier duals: k(τ) = S(s) = S(s)e 2πisTτ ds, (85) k(τ)e 2πisTτ dτ. (86) If we can approximate S(s) to arbitrary accuracy, then we can approximate any stationary kernel to arbitrary accuracy. We can model S(s) to arbitrary accuracy, since scale-location mixtures of Gaussians can approximate any distribution to arbitrary accuracy. A scale-location mixture of Gaussians can flexibly model many distributions, and thus many covariance kernels, even with a small number of components. 61 / 100

62 Kernels for Pattern Discovery Let τ = x x R P. From Bochner s Theorem, k(τ) = S(s)e 2πis Tτ ds (87) R P For simplicity, assume τ R 1 and let Then S(s) = [N (s; µ, σ 2 ) + N ( s; µ, σ 2 )]/2. (88) k(τ) = exp{ 2π 2 τ 2 σ 2 } cos(2πτµ). (89) More generally, if S(s) is a symmetrized mixture of diagonal covariance Gaussians on R p, with covariance matrix M q = diag(v (1) q,..., v (P) q ), then k(τ) = Q P w q cos(2πτ p µ (p) q ) exp{ 2π 2 τp 2 v (p) q }. (90) q=1 p=1 62 / 100

63 GP Model for Pattern Extrapolation Observations y(x) N (y(x); f (x), σ 2 ) f (x) GP(0, k SM (x, x θ)) kernel). (can easily be relaxed). (f (x) is a GP with SM k SM (x, x θ) can approximate many different kernels with different settings of its hyperparameters θ. Learning involves training these hyperparameters through maximum marginal likelihood optimization (using BFGS) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π). (91) Once hyperparameters are trained as ˆθ, making predictions using p(f y, X, ˆθ), which can be expressed in closed form. 63 / 100

64 Results, CO 2 CO 2 Concentration (ppm) % CR Train Test MA RQ PER SE SM (c) Year Log Spectral Density SM SE Empirical Frequency (1/month) (d) 64 / 100

65 Results, Reconstructing Standard Covariances Correla 0 Correla τ a) τ b) 65 / 100

66 Results, Negative Covariances Truth SM Observation 0 Covariance x (input) (e) τ (f) 20 Log Spectral Density SM Empirical Frequency (g) 66 / 100

67 Results, Sinc Pattern Observation Observation Train Test MA RQ SE PER SM x (input) (h) x (input) (i) 1 MA SM 0 Correlation Log Spectral Density SM SE Empirical τ (j) Frequency (k) 67 / 100

68 Results, Airline Passengers Airline Passengers (Thousa PER Train Test SE RQ MA SM SSGP Year (l) Log Spectral Density SM SE Empirical Frequency (1/month) (m) 68 / 100

69 Scaling up kernel machines Expressive kernels will be most valuable on large datasets. Computational bottlenecks for GPs: Inference: (Kθ + σ 2 I) 1 y for n n matrix K. Learning: log Kθ + σ 2 I, for marginal likelihood evaluations needed to learn θ. Both inference and learning naively require O(n 3 ) operations and O(n 2 ) storage (typically from computing a Cholesky decomposition of K). Afterwards, the predictive mean and variance cost O(n) and O(n 2 ) per test point. 69 / 100

70 Inference and Learning 1. Learning: Optimize marginal likelihood, model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y with respect to kernel hyperparameters θ. complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π), 2. Inference: Conditioned on kernel hyperparameters θ, form the predictive distribution for test inputs X : f X, X, y, θ N ( f, cov(f )), f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). 70 / 100

71 Scaling up kernel machines Three Families of Approaches Approximate non-parametric kernels in a finite basis dual space. Requires O(m 2 n) computations and O(m) storage for m basis functions. Examples: SSGP, Random Kitchen Sinks, Fastfood, À la Carte. Inducing point based sparse approximations. Examples: SoR, FITC, KISS-GP. Exploit existing structure in K to quickly (and exactly) solve linear systems and log determinants. Examples: Toeplitz and Kronecker methods. 71 / 100

72 Parametric Expansions via Random Basis Functions Return to Bochner s Theorem k(τ) = S(s) = S(s)e 2πisTτ ds, (92) k(τ)e 2πisTτ dτ. (93) We can treat S(s) as a probability distribution and sample from it, to approximate the integral for k(τ)! It is a valid Monte Carlo procedure to sample the pairs {s j, s j } from S(s): k(τ) 1 J [ exp(2πis T 2J j τ) + exp( 2πis T j τ) ], s j S(s) (94) = 1 J j=1 J cos(2πs T j τ) (95) j=1 This is exactly the covariance function we get if we use a linear basis function model with trigonometric basis functions! Use the basis function representation with finite J for computational efficiency. 72 / 100

73 Scaling a Gaussian process: inducing inputs Gaussian process f and f evaluated at n training points and J testing points. m n inducing points u, p(u) = N (0, K u,u ) p(f, f) = p(f, f, u)du = p(f, f u)p(u)du Assume that f and f are conditionally independent given u: p(f, f) q(f, f) = q(f u)q(f u)p(u)du (96) Exact conditional distributions p(f u) = N (K f,u K 1 u,uu, K f,f Q f,f ) (97) p(f u) = N (K f,uku,uu, 1 K f,f Q f,f ) (98) Q a,b = K a,u Ku,uK 1 u,b (99) Cost for predictions reduced from O(n 3 ) to O(m 2 n) where m n. Different inducing approaches correspond to different additional assumptions about q(f u) and q(f u). 73 / 100

74 Inducing Point Methods The inducing points act as a communication channel between the GP evaluated at the training and test points, f and f : 74 / 100

75 Subset of Regression (SoR) The subset of regressors method uses deterministic conditional distributions with exact means: q(f u) = N (K X,U KU,U 1 u, 0) (100) q(f u) = N (K X,UKU,U 1 u, 0) (101) (102) Integrate away u via p(f, f) q(f, f) = q(f u)q(f u)p(u)du (103) to obtain the joint distribution ( q SoR = N 0, [ QX,X Q X,X Q X,X Q X,X ]). (104) Q a,b = K a,u K 1 u,uk u,b (105) 75 / 100

76 Subsets of Regressors The predictive conditional can then be derived as before: q SoR (f y) = N (µ, A) (106) µ = Q X,X(Q X,X + σ 2 I) 1 y (107) A = Q X,X Q X,X(Q X,X + σ 2 ) 1 Q X,X (108) This method can be viewed as replacing the exact covariance function k with an approximate covariance function which admits fast computations. k SoR (x i, x j ) = k(x i, U)K 1 U,U k(u, x j) (109) 76 / 100

77 Subsets of Regressors The SoR covariance matrix is n n {}}{ K SoR (X, X) = n m m m {}}{{}}{{}}{ K U,X (110) K X,U KU,U 1 m n For m < n, this is a low rank covariance matrix, corresponding to a degenerate (finite basis) Gaussian process. As a result, for n large, SoR tends to underestimate uncertainty. 77 / 100

78 FITC FITC, the most popular inducing point method, uses the exact test conditional, and a factorized training conditional: q FITC (f u) = n p(f i u) (111) i=1 q FITC (f u) = p(f u). (112) Integrating away u, we can derive the FITC approximate kernel as: k SoR (x, z) = K x,u K 1 U,U K U,z, (113) k FITC (x, z) = k SoR (x, z) + δ xz ( k(x, z) k SoR (x, z) ). (114) FITC replaces the diagonal of the SoR approximation with the true diagonal of k. FITC corresponds to a non-parametric GP. 78 / 100

79 Kronecker methods Suppose If x R P, k decomposes as a product of kernels across each input dimension: k(x i, x j ) = P p=1 kp (x p i, xp j ) (e.g., the RBF kernel has this property). Suppose the inputs x X are on a multidimensional grid X = X 1 X P R P. Then K decomposes into a Kronecker product of matrices over each input dimension K = K 1 K P. The eigendecomposition of K into QVQ also decomposes: Q = Q 1 Q P, V = Q 1 Q P. Assuming equal cardinality for each input dimension, we can thus eigendecompose an N N matrix K in O(PN 3/P ) operations instead of O(N 3 ) operations. 79 / 100

80 Kronecker methods Suppose If x R P, k decomposes as a product of kernels across each input dimension: k(x i, x j ) = P p=1 kp (x p i, xp j ) (e.g., the RBF kernel has this property). Suppose the inputs x X are on a multidimensional grid X = X 1 X P R P. Then K decomposes into a Kronecker product of matrices over each input dimension K = K 1 K P. The eigendecomposition of K into QVQ also decomposes: Q = Q 1 Q P, V = Q 1 Q P. Assuming equal cardinality for each input dimension, we can thus eigendecompose an N N matrix K in O(PN 3/P ) operations instead of O(N 3 ) operations. Then inference and learning are highly efficient: (K + σ 2 I) 1 y = (QVQ T + σ 2 I) 1 y = Q(V + σ 2 I) 1 Q T y, (115) N log K + σ 2 I = log QVQ T + σ 2 I = log(λ i + σ 2 ), (116) i=1 80 / 100

81 Kronecker Methods We assumed that the inputs x X are on a multidimensional grid X = X 1 X P R P. How might we relax this assumption, to use Kronecker methods if there are gaps (missing data) in our multidimensional grid? 81 / 100

82 Kronecker Methods Assume imaginary points that complete the grid Place infinite noise on these points so they have no effect on inference The relevant matrices are no longer Kronecker, but we can get around this using pre-conditioned conjugate gradients, an iterative linear solver. 82 / 100

83 Kronecker Methods with Missing Data Assuming we have a dataset of M observations which are not necessarily on a grid, we propose to form a complete grid using W imaginary observations, y W N (f W, ɛ 1 I W ), ɛ 0. The total observation vector y = [y M, y W ] T has N = M + W entries: y = N (f, D N ), where the noise covariance matrix D N = diag(d M, ɛ 1 I W ), D M = σ 2 I M. The imaginary observations y W have no corrupting effect on inference: the moments of the resulting predictive distribution are exactly the same as for the standard predictive distribution, namely lim ɛ 0 (K N + D N ) 1 y = (K M + D M ) 1 y M. 83 / 100

84 Kronecker Methods with Missing Inputs We use preconditioned conjugate gradients to compute (K N + D N ) 1 y. We use the preconditioning matrix C = D 1/2 N to solve C T (K N + D N ) Cz = C T y. The preconditioning matrix C speeds up convergence by ignoring the imaginary observations y W. For the log complexity in the marginal likelihood (used in hyperparameter learning), log K M + D M = M log(λ M i i=1 + σ 2 ) M log( λ M i + σ 2 ), (117) i=1 where λ M i = M N λn i for i = 1,..., M. 84 / 100

85 Spectral Mixture Product Kernel The spectral mixture kernel, in its standard form, does not quite have Kronecker structure. Introduce a spectral mixture product kernel, which takes a product of across input dimensions of one dimensional spectral mixture kernels. k SMP (τ θ) = P k SM (τ p θ p ). (118) p=1 85 / 100

86 GPatt Observations y(x) N (y(x); f (x), σ 2 ) (can easily be relaxed). f (x) GP(0, k SMP (x, x θ)) (f (x) is a GP with SMP kernel). k SMP (x, x θ) can approximate many different kernels with different settings of its hyperparameters θ. Learning involves training these hyperparameters through maximum marginal likelihood optimization (using BFGS) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π). (119) Once hyperparameters are trained as ˆθ, making predictions using p(f y, X, ˆθ), which can be expressed in closed form. Exploit Kronecker structure for fast exact inference and learning (and extend Kronecker methods to allow for non-grid data). Exact inference and learning requires O(PN P+1 P ) operations and O(PN 2 P ) storage, compared to O(N 3 ) operations and O(N 2 ) storage, for N datapoints, and P input dimensions. 86 / 100

87 Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC (g) GP-SE (h) GP-MA (i) GP-RQ 87 / 100

88 Results: Extrapolation and Interpolation with Shadows (a) Train (b) GPatt (c) GP-MA (d) Train (e) GPatt (f) GP-MA 88 / 100

89 Automatic Model Selection via Marginal Likelihood 0 µ µ 2 Simple initialisation The marginal likelihood shrinks weights of extraneous components to zero through the log K complexity penalty. 89 / 100

90 Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC 0 µ (g) GP-SE (h) GP-MA (i) GP-RQ (j) GPatt Initialisation (k) Train (l) GPatt (m) GP-MA (n) Train (o) GPatt (p) GP-MA 90 / 100

91 More Patterns (a) Rubber mat (b) Tread plate (c) Pores (d) Wood (e) Chain mail (f) Cone 91 / 100

92 Speed and Accuracy Stress Tests (a) Runtime Stress Test (b) Accuracy Stress Test 92 / 100

93 Image Inpainting 93 / 100

94 Recovering Sophisticated Out of Class Kernels True Recovered k1 0.5 k2 0.5 k τ τ τ 94 / 100

95 Video Extrapolation GPatt makes almost no assumptions about the correlation structures across input dimensions: it can automatically discover both temporal and spatial correlations! Top row: True frames taken from the middle of a movie. Bottom row: Predicted sequence of frames (all are forecast together). 112,500 datapoints. GPatt training time is under 5 minutes. 95 / 100

96 Land Surface Temperature Forecasting Train using 9 years of temperature data. First two rows are the last 12 months of training data, last two rows is a 12 month ahead forecast. 300, 000 data points, with 40% missing data (from ocean). Predictions using GP-SE (GP with an SE or RBF kernel), and Kronecker Inference / 100

97 Land Surface Temperature Forecasting Train using 9 years of temperature data. First two rows are the last 12 months of training data, last two rows is a 12 month ahead forecast. 300, 000 data points, with 40% missing data (from ocean). Predictions using GPatt. Training time < 30 minutes / 100

98 Learned Kernels for Land Surface Temperatures Correlations X [Km] Correlations Y [Km] (a) Learned GPatt Kernel for Temperatures X [Km] Y [Km] Time [mon] 0 5 Time [mon] (b) Learned GP-SE Kernel for Temperatures The learned GPatt kernel tells us interesting properties of the data. In this case, the learned kernels are heavy tailed and quasi-periodic. 98 / 100

99 Building Gauss-Markov Processes 99 / 100

100 Generalising inducing point methods Blackboard discussion 100 / 100

Kernels for Automatic Pattern Discovery and Extrapolation

Kernels for Automatic Pattern Discovery and Extrapolation Kernels for Automatic Pattern Discovery and Extrapolation Andrew Gordon Wilson agw38@cam.ac.uk mlg.eng.cam.ac.uk/andrew University of Cambridge Joint work with Ryan Adams (Harvard) 1 / 21 Pattern Recognition

More information

Kernel Methods for Large Scale Representation Learning

Kernel Methods for Large Scale Representation Learning Kernel Methods for Large Scale Representation Learning Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University November, 2014 University of Oxford 1 / 70 Themes and Conclusions (High Level)

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Gaussian Process Regression Networks

Gaussian Process Regression Networks Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh

More information

Gaussian Process Kernels for Pattern Discovery and Extrapolation

Gaussian Process Kernels for Pattern Discovery and Extrapolation Andrew Gordon Wilson Department of Engineering, University of Cambridge, Cambridge, UK Ryan Prescott Adams School of Engineering and Applied Sciences, Harvard University, Cambridge, USA agw38@cam.ac.uk

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Andrew Gordon Wilson Assistant Professor https://people.orie.cornell.edu/andrew Cornell University Toronto Deep Learning Summer School July 31, 218 1 / 66 Model Selection 7 Airline

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

How to build an automatic statistician

How to build an automatic statistician How to build an automatic statistician James Robert Lloyd 1, David Duvenaud 1, Roger Grosse 2, Joshua Tenenbaum 2, Zoubin Ghahramani 1 1: Department of Engineering, University of Cambridge, UK 2: Massachusetts

More information

State Space Representation of Gaussian Processes

State Space Representation of Gaussian Processes State Space Representation of Gaussian Processes Simo Särkkä Department of Biomedical Engineering and Computational Science (BECS) Aalto University, Espoo, Finland June 12th, 2013 Simo Särkkä (Aalto University)

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

An Introduction to Gaussian Processes for Spatial Data (Predictions!)

An Introduction to Gaussian Processes for Spatial Data (Predictions!) An Introduction to Gaussian Processes for Spatial Data (Predictions!) Matthew Kupilik College of Engineering Seminar Series Nov 216 Matthew Kupilik UAA GP for Spatial Data Nov 216 1 / 35 Why? When evidence

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Scalable kernel methods and their use in black-box optimization

Scalable kernel methods and their use in black-box optimization with derivatives Scalable kernel methods and their use in black-box optimization David Eriksson Center for Applied Mathematics Cornell University dme65@cornell.edu November 9, 2018 1 2 3 4 1/37 with derivatives

More information

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes

Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes Andrew Gordon Wilson Trinity College University of Cambridge This dissertation is submitted for the degree

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Outline lecture 2 2(30)

Outline lecture 2 2(30) Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009 with with July 30, 2010 with 1 2 3 Representation Representation for Distribution Inference for the Augmented Model 4 Approximate Laplacian Approximation Introduction to Laplacian Approximation Laplacian

More information

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f(x).5.5 f(x) y i.5.2.4.6.8 x f 5 5.5 x x 2.5 We have (possibly

More information

Large-scale Ordinal Collaborative Filtering

Large-scale Ordinal Collaborative Filtering Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk

More information

Modeling human function learning with Gaussian processes

Modeling human function learning with Gaussian processes Modeling human function learning with Gaussian processes Thomas L. Griffiths Christopher G. Lucas Joseph J. Williams Department of Psychology University of California, Berkeley Berkeley, CA 94720-1650

More information

The Automatic Statistician

The Automatic Statistician The Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ James Lloyd, David Duvenaud (Cambridge) and

More information

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Multiple-step Time Series Forecasting with Sparse Gaussian Processes Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Lecture 1c: Gaussian Processes for Regression

Lecture 1c: Gaussian Processes for Regression Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk

More information

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005 Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005 Presented by Piotr Mirowski CBLL meeting, May 6, 2009 Courant Institute of Mathematical Sciences, New York University

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Reliability Monitoring Using Log Gaussian Process Regression

Reliability Monitoring Using Log Gaussian Process Regression COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression Hilbert Space Methods for Reduced-Rank Gaussian Process Regression Arno Solin and Simo Särkkä Aalto University, Finland Workshop on Gaussian Process Approximation Copenhagen, Denmark, May 2015 Solin &

More information

Building an Automatic Statistician

Building an Automatic Statistician Building an Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ ALT-Discovery Science Conference, October

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling

Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling Nakul Gopalan IAS, TU Darmstadt nakul.gopalan@stud.tu-darmstadt.de Abstract Time series data of high dimensions

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

LA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018

LA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018 LA Support for Scalable Kernel Methods David Bindel 29 Sep 2018 Collaborators Kun Dong (Cornell CAM) David Eriksson (Cornell CAM) Jake Gardner (Cornell CS) Eric Lee (Cornell CS) Hannes Nickisch (Phillips

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Prediction of Data with help of the Gaussian Process Method

Prediction of Data with help of the Gaussian Process Method of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

Probabilistic & Bayesian deep learning. Andreas Damianou

Probabilistic & Bayesian deep learning. Andreas Damianou Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019 In this talk Not in this talk: CRFs, Boltzmann machines,... In this

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

State Space Gaussian Processes with Non-Gaussian Likelihoods

State Space Gaussian Processes with Non-Gaussian Likelihoods State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline

More information

Random Projections. Lopez Paz & Duvenaud. November 7, 2013

Random Projections. Lopez Paz & Duvenaud. November 7, 2013 Random Projections Lopez Paz & Duvenaud November 7, 2013 Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013) Why random

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

More information

Tokamak profile database construction incorporating Gaussian process regression

Tokamak profile database construction incorporating Gaussian process regression Tokamak profile database construction incorporating Gaussian process regression A. Ho 1, J. Citrin 1, C. Bourdelle 2, Y. Camenen 3, F. Felici 4, M. Maslov 5, K.L. van de Plassche 1,4, H. Weisen 6 and JET

More information

Computational Science and Engineering (Int. Master s Program)

Computational Science and Engineering (Int. Master s Program) Computational Science and Engineering (Int. Master s Program) Technische Universität München Master s Thesis Derivative-Free Stochastic Constrained Optimization using Gaussian Processes, with Application

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information