Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes
|
|
- Cory Moody
- 5 years ago
- Views:
Transcription
1 Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson Carnegie Mellon University April 1, / 100
2 Gaussian process review Definition A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. Nonparametric Regression Model Prior: f (x) GP(m(x), k(x, x )), meaning (f (x 1 ),..., f (x N )) N (µ, K), with µ i = m(x i ) and K ij = cov(f (x i ), f (x j )) = k(x i, x j ). GP posterior {}}{ p(f (x) D) Likelihood GP prior {}}{{}}{ p(d f (x)) p(f (x)) output, f(t) Gaussian process sample prior functions output, f(t) Gaussian process sample posterior functions / 100
3 Gaussian Process Inference Observed noisy data y = (y(x 1 ),..., y(x N )) T at input locations X. Start with the standard regression assumption: N (y(x); f (x), σ 2 ). Place a Gaussian process distribution over noise free functions f (x) GP(0, k θ ). The kernel k is parametrized by θ. Infer p(f y, X, X ) for the noise free function f evaluated at test points X. Joint distribution [ ] ( [ y Kθ (X, X) + σ 2 I K θ (X, X ) N 0, K θ (X, X) K θ (X, X ) f Conditional predictive distribution ]). (1) f X, X, y, θ N ( f, cov(f )), (2) f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, (3) cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). (4) 3 / 100
4 Learning and Model Selection p(m i y) = p(y M i)p(m i ) (5) p(y) We can write the evidence of the model as p(y M i ) = p(y f, M i )p(f)df, (6) Complex Model Simple Model Appropriate Model Data Simple Complex Appropriate p(y M) Output, f(x) y All Possible Datasets (a) Input, x (b) 4 / 100
5 Learning and Model Selection We can integrate away the entire Gaussian process f (x) to obtain the marginal likelihood, as a function of kernel hyperparameters θ alone. p(y θ, X) = p(y f, X)p(f θ, X)df. (7) model fit complexity penalty {}}{ log p(y θ, X) = 1 {}}{ 2 yt (K θ + σ 2 I) 1 1 y 2 log K θ + σ 2 I N log(2π). (8) 2 An extremely powerful mechanism for kernel learning. 4 Samples from GP Prior 4 Samples from GP Posterior Output, f(x) Output, f(x) Input, x Input, x 5 / 100
6 Inference and Learning 1. Learning: Optimize marginal likelihood, model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y with respect to kernel hyperparameters θ. complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π), 2. Inference: Conditioned on kernel hyperparameters θ, form the predictive distribution for test inputs X : f X, X, y, θ N ( f, cov(f )), f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). 6 / 100
7 Learning and Model Selection A fully Bayesian treatment would integrate away kernel hyperparameters θ. p(f X, X, y) = p(f X, X, y, θ)p(θ y)dθ (9) For example, we could specify a prior p(θ), use MCMC to take J samples from p(θ y) p(y θ)p(θ), and then find p(f X, X, y) 1 J J p(f X, X, y, θ (i) ), θ (i) p(θ y). (10) i=1 If we have a non-gaussian noise model, and thus cannot integrate away f, the strong dependencies between Gaussian process f and hyperparameters θ make sampling extremely difficult. In my experience, the most effective solution is to use a deterministic approximation for the posterior p(f y) which enables one to work with an approximate marginal likelihood. 7 / 100
8 Popular Kernels Let τ = x x : k SE (τ) = exp( 0.5τ 2 /l 2 ) (11) 3τ 3τ k MA (τ) = a(1 + ) exp( ) (12) l l k RQ (τ) = (1 + τ 2 2 α l 2 ) α (13) k PE (τ) = exp( 2 sin 2 (π τ ω)/l 2 ) (14) 8 / 100
9 Worked Example: Combining Kernels, CO 2 Data CO 2 Concentration (ppm) Year Example from Rasmussen and Williams (2006), Gaussian Processes for Machine Learning. 9 / 100
10 Worked Example: Combining Kernels, CO 2 Data 10 / 100
11 Worked Example: Combining Kernels, CO 2 Data Long rising trend: k 1 (x p, x q ) = θ 2 1 exp ( (xp xq)2 2θ2 2 Quasi-periodic seasonal changes: ( k 2 (x p, x q ) = k RBF (x p, x q )k PER (x p, x q ) = θ3 2 exp (xp xq) Multi-scale medium term irregularities: ) k 3 (x p, x q ) = θ6 (1 2 θ8 + (xp xq)2 2θ 8θ7 2 ( Correlated and i.i.d. noise: k 4 (x p, x q ) = θ9 2 exp 2θ 2 4 ) 2 sin2 (π(x p x q)) θ 2 5 (xp xq)2 2θ10 2 k total (x p, x q ) = k 1 (x p, x q ) + k 2 (x p, x q ) + k 3 (x p, x q ) + k 4 (x p, x q ) ) ) + θ 2 11 δ pq 11 / 100
12 What is a kernel? Informally, k describes the similarities between pairs of data points. For example, far away points may be considered less similar than nearby points. K ij = φ(x i ), φ(x j ) and so tells us the overlap between the features (basis functions) φ(x i ) and φ(x j ) We have seen that all linear basis function models f (x) = w T φ(x), with p(w) = N (0, Σ w ) correspond to Gaussian processes with kernel k(x, x ) = φ(x) T Σ w φ(x ). We have also accumulated some experience with the RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l 2 ). The kernel controls the generalisation behaviour of a kernel machine. For example, a kernel controls the support and inductive biases of a Gaussian process which functions are a priori likely. A kernel is also known as covariance function or covariance kernel in the context of Gaussian processes. 12 / 100
13 Candidate Kernel Symmetric k(x, x ) = { 1 x x 1 0 otherwise Provides information about proximity of points Exercise: Is it a valid kernel? 13 / 100
14 Candidate Kernel k(x, x ) = { 1 x x 1 0 otherwise Try the points x 1 = 1, x 2 = 2, x 3 = 3. Compute the kernel matrix??? K =??? (15)??? 14 / 100
15 Candidate Kernel k(x, x ) = { 1 x x 1 0 otherwise Try the points x 1 = 1, x 2 = 2, x 3 = 3. Compute the kernel matrix K = (16) The eigenvalues of K are ( (2) 1) 1, 1, and (1 (2)). Therefore K is not positive semidefinite. 15 / 100
16 Representer Theorem A decision function f (x) can be written as N f (x) = w, φ(x) = α i φ(x i ), φ(x) = i=1 N α i k(x i, x). (17) i=1 Representer theorem says this function exists with finitely many coefficients α i even when φ is infinite dimensional (an infinite number of basis functions). Initially viewed as a strength of kernel methods, for datasets not exceeding e.g. ten thousand points. Unfortunately, the number of nonzero α i often grows linearly in the size of the training set N. Example: In GP regression, the predictive mean is E[f y, X, x ] = k T (K + σ 2 I) 1 y = N α i k(x i, x ), (18) i=1 where α i = (K + σ 2 I) 1 y. 16 / 100
17 Making new kernels from old Suppose k 1 (x, x ) and k 2 (x, x ) are valid. Then the following covariance functions are also valid: k(x, x ) = g(x)k 1 (x, x )g(x ) (19) k(x, x ) = q(k 1 (x, x )) (20) k(x, x ) = exp(k 1 (x, x )) (21) k(x, x ) = k 1 (x, x ) + k 2 (x, x ) (22) k(x, x ) = k 1 (x, x )k 2 (x, x ) (23) k(x, x ) = k 3 (φ(x), φ(x )) (24) k(x, x ) = x T Ax (25) k(x, x ) = k a (x a, x a) + k b (x b, x b) (26) k(x, x ) = k a (x a, x a)k b (x b, x b) (27) where g is any function, q is a polynomial with nonnegative coefficients, φ(x) is a function from x to R M, k 3 is a valid covariance function in R M, A is a symmetric positive definite matrix, x a and x b are not necessarily disjoint variables with x = (x a, x b ) T, and k a and k b are valid kernels in their respective spaces. 17 / 100
18 Stationary Kernels A stationary kernel is invariant to translations of the input space. Equivalently, k = k(x x ) = k(τ). All distance kernels, k = k( x x ) are examples of stationary kernels. The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is a stationary kernel. 2 The polynomial kernel k POL (x, x ) = (x T x + σ0 2)p is an example of a non-stationary kernel. Stationarity provides a useful inductive bias. 18 / 100
19 Bochner s Theorem Theorem (Bochner) A complex-valued function k on R P is the covariance function of a weakly stationary mean square continuous complex-valued random process on R P if and only if it can be represented as k(τ) = Tτ ψ(ds), (28) where ψ is a positive finite measure. R P e 2πis If ψ has a density S(s), then S is called the spectral density or power spectrum of k, and k and S are Fourier duals: k(τ) = S(s)e 2πisTτ ds, (29) S(s) = k(τ)e 2πisTτ dτ. (30) 19 / 100
20 Review: Linear Basis Function Models Model Specification f (x, w) = w T φ(x) (31) p(w) = N (0, Σ w ) (32) Moments of Induced Distribution over Functions E[f (x, w)] = m(x) = E[w T ]φ(x) = 0 (33) cov(f (x i ), f (x j )) = k(x i, x j ) = E[f (x i )f (x j )] E[f (x i )]E[f (x j )] (34) = φ(x i ) T E[ww T ]φ(x j ) 0 (35) = φ(x i ) T Σ w φ(x j ) (36) f (x, w) is a Gaussian process, f (x) N (m, k) with mean function m(x) = 0 and covariance kernel k(x i, x j ) = φ(x i ) T Σ w φ(x j ). The entire basis function model of Eqs. (31) and (32) is encapsulated as a distribution over functions with kernel k(x, x ). 20 / 100
21 Deriving the RBF Kernel Start with the basis model f (x) = J w i φ i (x), (37) i=1 ) w i N (0, σ2, (38) J φ i (x) = exp ( (x c i) 2 ) 2l 2. (39) Equations (37)-(39) define a radial basis function regression model, with radial basis functions centred at the points c i. Using our result for the kernel of a generalised linear model, k(x, x ) = σ2 J J φ i (x)φ i (x ). (40) i=1 21 / 100
22 Deriving the RBF Kernel f (x) = J w i φ i (x), i=1 ) w i N (0, σ2, φ i (x) = exp ( (x c i) 2 ) J 2l 2 k(x, x ) = σ2 J (41) J φ i (x)φ i (x ) (42) i=1 Letting c i+1 c i = c = 1 J, and J, the kernel in Eq. (42) becomes a Riemann sum: k(x, x σ 2 ) = lim J J J φ i (x)φ i (x ) = i=1 c c 0 φ c (x)φ c (x )dc (43) By setting c 0 = and c =, we spread the infinitely many basis functions across the whole real line, each a distance c 0 apart: k(x, x ) = exp( x c 2l 2 ) c exp( x 2l 2 )dc (44) = πlσ 2 exp( (x x ) 2 ). (45) 22 / 100
23 Deriving the RBF Kernel It is remarkable we can work with infinitely many basis functions with finite amounts of computation using the kernel trick replacing inner products of basis functions with kernels. The RBF kernel, also known as the Gaussian or squared exponential kernel, is by far the most popular kernel. k RBF (x, x ) = a 2 exp( x x 2 2l 2 ). Recall Bochner s theorem. If we take the Fourier transform of the RBF kernel we recover a Gaussian spectral density, S(s) = (2πl 2 ) D/2 exp( 2π 2 l 2 s 2 ) for x R D. Therefore the RBF kernel kernel does not have much support for high frequency functions, since a Gaussian does not have heavy tails. Functions drawn from a GP with an RBF kernel are infinitely differentiable. For this reason, the RBF kernel is accused of being overly smooth and unrealistic. Nonetheless it has nice theoretical properties / 100
24 The RBF Kernel SE kernels with Different Length scales l=7 l = 0.7 l = k(τ) τ Figure: SE kernels with different length-scales, as a function of τ = x x. 24 / 100
25 Representer Theorem A decision function f (x) can be written as N f (x) = w, φ(x) = α i φ(x i ), φ(x) = i=1 N α i k(x i, x). (46) i=1 Representer theorem says this function exists with finitely many coefficients α i even when φ is infinite dimensional (an infinite number of basis functions). Initially viewed as a strength of kernel methods, for datasets not exceeding e.g. ten thousand points. Unfortunately, the number of nonzero α i often grows linearly in the size of the training set N. Example: In GP regression, the predictive mean is E[f y, X, x ] = k T (K + σ 2 I) 1 y = N α i k(x i, x ), (47) i=1 where α i = (K + σ 2 I) 1 y. 25 / 100
26 Polynomial Kernel We have already shown that the simple linear model corresponds to a Gaussian process with kernel f (x, w) = w T x + b, (48) p(w) = N (0, α 2 I), (49) p(b) = N (0, β 2 ), (50) k LIN (x, x ) = α 2 x T x + β 2. (51) Samples from a GP with k LIN (x, x ) will thus be straight lines. Recall that the product of two kernels is a valid kernel. The product of two linear kernels is a quadratic kernel, which gives rise to quadratic functions: k QUAD (x, x ) = k LIN (x, x )k LIN (x, x ). (52) For example, if β = 0, α = 1, and x R 2, then k QUAD (x, x ) = φ(x) T φ(x ) with φ(x) = (x 2 1, x2 2, 2x 1 x 2 ) T, where x = (x 1, x 2 ). We can generalize to the polynomial kernel k POL (x, x ) = (α 2 x T x + β 2 ) p. (53) 26 / 100
27 The Rational Quadratic Kernel What if we want data varying at multiple scales? 27 / 100
28 The Rational Quadratic Kernel Try a scale mixture of RBF kernels. Let r = x x. k(r) = exp( r2 2l )p(l)dl. 2 For example, we can consider a Gamma density for p(l). Letting γ = l 2, g(γ α, β) γ α 1 exp( αγ/β), with β 1 = l 2, the rational quadratic (RQ) kernel is derived as k RQ (r) = 0 k RBF (r γ)g(γ α, β)dγ = (1 + r2 2αl 2 ) α. (54) One could derive other interesting covariance functions using different (non-gamma) functions for p(l). 28 / 100
29 The Rational Quadratic Kernel k RQ (r) = (1 + r2 2αl 2 ) α (55) r = τ = x x. (56) 1.4 RQ kernels with Different α 2.5 Sample GP RQ Functions α = 40 α=0.1 α= k(τ) 0.6 f(x) (a) τ Input, x (b) 29 / 100
30 Neural Network Kernel The neural network kernel (Neal, 1996) is famous for triggering research on Gaussian processes in the machine learning community. Consider a neural network with one hidden layer: J f (x) = b + v i h(x; u i ). (57) i=1 b is a bias, v i are the hidden to output weights, h is any bounded hidden unit transfer function, u i are the input to hidden weights, and J is the number of hidden units. Let b and v i be independent with zero mean and variances σ 2 b and σ2 v /J, respectively, and let the u i have independent identical distributions. Collecting all free parameters into the weight vector w, E w [f (x)] = 0, cov[f (x), f (x )] = E w [f (x)f (x )] = σ 2 b + 1 J J σv 2 E u [h i (x; u i )h i (x ; u i )], i=1 (58) (59) = σ 2 b + σ 2 v E u [h(x; u)h(x ; u)]. (60) 30 / 100
31 Neural Network Kernel f (x) = b + J v i h(x; u i ). (61) Let h(x; u) = erf(u 0 + P j=1 u jx j ), where erf(z) = 2 π z 0 e t2 dt Choose u N (0, Σ) Then we obtain i=1 k NN (x, x ) = 2 π sin( 2 x T Σ x (1 + 2 xt Σ x)(1 + 2 x T Σ x ) ), (62) where x R P and x = (1, x T ) T. 31 / 100
32 Neural Network Kernel Figure: Draws from a GP with a Neural Network Kernel with Varying σ Rasmussen and Williams (2006) 32 / 100
33 Gibbs Kernel Recall the RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l 2 ). (63) What if we want to make the length-scale of l input dependent, so that the resulting function is biased to vary more quickly in parts of the input space than in others? 33 / 100
34 Gibbs Kernel Recall the RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l 2 ). (64) What if we want to make the length-scale of l input dependent, so that the resulting function is biased to vary more quickly in parts of the input space than in others? Just letting l l(x) doesn t produce a valid kernel. 34 / 100
35 Gibbs Kernel k Gibbs (x, x ) = P p=1 where x p is the p th component of x. ( 2l p (x)l p (x ) ) 1/2 ( P exp lp(x) 2 + lp(x 2 ) p=1 (x p x p) 2 ), (65) lp(x) 2 + lp(x 2 ) lengt 0.5 ou (a) (b) Rasmussen and Williams (2006) 35 / 100
36 Periodic Kernel Transform the inputs through a vector-valued function: u(x) = (cos(x), sin(x)). Apply the RBF kernel in u space: k RBF (x, x ) k RBF (u(x), u(x )). Recover the periodic kernel k PER (x, x ) = exp( 2 sin2 ( x x 2 ) l 2 ). (66) Can you see anything unusual about this kernel? 36 / 100
37 Periodic Kernel Covariance Output, y(x) Input distance, r (a) Input, x (b) 37 / 100
38 Non-Stationary Kernels A stationary kernel is invariant to translations of the input space: k = k(τ), τ = x x. Intuitively, this means the properties of the function are similar across different regions of the input domain. How might we make other non-stationary kernels, besides the Gibbs kernel? 38 / 100
39 Non-Stationary Kernels Figure: Non-stationary function 39 / 100
40 Non-Stationary Kernels (a) (b) Figure: Warp the inputs (in this case, x x 2 to go from non-stationary function to a stationary function). E.g., apply k(g(x), g(x )) to the data, where g is a warping function. 40 / 100
41 Non-Stationary Kernels / 100
42 Non-Stationary Kernels Warp the input space: k(x, x ) k(g(x), g(x )) where g is an arbitrary warping function. Modulate the amplitude of the kernel. If f (x) GP(0, k(x, x )) then a(x)f (x) has kernel a(x)k(x, x )a(x ), conditioned on a(x). What would happen if we tried w 1 (x)f 1 (x) + w 2 (x)f 2 (x) where f 1 and f 2 are GPs with different kernels? How about σ(w 1 (x))f 1 (x) + (1 σ(w 1 (x)))f 2 (x)? 42 / 100
43 Matérn Kernel The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is criticized for being 2 too smooth. How might we create a drop-in replacement, while retaining useful inductive biases? 43 / 100
44 Matérn Kernel The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is criticized for being 2 too smooth. How might we create a drop-in replacement, while retaining useful inductive biases? Could replace the Euclidean distance measure with an absolute distance measure... then we recover the Ornstein-Uhlenbeck kernel: k OU (x, x ) = exp( x x /l). The velocity of a particle undergoing brownian motion is described by a GP with the OU kernel. 44 / 100
45 Matérn Kernel The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is criticized for being 2 too smooth. How might we create a drop-in replacement, while retaining useful inductive biases? Could replace the Euclidean distance measure with an absolute distance measure... then we recover the Ornstein-Uhlenbeck kernel: k OU (x, x ) = exp( x x /l). The velocity of a particle undergoing brownian motion is described by a GP with the OU kernel. Recall that stationary kernels k(τ), τ = x x and spectral densities are Fourier duals of one another: k(τ) = S(s)e 2πisTτ ds, (67) S(s) = k(τ)e 2πisTτ dτ. (68) If we take the Fourier transform of the RBF kernel, we recover a Gaussian spectral density... But we can go from spectral densities to kernels too / 100
46 Matérn Kernel The RBF kernel k RBF (x, x ) = a 2 exp( x x 2 2l ) is criticized for being 2 too smooth. How might we create a drop-in replacement, while retaining useful inductive biases? Could replace the Euclidean distance measure with an absolute distance measure... then we recover the Ornstein-Uhlenbeck kernel: k OU (x, x ) = exp( x x /l). The velocity of a particle undergoing brownian motion is described by a GP with the OU kernel. Recall that stationary kernels k(τ), τ = x x and spectral densities are Fourier duals of one another: k(τ) = S(s)e 2πisTτ ds, (69) S(s) = k(τ)e 2πisTτ dτ. (70) If we take the Fourier transform of the RBF kernel, we recover a Gaussian spectral density... But we can go from spectral densities to kernels too... If we use a Student-t spectral density for S(s), and take the inverse Fourier transform, we recover the Matérn kernel. 46 / 100
47 Matérn Kernel k Matérn (x, x ) = 21 ν 2ν x x Γ(ν) ( 2ν x x ) ν K ν ( ), (71) l l where K ν is a modified Bessel function. In one dimension, and when ν + 1/2 = p, for some natural number p, the corresponding GP is a continuous time AR(p) process. By setting ν = 1, we obtain the Ornstein-Uhlenbeck (OU) kernel, k OU (x, x ) = exp( x x ). (72) l The Matérn kernel does not have concentration of measure problems for high dimensional inputs to the extent of the RBF (Gaussian) kernel (Fastfood: Le, Sarlos, Smola, ICML 2013). The kernel gives rise to a Markovian process (and classical filtering and smoothing algorithms can be applied). 47 / 100
48 Matérn Kernel OU and RBF kernels both with lengthscale l = k OU Covariance k RBF Outputs, f(x) GP OU GP RBF Distance, r (a) Inputs, x (b) 48 / 100
49 Matérn Kernel From Yunus Saatchi s PhD thesis, Scalable Inference for Structured Gaussian Process Models, / 100
50 Gaussian Processes Are Gaussian processes Bayesian nonparametric models? 50 / 100
51 Nonparametric Kernels For a Gaussian process f (x) to be non-parametric, f (x i ) f i, where f i is any collection of function values excluding f (x i ), must be free to take any value in R. For this freedom to be possible it is a necessary (but not sufficient) condition for the kernel of the Gaussian process to be derived from an infinite basis function expansion. Nonparametric kernels allow for a great amount of flexibility: the amount of information the model can represent grows with the amount of available data. 51 / 100
52 Nonparametric RBF vs Finite Dimensional Analogue The parametric analogue to a GP with a non-parametric RBF kernel becomes more confident in its predictions, the further away we get from the data! Rasmussen, MLSS Cambridge, / 100
53 Simple Random Walk Discrete time auto-regressive model f (t) = a f (t 1) + ɛ(t), (73) ɛ(t) N (0, 1), (74) a R, (75) t = 1, 2, 3, 4,... (76) (77) Is this model a Gaussian process? 53 / 100
54 Gaussian Process Covariance Kernels Let τ = x x : k SE (τ) = exp( 0.5τ 2 /l 2 ) (78) 3τ 3τ k MA (τ) = a(1 + ) exp( ) (79) l l k RQ (τ) = (1 + τ 2 2 α l 2 ) α (80) k PE (τ) = exp( 2 sin 2 (π τ ω)/l 2 ) (81) 54 / 100
55 CO 2 Extrapolation with Standard Kernels CO 2 Concentration (ppm) Year 55 / 100
56 Gaussian processes How can Gaussian processes possibly replace neural networks? Did we throw the baby out with the bathwater? David MacKay, / 100
57 More Expressive Covariance Functions ˆf 1 (x) W 11(x) y 1 (x). W1q(x). x ˆf i (x) y j (x).. Wp1(x) ˆf q (x) W pq(x) y p (x) Gaussian Process Regression Networks. Wilson et. al, ICML / 100
58 Gaussian Process Regression Network latitude longitude / 100
59 Expressive Covariance Functions GPs in Bayesian neural network like architectures. (Salakhutdinov and Hinton, 2008; Wilson et. al, 2012; Damianou and Lawrence, 2012). Task specific, difficult inference, no closed form kernels. Compositions of kernels. (Archambeau and Bach, 2011; Durrande et. al, 2011; Rasmussen and Williams, 2006). In the general case, difficult to interpret, difficult inference, struggle with over-fitting. Can learn almost nothing about the covariance function of a stochastic process from a single realization, if we assume that the covariance function could be any positive definite function. Most commonly one assumes a restriction to stationary kernels, meaning that covariances are invariant to translations in the input space. 59 / 100
60 Bochner s Theorem Theorem (Bochner) A complex-valued function k on R P is the covariance function of a weakly stationary mean square continuous complex-valued random process on R P if and only if it can be represented as k(τ) = Tτ ψ(ds), (82) where ψ is a positive finite measure. R P e 2πis If ψ has a density S(s), then S is called the spectral density or power spectrum of k, and k and S are Fourier duals: k(τ) = S(s)e 2πisTτ ds, (83) S(s) = k(τ)e 2πisTτ dτ. (84) 60 / 100
61 Idea k and S are Fourier duals: k(τ) = S(s) = S(s)e 2πisTτ ds, (85) k(τ)e 2πisTτ dτ. (86) If we can approximate S(s) to arbitrary accuracy, then we can approximate any stationary kernel to arbitrary accuracy. We can model S(s) to arbitrary accuracy, since scale-location mixtures of Gaussians can approximate any distribution to arbitrary accuracy. A scale-location mixture of Gaussians can flexibly model many distributions, and thus many covariance kernels, even with a small number of components. 61 / 100
62 Kernels for Pattern Discovery Let τ = x x R P. From Bochner s Theorem, k(τ) = S(s)e 2πis Tτ ds (87) R P For simplicity, assume τ R 1 and let Then S(s) = [N (s; µ, σ 2 ) + N ( s; µ, σ 2 )]/2. (88) k(τ) = exp{ 2π 2 τ 2 σ 2 } cos(2πτµ). (89) More generally, if S(s) is a symmetrized mixture of diagonal covariance Gaussians on R p, with covariance matrix M q = diag(v (1) q,..., v (P) q ), then k(τ) = Q P w q cos(2πτ p µ (p) q ) exp{ 2π 2 τp 2 v (p) q }. (90) q=1 p=1 62 / 100
63 GP Model for Pattern Extrapolation Observations y(x) N (y(x); f (x), σ 2 ) f (x) GP(0, k SM (x, x θ)) kernel). (can easily be relaxed). (f (x) is a GP with SM k SM (x, x θ) can approximate many different kernels with different settings of its hyperparameters θ. Learning involves training these hyperparameters through maximum marginal likelihood optimization (using BFGS) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π). (91) Once hyperparameters are trained as ˆθ, making predictions using p(f y, X, ˆθ), which can be expressed in closed form. 63 / 100
64 Results, CO 2 CO 2 Concentration (ppm) % CR Train Test MA RQ PER SE SM (c) Year Log Spectral Density SM SE Empirical Frequency (1/month) (d) 64 / 100
65 Results, Reconstructing Standard Covariances Correla 0 Correla τ a) τ b) 65 / 100
66 Results, Negative Covariances Truth SM Observation 0 Covariance x (input) (e) τ (f) 20 Log Spectral Density SM Empirical Frequency (g) 66 / 100
67 Results, Sinc Pattern Observation Observation Train Test MA RQ SE PER SM x (input) (h) x (input) (i) 1 MA SM 0 Correlation Log Spectral Density SM SE Empirical τ (j) Frequency (k) 67 / 100
68 Results, Airline Passengers Airline Passengers (Thousa PER Train Test SE RQ MA SM SSGP Year (l) Log Spectral Density SM SE Empirical Frequency (1/month) (m) 68 / 100
69 Scaling up kernel machines Expressive kernels will be most valuable on large datasets. Computational bottlenecks for GPs: Inference: (Kθ + σ 2 I) 1 y for n n matrix K. Learning: log Kθ + σ 2 I, for marginal likelihood evaluations needed to learn θ. Both inference and learning naively require O(n 3 ) operations and O(n 2 ) storage (typically from computing a Cholesky decomposition of K). Afterwards, the predictive mean and variance cost O(n) and O(n 2 ) per test point. 69 / 100
70 Inference and Learning 1. Learning: Optimize marginal likelihood, model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y with respect to kernel hyperparameters θ. complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π), 2. Inference: Conditioned on kernel hyperparameters θ, form the predictive distribution for test inputs X : f X, X, y, θ N ( f, cov(f )), f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). 70 / 100
71 Scaling up kernel machines Three Families of Approaches Approximate non-parametric kernels in a finite basis dual space. Requires O(m 2 n) computations and O(m) storage for m basis functions. Examples: SSGP, Random Kitchen Sinks, Fastfood, À la Carte. Inducing point based sparse approximations. Examples: SoR, FITC, KISS-GP. Exploit existing structure in K to quickly (and exactly) solve linear systems and log determinants. Examples: Toeplitz and Kronecker methods. 71 / 100
72 Parametric Expansions via Random Basis Functions Return to Bochner s Theorem k(τ) = S(s) = S(s)e 2πisTτ ds, (92) k(τ)e 2πisTτ dτ. (93) We can treat S(s) as a probability distribution and sample from it, to approximate the integral for k(τ)! It is a valid Monte Carlo procedure to sample the pairs {s j, s j } from S(s): k(τ) 1 J [ exp(2πis T 2J j τ) + exp( 2πis T j τ) ], s j S(s) (94) = 1 J j=1 J cos(2πs T j τ) (95) j=1 This is exactly the covariance function we get if we use a linear basis function model with trigonometric basis functions! Use the basis function representation with finite J for computational efficiency. 72 / 100
73 Scaling a Gaussian process: inducing inputs Gaussian process f and f evaluated at n training points and J testing points. m n inducing points u, p(u) = N (0, K u,u ) p(f, f) = p(f, f, u)du = p(f, f u)p(u)du Assume that f and f are conditionally independent given u: p(f, f) q(f, f) = q(f u)q(f u)p(u)du (96) Exact conditional distributions p(f u) = N (K f,u K 1 u,uu, K f,f Q f,f ) (97) p(f u) = N (K f,uku,uu, 1 K f,f Q f,f ) (98) Q a,b = K a,u Ku,uK 1 u,b (99) Cost for predictions reduced from O(n 3 ) to O(m 2 n) where m n. Different inducing approaches correspond to different additional assumptions about q(f u) and q(f u). 73 / 100
74 Inducing Point Methods The inducing points act as a communication channel between the GP evaluated at the training and test points, f and f : 74 / 100
75 Subset of Regression (SoR) The subset of regressors method uses deterministic conditional distributions with exact means: q(f u) = N (K X,U KU,U 1 u, 0) (100) q(f u) = N (K X,UKU,U 1 u, 0) (101) (102) Integrate away u via p(f, f) q(f, f) = q(f u)q(f u)p(u)du (103) to obtain the joint distribution ( q SoR = N 0, [ QX,X Q X,X Q X,X Q X,X ]). (104) Q a,b = K a,u K 1 u,uk u,b (105) 75 / 100
76 Subsets of Regressors The predictive conditional can then be derived as before: q SoR (f y) = N (µ, A) (106) µ = Q X,X(Q X,X + σ 2 I) 1 y (107) A = Q X,X Q X,X(Q X,X + σ 2 ) 1 Q X,X (108) This method can be viewed as replacing the exact covariance function k with an approximate covariance function which admits fast computations. k SoR (x i, x j ) = k(x i, U)K 1 U,U k(u, x j) (109) 76 / 100
77 Subsets of Regressors The SoR covariance matrix is n n {}}{ K SoR (X, X) = n m m m {}}{{}}{{}}{ K U,X (110) K X,U KU,U 1 m n For m < n, this is a low rank covariance matrix, corresponding to a degenerate (finite basis) Gaussian process. As a result, for n large, SoR tends to underestimate uncertainty. 77 / 100
78 FITC FITC, the most popular inducing point method, uses the exact test conditional, and a factorized training conditional: q FITC (f u) = n p(f i u) (111) i=1 q FITC (f u) = p(f u). (112) Integrating away u, we can derive the FITC approximate kernel as: k SoR (x, z) = K x,u K 1 U,U K U,z, (113) k FITC (x, z) = k SoR (x, z) + δ xz ( k(x, z) k SoR (x, z) ). (114) FITC replaces the diagonal of the SoR approximation with the true diagonal of k. FITC corresponds to a non-parametric GP. 78 / 100
79 Kronecker methods Suppose If x R P, k decomposes as a product of kernels across each input dimension: k(x i, x j ) = P p=1 kp (x p i, xp j ) (e.g., the RBF kernel has this property). Suppose the inputs x X are on a multidimensional grid X = X 1 X P R P. Then K decomposes into a Kronecker product of matrices over each input dimension K = K 1 K P. The eigendecomposition of K into QVQ also decomposes: Q = Q 1 Q P, V = Q 1 Q P. Assuming equal cardinality for each input dimension, we can thus eigendecompose an N N matrix K in O(PN 3/P ) operations instead of O(N 3 ) operations. 79 / 100
80 Kronecker methods Suppose If x R P, k decomposes as a product of kernels across each input dimension: k(x i, x j ) = P p=1 kp (x p i, xp j ) (e.g., the RBF kernel has this property). Suppose the inputs x X are on a multidimensional grid X = X 1 X P R P. Then K decomposes into a Kronecker product of matrices over each input dimension K = K 1 K P. The eigendecomposition of K into QVQ also decomposes: Q = Q 1 Q P, V = Q 1 Q P. Assuming equal cardinality for each input dimension, we can thus eigendecompose an N N matrix K in O(PN 3/P ) operations instead of O(N 3 ) operations. Then inference and learning are highly efficient: (K + σ 2 I) 1 y = (QVQ T + σ 2 I) 1 y = Q(V + σ 2 I) 1 Q T y, (115) N log K + σ 2 I = log QVQ T + σ 2 I = log(λ i + σ 2 ), (116) i=1 80 / 100
81 Kronecker Methods We assumed that the inputs x X are on a multidimensional grid X = X 1 X P R P. How might we relax this assumption, to use Kronecker methods if there are gaps (missing data) in our multidimensional grid? 81 / 100
82 Kronecker Methods Assume imaginary points that complete the grid Place infinite noise on these points so they have no effect on inference The relevant matrices are no longer Kronecker, but we can get around this using pre-conditioned conjugate gradients, an iterative linear solver. 82 / 100
83 Kronecker Methods with Missing Data Assuming we have a dataset of M observations which are not necessarily on a grid, we propose to form a complete grid using W imaginary observations, y W N (f W, ɛ 1 I W ), ɛ 0. The total observation vector y = [y M, y W ] T has N = M + W entries: y = N (f, D N ), where the noise covariance matrix D N = diag(d M, ɛ 1 I W ), D M = σ 2 I M. The imaginary observations y W have no corrupting effect on inference: the moments of the resulting predictive distribution are exactly the same as for the standard predictive distribution, namely lim ɛ 0 (K N + D N ) 1 y = (K M + D M ) 1 y M. 83 / 100
84 Kronecker Methods with Missing Inputs We use preconditioned conjugate gradients to compute (K N + D N ) 1 y. We use the preconditioning matrix C = D 1/2 N to solve C T (K N + D N ) Cz = C T y. The preconditioning matrix C speeds up convergence by ignoring the imaginary observations y W. For the log complexity in the marginal likelihood (used in hyperparameter learning), log K M + D M = M log(λ M i i=1 + σ 2 ) M log( λ M i + σ 2 ), (117) i=1 where λ M i = M N λn i for i = 1,..., M. 84 / 100
85 Spectral Mixture Product Kernel The spectral mixture kernel, in its standard form, does not quite have Kronecker structure. Introduce a spectral mixture product kernel, which takes a product of across input dimensions of one dimensional spectral mixture kernels. k SMP (τ θ) = P k SM (τ p θ p ). (118) p=1 85 / 100
86 GPatt Observations y(x) N (y(x); f (x), σ 2 ) (can easily be relaxed). f (x) GP(0, k SMP (x, x θ)) (f (x) is a GP with SMP kernel). k SMP (x, x θ) can approximate many different kernels with different settings of its hyperparameters θ. Learning involves training these hyperparameters through maximum marginal likelihood optimization (using BFGS) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π). (119) Once hyperparameters are trained as ˆθ, making predictions using p(f y, X, ˆθ), which can be expressed in closed form. Exploit Kronecker structure for fast exact inference and learning (and extend Kronecker methods to allow for non-grid data). Exact inference and learning requires O(PN P+1 P ) operations and O(PN 2 P ) storage, compared to O(N 3 ) operations and O(N 2 ) storage, for N datapoints, and P input dimensions. 86 / 100
87 Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC (g) GP-SE (h) GP-MA (i) GP-RQ 87 / 100
88 Results: Extrapolation and Interpolation with Shadows (a) Train (b) GPatt (c) GP-MA (d) Train (e) GPatt (f) GP-MA 88 / 100
89 Automatic Model Selection via Marginal Likelihood 0 µ µ 2 Simple initialisation The marginal likelihood shrinks weights of extraneous components to zero through the log K complexity penalty. 89 / 100
90 Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC 0 µ (g) GP-SE (h) GP-MA (i) GP-RQ (j) GPatt Initialisation (k) Train (l) GPatt (m) GP-MA (n) Train (o) GPatt (p) GP-MA 90 / 100
91 More Patterns (a) Rubber mat (b) Tread plate (c) Pores (d) Wood (e) Chain mail (f) Cone 91 / 100
92 Speed and Accuracy Stress Tests (a) Runtime Stress Test (b) Accuracy Stress Test 92 / 100
93 Image Inpainting 93 / 100
94 Recovering Sophisticated Out of Class Kernels True Recovered k1 0.5 k2 0.5 k τ τ τ 94 / 100
95 Video Extrapolation GPatt makes almost no assumptions about the correlation structures across input dimensions: it can automatically discover both temporal and spatial correlations! Top row: True frames taken from the middle of a movie. Bottom row: Predicted sequence of frames (all are forecast together). 112,500 datapoints. GPatt training time is under 5 minutes. 95 / 100
96 Land Surface Temperature Forecasting Train using 9 years of temperature data. First two rows are the last 12 months of training data, last two rows is a 12 month ahead forecast. 300, 000 data points, with 40% missing data (from ocean). Predictions using GP-SE (GP with an SE or RBF kernel), and Kronecker Inference / 100
97 Land Surface Temperature Forecasting Train using 9 years of temperature data. First two rows are the last 12 months of training data, last two rows is a 12 month ahead forecast. 300, 000 data points, with 40% missing data (from ocean). Predictions using GPatt. Training time < 30 minutes / 100
98 Learned Kernels for Land Surface Temperatures Correlations X [Km] Correlations Y [Km] (a) Learned GPatt Kernel for Temperatures X [Km] Y [Km] Time [mon] 0 5 Time [mon] (b) Learned GP-SE Kernel for Temperatures The learned GPatt kernel tells us interesting properties of the data. In this case, the learned kernels are heavy tailed and quasi-periodic. 98 / 100
99 Building Gauss-Markov Processes 99 / 100
100 Generalising inducing point methods Blackboard discussion 100 / 100
Kernels for Automatic Pattern Discovery and Extrapolation
Kernels for Automatic Pattern Discovery and Extrapolation Andrew Gordon Wilson agw38@cam.ac.uk mlg.eng.cam.ac.uk/andrew University of Cambridge Joint work with Ryan Adams (Harvard) 1 / 21 Pattern Recognition
More informationKernel Methods for Large Scale Representation Learning
Kernel Methods for Large Scale Representation Learning Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University November, 2014 University of Oxford 1 / 70 Themes and Conclusions (High Level)
More informationProbabilistic Graphical Models Lecture 20: Gaussian Processes
Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms
More information20: Gaussian Processes
10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationA Process over all Stationary Covariance Kernels
A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationGaussian Process Regression Networks
Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh
More informationGaussian Process Kernels for Pattern Discovery and Extrapolation
Andrew Gordon Wilson Department of Engineering, University of Cambridge, Cambridge, UK Ryan Prescott Adams School of Engineering and Applied Sciences, Harvard University, Cambridge, USA agw38@cam.ac.uk
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationBayesian Deep Learning
Bayesian Deep Learning Andrew Gordon Wilson Assistant Professor https://people.orie.cornell.edu/andrew Cornell University Toronto Deep Learning Summer School July 31, 218 1 / 66 Model Selection 7 Airline
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationProbabilistic Graphical Models Lecture 17: Markov chain Monte Carlo
Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,
More informationCSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes
CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail
More informationHow to build an automatic statistician
How to build an automatic statistician James Robert Lloyd 1, David Duvenaud 1, Roger Grosse 2, Joshua Tenenbaum 2, Zoubin Ghahramani 1 1: Department of Engineering, University of Cambridge, UK 2: Massachusetts
More informationState Space Representation of Gaussian Processes
State Space Representation of Gaussian Processes Simo Särkkä Department of Biomedical Engineering and Computational Science (BECS) Aalto University, Espoo, Finland June 12th, 2013 Simo Särkkä (Aalto University)
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationGaussian processes for inference in stochastic differential equations
Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August
More informationAn Introduction to Gaussian Processes for Spatial Data (Predictions!)
An Introduction to Gaussian Processes for Spatial Data (Predictions!) Matthew Kupilik College of Engineering Seminar Series Nov 216 Matthew Kupilik UAA GP for Spatial Data Nov 216 1 / 35 Why? When evidence
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationLecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu
Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationScalable kernel methods and their use in black-box optimization
with derivatives Scalable kernel methods and their use in black-box optimization David Eriksson Center for Applied Mathematics Cornell University dme65@cornell.edu November 9, 2018 1 2 3 4 1/37 with derivatives
More informationLarge-scale Collaborative Prediction Using a Nonparametric Random Effects Model
Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationCovariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes
Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes Andrew Gordon Wilson Trinity College University of Cambridge This dissertation is submitted for the degree
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationOutline lecture 2 2(30)
Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control
More informationGaussian Processes for Machine Learning
Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More informationOutline Lecture 2 2(32)
Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic
More informationNonparametric Bayesian Methods - Lecture I
Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics
More informationLikelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009
with with July 30, 2010 with 1 2 3 Representation Representation for Distribution Inference for the Augmented Model 4 Approximate Laplacian Approximation Introduction to Laplacian Approximation Laplacian
More informationNonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group
Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian
More informationGaussian with mean ( µ ) and standard deviation ( σ)
Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II
1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f(x).5.5 f(x) y i.5.2.4.6.8 x f 5 5.5 x x 2.5 We have (possibly
More informationLarge-scale Ordinal Collaborative Filtering
Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk
More informationModeling human function learning with Gaussian processes
Modeling human function learning with Gaussian processes Thomas L. Griffiths Christopher G. Lucas Joseph J. Williams Department of Psychology University of California, Berkeley Berkeley, CA 94720-1650
More informationThe Automatic Statistician
The Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ James Lloyd, David Duvenaud (Cambridge) and
More informationMultiple-step Time Series Forecasting with Sparse Gaussian Processes
Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationLecture 1c: Gaussian Processes for Regression
Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk
More informationGaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005
Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005 Presented by Piotr Mirowski CBLL meeting, May 6, 2009 Courant Institute of Mathematical Sciences, New York University
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationReliability Monitoring Using Log Gaussian Process Regression
COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical
More information9.2 Support Vector Machines 159
9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationCS 7140: Advanced Machine Learning
Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)
More informationHilbert Space Methods for Reduced-Rank Gaussian Process Regression
Hilbert Space Methods for Reduced-Rank Gaussian Process Regression Arno Solin and Simo Särkkä Aalto University, Finland Workshop on Gaussian Process Approximation Copenhagen, Denmark, May 2015 Solin &
More informationBuilding an Automatic Statistician
Building an Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ ALT-Discovery Science Conference, October
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationGaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling
Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling Nakul Gopalan IAS, TU Darmstadt nakul.gopalan@stud.tu-darmstadt.de Abstract Time series data of high dimensions
More informationBayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview
More informationLA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018
LA Support for Scalable Kernel Methods David Bindel 29 Sep 2018 Collaborators Kun Dong (Cornell CAM) David Eriksson (Cornell CAM) Jake Gardner (Cornell CS) Eric Lee (Cornell CS) Hannes Nickisch (Phillips
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationPrediction of Data with help of the Gaussian Process Method
of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationProbabilistic & Bayesian deep learning. Andreas Damianou
Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019 In this talk Not in this talk: CRFs, Boltzmann machines,... In this
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationState Space Gaussian Processes with Non-Gaussian Likelihoods
State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline
More informationRandom Projections. Lopez Paz & Duvenaud. November 7, 2013
Random Projections Lopez Paz & Duvenaud November 7, 2013 Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013) Why random
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationProbabilistic Graphical Models
2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector
More informationBayesian Hidden Markov Models and Extensions
Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling
More informationTokamak profile database construction incorporating Gaussian process regression
Tokamak profile database construction incorporating Gaussian process regression A. Ho 1, J. Citrin 1, C. Bourdelle 2, Y. Camenen 3, F. Felici 4, M. Maslov 5, K.L. van de Plassche 1,4, H. Weisen 6 and JET
More informationComputational Science and Engineering (Int. Master s Program)
Computational Science and Engineering (Int. Master s Program) Technische Universität München Master s Thesis Derivative-Free Stochastic Constrained Optimization using Gaussian Processes, with Application
More informationIntroduction to Machine Learning
Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More information