Kernel Methods for Large Scale Representation Learning

Size: px
Start display at page:

Download "Kernel Methods for Large Scale Representation Learning"

Transcription

1 Kernel Methods for Large Scale Representation Learning Andrew Gordon Wilson Carnegie Mellon University November, 2014 University of Oxford 1 / 70

2 Themes and Conclusions (High Level) Machine learning aims to develop algorithms which can automatically discover rich statistical representations of data and make impressive generalisations. Flexibility and scalability are two sides of one coin: larger datasets provide relatively more information for learning rich statistical representations. We therefore wish to simultaneously increase the flexibility and scalability of machine learning methods. Bayesian nonparametric methods are natural for big datasets, since they automatically scale their information capacity with the amount of available data. 2 / 70

3 Themes and Conclusions (Kernels) The generalisation properties of a kernel method are entirely controlled by a kernel function, which represents an inner product of arbitrarily many basis functions. In other words, the support and inductive biases of a kernel method (which enables learning) are determined by the kernel. The main advantage of a Gaussian process, over other kernel machines, is access to a marginal likelihood, which provides a powerful probabilistic framework for kernel learning. This advantage is currently underappreciated, because we typically use very simple kernels with only a few hyperparameters. In order to enable powerful representation learning, we can introduce highly expressive kernels, and learn the properties of these kernels through marginal likelihood optimisation. The best way to scale up expressive kernel learning approaches is by exploiting the existing structure in the kernel (e.g., Kronecker methods). Nonparametric kernel methods corresponds to infinite basis function expansions; these methods are well suited to performing ambitious generalisations from large datasets. 3 / 70

4 Outline (Specific), Part 1 Introduce expressive spectral mixture kernels by modelling a spectral density (the Fourier transform of a kernel) with scale location Gaussian mixtures. Adapt these kernels for Kronecker structure, exploit this structure for exact inference and learning, which costs O(PN P+1 2 P ) computations and O(PN P ) storage, for N datapoints and P input dimensions, compared to the standard O(N 3 ) computations and O(N 2 ) storage associated with GPs. Extend Kronecker methods to account for non-grid data, whilst preserving exact inference. 4 / 70

5 Outline (Specific), Part 2 Show that i) truly nonparametric representations, ii) expressive kernels, and iii) structure exploiting inference, when used in combination, distinctly enables large scale pattern extrapolation, and representation learning including: long range spatiotemporal forecasting, image inpainting, video extrapolation, and kernel discovery. This is the first time, as far as we are aware, that highly expressive non-parametric kernels with in some cases hundreds of hyperparameters, on datasets exceeding N = 10 5 training instances, can be learned from the marginal likelihood of a GP, in only minutes. Such experiments show that one can, to some extent, solve kernel selection, and automatically extract useful features from the data, on large datasets, using a special combination of expressive kernels and scalable inference. 5 / 70

6 Gaussian processes Definition A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. Nonparametric Regression Model Prior: f (x) GP(m(x), k(x, x )), meaning (f (x 1),..., f (x N)) N (µ, K), with µ i = m(x i) and K ij = cov(f (x i), f (x j)) = k(x i, x j). GP posterior {}}{ p(f (x) D) Likelihood GP prior {}}{{}}{ p(d f (x)) p(f (x)) output, f(t) Gaussian process sample prior functions input, t a) output, f(t) Gaussian process sample posterior functions input, t b) 6 / 70

7 Gaussian Process Inference Observed noisy data y = (y(x 1),..., y(x N)) T at input locations X. Start with the standard regression assumption: N (y(x); f (x), σ 2 ). Place a Gaussian process distribution over noise free functions f (x) GP(0, k θ ). The kernel k is parametrized by θ. Infer p(f y, X, X ) for the noise free function f evaluated at test points X. Joint distribution [ y f ] ( [ Kθ (X, X) + σ 2 I K θ (X, X ) N 0, K θ (X, X) K θ (X, X ) ]). (1) Conditional predictive distribution f X, X, y, θ N ( f, cov(f )), (2) f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, (3) cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). (4) 7 / 70

8 Learning and Model Selection p(m i y) = p(y Mi)p(Mi) (5) p(y) We can write the evidence of the model as p(y M i) = p(y f, M i)p(f)df, (6) Complex Model Simple Model Appropriate Model Data Simple Complex Appropriate p(y M) Output, f(x) y All Possible Datasets (a) Input, x (b) 8 / 70

9 Learning and Model Selection We can integrate away the entire Gaussian process f (x) to obtain the marginal likelihood, as a function of kernel hyperparameters θ alone. p(y θ, X) = p(y f, X)p(f θ, X)df. (7) model fit complexity penalty {}}{{}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 1 y 2 log K θ + σ 2 I N log(2π). (8) 2 An extremely powerful mechanism for kernel learning. 4 Samples from GP Prior 4 Samples from GP Posterior Output, f(x) Output, f(x) Input, x Input, x (a) (b) 9 / 70

10 Learning and Model Selection A fully Bayesian treatment would integrate away kernel hyperparameters θ. p(f X, X, y) = p(f X, X, y, θ)p(θ y)dθ (9) For example, we could specify a prior p(θ), use MCMC to take J samples from p(θ y) p(y θ)p(θ), and then find p(f X, X, y) 1 J J p(f X, X, y, θ (i) ), θ (i) p(θ y). (10) i=1 If we have a non-gaussian noise model, and thus cannot integrate away f, the strong dependencies between Gaussian process f and hyperparameters θ make sampling extremely difficult. In my experience, the most effective solution is to use a deterministic approximation for the posterior p(f y) which enables one to work with an approximate marginal likelihood. 10 / 70

11 Linear Basis Function Models Model Specification f (x, w) = w T φ(x) (11) p(w) = N (0, Σ w) (12) Moments of Induced Distribution over Functions E[f (x, w)] = m(x) = E[w T ]φ(x) = 0 (13) cov(f (x i), f (x j)) = k(x i, x j) = E[f (x i)f (x j)] E[f (x i)]e[f (x j)] (14) = φ(x i) T E[ww T ]φ(x j) 0 (15) = φ(x i) T Σ wφ(x j) (16) f (x, w) is a Gaussian process, f (x) N (m, k) with mean function m(x) = 0 and covariance kernel k(x a, x b) = φ(x i) T Σ wφ(x j). The entire basis function model of Eqs. (11) and (12) is encapsulated as a distribution over functions with kernel k(x, x ). 11 / 70

12 Deriving the RBF Kernel Start with the generalised linear model f (x) = J w iφ i(x), (17) i=1 ) w i N (0, σ2, (18) J ) (x ci)2 φ i(x) = exp (. (19) 2l 2 Equations (17)-(19) define a radial basis function regression model, with radial basis functions centred at the points c i. Using our result for the kernel of a generalised linear model, k(x, x ) = σ2 J J φ i(x)φ i(x ). (20) i=1 12 / 70

13 Deriving the SE (aka RBF, Gaussian) Kernel f (x) = J w iφ i(x), i=1 w i N k(x, x ) = σ2 J ) ) (0, σ2 (x ci)2, φ i(x) = exp ( J 2l 2 (21) J φ i(x)φ i(x ) (22) i=1 Letting c i+1 c i = c = 1, and J, the kernel in Eq. (22) becomes a J Riemann sum: k(x, x σ 2 J c ) = lim φ J i(x)φ i(x ) = φ c(x)φ c(x )dc (23) J c 0 i=1 By setting c 0 = and c =, we spread the infinitely many basis functions across the whole real line, each a distance c 0 apart: k(x, x ) = exp( x c 2l ) exp( x c )dc (24) 2 2l 2 = πlσ 2 exp( (x x ) 2 2( 2l) 2 ). (25) 13 / 70

14 Gaussian Process Covariance Kernels Let τ = x x : k SE(τ) = exp( 0.5τ 2 /l 2 ) (26) 3τ 3τ k MA(τ) = a(1 + ) exp( ) (27) l l k RQ(τ) = (1 + τ 2 2 α l 2 ) α (28) k PE(τ) = exp( 2 sin 2 (π τ ω)/l 2 ) (29) 14 / 70

15 CO 2 Extrapolation with Standard Kernels CO 2 Concentration (ppm) Year 15 / 70

16 Gaussian processes How can Gaussian processes possibly replace neural networks? Did we throw the baby out with the bathwater? David MacKay, / 70

17 More Expressive Covariance Functions ˆf 1 (x) W 11(x) y 1 (x). W1q(x). x ˆf i (x) y j (x).. Wp1(x) ˆf q (x) W pq(x) y p (x) Gaussian Process Regression Networks. Wilson et. al, ICML / 70

18 Gaussian Process Regression Network latitude longitude / 70

19 Expressive Covariance Functions GPs in Bayesian neural network like architectures. (Salakhutdinov and Hinton, 2008; Wilson et. al, 2012; Damianou and Lawrence, 2012). Task specific, difficult inference, no closed form kernels. Compositions of kernels. (Archambeau and Bach, 2011; Durrande et. al, 2011; Rasmussen and Williams, 2006). In the general case, difficult to interpret, difficult inference, struggle with over-fitting. Can learn almost nothing about the covariance function of a stochastic process from a single realization, if we assume that the covariance function could be any positive definite function. Most commonly one assumes a restriction to stationary kernels, meaning that covariances are invariant to translations in the input space. 19 / 70

20 Bochner s Theorem Theorem (Bochner) A complex-valued function k on R P is the covariance function of a weakly stationary mean square continuous complex-valued random process on R P if and only if it can be represented as k(τ) = where ψ is a positive finite measure. R P e 2πis T τ ψ(ds), (30) If ψ has a density S(s), then S is called the spectral density or power spectrum of k, and k and S are Fourier duals: k(τ) = S(s)e 2πisTτ ds, (31) S(s) = k(τ)e 2πisTτ dτ. (32) 20 / 70

21 Idea k and S are Fourier duals: k(τ) = S(s) = S(s)e 2πisTτ ds, (33) k(τ)e 2πisTτ dτ. (34) If we can approximate S(s) to arbitrary accuracy, then we can approximate any stationary kernel to arbitrary accuracy. We can model S(s) to arbitrary accuracy, since scale-location mixtures of Gaussians can approximate any distribution to arbitrary accuracy. A scale-location mixture of Gaussians can flexibly model many distributions, and thus many covariance kernels, even with a small number of components. 21 / 70

22 Kernels for Pattern Discovery Let τ = x x R P. From Bochner s Theorem, k(τ) = For simplicity, assume τ R 1 and let Then R P S(s)e 2πis T τ ds (35) S(s) = [N (s; µ, σ 2 ) + N ( s; µ, σ 2 )]/2. (36) k(τ) = exp{ 2π 2 τ 2 σ 2 } cos(2πτµ). (37) More generally, if S(s) is a symmetrized mixture of diagonal covariance Gaussians on R p, with covariance matrix M q = diag(v (1) q,..., v q (P) ), then k(τ) = Q P w qcos(2πτ T µ q) exp{ 2π 2 τp 2 v (p) q }. (38) q=1 p=1 22 / 70

23 GP Model for Pattern Extrapolation Observations y(x) N (y(x); f (x), σ 2 ) f (x) GP(0, k SM(x, x θ)) (can easily be relaxed). (f (x) is a GP with SM kernel). k SM(x, x θ) can approximate many different kernels with different settings of its hyperparameters θ. Learning involves training these hyperparameters through maximum marginal likelihood optimization (using BFGS) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N log(2π). (39) 2 Once hyperparameters are trained as ˆθ, making predictions using p(f y, X, ˆθ), which can be expressed in closed form. 23 / 70

24 Results, CO 2 CO 2 Concentration (ppm) % CR Train Test MA RQ PER SE SM (c) Year Log Spectral Density SM SE Empirical Frequency (1/month) (d) 24 / 70

25 Results, Reconstructing Standard Covariances Correla 0 Correla τ a) τ b) 25 / 70

26 Results, Negative Covariances Truth SM Observation 0 Covariance x (input) (e) τ (f) 20 Log Spectral Density SM Empirical Frequency (g) 26 / 70

27 Results, Sinc Pattern Observation Observation Train Test MA RQ SE PER SM x (input) (h) x (input) (i) 1 MA SM 0 Correlation Log Spectral Density SM SE Empirical τ (j) Frequency (k) 27 / 70

28 Results, Airline Passengers Airline Passengers (Thousa PER Train Test SE RQ MA SM SSGP Year (l) Log Spectral Density SM SE Empirical Frequency (1/month) (m) 28 / 70

29 Results Table: We compare the test performance of the proposed spectral mixture (SM) kernel with squared exponential (SE), Matérn (MA), rational quadratic (RQ), and periodic (PE) kernels. The SM kernel consistently has the lowest mean squared error (MSE) and highest log likelihood (L). CO 2 SM SE MA RQ PE MSE L NEG COV MSE L SINC MSE L AIRLINE MSE L / 70

30 Scaling up the Spectral Mixture Kernel The flexibility of the spectral mixture kernel will be most useful on large datasets... Scaling an expressive kernel learning approach poses different challenges than scaling a standard Gaussian process model. One faces additional computational constraints, and the need to retain significant model structure for expressing the rich information available in a large dataset. So far we have considered just univariate (time series) problems. But the spectral mixture model is general purpose. How do we scale to multiple input dimensions, and what sort of multidimensional pattern extrapolation problems can we imagine? 30 / 70

31 Scaling a Gaussian process: inducing inputs Gaussian process f and f evaluated at N training points and J testing points. M N inducing points u, p(u) = N (0, K u,u) p(f, f) = p(f, f, u)du = p(f, f u)p(u)du Assume that f and f are conditionally independent given u: p(f, f) q(f, f) = q(f u)q(f u)p(u)du (40) Under this assumption, p(f u) = N (K f,u K 1 u,uu, K f,f Q f,f ) (41) p(f u) = N (K f,uk 1 u,uu, K f,f Q f,f ) (42) Q a,b = K a,uk 1 u,uk u,b (43) Cost for predictions reduced from O(N 3 ) to O(M 2 N) where M N. Different inducing approaches correspond to different additional assumptions about q(f u) and q(f u). For further reading, see Quinonero-Candela and Rasmussen (2005) 31 / 70

32 Kronecker methods Suppose If x R P, k decomposes as a product of kernels across each input dimension: k(x i, x j) = P p=1 kp (x p i, xp j ) (e.g., the RBF kernel has this property). Suppose the inputs x X are on a multidimensional grid X = X 1 X P R P. Then K decomposes into a Kronecker product of matrices over each input dimension K = K 1 K P. The eigendecomposition of K into QVQ also decomposes: Q = Q 1 Q P, V = Q 1 Q P. Assuming equal cardinality for each input dimension, we can thus eigendecompose an N N matrix K in O(PN 3/P ) operations instead of O(N 3 ) operations. 32 / 70

33 Kronecker methods Suppose If x R P, k decomposes as a product of kernels across each input dimension: k(x i, x j) = P p=1 kp (x p i, xp j ) (e.g., the RBF kernel has this property). Suppose the inputs x X are on a multidimensional grid X = X 1 X P R P. Then K decomposes into a Kronecker product of matrices over each input dimension K = K 1 K P. The eigendecomposition of K into QVQ also decomposes: Q = Q 1 Q P, V = Q 1 Q P. Assuming equal cardinality for each input dimension, we can thus eigendecompose an N N matrix K in O(PN 3/P ) operations instead of O(N 3 ) operations. Then inference and learning are highly efficient: (K + σ 2 I) 1 y = (QVQ T + σ 2 I) 1 y = Q(V + σ 2 I) 1 Q T y, (44) N log K + σ 2 I = log QVQ T + σ 2 I = log(λ i + σ 2 ), (45) where λ i are the eigenvalues of K. Saatchi (2011) i=1 33 / 70

34 Kronecker Methods We assumed that the inputs x X are on a multidimensional grid X = X 1 X P R P. How might we relax this assumption, to use Kronecker methods if there are gaps (missing data) in our multidimensional grid? 34 / 70

35 Kronecker Methods Assume imaginary points that complete the grid Place infinite noise on these points so they have no effect on inference The relevant matrices are no longer Kronecker, but we can get around this using pre-conditioned conjugate gradients, an iterative linear solver. 35 / 70

36 Kronecker Methods with Missing Data Assuming we have a dataset of M observations which are not necessarily on a grid, we propose to form a complete grid using W imaginary observations, y W N (f W, ɛ 1 I W), ɛ 0. The total observation vector y = [y M, y W ] T has N = M + W entries: y = N (f, D N), where the noise covariance matrix D N = diag(d M, ɛ 1 I W), D M = σ 2 I M. The imaginary observations y W have no corrupting effect on inference: the moments of the resulting predictive distribution are exactly the same as for the standard predictive distribution, namely lim ɛ 0(K N + D N) 1 y = (K M + D M) 1 y M. 36 / 70

37 Kronecker Methods with Missing Inputs We use preconditioned conjugate gradients to compute (K N + D N) 1 y. We use the preconditioning matrix C = D 1/2 N to solve C T (K N + D N) Cz = C T y. The preconditioning matrix C speeds up convergence by ignoring the imaginary observations y W. For the log complexity in the marginal likelihood (used in hyperparameter learning), log K M + D M = M log(λ M i i=1 + σ 2 ) M log( λ M i + σ 2 ), (46) i=1 where λ M i = M N λn i for i = 1,..., M. 37 / 70

38 Spectral Mixture Product Kernel The spectral mixture kernel, in its standard form, does not quite have Kronecker structure. Introduce a spectral mixture product kernel, which takes a product of across input dimensions of one dimensional spectral mixture kernels. k SMP(τ θ) = P k SM(τ p θ p). (47) p=1 38 / 70

39 GPatt Observations y(x) N (y(x); f (x), σ 2 ) f (x) GP(0, k SMP(x, x θ)) (can easily be relaxed). (f (x) is a GP with SMP kernel). k SMP(x, x θ) can approximate many different kernels with different settings of its hyperparameters θ. Learning involves training these hyperparameters through maximum marginal likelihood optimization (using BFGS) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N log(2π). (48) 2 Once hyperparameters are trained as ˆθ, making predictions using p(f y, X, ˆθ), which can be expressed in closed form. Exploit Kronecker structure for fast exact inference and learning (and extend Kronecker methods to allow for non-grid data). Exact inference and learning requires O(PN P+1 2 P ) operations and O(PN P ) storage, compared to O(N 3 ) operations and O(N 2 ) storage, for N datapoints, and P input dimensions. 39 / 70

40 Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC (g) GP-SE (h) GP-MA (i) GP-RQ 40 / 70

41 Results: Extrapolation and Interpolation with Shadows (a) Train (b) GPatt (c) GP-MA (d) Train (e) GPatt (f) GP-MA 41 / 70

42 Automatic Model Selection via Marginal Likelihood 0 µ µ 2 Simple initialisation The marginal likelihood shrinks weights of extraneous components to zero through the log K complexity penalty. 42 / 70

43 Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC 0 µ (g) GP-SE (h) GP-MA (i) GP-RQ (j) GPatt Initialisation (k) Train (l) GPatt (m) GP-MA (n) Train (o) GPatt (p) GP-MA 43 / 70

44 More Patterns (a) Rubber mat (b) Tread plate (c) Pores (d) Wood (e) Chain mail (f) Cone 44 / 70

45 More Patterns GPatt SSGP SE MA RQ Rubber mat (train = 12675, test = 4225) SMSE MSLL Tread plate (train = 12675, test = 4225) SMSE MSLL Pores (train = 12675, test = 4225) SMSE MSLL Wood (train = 14259, test = 4941) SMSE MSLL Chain mail (train = 14101, test = 4779) SMSE MSLL / 70

46 Speed and Accuracy Stress Tests (a) Runtime Stress Test (b) Accuracy Stress Test 46 / 70

47 Image Inpainting 47 / 70

48 Recovering Sophisticated Out of Class Kernels True Recovered k1 0.5 k2 0.5 k τ τ τ 48 / 70

49 Video Extrapolation GPatt makes almost no assumptions about the correlation structures across input dimensions: it can automatically discover both temporal and spatial correlations! Top row: True frames taken from the middle of a movie. Bottom row: Predicted sequence of frames (all are forecast together). 112,500 datapoints. GPatt training time is under 5 minutes. 49 / 70

50 Land Surface Temperature Forecasting Train using 9 years of temperature data. First two rows are the last 12 months of training data, last two rows is a 12 month ahead forecast. 300, 000 data points, with 40% missing data (from ocean). Predictions using GP-SE (GP with an SE or RBF kernel), and Kronecker Inference / 70

51 Land Surface Temperature Forecasting Train using 9 years of temperature data. First two rows are the last 12 months of training data, last two rows is a 12 month ahead forecast. 300, 000 data points, with 40% missing data (from ocean). Predictions using GPatt. Training time < 30 minutes / 70

52 Learned Kernels for Land Surface Temperatures Correlations X [Km] Correlations Y [Km] (a) Learned GPatt Kernel for Temperatures X [Km] Y [Km] Time [mon] 0 5 Time [mon] (b) Learned GP-SE Kernel for Temperatures The learned GPatt kernel tells us interesting properties of the data. In this case, the learned kernels are heavy tailed and quasi-periodic. 52 / 70

53 Summary of Findings We have separately understood the effects of many different kernels, inference and learning algorithms, and parametric vs non-parametric representations. We found truly nonparametric representations, scalable inference exploiting structure, and highly expressive kernels, when used in combination, distinctly enable pattern extrapolation on large multidimensional problems. Applications such as image inpainting were previously inaccessible to Gaussian processes and kernel machines. Moreover, conventional inpainting algorithms are often model free and highly specialised. On the other hand, the proposed kernel approaches are quite general, and can learn, for example, spatial and temporal correlations. Examples of this generality are demonstrated on video extrapolation and land surface temperature forecastng problems. We can recover sophisticated out of class kernels. We can interpret the learned kernels to understand the fundamental properties of our data, and make new scientific discoveries. 53 / 70

54 Distribution over Kernels A Gaussian process can be viewed as a distribution over functions. We can sample prior functions, and posterior functions. Why not create an analogous process over kernels, where we have prior over kernels (with large support), concentrated on something reasonable, such as stationarity? This process would contain many types of kernels, and represent uncertainty over the kernels. 54 / 70

55 Distribution over Spectral Mixture Kernels Induce a process prior over stationary kernels, by placing a (vague) prior distribution over the parameters θ = {a q, σ q, µ q} Q q=1 of the spectral mixture kernel. We can then sample for the prior over kernels, and condition on these kernels and sample from the corresponding Gaussian process. 15 Sample kernels 10 Sample functions 10 5 Covariance 5 y(x) τ (a) Input, x (b) 55 / 70

56 Non-Stationary Extensions If we can say almost nothing about the kernel of a stochastic process, from a single realisation, if we make no assumptions. Stationarity provides a powerful inductive bias, but sometimes we still want to allow for non-stationarity. It would be promising, therefore, to concentrate our process over kernels around stationary kernels, but still have support for non-stationarity. Idea: Have the weights in the spectral mixture become random functions of the input. Therefore at any different point in the input space, the kernel is described by a diffrent spectral mixture. If we concentrate the support of these functions around constant functions, then this process is concentrated around stationarity, but still has support for a massive range of non-stationary kernels! 56 / 70

57 Non-Stationary Extensions 15 Sample kernels 10 Sample functions 10 5 Covariance 5 y(x) τ (a) Input, x (b) w(x) Samples from Amplitude Modulator y(x) Samples from Non Stationary Process Input, x (c) Input, x (d) 57 / 70

58 Non-Stationary Extensions (a) Black (b) Blue (c) Red (d) Magenta (e) Cov SE 58 / 70

59 The Spectral Mixture Kernel k GSM(τ) = Q exp{ 2π 2 τ 2 σi 2 } cos(2πτµ i) (49) i=1 Can approximate a wide range of kernels, even with a small number of components Q. Described in: Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. Wilson, PhD Thesis, January Fast kernel learning for multidimensional pattern extrapolation. Wilson, Gilboa, Nehorai, Cunningham, NIPS A process over all stationary kernels. Wilson, Gaussian process kernels for pattern discovery and extrapolation. Wilson and Adams, ICML, / 70

60 Themes and Conclusions (High Level) Machine learning aims to develop algorithms which can automatically discover rich statistical representations of data and make impressive generalisations. Flexibility and scalability are two sides of one coin: larger datasets provide relatively more information for learning rich statistical representations. We therefore wish to simultaneously increase the flexibility and scalability of machine learning methods. Bayesian nonparametric methods are natural for big datasets, since they automatically scale their information capacity with the amount of available data. 60 / 70

61 Themes and Conclusions (Kernels) The generalisation properties of a kernel method are entirely controlled by a kernel function, which represents an inner product of arbitrarily many basis functions. In other words, the support and inductive biases of a kernel method (which enables learning) are determined by the kernel. The main advantage of a Gaussian process, over other kernel machines, is access to a marginal likelihood, which provides a powerful probabilistic framework for kernel learning. This advantage is currently underappreciated, because we typically use very simple kernels with only a few hyperparameters. In order to enable powerful representation learning, we can introduce highly expressive kernels, and learn the properties of these kernels through marginal likelihood optimisation. The best way to scale up expressive kernel learning approaches is by exploiting the existing structure in the kernel (e.g., Kronecker methods). Nonparametric kernel methods corresponds to infinite basis function expansions; these methods are well suited to performing ambitious generalisations from large datasets. 61 / 70

62 Outline (Specific), Part 1 Introduce expressive spectral mixture kernels by modelling a spectral density (the Fourier transform of a kernel) with scale location Gaussian mixtures. Adapt these kernels for Kronecker structure. Exploit this Kronecker structure for exact inference and learning with Gaussian processes, which costs O(PN P+1 2 P ) computations and O(PN P ) storage, for N datapoints and P input dimensions, compared to the standard O(N 3 ) computations and O(N 2 ) storage associated with GPs. Extend Kronecker methods to account for non-grid data, whilst preserving exact inference. 62 / 70

63 Outline (Specific), Part 2 Show that i) truly nonparametric representations, ii) expressive kernels, and iii) structure exploiting inference, when used in combination, distinctly enable large scale pattern extrapolation, and representation learning including: long range spatiotemporal forecasting, image inpainting, video extrapolation, and kernel discovery. This is the first time, as far as we are aware, that highly expressive non-parametric kernels with in some cases hundreds of hyperparameters, on datasets exceeding N = 10 5 training instances, can be learned from the marginal likelihood of a GP, in only minutes. Such experiments show that one can, to some extent, solve kernel selection, and automatically extract useful features from the data, on large datasets, using a special combination of expressive kernels and scalable inference. 63 / 70

64 Worked Example: Combining Kernels, CO 2 Data CO 2 Concentration (ppm) Year Example from Rasmussen and Williams (2006), Gaussian Processes for Machine Learning. 64 / 70

65 Worked Example: Combining Kernels, CO 2 Data 65 / 70

66 Worked Example: Combining Kernels, CO 2 Data Long rising trend: k 1(x p, x q) = θ 2 1 exp Quasi-periodic seasonal changes: ( (xp xq)2 k 2(x p, x q) = k RBF(x p, x q)k PER(x p, x q) = θ 2 3 exp Multi-scale medium term irregularities: k 3(x p, x q) = θ 2 6 Correlated and i.i.d. noise: k 4(x p, x q) = θ 2 9 exp 2θ 2 2 ) ( ) (xp xq) 2 sin2 (π(x p x q)) 2θ 4 2 θ 5 2 ) θ8 (1 + (xp xq)2 ( (xp xq)2 k total(x p, x q) = k 1(x p, x q) + k 2(x p, x q) + k 3(x p, x q) + k 4(x p, x q) 2θ θ 8 θ 2 7 ) + θ 2 11δ pq 66 / 70

67 Worked Example: Combining Kernels, CO 2 Data Hand crafted a kernel combination to perform extrapolation Confidence in the extrapolation is high (suggests that model is well specified). Can interpret the learned kernel hyperparameters θ to learn information about our dataset. A lot of the interesting pattern recognition has been done by a human in this example. We would like to completely automate this modelling procedure. 67 / 70

68 Function Learning Example Airline Passengers (Thousands) Train Human? Year 68 / 70

69 Function Learning Example Airline Passengers (Thousands) Train Human? Alien? Year 69 / 70

70 Function Learning Example Airline Passengers (Thousands) Train Alien? Test Human? Year 70 / 70

Kernels for Automatic Pattern Discovery and Extrapolation

Kernels for Automatic Pattern Discovery and Extrapolation Kernels for Automatic Pattern Discovery and Extrapolation Andrew Gordon Wilson agw38@cam.ac.uk mlg.eng.cam.ac.uk/andrew University of Cambridge Joint work with Ryan Adams (Harvard) 1 / 21 Pattern Recognition

More information

Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes

Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University April 1, 2015 1 / 100 Gaussian process review Definition

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Gaussian Process Regression Networks

Gaussian Process Regression Networks Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Andrew Gordon Wilson Assistant Professor https://people.orie.cornell.edu/andrew Cornell University Toronto Deep Learning Summer School July 31, 218 1 / 66 Model Selection 7 Airline

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Gaussian Process Kernels for Pattern Discovery and Extrapolation

Gaussian Process Kernels for Pattern Discovery and Extrapolation Andrew Gordon Wilson Department of Engineering, University of Cambridge, Cambridge, UK Ryan Prescott Adams School of Engineering and Applied Sciences, Harvard University, Cambridge, USA agw38@cam.ac.uk

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

An Introduction to Gaussian Processes for Spatial Data (Predictions!)

An Introduction to Gaussian Processes for Spatial Data (Predictions!) An Introduction to Gaussian Processes for Spatial Data (Predictions!) Matthew Kupilik College of Engineering Seminar Series Nov 216 Matthew Kupilik UAA GP for Spatial Data Nov 216 1 / 35 Why? When evidence

More information

The Automatic Statistician

The Automatic Statistician The Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ James Lloyd, David Duvenaud (Cambridge) and

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

How to build an automatic statistician

How to build an automatic statistician How to build an automatic statistician James Robert Lloyd 1, David Duvenaud 1, Roger Grosse 2, Joshua Tenenbaum 2, Zoubin Ghahramani 1 1: Department of Engineering, University of Cambridge, UK 2: Massachusetts

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes

Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes Andrew Gordon Wilson Trinity College University of Cambridge This dissertation is submitted for the degree

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Reliability Monitoring Using Log Gaussian Process Regression

Reliability Monitoring Using Log Gaussian Process Regression COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Multiple-step Time Series Forecasting with Sparse Gaussian Processes Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ

More information

Nonparametric Regression With Gaussian Processes

Nonparametric Regression With Gaussian Processes Nonparametric Regression With Gaussian Processes From Chap. 45, Information Theory, Inference and Learning Algorithms, D. J. C. McKay Presented by Micha Elsner Nonparametric Regression With Gaussian Processes

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Fast Kronecker Inference in Gaussian Processes with non-gaussian Likelihoods

Fast Kronecker Inference in Gaussian Processes with non-gaussian Likelihoods Fast Kronecker Inference in Gaussian Processes with non-gaussian Likelihoods Seth R. Flaxman 1, Andrew Gordon Wilson 1, Daniel B. Neill 1, Hannes Nickisch 2 and Alexander J. Smola 1 1 Carnegie Mellon University,

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f(x).5.5 f(x) y i.5.2.4.6.8 x f 5 5.5 x x 2.5 We have (possibly

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005 Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005 Presented by Piotr Mirowski CBLL meeting, May 6, 2009 Courant Institute of Mathematical Sciences, New York University

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Probabilistic & Bayesian deep learning. Andreas Damianou

Probabilistic & Bayesian deep learning. Andreas Damianou Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019 In this talk Not in this talk: CRFs, Boltzmann machines,... In this

More information

Scalable kernel methods and their use in black-box optimization

Scalable kernel methods and their use in black-box optimization with derivatives Scalable kernel methods and their use in black-box optimization David Eriksson Center for Applied Mathematics Cornell University dme65@cornell.edu November 9, 2018 1 2 3 4 1/37 with derivatives

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted

More information

Gaussian process for nonstationary time series prediction

Gaussian process for nonstationary time series prediction Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong

More information

Building an Automatic Statistician

Building an Automatic Statistician Building an Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ ALT-Discovery Science Conference, October

More information

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification N. Zabaras 1 S. Atkinson 1 Center for Informatics and Computational Science Department of Aerospace and Mechanical Engineering University

More information

State Space Gaussian Processes with Non-Gaussian Likelihoods

State Space Gaussian Processes with Non-Gaussian Likelihoods State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

Optimization of Gaussian Process Hyperparameters using Rprop

Optimization of Gaussian Process Hyperparameters using Rprop Optimization of Gaussian Process Hyperparameters using Rprop Manuel Blum and Martin Riedmiller University of Freiburg - Department of Computer Science Freiburg, Germany Abstract. Gaussian processes are

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Variational Model Selection for Sparse Gaussian Process Regression

Variational Model Selection for Sparse Gaussian Process Regression Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse

More information

Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

DD Advanced Machine Learning

DD Advanced Machine Learning Modelling Carl Henrik {chek}@csc.kth.se Royal Institute of Technology November 4, 2015 Who do I think you are? Mathematically competent linear algebra multivariate calculus Ok programmers Able to extend

More information

MTTTS16 Learning from Multiple Sources

MTTTS16 Learning from Multiple Sources MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

State Space Representation of Gaussian Processes

State Space Representation of Gaussian Processes State Space Representation of Gaussian Processes Simo Särkkä Department of Biomedical Engineering and Computational Science (BECS) Aalto University, Espoo, Finland June 12th, 2013 Simo Särkkä (Aalto University)

More information

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016 Log Gaussian Cox Processes Chi Group Meeting February 23, 2016 Outline Typical motivating application Introduction to LGCP model Brief overview of inference Applications in my work just getting started

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

More information

Modeling human function learning with Gaussian processes

Modeling human function learning with Gaussian processes Modeling human function learning with Gaussian processes Thomas L. Griffiths Christopher G. Lucas Joseph J. Williams Department of Psychology University of California, Berkeley Berkeley, CA 94720-1650

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Variational Gaussian Process Dynamical Systems

Variational Gaussian Process Dynamical Systems Variational Gaussian Process Dynamical Systems Andreas C. Damianou Department of Computer Science University of Sheffield, UK andreas.damianou@sheffield.ac.uk Michalis K. Titsias School of Computer Science

More information

Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling

Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling Nakul Gopalan IAS, TU Darmstadt nakul.gopalan@stud.tu-darmstadt.de Abstract Time series data of high dimensions

More information

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes Anomaly Detection and Removal Using Non-Stationary Gaussian ocesses Steven Reece Roman Garnett, Michael Osborne and Stephen Roberts Robotics Research Group Dept Engineering Science Oford University, UK

More information

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes Anomaly Detection and Removal Using Non-Stationary Gaussian ocesses Steven Reece, Roman Garnett, Michael Osborne and Stephen Roberts Robotics Research Group Dept Engineering Science Oxford University,

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Probabilistic Reasoning in Deep Learning

Probabilistic Reasoning in Deep Learning Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian

More information

Joint Emotion Analysis via Multi-task Gaussian Processes

Joint Emotion Analysis via Multi-task Gaussian Processes Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014 1 Introduction 2 Multi-task Gaussian Process Regression 3 Experiments and Discussion 4 Conclusions

More information

LA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018

LA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018 LA Support for Scalable Kernel Methods David Bindel 29 Sep 2018 Collaborators Kun Dong (Cornell CAM) David Eriksson (Cornell CAM) Jake Gardner (Cornell CS) Eric Lee (Cornell CS) Hannes Nickisch (Phillips

More information

System identification and control with (deep) Gaussian processes. Andreas Damianou

System identification and control with (deep) Gaussian processes. Andreas Damianou System identification and control with (deep) Gaussian processes Andreas Damianou Department of Computer Science, University of Sheffield, UK MIT, 11 Feb. 2016 Outline Part 1: Introduction Part 2: Gaussian

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Prediction of Data with help of the Gaussian Process Method

Prediction of Data with help of the Gaussian Process Method of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Non Linear Latent Variable Models

Non Linear Latent Variable Models Non Linear Latent Variable Models Neil Lawrence GPRS 14th February 2014 Outline Nonlinear Latent Variable Models Extensions Outline Nonlinear Latent Variable Models Extensions Non-Linear Latent Variable

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Approximate Inference using MCMC

Approximate Inference using MCMC Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

More information