Kernel Methods for Large Scale Representation Learning
|
|
- Allison Strickland
- 6 years ago
- Views:
Transcription
1 Kernel Methods for Large Scale Representation Learning Andrew Gordon Wilson Carnegie Mellon University November, 2014 University of Oxford 1 / 70
2 Themes and Conclusions (High Level) Machine learning aims to develop algorithms which can automatically discover rich statistical representations of data and make impressive generalisations. Flexibility and scalability are two sides of one coin: larger datasets provide relatively more information for learning rich statistical representations. We therefore wish to simultaneously increase the flexibility and scalability of machine learning methods. Bayesian nonparametric methods are natural for big datasets, since they automatically scale their information capacity with the amount of available data. 2 / 70
3 Themes and Conclusions (Kernels) The generalisation properties of a kernel method are entirely controlled by a kernel function, which represents an inner product of arbitrarily many basis functions. In other words, the support and inductive biases of a kernel method (which enables learning) are determined by the kernel. The main advantage of a Gaussian process, over other kernel machines, is access to a marginal likelihood, which provides a powerful probabilistic framework for kernel learning. This advantage is currently underappreciated, because we typically use very simple kernels with only a few hyperparameters. In order to enable powerful representation learning, we can introduce highly expressive kernels, and learn the properties of these kernels through marginal likelihood optimisation. The best way to scale up expressive kernel learning approaches is by exploiting the existing structure in the kernel (e.g., Kronecker methods). Nonparametric kernel methods corresponds to infinite basis function expansions; these methods are well suited to performing ambitious generalisations from large datasets. 3 / 70
4 Outline (Specific), Part 1 Introduce expressive spectral mixture kernels by modelling a spectral density (the Fourier transform of a kernel) with scale location Gaussian mixtures. Adapt these kernels for Kronecker structure, exploit this structure for exact inference and learning, which costs O(PN P+1 2 P ) computations and O(PN P ) storage, for N datapoints and P input dimensions, compared to the standard O(N 3 ) computations and O(N 2 ) storage associated with GPs. Extend Kronecker methods to account for non-grid data, whilst preserving exact inference. 4 / 70
5 Outline (Specific), Part 2 Show that i) truly nonparametric representations, ii) expressive kernels, and iii) structure exploiting inference, when used in combination, distinctly enables large scale pattern extrapolation, and representation learning including: long range spatiotemporal forecasting, image inpainting, video extrapolation, and kernel discovery. This is the first time, as far as we are aware, that highly expressive non-parametric kernels with in some cases hundreds of hyperparameters, on datasets exceeding N = 10 5 training instances, can be learned from the marginal likelihood of a GP, in only minutes. Such experiments show that one can, to some extent, solve kernel selection, and automatically extract useful features from the data, on large datasets, using a special combination of expressive kernels and scalable inference. 5 / 70
6 Gaussian processes Definition A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. Nonparametric Regression Model Prior: f (x) GP(m(x), k(x, x )), meaning (f (x 1),..., f (x N)) N (µ, K), with µ i = m(x i) and K ij = cov(f (x i), f (x j)) = k(x i, x j). GP posterior {}}{ p(f (x) D) Likelihood GP prior {}}{{}}{ p(d f (x)) p(f (x)) output, f(t) Gaussian process sample prior functions input, t a) output, f(t) Gaussian process sample posterior functions input, t b) 6 / 70
7 Gaussian Process Inference Observed noisy data y = (y(x 1),..., y(x N)) T at input locations X. Start with the standard regression assumption: N (y(x); f (x), σ 2 ). Place a Gaussian process distribution over noise free functions f (x) GP(0, k θ ). The kernel k is parametrized by θ. Infer p(f y, X, X ) for the noise free function f evaluated at test points X. Joint distribution [ y f ] ( [ Kθ (X, X) + σ 2 I K θ (X, X ) N 0, K θ (X, X) K θ (X, X ) ]). (1) Conditional predictive distribution f X, X, y, θ N ( f, cov(f )), (2) f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, (3) cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). (4) 7 / 70
8 Learning and Model Selection p(m i y) = p(y Mi)p(Mi) (5) p(y) We can write the evidence of the model as p(y M i) = p(y f, M i)p(f)df, (6) Complex Model Simple Model Appropriate Model Data Simple Complex Appropriate p(y M) Output, f(x) y All Possible Datasets (a) Input, x (b) 8 / 70
9 Learning and Model Selection We can integrate away the entire Gaussian process f (x) to obtain the marginal likelihood, as a function of kernel hyperparameters θ alone. p(y θ, X) = p(y f, X)p(f θ, X)df. (7) model fit complexity penalty {}}{{}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 1 y 2 log K θ + σ 2 I N log(2π). (8) 2 An extremely powerful mechanism for kernel learning. 4 Samples from GP Prior 4 Samples from GP Posterior Output, f(x) Output, f(x) Input, x Input, x (a) (b) 9 / 70
10 Learning and Model Selection A fully Bayesian treatment would integrate away kernel hyperparameters θ. p(f X, X, y) = p(f X, X, y, θ)p(θ y)dθ (9) For example, we could specify a prior p(θ), use MCMC to take J samples from p(θ y) p(y θ)p(θ), and then find p(f X, X, y) 1 J J p(f X, X, y, θ (i) ), θ (i) p(θ y). (10) i=1 If we have a non-gaussian noise model, and thus cannot integrate away f, the strong dependencies between Gaussian process f and hyperparameters θ make sampling extremely difficult. In my experience, the most effective solution is to use a deterministic approximation for the posterior p(f y) which enables one to work with an approximate marginal likelihood. 10 / 70
11 Linear Basis Function Models Model Specification f (x, w) = w T φ(x) (11) p(w) = N (0, Σ w) (12) Moments of Induced Distribution over Functions E[f (x, w)] = m(x) = E[w T ]φ(x) = 0 (13) cov(f (x i), f (x j)) = k(x i, x j) = E[f (x i)f (x j)] E[f (x i)]e[f (x j)] (14) = φ(x i) T E[ww T ]φ(x j) 0 (15) = φ(x i) T Σ wφ(x j) (16) f (x, w) is a Gaussian process, f (x) N (m, k) with mean function m(x) = 0 and covariance kernel k(x a, x b) = φ(x i) T Σ wφ(x j). The entire basis function model of Eqs. (11) and (12) is encapsulated as a distribution over functions with kernel k(x, x ). 11 / 70
12 Deriving the RBF Kernel Start with the generalised linear model f (x) = J w iφ i(x), (17) i=1 ) w i N (0, σ2, (18) J ) (x ci)2 φ i(x) = exp (. (19) 2l 2 Equations (17)-(19) define a radial basis function regression model, with radial basis functions centred at the points c i. Using our result for the kernel of a generalised linear model, k(x, x ) = σ2 J J φ i(x)φ i(x ). (20) i=1 12 / 70
13 Deriving the SE (aka RBF, Gaussian) Kernel f (x) = J w iφ i(x), i=1 w i N k(x, x ) = σ2 J ) ) (0, σ2 (x ci)2, φ i(x) = exp ( J 2l 2 (21) J φ i(x)φ i(x ) (22) i=1 Letting c i+1 c i = c = 1, and J, the kernel in Eq. (22) becomes a J Riemann sum: k(x, x σ 2 J c ) = lim φ J i(x)φ i(x ) = φ c(x)φ c(x )dc (23) J c 0 i=1 By setting c 0 = and c =, we spread the infinitely many basis functions across the whole real line, each a distance c 0 apart: k(x, x ) = exp( x c 2l ) exp( x c )dc (24) 2 2l 2 = πlσ 2 exp( (x x ) 2 2( 2l) 2 ). (25) 13 / 70
14 Gaussian Process Covariance Kernels Let τ = x x : k SE(τ) = exp( 0.5τ 2 /l 2 ) (26) 3τ 3τ k MA(τ) = a(1 + ) exp( ) (27) l l k RQ(τ) = (1 + τ 2 2 α l 2 ) α (28) k PE(τ) = exp( 2 sin 2 (π τ ω)/l 2 ) (29) 14 / 70
15 CO 2 Extrapolation with Standard Kernels CO 2 Concentration (ppm) Year 15 / 70
16 Gaussian processes How can Gaussian processes possibly replace neural networks? Did we throw the baby out with the bathwater? David MacKay, / 70
17 More Expressive Covariance Functions ˆf 1 (x) W 11(x) y 1 (x). W1q(x). x ˆf i (x) y j (x).. Wp1(x) ˆf q (x) W pq(x) y p (x) Gaussian Process Regression Networks. Wilson et. al, ICML / 70
18 Gaussian Process Regression Network latitude longitude / 70
19 Expressive Covariance Functions GPs in Bayesian neural network like architectures. (Salakhutdinov and Hinton, 2008; Wilson et. al, 2012; Damianou and Lawrence, 2012). Task specific, difficult inference, no closed form kernels. Compositions of kernels. (Archambeau and Bach, 2011; Durrande et. al, 2011; Rasmussen and Williams, 2006). In the general case, difficult to interpret, difficult inference, struggle with over-fitting. Can learn almost nothing about the covariance function of a stochastic process from a single realization, if we assume that the covariance function could be any positive definite function. Most commonly one assumes a restriction to stationary kernels, meaning that covariances are invariant to translations in the input space. 19 / 70
20 Bochner s Theorem Theorem (Bochner) A complex-valued function k on R P is the covariance function of a weakly stationary mean square continuous complex-valued random process on R P if and only if it can be represented as k(τ) = where ψ is a positive finite measure. R P e 2πis T τ ψ(ds), (30) If ψ has a density S(s), then S is called the spectral density or power spectrum of k, and k and S are Fourier duals: k(τ) = S(s)e 2πisTτ ds, (31) S(s) = k(τ)e 2πisTτ dτ. (32) 20 / 70
21 Idea k and S are Fourier duals: k(τ) = S(s) = S(s)e 2πisTτ ds, (33) k(τ)e 2πisTτ dτ. (34) If we can approximate S(s) to arbitrary accuracy, then we can approximate any stationary kernel to arbitrary accuracy. We can model S(s) to arbitrary accuracy, since scale-location mixtures of Gaussians can approximate any distribution to arbitrary accuracy. A scale-location mixture of Gaussians can flexibly model many distributions, and thus many covariance kernels, even with a small number of components. 21 / 70
22 Kernels for Pattern Discovery Let τ = x x R P. From Bochner s Theorem, k(τ) = For simplicity, assume τ R 1 and let Then R P S(s)e 2πis T τ ds (35) S(s) = [N (s; µ, σ 2 ) + N ( s; µ, σ 2 )]/2. (36) k(τ) = exp{ 2π 2 τ 2 σ 2 } cos(2πτµ). (37) More generally, if S(s) is a symmetrized mixture of diagonal covariance Gaussians on R p, with covariance matrix M q = diag(v (1) q,..., v q (P) ), then k(τ) = Q P w qcos(2πτ T µ q) exp{ 2π 2 τp 2 v (p) q }. (38) q=1 p=1 22 / 70
23 GP Model for Pattern Extrapolation Observations y(x) N (y(x); f (x), σ 2 ) f (x) GP(0, k SM(x, x θ)) (can easily be relaxed). (f (x) is a GP with SM kernel). k SM(x, x θ) can approximate many different kernels with different settings of its hyperparameters θ. Learning involves training these hyperparameters through maximum marginal likelihood optimization (using BFGS) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N log(2π). (39) 2 Once hyperparameters are trained as ˆθ, making predictions using p(f y, X, ˆθ), which can be expressed in closed form. 23 / 70
24 Results, CO 2 CO 2 Concentration (ppm) % CR Train Test MA RQ PER SE SM (c) Year Log Spectral Density SM SE Empirical Frequency (1/month) (d) 24 / 70
25 Results, Reconstructing Standard Covariances Correla 0 Correla τ a) τ b) 25 / 70
26 Results, Negative Covariances Truth SM Observation 0 Covariance x (input) (e) τ (f) 20 Log Spectral Density SM Empirical Frequency (g) 26 / 70
27 Results, Sinc Pattern Observation Observation Train Test MA RQ SE PER SM x (input) (h) x (input) (i) 1 MA SM 0 Correlation Log Spectral Density SM SE Empirical τ (j) Frequency (k) 27 / 70
28 Results, Airline Passengers Airline Passengers (Thousa PER Train Test SE RQ MA SM SSGP Year (l) Log Spectral Density SM SE Empirical Frequency (1/month) (m) 28 / 70
29 Results Table: We compare the test performance of the proposed spectral mixture (SM) kernel with squared exponential (SE), Matérn (MA), rational quadratic (RQ), and periodic (PE) kernels. The SM kernel consistently has the lowest mean squared error (MSE) and highest log likelihood (L). CO 2 SM SE MA RQ PE MSE L NEG COV MSE L SINC MSE L AIRLINE MSE L / 70
30 Scaling up the Spectral Mixture Kernel The flexibility of the spectral mixture kernel will be most useful on large datasets... Scaling an expressive kernel learning approach poses different challenges than scaling a standard Gaussian process model. One faces additional computational constraints, and the need to retain significant model structure for expressing the rich information available in a large dataset. So far we have considered just univariate (time series) problems. But the spectral mixture model is general purpose. How do we scale to multiple input dimensions, and what sort of multidimensional pattern extrapolation problems can we imagine? 30 / 70
31 Scaling a Gaussian process: inducing inputs Gaussian process f and f evaluated at N training points and J testing points. M N inducing points u, p(u) = N (0, K u,u) p(f, f) = p(f, f, u)du = p(f, f u)p(u)du Assume that f and f are conditionally independent given u: p(f, f) q(f, f) = q(f u)q(f u)p(u)du (40) Under this assumption, p(f u) = N (K f,u K 1 u,uu, K f,f Q f,f ) (41) p(f u) = N (K f,uk 1 u,uu, K f,f Q f,f ) (42) Q a,b = K a,uk 1 u,uk u,b (43) Cost for predictions reduced from O(N 3 ) to O(M 2 N) where M N. Different inducing approaches correspond to different additional assumptions about q(f u) and q(f u). For further reading, see Quinonero-Candela and Rasmussen (2005) 31 / 70
32 Kronecker methods Suppose If x R P, k decomposes as a product of kernels across each input dimension: k(x i, x j) = P p=1 kp (x p i, xp j ) (e.g., the RBF kernel has this property). Suppose the inputs x X are on a multidimensional grid X = X 1 X P R P. Then K decomposes into a Kronecker product of matrices over each input dimension K = K 1 K P. The eigendecomposition of K into QVQ also decomposes: Q = Q 1 Q P, V = Q 1 Q P. Assuming equal cardinality for each input dimension, we can thus eigendecompose an N N matrix K in O(PN 3/P ) operations instead of O(N 3 ) operations. 32 / 70
33 Kronecker methods Suppose If x R P, k decomposes as a product of kernels across each input dimension: k(x i, x j) = P p=1 kp (x p i, xp j ) (e.g., the RBF kernel has this property). Suppose the inputs x X are on a multidimensional grid X = X 1 X P R P. Then K decomposes into a Kronecker product of matrices over each input dimension K = K 1 K P. The eigendecomposition of K into QVQ also decomposes: Q = Q 1 Q P, V = Q 1 Q P. Assuming equal cardinality for each input dimension, we can thus eigendecompose an N N matrix K in O(PN 3/P ) operations instead of O(N 3 ) operations. Then inference and learning are highly efficient: (K + σ 2 I) 1 y = (QVQ T + σ 2 I) 1 y = Q(V + σ 2 I) 1 Q T y, (44) N log K + σ 2 I = log QVQ T + σ 2 I = log(λ i + σ 2 ), (45) where λ i are the eigenvalues of K. Saatchi (2011) i=1 33 / 70
34 Kronecker Methods We assumed that the inputs x X are on a multidimensional grid X = X 1 X P R P. How might we relax this assumption, to use Kronecker methods if there are gaps (missing data) in our multidimensional grid? 34 / 70
35 Kronecker Methods Assume imaginary points that complete the grid Place infinite noise on these points so they have no effect on inference The relevant matrices are no longer Kronecker, but we can get around this using pre-conditioned conjugate gradients, an iterative linear solver. 35 / 70
36 Kronecker Methods with Missing Data Assuming we have a dataset of M observations which are not necessarily on a grid, we propose to form a complete grid using W imaginary observations, y W N (f W, ɛ 1 I W), ɛ 0. The total observation vector y = [y M, y W ] T has N = M + W entries: y = N (f, D N), where the noise covariance matrix D N = diag(d M, ɛ 1 I W), D M = σ 2 I M. The imaginary observations y W have no corrupting effect on inference: the moments of the resulting predictive distribution are exactly the same as for the standard predictive distribution, namely lim ɛ 0(K N + D N) 1 y = (K M + D M) 1 y M. 36 / 70
37 Kronecker Methods with Missing Inputs We use preconditioned conjugate gradients to compute (K N + D N) 1 y. We use the preconditioning matrix C = D 1/2 N to solve C T (K N + D N) Cz = C T y. The preconditioning matrix C speeds up convergence by ignoring the imaginary observations y W. For the log complexity in the marginal likelihood (used in hyperparameter learning), log K M + D M = M log(λ M i i=1 + σ 2 ) M log( λ M i + σ 2 ), (46) i=1 where λ M i = M N λn i for i = 1,..., M. 37 / 70
38 Spectral Mixture Product Kernel The spectral mixture kernel, in its standard form, does not quite have Kronecker structure. Introduce a spectral mixture product kernel, which takes a product of across input dimensions of one dimensional spectral mixture kernels. k SMP(τ θ) = P k SM(τ p θ p). (47) p=1 38 / 70
39 GPatt Observations y(x) N (y(x); f (x), σ 2 ) f (x) GP(0, k SMP(x, x θ)) (can easily be relaxed). (f (x) is a GP with SMP kernel). k SMP(x, x θ) can approximate many different kernels with different settings of its hyperparameters θ. Learning involves training these hyperparameters through maximum marginal likelihood optimization (using BFGS) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N log(2π). (48) 2 Once hyperparameters are trained as ˆθ, making predictions using p(f y, X, ˆθ), which can be expressed in closed form. Exploit Kronecker structure for fast exact inference and learning (and extend Kronecker methods to allow for non-grid data). Exact inference and learning requires O(PN P+1 2 P ) operations and O(PN P ) storage, compared to O(N 3 ) operations and O(N 2 ) storage, for N datapoints, and P input dimensions. 39 / 70
40 Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC (g) GP-SE (h) GP-MA (i) GP-RQ 40 / 70
41 Results: Extrapolation and Interpolation with Shadows (a) Train (b) GPatt (c) GP-MA (d) Train (e) GPatt (f) GP-MA 41 / 70
42 Automatic Model Selection via Marginal Likelihood 0 µ µ 2 Simple initialisation The marginal likelihood shrinks weights of extraneous components to zero through the log K complexity penalty. 42 / 70
43 Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC 0 µ (g) GP-SE (h) GP-MA (i) GP-RQ (j) GPatt Initialisation (k) Train (l) GPatt (m) GP-MA (n) Train (o) GPatt (p) GP-MA 43 / 70
44 More Patterns (a) Rubber mat (b) Tread plate (c) Pores (d) Wood (e) Chain mail (f) Cone 44 / 70
45 More Patterns GPatt SSGP SE MA RQ Rubber mat (train = 12675, test = 4225) SMSE MSLL Tread plate (train = 12675, test = 4225) SMSE MSLL Pores (train = 12675, test = 4225) SMSE MSLL Wood (train = 14259, test = 4941) SMSE MSLL Chain mail (train = 14101, test = 4779) SMSE MSLL / 70
46 Speed and Accuracy Stress Tests (a) Runtime Stress Test (b) Accuracy Stress Test 46 / 70
47 Image Inpainting 47 / 70
48 Recovering Sophisticated Out of Class Kernels True Recovered k1 0.5 k2 0.5 k τ τ τ 48 / 70
49 Video Extrapolation GPatt makes almost no assumptions about the correlation structures across input dimensions: it can automatically discover both temporal and spatial correlations! Top row: True frames taken from the middle of a movie. Bottom row: Predicted sequence of frames (all are forecast together). 112,500 datapoints. GPatt training time is under 5 minutes. 49 / 70
50 Land Surface Temperature Forecasting Train using 9 years of temperature data. First two rows are the last 12 months of training data, last two rows is a 12 month ahead forecast. 300, 000 data points, with 40% missing data (from ocean). Predictions using GP-SE (GP with an SE or RBF kernel), and Kronecker Inference / 70
51 Land Surface Temperature Forecasting Train using 9 years of temperature data. First two rows are the last 12 months of training data, last two rows is a 12 month ahead forecast. 300, 000 data points, with 40% missing data (from ocean). Predictions using GPatt. Training time < 30 minutes / 70
52 Learned Kernels for Land Surface Temperatures Correlations X [Km] Correlations Y [Km] (a) Learned GPatt Kernel for Temperatures X [Km] Y [Km] Time [mon] 0 5 Time [mon] (b) Learned GP-SE Kernel for Temperatures The learned GPatt kernel tells us interesting properties of the data. In this case, the learned kernels are heavy tailed and quasi-periodic. 52 / 70
53 Summary of Findings We have separately understood the effects of many different kernels, inference and learning algorithms, and parametric vs non-parametric representations. We found truly nonparametric representations, scalable inference exploiting structure, and highly expressive kernels, when used in combination, distinctly enable pattern extrapolation on large multidimensional problems. Applications such as image inpainting were previously inaccessible to Gaussian processes and kernel machines. Moreover, conventional inpainting algorithms are often model free and highly specialised. On the other hand, the proposed kernel approaches are quite general, and can learn, for example, spatial and temporal correlations. Examples of this generality are demonstrated on video extrapolation and land surface temperature forecastng problems. We can recover sophisticated out of class kernels. We can interpret the learned kernels to understand the fundamental properties of our data, and make new scientific discoveries. 53 / 70
54 Distribution over Kernels A Gaussian process can be viewed as a distribution over functions. We can sample prior functions, and posterior functions. Why not create an analogous process over kernels, where we have prior over kernels (with large support), concentrated on something reasonable, such as stationarity? This process would contain many types of kernels, and represent uncertainty over the kernels. 54 / 70
55 Distribution over Spectral Mixture Kernels Induce a process prior over stationary kernels, by placing a (vague) prior distribution over the parameters θ = {a q, σ q, µ q} Q q=1 of the spectral mixture kernel. We can then sample for the prior over kernels, and condition on these kernels and sample from the corresponding Gaussian process. 15 Sample kernels 10 Sample functions 10 5 Covariance 5 y(x) τ (a) Input, x (b) 55 / 70
56 Non-Stationary Extensions If we can say almost nothing about the kernel of a stochastic process, from a single realisation, if we make no assumptions. Stationarity provides a powerful inductive bias, but sometimes we still want to allow for non-stationarity. It would be promising, therefore, to concentrate our process over kernels around stationary kernels, but still have support for non-stationarity. Idea: Have the weights in the spectral mixture become random functions of the input. Therefore at any different point in the input space, the kernel is described by a diffrent spectral mixture. If we concentrate the support of these functions around constant functions, then this process is concentrated around stationarity, but still has support for a massive range of non-stationary kernels! 56 / 70
57 Non-Stationary Extensions 15 Sample kernels 10 Sample functions 10 5 Covariance 5 y(x) τ (a) Input, x (b) w(x) Samples from Amplitude Modulator y(x) Samples from Non Stationary Process Input, x (c) Input, x (d) 57 / 70
58 Non-Stationary Extensions (a) Black (b) Blue (c) Red (d) Magenta (e) Cov SE 58 / 70
59 The Spectral Mixture Kernel k GSM(τ) = Q exp{ 2π 2 τ 2 σi 2 } cos(2πτµ i) (49) i=1 Can approximate a wide range of kernels, even with a small number of components Q. Described in: Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. Wilson, PhD Thesis, January Fast kernel learning for multidimensional pattern extrapolation. Wilson, Gilboa, Nehorai, Cunningham, NIPS A process over all stationary kernels. Wilson, Gaussian process kernels for pattern discovery and extrapolation. Wilson and Adams, ICML, / 70
60 Themes and Conclusions (High Level) Machine learning aims to develop algorithms which can automatically discover rich statistical representations of data and make impressive generalisations. Flexibility and scalability are two sides of one coin: larger datasets provide relatively more information for learning rich statistical representations. We therefore wish to simultaneously increase the flexibility and scalability of machine learning methods. Bayesian nonparametric methods are natural for big datasets, since they automatically scale their information capacity with the amount of available data. 60 / 70
61 Themes and Conclusions (Kernels) The generalisation properties of a kernel method are entirely controlled by a kernel function, which represents an inner product of arbitrarily many basis functions. In other words, the support and inductive biases of a kernel method (which enables learning) are determined by the kernel. The main advantage of a Gaussian process, over other kernel machines, is access to a marginal likelihood, which provides a powerful probabilistic framework for kernel learning. This advantage is currently underappreciated, because we typically use very simple kernels with only a few hyperparameters. In order to enable powerful representation learning, we can introduce highly expressive kernels, and learn the properties of these kernels through marginal likelihood optimisation. The best way to scale up expressive kernel learning approaches is by exploiting the existing structure in the kernel (e.g., Kronecker methods). Nonparametric kernel methods corresponds to infinite basis function expansions; these methods are well suited to performing ambitious generalisations from large datasets. 61 / 70
62 Outline (Specific), Part 1 Introduce expressive spectral mixture kernels by modelling a spectral density (the Fourier transform of a kernel) with scale location Gaussian mixtures. Adapt these kernels for Kronecker structure. Exploit this Kronecker structure for exact inference and learning with Gaussian processes, which costs O(PN P+1 2 P ) computations and O(PN P ) storage, for N datapoints and P input dimensions, compared to the standard O(N 3 ) computations and O(N 2 ) storage associated with GPs. Extend Kronecker methods to account for non-grid data, whilst preserving exact inference. 62 / 70
63 Outline (Specific), Part 2 Show that i) truly nonparametric representations, ii) expressive kernels, and iii) structure exploiting inference, when used in combination, distinctly enable large scale pattern extrapolation, and representation learning including: long range spatiotemporal forecasting, image inpainting, video extrapolation, and kernel discovery. This is the first time, as far as we are aware, that highly expressive non-parametric kernels with in some cases hundreds of hyperparameters, on datasets exceeding N = 10 5 training instances, can be learned from the marginal likelihood of a GP, in only minutes. Such experiments show that one can, to some extent, solve kernel selection, and automatically extract useful features from the data, on large datasets, using a special combination of expressive kernels and scalable inference. 63 / 70
64 Worked Example: Combining Kernels, CO 2 Data CO 2 Concentration (ppm) Year Example from Rasmussen and Williams (2006), Gaussian Processes for Machine Learning. 64 / 70
65 Worked Example: Combining Kernels, CO 2 Data 65 / 70
66 Worked Example: Combining Kernels, CO 2 Data Long rising trend: k 1(x p, x q) = θ 2 1 exp Quasi-periodic seasonal changes: ( (xp xq)2 k 2(x p, x q) = k RBF(x p, x q)k PER(x p, x q) = θ 2 3 exp Multi-scale medium term irregularities: k 3(x p, x q) = θ 2 6 Correlated and i.i.d. noise: k 4(x p, x q) = θ 2 9 exp 2θ 2 2 ) ( ) (xp xq) 2 sin2 (π(x p x q)) 2θ 4 2 θ 5 2 ) θ8 (1 + (xp xq)2 ( (xp xq)2 k total(x p, x q) = k 1(x p, x q) + k 2(x p, x q) + k 3(x p, x q) + k 4(x p, x q) 2θ θ 8 θ 2 7 ) + θ 2 11δ pq 66 / 70
67 Worked Example: Combining Kernels, CO 2 Data Hand crafted a kernel combination to perform extrapolation Confidence in the extrapolation is high (suggests that model is well specified). Can interpret the learned kernel hyperparameters θ to learn information about our dataset. A lot of the interesting pattern recognition has been done by a human in this example. We would like to completely automate this modelling procedure. 67 / 70
68 Function Learning Example Airline Passengers (Thousands) Train Human? Year 68 / 70
69 Function Learning Example Airline Passengers (Thousands) Train Human? Alien? Year 69 / 70
70 Function Learning Example Airline Passengers (Thousands) Train Alien? Test Human? Year 70 / 70
Kernels for Automatic Pattern Discovery and Extrapolation
Kernels for Automatic Pattern Discovery and Extrapolation Andrew Gordon Wilson agw38@cam.ac.uk mlg.eng.cam.ac.uk/andrew University of Cambridge Joint work with Ryan Adams (Harvard) 1 / 21 Pattern Recognition
More informationProbabilistic Graphical Models Lecture 21: Advanced Gaussian Processes
Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University April 1, 2015 1 / 100 Gaussian process review Definition
More informationProbabilistic Graphical Models Lecture 20: Gaussian Processes
Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms
More information20: Gaussian Processes
10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction
More informationGaussian Process Regression Networks
Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh
More informationBayesian Deep Learning
Bayesian Deep Learning Andrew Gordon Wilson Assistant Professor https://people.orie.cornell.edu/andrew Cornell University Toronto Deep Learning Summer School July 31, 218 1 / 66 Model Selection 7 Airline
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More informationA Process over all Stationary Covariance Kernels
A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that
More informationGaussian Process Kernels for Pattern Discovery and Extrapolation
Andrew Gordon Wilson Department of Engineering, University of Cambridge, Cambridge, UK Ryan Prescott Adams School of Engineering and Applied Sciences, Harvard University, Cambridge, USA agw38@cam.ac.uk
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationAn Introduction to Gaussian Processes for Spatial Data (Predictions!)
An Introduction to Gaussian Processes for Spatial Data (Predictions!) Matthew Kupilik College of Engineering Seminar Series Nov 216 Matthew Kupilik UAA GP for Spatial Data Nov 216 1 / 35 Why? When evidence
More informationThe Automatic Statistician
The Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ James Lloyd, David Duvenaud (Cambridge) and
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationHow to build an automatic statistician
How to build an automatic statistician James Robert Lloyd 1, David Duvenaud 1, Roger Grosse 2, Joshua Tenenbaum 2, Zoubin Ghahramani 1 1: Department of Engineering, University of Cambridge, UK 2: Massachusetts
More informationCSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes
CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationCovariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes
Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes Andrew Gordon Wilson Trinity College University of Cambridge This dissertation is submitted for the degree
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationProbabilistic Graphical Models Lecture 17: Markov chain Monte Carlo
Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationLarge-scale Collaborative Prediction Using a Nonparametric Random Effects Model
Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationReliability Monitoring Using Log Gaussian Process Regression
COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationPILCO: A Model-Based and Data-Efficient Approach to Policy Search
PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol
More informationLecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu
Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationMultiple-step Time Series Forecasting with Sparse Gaussian Processes
Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ
More informationNonparametric Regression With Gaussian Processes
Nonparametric Regression With Gaussian Processes From Chap. 45, Information Theory, Inference and Learning Algorithms, D. J. C. McKay Presented by Micha Elsner Nonparametric Regression With Gaussian Processes
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II
1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector
More informationFast Kronecker Inference in Gaussian Processes with non-gaussian Likelihoods
Fast Kronecker Inference in Gaussian Processes with non-gaussian Likelihoods Seth R. Flaxman 1, Andrew Gordon Wilson 1, Daniel B. Neill 1, Hannes Nickisch 2 and Alexander J. Smola 1 1 Carnegie Mellon University,
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationNonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group
Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f(x).5.5 f(x) y i.5.2.4.6.8 x f 5 5.5 x x 2.5 We have (possibly
More informationStochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints
Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationGaussian processes for inference in stochastic differential equations
Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationGaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005
Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005 Presented by Piotr Mirowski CBLL meeting, May 6, 2009 Courant Institute of Mathematical Sciences, New York University
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationProbabilistic & Bayesian deep learning. Andreas Damianou
Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019 In this talk Not in this talk: CRFs, Boltzmann machines,... In this
More informationScalable kernel methods and their use in black-box optimization
with derivatives Scalable kernel methods and their use in black-box optimization David Eriksson Center for Applied Mathematics Cornell University dme65@cornell.edu November 9, 2018 1 2 3 4 1/37 with derivatives
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationCOMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation
COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted
More informationGaussian process for nonstationary time series prediction
Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong
More informationBuilding an Automatic Statistician
Building an Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ ALT-Discovery Science Conference, October
More informationFully Bayesian Deep Gaussian Processes for Uncertainty Quantification
Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification N. Zabaras 1 S. Atkinson 1 Center for Informatics and Computational Science Department of Aerospace and Mechanical Engineering University
More informationState Space Gaussian Processes with Non-Gaussian Likelihoods
State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline
More informationGaussian with mean ( µ ) and standard deviation ( σ)
Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (
More informationOptimization of Gaussian Process Hyperparameters using Rprop
Optimization of Gaussian Process Hyperparameters using Rprop Manuel Blum and Martin Riedmiller University of Freiburg - Department of Computer Science Freiburg, Germany Abstract. Gaussian processes are
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationVariational Model Selection for Sparse Gaussian Process Regression
Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse
More informationBayesian Hidden Markov Models and Extensions
Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More informationDD Advanced Machine Learning
Modelling Carl Henrik {chek}@csc.kth.se Royal Institute of Technology November 4, 2015 Who do I think you are? Mathematically competent linear algebra multivariate calculus Ok programmers Able to extend
More informationMTTTS16 Learning from Multiple Sources
MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:
More informationGaussian Processes for Machine Learning
Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of
More informationState Space Representation of Gaussian Processes
State Space Representation of Gaussian Processes Simo Särkkä Department of Biomedical Engineering and Computational Science (BECS) Aalto University, Espoo, Finland June 12th, 2013 Simo Särkkä (Aalto University)
More informationLog Gaussian Cox Processes. Chi Group Meeting February 23, 2016
Log Gaussian Cox Processes Chi Group Meeting February 23, 2016 Outline Typical motivating application Introduction to LGCP model Brief overview of inference Applications in my work just getting started
More informationStatistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes
Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian
More informationModeling human function learning with Gaussian processes
Modeling human function learning with Gaussian processes Thomas L. Griffiths Christopher G. Lucas Joseph J. Williams Department of Psychology University of California, Berkeley Berkeley, CA 94720-1650
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationVariational Gaussian Process Dynamical Systems
Variational Gaussian Process Dynamical Systems Andreas C. Damianou Department of Computer Science University of Sheffield, UK andreas.damianou@sheffield.ac.uk Michalis K. Titsias School of Computer Science
More informationGaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling
Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling Nakul Gopalan IAS, TU Darmstadt nakul.gopalan@stud.tu-darmstadt.de Abstract Time series data of high dimensions
More informationAnomaly Detection and Removal Using Non-Stationary Gaussian Processes
Anomaly Detection and Removal Using Non-Stationary Gaussian ocesses Steven Reece Roman Garnett, Michael Osborne and Stephen Roberts Robotics Research Group Dept Engineering Science Oford University, UK
More informationAnomaly Detection and Removal Using Non-Stationary Gaussian Processes
Anomaly Detection and Removal Using Non-Stationary Gaussian ocesses Steven Reece, Roman Garnett, Michael Osborne and Stephen Roberts Robotics Research Group Dept Engineering Science Oxford University,
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationAfternoon Meeting on Bayesian Computation 2018 University of Reading
Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of
More informationProbabilistic Reasoning in Deep Learning
Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian
More informationJoint Emotion Analysis via Multi-task Gaussian Processes
Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014 1 Introduction 2 Multi-task Gaussian Process Regression 3 Experiments and Discussion 4 Conclusions
More informationLA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018
LA Support for Scalable Kernel Methods David Bindel 29 Sep 2018 Collaborators Kun Dong (Cornell CAM) David Eriksson (Cornell CAM) Jake Gardner (Cornell CS) Eric Lee (Cornell CS) Hannes Nickisch (Phillips
More informationSystem identification and control with (deep) Gaussian processes. Andreas Damianou
System identification and control with (deep) Gaussian processes Andreas Damianou Department of Computer Science, University of Sheffield, UK MIT, 11 Feb. 2016 Outline Part 1: Introduction Part 2: Gaussian
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationPrediction of Data with help of the Gaussian Process Method
of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall
More informationNonparametric Bayesian Methods - Lecture I
Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More informationNon Linear Latent Variable Models
Non Linear Latent Variable Models Neil Lawrence GPRS 14th February 2014 Outline Nonlinear Latent Variable Models Extensions Outline Nonlinear Latent Variable Models Extensions Non-Linear Latent Variable
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationApproximate Inference using MCMC
Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov
More information