Kernel Methods for Large Scale Representation Learning

Size: px

Start display at page:

Download "Kernel Methods for Large Scale Representation Learning"

Allison Strickland
6 years ago
Views:

1 Kernel Methods for Large Scale Representation Learning Andrew Gordon Wilson Carnegie Mellon University November, 2014 University of Oxford 1 / 70

2 Themes and Conclusions (High Level) Machine learning aims to develop algorithms which can automatically discover rich statistical representations of data and make impressive generalisations. Flexibility and scalability are two sides of one coin: larger datasets provide relatively more information for learning rich statistical representations. We therefore wish to simultaneously increase the flexibility and scalability of machine learning methods. Bayesian nonparametric methods are natural for big datasets, since they automatically scale their information capacity with the amount of available data. 2 / 70

3 Themes and Conclusions (Kernels) The generalisation properties of a kernel method are entirely controlled by a kernel function, which represents an inner product of arbitrarily many basis functions. In other words, the support and inductive biases of a kernel method (which enables learning) are determined by the kernel. The main advantage of a Gaussian process, over other kernel machines, is access to a marginal likelihood, which provides a powerful probabilistic framework for kernel learning. This advantage is currently underappreciated, because we typically use very simple kernels with only a few hyperparameters. In order to enable powerful representation learning, we can introduce highly expressive kernels, and learn the properties of these kernels through marginal likelihood optimisation. The best way to scale up expressive kernel learning approaches is by exploiting the existing structure in the kernel (e.g., Kronecker methods). Nonparametric kernel methods corresponds to infinite basis function expansions; these methods are well suited to performing ambitious generalisations from large datasets. 3 / 70

4 Outline (Specific), Part 1 Introduce expressive spectral mixture kernels by modelling a spectral density (the Fourier transform of a kernel) with scale location Gaussian mixtures. Adapt these kernels for Kronecker structure, exploit this structure for exact inference and learning, which costs O(PN P+1 2 P ) computations and O(PN P ) storage, for N datapoints and P input dimensions, compared to the standard O(N 3 ) computations and O(N 2 ) storage associated with GPs. Extend Kronecker methods to account for non-grid data, whilst preserving exact inference. 4 / 70

5 Outline (Specific), Part 2 Show that i) truly nonparametric representations, ii) expressive kernels, and iii) structure exploiting inference, when used in combination, distinctly enables large scale pattern extrapolation, and representation learning including: long range spatiotemporal forecasting, image inpainting, video extrapolation, and kernel discovery. This is the first time, as far as we are aware, that highly expressive non-parametric kernels with in some cases hundreds of hyperparameters, on datasets exceeding N = 10 5 training instances, can be learned from the marginal likelihood of a GP, in only minutes. Such experiments show that one can, to some extent, solve kernel selection, and automatically extract useful features from the data, on large datasets, using a special combination of expressive kernels and scalable inference. 5 / 70

6 Gaussian processes Definition A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. Nonparametric Regression Model Prior: f (x) GP(m(x), k(x, x )), meaning (f (x 1),..., f (x N)) N (µ, K), with µ i = m(x i) and K ij = cov(f (x i), f (x j)) = k(x i, x j). GP posterior {}}{ p(f (x) D) Likelihood GP prior {}}{{}}{ p(d f (x)) p(f (x)) output, f(t) Gaussian process sample prior functions input, t a) output, f(t) Gaussian process sample posterior functions input, t b) 6 / 70

7 Gaussian Process Inference Observed noisy data y = (y(x 1),..., y(x N)) T at input locations X. Start with the standard regression assumption: N (y(x); f (x), σ 2 ). Place a Gaussian process distribution over noise free functions f (x) GP(0, k θ ). The kernel k is parametrized by θ. Infer p(f y, X, X ) for the noise free function f evaluated at test points X. Joint distribution [ y f ] ( [ Kθ (X, X) + σ 2 I K θ (X, X ) N 0, K θ (X, X) K θ (X, X ) ]). (1) Conditional predictive distribution f X, X, y, θ N ( f, cov(f )), (2) f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, (3) cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). (4) 7 / 70

8 Learning and Model Selection p(m i y) = p(y Mi)p(Mi) (5) p(y) We can write the evidence of the model as p(y M i) = p(y f, M i)p(f)df, (6) Complex Model Simple Model Appropriate Model Data Simple Complex Appropriate p(y M) Output, f(x) y All Possible Datasets (a) Input, x (b) 8 / 70

9 Learning and Model Selection We can integrate away the entire Gaussian process f (x) to obtain the marginal likelihood, as a function of kernel hyperparameters θ alone. p(y θ, X) = p(y f, X)p(f θ, X)df. (7) model fit complexity penalty {}}{{}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 1 y 2 log K θ + σ 2 I N log(2π). (8) 2 An extremely powerful mechanism for kernel learning. 4 Samples from GP Prior 4 Samples from GP Posterior Output, f(x) Output, f(x) Input, x Input, x (a) (b) 9 / 70

10 Learning and Model Selection A fully Bayesian treatment would integrate away kernel hyperparameters θ. p(f X, X, y) = p(f X, X, y, θ)p(θ y)dθ (9) For example, we could specify a prior p(θ), use MCMC to take J samples from p(θ y) p(y θ)p(θ), and then find p(f X, X, y) 1 J J p(f X, X, y, θ (i) ), θ (i) p(θ y). (10) i=1 If we have a non-gaussian noise model, and thus cannot integrate away f, the strong dependencies between Gaussian process f and hyperparameters θ make sampling extremely difficult. In my experience, the most effective solution is to use a deterministic approximation for the posterior p(f y) which enables one to work with an approximate marginal likelihood. 10 / 70

11 Linear Basis Function Models Model Specification f (x, w) = w T φ(x) (11) p(w) = N (0, Σ w) (12) Moments of Induced Distribution over Functions E[f (x, w)] = m(x) = E[w T ]φ(x) = 0 (13) cov(f (x i), f (x j)) = k(x i, x j) = E[f (x i)f (x j)] E[f (x i)]e[f (x j)] (14) = φ(x i) T E[ww T ]φ(x j) 0 (15) = φ(x i) T Σ wφ(x j) (16) f (x, w) is a Gaussian process, f (x) N (m, k) with mean function m(x) = 0 and covariance kernel k(x a, x b) = φ(x i) T Σ wφ(x j). The entire basis function model of Eqs. (11) and (12) is encapsulated as a distribution over functions with kernel k(x, x ). 11 / 70

12 Deriving the RBF Kernel Start with the generalised linear model f (x) = J w iφ i(x), (17) i=1 ) w i N (0, σ2, (18) J ) (x ci)2 φ i(x) = exp (. (19) 2l 2 Equations (17)-(19) define a radial basis function regression model, with radial basis functions centred at the points c i. Using our result for the kernel of a generalised linear model, k(x, x ) = σ2 J J φ i(x)φ i(x ). (20) i=1 12 / 70

13 Deriving the SE (aka RBF, Gaussian) Kernel f (x) = J w iφ i(x), i=1 w i N k(x, x ) = σ2 J ) ) (0, σ2 (x ci)2, φ i(x) = exp ( J 2l 2 (21) J φ i(x)φ i(x ) (22) i=1 Letting c i+1 c i = c = 1, and J, the kernel in Eq. (22) becomes a J Riemann sum: k(x, x σ 2 J c ) = lim φ J i(x)φ i(x ) = φ c(x)φ c(x )dc (23) J c 0 i=1 By setting c 0 = and c =, we spread the infinitely many basis functions across the whole real line, each a distance c 0 apart: k(x, x ) = exp( x c 2l ) exp( x c )dc (24) 2 2l 2 = πlσ 2 exp( (x x ) 2 2( 2l) 2 ). (25) 13 / 70

14 Gaussian Process Covariance Kernels Let τ = x x : k SE(τ) = exp( 0.5τ 2 /l 2 ) (26) 3τ 3τ k MA(τ) = a(1 + ) exp( ) (27) l l k RQ(τ) = (1 + τ 2 2 α l 2 ) α (28) k PE(τ) = exp( 2 sin 2 (π τ ω)/l 2 ) (29) 14 / 70

15 CO 2 Extrapolation with Standard Kernels CO 2 Concentration (ppm) Year 15 / 70

16 Gaussian processes How can Gaussian processes possibly replace neural networks? Did we throw the baby out with the bathwater? David MacKay, / 70

17 More Expressive Covariance Functions ˆf 1 (x) W 11(x) y 1 (x). W1q(x). x ˆf i (x) y j (x).. Wp1(x) ˆf q (x) W pq(x) y p (x) Gaussian Process Regression Networks. Wilson et. al, ICML / 70

Gaussian Process Regression Network latitude 6 5 4 3 2 1 0.

18 Gaussian Process Regression Network latitude longitude / 70

19 Expressive Covariance Functions GPs in Bayesian neural network like architectures. (Salakhutdinov and Hinton, 2008; Wilson et. al, 2012; Damianou and Lawrence, 2012). Task specific, difficult inference, no closed form kernels. Compositions of kernels. (Archambeau and Bach, 2011; Durrande et. al, 2011; Rasmussen and Williams, 2006). In the general case, difficult to interpret, difficult inference, struggle with over-fitting. Can learn almost nothing about the covariance function of a stochastic process from a single realization, if we assume that the covariance function could be any positive definite function. Most commonly one assumes a restriction to stationary kernels, meaning that covariances are invariant to translations in the input space. 19 / 70

20 Bochner s Theorem Theorem (Bochner) A complex-valued function k on R P is the covariance function of a weakly stationary mean square continuous complex-valued random process on R P if and only if it can be represented as k(τ) = where ψ is a positive finite measure. R P e 2πis T τ ψ(ds), (30) If ψ has a density S(s), then S is called the spectral density or power spectrum of k, and k and S are Fourier duals: k(τ) = S(s)e 2πisTτ ds, (31) S(s) = k(τ)e 2πisTτ dτ. (32) 20 / 70

21 Idea k and S are Fourier duals: k(τ) = S(s) = S(s)e 2πisTτ ds, (33) k(τ)e 2πisTτ dτ. (34) If we can approximate S(s) to arbitrary accuracy, then we can approximate any stationary kernel to arbitrary accuracy. We can model S(s) to arbitrary accuracy, since scale-location mixtures of Gaussians can approximate any distribution to arbitrary accuracy. A scale-location mixture of Gaussians can flexibly model many distributions, and thus many covariance kernels, even with a small number of components. 21 / 70

22 Kernels for Pattern Discovery Let τ = x x R P. From Bochner s Theorem, k(τ) = For simplicity, assume τ R 1 and let Then R P S(s)e 2πis T τ ds (35) S(s) = [N (s; µ, σ 2 ) + N ( s; µ, σ 2 )]/2. (36) k(τ) = exp{ 2π 2 τ 2 σ 2 } cos(2πτµ). (37) More generally, if S(s) is a symmetrized mixture of diagonal covariance Gaussians on R p, with covariance matrix M q = diag(v (1) q,..., v q (P) ), then k(τ) = Q P w qcos(2πτ T µ q) exp{ 2π 2 τp 2 v (p) q }. (38) q=1 p=1 22 / 70

23 GP Model for Pattern Extrapolation Observations y(x) N (y(x); f (x), σ 2 ) f (x) GP(0, k SM(x, x θ)) (can easily be relaxed). (f (x) is a GP with SM kernel). k SM(x, x θ) can approximate many different kernels with different settings of its hyperparameters θ. Learning involves training these hyperparameters through maximum marginal likelihood optimization (using BFGS) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N log(2π). (39) 2 Once hyperparameters are trained as ˆθ, making predictions using p(f y, X, ˆθ), which can be expressed in closed form. 23 / 70

24 Results, CO 2 CO 2 Concentration (ppm) % CR Train Test MA RQ PER SE SM (c) Year Log Spectral Density SM SE Empirical Frequency (1/month) (d) 24 / 70

25 Results, Reconstructing Standard Covariances Correla 0 Correla τ a) τ b) 25 / 70

26 Results, Negative Covariances Truth SM Observation 0 Covariance x (input) (e) τ (f) 20 Log Spectral Density SM Empirical Frequency (g) 26 / 70

27 Results, Sinc Pattern Observation Observation Train Test MA RQ SE PER SM x (input) (h) x (input) (i) 1 MA SM 0 Correlation Log Spectral Density SM SE Empirical τ (j) Frequency (k) 27 / 70

28 Results, Airline Passengers Airline Passengers (Thousa PER Train Test SE RQ MA SM SSGP Year (l) Log Spectral Density SM SE Empirical Frequency (1/month) (m) 28 / 70

29 Results Table: We compare the test performance of the proposed spectral mixture (SM) kernel with squared exponential (SE), Matérn (MA), rational quadratic (RQ), and periodic (PE) kernels. The SM kernel consistently has the lowest mean squared error (MSE) and highest log likelihood (L). CO 2 SM SE MA RQ PE MSE L NEG COV MSE L SINC MSE L AIRLINE MSE L / 70

30 Scaling up the Spectral Mixture Kernel The flexibility of the spectral mixture kernel will be most useful on large datasets... Scaling an expressive kernel learning approach poses different challenges than scaling a standard Gaussian process model. One faces additional computational constraints, and the need to retain significant model structure for expressing the rich information available in a large dataset. So far we have considered just univariate (time series) problems. But the spectral mixture model is general purpose. How do we scale to multiple input dimensions, and what sort of multidimensional pattern extrapolation problems can we imagine? 30 / 70

31 Scaling a Gaussian process: inducing inputs Gaussian process f and f evaluated at N training points and J testing points. M N inducing points u, p(u) = N (0, K u,u) p(f, f) = p(f, f, u)du = p(f, f u)p(u)du Assume that f and f are conditionally independent given u: p(f, f) q(f, f) = q(f u)q(f u)p(u)du (40) Under this assumption, p(f u) = N (K f,u K 1 u,uu, K f,f Q f,f ) (41) p(f u) = N (K f,uk 1 u,uu, K f,f Q f,f ) (42) Q a,b = K a,uk 1 u,uk u,b (43) Cost for predictions reduced from O(N 3 ) to O(M 2 N) where M N. Different inducing approaches correspond to different additional assumptions about q(f u) and q(f u). For further reading, see Quinonero-Candela and Rasmussen (2005) 31 / 70

32 Kronecker methods Suppose If x R P, k decomposes as a product of kernels across each input dimension: k(x i, x j) = P p=1 kp (x p i, xp j ) (e.g., the RBF kernel has this property). Suppose the inputs x X are on a multidimensional grid X = X 1 X P R P. Then K decomposes into a Kronecker product of matrices over each input dimension K = K 1 K P. The eigendecomposition of K into QVQ also decomposes: Q = Q 1 Q P, V = Q 1 Q P. Assuming equal cardinality for each input dimension, we can thus eigendecompose an N N matrix K in O(PN 3/P ) operations instead of O(N 3 ) operations. 32 / 70

33 Kronecker methods Suppose If x R P, k decomposes as a product of kernels across each input dimension: k(x i, x j) = P p=1 kp (x p i, xp j ) (e.g., the RBF kernel has this property). Suppose the inputs x X are on a multidimensional grid X = X 1 X P R P. Then K decomposes into a Kronecker product of matrices over each input dimension K = K 1 K P. The eigendecomposition of K into QVQ also decomposes: Q = Q 1 Q P, V = Q 1 Q P. Assuming equal cardinality for each input dimension, we can thus eigendecompose an N N matrix K in O(PN 3/P ) operations instead of O(N 3 ) operations. Then inference and learning are highly efficient: (K + σ 2 I) 1 y = (QVQ T + σ 2 I) 1 y = Q(V + σ 2 I) 1 Q T y, (44) N log K + σ 2 I = log QVQ T + σ 2 I = log(λ i + σ 2 ), (45) where λ i are the eigenvalues of K. Saatchi (2011) i=1 33 / 70

34 Kronecker Methods We assumed that the inputs x X are on a multidimensional grid X = X 1 X P R P. How might we relax this assumption, to use Kronecker methods if there are gaps (missing data) in our multidimensional grid? 34 / 70

35 Kronecker Methods Assume imaginary points that complete the grid Place infinite noise on these points so they have no effect on inference The relevant matrices are no longer Kronecker, but we can get around this using pre-conditioned conjugate gradients, an iterative linear solver. 35 / 70

36 Kronecker Methods with Missing Data Assuming we have a dataset of M observations which are not necessarily on a grid, we propose to form a complete grid using W imaginary observations, y W N (f W, ɛ 1 I W), ɛ 0. The total observation vector y = [y M, y W ] T has N = M + W entries: y = N (f, D N), where the noise covariance matrix D N = diag(d M, ɛ 1 I W), D M = σ 2 I M. The imaginary observations y W have no corrupting effect on inference: the moments of the resulting predictive distribution are exactly the same as for the standard predictive distribution, namely lim ɛ 0(K N + D N) 1 y = (K M + D M) 1 y M. 36 / 70

37 Kronecker Methods with Missing Inputs We use preconditioned conjugate gradients to compute (K N + D N) 1 y. We use the preconditioning matrix C = D 1/2 N to solve C T (K N + D N) Cz = C T y. The preconditioning matrix C speeds up convergence by ignoring the imaginary observations y W. For the log complexity in the marginal likelihood (used in hyperparameter learning), log K M + D M = M log(λ M i i=1 + σ 2 ) M log( λ M i + σ 2 ), (46) i=1 where λ M i = M N λn i for i = 1,..., M. 37 / 70

38 Spectral Mixture Product Kernel The spectral mixture kernel, in its standard form, does not quite have Kronecker structure. Introduce a spectral mixture product kernel, which takes a product of across input dimensions of one dimensional spectral mixture kernels. k SMP(τ θ) = P k SM(τ p θ p). (47) p=1 38 / 70

39 GPatt Observations y(x) N (y(x); f (x), σ 2 ) f (x) GP(0, k SMP(x, x θ)) (can easily be relaxed). (f (x) is a GP with SMP kernel). k SMP(x, x θ) can approximate many different kernels with different settings of its hyperparameters θ. Learning involves training these hyperparameters through maximum marginal likelihood optimization (using BFGS) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N log(2π). (48) 2 Once hyperparameters are trained as ˆθ, making predictions using p(f y, X, ˆθ), which can be expressed in closed form. Exploit Kronecker structure for fast exact inference and learning (and extend Kronecker methods to allow for non-grid data). Exact inference and learning requires O(PN P+1 2 P ) operations and O(PN P ) storage, compared to O(N 3 ) operations and O(N 2 ) storage, for N datapoints, and P input dimensions. 39 / 70

40 Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC (g) GP-SE (h) GP-MA (i) GP-RQ 40 / 70

41 Results: Extrapolation and Interpolation with Shadows (a) Train (b) GPatt (c) GP-MA (d) Train (e) GPatt (f) GP-MA 41 / 70

42 Automatic Model Selection via Marginal Likelihood 0 µ µ 2 Simple initialisation The marginal likelihood shrinks weights of extraneous components to zero through the log K complexity penalty. 42 / 70

5 1 (g) GP-SE (h) GP-MA (i) GP-RQ (j) GPatt

43 Results (a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC 0 µ (g) GP-SE (h) GP-MA (i) GP-RQ (j) GPatt Initialisation (k) Train (l) GPatt (m) GP-MA (n) Train (o) GPatt (p) GP-MA 43 / 70

44 More Patterns (a) Rubber mat (b) Tread plate (c) Pores (d) Wood (e) Chain mail (f) Cone 44 / 70

45 More Patterns GPatt SSGP SE MA RQ Rubber mat (train = 12675, test = 4225) SMSE MSLL Tread plate (train = 12675, test = 4225) SMSE MSLL Pores (train = 12675, test = 4225) SMSE MSLL Wood (train = 14259, test = 4941) SMSE MSLL Chain mail (train = 14101, test = 4779) SMSE MSLL / 70

46 Speed and Accuracy Stress Tests (a) Runtime Stress Test (b) Accuracy Stress Test 46 / 70

47 Image Inpainting 47 / 70

48 Recovering Sophisticated Out of Class Kernels True Recovered k1 0.5 k2 0.5 k τ τ τ 48 / 70

49 Video Extrapolation GPatt makes almost no assumptions about the correlation structures across input dimensions: it can automatically discover both temporal and spatial correlations! Top row: True frames taken from the middle of a movie. Bottom row: Predicted sequence of frames (all are forecast together). 112,500 datapoints. GPatt training time is under 5 minutes. 49 / 70

50 Land Surface Temperature Forecasting Train using 9 years of temperature data. First two rows are the last 12 months of training data, last two rows is a 12 month ahead forecast. 300, 000 data points, with 40% missing data (from ocean). Predictions using GP-SE (GP with an SE or RBF kernel), and Kronecker Inference / 70

51 Land Surface Temperature Forecasting Train using 9 years of temperature data. First two rows are the last 12 months of training data, last two rows is a 12 month ahead forecast. 300, 000 data points, with 40% missing data (from ocean). Predictions using GPatt. Training time < 30 minutes / 70

52 Learned Kernels for Land Surface Temperatures Correlations X [Km] Correlations Y [Km] (a) Learned GPatt Kernel for Temperatures X [Km] Y [Km] Time [mon] 0 5 Time [mon] (b) Learned GP-SE Kernel for Temperatures The learned GPatt kernel tells us interesting properties of the data. In this case, the learned kernels are heavy tailed and quasi-periodic. 52 / 70

53 Summary of Findings We have separately understood the effects of many different kernels, inference and learning algorithms, and parametric vs non-parametric representations. We found truly nonparametric representations, scalable inference exploiting structure, and highly expressive kernels, when used in combination, distinctly enable pattern extrapolation on large multidimensional problems. Applications such as image inpainting were previously inaccessible to Gaussian processes and kernel machines. Moreover, conventional inpainting algorithms are often model free and highly specialised. On the other hand, the proposed kernel approaches are quite general, and can learn, for example, spatial and temporal correlations. Examples of this generality are demonstrated on video extrapolation and land surface temperature forecastng problems. We can recover sophisticated out of class kernels. We can interpret the learned kernels to understand the fundamental properties of our data, and make new scientific discoveries. 53 / 70

54 Distribution over Kernels A Gaussian process can be viewed as a distribution over functions. We can sample prior functions, and posterior functions. Why not create an analogous process over kernels, where we have prior over kernels (with large support), concentrated on something reasonable, such as stationarity? This process would contain many types of kernels, and represent uncertainty over the kernels. 54 / 70

55 Distribution over Spectral Mixture Kernels Induce a process prior over stationary kernels, by placing a (vague) prior distribution over the parameters θ = {a q, σ q, µ q} Q q=1 of the spectral mixture kernel. We can then sample for the prior over kernels, and condition on these kernels and sample from the corresponding Gaussian process. 15 Sample kernels 10 Sample functions 10 5 Covariance 5 y(x) τ (a) Input, x (b) 55 / 70

56 Non-Stationary Extensions If we can say almost nothing about the kernel of a stochastic process, from a single realisation, if we make no assumptions. Stationarity provides a powerful inductive bias, but sometimes we still want to allow for non-stationarity. It would be promising, therefore, to concentrate our process over kernels around stationary kernels, but still have support for non-stationarity. Idea: Have the weights in the spectral mixture become random functions of the input. Therefore at any different point in the input space, the kernel is described by a diffrent spectral mixture. If we concentrate the support of these functions around constant functions, then this process is concentrated around stationarity, but still has support for a massive range of non-stationary kernels! 56 / 70

57 Non-Stationary Extensions 15 Sample kernels 10 Sample functions 10 5 Covariance 5 y(x) τ (a) Input, x (b) w(x) Samples from Amplitude Modulator y(x) Samples from Non Stationary Process Input, x (c) Input, x (d) 57 / 70

58 Non-Stationary Extensions (a) Black (b) Blue (c) Red (d) Magenta (e) Cov SE 58 / 70

59 The Spectral Mixture Kernel k GSM(τ) = Q exp{ 2π 2 τ 2 σi 2 } cos(2πτµ i) (49) i=1 Can approximate a wide range of kernels, even with a small number of components Q. Described in: Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. Wilson, PhD Thesis, January Fast kernel learning for multidimensional pattern extrapolation. Wilson, Gilboa, Nehorai, Cunningham, NIPS A process over all stationary kernels. Wilson, Gaussian process kernels for pattern discovery and extrapolation. Wilson and Adams, ICML, / 70

60 Themes and Conclusions (High Level) Machine learning aims to develop algorithms which can automatically discover rich statistical representations of data and make impressive generalisations. Flexibility and scalability are two sides of one coin: larger datasets provide relatively more information for learning rich statistical representations. We therefore wish to simultaneously increase the flexibility and scalability of machine learning methods. Bayesian nonparametric methods are natural for big datasets, since they automatically scale their information capacity with the amount of available data. 60 / 70

61 Themes and Conclusions (Kernels) The generalisation properties of a kernel method are entirely controlled by a kernel function, which represents an inner product of arbitrarily many basis functions. In other words, the support and inductive biases of a kernel method (which enables learning) are determined by the kernel. The main advantage of a Gaussian process, over other kernel machines, is access to a marginal likelihood, which provides a powerful probabilistic framework for kernel learning. This advantage is currently underappreciated, because we typically use very simple kernels with only a few hyperparameters. In order to enable powerful representation learning, we can introduce highly expressive kernels, and learn the properties of these kernels through marginal likelihood optimisation. The best way to scale up expressive kernel learning approaches is by exploiting the existing structure in the kernel (e.g., Kronecker methods). Nonparametric kernel methods corresponds to infinite basis function expansions; these methods are well suited to performing ambitious generalisations from large datasets. 61 / 70

62 Outline (Specific), Part 1 Introduce expressive spectral mixture kernels by modelling a spectral density (the Fourier transform of a kernel) with scale location Gaussian mixtures. Adapt these kernels for Kronecker structure. Exploit this Kronecker structure for exact inference and learning with Gaussian processes, which costs O(PN P+1 2 P ) computations and O(PN P ) storage, for N datapoints and P input dimensions, compared to the standard O(N 3 ) computations and O(N 2 ) storage associated with GPs. Extend Kronecker methods to account for non-grid data, whilst preserving exact inference. 62 / 70

63 Outline (Specific), Part 2 Show that i) truly nonparametric representations, ii) expressive kernels, and iii) structure exploiting inference, when used in combination, distinctly enable large scale pattern extrapolation, and representation learning including: long range spatiotemporal forecasting, image inpainting, video extrapolation, and kernel discovery. This is the first time, as far as we are aware, that highly expressive non-parametric kernels with in some cases hundreds of hyperparameters, on datasets exceeding N = 10 5 training instances, can be learned from the marginal likelihood of a GP, in only minutes. Such experiments show that one can, to some extent, solve kernel selection, and automatically extract useful features from the data, on large datasets, using a special combination of expressive kernels and scalable inference. 63 / 70

64 Worked Example: Combining Kernels, CO 2 Data CO 2 Concentration (ppm) Year Example from Rasmussen and Williams (2006), Gaussian Processes for Machine Learning. 64 / 70

65 Worked Example: Combining Kernels, CO 2 Data 65 / 70

66 Worked Example: Combining Kernels, CO 2 Data Long rising trend: k 1(x p, x q) = θ 2 1 exp Quasi-periodic seasonal changes: ( (xp xq)2 k 2(x p, x q) = k RBF(x p, x q)k PER(x p, x q) = θ 2 3 exp Multi-scale medium term irregularities: k 3(x p, x q) = θ 2 6 Correlated and i.i.d. noise: k 4(x p, x q) = θ 2 9 exp 2θ 2 2 ) ( ) (xp xq) 2 sin2 (π(x p x q)) 2θ 4 2 θ 5 2 ) θ8 (1 + (xp xq)2 ( (xp xq)2 k total(x p, x q) = k 1(x p, x q) + k 2(x p, x q) + k 3(x p, x q) + k 4(x p, x q) 2θ θ 8 θ 2 7 ) + θ 2 11δ pq 66 / 70

67 Worked Example: Combining Kernels, CO 2 Data Hand crafted a kernel combination to perform extrapolation Confidence in the extrapolation is high (suggests that model is well specified). Can interpret the learned kernel hyperparameters θ to learn information about our dataset. A lot of the interesting pattern recognition has been done by a human in this example. We would like to completely automate this modelling procedure. 67 / 70

68 Function Learning Example Airline Passengers (Thousands) Train Human? Year 68 / 70

69 Function Learning Example Airline Passengers (Thousands) Train Human? Alien? Year 69 / 70

70 Function Learning Example Airline Passengers (Thousands) Train Alien? Test Human? Year 70 / 70

Kernels for Automatic Pattern Discovery and Extrapolation

Kernels for Automatic Pattern Discovery and Extrapolation Andrew Gordon Wilson agw38@cam.ac.uk mlg.eng.cam.ac.uk/andrew University of Cambridge Joint work with Ryan Adams (Harvard) 1 / 21 Pattern Recognition