Inter-domain Gaussian Processes for Sparse Inference using Inducing Features

Similar documents
Gaussian processes with monotonicity information

Least-Squares Regression on Sparse Spaces

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Table of Common Derivatives By David Abraham

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A Review of Multiple Try MCMC algorithms for Signal Processing

Chapter 6: Energy-Momentum Tensors

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Sparse Spectrum Gaussian Process Regression

Influence of weight initialization on multilayer perceptron performance

A Course in Machine Learning

Linear First-Order Equations

Variational Model Selection for Sparse Gaussian Process Regression

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Introduction to Markov Processes

Robust Low Rank Kernel Embeddings of Multivariate Distributions

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

Parameter estimation: A new approach to weighting a priori information

Expected Value of Partial Perfect Information

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Topic 7: Convergence of Random Variables

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

WEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

Conservation Laws. Chapter Conservation of Energy

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Introduction to Machine Learning

7.1 Support Vector Machine

Schrödinger s equation.

The total derivative. Chapter Lagrangian and Eulerian approaches

Inverse Theory Course: LTU Kiruna. Day 1

The Principle of Least Action

STATISTICAL LIKELIHOOD REPRESENTATIONS OF PRIOR KNOWLEDGE IN MACHINE LEARNING

Non-Linear Bayesian CBRN Source Term Estimation

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

Cascaded redundancy reduction

arxiv: v4 [math.pr] 27 Jul 2016

Quantum mechanical approaches to the virial

Sparse Spectral Sampling Gaussian Processes

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

State observers and recursive filters in classical feedback control theory

Tutorial on Maximum Likelyhood Estimation: Parametric Density Estimation

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling

EVALUATING HIGHER DERIVATIVE TENSORS BY FORWARD PROPAGATION OF UNIVARIATE TAYLOR SERIES

Linear Regression with Limited Observation

KNN Particle Filters for Dynamic Hybrid Bayesian Networks

Switched Latent Force Models for Movement Segmentation

Estimating Causal Direction and Confounding Of Two Discrete Variables

θ x = f ( x,t) could be written as

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Calculus of Variations

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers

Bayesian Estimation of the Entropy of the Multivariate Gaussian

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets

Monte Carlo Methods with Reduced Error

Robust Bounds for Classification via Selective Sampling

Entanglement is not very useful for estimating multiple phases

Hyperbolic Moment Equations Using Quadrature-Based Projection Methods

A. Exclusive KL View of the MLE

Wiener Deconvolution: Theoretical Basis

Optimization of Geometries by Energy Minimization

Homework 2 EM, Mixture Models, PCA, Dualitys

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

Lagrangian and Hamiltonian Mechanics

Local and global sparse Gaussian process approximations

Predictive Control of a Laboratory Time Delay Process Experiment

Diagonalization of Matrices Dr. E. Jacobs

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

Some vector algebra and the generalized chain rule Ross Bannister Data Assimilation Research Centre, University of Reading, UK Last updated 10/06/10

Jointly continuous distributions and the multivariate Normal

Introduction to variational calculus: Lecture notes 1

Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation

Introduction to the Vlasov-Poisson system

Stable and compact finite difference schemes

Acute sets in Euclidean spaces

A New Minimum Description Length

A simple model for the small-strain behaviour of soils

The Exact Form and General Integrating Factors

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Proof of SPNs as Mixture of Trees

Agmon Kolmogorov Inequalities on l 2 (Z d )

A Modification of the Jarque-Bera Test. for Normality

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Modelling and simulation of dependence structures in nonlife insurance with Bernstein copulas

Multi-View Clustering via Canonical Correlation Analysis

ASYMMETRIC TWO-OUTPUT QUANTUM PROCESSOR IN ANY DIMENSION

Math 1B, lecture 8: Integration by parts

SINGULAR PERTURBATION AND STATIONARY SOLUTIONS OF PARABOLIC EQUATIONS IN GAUSS-SOBOLEV SPACES

Topic Modeling: Beyond Bag-of-Words

The canonical controllers and regular interconnection

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1

Equilibrium in Queues Under Unknown Service Times and Service Value

Transcription:

Inter-omain Gaussian Processes for Sparse Inference using Inucing Features Miguel Lázaro-Greilla an Aníbal R. Figueiras-Vial Dep. Signal Processing & Communications Universia Carlos III e Mari, SPAIN {miguel,arfv}@tsc.uc3m.es Abstract We present a general inference framework for inter-omain Gaussian Processes (GPs) an focus on its usefulness to buil sparse GP moels. The state-of-the-art sparse GP moel introuce by Snelson an Ghahramani in relies on fining a small, representative pseuo ata set of m elements (from the same omain as the n available ata elements) which is able to explain existing ata well, an then uses it to perform inference. This reuces inference an moel selection computation time from O(n 3 ) to O(m 2 n), where m n. Inter-omain GPs can be use to fin a (possibly more compact) representative set of features lying in a ifferent omain, at the same computational cost. Being able to specify a ifferent omain for the representative features allows to incorporate prior knowlege about relevant characteristics of ata an etaches the functional form of the covariance an basis functions. We will show how previously existing moels fit into this framework an will use it to evelop two new sparse GP moels. Tests on large, representative regression ata sets suggest that significant improvement can be achieve, while retaining computational efficiency. Introuction an previous work Along the past ecae there has been a growing interest in the application of Gaussian Processes (GPs) to machine learning tasks. GPs are probabilistic non-parametric Bayesian moels that combine a number of attractive characteristics: They achieve state-of-the-art performance on supervise learning tasks, provie probabilistic preictions, have a simple an well-foune moel selection scheme, present no overfitting (since parameters are integrate out), etc. Unfortunately, the irect application of GPs to regression problems (with which we will be concerne here) is limite ue to their training time being O(n 3 ). To overcome this limitation, several sparse approximations have been propose 2, 3, 4, 5, 6. In most of them, sparsity is achieve by projecting all available ata onto a smaller subset of size m n (the active set), which is selecte accoring to some specific criterion. This reuces computation time to O(m 2 n). However, active set selection interferes with hyperparameter learning, ue to its non-smooth nature (see, 3). These proposals have been supersee by the Sparse Pseuo-inputs GP () moel, introuce in. In this moel, the constraint that the samples of the active set (which are calle pseuoinputs) must be selecte among training ata is relaxe, allowing them to lie anywhere in the input space. This allows both pseuo-inputs an hyperparameters to be selecte in a joint continuous optimisation an increases flexibility, resulting in much superior performance. In this work we introuce Inter-Domain GPs (IDGPs) as a general tool to perform inference across omains. This allows to remove the constraint that the pseuo-inputs must remain within the same omain as input ata. This ae flexibility results in an increase performance an allows to encoe prior knowlege about other omains where ata can be represente more compactly.

2 Review of GPs for regression We will briefly state here the main efinitions an results for regression with GPs. See 7 for a comprehensive review. Assume we are given a training set with n samples D {x j, y j } n j=, where each D-imensional input x j is associate to a scalar output y j. The regression task goal is, given a new input x, preict the corresponing output y base on D. The GP regression moel assumes that the outputs can be expresse as some noiseless latent function plus inepenent noise, y = f(x)+ε, an then sets a zero-mean GP prior on f(x), with covariance k(x, x ), an a zero-mean Gaussian prior on ε, with variance σ 2 (the noise power hyperparameter). The covariance function encoes prior knowlege about the smoothness of f(x). The most common choice for it is the Automatic Relevance Determination Square Exponential (ARD SE): k(x, x ) = σ 2 0 exp 2 (x x )2 l 2, () with hyperparameters σ0 2 (the latent function power) an {l } D (the length-scales, efining how rapily the covariance ecays along each imension). It is referre to as ARD SE because, when couple with a moel selection metho, non-informative input imensions can be remove automatically by growing the corresponing length-scale. The set of hyperparameters that efine the GP are θ = {σ 2, σ0 2, {l } D }. We will omit the epenence on θ for the sake of clarity. If we evaluate the latent function at X = {x j } n j=, we obtain a set of latent variables following a joint Gaussian istribution p(f X) = N (f 0, K ff ), where K ff ij = k(x i, x j ). Using this moel it is possible to express the joint istribution of training an test cases an then conition on the observe outputs to obtain the preictive istribution for any test case p GP (y x, D) = N (y k f (K ff + σ 2 I n ) y, σ 2 + k k f (K ff + σ 2 I n ) k f ), (2) where y = y,..., y n, k f = k(x, x ),..., k(x n, x ), an k = k(x, x ). I n is use to enote the ientity matrix of size n. The O(n 3 ) cost of these equations arises from the inversion of the n n covariance matrix. Preictive istributions for aitional test cases take O(n 2 ) time each. These costs make stanar GPs impractical for large ata sets. To select hyperparameters θ, Type-II Maximum Likelihoo (ML-II) is commonly use. This amounts to selecting the hyperparameters that correspon to a (possibly local) maximum of the log-marginal likelihoo, also calle log-evience. 3 Inter-omain GPs In this section we will introuce Inter-Domain GPs (IDGPs) an show how they can be use as a framework for computationally efficient inference. Then we will use this framework to express two previous relevant moels an evelop two new ones. 3. Definition Consier a real-value GP f(x) with x R D an some eterministic real function g(x, z), with z R H. We efine the following transformation: u(z) = f(x)g(x, z)x. (3) R D There are many examples of transformations that take on this form, the Fourier transform being one of the best known. We will iscuss possible choices for g(x, z) in Section 3.3; for the moment we will eal with the general form. Since u(z) is obtaine by a linear transformation of GP f(x), We follow the common approach of subtracting the sample mean from the outputs an then assume a zero-mean moel. 2

it is also a GP. This new GP may lie in a ifferent omain of possibly ifferent imension. This transformation is not invertible in general, its properties being efine by g(x, z). IDGPs arise when we jointly consier f(x) an u(z) as a single, extene GP. The mean an covariance function of this extene GP are overloae to accept arguments from both the input an transforme omains an treat them accoringly. We refer to each version of an overloae function as an instance, which will accept a ifferent type of arguments. If the istribution of the original GP is f(x) GP(m(x), k(x, x )), then it is possible to compute the remaining instances that efine the istribution of the extene GP over both omains. The transforme-omain instance of the mean is m(z) = Eu(z) = Ef(x)g(x, z)x = m(x)g(x, z)x. R D R D The inter-omain an transforme-omain instances of the covariance function are: k(x, z ) = Ef(x)u(z ) = E f(x) f(x )g(x, z )x = k(x, x )g(x, z )x (4) R D R D k(z, z ) = Eu(z)u(z ) = E f(x)g(x, z)x f(x )g(x, z )x R D R D = k(x, x )g(x, z)g(x, z )xx. (5) R D R D Mean m( ) an covariance function k(, ) are therefore efine both by the values an omains of their arguments. This can be seen as if each argument ha an aitional omain inicator use to select the instance. Apart from that, they efine a regular GP, an all stanar properties hol. In particular k(a, b) = k(b, a). This approach is relate to 8, but here the latent space is efine as a transformation of the input space, an not the other way aroun. This allows to pre-specify the esire input-omain covariance. The transformation is also more general: Any g(x, z) can be use. We can sample an IDGP at n input-omain points f = f, f 2,..., f n (with f j = f(x j )) an m transforme-omain points u = u, u 2,..., u m (with u i = u(z i )). With the usual assumption of f(x) being a zero mean GP an efining Z = {z i } m i=, the joint istribution of these samples is: ( ( ) f f Kff K fu p X, Z) = N 0,, (6) u u K fu K uu with K ff pq = k(x p, x q ), K fu pq = k(x p, z q ), K uu pq = k(z p, z q ), which allows to perform inference across omains. We will only be concerne with one input omain an one transforme omain, but IDGPs can be efine for any number of omains. 3.2 Sparse regression using inucing features In the stanar regression setting, we are aske to perform inference about the latent function f(x) from a ata set D lying in the input omain. Using IDGPs, we can use ata from any omain to perform inference in the input omain. Some latent functions might be better efine by a set of ata lying in some transforme space rather than in the input space. This iea is use for sparse inference. Following we introuce a pseuo ata set, but here we place it in the transforme omain: D = {Z, u}. The following erivation is analogous to that of. We will refer to Z as the inucing features an u as the inucing variables. The key approximation leaing to sparsity is to set m n an assume that f(x) is well-escribe by the pseuo ata set D, so that any two samples (either from the training or test set) f p an f q with p q will be inepenent given x p, x q an D. With this simplifying assumption 2, the prior over f can be factorise as a prouct of marginals: n p(f X, Z, u) p(f j x j, Z, u). (7) j= 2 Alternatively, (7) can be obtaine by proposing a generic factorise form for the approximate conitional p(f X, Z, u) q(f X, Z, u) = n qj(fj xj, Z, u) an then choosing the set of functions {q j( )} n j= so as to minimise the Kullback-Leibler (KL) ivergence from the exact joint prior j= KL(p(f X, Z, u)p(u Z) q(f X, Z, u)p(u Z)), as note in 9, Section 2.3.6. 3

Marginals are in turn obtaine from (6): p(f j x j, Z, u) = N (f j k j K uuu, λ j ), where k j is the j-th row of K fu an λ j is the j-th element of the iagonal of matrix Λ f = iag(k ff K fu K uuk uf ). Operator iag( ) sets all off-iagonal elements to zero, so that Λ f is a iagonal matrix. Since p(u Z) is reaily available an also Gaussian, the inucing variables can be integrate out from (7), yieling a new, approximate prior over f(x): n p(f X, Z) = p(f, u X, Z)u p(f j x j, Z, u)p(u Z)u = N (f 0, K fu K uuk uf + Λ f ) Using this approximate prior, the posterior istribution for a test case is: j= p IDGP (y x, D, Z) = N (y k u Q K fuλ y y, σ 2 + k + k u (Q K uu)k u ), (8) where we have efine Q = K uu + K fuλ y K fu an Λ y = Λ f + σ 2 I n. The istribution (2) is approximate by (8) with the information available in the pseuo ata set. After O(m 2 n) time precomputations, preictive means an variances can be compute in O(m) an O(m 2 ) time per test case, respectively. This moel is, in general, non-stationary, even when it is approximating a stationary input-omain covariance an can be interprete as a egenerate GP plus heterosceastic white noise. The log-marginal likelihoo (or log-evience) of the moel, explicitly incluing the conitioning on kernel hyperparameters θ can be expresse as log p(y X, Z, θ) = 2 y Λ y which is also computable in O(m 2 n) time. y y Λ K fu Q K fuλ y+log( Q Λ y / K uu )+n log(2π) y Moel selection will be performe by jointly optimising the evience with respect to the hyperparameters an the inucing features. If analytical erivatives of the covariance function are available, conjugate graient optimisation can be use with O(m 2 n) cost per step. 3.3 On the choice of g(x, z) The feature extraction function g(x, z) efines the transforme omain in which the pseuo ata set lies. Accoring to (3), the inucing variables can be seen as projections of the target function f(x) on the feature extraction function over the whole input space. Therefore, each of them summarises information about the behaviour of f(x) everywhere. The inucing features Z efine the concrete set of functions over which the target function will be projecte. It is esirable that this set captures the most significant characteristics of the function. This can be achieve either using prior knowlege about ata to select {g(x, z i )} m i= or using a very general family of functions an letting moel selection automatically choose the appropriate set. Another way to choose g(x, z) relies on the form of the posterior. The posterior mean of a GP is often thought of as a linear combination of basis functions. For full GPs an other approximations such as, 2, 3, 4, 5, 6, basis functions must have the form of the input-omain covariance function. When using IDGPs, basis functions have the form of the inter-omain instance of the covariance function, an can therefore be ajuste by choosing g(x, z), inepenently of the input-omain covariance function. If two feature extraction functions g(, ) an h(, ) can be relate by g(x, z) = h(x, z)r(z) for any function r( ), then both yiel the same sparse GP moel. This property can be use to simplify the expressions of the instances of the covariance function. In this work we use the same functional form for every feature, i.e. our function set is {g(x, z i )} m i=, but it is also possible to use sets with ifferent functional forms for each inucing feature, i.e. {g i (x, z i )} m i= where each z i may even have a ifferent size (imension). In the sections below we will iscuss ifferent possible choices for g(x, z). 3.3. Relation with Sparse GPs using pseuo-inputs The sparse GP using pseuo-inputs () was introuce in an was later rename to Fully Inepenent Training Conitional (FITC) moel to fit in the systematic framework of 0. Since y 4

the sparse moel introuce in Section 3.2 also uses a fully inepenent training conitional, we will stick to the first name to avoi possible confusion. IDGP innovation with respect to consists in letting the pseuo ata set lie in a ifferent omain. If we set g (x, z) δ(x z) where δ( ) is a Dirac elta, we force the pseuo ata set to lie in the input omain. Thus there is no longer a transforme space an the original moel is retrieve. In this setting, the inucing features of IDGP play the role of s pseuo-inputs. 3.3.2 Relation with Sparse Multiscale GPs Sparse Multiscale GPs (SMGPs) are presente in. Seeking to generalise the moel with ARD SE covariance function, they propose to use a ifferent set of length-scales for each basis function. The resulting moel presents a efective variance that is heale by aing heterosceastic white noise. SMGPs, incluing the variance improvement, can be erive in a principle way as IDGPs: g SMGP (x, z) D k SMGP (x, z ) = exp k SMGP (z, z ) = exp 2π(c 2 l 2 ) exp (x µ )2 2c 2 D l 2 c 2 (µ µ )2 D 2(c 2 + c 2 l2 ) (x µ ) 2 2(c 2 l2 ) l 2 c 2 + c 2 l2 µ with z = c (9) (0). () With this approximation, each basis function has its own centre µ = µ, µ 2,..., µ an its own length-scales c = c, c 2,..., c, whereas global length-scales {l } D are share by all inucing features. Equations (0) an () are erive from (4) an (5) using () an (9). The integrals efining k SMGP (, ) converge if an only if c 2 l2,, which suggests that other values, even if permitte in, shoul be avoie for the moel to remain well efine. 3.3.3 Frequency Inucing Features GP If the target function can be escribe more compactly in the frequency omain than in the input omain, it can be avantageous to let the pseuo ata set lie in the former omain. We will pursue that possibility for the case where the input omain covariance is the ARD SE. We will call the resulting sparse moel Frequency Inucing Features GP (). Directly applying the Fourier transform is not possible because the target function is not square integrable (it has constant power σ0 2 everywhere, so (5) oes not converge). We will workaroun this by winowing the target function in the region of interest. It is possible to use a square winow, but this results in the covariance being efine in terms of the complex error function, which is very slow to evaluate. Instea, we will use a Gaussian winow 3. Since multiplying by a Gaussian in the input omain is equivalent to convolving with a Gaussian in the frequency omain, we will be working with a blurre version of the frequency space. This moel is efine by: g FIF (x, z) D 2πc 2 exp x 2 2c 2 ( cos ω 0 + ) x ω with z = ω (2) ( ) k FIF (x, z x 2 ) = exp + c2 ω 2 2(c 2 + l2 ) cos ω 0 c 2 + ω x D l 2 c 2 + l2 c 2 + (3) l2 ( k FIF (z, z c 2 ) = exp (ω2 + ω 2 ) c 4 2(2c 2 + l2 ) exp (ω ω )2 2(2c 2 + l2 ) cos(ω 0 ω 0) ) c 4 + exp (ω + ω )2 D 2(2c 2 + l2 ) cos(ω 0 + ω 0) l 2 2c 2 +. (4) l2 3 A mixture of m Gaussians coul also be use as winow without increasing the complexity orer. 5

The inucing features are ω = ω 0, ω,..., ω, where ω 0 is the phase an the remaining components are frequencies along each imension. In this moel, both global length-scales {l } D an winow length-scales {c } D are share, thus c = c. Instances (3) an (4) are inuce by (2) using (4) an (5). 3.3.4 Time-Frequency Inucing Features GP Instea of using a single winow to select the region of interest, it is possible to use a ifferent winow for each feature. We will use winows of the same size but ifferent centres. The resulting moel combines an, so we will call it Time-Frequency Inucing Features GP (). It is efine by g TFIF (x, z) g FIF (x µ, ω), with z = µ ω. The implie inter-omain an transforme-omain instances of the covariance function are: k TFIF (x, z ) = k FIF (x µ, ω ), k TFIF (z, z ) = k FIF (z, z ) exp (µ µ )2 2(2c 2 + l2 ) is trivially obtaine by setting every centre to zero {µ i = 0} m i=, whereas is obtaine by setting winow length-scales c, frequencies an phases {ω i } m i= to zero. If the winow lengthscales were iniviually ajuste, SMGP woul be obtaine. While has the moelling power of both an, it might perform worse in practice ue to it having roughly twice as many hyperparameters, thus making the optimisation problem harer. The same problem also exists in SMGP. A possible workaroun is to initialise the hyperparameters using a simpler moel, as one in for SMGP, though we will not o this here. 4 Experiments In this section we will compare the propose approximations an with the current state of the art, on some large ata sets, for the same number of inucing features/inputs an therefore, roughly equal computational cost. Aitionally, we provie results using a full GP, which is expecte to provie top performance (though requiring an impractically big amount of computation). In all cases, the (input-omain) covariance function is the ARD SE (). We use four large ata sets: Kin-40k, Pumayn-32nm 4 (escribing the ynamics of a robot arm, use with in ), Elevators an Pole Telecomm 5 (relate to the control of the elevators of an F6 aircraft an a telecommunications problem, an use in 2, 3, 4). Input imensions that remaine constant throughout the training set were remove. Input ata was aitionally centre for use with (the remaining methos are translation invariant). Pole Telecomm outputs actually take iscrete values in the 0-00 range, in multiples of 0. This was taken into account by using the corresponing quantization noise variance (0 2 /2) as lower boun for the noise hyperparameter 6. Hyperparameters are initialise as follows: σ0 2 = n n j= y2 j, σ2 = σ0/4, 2 {l } D to one half of the range spanne by training ata along each imension. For, pseuo-inputs are initialise to a ranom subset of the training ata, for winow size c is initialise to the stanar eviation of input ata, frequencies are ranomly chosen from a zero-mean l 2 -variance Gaussian istribution, an phases are obtaine from a uniform istribution in 0... 2π). uses the same initialisation as, with winow centres set to zero. Final values are selecte by evience maximisation. Denoting the output average over the training set as y an the preictive mean an variance for test sample y l as µ l an σ l respectively, we efine the following quality measures: Normalize Mean Square Error (NMSE) (y l µ l ) 2 / (y l y) 2 an Mean Negative Log-Probability (MNLP) 2 (y l µ l ) 2 /σ l 2 + log σ2 l + log 2π, where averages over the test set. 4 Kin-40k: 8 input imensions, 0000/30000 samples for train/test, Pumayn-32nm: 32 input imensions, 768/024 samples for train/test, using exactly the same preprocessing an train/test splits as, 3. Note that their error measure is actually one half of the Normalize Mean Square Error efine here. 5 Pole Telecomm: 26 non-constant input imensions, 0000/5000 samples for train/test. Elevators: 7 non-constant input imensions, 8752/7847 samples for train/test. Both have been ownloae from http://www.liaa.up.pt/ ltorgo/regression/atasets.html 6 If unconstraine, similar plots are obtaine; in particular, no overfitting is observe. 6

For Kin-40k (Fig., top), all three sparse methos perform similarly, though for high sparseness (the most useful case) an are slightly superior. In Pumayn-32nm (Fig., bottom), only 4 out the 32 input imensions are relevant to the regression task, so it can be use as an ARD capabilities test. We follow an use a full GP on a small subset of the training ata (024 ata points) to obtain the initial length-scales. This allows better minima to be foun uring optimisation. Though all methos are able to properly fin a goo solution, an especially are better in the sparser regime. Roughly the same consierations can be mae about Pole Telecomm an Elevators (Fig. 2), but in these ata sets the superiority of an is more ramatic. Though not shown here, we have aitionally teste these moels on smaller, overfitting-prone ata sets, an have foun no noticeable overfitting even using m > n, espite the relatively high number of parameters being ajuste. This is in line with the results an iscussion of. Normalize Mean Square Error 0.5 0. 0.05 0.0 0.005 0.00 Full GP on 0000 ata points 25 50 00 200 300 500 750 250 (a) Kin-40k NMSE (log-log plot) Mean Negative Log Probability 2.5 2.5 0.5 0 0.5.5 Full GP on 0000 ata points 25 50 00 200 300 500 750 250 (b) Kin-40k MNLP (semilog plot) Normalize Mean Square Error 0. 0.05 0.04 Full GP on 768 ata points 0 25 50 75 00 (c) Pumayn-32nm NMSE (log-log plot) Mean Negative Log Probability 0.2 0.5 0. 0.05 0 0.05 0. 0.5 0.2 Full GP on 768 ata points 0 25 50 75 00 () Pumayn-32nm MNLP (semilog plot) Figure : Performance of the compare methos on Kin-40k an Pumayn-32nm. 5 Conclusions an extensions In this work we have introuce IDGPs, which are able combine representations of a GP in ifferent omains, an have use them to exten to hanle inucing features lying in a ifferent omain. This provies a general framework for sparse moels, which are efine by a feature extraction function. Using this framework, SMGPs can be reinterprete as fully principle moels using a transforme space of local features, without any nee for post-hoc variance improvements. Furthermore, it is possible to evelop new sparse moels of practical use, such as the propose an, which are able to outperform the state-of-the-art on some large ata sets, especially for high sparsity regimes. 7

Normalize Mean Square Error 0.25 0.2 0.5 0. Full GP on 8752 ata points 0 25 50 00 250 500 750000 (a) Elevators NMSE (log-log plot) Mean Negative Log Probability 3.8 4 4.2 4.4 4.6 4.8 Full GP on 8752 ata points 0 25 50 00 250 500 750000 (b) Elevators MNLP (semilog plot) Normalize Mean Square Error 0.2 0.5 0. 0.05 0.04 0.03 0.02 0.0 Full GP on 0000 ata points 0 25 50 00 250 500 000 (c) Pole Telecomm NMSE (log-log plot) Mean Negative Log Probability 5.5 5 4.5 4 3.5 3 2.5 Full GP on 0000 ata points 0 25 50 00 250 500 000 () Pole Telecomm MNLP (semilog plot) Figure 2: Performance of the compare methos on Elevators an Pole Telecomm. Choosing a transforme space for the inucing features enables to use omains where the target function can be expresse more compactly, or where the evience (which is a function of the features) is easier to optimise. This ae flexibility translates as a etaching of the functional form of the input-omain covariance an the set of basis functions use to express the posterior mean. IDGPs approximate full GPs optimally in the KL sense note in Section 3.2, for a given set of inucing features. Using ML-II to select the inucing features means that moels proviing a goo fit to ata are given preference over moels that might approximate the full GP more closely. This, though rarely, might lea to harmful overfitting. To more faithfully approximate the full GP an avoi overfitting altogether, our proposal can be combine with the variational approach from 5, in which the inucing features woul be regare as variational parameters. This woul result in more constraine moels, which woul be closer to the full GP but might show reuce performance. We have explore the case of regression with Gaussian noise, which is analytically tractable, but it is straightforwar to apply the same moel to other tasks such as robust regression or classification, using approximate inference (see 6). Also, IDGPs as a general tool can be use for other purposes, such as moelling noise in the frequency omain, aggregating ata from ifferent omains or even imposing constraints on the target function. Acknowlegments We woul like to thank the anonymous referees for helpful comments an suggestions. This work has been partly supporte by the Spanish government uner grant TEC2008-02473/TEC, an by the Mari Community uner grant S-505/TIC/0223. 8

References E. Snelson an Z. Ghahramani. Sparse Gaussian processes using pseuo-inputs. In Avances in Neural Information Processing Systems 8, pages 259 266. MIT Press, 2006. 2 A. J. Smola an P. Bartlett. Sparse greey Gaussian process regression. In Avances in Neural Information Processing Systems 3, pages 69 625. MIT Press, 200. 3 M. Seeger, C. K. I. Williams, an N. D. Lawrence. Fast forwar selection to spee up sparse Gaussian process regression. In Proceeings of the 9th International Workshop on AI Stats, 2003. 4 V. Tresp. A Bayesian committee machine. Neural Computation, 2:279 274, 2000. 5 L. Csató an M. Opper. Sparse online Gaussian processes. Neural Computation, 4(3):64 669, 2002. 6 C. K. I. Williams an M. Seeger. Using the Nyström metho to spee up kernel machines. In Avances in Neural Information Processing Systems 3, pages 682 688. MIT Press, 200. 7 C. E. Rasmussen an C. K. I. Williams. Gaussian Processes for Machine Learning. Aaptive Computation an Machine Learning. MIT Press, 2006. 8 M. Alvarez an N. D. Lawrence. Sparse convolve Gaussian processes for multi-output regression. In Avances in Neural Information Processing Systems 2, pages 57 64, 2009. 9 E. Snelson. Flexible an efficient Gaussian process moels for machine learning. PhD thesis, University of Cambrige, 2007. 0 J. Quiñonero-Canela an C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:939 959, 2005. C. Waler, K. I. Kim, an B. Schölkopf. Sparse multiscale Gaussian process regression. In 25th International Conference on Machine Learning. ACM Press, New York, 2008. 2 G. Potgietera an A. P. Engelbrecht. Evolving moel trees for mining ata sets with continuous-value classes. Expert Systems with Applications, 35:53 532, 2007. 3 L. Torgo an J. Pinto a Costa. Clustere partial linear regression. In Proceeings of the th European Conference on Machine Learning, pages 426 436. Springer, 2000. 4 G. Potgietera an A. P. Engelbrecht. Pairwise classification as an ensemble technique. In Proceeings of the 3th European Conference on Machine Learning, pages 97 0. Springer-Verlag, 2002. 5 M. K. Titsias. Variational learning of inucing variables in sparse Gaussian processes. In Proceeings of the 2th International Workshop on AI Stats, 2009. 6 A. Naish-Guzman an S. Holen. The generalize FITC approximation. In Avances in Neural Information Processing Systems 20, pages 057 064. MIT Press, 2008. 9