Gaussian Process Regression Networks

Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh 1 / 20

Multiple responses with input dependent covariances Two response variables: y 1(x): concentration of cadmium at a spatial location x y 2(x): concentration of zinc at a spatial location x The values of these responses, at a given spatial location x, are correlated We can account for these correlations (rather than assuming y 1(x) and y 2(x) are independent) to enhance predictions We can further enhance predictions by accounting for how these correlations vary with geographical location x Accounting for input dependent correlations is a distinctive feature of the Gaussian process regression network latitude 6 5 4 3 2 1 1 2 3 4 5 longitude 090 075 060 045 030 015 000 015 2 / 20

Motivation for modelling dependent covariances Promise Many problems in fact have input dependent uncertainties and correlations Accounting for dependent covariances (uncertainties and correlations) can greatly improve statistical inferences Uncharted Territory For convenience, response variables are typically seen as independent, or as having fixed covariances (eg multi-task literature) The few existing models of dependent covariances are typically not expressive (eg Brownian motion covariance structure) or scalable (eg < 5 response variables) Goal We want to develop expressive and scalable models (> 1000 response variables) for dependent uncertainties and correlations 3 / 20

Outline Gaussian process review Gaussian process regression networks Applications 4 / 20

Gaussian processes Definition A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution Nonparametric Regression Model Prior: f (x) GP(m(x), k(x, x )), meaning (f (x 1),, f (x N)) N (µ, K), with µ i = m(x i) and K ij = cov(f (x i), f (x j)) = k(x i, x j) GP posterior p(f (x) D) Likelihood GP prior p(d f (x)) p(f (x)) output, f(t) Gaussian process sample prior functions 3 2 1 0 1 2 3 10 5 0 5 10 input, t a) output, f(t) Gaussian process sample posterior functions 3 2 1 0 1 2 3 10 5 0 5 10 input, t b) 5 / 20

Gaussian processes How can Gaussian processes possibly replace neural networks? Did we throw the baby out with the bathwater? David MacKay, 1998 6 / 20

Semiparametric Latent Factor Model The semiparametric latent factor model (SLFM) (Teh, Seeger, and Jordan, 2005) is a popular multi-output (multi-task) GP model for fixed signal correlations between outputs (response variables): p 1 y(x) = p q W q 1 f(x) +σ y N (0,I p) z(x) x: input variable (eg geographical location) y(x): p 1 vector of output variables (responses) evaluated at x W: p q matrix of mixing weights f(x): q 1 vector of Gaussian process functions σ y: hyperparameter controlling noise variance z(x): iid Gaussian white noise with p p identity covariance I p x f 1(x) f i(x) f q(x) W11 Wpq Wp1 W1q y 1(x) y j(x) y p(x) 7 / 20

Deriving the GPRN Structure at x = x 1 Structure at x = x 2 f 1(x 1) y 1(x 1) f 1(x 2) y 1(x 2) f 2(x 1) y 2(x 1) f 2(x 2) y 2(x 2) At x = x 1 the two outputs (responses) y 1 and y 2 are correlated since they share the basis function f 1 At x = x 2 the outputs are independent 8 / 20

From the SLFM to the GPRN SLFM GPRN f 1(x) W11 y 1(x) x f i(x) W1q y j(x)? Wp1 f q(x) Wpq y p(x) p 1 y(x) = p q W q 1 f(x) +σ y N (0,I p) z(x) 9 / 20

From the SLFM to the GPRN SLFM GPRN f 1(x) W11 y 1(x) ˆf 1(x) W11(x) y 1(x) W1q W1q(x) x f i(x) y j(x) x ˆf i(x) y j(x) Wp1 Wp1(x) f q(x) Wpq y p(x) ˆf q(x) Wpq(x) y p(x) p 1 y(x) = p q W q 1 f(x) +σ y N (0,I p) z(x) p 1 y(x) = p q W(x) [ q 1 N (0,I q) f(x) +σ f }{{} ˆf(x) ɛ(x) ] +σ y N (0,I p) z(x) 10 / 20

Gaussian process regression networks p 1 y(x) = p q W(x) [ or, equivalently, q 1 N (0,I q) f(x) +σ f }{{} ˆf(x) ɛ(x) ] +σ y N (0,I p) z(x) ˆf 1 (x) W 11(x) y 1 (x) signal noise y(x) = W(x)f(x) + σ f W(x)ɛ(x) + σ yz(x) W1q(x) y(x): p 1 vector of output variables (responses) evaluated at x W(x): p q matrix of weight functions W(x) ij GP(0, k w) f(x): q 1 vector of Gaussian process node functions f(x) i GP(0, k f ) σ f, σ y: hyperparameters controlling noise variance ɛ(x), z(x): Gaussian white noise x ˆf i (x) y j (x) ˆf q (x) W pq(x) Wp1(x) y p (x) 11 / 20

GPRN Inference We sample from the posterior over Gaussian processes in the weight and node functions using elliptical slice sampling (ESS) (Murray, Adams, and MacKay, 2010) ESS is especially good for sampling from posteriors with correlated Gaussian priors We also approximate this posterior using a message passing implementation of variational Bayes (VB) The computational complexity is cubic in the number of data points and linear in the number of response variables, per iteration of ESS or VB Details are in the paper 12 / 20

GPRN Results, Jura Heavy Metal Dataset 6 090 5 075 W11(x) latitude 4 3 2 060 045 030 015 f1(x) f2(x) W12(x) W21(x) W31(x) W22(x) W32(x) y1(x) y2(x) 1 000 y3(x) 1 2 3 4 5 longitude 015 signal noise y(x) = W(x)f(x) + σ f W(x)ɛ(x) + σ yz(x) 13 / 20

GPRN Results, Gene Expression 50D 08 GENE (p=50) 07 06 05 SMSE 04 03 02 01 0 GPRN (ESS) GPRN (VB) CMOGP LMC SLFM 14 / 20

GPRN Results, Gene Expression 1000D 07 GENE (p=1000) 06 05 SMSE 04 03 02 01 0 GPRN (ESS) GPRN (VB) CMOFITC CMOPITC CMODTC 15 / 20

Training Times on GENE Training time GENE (50D) (s) GENE (1000D) (s) GPRN (VB) 12 330 GPRN (ESS) 40 9000 LMC, CMOGP, SLFM minutes days 16 / 20

Multivariate Volatility Results 7 MSE on EQUITY dataset 6 5 4 MSE 3 2 1 0 GPRN (VB) GPRN (ESS) GWP WP MGARCH 17 / 20

Summary A Gaussian process regression network is used for multi-task regression and multivariate volatility, and can account for input dependent signal and noise covariances Can scale to thousands of dimensions Outperforms multi-task Gaussian process models and multivariate volatility models 18 / 20

Generalised Wishart Processes Recall that the GPRN model can be written as The induced noise process, signal noise y(x) = W(x)f(x) + σ f W(x)ɛ(x) + σ yz(x) Σ(x) noise = σ 2 f W(x)W(x) T + σ 2 y I, is an example of a Generalised Wishart Process (Wilson and Ghahramani, 2010) At every x, Σ(x) is marginally Wishart, and the dynamics of Σ(x) are governed by the GP covariance kernel used for the weight functions in W(x) 19 / 20

GPRN Inference New signal noise y(x) = W(x)f(x) + σ f W(x)ɛ(x) + σ yz(x) Prior is induced through GP priors in nodes and weights Likelihood Posterior p(d u, σ f, σ y) = p(u σ f, γ) = N (0, C B) N N (y(x i); W(x i)ˆf(x i), σy 2 I p) i=1 p(u D, σ f, σ y, γ) p(d u, σ f, σ y)p(u σ f, γ) We sample from the posterior using elliptical slice sampling (Murray, Adams, and MacKay, 2010) or approximate it using a message passing implementation of variational Bayes 20 / 20