Gaussian Process Regression Networks

Similar documents
Kernels for Automatic Pattern Discovery and Extrapolation

Gaussian Process Regression Networks

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

20: Gaussian Processes

Model Selection for Gaussian Processes

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Bayesian Machine Learning

GAUSSIAN PROCESS REGRESSION

Introduction to Gaussian Processes

CPSC 540: Machine Learning

Gaussian Process Vine Copulas for Multivariate Dependence

Lecture 5: GPs and Streaming regression

Generalised Wishart Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Kernel Methods for Large Scale Representation Learning

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Gaussian Process Regression Networks Supplementary Material

A Process over all Stationary Covariance Kernels

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

STA 4273H: Statistical Machine Learning

Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes

Bayesian Machine Learning

Nonparameteric Regression:

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Bayesian Deep Learning

Gaussian Process Regression

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

An Introduction to Gaussian Processes for Spatial Data (Predictions!)

Learning Gaussian Process Models from Uncertain Data

Introduction to Gaussian Processes

GWAS V: Gaussian processes

Approximate Inference using MCMC

Gaussian Processes (10/16/13)

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Multi-Kernel Gaussian Processes

How to build an automatic statistician

Large-scale Ordinal Collaborative Filtering

Recent Advances in Bayesian Inference Techniques

Tree-structured Gaussian Process Approximations

Variable sigma Gaussian processes: An expectation propagation perspective

Introduction to Gaussian Processes

Deep Neural Networks as Gaussian Processes

Curve Fitting Re-visited, Bishop1.2.5

Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics

Probabilistic & Unsupervised Learning

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Bayesian Machine Learning

Active and Semi-supervised Kernel Classification

Gaussian Processes in Machine Learning

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Gaussian Process Kernels for Pattern Discovery and Extrapolation

Deep learning with differential Gaussian process flows

DD Advanced Machine Learning

Dirichlet Processes: Tutorial and Practical Course

Reliability Monitoring Using Log Gaussian Process Regression

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Correlation Changepoints in Log-Gaussian Cox Processes

Gaussian processes for inference in stochastic differential equations

The Variational Gaussian Approximation Revisited

Probabilistic & Bayesian deep learning. Andreas Damianou

Student-t Process as Alternative to Gaussian Processes Discussion

Expectation Propagation in Dynamical Systems

Nonparametric Bayesian Methods (Gaussian Processes)

The Automatic Statistician

Nonparametric Bayesian Models for Sparse Matrices and Covariances

Nonparametric Bayesian inference on multivariate exponential families

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Approximate Inference Part 1 of 2

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Approximate Inference Part 1 of 2

Learning Multiple Related Tasks using Latent Independent Component Analysis

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Variational Model Selection for Sparse Gaussian Process Regression

Talk on Bayesian Optimization

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Lecture : Probabilistic Machine Learning

State Space Gaussian Processes with Non-Gaussian Likelihoods

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Building an Automatic Statistician

Learning to Learn and Collaborative Filtering

STAT 518 Intro Student Presentation

Sequential Inference for Deep Gaussian Process

Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London

Non-Factorised Variational Inference in Dynamical Systems

System identification and control with (deep) Gaussian processes. Andreas Damianou

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

STA 4273H: Statistical Machine Learning

Fast Direct Methods for Gaussian Processes

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

MTTTS16 Learning from Multiple Sources

Linear-Time Inverse Covariance Matrix Estimation in Gaussian Processes

Transcription:

Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh 1 / 20

Multiple responses with input dependent covariances Two response variables: y 1(x): concentration of cadmium at a spatial location x y 2(x): concentration of zinc at a spatial location x The values of these responses, at a given spatial location x, are correlated We can account for these correlations (rather than assuming y 1(x) and y 2(x) are independent) to enhance predictions We can further enhance predictions by accounting for how these correlations vary with geographical location x Accounting for input dependent correlations is a distinctive feature of the Gaussian process regression network latitude 6 5 4 3 2 1 1 2 3 4 5 longitude 090 075 060 045 030 015 000 015 2 / 20

Motivation for modelling dependent covariances Promise Many problems in fact have input dependent uncertainties and correlations Accounting for dependent covariances (uncertainties and correlations) can greatly improve statistical inferences Uncharted Territory For convenience, response variables are typically seen as independent, or as having fixed covariances (eg multi-task literature) The few existing models of dependent covariances are typically not expressive (eg Brownian motion covariance structure) or scalable (eg < 5 response variables) Goal We want to develop expressive and scalable models (> 1000 response variables) for dependent uncertainties and correlations 3 / 20

Outline Gaussian process review Gaussian process regression networks Applications 4 / 20

Gaussian processes Definition A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution Nonparametric Regression Model Prior: f (x) GP(m(x), k(x, x )), meaning (f (x 1),, f (x N)) N (µ, K), with µ i = m(x i) and K ij = cov(f (x i), f (x j)) = k(x i, x j) GP posterior p(f (x) D) Likelihood GP prior p(d f (x)) p(f (x)) output, f(t) Gaussian process sample prior functions 3 2 1 0 1 2 3 10 5 0 5 10 input, t a) output, f(t) Gaussian process sample posterior functions 3 2 1 0 1 2 3 10 5 0 5 10 input, t b) 5 / 20

Gaussian processes How can Gaussian processes possibly replace neural networks? Did we throw the baby out with the bathwater? David MacKay, 1998 6 / 20

Semiparametric Latent Factor Model The semiparametric latent factor model (SLFM) (Teh, Seeger, and Jordan, 2005) is a popular multi-output (multi-task) GP model for fixed signal correlations between outputs (response variables): p 1 y(x) = p q W q 1 f(x) +σ y N (0,I p) z(x) x: input variable (eg geographical location) y(x): p 1 vector of output variables (responses) evaluated at x W: p q matrix of mixing weights f(x): q 1 vector of Gaussian process functions σ y: hyperparameter controlling noise variance z(x): iid Gaussian white noise with p p identity covariance I p x f 1(x) f i(x) f q(x) W11 Wpq Wp1 W1q y 1(x) y j(x) y p(x) 7 / 20

Deriving the GPRN Structure at x = x 1 Structure at x = x 2 f 1(x 1) y 1(x 1) f 1(x 2) y 1(x 2) f 2(x 1) y 2(x 1) f 2(x 2) y 2(x 2) At x = x 1 the two outputs (responses) y 1 and y 2 are correlated since they share the basis function f 1 At x = x 2 the outputs are independent 8 / 20

From the SLFM to the GPRN SLFM GPRN f 1(x) W11 y 1(x) x f i(x) W1q y j(x)? Wp1 f q(x) Wpq y p(x) p 1 y(x) = p q W q 1 f(x) +σ y N (0,I p) z(x) 9 / 20

From the SLFM to the GPRN SLFM GPRN f 1(x) W11 y 1(x) ˆf 1(x) W11(x) y 1(x) W1q W1q(x) x f i(x) y j(x) x ˆf i(x) y j(x) Wp1 Wp1(x) f q(x) Wpq y p(x) ˆf q(x) Wpq(x) y p(x) p 1 y(x) = p q W q 1 f(x) +σ y N (0,I p) z(x) p 1 y(x) = p q W(x) [ q 1 N (0,I q) f(x) +σ f }{{} ˆf(x) ɛ(x) ] +σ y N (0,I p) z(x) 10 / 20

Gaussian process regression networks p 1 y(x) = p q W(x) [ or, equivalently, q 1 N (0,I q) f(x) +σ f }{{} ˆf(x) ɛ(x) ] +σ y N (0,I p) z(x) ˆf 1 (x) W 11(x) y 1 (x) signal noise y(x) = W(x)f(x) + σ f W(x)ɛ(x) + σ yz(x) W1q(x) y(x): p 1 vector of output variables (responses) evaluated at x W(x): p q matrix of weight functions W(x) ij GP(0, k w) f(x): q 1 vector of Gaussian process node functions f(x) i GP(0, k f ) σ f, σ y: hyperparameters controlling noise variance ɛ(x), z(x): Gaussian white noise x ˆf i (x) y j (x) ˆf q (x) W pq(x) Wp1(x) y p (x) 11 / 20

GPRN Inference We sample from the posterior over Gaussian processes in the weight and node functions using elliptical slice sampling (ESS) (Murray, Adams, and MacKay, 2010) ESS is especially good for sampling from posteriors with correlated Gaussian priors We also approximate this posterior using a message passing implementation of variational Bayes (VB) The computational complexity is cubic in the number of data points and linear in the number of response variables, per iteration of ESS or VB Details are in the paper 12 / 20

GPRN Results, Jura Heavy Metal Dataset 6 090 5 075 W11(x) latitude 4 3 2 060 045 030 015 f1(x) f2(x) W12(x) W21(x) W31(x) W22(x) W32(x) y1(x) y2(x) 1 000 y3(x) 1 2 3 4 5 longitude 015 signal noise y(x) = W(x)f(x) + σ f W(x)ɛ(x) + σ yz(x) 13 / 20

GPRN Results, Gene Expression 50D 08 GENE (p=50) 07 06 05 SMSE 04 03 02 01 0 GPRN (ESS) GPRN (VB) CMOGP LMC SLFM 14 / 20

GPRN Results, Gene Expression 1000D 07 GENE (p=1000) 06 05 SMSE 04 03 02 01 0 GPRN (ESS) GPRN (VB) CMOFITC CMOPITC CMODTC 15 / 20

Training Times on GENE Training time GENE (50D) (s) GENE (1000D) (s) GPRN (VB) 12 330 GPRN (ESS) 40 9000 LMC, CMOGP, SLFM minutes days 16 / 20

Multivariate Volatility Results 7 MSE on EQUITY dataset 6 5 4 MSE 3 2 1 0 GPRN (VB) GPRN (ESS) GWP WP MGARCH 17 / 20

Summary A Gaussian process regression network is used for multi-task regression and multivariate volatility, and can account for input dependent signal and noise covariances Can scale to thousands of dimensions Outperforms multi-task Gaussian process models and multivariate volatility models 18 / 20

Generalised Wishart Processes Recall that the GPRN model can be written as The induced noise process, signal noise y(x) = W(x)f(x) + σ f W(x)ɛ(x) + σ yz(x) Σ(x) noise = σ 2 f W(x)W(x) T + σ 2 y I, is an example of a Generalised Wishart Process (Wilson and Ghahramani, 2010) At every x, Σ(x) is marginally Wishart, and the dynamics of Σ(x) are governed by the GP covariance kernel used for the weight functions in W(x) 19 / 20

GPRN Inference New signal noise y(x) = W(x)f(x) + σ f W(x)ɛ(x) + σ yz(x) Prior is induced through GP priors in nodes and weights Likelihood Posterior p(d u, σ f, σ y) = p(u σ f, γ) = N (0, C B) N N (y(x i); W(x i)ˆf(x i), σy 2 I p) i=1 p(u D, σ f, σ y, γ) p(d u, σ f, σ y)p(u σ f, γ) We sample from the posterior using elliptical slice sampling (Murray, Adams, and MacKay, 2010) or approximate it using a message passing implementation of variational Bayes 20 / 20