Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005

Similar documents
STAT 518 Intro Student Presentation

CSC 2541: Bayesian Methods for Machine Learning

Introduction to Gaussian Processes

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

GAUSSIAN PROCESS REGRESSION

Nonparametric Bayesian Methods (Gaussian Processes)

STA 4273H: Sta-s-cal Machine Learning

Gentle Introduction to Infinite Gaussian Mixture Modeling

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Where now? Machine Learning and Bayesian Inference

Bayesian Methods for Machine Learning

Week 3: Linear Regression

TDA231. Logistic regression

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Model Selection for Gaussian Processes

Learning Gaussian Process Models from Uncertain Data

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Bayesian Regression Linear and Logistic Regression

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

CSC 2541: Bayesian Methods for Machine Learning

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

CPSC 540: Machine Learning

Advanced Introduction to Machine Learning CMU-10715

Gaussian Process Approximations of Stochastic Differential Equations

Notes on Discriminant Functions and Optimal Classification

Generative Learning algorithms

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

CSC321 Lecture 18: Learning Probabilistic Models

STA414/2104 Statistical Methods for Machine Learning II

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Introduction to Bayesian Learning. Machine Learning Fall 2018

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Machine Learning. Probabilistic KNN.

Computing Likelihood Functions for High-Energy Physics Experiments when Distributions are Defined by Simulators with Nuisance Parameters

Fundamental Issues in Bayesian Functional Data Analysis. Dennis D. Cox Rice University

Gaussian Processes in Machine Learning

Bayesian Linear Regression [DRAFT - In Progress]

20: Gaussian Processes

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

Bayesian Analysis for Natural Language Processing Lecture 2

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

PMR Learning as Inference

1 Using standard errors when comparing estimated values

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

TAKEHOME FINAL EXAM e iω e 2iω e iω e 2iω

Lecture 7 and 8: Markov Chain Monte Carlo

Variational Autoencoder

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Lecture 5: GPs and Streaming regression

ECE521 week 3: 23/26 January 2017

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Markov Chain Monte Carlo methods

Generative Clustering, Topic Modeling, & Bayesian Inference

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Massachusetts Institute of Technology

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

State-Space Methods for Inferring Spike Trains from Calcium Imaging

Introduction to Gaussian Process

Bayesian Nonparametric Regression for Diabetes Deaths

Density Estimation. Seungjin Choi

GWAS V: Gaussian processes

Tractable Nonparametric Bayesian Inference in Poisson Processes with Gaussian Process Intensities

Elliptical slice sampling

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Statistical Approaches to Learning and Discovery

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

Using Estimating Equations for Spatially Correlated A

MCMC for non-linear state space models using ensembles of latent sequences

Disease mapping with Gaussian processes

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Multivariate Normal & Wishart

Bayesian PalaeoClimate Reconstruction from proxies:

Modelling Transcriptional Regulation with Gaussian Processes

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

MCMC 2: Lecture 2 Coding and output. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

Gaussian process for nonstationary time series prediction

Bayesian Dynamic Linear Modelling for. Complex Computer Models

Expectation Maximization (EM)

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Probabilistic & Unsupervised Learning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Bayesian Inference. 2 CS295-7 cfl Michael J. Black,

Density Modeling and Clustering Using Dirichlet Diffusion Trees

Unsupervised Learning

Afternoon Meeting on Bayesian Computation 2018 University of Reading

STA 4273H: Statistical Machine Learning

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Lecture : Probabilistic Machine Learning

Bayesian Support Vector Machines for Feature Ranking and Selection

AN INTRODUCTION TO TOPIC MODELS

Ways to make neural networks generalize better

Density Estimation: ML, MAP, Bayesian estimation

Transcription:

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach Radford M. Neal, 28 February 2005

A Very Brief Review of Gaussian Processes A Gaussian process is a distribution over functions, X(t). Here, t could be anything, but is typically a vector in R p. We can use such a distribution as the prior for a Bayesian regression model. The defining feature of a Gaussian process is that the joint distribution of the function values at any finite number of points, t 1,..., t n, is multivariate Gaussian. Consequently, a Gaussian process can be completely specified by its mean function: µ(t) = E[X(t)], and its covariance function: C(t, t ) = E[(X(t) µ(t))(x(t ) µ(t ))]. From these, we can compute the mean vector and covariance matrix for the joint distribution of X(t 1 ),..., X(t n ). Of course, the covariance function must be such that the resulting covariance matrix is always positive semi-definite.

Examples of Functions Drawn from Gaussian Processes Three functions from a GP with µ(t) = 0, C(t, t ) = 1.5 2 + exp( (t t ) 2 ): 3 2 1 0 1 3 2 1 0 1 2 3 Three with µ(t) = 0, C(t, t ) = 1.5 2 + exp( (t t ) 2 ) + 0.2 2 exp( 10(t t ) 2 ): 3 2 1 0 1 3 2 1 0 1 2 3

Non-Gaussian Distributions Over Functions The choice of covariance function allows considerable flexibility in GP models, but some distributions cannot be obtained. Consider the following class of functions: b X(t) = b if t < a if t a where b is randomly chosen to be +1 or 1 with equal probabilities, and a is randomly chosen uniformly over some broad range. We can easily see that µ(t) = E[X(t)] = 0. The covariance function is C(t, t ) = E[X(t)X(t )] = P (a / (t, t )) P (a (t, t )) Over the broad range for a, we see that for nearby t and t, C(t, t ) 1 c t t, for some c, which is locally similar to exp( c t t ). But what do functions drawn from the Gaussian process with this mean and covariance look like?

A Function Drawn From This Gaussian Process Here s a function drawn from the GP with µ(t) = 0, C(t, t ) = exp( t t ): 2 1 0 1 2 3 2 1 0 1 2 3 This is nothing like what we were looking for, which is something like this: 2 1 0 1 2 3 2 1 0 1 2 3

Warped Gaussian Processes One way to get non-gaussian processes from Gaussian processes is to warp the function by applying some non-linear transformation. This is often done manually eg, we may decide to model the logs of data points rather than the original values. Snelson, Rasmussen, and Ghahramani (NIPS 2003) find a monotonic warping function automatically eg, from the class f(t) = t + I i=1 a i tanh(b i (t + c i )). The hyperparameters a i b i, and c i can be set to their maximum a posteriori probability values (as SRG do) or their posterior can be sampled using MCMC. Importantly, fitting warped Gaussian processes to data is not much harder than fitting ordinary Gaussian processes. In both cases, the crucial computations are done with matrix operations. Doing the warping requires only that we be able to compute f, f, and f 1. However, warping gets us only a small extension beyond Gaussian processes. For instance, we can t get processes whose smoothness characteristics vary over the input space, in a priori unknown ways.

Modeling Functions as Sums of Other Functions Here is data on daily mortality in Toronto from 1992 to 1997: Number of Deaths Each Day 30 40 50 60 70 80 1992 1993 1994 1995 1996 1997 1998 We can model these counts a being Poisson distributed, with the mean being a function of time. This function can incorporate yearly and weekly cycles, an overall trend, as well as short-term dependencies, such as could result from flu epidemics. People die of many causes, however. Rather than model the mean directly, perhaps we should use a model that represents the mean number of deaths as a sum of several terms, corresponding to various causes of death.

How About Using a Sum of Gaussian Processes? At first, we might think this is easy to do with a Gaussian process, since if X(t) = A(t) + B(t) + C(t), and A(t), B(t), and C(t) are independent Gaussian processes, then X(t) will also be a Gaussian process. Its mean and covariance functions will just be the sums of those for A(t), B(t), and C(t). However: The mean number of deaths is non-negative, whereas Gaussian processes range over the whole real line. This motivates modeling the log of the mean with a GP, not the mean itself. In any case: If the mean number of deaths is a Gaussian process, it will have the limitations of a Gaussian process. Consequently, we won t be able to model events like sudden flu epidemics very well.

The Log-Sum-Exp Approach Let X(t) be the log of the mean number of deaths on day t. If we want to model the mean as the sum of contributions from various causes of death, we are led to models such as the following: X(t) = log(exp(a(t)) + exp(b(t)) + exp(c(t))) where A(t), B(t), and C(t) represent the logs of the mean numbers of deaths from three causes. We give Gaussian process priors to A(t), B(t), and C(t). I assume here that we observe only the total number of deaths. A(t), B(t), and C(t) are latent processes, which will not be observed directly (even noisily). The three causes of death that they correspond to would be discovered by the model; they aren t fixed beforehand. We don t observe X(t) either, but we do observe the number of deaths on day t, N(t) Poisson(exp(X(t))).

Functions Randomly Drawn from a Log-Sum-Exp GP Model Here are two functions randomly drawn from a log-sum-exp GP model with X(t) = log(exp(a(t)) + exp(b(t))), with A(t) and B(t) having zero means, and the following covariance functions: C A (t, t ) = 10 2 exp( 0.04(t t ) 2 ) C B (t, t ) = 10 2 exp( 0.04(t t ) 2 ) + exp( 25(t t ) 2 ) 5 10 15 20 25 2 0 2 4 6 8 5 0 5 5 0 5

Computations for Log-Sum-Exp GP Models For regression with Gaussian noise, computations for a plain GP model involve: matrix operations to find the likelihood MCMC updates of the hyperparameters, based on the likelihood matrix operations to make predictions. For a log-sum-exp model, we need to explicitly represent A(t), B(t), etc. for all training points. Computations involve: MCMC updates for the latent A(t), B(t), etc. MCMC updates for hyperparameters, based on current A(t), B(t), etc. matrix operations for predictions, with the results combined by log-sum-exp. MCMC updates for both the latent values and the hyperparameters require matrix operations. Updates of the latent values look at the observations; updates for the hyperparameters look only at A(t), B(t), etc. Keeping around latent values slows MCMC convergence. For models with non-gaussian observations (eg, Poisson counts), a latent process needs to be represented explicitly anyway, but a log-sum-exp model requires slower, more complex MCMC updates (more latent values, distributions not log concave).

A Test Problem I tested a log-sum-exp model on data from a sum of Gaussian density functions: f(t) = 3N(t, +2, 1) + 2N(t, 1, 1.5) + N(t, 2.2, 0.05) + N(t, 1.8, 0.04) + N(t, 2.0, 0.07) + N(t, +1.7, 0.08) + N(t, +1.9, 0.06) + N(t, +2.2, 0.07) where N(x, m, s) is the Gaussian density function with mean m and standard deviation s. I then took the log of this function and added Gaussian noise with standard deviation 0.1. Here are 201 training points (on a grid from 3 to 3), with the noise-free function: 1 0 1 2 3 2 1 0 1 2 3

Results on the Test Problem Fit of a plain GP model with covariance function η exp( ρ(t t ) 2 ): 1 0 1 2 3 2 1 0 1 2 3 Fit of a log-sum-exp model, two latent processes, covariances η 1 exp( ρ 1 (t t ) 2 ) and η 2 exp( ρ 2 (t t ) 2 ) + η 3 exp( ρ 3 (t t ) 2 ), prior for ρ 3 favouring large values: 1 0 1 2 3 2 1 0 1 2 3

A Model for the Toronto Mortality Data To model the Toronto mortality data, I used as covariates: time in days, 10s of days, 100s of days, and 1000s of days sine and cosine of year angle (for seasonal effects) sine and cosine of week angle (for day-of-week effects) a binary holiday indicator Time in 1000s of days and the others were used in both linear and non-linear covariance terms. The four time scales were used in additional terms intended for modeling autocorrelations at various scales. For the log-sum-exp model, two such processes were used, with the same form of covariance and the same priors, but with separate hyperparameters, so they can do different things in the end.

Results on the Toronto Mortality Data Here are samples from the posterior for the plain GP (left) and a log-sum-exp model (right). The log-sum-exp model does better on a held-out validation set. 42 44 46 48 50 52 42 44 46 48 50 52 1992 1993 1994 1995 1996 1997 1998 1992 1993 1994 1995 1996 1997 1998 Here are the components of this sample from the log-sum-exp model: 10 15 20 25 30 35 40 1992 1993 1994 1995 1996 1997 1998

Future Research Improve MCMC sampling for GP models, especially of the log-sum-exp sort: Metropolis updates while dragging latent variables (Andriy is trying this) Hamiltonian Monte Carlo updates of latent variables (while dragging?) Try log-sum-exp models for Toronto mortality data with weather and pollution covariates. Hypothesis: These factors affect only a vulnerable subset of the population. Try log-sum-exp models for general regression problems. Need to introduce an additional scale factor: X(t) = η log(exp(a(t) + exp(b(t)) + exp(c(t))). Investigate more general ways of combining several Gaussian processes. It s much like having a second hidden layer in a multilayer perceptron network.