Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Similar documents
Gaussian Process Regression

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CPSC 540: Machine Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Nonparametric Bayesian Methods (Gaussian Processes)

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Model Selection for Gaussian Processes

Gaussian Processes for Machine Learning

Non-Parametric Bayes

Probabilistic & Unsupervised Learning

Unsupervised Learning

Probabilistic Graphical Models Lecture 20: Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

STAT 518 Intro Student Presentation

Gaussian Processes in Machine Learning

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Introduction to Gaussian Processes

A Process over all Stationary Covariance Kernels

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Statistical Approaches to Learning and Discovery

Bayesian Hidden Markov Models and Extensions

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

STA 4273H: Statistical Machine Learning

Nonparameteric Regression:

Bayesian Methods for Machine Learning

A Brief Overview of Nonparametric Bayesian Models

Pattern Recognition and Machine Learning

Introduction to Gaussian Processes

Neutron inverse kinetics via Gaussian Processes

Bayesian Nonparametrics

Dirichlet Processes: Tutorial and Practical Course

Building an Automatic Statistician

20: Gaussian Processes

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

GAUSSIAN PROCESS REGRESSION

GWAS V: Gaussian processes

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

State Space Representation of Gaussian Processes

Part 1: Expectation Propagation

The Automatic Statistician

Lecture 6: Graphical Models: Learning

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Introduction to Gaussian Process

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Relevance Vector Machines

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Density Estimation. Seungjin Choi

Reliability Monitoring Using Log Gaussian Process Regression

Gaussian processes for inference in stochastic differential equations

Gaussian Processes (10/16/13)

STA 4273H: Sta-s-cal Machine Learning

Lecture 3a: Dirichlet processes

Introduction to Gaussian Processes

Kernels for Automatic Pattern Discovery and Extrapolation

CMU-Q Lecture 24:

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Gaussian Process Regression Networks

How to build an automatic statistician

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Recent Advances in Bayesian Inference Techniques

Nonparametric Bayesian Methods - Lecture I

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

STA414/2104 Statistical Methods for Machine Learning II

Lecture 6: Bayesian Inference in SDE Models

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Learning Gaussian Process Models from Uncertain Data

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

Nonparametric Probabilistic Modelling

Machine Learning Summer School

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

Advanced Introduction to Machine Learning CMU-10715

Optimization of Gaussian Process Hyperparameters using Rprop

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Bayesian Nonparametric Models

Introduction to Gaussian Processes

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Identification of Gaussian Process State-Space Models with Particle Stochastic Approximation EM

CS-E4830 Kernel Methods in Machine Learning

STAT Advanced Bayesian Inference

Bayesian Nonparametrics

Disease mapping with Gaussian processes

Variational Scoring of Graphical Model Structures

Gaussian with mean ( µ ) and standard deviation ( σ)

Probabilistic Models for Learning Data Representations. Andreas Damianou

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

State Space Gaussian Processes with Non-Gaussian Likelihoods

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Modelling Transcriptional Regulation with Gaussian Processes

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Identification of Gaussian Process State-Space Models with Particle Stochastic Approximation EM

Transcription:

Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group

Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian Processes regression classification Summary

Bayesian Inference ) ( ) ( ), ( ), ( M D p M p M D p M D p θ θ θ = Posterior Likelihood x Prior Evidence θ θ θ d M p M D p M D p ) ( ), ( ) ( = The evidence for our model M is also called Marginal Likelihood

Bayesian Nutshell Posterior Likelihood x Prior p( θ D, M ) p( D θ, M ) p( θ M ) x θ θ θ θˆmap θˆml

Hierarchical Models α Hyperparameter Prior p( θ α, M ) θ Parameter Likelihood p( D θ, M ) d i N Data

Hierarchical Models Hyperprior p( α M ) α Hyperparameter Prior p( θ α, M ) θ Parameter Likelihood p( D θ, M ) d i N Data

Level 1: infer parameters Level 2: infer hyper-parameters Level 3: infer models Bayesian Hierarchy ), ( ), ( ), ( ),, ( M D p M p M D p M D p α α θ θ α θ = ) ( ) ( ), ( ), ( M D p M p M D p M D p α α α = ) ( ) ( ) ( M p M D p D M p

Model Selection by Zoubin Ghahramani

Bayesian Model Selection by Zoubin Ghahramani

Bayesian Occam s Razor M 1 p(d = d M ) M 2 all possible data sets d for any model M all d D p( D = d M ) = 1 The law of conservation of belief states that models that explain many possible data sets must necessarily assign each of them a low probability.

Bayesian Occam s Razor M 1 M 1 p(d = d M ) M 2 M3 M 2 observed data D M 1 : the too simple model is unlikely to generate this data M 3 M 3 : the too complex model is a little better but still unlikely M 2 : the just right model has the highest marginal likelihood

All the Bayesics by Zoubin Ghahramani

Approximation Methods for the evidence and posterior integrals Laplace s method Bayesian Information Criteria (BIC = MDL) Akaike Information Criteria (AIC) Variational Bayes (VB) Expectation Propagation (EP) Markov Chain Monte Carlo (MCMC) Exact ( Perfect ) Sampling etc

Parametric vs. Nonparametric Parametric the total # of parameters is fixed (property of its distribution) so it doesn t depend (grow) with the # of data points collected for prediction, knowing θ means you can throw away your data Nonparametric p( d D) p( d θ ) p( θ D) new new # of parameters can grow with the number of data so the model can adapt to the data s complexity but this typically means you can not throw away your data future predictions require access to the previous training set p( d D, α) new

Parametric p( d D) p( d θ ) p( θ D) n n Nonparametric p( d D, α) n α α θ d 1 d2 dn d d 1 2 dn

Nonparametric Bayesian Models Gaussian Process Dirichlet Process Chinese restaurant process Polya urn model Stick-breaking models Pitman-Yor process Indian buffet process Polya trees Dirichlet diffusion trees Infinite Hidden Markov Models

Gaussian Process Regression for modeling, prediction, curve fitting y = f (x) + ε some (mis)conceptions: not curve fitting! we want p( y x ) not just regression with Gaussian noise input x can be :,, output y (hence f ) is a scalar n,,, ATGC

Early GP History Thiele (1880) Kolmogorov (1941), Wiener (1949) Thompson (1956) : meteorology Matheron (1963) : Kriging in Geostatistics Whittle (1963) : geospatial prediction O Hagan (1978) : general Bayesian regression Ripley (1981): Bayesian spatial models

Thorvald Nicolai Thiele Danish Astronomer, Actuarian, Statistician Born 1838, died 1910 General Theory of Observations (1889) Find best predictor for time series x(t) Formulated a Kalman Filter for a GP - rediscovered by Kalman & Bucy (1960)

Recent GP History Bar-Shalom & Fortman (1988) : Kalman filters Poggio & Girosi (1989) : Generalized RBFs Wahba (1990) : ARMA models & splines Cressie (1993) : spatial statistics (2D/3D) Neal (1996) : MLP = GP Williams & Rasmussen (1997) : general ML Saunders (1998) : KRR (Kriging rediscovered) etc etc (just see NIPS, UAI, AISTATS, ICML )

Two Equivalent Views GP models Weight Space linear parametric finite-dim Random Process nonlinear nonparametric infinite-dim

Weight Space View Parametric and finite-dimensional Regression with basis functions φ - e.g., cubic polynomials 2 3 φ( x ) = [1 x x x ] w = w w w w ] [ 0 1 2 3 T T Gaussian prior on weights let K = < f f T > = Cov(f) vector f is jointly Gaussian

Random Process View non-parametric and infinite-dimensional imagine increasing the length of that f vector to infinity K = ij k( x, x ) i j Informally speaking the infinite-dimensional vector f becomes a function f(x) its covariance matrix K becomes a kernel function k(x i,x j ) in the limit f(x) becomes a stochastic Gaussian Process mean function m(x) kernel function k(x, x )

Our Beloved Gaussian 1D 2D by Carl Rasmussen

Gaussians beget Gaussians 1D 2D Also : the product of Gaussians is Gaussian e.g., Gaussian prior x Gaussian likelihood Gaussian posterior by Carl Rasmussen

Marginalization y a = mean(x) b = mean(y) A = cov(x,x') B = cov(x,y) C = cov(y,y') x by Carl Rasmussen

Conditional Gaussians Jointly Gaussian (sub)vectors x and y Conditional density of x given y Schur complement of C

Definition of Gaussian Process f ( x) ~ GP( m( x), k( x, x')) by Carl Rasmussen

1D GP sample (RBF)

2D GP sample (RBF)

Covariances (Kernels) not a function of f by Carl Rasmussen

Squared Exponential (RBF) Kernel by Carl Rasmussen

Nonstationary Covariances k ( x, x') tanh( a x x' + b) = not a valid kernel (psd)! by Carl Rasmussen

Matern Class of Covariances by Carl Rasmussen

Matern Class of Covariances by Carl Rasmussen

Rational Quadratic Kernels by Carl Rasmussen

Rational Quadratic Kernels by Carl Rasmussen

Building New Covariances by Carl Rasmussen

Matlab Demo 1

Sampling from a GP Matlab: % given GP(fmean,K) L = chol(k) ; % K = L*L ; while 1 f = L*randn(n,1) + fmean; plot(f) pause end

Demo 1: Sampling from Prior Prior Cov

Conditional Gaussians Jointly Gaussian (sub)vectors x and y Conditional density of x given y mean covariance

GPR with Gaussian Noise by Carl Rasmussen

GPR Prediction by Carl Rasmussen

GPR Prediction parameter by Carl Rasmussen

posterior mean +/- 2σ

posterior mean +/- 2σ

Matlab Demo 2

Posterior Sampling for GPR % train set: (x,y) L = chol(k) ; a = L \(L\y); % a = inv(k)*y; ML = -y *a/2 sum(log(diag(l))) + const; % test set: xt (no noise) ftmean = Kt*a; V = L\Kt; Kt = K V *V; % Kt = K Kt inv(k)*kt Lt = chol(kt) ; while 1 ft = Lt*randn(m,1) + ftmean; plot(ft) end

Demo 2: Sampling from Posterior Prior Cov Posterior Cov

All the Bayesics by Zoubin Ghahramani

GP Marginal Likelihood by Carl Rasmussen

ML for RBF : exp( ( x x') 2 / 2 ) + 2 δ xx' σ by Carl Rasmussen

Sparse GPs by Carl Rasmussen

Graphical Model of GPs observed unknown by Carl Rasmussen

Sparse GP observed unknown

Full GPR (all the data) y=kα

Sparse GPR (subset of data) y=k ss α s

Advantages of GPs uses probability theory (it s not a hack!) yields full predictive distributions can be a building block : p(y x) posterior sampling automatic learning of kernel parameters principled (efficient) model selection ideal for limited training data have good generalization performance

The GP Bible (for ML folk) all the chapters are available online!

Example: GPR Pseudocode