Student-t Process as Alternative to Gaussian Processes Discussion

Similar documents
arxiv: v2 [stat.ml] 19 Feb 2014

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Multivariate Normal & Wishart

GAUSSIAN PROCESS REGRESSION

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Hierarchical Modeling for Univariate Spatial Data

Nonparameteric Regression:

Nonparametric Bayesian Methods (Gaussian Processes)

A Few Special Distributions and Their Properties

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Gaussian Process Regression

Hierarchical Modelling for Univariate Spatial Data

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Lecture 5: GPs and Streaming regression

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Gaussian Process Regression Networks

Gaussian with mean ( µ ) and standard deviation ( σ)

Multiple Random Variables

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Scalable kernel methods and their use in black-box optimization

Probabilistic & Unsupervised Learning

Some Curiosities Arising in Objective Bayesian Analysis

STAT 518 Intro Student Presentation

Introduction to Gaussian Processes

Introduction to Gaussian Processes

Advances and Applications in Perfect Sampling

Multivariate Random Variable

Random Eigenvalue Problems Revisited

STA 4273H: Sta-s-cal Machine Learning

Part 6: Multivariate Normal and Linear Models

Advanced Introduction to Machine Learning CMU-10715

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Variational Principal Components

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

1 Data Arrays and Decompositions

STA414/2104 Statistical Methods for Machine Learning II

Spatial smoothing using Gaussian processes

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Gibbs Sampling in Linear Models #2

A Bayesian Treatment of Linear Gaussian Regression

MFM Practitioner Module: Quantitative Risk Management. John Dodson. September 23, 2015

Multivariate Distributions

GWAS V: Gaussian processes

Multivariate Statistics

Gaussian Processes (10/16/13)

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

CPSC 540: Machine Learning

Reliability Monitoring Using Log Gaussian Process Regression

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression

Hierarchical Modelling for Univariate Spatial Data

Spatial Statistics with Image Analysis. Lecture L02. Computer exercise 0 Daily Temperature. Lecture 2. Johan Lindström.

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Lecture 3. Probability - Part 2. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. October 19, 2016

Bayesian Regression Linear and Logistic Regression

The joint posterior distribution of the unknown parameters and hidden variables, given the

Computer Emulation With Density Estimation

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

ST 740: Linear Models and Multivariate Normal Inference

1 Exercises for lecture 1

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes

A Process over all Stationary Covariance Kernels

Expectation Propagation for Approximate Bayesian Inference

Random Matrix Eigenvalue Problems in Probabilistic Structural Mechanics

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

The Bayesian approach to inverse problems

STAT Advanced Bayesian Inference

Outline Lecture 2 2(32)

Efficient Bayesian Multivariate Surface Regression

Notes on Random Vectors and Multivariate Normal

Learning Gaussian Process Models from Uncertain Data

An Introduction to Bayesian Linear Regression

Introduction to Gaussian Processes

Kernel adaptive Sequential Monte Carlo

Extreme Value Analysis and Spatial Extremes

Gaussian processes for spatial modelling in environmental health: parameterizing for flexibility vs. computational efficiency

CPSC 540: Machine Learning

Gaussian Processes in Machine Learning

On an Additive Semigraphoid Model for Statistical Networks With Application to Nov Pathway 25, 2016 Analysis -1 Bing / 38Li,

Nonparametric Regression With Gaussian Processes

Automatic Relevance Determination

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Lecture : Probabilistic Machine Learning

Introduction to Bayesian Inference

Kazuhiko Kakamu Department of Economics Finance, Institute for Advanced Studies. Abstract

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Transcription:

Student-t Process as Alternative to Gaussian Processes Discussion A. Shah, A. G. Wilson and Z. Gharamani Discussion by: R. Henao Duke University June 20, 2014

Contributions The paper is concerned about the following aspects of Student-t processes: Definition and motivation of inverse Wishart processes (IWP). Propose a Student-t process (TP) derived from a hierarchical GP model. Show that predictive covariances of a TP depend on training observations. Show that TP is the most general elliptic symmetric process with analytic marginal and predictive distributions. Derive new sampling strategy for IWP. Show that an analytic TP noise model can separate signal from noise analytically. Empirically show non-trivial differences between GP and TP.

Inverse Wishart Distribution Definition (Dawid, 1981). Σ Π(n) has inverse Wishart distribution with parameters ν R +, K Π(n) and we write Σ IW n(ν,k) if its density is given by { p(σ) = c n(ν,k) K (ν+2n)/2 exp 1 } 2 trace(kσ 1 ), with Some properties: c n(ν,k) = When ν > 2, E[Σ] = (ν 2) 1 K. K (ν+n 1)/2 2 (ν+n 1)n/2 Γ n((ν +n 1)/2). Wishart and inverse Wishart distributions place prior mass on every Σ Π(n). Σ W n(ν,k) iff Σ 1 IW n(ν n+1,k 1 ). Dawid (1981) showed that the inverse Wishart distribution is closed under marginalization, so if Σ IW n(ν,k) then Σ 11 IW n1 (ν,k 11).

Inverse Wishart process Definition. σ is an inverse Wishart process on X with parameters ν R + and base kernel k θ : X X R if for any finite collection x 1,...,x n X: σ(x 1,...,x n) IW n(ν,k). K Π(n) with k ij = k θ (x i,x j). σ IWP(ν,k θ ). Generative model: where φ : X R. σ IWP(ν,k θ ), y σ GP(φ,(ν 2)σ) For data y = [y 1... y n] with φ = [φ(x 1)... φ(x n)] and Σ = σ(x 1,...,x n) p(y φ,ν,k) = p(y φ, Σ)p(Σ ν, K)dΣ ( 1+ 1 ) (ν+n)/2 ν 2 (y φ) K 1 (y φ)

Student-t process Definition. y R n has multivariate Student-t distribution with parameters ν R +\[0,2], φ R n and K Π(n) if it has density p(y) = ( Γ((ν +2)/2) (ν 2) n/2 π n/2 Γ(ν/2) K 1/2 1+ 1 ) (ν+n)/2 ν 2 (y φ) K 1 (y φ). We write y MVT n(ν,φ,k). Some properties: E[y] = E[y Σ] = φ. cov[y] = E[(y φ)(y φ) Σ] = E[(ν 2)Σ] = K. Lemma. MVT is closed under marginalization. Definition. f is a Student-t process on X with parameters ν > 2, mean function φ : X R, and kernel function k θ : X X R, if any finite collection of function values [f(x 1)... f(x n)] MVT n(ν,φ,k), where K Π(n) with k ij = k θ (x i,x j) and φ R n with φ i = φ(x i). We write f T P(ν,φ,k θ ).

Relation to Gaussian process GP is a special case of TP. Lemma. Suppose that f T P(ν,φ,k θ ) and g GP(φ,k θ ), then f tends to g in distribution as ν. ν controls the tails of the distribution.

Conditional distribution Lemma. Suppose y MVT n(ν,φ,k) and let y = [y 1 y 2], with y 1 R n 1 and y 2 R n 2, then ( ) y 2 y 1 MVT n2 ν +n 1, φ ν +β1 2 2, ν +n 1 2 K 22, where φ2 = K 21K 1 11 (y1 φ1)+φ2. β 1 = (y 1 φ) K 1 11 (y1 φ). K22 = K 22 K 21K 1 11 K12. E[y 2 y 1] = φ 2. cov[y 2 y 1] = ν+β 1 2 ν+n 1 2 K 22. The predictive covariance of y 2 depends on y 1 via β 1.

Another covariance prior Yu et al., 2007. Scale mixture of Gaussians construction r 1 Gamma(ν/2,ρ/2), y r N(φ,r(ν 2)K/ρ), where K Π(n), φ R n, ν > 2, ρ > 0 and marginally y MVT n(ν,φ,k). Besides ( r 1 ν +n y Gamma, ρ ( )) 1+ β1, 2 2 ν 2 hence E[r(ν 2)/ρ y] = ν+β 1 2 ν+n 1 2.

Elliptical processes Definition. y R n is elliptically symmetric iff there exists µ R n, R a non-negative random variable, Ω a n d matrix with maximal rank d and u uniformly distributed on the unit sphere in R d independent of R such that y D = µ+rωu. Lemma (Fang et al., 1989). Suppose R 1 χ 2 (n) and R 2 Gamma 1 (ν/2,1/2) independently. If R = R 1, then y is Gaussian distributed. If R = (ν 2)R 1R 2, then y is MVT distributed. Definition Let Y = {y i} be a countable family of random variables. It is an elliptical process if any finite subset of them are jointly elliptically symmetric. Theorem (Kelker, 1970). Suppose Y = {y i} is an elliptical process. Any finite collection z = {z 1,...,z n} Y has a density iff there exists a non-negative random variable r such that z r N(µ,rΩΩ ). Corollary. Suppose Y = {y i} is an elliptical process. Any finite collection z = {z 1,...,z n} Y has an analytically representable density iff Y is a Gaussian process or a Student-t process.

A New Way to Sample the IWP Theorem. Let Σ Π(n). Suppose {λ 1,...,λ n} are the eigenvalues of Σ. There exists Q Ξ(n) such that Σ = QΛQ, where Λ = diag(λ 1,...,λ n). Using the facts that Q Q = I and AB = BA p(σ)dσ = p(qλq ) J(Σ;Q,Λ) dλdq n λ ν+2n 2 i e 2λ 1 i λ i λ j n dλ i dq, i=1 thus Q is uniformly distributed over Ξ(n) and the λ i are exchangeable. We draw Σ = QΛQ (Dawid, 1977): Q Υ n,n. Λ Θ n(ν). j i

Modeling Noisy Functions A common practice for GP is y = f +ǫ, f GP(φ,k θ ), ǫ N(0,ψI). This model is tractable because Gaussian distributions are closed under addition. MVT is not closed under addition but we write k = k θ +ψδ. Empirically MVT n(ν,0,k)+mvt n(ν,0,ψi) MVT n(ν,0,k+ψi).

Experiments Regression: Squared exponential kernel function plus delta. Sampling with Hamiltonian Monte Carlo for θ. 2 artificial and 3 benchmark datasets. Performance measure: MSE and log-likelihood. Bayesian optimization: ARD Matérn kernel function plus delta. Sampling with slice sampling for θ. 3 benchmark functions. Performance measure: minimum function vale vs. function evaluations.