The Laplace Approximation

Similar documents
Outline Lecture 2 2(32)

Bayesian Linear Regression. Sargur Srihari

Neural Network Training

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Neural Networks - II

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Iterative Reweighted Least Squares

Bayesian Logistic Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Classification

Ch 4. Linear Models for Classification

Variational Bayesian Logistic Regression

Nonparametric Bayesian Methods (Gaussian Processes)

STA 4273H: Statistical Machine Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

GAUSSIAN PROCESS REGRESSION

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Approximate Inference using MCMC

GWAS IV: Bayesian linear (variance component) models

Reading Group on Deep Learning Session 1

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Bayesian Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING

An Introduction to Statistical and Probabilistic Linear Models

Introduction to Machine Learning

Density Estimation. Seungjin Choi

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Outline lecture 2 2(30)

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Parametric Techniques

Linear Models for Classification

Introduction to Machine Learning

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

SRNDNA Model Fitting in RL Workshop

Linear Classification: Probabilistic Generative Models

Introduction into Bayesian statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Parametric Techniques Lecture 3

Machine Learning Lecture 5

Basic Sampling Methods

Reading Group on Deep Learning Session 2

Bayesian Regression Linear and Logistic Regression

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA414/2104 Statistical Methods for Machine Learning II

CS-E3210 Machine Learning: Basic Principles

Probabilistic & Unsupervised Learning

Bayesian Machine Learning

Machine Learning Lecture 7

Computer Vision Group Prof. Daniel Cremers. 3. Regression

7. Estimation and hypothesis testing. Objective. Recommended reading

Multiclass Logistic Regression

Multivariate Bayesian Linear Regression MLAI Lecture 11

Gaussian Process Regression

MODULE -4 BAYEIAN LEARNING

Machine Learning. 7. Logistic and Linear Regression

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Unsupervised Learning

Logistic Regression. Seungjin Choi

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Bayesian methods in economics and finance

Linear Models for Regression. Sargur Srihari

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

INTRODUCTION TO PATTERN RECOGNITION

Least Squares Regression

Least Squares Regression

Outline lecture 4 2(26)

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Gradient-Based Learning. Sargur N. Srihari

Variational Inference. Sargur Srihari

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture : Probabilistic Machine Learning

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Lecture 1b: Linear Models for Regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Introduction to Gaussian Process

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Curve Fitting Re-visited, Bishop1.2.5

Introduction to Machine Learning

Kernel methods, kernel SVM and ridge regression

Machine Learning using Bayesian Approaches

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Mixtures of Gaussians. Sargur Srihari

CSC 411: Lecture 04: Logistic Regression

Gaussian Processes in Machine Learning

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Bayesian RL Seminar. Chris Mansley September 9, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Introduction to Probabilistic Graphical Models: Exercises

Transcription:

The Laplace Approximation Sargur N. University at Buffalo, State University of New York USA

Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic Generative Models 3. Probabilistic Discriminative Models 4. The Laplace Approximation 5. Bayesian Logistic Regression 2

Topics in Laplace Approximation Motivation Finding a Gaussian approximation: 1-D case Approximation in M-dimensional space Weakness of Laplace approximation Model Comparison using BIC 3

What is Laplace Approximation? The Laplace approximation framework aims to find a Gaussian approximation to a continuous distribution 4

Why study Laplace Approximation? We shall discuss Bayesian treatment of logistic regression It is more complex than the Bayesian treatment of linear regression In particular we cannot integrate exactly In ML we need to predict a distribution of the output that may involve integration 5

Bayesian Linear Regression Recapitulate: in Bayesian linear regression we integrate exactly: Gaussian noise: Gaussian prior: Gaussian posterior: p(t x,w, β) = N(t y(x,w), β 1 ) p(w)=n (w m 0,S 0 ) p(w t)=n(w m N,S N ) m N =β S N Φ T t S 0 =α -1 I Predictive Distribution:: S N -1 =α I+β Φ T Φ p(t t,α,β)= p(t w,β) p(w t,α,β)dw Result of integration: p(t x, t,α, β) = N(t m N T φ(x),σ N 2 (x)) where σ N 2 (x) = 1 β + φ(x)t S N φ(x) Equivalent Kernel N y(x, m N ) = k(x,x n )t k(x,x )=βφ (x) T S n N φ (x ) S -1 N = S -1 0 + βφ T Φ n=1 φ 0(x1) φ 0(x 2) Φ = φ 0(x N ) φ (x ) 1 1... φ M 1(x 1) φ M 1(x N ) 6

Bayesian Logistic Regression In Bayesian treatment of Logistic regression we cannot directly integrate over the parameter vector w since the posterior is not Gaussian It is therefore necessary to introduce some form of approximation Later we will consider analytical approximations and numerical sampling 7

Approximation due to intractability Bayesian logistic regression Prediction p(c k x) involves integrating over w p(c 1 φ,t)! σ (w T φ)p(w)dw Convolution of Sigmoid-Gaussian is intractable It is not Gaussian-Gaussian as in linear regression Need to introduce methods of approximation Approaches Analytical Approximations Laplace Approximation Numerical Sampling 8

Laplace approximation framework Simple but widely used framework Aims to find a Gaussian approximation to a probability density defined over a set of continuous variables Method aims specifically at problems in which the distribution is unimodal Consider first the case of single continuous variable 9

Laplace Approximation: 1D case Single continuous variable z with distribution p(z) defined by p(z) = 1 Z f (z) where Z= f (z)dz is a normalization coefficient Value of Z is unknown f(z) is a scaled version of p(z) Goal is to find Gaussian approximation q(z) which is centered on the mode of the distribution p(z) First step is to find mode of p(z) i.e., a point z 0 such that p (z 0 )=0 0.8 0.6 0.4 0.2 q(z) p(z) 0!2!1 0 1 2 3 4 z

Quadratic using Taylor s series A Taylor s series expansion of f (x) centered on the mode x 0 has the power series f (x) = f (x 0 )+ f '(x 0 ) 1! ( x x 0 )+ f ''(x ) 0 2! ( ) 2 + f (3) (x 0 ) ( ) 3 +... x x 0 f (x) is assumed infinitely differentiable at x 0 Note that when x 0 is the maximum f (x 0 )=0 and that term disappears A Gaussian f (x) has the property that lnf (x) is a quadratic function of x We will use a Taylor s series approximation of the function ln f (z) and use only the quadratic term 3! x x 0 Animation 11

Finding mode of a distribution Find point z 0 such that p (z 0 )=0, Or equivalently df (z) dz z=z 0 = 0 Approximate f (z) using 2nd derivative p (z 0 ) Logarithm of Gaussian is a quadratic. Use Taylor expansion of ln f (z)centered at mode z 0 0.8 0.6 0.4 0.2 q(z) 0!2!1 0 1 2 3 4 z p(z) ln f (z) ln f (z 0 ) 1 2 A(z z 0 )2 where A = d 2 dz 2 ln f (z) z= z 0 A is the second derivative of logarithm of scaled p(z) First order term does not appear since z 0 is a local maximum and f (z 0 )=0

Final form of Laplacian (one dim) Approximation of f(z): Assuming ln f(z) to be quadratic Taking exponential $ f (z) f (z 0 )exp% A 2 (z z 0 & )2 To Normalize f (z) we need Z ' ( ) ln f (z) ln f (z 0 ) 1 2 A(z z 0 )2 where A = d 2 dz 2 ln f (z) z= z 0 Z = f(z)dz f(z 0 ) exp - A 2 (z - z 0 )2 dz = f(z 0 ) (2π)1/2 A 1/2 The normalized version of f(z) is q(z) = 1 Z f (z) =! A $ # & " 2π % 1/2 ( exp A 2 (z z + ) 0 * )2, - ~ N(z z, 0 A 1 )

Laplace Approximation Example Applied to distribution where σ is sigmoid 0.8 p(z)α exp( z 2 / 2)σ(20z + 4) 40 Laplace 0.6 Approx (Gaussian) 0.4 0.2 p(z) Negative Logarithms 30 20 10 0!2!1 0 1 2 3 4 Mode of p(z) 0!2!1 0 1 2 3 4 f (z) = exp( z 3 / 2)σ(20z + 4) Z = f (z)dz Gaussian approximation will only be well-defined if its precision A>0, or second derivative of f (z) at point z 0 is negative 14

Laplace s Method Approximate integrals of the form b a e Mf (x) dx Assume f(x) has global maximum at x 0 Then f(x 0 ) >> other values of f(x) with e Mf(x) growing exponentially with M M=0.5 M=3 So enough to focus on f(x) at x 0 As M increases, integral is wellapproximated by a Gaussian Second derivative appears in denominator Proof involves Taylor s series expansion of f(x) at x 0 15

Laplace Approx: M-dimensions Task: approximate p(z)=f(z)/z defined over M-dim space z At stationary point z 0 the gradient Expanding around this point ln f (z) ln f (z 0 ) 1 2 (z z 0 )T A(z z 0 ) where A is the M x M Hessian matrix f (z) vanishes A = ln f (z) z=z0 Taking exponentials " f (z) f (z 0 )exp 1 2 (z z 0 ) T % # A(z z 0 )& $ ' 16

Normalized Multivariate Laplacian Z = f (z)dz $ f (z 0 ) exp - 1 2 (z z 0 ) T ' % A(z z 0 )( dz & ) = f (z 0 ) (2π )M /2 A 1/2 Distribution q(z) is proportional to f(z) as q(z) = 1 Z f (z) = A 1/2 (2π ) exp # 1 M /2 2 (z z 0 ) T & $ A(z z 0 )' % ( =N(z z 0,A -1 ) 17

Steps in Applying Laplace Approx. 1. Find the mode z 0 Run a numerical optimization algorithm 2. Evaluate Hessian matrix A at that mode Many distributions encountered in practice are multimodal There will be different approximations according to which mode considered Z of the true distribution need not be known to apply Laplace method 18

Weakness of Laplace Approx. Directly applicable only to real variables Based on Gaussian distribution May be applicable to transformed variable If 0 τ< then consider Laplace approx of ln τ Based purely on a specific value of the variable Variational methods have a more global perspective 19

Approximating Z As well as approximating the distribution p(z) we can also obtain an approximation to the normalizing constant Z Z = f (z)dz! f (z 0 ) exp 1 2 (z z 0 )T A(z z 0 ) dz =f (z 0 ) (2π )M /2 A 1/2 Can use this result to obtain an approximation to model evidence that plays a central role in Bayesian model comparison 20

Model Comparison and BIC Consider data set D and models {M i } having parameters {θ i } For each model define likelihood p(d θ i,m i ) Introduce prior p(θ i M i ) over parameters Need model evidence p(d M i ) for various models 21

Model Comparison and BIC For a given model from sum rule p(d) = ln p(d)=ln p(d θ)p(θ)dθ /2 (2π )M Identifying f(θ)=p(d θ)p(θ) and Z=p(D) and using Z = f (z 0 ) f (θ)dθ $ =ln & f (θ map ) % (2π )M /2 A 1/2 ' ) ( = ln p(d θ map ) + ln p(θ map ) + M 2 ln2π 1 2 ln A A 1/2 Occam factor that penalizes model complexity where θ map is the value of θ at the mode of the posterior A is Hessian of second derivatives of negative log posterior 22

Bayes Information Criterion (BIC) Assuming broad Gaussian prior over parameters & Hessian is of full rank, approximate model evidence is lnp(d) lnp(d θ map )- 1 2 M ln N N is the number of data points M is the no of parameters in θ Compared to AIC given by ln p(d w ML )-M BIC penalizes model complexity more heavily 23

Weakness of AIC, BIC AIC and BIC are easy to evaluate But can give misleading results since Hessian matrix may not have full rank since many parameters not well-determined Can obtain more accurate estimate from ln p(d)= lnp(d θ map ) + ln p(θ map ) + M 2 ln2π 1 2 ln A Used in the context of neural networks 24

Summary Bayesian approach for logistic regression is more complex than for linear regression Since posterior over w is not Gaussian Predictive Distribution needs an integral over parameters Simplified when Gaussian Laplace approximation fits the best Gaussian Defined for both univariate and multivariate Normalization term is useful as BIC criterion AIC and BIC are simple but not accurate 25