Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Similar documents
Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

STA 414/2104, Spring 2014, Practice Problem Set #1

Multivariate Bayesian Linear Regression MLAI Lecture 11

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Linear Models for Regression

Bayesian Linear Regression [DRAFT - In Progress]

1. Non-Uniformly Weighted Data [7pts]

Lecture : Probabilistic Machine Learning

Lecture 5: GPs and Streaming regression

Today. Calculus. Linear Regression. Lagrange Multipliers

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Nonparameteric Regression:

STA414/2104 Statistical Methods for Machine Learning II

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Gaussian Processes

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Linear Models for Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Week 3: Linear Regression

Advanced Introduction to Machine Learning CMU-10715

CSC 2541: Bayesian Methods for Machine Learning

The Expectation-Maximization Algorithm

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Machine Learning

y(x) = x w + ε(x), (1)

Bayesian Machine Learning

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

CS-E3210 Machine Learning: Basic Principles

The evolution from MLE to MAP to Bayesian Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

GAUSSIAN PROCESS REGRESSION

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Hierarchical Modeling for Univariate Spatial Data

The Multivariate Gaussian Distribution [DRAFT]

Gaussian Processes in Machine Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

STA 4273H: Sta-s-cal Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Gaussian Process Regression

Choosing among models

Linear Regression (9/11/13)

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

TDA231. Logistic regression

Machine Learning 4771

Minimum Description Length (MDL)

Bayesian methods in economics and finance

GWAS IV: Bayesian linear (variance component) models

Bayesian data analysis in practice: Three simple examples

Probabilistic & Unsupervised Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Probability Theory for Machine Learning. Chris Cremer September 2015

CS 195-5: Machine Learning Problem Set 1

Reliability Monitoring Using Log Gaussian Process Regression

Bayesian Regression Linear and Logistic Regression

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Support vector machines Lecture 4

CMU-Q Lecture 24:

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Bayesian Linear Regression. Sargur Srihari

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Introduction to Gaussian Processes

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Outline Lecture 2 2(32)

Machine Learning and Pattern Recognition, Tutorial Sheet Number 3 Answers

Nonparametric Bayesian Methods (Gaussian Processes)

A Bayesian Treatment of Linear Gaussian Regression

CSC 2541: Bayesian Methods for Machine Learning

Testing Restrictions and Comparing Models

Introduction to Machine Learning

Hierarchical Modelling for Univariate Spatial Data

Machine Learning Lecture 7

Introduction to Bayesian Inference

Logistic Regression. Machine Learning Fall 2018

Machine Learning Linear Classification. Prof. Matteo Matteucci

Learning from Data: Regression

Bayesian Model Comparison

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

g-priors for Linear Regression

Linear Models for Classification

Exam 2. Jeremy Morris. March 23, 2006

Transcription:

Bayesian Gaussian / Linear Models Read Sections 2.3.3 and 3.3 in the text by Bishop

Multivariate Gaussian Model with Multivariate Gaussian Prior Suppose we model the observed vector b as having a multivariate Gaussian distribution with known covariance matrix B and unknown mean x. We give x a multivariate Gaussian prior with known covariance matrix A and known mean a. The posterior distribution of x will be Gaussian, since the product of the prior density and the likelihood is proportional to the exponential of a quadratic function of x: Prior Likelihood exp( (x a) T A 1 (x a)/2) exp( (b x) T B 1 (b x)/2) The log posterior density is this quadratic function ( is parts not involving x): 1 2 (x a) T A 1 (x a) + (b x) T B 1 (b x) + = 1 2 x T (A 1 + B 1 )x 2x T (A 1 a + B 1 b) + = 1 2 (x c) T (A 1 + B 1 )(x c) + where c = (A 1 + B 1 ) 1 (A 1 a + B 1 b). This is the density for a Gaussian distribution with mean c and variance (A 1 + B 1 ) 1.

Bayesian Linear Basis Function Model Recall the linear basis function model, which we can write as t N(Φw, σ 2 I) where here, t is the vector of observed targets w is the vector of regression coefficients σ 2 is the noise variance Φ is the matrix of basis function values in the training cases. Suppose that our prior for w is N(m 0, S 0 ). This is a conjugate prior, with the posterior for w also being normal.

Posterior for Linear Basis Function Model Both the log prior and the log likelihood are quadratic functions of w. The log likelihood for w is (t Φw) T (σ 2 I) 1 (t Φw) 1 2 + = 1 2 1 w T Φ T Φw 2w T Φ T t σ 2 which is the same quadratic function of w as for a Gaussian log density with covariance σ 2 (Φ T Φ) 1 and mean (Φ T Φ) 1 Φ T t. This combines with the prior for w in the same way as seen earlier, with the result that the posterior distribution for w is Gaussian with covariance S N = S0 1 + (σ 2 (Φ T Φ) 1 ) 1 1 = S0 1 + (1/σ 2 )Φ T Φ 1 + and mean m N = (S 1 N ) 1 S0 1 m 0 + (1/σ 2 )Φ T Φ(Φ T Φ) 1 Φ T t = S N S0 1 m 0 + (1/σ 2 )Φ T t

Predictive Distribution for a Test Case We can write the target, t, for some new case with inputs x as t = φ(x) T w + n where the noise n has the N(0, σ 2 ) distribution, independently of w. Since the posterior distribution for w is N(m N, S N ), the posterior distribution for φ(x) T w will be N(φ(x) T m N, φ(x) T S N φ(x)). Hence the predictive distribution for t will be N(φ(x) T m N, φ(x) T S N φ(x) + σ 2 ).

Comparison with Regularized Estimates The Bayesian predictive mean for a test case is what we would get using the posterior mean value for the regression coefficients (since the model is linear in the parameters). We can compare the Bayesian mean prediction with the prediction using the regularized (maximum penalized likelihood) estimate for w, which is ŵ = (λi + Φ T Φ) 1 Φ T t where I is like the identity matrix except that I 1,1 = 0. Compare with the posterior mean, if we set the prior mean, m 0, to zero: m N = S N (1/σ 2 )Φ T t = (S 1 0 + (1/σ 2 )Φ T Φ) 1 (1/σ 2 )Φ T t = (σ 2 S 1 0 + Φ T Φ) 1 Φ T t If S 1 0 = (1/ω 2 )I, then these are the same, with λ = σ 2 /ω 2. This corresponds to a prior for w in which the w j are independent, all with variance ω 2, except that w 0 (the intercept) has an infinite variance.

A Semi-Bayesian Way to Estimate σ 2 and ω 2 We see that σ 2 (the noise variance) and ω 2 (the variance of regression coefficients, other than w 0 ) together (as σ 2 /ω 2 ) play a role similar to the penalty magnitude, λ, in the maximum penalized likelihood approach. We can find values for σ 2 and ω 2 in a semi-bayesian way by maximizing the marginal likelihood the probability of the data (t) given values for σ 2 and ω 2. We need to set the prior variance of w 0 to some finite ω0 2 (which could be very large), else the probability of the observed data will be zero. We can also select basis function parameters (eg, s) by maximizing the marginal likelihood. Such maximization is somewhat easier than the full Bayesian approach, in which we define some prior distribution for σ 2 and ω 2 (and any basis function parameters we haven t fixed), and then average predictions over their posterior distribution. Note: Notation in Bishop s book is β = 1/σ 2 and α = 1/ω 2. The marginal likelihood is sometimes called the evidence.

Finding the Marginal Likelihood for σ 2 and ω 2 The marginal likelihood for σ 2 and ω 2 given a vector of observed targets values, t, is found by integrating over w with respect to its prior: P (t σ 2, ω 2 ) = P (t w, σ 2 ) P (w ω 2 ) dw This is the denominator in Bayes Rule, that normalizes the posterior. Here, the basis function values for the training cases, based on the inputs for those cases, are considered fixed. Both factors in this integrand are exponentials of quadratic functions of w, so this turns into the same sort of integral as that for the normalizing constant of a Gaussian density function, for which we know the answer.

Details of Computing the Marginal Likelihood We go back to the computation of the posterior for w, but we now need to pay attention to some factors we ignored before. I ll fix the prior mean for w to m 0 = 0. The log of the probability density of the data is N 2 log(2π) N 2 log(σ2 ) 1 2 (t Φw)T (t Φw)/σ 2 The log prior density for w is M 2 log(2π) 1 2 log( S 0 ) 1 2 wt S 1 0 w expanding and then adding these together, we see the following terms that don t involve w: N + M 2 and these terms that do involve w: log(2π) N 2 log(σ2 ) 1 2 log( S 0 ) 1 2 tt t/σ 2 1 2 wt Φ T Φw/σ 2 + w T Φ T t/σ 2 1 2 wt S 1 0 w

More Details... We can combine the quadratic terms that involve w, giving 1 w T (S0 1 + Φ T Φ/σ 2 )w 2w T Φ T t/σ 2 2 We had previously used this to identify the posterior covariance and mean for w. Setting the prior mean to zero, these are 1, S N = S0 1 + (1/σ 2 )Φ T Φ mn = S N Φ T t/σ 2 We can write the terms involving w using these, then complete the square : 1 w T S 1 N 2 w 2wT S 1 = 1 w T S 1 N 2 w 2wT S 1 + m T NS 1 + 1 2 mt NS 1 = 1 2 (w m N) T S 1 N (w m N) + 1 2 mt NS 1 The second term above doesn t involve w, so we can put it with the other such.

And Yet More Details... We now see that the log of the prior times the probability of the data has these terms not involving w: N + M 2 log(2π) N 2 log(σ2 ) 1 2 log( S 0 ) 1 2 tt t/σ 2 and this term that does involve w: 1 2 (w m N) T S 1 N (w m N) + 1 2 mt NS 1 When we exponentiate this and then integrate over w, we see that ( exp 1 ) 2 (w m N) T S 1 N (w m N) dw = (2π) M/2 S N 1/2 since this is just the integral defining the Gaussian normalizing constant. The final result is that the log of the marginal likelihood is N 2 log(2π) N 2 log(σ2 ) 1 ( 2 log S0 ) S N 1 2 tt t/σ 2 + 1 2 mt NS 1 Compare equation (3.86) in Bishop s book... Have I made a mistake?

Correspondence with Bishop s Marginal Likelihood Formula I didn t make a mistake. My formula from the previous slide: N 2 log(2π) N 2 log(σ2 ) 1 ( 2 log S0 ) S N 1 2 tt t/σ 2 + 1 2 mt NS 1 Bishop s formula, after translating notation and minor rearrangement: N 2 log(2π) N 2 log(σ2 ) 1 ( 2 log S0 ) S N 1 2 t Φm N) 2 /σ 2 1 2 mt NS 1 0 m N The difference is in the last two terms. Expanding these in Bishop s formula gives 1 2 t Φm N) 2 /σ 2 1 2 mt NS 1 0 m N = 1 2 tt t/σ 2 + m T NΦ T t/σ 2 1 2 mt NΦ T Φm N /σ 2 1 2 mt NS 1 0 m N = 1 2 tt t/σ 2 + m T NS 1 1 2 mt NS 1 = 1 2 tt t/σ 2 + 1 2 mt NS 1 Bishop s formula is probably better numerically the roundoff error when computing t T t could sometimes be large.