A Bayesian Treatment of Linear Gaussian Regression

Similar documents
The linear model is the most fundamental of all serious statistical models encompassing:

Bayesian Linear Regression

Bayesian Linear Models

Bayesian Linear Models

ST 740: Linear Models and Multivariate Normal Inference

AMS-207: Bayesian Statistics

variability of the model, represented by σ 2 and not accounted for by Xβ

The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

David Giles Bayesian Econometrics

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Bayesian Linear Models

An Introduction to Bayesian Linear Regression

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Inference in Regression Analysis

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Part 4: Multi-parameter and normal models

5.2 Expounding on the Admissibility of Shrinkage Estimators

STA 414/2104, Spring 2014, Practice Problem Set #1

The Multivariate Gaussian Distribution [DRAFT]

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Linear Models A linear model is defined by the expression

Foundations of Statistical Inference

Statistical Models with Uncertain Error Parameters (G. Cowan, arxiv: )

Accounting for Complex Sample Designs via Mixture Models

MCMC algorithms for fitting Bayesian models

Bias Variance Trade-off

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Bayesian Linear Regression [DRAFT - In Progress]

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

Part 6: Multivariate Normal and Linear Models

The Multivariate Normal Distribution 1

Gibbs Sampling in Linear Models #1

Bayesian Inference for Normal Mean

Introduction to Probabilistic Graphical Models: Exercises

Basic Distributional Assumptions of the Linear Model: 1. The errors are unbiased: E[ε] = The errors are uncorrelated with common variance:

STA 2201/442 Assignment 2

Bayesian Econometrics

Bayesian Regression (1/31/13)

CS281A/Stat241A Lecture 17

1 Bayesian Linear Regression (BLR)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Bayesian linear regression

Expectation Propagation for Approximate Bayesian Inference

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Markov Chain Monte Carlo

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Introduction to Bayesian Inference

PMR Learning as Inference

Linear Regression and Discrimination

GAUSSIAN PROCESS REGRESSION

Lecture : Probabilistic Machine Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

1 Data Arrays and Decompositions

Bayesian Regression Linear and Logistic Regression

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Module 4: Bayesian Methods Lecture 5: Linear regression

Frequentist-Bayesian Model Comparisons: A Simple Example

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

1 Mixed effect models and longitudinal data analysis

Bayesian Ingredients. Hedibert Freitas Lopes

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Predictive Distributions

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

CSC411 Fall 2018 Homework 5

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

General Linear Model: Statistical Inference

Inference in Normal Regression Model. Dr. Frank Wood

Introduction into Bayesian statistics

Choosing among models

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Probabilistic Graphical Models

Gibbs Sampling in Linear Models #2

The Multivariate Normal Distribution 1

Introduc)on to Bayesian Methods

Naïve Bayes classification

Sparse Linear Models (10/7/13)

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1

Remarks on Improper Ignorance Priors

A quadratic expression is a mathematical expression that can be written in the form 2

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

CMU-Q Lecture 24:

Cross-sectional space-time modeling using ARNN(p, n) processes

Lecture 16 : Bayesian analysis of contingency tables. Bayesian linear regression. Jonathan Marchini (University of Oxford) BS2a MT / 15

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

MIT Spring 2015

Machine Learning 4771

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Ages of stellar populations from color-magnitude diagrams. Paul Baines. September 30, 2008

Transcription:

A Bayesian Treatment of Linear Gaussian Regression Frank Wood December 3, 2009

Bayesian Approach to Classical Linear Regression In classical linear regression we have the following model y β, σ 2, X N(Xβ, σ 2 I) Unfortunately we often don t know the observation error σ 2 and, as well, we don t know the vector of linear weights β that relates the input(s) to the output. In Bayesian regression, we are interested in several inference objectives. One is the posterior distribution of the model parameters, in particular the posterior distribution of the observation error variance given the inputs and the outputs. P(σ 2 X, y)

Posterior Distribution of the Error Variance Of course in order to derive P(σ 2 X, y) We have to treat β as a nuisance parameter and integrate it out P(σ 2 X, y) = = P(σ 2, β X, y)dβ P(σ 2 β, X, y)p(β X, y)dβ

Predicting a New Output for a (set of) new Input(s) Of particular interest is the ability to predict the distribution of output values for a new input P(y new X, y, X new ) Here we have to treat both σ 2 and β as a nuisance parameters and integrate them out P(y new X, y, X new ) = P(y new β, σ 2 )P(σ 2 β, X, y)p(β X, y)dβ, dσ 2

Noninformative Prior for Classical Regression For both objectives, we need to place a prior on the model parameters σ 2 and β. We will choose a noninformative prior to demonstrate the connection between the Bayesian approach to multiple regression and the classical approach. P(σ 2, β) σ 2 Is this a proper prior? What form will the posterior take in this case? Will it be proper? Clearly other priors can be imposed, priors that are more informative.

Posterior distribution of β given σ 2 Sometimes it is the case that σ 2 is known. In such cases the posterior distribution over the model parameters collapses to the posterior over β alone. Even when σ 2 is also unknown, the factorization of the posterior distribution P(σ 2, β X, y) = P(β σ 2, X, y)p(σ 2 X, y) Suggests that determining the posterior distribution P(β σ 2, X, y) will be of use as a step in posterior analyses.

Posterior distribution of β given σ 2 Given our choice of (improper) prior we have P(β σ 2, X, y)p(σ 2 X, y) N(y Xβ, σ 2 I)σ 2 Which, plugging in the normal likelihood and ignoring terms that are not a function of β we have P(β σ 2, X, y) exp( 1 2 (y 1 Xβ)T I(y Xβ))) σ2 when we expand out the exponent we get an expression that looks like (again dropping terms that do not involve β) exp( 1 2 ( 2yT 1 σ 2 IXβ + βt X T 1 σ 2 IXβ))

Multivariate Quadratic Square Completion We recognize the familiar form of the exponent of a multivariate Gaussian in this expression and can derive the mean and the variance of the distribution of β σ 2,... by noting that (β µ β ) T Σ 1 β (β µ β) = β T Σ 1 β 2µ T β Σ 1 β β + const From this and the result from the previous slide exp( 1 2 2yT 1 σ 2 IXβ + βt X T 1 σ 2 IXβ) We can immediately identify Σ 1 β = XT 1 IX and thus that σ 2 Σ β = σ 2 (X T X) 1. Similarly we can solve for µ β and we find µ β = (X T X) 1 X T y

Distribution of β given σ 2 Mirroring the classical approach to matrix regression we have that the distribution of the regression coefficients given the observation noise variance is β y, X, σ 2 N(µ β, Σ β ) where Σ β = σ 2 (X T X) 1 and µ β = (X T X) 1 X T y Note that µ β is the same as the maximum likelihood or least squares estimate ˆβ = (X T X) 1 X T y of the regression coefficients. Of course we don t usually know the observation noise variance σ 2 and have to simultaneously estimate it from the data. To determine the distribution of this quantity we need a few facts.

Scaled inverse-chi-square distribution If θ Inv χ 2 (ν, s 2 ) then the pdf for θ is given by P(θ) = (ν/2)ν/2 Γ(ν/2) θ (ν/2+1) e ( νs2 /(2θ)) θ (ν/2+1) e ( νs2 /(2θ)) You can think of the scaled inverse chi squared distribution as the chi squared distribution where the sum of squares is explicit in the parameterization. ν > 0 is the number of degrees of freedom, s > 0 is the scale parameter.

Distribution of σ 2 given observations y and X The posterior distribution of the observation noise can be derived by noting that P(σ 2 y, X) = P(β, σ2 y, X) P(β σ 2, y, X) P(y β, σ2, X)P(β, σ 2 X) P(β σ 2, y, X) But we have all of these terms. P(y β, σ 2, X) is the standard regression likelihood. We have just solved for the posterior distribution of β given σ 2 and the rest, P(β σ 2, y, X) and we specified our prior P(σ 2, β) σ 2

Distribution of σ 2 given observations y and X When we plug all of these known distributions into the P(σ 2 y, X) P(y β, σ2, X)P(β, σ 2 X) P(β σ 2, y, X) which simplifies to σ n exp( 1 2 (y Xβ)T 1 σ 2 I(y Xβ))σ 2 σ p exp( 1 2 (β µ β) T Σ 1 β (β µ β)) σ n+p 2 exp( 1 2 ( (y 1 Xβ)T I(y Xβ) σ2 (β µ β ) T Σ 1 β (β µ β) ))

Distribution of σ 2 given observations y and X With significant algebraic effort one can arrive at P(σ 2 y, X) σ n+p 2 exp( 1 2σ 2 (y Xµ β) T (y Xµ β )) Remembering that µ β = ˆβ we can rewrite this in a more familiar form, namely P(σ 2 y, X) σ n+p 2 exp( 1 2σ 2 (y Xˆβ) T (y Xˆβ)) where the exponent is the sum of squared errors SSE.

Distribution of σ 2 given observations y and X By inspection P(σ 2 y, X) σ n+p 2 exp( 1 2σ 2 (y Xˆβ) T (y Xˆβ)) follows an scaled inverse χ 2 distribution P(θ) θ (ν/2+1) e ( νs2 /(2θ)) where θ = σ 2 = ν = n p (i.e. the number of degrees of freedom is the number of observations n minus the number of free parameters in the model p and s 2 = 1 n p (y Xˆβ) T (y Xˆβ) is the standard MSE estimate of the sample variance.

Distribution of σ 2 given observations y and X Note that this result σ 2 Inv χ 2 (n p, 1 n p (y Xˆβ) T (y Xˆβ)) (1) is exactly analogous to the following result from the classical estimation approach to linear regression. From Cochran s Theorem we have SSE σ 2 = (y Xˆβ) T (y Xˆβ) σ 2 χ 2 (n p) (2) To get from (1) to (2) one can use the change of distribution formula with the change of variable θ = (y Xˆβ) T (y Xˆβ)/σ 2.

Distribution of output(s) given new input(s) Last but not least we will typically be interested in prediction. P(y new X, y, X new ) = P(y new β, σ 2 )P(σ 2 β, X, y)p(β X, y)dβ, dσ 2 we will first assume, as usual that σ 2 is known and proceed with evaluating instead. P(y new X, y, X new, σ 2 ) = P(y new β, σ 2 )P(β X, y, σ 2 )dβ

Distribution of output(s) given new input(s) We know the form of each of these expressions, the likelihood is normal as is the distribution of β given the rest In other words P(y new X, y, X new, σ 2 ) = P(y new β, σ 2 )P(β X, y, σ 2 )dβ P(y new X, y, X new, σ 2 ) = N(y new X new ˆβ, σ 2 ) N(β ˆβ, Σ β )dβ

Bayes Rule for Gaussians To solve this integral we will use Bayes rule for Gaussians (taken from Bishop). If P(x) = N(x µ, Λ 1 ) P(y x) = N(y Ax + b, L 1 ) where x, y, and µ are all vectors and Λ and L are (invertable) matrices of the appropriate size then P(y) = N(y Aµ + b, L 1 + AΛ 1 A T ) P(x y) = N(x Σ(A T L(y b) + Λµ), Σ) where Σ = (Λ + A T LA) 1

Distribution of output(s) given new input(s) Since this integral is just an application of Bayes rule for Gaussians we can directly write down the solution P(y new X, y, X new, σ 2 ) = N(y new X new ˆβ, σ 2 ) N(β ˆβ, Σ β )dβ = N(y new X new ˆβ, σ 2 (I + X new V β X T new ) where V β = Σ β /σ 2 = (X T X) 1

Distribution of output(s) given new input(s) This solution P(y new X, y, X new, σ 2 ) = N(y new X new ˆβ, σ 2 (I + X new V β X T new ) where V β = Σ β /σ 2 = (X T X) 1 relies upon σ 2 being known. Our final inference objective is to come up with P(y new X, y, X new ) = P(y new β, σ 2 )P(σ 2 β, X, y)p(β X, y)dβ, dσ 2 = P(y new X, y, X new, σ 2 )P(σ 2 X, y, X new )dσ 2 where we have just derived the first term and the second we known is scaled inverse chi-squared.

Distribution of output(s) given new input(s) The distributional form of P(y new X, y, X new ) = P(y new X, y, X new, σ 2 )P(σ 2 X, y, X new )dσ 2 is a multivariate Student-t distribution with center X new ˆβ, squared scale marix s 2 (I + X new V β X T new ) and n p degrees of freedom (left as homework). Again this is the same result as in classical regression analysis the predictive distribution of a new (set of) points is Student-t when σ 2 is unknown and marginalized out.

Take home The Bayesian perspective brings a new analytic perspective to the classical regression setting. In classical regression we develop estimators and then determine their distribution under repeated sampling or measurement of the underlying population. In Bayesian regression we stick with the single given dataset and calculate the uncertainty in our parameter estimates arising from the fact that we have a finite dataset. Given a single choice of prior, namely a particular improper prior we see that the posterior uncertainty regarding the model parameters corresponds exactly to the classical sampling distributions for regression estimators. Other priors can be utilized.