A Bayesian Treatment of Linear Gaussian Regression

A Bayesian Treatment of Linear Gaussian Regression Frank Wood December 3, 2009

Bayesian Approach to Classical Linear Regression In classical linear regression we have the following model y β, σ 2, X N(Xβ, σ 2 I) Unfortunately we often don t know the observation error σ 2 and, as well, we don t know the vector of linear weights β that relates the input(s) to the output. In Bayesian regression, we are interested in several inference objectives. One is the posterior distribution of the model parameters, in particular the posterior distribution of the observation error variance given the inputs and the outputs. P(σ 2 X, y)

Posterior Distribution of the Error Variance Of course in order to derive P(σ 2 X, y) We have to treat β as a nuisance parameter and integrate it out P(σ 2 X, y) = = P(σ 2, β X, y)dβ P(σ 2 β, X, y)p(β X, y)dβ

Predicting a New Output for a (set of) new Input(s) Of particular interest is the ability to predict the distribution of output values for a new input P(y new X, y, X new ) Here we have to treat both σ 2 and β as a nuisance parameters and integrate them out P(y new X, y, X new ) = P(y new β, σ 2 )P(σ 2 β, X, y)p(β X, y)dβ, dσ 2

Noninformative Prior for Classical Regression For both objectives, we need to place a prior on the model parameters σ 2 and β. We will choose a noninformative prior to demonstrate the connection between the Bayesian approach to multiple regression and the classical approach. P(σ 2, β) σ 2 Is this a proper prior? What form will the posterior take in this case? Will it be proper? Clearly other priors can be imposed, priors that are more informative.

Posterior distribution of β given σ 2 Sometimes it is the case that σ 2 is known. In such cases the posterior distribution over the model parameters collapses to the posterior over β alone. Even when σ 2 is also unknown, the factorization of the posterior distribution P(σ 2, β X, y) = P(β σ 2, X, y)p(σ 2 X, y) Suggests that determining the posterior distribution P(β σ 2, X, y) will be of use as a step in posterior analyses.

Posterior distribution of β given σ 2 Given our choice of (improper) prior we have P(β σ 2, X, y)p(σ 2 X, y) N(y Xβ, σ 2 I)σ 2 Which, plugging in the normal likelihood and ignoring terms that are not a function of β we have P(β σ 2, X, y) exp( 1 2 (y 1 Xβ)T I(y Xβ))) σ2 when we expand out the exponent we get an expression that looks like (again dropping terms that do not involve β) exp( 1 2 ( 2yT 1 σ 2 IXβ + βt X T 1 σ 2 IXβ))

Multivariate Quadratic Square Completion We recognize the familiar form of the exponent of a multivariate Gaussian in this expression and can derive the mean and the variance of the distribution of β σ 2,... by noting that (β µ β ) T Σ 1 β (β µ β) = β T Σ 1 β 2µ T β Σ 1 β β + const From this and the result from the previous slide exp( 1 2 2yT 1 σ 2 IXβ + βt X T 1 σ 2 IXβ) We can immediately identify Σ 1 β = XT 1 IX and thus that σ 2 Σ β = σ 2 (X T X) 1. Similarly we can solve for µ β and we find µ β = (X T X) 1 X T y

Distribution of β given σ 2 Mirroring the classical approach to matrix regression we have that the distribution of the regression coefficients given the observation noise variance is β y, X, σ 2 N(µ β, Σ β ) where Σ β = σ 2 (X T X) 1 and µ β = (X T X) 1 X T y Note that µ β is the same as the maximum likelihood or least squares estimate ˆβ = (X T X) 1 X T y of the regression coefficients. Of course we don t usually know the observation noise variance σ 2 and have to simultaneously estimate it from the data. To determine the distribution of this quantity we need a few facts.

Scaled inverse-chi-square distribution If θ Inv χ 2 (ν, s 2 ) then the pdf for θ is given by P(θ) = (ν/2)ν/2 Γ(ν/2) θ (ν/2+1) e ( νs2 /(2θ)) θ (ν/2+1) e ( νs2 /(2θ)) You can think of the scaled inverse chi squared distribution as the chi squared distribution where the sum of squares is explicit in the parameterization. ν > 0 is the number of degrees of freedom, s > 0 is the scale parameter.

Distribution of σ 2 given observations y and X The posterior distribution of the observation noise can be derived by noting that P(σ 2 y, X) = P(β, σ2 y, X) P(β σ 2, y, X) P(y β, σ2, X)P(β, σ 2 X) P(β σ 2, y, X) But we have all of these terms. P(y β, σ 2, X) is the standard regression likelihood. We have just solved for the posterior distribution of β given σ 2 and the rest, P(β σ 2, y, X) and we specified our prior P(σ 2, β) σ 2

Distribution of σ 2 given observations y and X When we plug all of these known distributions into the P(σ 2 y, X) P(y β, σ2, X)P(β, σ 2 X) P(β σ 2, y, X) which simplifies to σ n exp( 1 2 (y Xβ)T 1 σ 2 I(y Xβ))σ 2 σ p exp( 1 2 (β µ β) T Σ 1 β (β µ β)) σ n+p 2 exp( 1 2 ( (y 1 Xβ)T I(y Xβ) σ2 (β µ β ) T Σ 1 β (β µ β) ))

Distribution of σ 2 given observations y and X With significant algebraic effort one can arrive at P(σ 2 y, X) σ n+p 2 exp( 1 2σ 2 (y Xµ β) T (y Xµ β )) Remembering that µ β = ˆβ we can rewrite this in a more familiar form, namely P(σ 2 y, X) σ n+p 2 exp( 1 2σ 2 (y Xˆβ) T (y Xˆβ)) where the exponent is the sum of squared errors SSE.

Distribution of σ 2 given observations y and X By inspection P(σ 2 y, X) σ n+p 2 exp( 1 2σ 2 (y Xˆβ) T (y Xˆβ)) follows an scaled inverse χ 2 distribution P(θ) θ (ν/2+1) e ( νs2 /(2θ)) where θ = σ 2 = ν = n p (i.e. the number of degrees of freedom is the number of observations n minus the number of free parameters in the model p and s 2 = 1 n p (y Xˆβ) T (y Xˆβ) is the standard MSE estimate of the sample variance.

Distribution of σ 2 given observations y and X Note that this result σ 2 Inv χ 2 (n p, 1 n p (y Xˆβ) T (y Xˆβ)) (1) is exactly analogous to the following result from the classical estimation approach to linear regression. From Cochran s Theorem we have SSE σ 2 = (y Xˆβ) T (y Xˆβ) σ 2 χ 2 (n p) (2) To get from (1) to (2) one can use the change of distribution formula with the change of variable θ = (y Xˆβ) T (y Xˆβ)/σ 2.

Distribution of output(s) given new input(s) Last but not least we will typically be interested in prediction. P(y new X, y, X new ) = P(y new β, σ 2 )P(σ 2 β, X, y)p(β X, y)dβ, dσ 2 we will first assume, as usual that σ 2 is known and proceed with evaluating instead. P(y new X, y, X new, σ 2 ) = P(y new β, σ 2 )P(β X, y, σ 2 )dβ

Distribution of output(s) given new input(s) We know the form of each of these expressions, the likelihood is normal as is the distribution of β given the rest In other words P(y new X, y, X new, σ 2 ) = P(y new β, σ 2 )P(β X, y, σ 2 )dβ P(y new X, y, X new, σ 2 ) = N(y new X new ˆβ, σ 2 ) N(β ˆβ, Σ β )dβ

Bayes Rule for Gaussians To solve this integral we will use Bayes rule for Gaussians (taken from Bishop). If P(x) = N(x µ, Λ 1 ) P(y x) = N(y Ax + b, L 1 ) where x, y, and µ are all vectors and Λ and L are (invertable) matrices of the appropriate size then P(y) = N(y Aµ + b, L 1 + AΛ 1 A T ) P(x y) = N(x Σ(A T L(y b) + Λµ), Σ) where Σ = (Λ + A T LA) 1

Distribution of output(s) given new input(s) Since this integral is just an application of Bayes rule for Gaussians we can directly write down the solution P(y new X, y, X new, σ 2 ) = N(y new X new ˆβ, σ 2 ) N(β ˆβ, Σ β )dβ = N(y new X new ˆβ, σ 2 (I + X new V β X T new ) where V β = Σ β /σ 2 = (X T X) 1

Distribution of output(s) given new input(s) This solution P(y new X, y, X new, σ 2 ) = N(y new X new ˆβ, σ 2 (I + X new V β X T new ) where V β = Σ β /σ 2 = (X T X) 1 relies upon σ 2 being known. Our final inference objective is to come up with P(y new X, y, X new ) = P(y new β, σ 2 )P(σ 2 β, X, y)p(β X, y)dβ, dσ 2 = P(y new X, y, X new, σ 2 )P(σ 2 X, y, X new )dσ 2 where we have just derived the first term and the second we known is scaled inverse chi-squared.

Distribution of output(s) given new input(s) The distributional form of P(y new X, y, X new ) = P(y new X, y, X new, σ 2 )P(σ 2 X, y, X new )dσ 2 is a multivariate Student-t distribution with center X new ˆβ, squared scale marix s 2 (I + X new V β X T new ) and n p degrees of freedom (left as homework). Again this is the same result as in classical regression analysis the predictive distribution of a new (set of) points is Student-t when σ 2 is unknown and marginalized out.

Take home The Bayesian perspective brings a new analytic perspective to the classical regression setting. In classical regression we develop estimators and then determine their distribution under repeated sampling or measurement of the underlying population. In Bayesian regression we stick with the single given dataset and calculate the uncertainty in our parameter estimates arising from the fact that we have a finite dataset. Given a single choice of prior, namely a particular improper prior we see that the posterior uncertainty regarding the model parameters corresponds exactly to the classical sampling distributions for regression estimators. Other priors can be utilized.