An Introduction to Bayesian Linear Regression

An Introduction to Bayesian Linear Regression APPM 5720: Bayesian Computation Fall 2018

A SIMPLE LINEAR MODEL Suppose that we observe explanatory variables x 1, x 2,..., x n and dependent variables y 1, y 2,..., y n Assume they are related through the very simple linear model y i = βx i + ε i for i = 1, 2,..., n, with ε 1, ε 2,..., ε n being realizations of iid N(0, σ 2 ) random variables.

A SIMPLE LINEAR MODEL y i = βx i + ε i, i = 1, 2,..., n The x i can either be constants or realizations of random variables. In the latter case, assume that they have joint pdf f ( x θ) where θ is a parameter (or vector of parameters) that is unrelated to β and σ 2. The likelihood for this model is f ( y, x β, σ 2, θ) = f ( y x, β, σ 2, θ) f ( x β, σ 2, θ) = f ( y x, β, σ 2 ) f ( x θ)

A SIMPLE LINEAR MODEL Assume that the x i are fixed. The likelihood for the model is then f ( y x, β, σ 2 ). The goal is to estimate and make inferences about the parameters β and σ 2. Frequentist Approach: Ordinary Least Squares (OLS) y i is supposed to be β times x i plus some residual noise. The noise, modeled by a normal distribution, is observed as y i βx i. Take β to be the minimizer of the sum of squared errors n (y i βx i ) 2 i=1

A SIMPLE LINEAR MODEL β = n i=1 x iy i n i=1 x2 i Now for the randomness. Consider Y i = βx i + Z i, i = 1, 2,..., n for Z i iid N(0, σ 2 ). Then Y i N(βx i, σ 2 ) β = ( n i=1 x i x 2 j ) Y i N ( β, σ 2 / ) x 2 j

A SIMPLE LINEAR MODEL If we predict each y i to be ŷ i := βx i, we can define the sum of squared errors to be SSE = n (y i ŷ i ) 2 = i=1 n (y i βx i ) 2 i=1 We can then estimate the noise variance σ 2 by the average sum of squared errors SSE/n or, better yet, we can adjust the denominator slightly to get the unbiased estimator σ 2 = SSE n 1. This quantity is known as the mean squared error or MSE and will also be denoted by s 2.

THE BAYESIAN APPROACH: Y i = βx i + Z i, Z i iid N(0, σ 2 ) f (y i β, σ 2 ) = f ( y β, σ 2 ) = ( 2πσ 2) n/2 exp [ [ 1 exp 1 ] 2πσ 2 2σ 2 (y i βx i ) 2 1 2σ 2 ] n (y i βx i ) 2 It will be convenient to write this in terms of the OLS estimators xi y i β =, s 2 (yi = βx i ) 2 x 2 i n 1 i=1

THE BAYESIAN APPROACH: Then n (y i βx i ) 2 = νs 2 + (β β) n 2 i=1 where ν := n 1. It will also be convenient to work with the precision parameter τ := 1/σ 2. Then f ( y β, τ) = (2π) n/2 i=1 { [ τ 1/2 exp τ 2 (β β) 2 ]} n i=1 x2 i { [ τ ν/2 exp τνs2 2 ]} x 2 i

THE BAYESIAN APPROACH: [ τ 1/2 exp τ 2 (β β) 2 ] n i=1 x2 i [ τ ν/2 exp τνs2 2 ] looks normal as a function of β looks gamma as a function of τ (inverse gamma as a function of σ 2 ) The natural conjugate prior for (β, σ 2 ) will be a normal inverse gamma.

THE BAYESIAN APPROACH: So many symbols... will use underbars and overbars for prior and posterior hyperparameters and also add a little more structure. Priors β τ N(β, c/τ), τ Γ(ν/2, ν s 2 /2) Will write (β, τ) NG(β, c, ν/2, ν s 2 /2).

THE BAYESIAN APPROACH: It is routine to show that the posterior is (β, τ) y NG(β, c, ν/2, ν s 2 /2) where c = [ 1/c + x 2 i ] 1, β = c(c 1 β + β x 2 i ) ν = ν + n, νs 2 = νs 2 + νs 2 + ( β β) 2 c + x 2 i

ESTIMATING β AND σ 2 : The posterior Bayes estimator for β is E[β y]. A measure of uncertainty of the estimator is given by the posterior variance Var[β y]. We need to write down the NG(β, c, ν/2, ν s 2 /2) pdf for (β, τ) y and integrate out τ. The result is that β y has a generalized t-distribution. (This is not exactly the same as a non-central t.)

THE MULTIVARIATE t-distribution: We say that a k-dimensional random vector X has a multivariate t-distribution with mean µ variance-covariance matrix parameter V ν degrees of freedon if X has pdf ( ) ν ν/2 Γ ν+k 2 [ f ( x µ, V, ν) = π k/2 Γ ( ) ν V 1/2 ( x µ) t V 1 ( x µ) + ν 2 We will write X t( µ, V, ν). ] ν+k 2.

THE MULTIVARIATE t-distribution: With k = 1, µ = 0, and V = 1, we get the usual t-distribution. Marginals: X = ( X1 X 2 ) Xi t( µ i, V i, ν) where µ i and V i are the mean and variance-covariance matrix of X i. Conditionals such as X 1 X 2 are also multivariate t. E[ X] = µ, if ν > 1 Var[ X] = ν ν 2 V if ν > 2

BACK TO THE REGRESSION PROBLEM: Can show that β y t(β, cs 2, ν) So, the PBE is E[β y] = β and the posterior variance is Var[β y] = ν ν 1 cs2. Also can show that τ y Γ(ν/2, ν s 2 /2). So, E[τ y] = 1/s 2, Var[τ y] = 2/(ν (s 2 ) 2 ).

RELATIONSHIP TO FREQUENTIST APPROACH: The PBE of β E[β y] = β = c(c 1 β + β x 2 i ). It is a weighted average of the prior mean and the OLS estimator of β from frequentist statistics. c 1 reflects your confidence in the prior and should be chosen accordingly x 2 i reflects the degree of confidence that the data has in the OLS estimator β

RELATIONSHIP TO FREQUENTIST APPROACH: Recall also that and So, νs 2 = νs 2 + νs 2 + ( β β) 2 s 2 (yi = βx i ) 2 n 1 νs 2 }{{} posterior SSE = νs 2 }{{} prior SSE c + x 2 i = SSE n 1 = SSE ν. + νs 2 }{{} SSE + ( β β) 2 c + x 2 i The final term reflects conflict between the prior and the data.

CHOOSING PRIOR HYPERPARAMETERS: When choosing hyperparameters β, c, ν, and s 2, it may be helpful to know that β is equivalent to the OLS estimate from an imaginary data set with ν + 1 observations imaginary x 2 i equal to c 1 imaginary s 2 given by s 2 The imaginary data set might even be previous data!

MODEL COMPARISON Suppose you want to fit this overly simplistic linear model to describe the y i but are not sure whether you want to use the x i or a different set of explananatory variables. Consider the two models: M 1 : y i = β 1 x 1i + ε 1i Here, we assume are independent. M 2 : y i = β 2 x 2i + ε 2i ε 1i iid N(0, τ 1 1 ) and ε 2i iid N(0, τ 1 2 )

MODEL COMPARISON Priors for model j: posteriors for model j are The posterior odds ratio is (β j, τ j ) NG(β j, c j, ν j /2, ν j s j 2 ) (β j, τ j ) y NG(β j, c j, ν j /2, ν j s j 2 ) PO 12 := P(M 1 y) P(M 2 y) = f ( y M 1) f ( y M 2 ) P(M 1) P(M 2 )

MODEL COMPARISON Can show that ( ) 1/2 cj ( ) f ( y M j ) = a j ν j s 2 νj /2 c j j where a j = ( ) Γ(ν j /2) ν j s 2 νj /2 j Γ(ν j /2) π n/2

MODEL COMPARISON We can get the posterior model probabilities: P(M 1 y) = PO 12 1 + PO 12, P(M 2 y) = 1 1 + PO 12. where PO 12 = ) ν1 /2 ( ) 1/2 a c1 1 c 1 (ν 1 s 2 1 ( ) 1/2 ) a c2 2 c 2 (ν 2 s 2 ν2 /2 P(M 1) P(M 2 ) 2

MODEL COMPARISON PO 12 = ) ν1 /2 ( ) 1/2 a c1 1 c 1 (ν 1 s 2 1 ( ) 1/2 ) a c2 2 c 2 (ν 2 s 2 ν2 /2 P(M 1) P(M 2 ) 2 ν j s 2 j contains the OLS SSE. A lower value indicates a better fit. So, the posterior odds ratio rewards models which fit the data better.

MODEL COMPARISON PO 12 = ) ν1 /2 ( ) 1/2 a c1 1 c 1 (ν 1 s 2 1 ( ) 1/2 ) a c2 2 c 2 (ν 2 s 2 ν2 /2 P(M 1) P(M 2 ) 2 ν j s 2 j contains a term like ( β j β j ) 2 So, the posterior odds ratio supports greater coherency between prior info and data info!