An Introduction to Bayesian Linear Regression

Size: px

Start display at page:

Download "An Introduction to Bayesian Linear Regression"

Oliver Marsh
5 years ago
Views:

1 An Introduction to Bayesian Linear Regression APPM 5720: Bayesian Computation Fall 2018

2 A SIMPLE LINEAR MODEL Suppose that we observe explanatory variables x 1, x 2,..., x n and dependent variables y 1, y 2,..., y n Assume they are related through the very simple linear model y i = βx i + ε i for i = 1, 2,..., n, with ε 1, ε 2,..., ε n being realizations of iid N(0, σ 2 ) random variables.

3 A SIMPLE LINEAR MODEL y i = βx i + ε i, i = 1, 2,..., n The x i can either be constants or realizations of random variables. In the latter case, assume that they have joint pdf f ( x θ) where θ is a parameter (or vector of parameters) that is unrelated to β and σ 2. The likelihood for this model is f ( y, x β, σ 2, θ) = f ( y x, β, σ 2, θ) f ( x β, σ 2, θ) = f ( y x, β, σ 2 ) f ( x θ)

4 A SIMPLE LINEAR MODEL Assume that the x i are fixed. The likelihood for the model is then f ( y x, β, σ 2 ). The goal is to estimate and make inferences about the parameters β and σ 2. Frequentist Approach: Ordinary Least Squares (OLS) y i is supposed to be β times x i plus some residual noise. The noise, modeled by a normal distribution, is observed as y i βx i. Take β to be the minimizer of the sum of squared errors n (y i βx i ) 2 i=1

5 A SIMPLE LINEAR MODEL β = n i=1 x iy i n i=1 x2 i Now for the randomness. Consider Y i = βx i + Z i, i = 1, 2,..., n for Z i iid N(0, σ 2 ). Then Y i N(βx i, σ 2 ) β = ( n i=1 x i x 2 j ) Y i N ( β, σ 2 / ) x 2 j

6 A SIMPLE LINEAR MODEL If we predict each y i to be ŷ i := βx i, we can define the sum of squared errors to be SSE = n (y i ŷ i ) 2 = i=1 n (y i βx i ) 2 i=1 We can then estimate the noise variance σ 2 by the average sum of squared errors SSE/n or, better yet, we can adjust the denominator slightly to get the unbiased estimator σ 2 = SSE n 1. This quantity is known as the mean squared error or MSE and will also be denoted by s 2.

7 THE BAYESIAN APPROACH: Y i = βx i + Z i, Z i iid N(0, σ 2 ) f (y i β, σ 2 ) = f ( y β, σ 2 ) = ( 2πσ 2) n/2 exp [ [ 1 exp 1 ] 2πσ 2 2σ 2 (y i βx i ) 2 1 2σ 2 ] n (y i βx i ) 2 It will be convenient to write this in terms of the OLS estimators xi y i β =, s 2 (yi = βx i ) 2 x 2 i n 1 i=1

8 THE BAYESIAN APPROACH: Then n (y i βx i ) 2 = νs 2 + (β β) n 2 i=1 where ν := n 1. It will also be convenient to work with the precision parameter τ := 1/σ 2. Then f ( y β, τ) = (2π) n/2 i=1 { [ τ 1/2 exp τ 2 (β β) 2 ]} n i=1 x2 i { [ τ ν/2 exp τνs2 2 ]} x 2 i

9 THE BAYESIAN APPROACH: [ τ 1/2 exp τ 2 (β β) 2 ] n i=1 x2 i [ τ ν/2 exp τνs2 2 ] looks normal as a function of β looks gamma as a function of τ (inverse gamma as a function of σ 2 ) The natural conjugate prior for (β, σ 2 ) will be a normal inverse gamma.

10 THE BAYESIAN APPROACH: So many symbols... will use underbars and overbars for prior and posterior hyperparameters and also add a little more structure. Priors β τ N(β, c/τ), τ Γ(ν/2, ν s 2 /2) Will write (β, τ) NG(β, c, ν/2, ν s 2 /2).

11 THE BAYESIAN APPROACH: It is routine to show that the posterior is (β, τ) y NG(β, c, ν/2, ν s 2 /2) where c = [ 1/c + x 2 i ] 1, β = c(c 1 β + β x 2 i ) ν = ν + n, νs 2 = νs 2 + νs 2 + ( β β) 2 c + x 2 i

12 ESTIMATING β AND σ 2 : The posterior Bayes estimator for β is E[β y]. A measure of uncertainty of the estimator is given by the posterior variance Var[β y]. We need to write down the NG(β, c, ν/2, ν s 2 /2) pdf for (β, τ) y and integrate out τ. The result is that β y has a generalized t-distribution. (This is not exactly the same as a non-central t.)

13 THE MULTIVARIATE t-distribution: We say that a k-dimensional random vector X has a multivariate t-distribution with mean µ variance-covariance matrix parameter V ν degrees of freedon if X has pdf ( ) ν ν/2 Γ ν+k 2 [ f ( x µ, V, ν) = π k/2 Γ ( ) ν V 1/2 ( x µ) t V 1 ( x µ) + ν 2 We will write X t( µ, V, ν). ] ν+k 2.

14 THE MULTIVARIATE t-distribution: With k = 1, µ = 0, and V = 1, we get the usual t-distribution. Marginals: X = ( X1 X 2 ) Xi t( µ i, V i, ν) where µ i and V i are the mean and variance-covariance matrix of X i. Conditionals such as X 1 X 2 are also multivariate t. E[ X] = µ, if ν > 1 Var[ X] = ν ν 2 V if ν > 2

15 BACK TO THE REGRESSION PROBLEM: Can show that β y t(β, cs 2, ν) So, the PBE is E[β y] = β and the posterior variance is Var[β y] = ν ν 1 cs2. Also can show that τ y Γ(ν/2, ν s 2 /2). So, E[τ y] = 1/s 2, Var[τ y] = 2/(ν (s 2 ) 2 ).

16 RELATIONSHIP TO FREQUENTIST APPROACH: The PBE of β E[β y] = β = c(c 1 β + β x 2 i ). It is a weighted average of the prior mean and the OLS estimator of β from frequentist statistics. c 1 reflects your confidence in the prior and should be chosen accordingly x 2 i reflects the degree of confidence that the data has in the OLS estimator β

17 RELATIONSHIP TO FREQUENTIST APPROACH: Recall also that and So, νs 2 = νs 2 + νs 2 + ( β β) 2 s 2 (yi = βx i ) 2 n 1 νs 2 }{{} posterior SSE = νs 2 }{{} prior SSE c + x 2 i = SSE n 1 = SSE ν. + νs 2 }{{} SSE + ( β β) 2 c + x 2 i The final term reflects conflict between the prior and the data.

18 CHOOSING PRIOR HYPERPARAMETERS: When choosing hyperparameters β, c, ν, and s 2, it may be helpful to know that β is equivalent to the OLS estimate from an imaginary data set with ν + 1 observations imaginary x 2 i equal to c 1 imaginary s 2 given by s 2 The imaginary data set might even be previous data!

19 MODEL COMPARISON Suppose you want to fit this overly simplistic linear model to describe the y i but are not sure whether you want to use the x i or a different set of explananatory variables. Consider the two models: M 1 : y i = β 1 x 1i + ε 1i Here, we assume are independent. M 2 : y i = β 2 x 2i + ε 2i ε 1i iid N(0, τ 1 1 ) and ε 2i iid N(0, τ 1 2 )

20 MODEL COMPARISON Priors for model j: posteriors for model j are The posterior odds ratio is (β j, τ j ) NG(β j, c j, ν j /2, ν j s j 2 ) (β j, τ j ) y NG(β j, c j, ν j /2, ν j s j 2 ) PO 12 := P(M 1 y) P(M 2 y) = f ( y M 1) f ( y M 2 ) P(M 1) P(M 2 )

21 MODEL COMPARISON Can show that ( ) 1/2 cj ( ) f ( y M j ) = a j ν j s 2 νj /2 c j j where a j = ( ) Γ(ν j /2) ν j s 2 νj /2 j Γ(ν j /2) π n/2

22 MODEL COMPARISON We can get the posterior model probabilities: P(M 1 y) = PO PO 12, P(M 2 y) = PO 12. where PO 12 = ) ν1 /2 ( ) 1/2 a c1 1 c 1 (ν 1 s 2 1 ( ) 1/2 ) a c2 2 c 2 (ν 2 s 2 ν2 /2 P(M 1) P(M 2 ) 2

23 MODEL COMPARISON PO 12 = ) ν1 /2 ( ) 1/2 a c1 1 c 1 (ν 1 s 2 1 ( ) 1/2 ) a c2 2 c 2 (ν 2 s 2 ν2 /2 P(M 1) P(M 2 ) 2 ν j s 2 j contains the OLS SSE. A lower value indicates a better fit. So, the posterior odds ratio rewards models which fit the data better.

24 MODEL COMPARISON PO 12 = ) ν1 /2 ( ) 1/2 a c1 1 c 1 (ν 1 s 2 1 ( ) 1/2 ) a c2 2 c 2 (ν 2 s 2 ν2 /2 P(M 1) P(M 2 ) 2 ν j s 2 j contains a term like ( β j β j ) 2 So, the posterior odds ratio supports greater coherency between prior info and data info!

The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

The Normal Linear Regression Model with Natural Conjugate Prior March 7, 2016 The Normal Linear Regression Model with Natural Conjugate Prior The plan Estimate simple regression model using Bayesian methods