The Normal Linear Regression Model with Natural Conjugate Prior March 7, 2016
The Normal Linear Regression Model with Natural Conjugate Prior The plan Estimate simple regression model using Bayesian methods Formulate prior Combine prior and likelihood to compute posterior Model comparison Main reading: Ch.2 in Gary Koop s Bayesian Econometrics
Recap
Bayes Rule Deriving Bayes Rule Symmetrically Implying or p (A, B) = p(a B)p(B) p (A, B) = p(b A)p(A) p(a B)p(B) = p(b A)p(A) p(a B) = which is known as Bayes Rule. p(b A)p(A) p(b)
Data and parameters The purpose of Bayesian analysis is to use the data y to learn about the parameters θ Parameters of a statistical model Or anything not directly observed Replace A and B in Bayes rule with θ and y to get p(θ y) = p(y θ)p(θ) p(y) The probability density p(θ y) then describes what do we know about θ, given the data.
The posterior density The posterior density is often the object of fundamental interest in a Bayesian estimation. The posterior is the result. We are interested in the entire conditional distribution of the parameters θ p(θ y) = p(y θ)p(θ) p(y) Today we will start filling in blanks and specify a prior and data generating process (i.e. a likelihood function) in order to find the posterior.
The linear regression model
The linear regression model We will start with something very basic The linear regression model y i = βx i + ε i N(0, σ 2 ) i = 1, 2,...N The variable x i is fixed (i.e. not random) or independent of ε i and with a probability density function p(x i λ) where the parameter(s) λ does not include β or σ 2 Walk before we run etc...
0.018 0.016 0.014 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.012 0.01 0.008 0.006 0.004 0.002 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β
The likelihood function
Defining the likelihood function The assumptions made about the linear regression model implies that p ( y i β, σ 2) is Normal. p ( [ ] y i β, σ 2) 1 = exp (y i βx i ) 2 2πσ 2 2σ 2 or that p ( y β, σ 2) = 1 (2π) N 2 σ N exp [ 1 2σ 2 ] N (y i βx i ) 2 i=1 where y [ y 1 y 2 y N ]
Reformulating the likelihood function It will be convenient to use that where N i=1 ( (y i βx i ) 2 = vs 2 + β β ) N 2 v = N 1 xi y i β = x 2 i i=1 x 2 i s 2 = (yi βx i ) 2 v
Reformulating the likelihood function The likelihood function can then be written as p ( [ y β, σ 2) = {h 1 12 exp h ( β (2π) N 2 2 β ) N ]} 2 xi 2 i=1 [ {h v2 exp hv ]} 2s 2 where h σ 2.
Formulating a prior
The Prior density The prior density should reflect information about θ that we have prior to observing y. Priors should preferably be easy to interpret be of a form that facilitates computing the posterior Natural conjugate priors have both of these advantages.
Conjugate priors Conjugate priors, when combined with the likelihood function, result in posteriors that are of the same family of distributions Natural conjugate priors has the same functional form as the likelihood function The posterior can p(θ y) then be thought of as being the result of two sub-samples p(θ y) p(y θ)p(θ) The sufficient statistic about θ of the first sub-sample is described by the hyper-parameters of the prior The information about θ in second sub-sample can be described by the sufficient statistic for the actual observed sample (i.e. y) This feature makes natural conjugate priors easy to interpret
Specifying a prior density: It s your choice For the Normal regression model we need to elicit, i.e. formulate, a prior p(θ) for β and h which we can denote p(β, h). It will be convenient to use that p(β, h) = p(β h)p(h) and let β h N(β, h 1 V ) and let h G ( s 2, v ). The natural conjugate prior for β and h is then denoted β, h NG ( β, V, s 2, v ) Notation: Koop uses bars below parameters to denote hyper-parameters of a prior, and bars above to denote hyper-parameters of a posterior.
What makes a prior conjugate? Short answer: Convenient functional forms Long answer: Remember the likelihood function p ( [ y β, σ 2) = {h 1 12 exp h ( β (2π) N 2 2 β ) 2 N i=1 x 2 i ]} {h v2 exp [ hv 2s 2 and consider the normal-gamma prior distribution p (β h) p(h) where β h N(β, h 1 V ) and h G ( s 2, v ). We then have { ( ) 2 ]} 1 β β p (β h) p(h) = [ 2πh 1 V exp 2h 1 V { ( c 1 G h v 2 2 exp hv )} 2s 2 Note the similar functional forms. ]}
The posterior
The Normal-Gamma posterior It is possible to use Bayes rule p(β, h y) = p(y β, h)p (β h) p(h) p(y) to find an analytical expression for the posterior that is Normal-Gamma { [ ( ) 2 ]} { ( 1 β β p(β, h y) = exp c 1 G 2πV 2V h v 2 2 exp hv )} 2s 2 i.e. the same form as the prior β, h y NG ( β, V, s 2, v ) β, h NG ( β, V, s 2, v )
The posterior The (hyper) parameters of the posterior β, h y NG ( β, V, s 2, v ) are a combination of the (hyper) parameters of the prior and sample information 1 V = V 1 + xi 2 ( β = V V 1 β + β ) xi 2 v = v + N ( β β ) 2 vs 2 = vs 2 + vs 2 + V + xi 2
Posterior marginal distribution of β In regression analysis, we are often interested in the mean and the variance of the coefficient β. The mean can be found by integrating the posterior w.r.t. h and β E (β y) = βp(β, h y)dhdβ = βp(β y)dβ The density p(β y) is called the marginal distribution of β. It can be shown that β y t ( β, s 2 V, v ) so that E (β y) = β var (β y) = vs2 v 2 V
Combining prior and sample information The posterior mean of β is given by ( β = V V 1 β + β ) xi 2 V = 1 V 1 + x 2 i Do these expressions make sense? V 1 is the precision of the prior x 2 i is the precision of the data What happens when V 1 = 0? when V 0?
Example
Simple empirical example Consider a sample of 50 observations generated from the model y i = βx i + ε i N(0, σ 2 ) with β = 2, σ 2 = 1 and x i N(0, 1) Set the prior hyper parameters in β, h NG ( β, V, s 2, v ) to β = 1.5 V = 0.25 s 2 = 1 v = 10
The data y 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-2 -1 0 1 2 3 4 x
0.018 0.016 0.014 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.012 0.01 0.008 0.006 0.004 0.002 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β
Change the prior mean β = 2.5 V = 0.25 s 2 = 1 v = 10
0.018 0.016 0.014 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.012 0.01 0.008 0.006 0.004 0.002 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β
Change the prior precision β = 1.5 V = 0.5 s 2 = 1 v = 10
0.016 0.014 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood 0.012 probability density 0.01 0.008 0.006 0.004 0.002 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β
Change the prior mean of variance β = 1.5 V = 0.25 s 2 = 10 v = 10
0.016 0.014 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood 0.012 probability density 0.01 0.008 0.006 0.004 0.002 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β
Change the prior degrees of freedom β = 1.5 V = 0.25 s 2 = 1 v = 100
0.018 0.016 0.014 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.012 0.01 0.008 0.006 0.004 0.002 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β
More data: N=100
y 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-2 -1 0 1 2 3 4 x
0.03 0.025 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.02 0.015 0.01 0.005 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β
Even more data:n=500
0.06 0.05 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.04 0.03 0.02 0.01 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β
Prior precise but with β far from βb
0.025 0.02 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.015 0.01 0.005 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β
Noninformative prior Non-informative prior V 1 = 0 v = 0 V = 1 x 2 i β = β v = N vs 2 = vs 2 which are the same as the OLS estimates. But this prior is so-called improper, i.e. the density p(β, h) does not integrate to 1 p(β, h)dβdh = 1
Model comparison We may have two models that could both potentially explain y y i = βx ji + ε ji N(0, h 1 j ) j = 1, 2 For model comparison, it is important that the models explain the same data Improper priors can be used only for parameters shared by both models
Model comparison We can formulate priors and compute posterior just as before ( ) Prior: β j, h j NG β j, V j, s 2 j, v j ( ) Posterior: β j, h j y NG β j, V j, s 2 j, v j Again, the (hyper) parameters of the posterior are a combination of the (hyper) parameters of the prior and sample information V j = V 1 j 1 + xji 2 ( β j = V j V 1 j β j + β ) j x 2 ji v j = v j + N ( βj β j ) 2 v j s 2 j = v j s 2 j + v j sj 2 + V j + xji 2
Posterior odds ratio The posterior odds ratio is the relative probabilities of two models conditional on the data PO 12 p(m 1 y) p(m 2 y) = p(y M 1)p(M 1 ) p(y M 2 )p(m 2 ) The prior p(m j ) model probabilities are formulated before seeing the data. The marginal likelihood p(y M j ) is calculated as p(y M j ) = p(y β, h)p(β j, h j )dh j dβ j
The marginal likelihood for the Normal-Gamma model The marginal likelihood for the Normal-Gamma model is given by where p(y M j ) = c j ( Γ c j = V j V j ( ) ( v j 2 and Γ( ) is the Gamma function. Γ ( vj 2 ) 1 2 ( v j s 2 j ) v j 2 v j s 2 j ) π N 2 v ) j 2
Posterior odds ratio The posterior odds ratio is then given by PO 12 = ( ) 1 c V 1 2 ( 1 V v 1 s 2 v 12 1 1) ( ) 1 c V 2 2 ( 2 V v 2 s 2 v 22 2 2) The posterior odds ratio rewards fitting the data via the term v j s 2 j which contains the sum of squared errors v j s 2 j coherence between prior and OLS estimate via the term v j s 2 j which contains (β j β ) 2 j When models have different number of parameters, the posterior odds ratio also rewards parsimony, i.e. the model with fewer parameters.
That s it for today.