The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

Similar documents
Introduction into Bayesian statistics

An Introduction to Bayesian Linear Regression

David Giles Bayesian Econometrics

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Bayesian RL Seminar. Chris Mansley September 9, 2008

A Bayesian Treatment of Linear Gaussian Regression

Bayesian Inference: Concept and Practice

Bayesian Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Linear Models A linear model is defined by the expression

Density Estimation. Seungjin Choi

Introduc)on to Bayesian Methods

Introduction to Probabilistic Machine Learning

CS-E3210 Machine Learning: Basic Principles

A Very Brief Summary of Bayesian Inference, and Examples

Other Noninformative Priors

Bayesian Phylogenetics:

g-priors for Linear Regression

Learning Bayesian network : Given structure and completely observed data

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Bayesian Regression Linear and Logistic Regression

One Parameter Models

Gibbs Sampling in Endogenous Variables Models

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture 2: From Linear Regression to Kalman Filter and Beyond

PMR Learning as Inference

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

David Giles Bayesian Econometrics

Lecture 4: Probabilistic Learning

Computational Cognitive Science

GAUSSIAN PROCESS REGRESSION

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Bayesian Machine Learning

Bayesian Analysis for Natural Language Processing Lecture 2

Time Series and Dynamic Models

Introduction to Machine Learning

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Computational Cognitive Science

Bayesian Methods for DSGE models Lecture 1 Macro models as data generating processes

Module 11: Linear Regression. Rebecca C. Steorts

Support Vector Machines

Bayesian Inference and MCMC

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Bayesian linear regression

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Wageningen Summer School in Econometrics. The Bayesian Approach in Theory and Practice

Vector Autoregressive Model. Vector Autoregressions II. Estimation of Vector Autoregressions II. Estimation of Vector Autoregressions I.

The linear model is the most fundamental of all serious statistical models encompassing:

Bayesian Inference: Probit and Linear Probability Models

A Very Brief Summary of Statistical Inference, and Examples

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Econometrics

Probability and Estimation. Alan Moses

Lecture Notes based on Koop (2003) Bayesian Econometrics

STAT J535: Introduction

CSC321 Lecture 18: Learning Probabilistic Models

The Metropolis-Hastings Algorithm. June 8, 2012

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Introduction to Bayesian Inference

Bayesian methods in economics and finance

Lecture : Probabilistic Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Bayesian Linear Models

Bayesian Model Averaging

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

ST 740: Linear Models and Multivariate Normal Inference

Bayesian Machine Learning

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Dynamic models 1 Kalman filters, linearization,

COS513 LECTURE 8 STATISTICAL CONCEPTS

Bayesian Analysis (Optional)

Semantics for Probabilistic Programming

Bayesian Methods for Multivariate Categorical Data. Jon Forster (University of Southampton, UK)

Markov Chain Monte Carlo methods

Bayesian Interpretations of Regularization

Bayesian Semiparametric GARCH Models

Bayesian Semiparametric GARCH Models

ST 740: Model Selection

GWAS IV: Bayesian linear (variance component) models

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

STA 4273H: Statistical Machine Learning

Katsuhiro Sugita Faculty of Law and Letters, University of the Ryukyus. Abstract

Choosing among models

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Lecture 2: Priors and Conjugacy

Stats 579 Intermediate Bayesian Modeling. Assignment # 2 Solutions

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

STAT J535: Chapter 5: Classes of Bayesian Priors

STA 4273H: Sta-s-cal Machine Learning

Bayesian Linear Regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

One-parameter models

Introduction to Bayesian Statistics

Computational Perception. Bayesian Inference

Introduction to Machine Learning

The joint posterior distribution of the unknown parameters and hidden variables, given the

Accounting for Complex Sample Designs via Mixture Models

Transcription:

The Normal Linear Regression Model with Natural Conjugate Prior March 7, 2016

The Normal Linear Regression Model with Natural Conjugate Prior The plan Estimate simple regression model using Bayesian methods Formulate prior Combine prior and likelihood to compute posterior Model comparison Main reading: Ch.2 in Gary Koop s Bayesian Econometrics

Recap

Bayes Rule Deriving Bayes Rule Symmetrically Implying or p (A, B) = p(a B)p(B) p (A, B) = p(b A)p(A) p(a B)p(B) = p(b A)p(A) p(a B) = which is known as Bayes Rule. p(b A)p(A) p(b)

Data and parameters The purpose of Bayesian analysis is to use the data y to learn about the parameters θ Parameters of a statistical model Or anything not directly observed Replace A and B in Bayes rule with θ and y to get p(θ y) = p(y θ)p(θ) p(y) The probability density p(θ y) then describes what do we know about θ, given the data.

The posterior density The posterior density is often the object of fundamental interest in a Bayesian estimation. The posterior is the result. We are interested in the entire conditional distribution of the parameters θ p(θ y) = p(y θ)p(θ) p(y) Today we will start filling in blanks and specify a prior and data generating process (i.e. a likelihood function) in order to find the posterior.

The linear regression model

The linear regression model We will start with something very basic The linear regression model y i = βx i + ε i N(0, σ 2 ) i = 1, 2,...N The variable x i is fixed (i.e. not random) or independent of ε i and with a probability density function p(x i λ) where the parameter(s) λ does not include β or σ 2 Walk before we run etc...

0.018 0.016 0.014 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.012 0.01 0.008 0.006 0.004 0.002 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β

The likelihood function

Defining the likelihood function The assumptions made about the linear regression model implies that p ( y i β, σ 2) is Normal. p ( [ ] y i β, σ 2) 1 = exp (y i βx i ) 2 2πσ 2 2σ 2 or that p ( y β, σ 2) = 1 (2π) N 2 σ N exp [ 1 2σ 2 ] N (y i βx i ) 2 i=1 where y [ y 1 y 2 y N ]

Reformulating the likelihood function It will be convenient to use that where N i=1 ( (y i βx i ) 2 = vs 2 + β β ) N 2 v = N 1 xi y i β = x 2 i i=1 x 2 i s 2 = (yi βx i ) 2 v

Reformulating the likelihood function The likelihood function can then be written as p ( [ y β, σ 2) = {h 1 12 exp h ( β (2π) N 2 2 β ) N ]} 2 xi 2 i=1 [ {h v2 exp hv ]} 2s 2 where h σ 2.

Formulating a prior

The Prior density The prior density should reflect information about θ that we have prior to observing y. Priors should preferably be easy to interpret be of a form that facilitates computing the posterior Natural conjugate priors have both of these advantages.

Conjugate priors Conjugate priors, when combined with the likelihood function, result in posteriors that are of the same family of distributions Natural conjugate priors has the same functional form as the likelihood function The posterior can p(θ y) then be thought of as being the result of two sub-samples p(θ y) p(y θ)p(θ) The sufficient statistic about θ of the first sub-sample is described by the hyper-parameters of the prior The information about θ in second sub-sample can be described by the sufficient statistic for the actual observed sample (i.e. y) This feature makes natural conjugate priors easy to interpret

Specifying a prior density: It s your choice For the Normal regression model we need to elicit, i.e. formulate, a prior p(θ) for β and h which we can denote p(β, h). It will be convenient to use that p(β, h) = p(β h)p(h) and let β h N(β, h 1 V ) and let h G ( s 2, v ). The natural conjugate prior for β and h is then denoted β, h NG ( β, V, s 2, v ) Notation: Koop uses bars below parameters to denote hyper-parameters of a prior, and bars above to denote hyper-parameters of a posterior.

What makes a prior conjugate? Short answer: Convenient functional forms Long answer: Remember the likelihood function p ( [ y β, σ 2) = {h 1 12 exp h ( β (2π) N 2 2 β ) 2 N i=1 x 2 i ]} {h v2 exp [ hv 2s 2 and consider the normal-gamma prior distribution p (β h) p(h) where β h N(β, h 1 V ) and h G ( s 2, v ). We then have { ( ) 2 ]} 1 β β p (β h) p(h) = [ 2πh 1 V exp 2h 1 V { ( c 1 G h v 2 2 exp hv )} 2s 2 Note the similar functional forms. ]}

The posterior

The Normal-Gamma posterior It is possible to use Bayes rule p(β, h y) = p(y β, h)p (β h) p(h) p(y) to find an analytical expression for the posterior that is Normal-Gamma { [ ( ) 2 ]} { ( 1 β β p(β, h y) = exp c 1 G 2πV 2V h v 2 2 exp hv )} 2s 2 i.e. the same form as the prior β, h y NG ( β, V, s 2, v ) β, h NG ( β, V, s 2, v )

The posterior The (hyper) parameters of the posterior β, h y NG ( β, V, s 2, v ) are a combination of the (hyper) parameters of the prior and sample information 1 V = V 1 + xi 2 ( β = V V 1 β + β ) xi 2 v = v + N ( β β ) 2 vs 2 = vs 2 + vs 2 + V + xi 2

Posterior marginal distribution of β In regression analysis, we are often interested in the mean and the variance of the coefficient β. The mean can be found by integrating the posterior w.r.t. h and β E (β y) = βp(β, h y)dhdβ = βp(β y)dβ The density p(β y) is called the marginal distribution of β. It can be shown that β y t ( β, s 2 V, v ) so that E (β y) = β var (β y) = vs2 v 2 V

Combining prior and sample information The posterior mean of β is given by ( β = V V 1 β + β ) xi 2 V = 1 V 1 + x 2 i Do these expressions make sense? V 1 is the precision of the prior x 2 i is the precision of the data What happens when V 1 = 0? when V 0?

Example

Simple empirical example Consider a sample of 50 observations generated from the model y i = βx i + ε i N(0, σ 2 ) with β = 2, σ 2 = 1 and x i N(0, 1) Set the prior hyper parameters in β, h NG ( β, V, s 2, v ) to β = 1.5 V = 0.25 s 2 = 1 v = 10

The data y 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-2 -1 0 1 2 3 4 x

0.018 0.016 0.014 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.012 0.01 0.008 0.006 0.004 0.002 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β

Change the prior mean β = 2.5 V = 0.25 s 2 = 1 v = 10

0.018 0.016 0.014 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.012 0.01 0.008 0.006 0.004 0.002 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β

Change the prior precision β = 1.5 V = 0.5 s 2 = 1 v = 10

0.016 0.014 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood 0.012 probability density 0.01 0.008 0.006 0.004 0.002 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β

Change the prior mean of variance β = 1.5 V = 0.25 s 2 = 10 v = 10

0.016 0.014 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood 0.012 probability density 0.01 0.008 0.006 0.004 0.002 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β

Change the prior degrees of freedom β = 1.5 V = 0.25 s 2 = 1 v = 100

0.018 0.016 0.014 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.012 0.01 0.008 0.006 0.004 0.002 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β

More data: N=100

y 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-2 -1 0 1 2 3 4 x

0.03 0.025 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.02 0.015 0.01 0.005 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β

Even more data:n=500

0.06 0.05 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.04 0.03 0.02 0.01 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β

Prior precise but with β far from βb

0.025 0.02 Figure 2.1: Marginal Prior and Posteriors for β Prior Posterior Likelihood probability density 0.015 0.01 0.005 0 0 0.5 1 1.5 2 2.5 3 3.5 4 β

Noninformative prior Non-informative prior V 1 = 0 v = 0 V = 1 x 2 i β = β v = N vs 2 = vs 2 which are the same as the OLS estimates. But this prior is so-called improper, i.e. the density p(β, h) does not integrate to 1 p(β, h)dβdh = 1

Model comparison We may have two models that could both potentially explain y y i = βx ji + ε ji N(0, h 1 j ) j = 1, 2 For model comparison, it is important that the models explain the same data Improper priors can be used only for parameters shared by both models

Model comparison We can formulate priors and compute posterior just as before ( ) Prior: β j, h j NG β j, V j, s 2 j, v j ( ) Posterior: β j, h j y NG β j, V j, s 2 j, v j Again, the (hyper) parameters of the posterior are a combination of the (hyper) parameters of the prior and sample information V j = V 1 j 1 + xji 2 ( β j = V j V 1 j β j + β ) j x 2 ji v j = v j + N ( βj β j ) 2 v j s 2 j = v j s 2 j + v j sj 2 + V j + xji 2

Posterior odds ratio The posterior odds ratio is the relative probabilities of two models conditional on the data PO 12 p(m 1 y) p(m 2 y) = p(y M 1)p(M 1 ) p(y M 2 )p(m 2 ) The prior p(m j ) model probabilities are formulated before seeing the data. The marginal likelihood p(y M j ) is calculated as p(y M j ) = p(y β, h)p(β j, h j )dh j dβ j

The marginal likelihood for the Normal-Gamma model The marginal likelihood for the Normal-Gamma model is given by where p(y M j ) = c j ( Γ c j = V j V j ( ) ( v j 2 and Γ( ) is the Gamma function. Γ ( vj 2 ) 1 2 ( v j s 2 j ) v j 2 v j s 2 j ) π N 2 v ) j 2

Posterior odds ratio The posterior odds ratio is then given by PO 12 = ( ) 1 c V 1 2 ( 1 V v 1 s 2 v 12 1 1) ( ) 1 c V 2 2 ( 2 V v 2 s 2 v 22 2 2) The posterior odds ratio rewards fitting the data via the term v j s 2 j which contains the sum of squared errors v j s 2 j coherence between prior and OLS estimate via the term v j s 2 j which contains (β j β ) 2 j When models have different number of parameters, the posterior odds ratio also rewards parsimony, i.e. the model with fewer parameters.

That s it for today.