Bayesian Inference. Chapter 9. Linear models and regression

Similar documents
Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Bayesian Inference. Chapter 1. Introduction and basic concepts

Bayesian Inference for the Multivariate Normal

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

November 2002 STA Random Effects Selection in Linear Mixed Models

The linear model is the most fundamental of all serious statistical models encompassing:

ST 740: Linear Models and Multivariate Normal Inference

STAT 730 Chapter 4: Estimation

Bayesian Linear Regression

Scatter plot of data from the study. Linear Regression

Part 6: Multivariate Normal and Linear Models

10. Exchangeability and hierarchical models Objective. Recommended reading

where x and ȳ are the sample means of x 1,, x n

Scatter plot of data from the study. Linear Regression

Gibbs Sampling in Linear Models #2

Linear Models A linear model is defined by the expression

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Bayesian Linear Models

7. Estimation and hypothesis testing. Objective. Recommended reading

Generalized Linear Models Introduction

Bayesian Inference. Chapter 2: Conjugate models

Topic 12 Overview of Estimation

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Bayesian Linear Models

Simple Linear Regression

Lecture 2: Priors and Conjugacy

Notes on Random Vectors and Multivariate Normal

Problem Selected Scores

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

INTRODUCTION TO BAYESIAN STATISTICS

A few basics of credibility theory

Density Estimation. Seungjin Choi

A Very Brief Summary of Bayesian Inference, and Examples

1 Data Arrays and Decompositions

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Stat 5101 Lecture Notes

General Bayesian Inference I

Conjugate Analysis for the Linear Model

Gaussian Models (9/9/13)

Bayesian Inference in a Normal Population

7. Estimation and hypothesis testing. Objective. Recommended reading

COS513 LECTURE 8 STATISTICAL CONCEPTS

A Very Brief Summary of Statistical Inference, and Examples

Testing a Normal Covariance Matrix for Small Samples with Monotone Missing Data

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

POSTERIOR PROPRIETY IN SOME HIERARCHICAL EXPONENTIAL FAMILY MODELS

PMR Learning as Inference

Generalized Linear Models. Kurt Hornik

Bayesian Linear Models

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Part 8: GLMs and Hierarchical LMs and GLMs

STA 2201/442 Assignment 2

Default Priors and Effcient Posterior Computation in Bayesian

Probabilistic Graphical Models

Part 4: Multi-parameter and normal models

Regression diagnostics

Bayesian data analysis in practice: Three simple examples

Central Limit Theorem ( 5.3)

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Simple and Multiple Linear Regression

variability of the model, represented by σ 2 and not accounted for by Xβ

[y i α βx i ] 2 (2) Q = i=1

Lecture 3. Univariate Bayesian inference: conjugate analysis

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Markov Chain Monte Carlo methods

Random Vectors 1. STA442/2101 Fall See last slide for copyright information. 1 / 30

MFM Practitioner Module: Risk & Asset Allocation. John Dodson. February 3, 2010

Bayesian Regression Linear and Logistic Regression

Stat 579: Generalized Linear Models and Extensions

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

AMS-207: Bayesian Statistics

Hyperparameter estimation in Dirichlet process mixture models

Chapter 17: Undirected Graphical Models

Multivariate Normal & Wishart

Advanced Statistics I : Gaussian Linear Model (and beyond)

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

1 Bayesian Linear Regression (BLR)

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Learning Bayesian network : Given structure and completely observed data

Bayesian linear regression

Bayesian Inference for Normal Mean

2 Bayesian Hierarchical Response Modeling

Nonparameteric Regression:

Bayesian Image Segmentation Using MRF s Combined with Hierarchical Prior Models

Foundations of Statistical Inference

Nuisance parameters and their treatment

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data

GAUSSIAN PROCESS REGRESSION

An Introduction to Bayesian Linear Regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Ch 2: Simple Linear Regression

1 Mixed effect models and longitudinal data analysis

Transcription:

Bayesian Inference Chapter 9. Linear models and regression M. Concepcion Ausin Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in Mathematical Engineering

Chapter 9. Linear models and regression Objective Illustrate the Bayesian approach to fitting normal and generalized linear models. Recommended reading Lindley, D.V. and Smith, A.F.M. 197). Bayes estimates for the linear model with discussion), Journal of the Royal Statistical Society B, 34, 1-41. Broemeling, L.D. 1985). Bayesian Analysis of Linear Models, Marcel- Dekker. Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B. 003). Bayesian Data Analysis, Chapter 8.

9. Linear models and regression Chapter 9. Linear models and regression AFM Smith Objective AFM Smith developed some of the central ideas in the theory and practice of modern Bayesian statistics. To illustrate the Bayesian approach to fitting normal and generalized linear models.

0. Introduction Contents 1. The multivariate normal distribution 1.1. Conjugate Bayesian inference when the variance-covariance matrix is known up to a constant 1.. Conjugate Bayesian inference when the variance-covariance matrix is unknown. Normal linear models.1. Conjugate Bayesian inference for normal linear models.. Example 1: ANOVA model.3. Example : Simple linear regression model 3. Generalized linear models

The multivariate normal distribution Firstly, we review the definition and properties of the multivariate normal distribution. Definition A random variable X = X 1,..., X k ) T is said to have a multivariate normal distribution with mean µ and variance-covariance matrix Σ if: 1 f x µ, Σ) = exp 1 ) π) k/ Σ 1 x µ)t Σ 1 x µ), for x R k. In this case, we write X µ, Σ N µ, Σ).

The multivariate normal distribution The following properties of the multivariate normal distribution are well known: Any subset of X has a multivariate) normal distribution. Any linear combination k i=1 α ix i is normally distributed. If Y = a + BX is a linear transformation of X, then: Y µ, Σ N a + Bµ, BΣB T ) If X1 X = X ) µ, Σ N µ1 µ ) Σ11 Σ, 1 Σ 1 Σ then, the conditional density of X 1 given X = x is: )), X 1 x, µ, Σ N µ 1 + Σ 1 Σ 1 x µ ), Σ 11 Σ 1 Σ 1 Σ 1).

The multivariate normal distribution The likelihood function given a sample x = x 1,..., x n ) of data from N µ, Σ) is: ) 1 l µ, Σ x) = exp 1 n x π) nk/ Σ n i µ) T Σ 1 x i µ) i=1 ) 1 exp 1 n x Σ n i x) T Σ 1 x i x) + n µ x) T Σ 1 µ x) i=1 1 exp 1 Σ n tr SΣ 1) ) + n µ x) T Σ 1 µ x) where x = 1 n n i=1 x i and S = n i=1 x i x) x i x) T and trm) represents the trace of the matrix M. It is possible to carry out Bayesian inference with conjugate priors for µ and Σ. We shall consider two cases which reflect different levels of knowledge about the variance-covariance matrix, Σ.

Conjugate Bayesian inference when Σ = 1 φ C Firstly, consider the case where the variance-covariance matrix is known up to a constant, i.e. Σ = 1 φc where C is a known matrix. Then, we have X µ, φ N µ, 1 φc) and the likelihood function is, l µ, φ x) φ nk exp φ tr SC 1) ) + n µ x) T C 1 µ x) Analogous to the univariate case, it can be seen that a multivariate normal-gamma prior distribution is conjugate.

Conjugate Bayesian inference when Σ = 1 φ C Definition We say that µ, φ) have a multivariate normal gamma prior with parameters m, V 1, a, b ) if, µ φ N m, 1φ ) V a φ G, b ). In this case, we write µ, φ) N Gm, V 1, a, b ). Analogous to the univariate case, the marginal distribution of µ is a multivariate, non-central t distribution.

. Conjugate Bayesian inference when Σ = 1 φ C Definition A k-dimensional) random variable, T = T 1,..., T k ), has a multivariate t distribution with parameters d, µ T, Σ T ) if: f t) = Γ ) d+k πd) k Σ T 1 Γ ) d 1 + 1 ) d+k d t µ T ) T Σ 1 T t µ T ) In this case, we write T T µ T, Σ T, d). Theorem Let µ, φ N Gm, V 1, a, b ). Then, the marginal density of µ is: µ T m, b V, a) a

Conjugate Bayesian inference when Σ = 1 φ C Theorem Let X µ, φ N µ, 1 φ C) and assume a priori µ, φ) N Gm, V 1, a, b ). Then, given a sample data, x, we have that, µ x, φ N m, 1φ ) V ) a φ x G, b. where, V = V 1 + nc 1) 1 m = V V 1 m + nc 1 x ) a = a + nk b = b + tr SC 1) + m T V 1 + n x T C 1 x + m V 1 m

Conjugate Bayesian inference when Σ = 1 φ C Theorem Given the reference prior pµ, φ) 1 φ, then the posterior distribution is, p µ, φ x) φ nk 1 exp φ tr SC 1) ) + n µ x) T C 1 µ x), Then, µ, φ x N G x, nc 1, n 1)k, trsc 1 ) ) which implies that, ) 1 µ x, φ N x, nφ C n 1) k φ x G, tr SC 1) ) µ x T x, tr SC 1) ) C, n 1) k n n 1) k

Conjugate Bayesian inference when Σ is unknown In this case, it is useful to reparameterize the normal distribution in terms of the precision matrix Φ = Σ 1. Then, the normal likelihood function becomes, lµ, Φ) Φ n exp 1 ) tr SΦ) + n µ x)t Φ µ x) It is clear that a conjugate prior for µ and Σ must take a similar form to the likelihood. This is a normal-wishart distribution.

Conjugate Bayesian inference when Σ is unknown Definition A k k dimensional symmetric, positive definite random variable W is said to have a Wishart distribution with parameters d and V if, f W) = dk V d π kk 1) 4 W d k 1 k i=1 Γ d+1 i ) exp 1 tr V 1 W )), where d > k 1. In this case, E [W] = dv and we write W W d, V) If W W d, V), then the distribution of W 1 is said to be an inverse Wishart distribution, W 1 IW d, V 1), with mean E [ W 1] = 1 d k 1 V 1

Conjugate Bayesian inference when Σ is unknown Theorem Suppose that X µ, Φ N µ, Φ 1 ) and let µ Φ N m, 1 α Φ 1 ) and Φ Wd, W). Then, ) αm + x µ Φ, x N α + n, 1 α + n Φ 1 Φ x W d + n, W 1 + S + αn ) m x) m x)t α + n Theorem Given the limiting prior, pφ) Φ k+1, the posterior distribution is, µ Φ, x N x, 1n ) Φ 1 Φ x W n 1, S)

Conjugate Bayesian inference when Σ is unknown The conjugacy assumption that the prior precision of µ is proportional to the model precision φ is very strong in many cases. Often, we may simply wish to use a prior distribution of form µ N m, V) where m and V are known and a Wishart prior for Φ, say Φ Wd, W) as earlier. In this case, the conditional posterior distributions are: V µ Φ, x N 1 + nφ ) 1 V 1 m + nφ x ), V 1 + nφ ) ) 1 Φ µ, x W d ) + n, W 1 + S + n µ x) µ x) T and therefore, it is straightforward to set up a Gibbs sampling algorithm to sample the joint posterior, as in the univariate case.

Normal linear models Definition A normal linear model is of form: y = Xθ + ɛ, where y = y 1,..., y n ) are the observed data, X is a given design matrix of size n k and θ = θ 1,..., θ k ) are the set of model parameters. Finally, it is usually assumed that: ɛ N 0 n, 1 ) φ I n.

Normal linear models A simple example of normal linear model is the simple linear regression ) T 1 1... 1 model where X = and θ = α, β) x 1 x... x T. n It is easy to see that there is a conjugate, multivariate normal-gamma prior distribution for any normal linear model.

Theorem Conjugate Bayesian inference Consider a normal linear model, y = Xθ + ɛ, and assume a normal-gamma priori distribution, θ, φ N Gm, V 1, a, b ). Then, the posterior distribution given y is also a normal-gamma distribution: θ, φ y N Gm, V 1, a, b ) where: m = X T X + V 1) 1 X T y + V 1 m ) V = X T X + V 1) 1 a = a + n b = b + y T y + m T V 1 m m T V 1 m

Conjugate Bayesian inference Given a normal linear model, y = Xθ + ɛ, and assuming a normal-gamma priori distribution, θ, φ N Gm, V 1, a, b ), it is easy to see that the predictive distribution of y is: y T Xm, a XVX T + I ), a) b To see this, note that the distribution of y φ is: y φ N Xm, 1 XVX T + I) ), φ then, the joint distribution of y and φ is a multivariate normal-gamma: y, φ N G Xm, XVX T + I) 1, a, b ) and therefore, the marginal of y is the multivariate t-distribution obtained before.

We have that: Interpretation of the posterior mean E [θ y] = X T X + V 1) 1 X T y + V 1 m ) = X T X + V 1) 1 X T X X T X ) ) 1 X T y + V 1 m = X T X + V 1) ) 1 X T Xˆθ + V 1 m where ˆθ = X T X ) 1 X T y is the maximum likelihood estimator. Thus, this expression may be interpreted as a weighted average of the prior estimator, m, and the MLE, ˆθ, with weights proportional to precisions, as we can recall that, conditional on φ, the prior variance was 1 φ V and that the distribution of the MLE from the classical viewpoint is ) ˆθ φ N θ, 1φ XT X) 1

Theorem Conjugate Bayesian inference Consider a normal linear model, y = Xθ + ɛ, and assume the limiting prior distribution, pθ, φ) 1 φ. Then, we have that, θ y, φ N ˆθ, 1 X T X ) ) 1, φ φ y G n k, yt y ˆθ T X T X ) ˆθ. And then, θ y T ˆθ, ˆσ X T X ) 1, n k ), where, ˆσ = yt y ˆθ T X T X ) ˆθ n k

Conjugate Bayesian inference Note that ˆσ = yt y ˆθ T X T X)ˆθ n k is the usual classical estimator of σ = 1 φ. In this case, Bayesian credible intervals, estimators etc. will coincide with their classical counterparts. One should note however that the propriety of the posterior distribution in this case relies on two conditions: 1. n > k.. X T X is of full rank. If either of these two conditions is not satisfied, then the posterior distribution will be improper.

ANOVA model The ANOVA model is an example of normal lineal model where: y ij = θ i + ɛ ij, where ɛ ij N 0, 1 φ ), for i = 1,..., k, and j = 1,..., n i. Thus, the parameters are θ = θ 1,..., θ k ), the observed data are y = y 11,..., y 1n1, y 1,..., y n,..., y k1,..., y knk ) T, the design matrix is: 1 0 0.. 1 n1 0 0 0 1 0 X =... 0 1 n 0... 0 0 1

ANOVA model ) 1 If we use conditionally independent normal priors, θ i N m i, α i φ, for i = 1,..., k, and a gamma prior φ G a, b ). Then, we have that: θ, φ) N G m, V 1, a, b ) where m = m 1,..., m k ) and V = Noting that, X T X + V 1 = the following posterior distributions. 1 α 1... n 1 + α 1... 1 α k n k + α k, we can obtain

ANOVA model It is obtained that, θ y, φ N n 1ȳ 1 +α 1m 1 n 1+α 1. n 1ȳ 1 +α 1m 1 n 1+α 1, 1 φ 1 α 1+n 1... 1 α k +n k and φ y G a + n, b + k ni i=1 j=1 y ij ȳ i ) + k n i i=1 n i +α i ȳ i m i ) )

ANOVA model Often we are interested in the differences in group means, e.g. θ 1 θ. Here we have, n1 ȳ 1 + α 1 m 1 θ 1 θ y, φ N n ȳ + α m, 1 )) 1 1 + n 1 + α 1 n + α φ α 1 + n 1 α + n and therefore a posterior, 95% interval for θ 1 θ is given by: n 1 ȳ 1 + α 1 m 1 n 1 + α 1 n ȳ + α m n + α ± 1 α 1+n 1 + 1 α +n ) b+ k i=1 ni k j=1 y ij ȳ i ) + i=1 n i n i +α i ȳ i m i ) a+n t a+n 0.975)

ANOVA model If we assume alternatively the reference prior, pθ, φ) 1 φ, we have: θ y, φ N n k φ G, ȳ 1. ȳ k, 1 φ n k) ˆσ ), 1 n 1... 1 n k, where ˆσ = 1 n k k i=1 y ij ȳ i ) is the classical variance estimate for this problem. A 95% posterior interval for θ 1 θ is given by: 1 ȳ 1 ȳ ± ˆσ + 1 t n k 0.975), n 1 n which is equal to the usual, classical interval.

Simple linear regression model Consider the simple regression model: y i = α + βx i ɛ i, ) for i = 1,..., n, where ɛ i N 0, 1 φ and suppose that we use the limiting prior: pα, β, φ) 1 φ.

Simple linear regression model Then, we have that: α β y, φ N ˆαˆβ where α β ˆα = ȳ ˆβ x, φ y G y T ˆαˆβ ), n, s x ˆβ = s xy s x, ), ˆσ n n 1 xi i=1 φns x n x ) ) 1 r 1 s x n xi i=1 n x s x = n i=1 x i x), s y = n i=1 y i ȳ), s xy = n i=1 x i x) y i ȳ), r = s xy sx s y, n x n n x n, n ˆσ = s ) y 1 r. n

Simple linear regression model Thus, the marginal distributions of α and β are: n x i ˆσ α y T i=1 ˆα,, n n s x β y T ) ˆβ, ˆσ, n s x and therefore, for example, a 95% credible interval for β is given by: ˆβ ± equal to the usual classical interval. ˆσ sx t n 0.975)

Simple linear regression model Suppose now that we wish to predict a future observation: Note that, Therefore, y new = α + βx new + ɛ new. E [y new φ, y] = ˆα + ˆβx new V [y new φ, y] = 1 n i=1 x i + nx ) new n xx new + 1 φ ns x = 1 sx + n x + nx ) new n xx new + 1 φ ns x y new φ, y N ˆα + ˆβx new, 1 φ x x new ) s x + 1 n + 1 ))

Simple linear regression model And then, y new y T ˆα + ˆβx new, ˆσ x x new ) s x + 1 n + 1 ), n ) leading to the following 95% credible interval for y new : ) ˆα + ˆβx new ± ˆσ x x new ) + 1 s x n + 1 t n 0.975), which coincides with the usual, classical interval.

Generalized linear models The generalized linear model generalizes the normal linear model by allowing the possibility of non-normal error distributions and by allowing for a non-linear relationship between y and x. A generalized linear model is specified by two functions: 1. A conditional, exponential family density function of y given x, parameterized by a mean parameter, µ = µx) = E[Y x] and possibly) a dispersion parameter, φ > 0, that is independent of x.. A one-to-one) link function, g ), which relates the mean, µ = µx) to the covariate vector, x, as gµ) = xθ.

Generalized linear models The following are generalized linear models with the canonical link function which is the natural parameterization to leave the exponential family distribution in canonical form. Logistic regression It is often used for predicting the occurrence of an event given covariates: Poisson regression Y i p i Binn i, p i ) p i log = x i θ 1 p i It is used for predicting the number of events in a time period given covariates: Y i p i Pλ i ) log λ i = x i θ

Generalized linear models The Bayesian specification of a GLM is completed by defining typically normal or normal gamma) prior distributions pθ, φ) over the unknown model parameters. As with standard linear models, when improper priors are used, it is then important to check that these lead to valid posterior distributions. Clearly, these models will not have conjugate posterior distributions, but, usually, they are easily handled by Gibbs sampling. In particular, the posterior distributions from these models are usually log concave and are thus easily sampled via adaptive rejection sampling.