Bayesian Linear Models

Similar documents
Bayesian Linear Models

The linear model is the most fundamental of all serious statistical models encompassing:

Bayesian Linear Regression

Bayesian Linear Models

MCMC algorithms for fitting Bayesian models

Introduction to Spatial Data and Models

Principles of Bayesian Inference

Hierarchical Modelling for Univariate Spatial Data

Principles of Bayesian Inference

Principles of Bayesian Inference

Hierarchical Modelling for Univariate Spatial Data

Introduction to Spatial Data and Models

A Bayesian Treatment of Linear Gaussian Regression

Model Assessment and Comparisons

Foundations of Statistical Inference

Principles of Bayesian Inference

ST 740: Linear Models and Multivariate Normal Inference

variability of the model, represented by σ 2 and not accounted for by Xβ

Hierarchical Modelling for non-gaussian Spatial Data

AMS-207: Bayesian Statistics

Nearest Neighbor Gaussian Processes for Large Spatial Data

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Bayesian Linear Models

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Linear Models A linear model is defined by the expression

Hierarchical Modeling for Univariate Spatial Data

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Hierarchical Modeling for Multivariate Spatial Data

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Hierarchical Modeling for Spatio-temporal Data

Markov Chain Monte Carlo methods

Gibbs Sampling in Linear Models #1

Conjugate Analysis for the Linear Model

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Metropolis-Hastings Algorithm

Hierarchical Modeling for non-gaussian Spatial Data

Bayesian linear regression

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian Methods in Multilevel Regression

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Gibbs Sampling in Linear Models #2

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Ch 2: Simple Linear Regression

Bayesian Inference. Chapter 9. Linear models and regression

Advanced Statistical Modelling

MULTILEVEL IMPUTATION 1

Hierarchical Linear Models

Statistical Models. ref: chapter 1 of Bates, D and D. Watts (1988) Nonlinear Regression Analysis and its Applications, Wiley. Dave Campbell 2009

Introduction to Geostatistics

Comparing Non-informative Priors for Estimation and Prediction in Spatial Models

Gibbs Sampling in Latent Variable Models #1

spbayes: An R Package for Univariate and Multivariate Hierarchical Point-referenced Spatial Models

Markov Chain Monte Carlo and Applied Bayesian Statistics

Multiple Linear Regression

Embedding Supernova Cosmology into a Bayesian Hierarchical Model

Hierarchical Modeling for Spatial Data

Part 6: Multivariate Normal and Linear Models

Markov chain Monte Carlo

Categorical Predictor Variables

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Computational statistics

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

On Gaussian Process Models for High-Dimensional Geostatistical Datasets

Module 11: Linear Regression. Rebecca C. Steorts

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian Inference for Regression Parameters

CS281A/Stat241A Lecture 22

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Introduction to Spatial Data and Models

Models for spatial data (cont d) Types of spatial data. Types of spatial data (cont d) Hierarchical models for spatial data

Markov Chain Monte Carlo (MCMC)

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Markov Chain Monte Carlo

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Stat 5101 Lecture Notes

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Bayesian Multivariate Logistic Regression

Gibbs Sampling in Endogenous Variables Models

ST 740: Markov Chain Monte Carlo

The Metropolis-Hastings Algorithm. June 8, 2012

Bayesian Gaussian Process Regression

Lecture 16 : Bayesian analysis of contingency tables. Bayesian linear regression. Jonathan Marchini (University of Oxford) BS2a MT / 15

Part 8: GLMs and Hierarchical LMs and GLMs

STA 2201/442 Assignment 2

Linear models and their mathematical foundations: Simple linear regression

Bayesian spatial hierarchical modeling for temperature extremes

INTRODUCTION TO BAYESIAN STATISTICS

Cross-sectional space-time modeling using ARNN(p, n) processes

BAYESIAN MODEL CRITICISM

Bayesian Phylogenetics:

Bayesian data analysis in practice: Three simple examples

Bayesian Inference: Concept and Practice

A Bayesian Probit Model with Spatial Dependencies

Transcription:

Bayesian Linear Models Sudipto Banerjee 1 and Andrew O. Finley 2 1 Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. 2 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. March 12, 2009 1

Linear regression models: a Bayesian perspective Ingredients of a linear model include an n 1 response vector y = (y 1,..., y n ) T and an n p design matrix (e.g. including regressors) X = [x 1,..., x p ], assumed to have been observed without error. The linear model: y = Xβ + ɛ; ɛ N(0, σ 2 I) The linear model is the most fundamental of all serious statistical models encompassing: ANOVA: y is continuous, x i s are categorical REGRESSION: y is continuous, x i s are continuous ANCOVA: y is continuous, some x i s are continuous, some categorical. Unknown parameters include the regression parameters β and the variance σ 2. We assume X is observed without error and all inference is conditional on X. 2 ENAR 2009 Hierarchical Modeling and Analysis

Linear regression models: a Bayesian perspective The classical unbiased estimates of the regression parameter β and σ 2 are ˆβ = (X T X) 1 X T y; ˆσ 2 = 1 n p (y X ˆβ) T (y X ˆβ). The above estimate of β is also a least-squares estimate. The predicted value of y is given by ŷ = X ˆβ = P X y where P X = X(X T X) 1 X T. P X is called the projector of X. It projects any vector to the space spanned by the columns of X. The model residual is estimated as: ê = (y X ˆβ) T (y X ˆβ) = y T (I P X )y. 3 ENAR 2009 Hierarchical Modeling and Analysis

Bayesian regression with flat reference priors For Bayesian analysis, we will need to specify priors for the unknown regression parameters β and the variance σ 2. Consider independent flat priors on β and log σ 2 : p(β) 1; p(log(σ 2 )) 1 or equivalently p(β, σ 2 ) 1 σ 2. None of the above two distributions are valid probabilities (they do not integrate to any finite number). So why is it that we are even discussing them? It turns out that even if the priors are improper (that s what we call them), as long as the resulting posterior distributions are valid we can still conduct legitimate statistical inference on them. 4 ENAR 2009 Hierarchical Modeling and Analysis

Marginal and conditional distributions With a flat prior on β we obtain, after some algebra, the conditional posterior distribution: p(β σ 2, y) = N(β (X T X) 1 X T y, σ 2 (X T X) 1 ). The conditional posterior distribution of β would have been the desired posterior distribution had σ 2 been known. Since that is not the case, we need to obtain the marginal posterior distribution by integrating out σ 2 as: p(β y) = p(β σ 2, y)p(σ 2 y)dσ 2 Can we solve this integration using composition sampling? YES: if we can generate samples from p(σ 2 y)! 5 ENAR 2009 Hierarchical Modeling and Analysis

Marginal and conditional distributions So, we need to find the marginal posterior distribution of σ 2. With the choice of the flat prior we obtain: ) p(σ 2 1 (n p)s2 y) ( (σ 2 ) ( = IG (n p)/2+1 exp σ 2 n p, 2 (n p)s2 2 2σ 2 ), where s 2 = ˆσ 2 = 1 n p yt (I P X )y. This is known as an inverted Gamma distribution (also called a scaled chi-square distribution) IG(σ 2 (n p)/2, (n p)s 2 /2). In other words: [(n p)s 2 /σ 2 y] χ 2 n p (with n p degrees of freedom). A striking similarity with the classical result: The distribution of ˆσ 2 is also characterized as (n p)s 2 /σ 2 following a chi-square distribution. 6 ENAR 2009 Hierarchical Modeling and Analysis

Composition sampling for linear regression Now we are ready to carry out composittion sampling from p(β, σ 2 y) as follows: Draw M samples from p(σ 2 y): ( ) n p σ 2(j) (n p)s2 IG, (n p), j = 1,... M 2 2 For j = 1,..., M, draw from p(β σ 2(j), y): β (j) N ((X T X) 1 X T y, σ 2(j) (X T X) 1) The resulting samples {β (j), σ 2(j) } M j=1 represent M samples from p(β, σ 2 y). {β (j) } M j=1 are samples from the marginal posterior distribution p(β y). This is a multivariate t density: " Γ(n/2) p(β y) = (π(n p)) p/2 Γ((n p)/2) s 2 (X T X) 1 1 + (β ˆβ) T (X T X)(β ˆβ) # n/2 (n p)s 2. 7 ENAR 2009 Hierarchical Modeling and Analysis

Composition sampling for linear regression The marginal distribution of each individual regression parameter β j is a non-central univariate t n p distribution. In fact, β j ˆβ j s (X T X) 1 jj t n p. The 95% credible intervals for each β j are constructed from the quantiles of the t-distribution. The credible intervals exactly coicide with the 95% classical confidence intervals, but the intepretation is direct: the probability of β j falling in that interval, given the observed data, is 0.95. Note: an intercept only linear model reduces to the simple univariate N(ȳ µ, σ 2 /n) likelihood, for which the marginal posterior of µ is: µ ȳ s/ n t n 1. 8 ENAR 2009 Hierarchical Modeling and Analysis

Bayesian predictions from the linear model Suppose we have observed the new predictors X, and we wish to predict the outcome ỹ. We specify p(ỹ, y θ) to be a normal distribution: ( ỹ y ) ([ N X X ] ) β, σ 2 I Note p(ỹ y, β, σ 2 ) = p(ỹ β, σ 2 ) = N(ỹ Xβ, σ 2 I). The posterior predictive distribution: p(ỹ y) = p(ỹ y, β, σ 2 )p(β, σ 2 y)dβdσ 2 = p(ỹ β, σ 2 )p(β, σ 2 y)dβdσ 2. By now we are comfortable evaluating such integrals: First obtain: (β (j), σ 2(j) ) p(β, σ 2 y), j = 1,..., M Next draw: ỹ (j) N( Xβ (j), σ 2(j) I). 9 ENAR 2009 Hierarchical Modeling and Analysis

The Gibbs sampler Suppose that θ = (θ 1, θ 2 ) and we seek the posterior distribution p(θ 1, θ 2 y). For many interesting hierarchical models, we have access to full conditional distributions p(θ 1 θ 2, y) and p(θ 1 θ 2, y). The Gibbs sampler proposes the following sampling scheme. Set starting values θ (0) = (θ (0) 1, θ(0) 2 ) For j = 1,..., M Draw θ (j) 1 p(θ 1 θ (j 1) 2, y) Draw θ (j) 2 p(θ 2 θ (j) 1, y) This constructs a Markov Chain and, after an initial burn-in period when the chains are trying to find their way, the above algorithm guarantees that {θ (j) 1, θ(j) 2 }M j=m 0 +1 will be samples from p(θ 1, θ 2 y), where M 0 is the burn-in period.. 10 ENAR 2009 Hierarchical Modeling and Analysis

The Gibbs sampler More generally, if θ = (θ 1,..., θ p ) are the parameters in our model, we provide a set of initial values θ (0) = (θ (0) 1,..., θ(0) p ) and then performs the j-th iteration, say for j = 1,..., M, by updating successively from the full conditional distributions: θ (j) 1 p(θ (j) 1 θ (j 1) 2,..., θ p (j 1), y) θ (j) 2 p(θ 2 θ (j) 1, θ(j) 3,..., θ(j 1) p, y)... (the generic k th element) p(θ k θ (j) 1,..., θ(j) k 1, θ(j) k+1,..., θ(j 1) p, y) θ (j) k θ (j) p p(θ p θ (j) 1,..., θ(j) p 1, y) 11 ENAR 2009 Hierarchical Modeling and Analysis

The Gibbs sampler Example: Consider the linear model. Suppose we set p(σ 2 ) = IG(σ 2 a, b) and p(β) 1. The full conditional distributions are: p(β y, σ 2 ) = N(β (X T X) 1 X T y, σ 2 (X T X) 1 ) ( p(σ 2 y, β) = IG σ 2 a + n/2, b + 1 ) 2 (y Xβ)T (y Xβ). Thus, the Gibbs sampler will initialize (β (0), σ 2(0) ) and draw, for j = 1,..., M: Draw β (j) N((X( T X) 1 X T y, σ 2(j 1) (X T X) 1 ) ) Draw σ 2(j) IG a + n/2, b + 1 2 (y Xβ(j) ) T (y Xβ (j) ) 12 ENAR 2009 Hierarchical Modeling and Analysis

The Gibbs sampler In principle, the Gibbs sampler will work for extremely complex hierarchical models. The only issue is sampling from the full conditionals. They may not be amenable to easy sampling when these are not in closed form. A more general and extremely powerful - and often easier to code - algorithm is the Metropolis-Hastings (MH) algorithm. This algorithm also constructs a Markov Chain, but does not necessarily care about full conditionals. 13 ENAR 2009 Hierarchical Modeling and Analysis

The Metropolis-Hastings algorithm The Metropolis-Hastings algorithm: Start with a initial value for θ = θ (0). Select a candidate or proposal distribution from which to propose a value of θ at the j-th iteration: θ (j) q(θ (j 1), ν). For example, q(θ (j 1), ν) = N(θ (j 1), ν) with ν fixed. Compute r = p(θ y)q(θ (j 1) θ, ν) p(θ (j 1) y)q(θ θ (j 1) ν) If r 1 then set θ (j) = θ. If r 1 then draw U (0, 1). If U r then θ (j) = θ. Otherwise, θ (j) = θ (j 1). Repeat for j = 1,... M. This yields θ (1),..., θ (M), which, after a burn-in period, will be samples from the true posterior distribution. It is important to monitor the acceptance ratio r of the sampler through the iterations. Rough recommendations: for vector updates r 20%., for scalar updates r 40%. This can be controlled by tuning ν. Popular approach: Embed Metropolis steps within Gibbs to draw from full conditionals that are not accessible to directly generate from. 14 ENAR 2009 Hierarchical Modeling and Analysis

The Metropolis-Hastings algorithm Example: For the linear model, our parameters are (β, σ 2 ). We write θ = (β, log(σ 2 )) and, at the j-th iteration, propose θ N(θ (j 1), Σ). The log transformation on σ 2 ensures that all components of θ have support on the entire real line and can have meaningful proposed values from the multivariate normal. But we need to transform our prior to p(β, log(σ 2 )). Let z = log(σ 2 ) and assume p(β, z) = p(β)p(z). Let us derive p(z). REMEMBER: we need to adjust for the jacobian. Then p(z) = p(σ 2 ) dσ 2 /dz = p(e z )e z. The jacobian here is e z = σ 2. Let p(β) = 1 and an p(σ 2 ) = IG(σ 2 a, b). Then log-posterior is: (a + n/2 + 1)z + z 1 e z {b + 1 2 (Y Xβ)T (Y Xβ)}. A symmetric proposal distribution, say q(θ θ (j 1), Σ) = N(θ (j 1), Σ), cancels out in r. In practice it is better to compute log(r): log(r) = log(p(θ y) log(p(θ (j 1) y)). For the proposal, N(θ (j 1), Σ), Σ is a d d variance-covariance matrix, and d = dim(θ) = p + 1. If log r 0 then set θ (j) = θ. If log r 0 then draw U (0, 1). If U r (or log U log r) then θ (j) = θ. Otherwise, θ (j) = θ (j 1). Repeat the above procedure for j = 1,... M to obtain samples θ (1),..., θ (M). 15 ENAR 2009 Hierarchical Modeling and Analysis