Bayesian Linear Models

Similar documents
Bayesian Linear Models

The linear model is the most fundamental of all serious statistical models encompassing:

Bayesian Linear Regression

Bayesian Linear Models

MCMC algorithms for fitting Bayesian models

Introduction to Spatial Data and Models

Principles of Bayesian Inference

Hierarchical Modelling for Univariate Spatial Data

Principles of Bayesian Inference

Principles of Bayesian Inference

Introduction to Spatial Data and Models

Model Assessment and Comparisons

A Bayesian Treatment of Linear Gaussian Regression

Hierarchical Modelling for Univariate Spatial Data

Principles of Bayesian Inference

Foundations of Statistical Inference

ST 740: Linear Models and Multivariate Normal Inference

variability of the model, represented by σ 2 and not accounted for by Xβ

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Hierarchical Modeling for Spatio-temporal Data

AMS-207: Bayesian Statistics

Nearest Neighbor Gaussian Processes for Large Spatial Data

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Bayesian Linear Models

Hierarchical Modeling for Multivariate Spatial Data

Linear Models A linear model is defined by the expression

Hierarchical Modeling for Univariate Spatial Data

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Hierarchical Modeling for non-gaussian Spatial Data

Hierarchical Modelling for non-gaussian Spatial Data

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Gibbs Sampling in Linear Models #1

Markov Chain Monte Carlo methods

Conjugate Analysis for the Linear Model

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Metropolis-Hastings Algorithm

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian linear regression

Bayesian Methods in Multilevel Regression

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Gibbs Sampling in Linear Models #2

Ch 2: Simple Linear Regression

Bayesian Inference. Chapter 9. Linear models and regression

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Advanced Statistical Modelling

Hierarchical Modeling for Spatial Data

MULTILEVEL IMPUTATION 1

Hierarchical Linear Models

Models for spatial data (cont d) Types of spatial data. Types of spatial data (cont d) Hierarchical models for spatial data

Markov chain Monte Carlo

Statistical Models. ref: chapter 1 of Bates, D and D. Watts (1988) Nonlinear Regression Analysis and its Applications, Wiley. Dave Campbell 2009

Comparing Non-informative Priors for Estimation and Prediction in Spatial Models

Gibbs Sampling in Latent Variable Models #1

Markov Chain Monte Carlo and Applied Bayesian Statistics

spbayes: An R Package for Univariate and Multivariate Hierarchical Point-referenced Spatial Models

Multiple Linear Regression

Part 6: Multivariate Normal and Linear Models

On Gaussian Process Models for High-Dimensional Geostatistical Datasets

Introduction to Geostatistics

Categorical Predictor Variables

Cross-sectional space-time modeling using ARNN(p, n) processes

Bayesian data analysis in practice: Three simple examples

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

ST 740: Markov Chain Monte Carlo

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Module 11: Linear Regression. Rebecca C. Steorts

Embedding Supernova Cosmology into a Bayesian Hierarchical Model

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Bayesian Inference for Regression Parameters

CS281A/Stat241A Lecture 22

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian Gaussian Process Regression

Markov Chain Monte Carlo (MCMC)

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Markov Chain Monte Carlo

Stat 5101 Lecture Notes

Bayesian spatial hierarchical modeling for temperature extremes

Computational statistics

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

Bayesian Multivariate Logistic Regression

Gibbs Sampling in Endogenous Variables Models

Kazuhiko Kakamu Department of Economics Finance, Institute for Advanced Studies. Abstract

The Metropolis-Hastings Algorithm. June 8, 2012

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Lecture 16 : Bayesian analysis of contingency tables. Bayesian linear regression. Jonathan Marchini (University of Oxford) BS2a MT / 15

Part 8: GLMs and Hierarchical LMs and GLMs

STA 2201/442 Assignment 2

Linear models and their mathematical foundations: Simple linear regression

Introduction to Spatial Data and Models

BAYESIAN MODEL CRITICISM

INTRODUCTION TO BAYESIAN STATISTICS

Bayesian Phylogenetics:

Transcription:

Bayesian Linear Models Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. October 15, 2012 1

Linear regression models: a Bayesian perspective Ingredients of a linear model include an n 1 response vector y = (y 1,..., y n ) T and an n p design matrix (e.g. including regressors) X = [x 1,..., x p ], assumed to have been observed without error. The linear model: y = Xβ + ɛ; ɛ N(0, σ 2 I) The linear model is the most fundamental of all serious statistical models encompassing: ANOVA: y is continuous, x i s are categorical REGRESSION: y is continuous, x i s are continuous ANCOVA: y is continuous, some x i s are continuous, some categorical. Unknown parameters include the regression parameters β and the variance σ 2. We assume X is observed without error and all inference is conditional on X. 2 UNL Department of Statstics Spatio-temporal Workshop

Linear regression models: a Bayesian perspective The classical unbiased estimates of the regression parameter β and σ 2 are ˆβ = (X T X) 1 X T y; ˆσ 2 = 1 n p (y X ˆβ) T (y X ˆβ). The above estimate of β is also a least-squares estimate. The predicted value of y is given by ŷ = X ˆβ = P X y where P X = X(X T X) 1 X T. P X is called the projector of X. It projects any vector to the space spanned by the columns of X. The model residual is estimated as: ê = (y X ˆβ) T (y X ˆβ) = y T (I P X )y. 3 UNL Department of Statstics Spatio-temporal Workshop

Bayesian regression with flat reference priors For Bayesian analysis, we will need to specify priors for the unknown regression parameters β and the variance σ 2. Consider independent flat priors on β and log σ 2 : p(β) 1; p(log(σ 2 )) 1 or equivalently p(β, σ 2 ) 1 σ 2. None of the above two distributions are valid probabilities (they do not integrate to any finite number). So why is it that we are even discussing them? It turns out that even if the priors are improper (that s what we call them), as long as the resulting posterior distributions are valid we can still conduct legitimate statistical inference on them. 4 UNL Department of Statstics Spatio-temporal Workshop

Marginal and conditional distributions With a flat prior on β we obtain, after some algebra, the conditional posterior distribution: p(β σ 2, y) = N(β (X T X) 1 X T y, σ 2 (X T X) 1 ). The conditional posterior distribution of β would have been the desired posterior distribution had σ 2 been known. Since that is not the case, we need to obtain the marginal posterior distribution by integrating out σ 2 as: p(β y) = p(β σ 2, y)p(σ 2 y)dσ 2 Can we solve this integration using composition sampling? YES: if we can generate samples from p(σ 2 y)! 5 UNL Department of Statstics Spatio-temporal Workshop

Marginal and conditional distributions So, we need to find the marginal posterior distribution of σ 2. With the choice of the flat prior we obtain: ) p(σ 2 1 (n p)s2 y) ( (σ 2 ) ( = IG (n p)/2+1 exp σ 2 n p, 2 (n p)s2 2 2σ 2 ), where s 2 = ˆσ 2 = 1 n p yt (I P X )y. This is known as an inverted Gamma distribution (also called a scaled chi-square distribution) IG(σ 2 (n p)/2, (n p)s 2 /2). In other words: [(n p)s 2 /σ 2 y] χ 2 n p (with n p degrees of freedom). A striking similarity with the classical result: The distribution of ˆσ 2 is also characterized as (n p)s 2 /σ 2 following a chi-square distribution. 6 UNL Department of Statstics Spatio-temporal Workshop

Composition sampling for linear regression Now we are ready to carry out composittion sampling from p(β, σ 2 y) as follows: Draw M samples from p(σ 2 y): ( ) n p σ 2(j) (n p)s2 IG, (n p), j = 1,... M 2 2 For j = 1,..., M, draw from p(β σ 2(j), y): β (j) N ((X T X) 1 X T y, σ 2(j) (X T X) 1) The resulting samples {β (j), σ 2(j) } M j=1 represent M samples from p(β, σ 2 y). {β (j) } M j=1 are samples from the marginal posterior distribution p(β y). This is a multivariate t density: [ Γ(n/2) p(β y) = (π(n p)) p/2 Γ((n p)/2) s 2 (X T X) 1 1 + (β ˆβ) T (X T X)(β ˆβ) ] n/2 (n p)s 2. 7 UNL Department of Statstics Spatio-temporal Workshop

Composition sampling for linear regression The marginal distribution of each individual regression parameter β j is a non-central univariate t n p distribution. In fact, β j ˆβ j s (X T X) 1 jj t n p. The 95% credible intervals for each β j are constructed from the quantiles of the t-distribution. The credible intervals exactly coicide with the 95% classical confidence intervals, but the intepretation is direct: the probability of β j falling in that interval, given the observed data, is 0.95. Note: an intercept only linear model reduces to the simple univariate N(ȳ µ, σ 2 /n) likelihood, for which the marginal posterior of µ is: µ ȳ s/ n t n 1. 8 UNL Department of Statstics Spatio-temporal Workshop

Bayesian predictions from the linear model Suppose we have observed the new predictors X, and we wish to predict the outcome ỹ. We specify p(ỹ, y θ) to be a normal distribution: ( ỹ y ) ([ N X X ] ) β, σ 2 I Note p(ỹ y, β, σ 2 ) = p(ỹ β, σ 2 ) = N(ỹ Xβ, σ 2 I). The posterior predictive distribution: p(ỹ y) = p(ỹ y, β, σ 2 )p(β, σ 2 y)dβdσ 2 = p(ỹ β, σ 2 )p(β, σ 2 y)dβdσ 2. By now we are comfortable evaluating such integrals: First obtain: (β (j), σ 2(j) ) p(β, σ 2 y), j = 1,..., M Next draw: ỹ (j) N( Xβ (j), σ 2(j) I). 9 UNL Department of Statstics Spatio-temporal Workshop

The Gibbs sampler Suppose that θ = (θ 1, θ 2 ) and we seek the posterior distribution p(θ 1, θ 2 y). For many interesting hierarchical models, we have access to full conditional distributions p(θ 1 θ 2, y) and p(θ 1 θ 2, y). The Gibbs sampler proposes the following sampling scheme. Set starting values θ (0) = (θ (0) 1, θ(0) 2 ) For j = 1,..., M Draw θ (j) 1 p(θ 1 θ (j 1) 2, y) Draw θ (j) 2 p(θ 2 θ (j) 1, y) This constructs a Markov Chain and, after an initial burn-in period when the chains are trying to find their way, the above algorithm guarantees that {θ (j) 1, θ(j) 2 }M j=m 0 +1 will be samples from p(θ 1, θ 2 y), where M 0 is the burn-in period.. 10 UNL Department of Statstics Spatio-temporal Workshop

The Gibbs sampler More generally, if θ = (θ 1,..., θ p ) are the parameters in our model, we provide a set of initial values θ (0) = (θ (0) 1,..., θ(0) p ) and then performs the j-th iteration, say for j = 1,..., M, by updating successively from the full conditional distributions: θ (j) 1 p(θ (j) 1 θ (j 1) 2,..., θ p (j 1), y) θ (j) 2 p(θ 2 θ (j) 1, θ(j) 3,..., θ(j 1) p, y)... (the generic k th element) p(θ k θ (j) 1,..., θ(j) k 1, θ(j) k+1,..., θ(j 1) p, y) θ (j) k θ (j) p p(θ p θ (j) 1,..., θ(j) p 1, y) 11 UNL Department of Statstics Spatio-temporal Workshop

The Gibbs sampler Example: Consider the linear model. Suppose we set p(σ 2 ) = IG(σ 2 a, b) and p(β) 1. The full conditional distributions are: p(β y, σ 2 ) = N(β (X T X) 1 X T y, σ 2 (X T X) 1 ) ( p(σ 2 y, β) = IG σ 2 a + n/2, b + 1 ) 2 (y Xβ)T (y Xβ). Thus, the Gibbs sampler will initialize (β (0), σ 2(0) ) and draw, for j = 1,..., M: Draw β (j) N((X( T X) 1 X T y, σ 2(j 1) (X T X) 1 ) ) Draw σ 2(j) IG a + n/2, b + 1 2 (y Xβ(j) ) T (y Xβ (j) ) 12 UNL Department of Statstics Spatio-temporal Workshop

The Gibbs sampler In principle, the Gibbs sampler will work for extremely complex hierarchical models. The only issue is sampling from the full conditionals. They may not be amenable to easy sampling when these are not in closed form. A more general and extremely powerful - and often easier to code - algorithm is the Metropolis-Hastings (MH) algorithm. This algorithm also constructs a Markov Chain, but does not necessarily care about full conditionals. 13 UNL Department of Statstics Spatio-temporal Workshop

The Metropolis-Hastings algorithm The Metropolis-Hastings algorithm: Start with a initial value for θ = θ (0). Select a candidate or proposal distribution from which to propose a value of θ at the j-th iteration: θ (j) q(θ (j 1), ν). For example, q(θ (j 1), ν) = N(θ (j 1), ν) with ν fixed. Compute r = p(θ y)q(θ (j 1) θ, ν) p(θ (j 1) y)q(θ θ (j 1) ν) If r 1 then set θ (j) = θ. If r 1 then draw U (0, 1). If U r then θ (j) = θ. Otherwise, θ (j) = θ (j 1). Repeat for j = 1,... M. This yields θ (1),..., θ (M), which, after a burn-in period, will be samples from the true posterior distribution. It is important to monitor the acceptance ratio r of the sampler through the iterations. Rough recommendations: for vector updates r 20%., for scalar updates r 40%. This can be controlled by tuning ν. Popular approach: Embed Metropolis steps within Gibbs to draw from full conditionals that are not accessible to directly generate from. 14 UNL Department of Statstics Spatio-temporal Workshop

The Metropolis-Hastings algorithm Example: For the linear model, our parameters are (β, σ 2 ). We write θ = (β, log(σ 2 )) and, at the j-th iteration, propose θ N(θ (j 1), Σ). The log transformation on σ 2 ensures that all components of θ have support on the entire real line and can have meaningful proposed values from the multivariate normal. But we need to transform our prior to p(β, log(σ 2 )). Let z = log(σ 2 ) and assume p(β, z) = p(β)p(z). Let us derive p(z). REMEMBER: we need to adjust for the jacobian. Then p(z) = p(σ 2 ) dσ 2 /dz = p(e z )e z. The jacobian here is e z = σ 2. Let p(β) = 1 and an p(σ 2 ) = IG(σ 2 a, b). Then log-posterior is: (a + n/2 + 1)z + z 1 e z {b + 1 2 (Y Xβ)T (Y Xβ)}. A symmetric proposal distribution, say q(θ θ (j 1), Σ) = N(θ (j 1), Σ), cancels out in r. In practice it is better to compute log(r): log(r) = log(p(θ y) log(p(θ (j 1) y)). For the proposal, N(θ (j 1), Σ), Σ is a d d variance-covariance matrix, and d = dim(θ) = p + 1. If log r 0 then set θ (j) = θ. If log r 0 then draw U (0, 1). If U r (or log U log r) then θ (j) = θ. Otherwise, θ (j) = θ (j 1). Repeat the above procedure for j = 1,... M to obtain samples θ (1),..., θ (M). 15 UNL Department of Statstics Spatio-temporal Workshop