Bayesian Linear Regression

Similar documents
Bayesian Linear Models

Bayesian Linear Models

Bayesian Linear Models

The linear model is the most fundamental of all serious statistical models encompassing:

Principles of Bayesian Inference

MCMC algorithms for fitting Bayesian models

Model Assessment and Comparisons

variability of the model, represented by σ 2 and not accounted for by Xβ

Hierarchical Modelling for Univariate Spatial Data

Principles of Bayesian Inference

Principles of Bayesian Inference

A Bayesian Treatment of Linear Gaussian Regression

Hierarchical Modelling for Univariate Spatial Data

Gibbs Sampling in Linear Models #2

Introduction to Spatial Data and Models

Special Topic: Bayesian Finite Population Survey Sampling

Part 6: Multivariate Normal and Linear Models

Gibbs Sampling in Latent Variable Models #1

Principles of Bayesian Inference

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Bayesian linear regression

Hierarchical Modeling for Univariate Spatial Data

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Gibbs Sampling in Linear Models #1

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Part 8: GLMs and Hierarchical LMs and GLMs

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Methods in Multilevel Regression

Hierarchical Modeling for Spatial Data

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Markov chain Monte Carlo

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

Conjugate Analysis for the Linear Model

Markov Chain Monte Carlo methods

Bayesian Linear Models

Hierarchical Linear Models

Bayesian Multivariate Logistic Regression

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

AMS-207: Bayesian Statistics

Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis

ST 740: Linear Models and Multivariate Normal Inference

Bayesian data analysis in practice: Three simple examples

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

MULTILEVEL IMPUTATION 1

Bayesian Inference. Chapter 9. Linear models and regression

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Linear Models A linear model is defined by the expression

STA 4273H: Statistical Machine Learning

Foundations of Statistical Inference

Models for spatial data (cont d) Types of spatial data. Types of spatial data (cont d) Hierarchical models for spatial data

Accounting for Complex Sample Designs via Mixture Models

Nearest Neighbor Gaussian Processes for Large Spatial Data

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

Markov Chain Monte Carlo and Applied Bayesian Statistics

Stat260: Bayesian Modeling and Inference Lecture Date: March 10, 2010

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

Statistical Models. ref: chapter 1 of Bates, D and D. Watts (1988) Nonlinear Regression Analysis and its Applications, Wiley. Dave Campbell 2009

Gibbs Sampling in Endogenous Variables Models

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Hierarchical Modeling for Multivariate Spatial Data

November 2002 STA Random Effects Selection in Linear Mixed Models

ST 740: Markov Chain Monte Carlo

Bayesian Inference and MCMC

Fractional Imputation in Survey Sampling: A Comparative Review

CS281A/Stat241A Lecture 22

Stat 5101 Lecture Notes

Practical Bayesian Quantile Regression. Keming Yu University of Plymouth, UK

Hierarchical Modelling for non-gaussian Spatial Data

Contents. Part I: Fundamentals of Bayesian Inference 1

Bayesian Inference: Concept and Practice

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression

Bayesian Econometrics

Metropolis-Hastings Algorithm

Model Checking and Improvement

Categorical Predictor Variables

STA 2201/442 Assignment 2

Introduction to Spatial Data and Models

Markov Chain Monte Carlo

Module 11: Linear Regression. Rebecca C. Steorts

Part 4: Multi-parameter and normal models

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Multivariate beta regression with application to small area estimation

Introduction to Machine Learning

STAT 730 Chapter 4: Estimation

BIOS 2083 Linear Models c Abdus S. Wahed

Modeling Real Estate Data using Quantile Regression

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

10. Exchangeability and hierarchical models Objective. Recommended reading

General Linear Model: Statistical Inference

On Gaussian Process Models for High-Dimensional Geostatistical Datasets

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

A general mixed model approach for spatio-temporal regression data

MARKOV CHAIN MONTE CARLO

Hierarchical Modeling for non-gaussian Spatial Data

STAT 425: Introduction to Bayesian Analysis

Transcription:

Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1

Linear regression models: a Bayesian perspective Linear regression is, perhaps, the most widely used statistical modelling tool. It addresses the following question: How does a quantity of primary interest, y, vary as (depend upon) another quantity, or set of quantities, x? The quantity y is called the response or outcome variable. Some people simply refer to it as the dependent variable. The variable(s) x are called explanatory variables, covariates or simply independent variables. In general, we are interested in the conditional distribution of y, given x, parametrized as p(y θ, x). 2 CDC 2010 Hierarchical Modeling and Analysis

Linear regression models: a Bayesian perspective Typically, we have a set of units or experimental subjects i = 1, 2,..., n. For each of these units we have measured an outcome y i and a set of explanatory variables x i = (1, x i1, x i2,..., x ip ). The first element of x i is often taken as 1 to signify the presence of an intercept. We collect the outcome and explanatory variables into an n 1 vector and an n (p + 1) matrix: y 1 1 x 11 x 12... x 1p x 1 y 2 y =. ; X = 1 x 21 x 22... x 2p..... = x 2.. y n 1 x n1 x n2... x np x n 3 CDC 2010 Hierarchical Modeling and Analysis

Linear regression models: a Bayesian perspective The linear model is the most fundamental of all serious statistical models underpinning: ANOVA: y i is continuous, x ij s are all categorical REGRESSION: y i is continuous, x ij s are continuous ANCOVA: y i is continuous, x ij s are continuous for some j and categorical for others. 4 CDC 2010 Hierarchical Modeling and Analysis

Linear regression models: a Bayesian perspective The Bayesian or hierarchical linear model is given by: y i µ i, σ 2, X ind N(µ i, σ 2 ); i = 1, 2,..., n; µ i = β 0 + β 1 x i1 + + β p x ip = x iβ; β = (β 0, β 1,..., β p ); β, σ 2 X p(β, σ 2 X). Unknown parameters include the regression parameters and the variance, i.e. θ = {β, σ 2 }. p(β, σ 2 X) p(θ X) is the joint prior on the parameters. We assume X is observed without error and all inference is conditional on X. We suppress dependence on X in subsequent notation. 5 CDC 2010 Hierarchical Modeling and Analysis

Bayesian regression with flat priors Specifying p(β, σ 2 ) completes the hierarchical model. All inference proceeds from p(β, σ 2 y) With no prior information, we specify p(β, σ 2 ) 1 σ 2 or equivalently p(β) 1; p(log(σ2 )) 1. The above is NOT a probability density (they do not integrate to any finite number). So why is it that we are even discussing them? Even if the priors are improper, as long as the resulting posterior distributions are valid we can still conduct legitimate statistical inference on them. 6 CDC 2010 Hierarchical Modeling and Analysis

Bayesian regression with flat priors Computing the posterior distribution Strategy: Factor the joint posterior distribution for β and σ 2 as: p(β, σ 2 y) = p(β σ 2, y) p(σ 2 y). The conditional posterior distribution of β, given σ 2 : β σ 2, y N(ˆβ, σ 2 V β ), where, using some algebra, one finds ˆβ = (X X) 1 X y and V β = (X X) 1. 7 CDC 2010 Hierarchical Modeling and Analysis

Bayesian regression with flat priors The marginal posterior distribution of σ 2 : Let k = (p + 1) be the number of columns of X. ( ) n k σ 2 (n k)s2 y IG,, 2 2 where s 2 = 1 n k (y Xˆβ) (y Xˆβ) is the classical unbiased estimate of σ 2 in the linear regression model. The marginal posterior distribution p(β y), averaging over σ 2, is multivariate t with n k degrees of freedom. But we rarely use this fact in practice. Instead, we sample from the posterior distribution. 8 CDC 2010 Hierarchical Modeling and Analysis

Bayesian regression with flat priors Algorithm for sampling from the posterior distribution We draw samples from p(β, σ 2 y) by executing the following steps: Step 1: Compute ˆβ and V β. Step 2: Compute s 2. Step 3: Draw M samples from p(σ 2 y): ( ) n k σ 2(j) (n k)s2 IG,, j = 1,... M 2 2 Step 4: For j = 1,..., M, draw β (j) from p(β σ 2(j), y): ) β (j) N (ˆβ, σ 2(j) V β 9 CDC 2010 Hierarchical Modeling and Analysis

Bayesian regression with flat priors The marginal distribution of each individual regression parameter β j is a non-central univariate t n p distribution. In fact, β j ˆβ j s V β;jj t n p. The 95% credible interval for each β j is constructed from the quantiles of the t-distribution. This exactly coincides with the 95% classical confidence intervals, but the intepretation is direct: the probability of β j falling in that interval, given the observed data, is 0.95. Note: an intercept only linear model reduces to the simple univariate N(ȳ µ, σ 2 /n) likelihood, for which the marginal posterior of µ is: µ ȳ s/ n t n 1. 10 CDC 2010 Hierarchical Modeling and Analysis

Bayesian predictions from the linear model Suppose we have observed the new predictors X, and we wish to predict the outcome ỹ. If β and σ 2 were known exactly, the random vector ỹ would follow N( Xβ, σ 2 I). But we do not know model parameters, which contribute to the uncertainty in predictions. Predictions are carried out by sampling from the posterior predictive distribution, p(ỹ y) 1 Draw {β (j), σ 2(j) } p(β, σ 2 y), j = 1, 2,..., M 2 Draw ỹ (j) N( Xβ (j), σ 2(j) I), j = 1, 2,..., M. 11 CDC 2010 Hierarchical Modeling and Analysis

Bayesian predictions from the linear model Predictive Mean and Variance (conditional upon σ 2 ): E(ỹ σ 2, y) = Xˆβ var(ỹ σ 2, y) = (I + XV β X )σ 2. The posterior predictive distribution, p(ỹ y), is a multivariate t distribution, t n p ( Xˆβ, s 2 (I + XV β X )). 12 CDC 2010 Hierarchical Modeling and Analysis

Incorporating prior information Incorporating prior information y i µ i, σ 2 ind N(µ i, σ 2 ); i = 1, 2,..., n; µ i = β 0 + β 1 x i1 + + β p x ip = x iβ; β = (β 0, β 1,..., β p ); β σ 2 N(β 0, σ 2 R β ); σ 2 IG(a σ, b σ ), where R β is a fixed correlation matrix. Alternatively, y i µ i, σ 2 ind N(µ i, σ 2 ); i = 1, 2,..., n; µ i = β 0 + β 1 x i1 + + β p x p = x iβ; β = (β 0, β 1,..., β p ); β Σ β N(β 0, Σ β ); Σ β IW (ν, S); σ 2 IG(a σ, b σ ), where Σ β is a random covariance matrix. 13 CDC 2010 Hierarchical Modeling and Analysis

The Gibbs sampler The Gibbs sampler: If θ = (θ 1,..., θ p ) are the parameters in our model, we provide a set of initial values θ (0) = (θ (0) 1,..., θ(0) p ) and then performs the j-th iteration, say for j = 1,..., M, by updating successively from the full conditional distributions: θ (j) 1 p(θ (j) 1 θ (j 1) 2,..., θ (j 1) p, y) θ (j) 2 p(θ 2 θ (j) 1, θ(j) 3,..., θ(j 1) p, y). (the generic k th element) θ (j) k p(θ k θ (j) 1,..., θ(j) k 1, θ(j) k+1,..., θ(j 1) p, y). θ (j) p p(θ p θ (j) 1,..., θ(j) p 1, y) 14 CDC 2010 Hierarchical Modeling and Analysis

The Gibbs sampler In principle, the Gibbs sampler will work for extremely complex hierarchical models. The only issue is sampling from the full conditionals. They may not be amenable to easy sampling when these are not in closed form. A more general and extremely powerful - and often easier to code - algorithm is the Metropolis-Hastings (MH) algorithm. This algorithm also constructs a Markov Chain, but does not necessarily care about full conditionals. Popular approach: Embed Metropolis steps within Gibbs to draw from full conditionals that are not accessible to directly generate from. 15 CDC 2010 Hierarchical Modeling and Analysis