Bayesian linear regression

Similar documents
Metropolis-Hastings sampling

Bayesian Linear Regression

Lecture 16: Mixtures of Generalized Linear Models

Bayesian Regression Linear and Logistic Regression

Bayesian Methods for Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Linear Models A linear model is defined by the expression

November 2002 STA Random Effects Selection in Linear Mixed Models

The linear model is the most fundamental of all serious statistical models encompassing:

variability of the model, represented by σ 2 and not accounted for by Xβ

Bayesian non-parametric model to longitudinally predict churn

Hierarchical Modeling for Univariate Spatial Data

STA414/2104 Statistical Methods for Machine Learning II

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

g-priors for Linear Regression

Bayesian Regression (1/31/13)

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Sparse Linear Models (10/7/13)

STA 4273H: Statistical Machine Learning

Default Priors and Effcient Posterior Computation in Bayesian

Bayesian Linear Models

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Linear regression methods

Supplement to Bayesian inference for high-dimensional linear regression under the mnet priors

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Gibbs Sampling in Endogenous Variables Models

Accounting for Complex Sample Designs via Mixture Models

Bayesian Linear Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Linear model selection and regularization

Statistics 203: Introduction to Regression and Analysis of Variance Course review

PMR Learning as Inference

CPSC 540: Machine Learning

Regression, Ridge Regression, Lasso

Stat 5101 Lecture Notes

ST 740: Linear Models and Multivariate Normal Inference

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage

Lecture 14: Shrinkage

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

MCMC algorithms for fitting Bayesian models

MULTILEVEL IMPUTATION 1

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Markov Chain Monte Carlo (MCMC)

Bayesian Linear Models

Gibbs Sampling in Linear Models #2

The joint posterior distribution of the unknown parameters and hidden variables, given the

Contents. Part I: Fundamentals of Bayesian Inference 1

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

A general mixed model approach for spatio-temporal regression data

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Part 8: GLMs and Hierarchical LMs and GLMs

Generalized Elastic Net Regression

Part 6: Multivariate Normal and Linear Models

Generalized Linear Models Introduction

Principles of Bayesian Inference

Linear Regression (9/11/13)

Markov Chain Monte Carlo methods

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

Hierarchical Modelling for Univariate Spatial Data

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Choosing among models

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

Machine Learning for OR & FE

Bayesian spatial quantile regression

Bayesian Econometrics

Introduction to Probabilistic Machine Learning

Multivariate Normal & Wishart

Analysing geoadditive regression data: a mixed model approach

MS-C1620 Statistical inference

Gibbs Sampling in Latent Variable Models #1

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

1 Mixed effect models and longitudinal data analysis

Linear Model Selection and Regularization

Weighted Least Squares

STAT Advanced Bayesian Inference

Statistical Inference

Bayesian Inference: Concept and Practice

(I AL BL 2 )z t = (I CL)ζ t, where

A Bayesian Treatment of Linear Gaussian Regression

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Marginal Specifications and a Gaussian Copula Estimation

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Conjugate Analysis for the Linear Model

A Bayesian perspective on GMM and IV

Bayesian Multivariate Logistic Regression

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Proteomics and Variable Selection

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Bayesian methods in economics and finance

Dynamic Generalized Linear Models

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Transcription:

Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding vector of known predictors β = (β 1,..., β p ) T are the regression parameters of interest ε i iid N(0, σ 2 ) are the errors In matrix notation, where ε N(0, σ 2 I n ). Y = X β + ε n 1 n p p 1 n 1 We will discuss a Bayesian analysis of this model, including Priors for β Priors for σ 2 Connections with penalized regression Posterior predictive distributions Missing data Robust regression Generalized linear models ST740 (4) Modeling - Part 1 Page 1

Priors for β, σ fixed The Jefferys prior is flat, f(β) 1. The posterior is then ) β N (ˆβOLS, Σ OLS, where ˆβ OLS = (X T X) 1 X T Y and Σ OLS = σ 2 (X T X) 1. The posterior is proper if and only if X T X is nonsingular. With a flat prior, estimates and inference are identical to least squares. Interpretation is not the same as least squares. ST740 (4) Modeling - Part 1 Page 2

Priors for β, σ fixed Another common prior is the independent normal prior β j iid N(0, σ 2 /λ). The variance parameter λ can either be fixed or treated as unknown with a prior. The prior variance is proportional to σ 2 to account for the scale of the data. The posterior is The posterior mode is β R = argmin(y Xβ) T (Y Xβ) + λ β p βj 2. In penalized regression, this is known as the ridge regression solution. Ridge regression is used for stability in high variance problems such as regression with collinearity or many predictors. It reduces variance by shrinking coefficients towards zero. The prior variance λ is called the ridge parameter, and is selected using BIC or crossvalidation. j=1 ST740 (4) Modeling - Part 1 Page 3

Priors for β, σ fixed Zellner s g-prior is β N [0, σ2 ( X T X ) ] T 1. g This is a good automatic prior because σ 2 accounts for the scale of Y and (X T X T ) 1 adjusts for the scale of X. The posterior is Therefore, c = 1/(g + 1) has a clear role as the shrinkage parameter. Is it OK to have the prior depend on X? Is this empirical Bayes? Under this prior, if X i and X j are positively correlated, then β i and β j will be negatively correlated. Is that a good idea? How to pick g? 1. g gamma gives a mixture of g-priors 2. g = 1/n is the unit information prior ST740 (4) Modeling - Part 1 Page 4

Priors for β, σ fixed Double exponential priors have heavier tails than Gaussian priors, and are thus more appropriate if you believe many coefficients are near zero but a few are large. The double exponential prior is β j iid DE(0, λ) and p(β) exp ( λ 2 β ). The posterior mode is the LASSO solution! This is a useful penalty because it stabilizes the estimates by shrinking towards zero, but also gives a solution with some coefficients exactly zero, β j = 0, thus performing variable selection. ST740 (4) Modeling - Part 1 Page 5

Priors for σ 2 The most common prior for the error variance is σ 2 InvGamma(a, b). The hyperparameters are often set to be small, such as a = b = 0.01. σ 2 β, Y If β has a flat prior, σ 2 Y The posterior mean is ˆσ 2 B = ST740 (4) Modeling - Part 1 Page 6

Marginal posterior of β Assume β has a flat prior and σ 2 InvGamma(a, b). The marginal posterior (derived in the handout) of β over σ 2 is multivariate t with n p + 2a degrees of freedom Location ˆβ OLS And scale matrix ˆσ 2 B (XT X) 1. What is the effect of accounting for uncertainty in σ 2? What happens if a 0 and b 0? What if you assume a strong prior for σ 2? ST740 (4) Modeling - Part 1 Page 7

Prediction Let X p be a new covariate vector and Y p the corresponding response we wish to predict. We ll discuss prediction in the regression setting, but the concepts apply more generally. Conditional (plug-in) prediction: Y p N(X T ˆβ, p ˆσ 2 ). This fails to account for uncertainty in β and σ 2. To properly account for uncertainty, we want the posterior predictive distribution Y p Y. This accounts for uncertainty in the model parameters θ = (β, σ 2 ) because MCMC easily produces samples from the posterior predictive distribution: ST740 (4) Modeling - Part 1 Page 8

Missing data Often either Y i or elements X i are missing. MCMC easily handles this as well, assuming you are willing to specify a parametric model for the missing values. The main idea is to treat the missing values as unknown parameters in the Bayesian model. Unknown parameters need priors, so missing X i must have priors such as X i iid N(µ X, Σ X ). Of course if the prior is way off, the results will be invalid. For example, if in reality the data are not missing at random the Bayesian model will likely give bad results. If specified correctly, the model will lead to inference in β that properly accounts for uncertainty about the missing data. Hierarchical linear regression model with missing data: Y i X i, β, σ 2 N(X T i β, σ 2 ) X i µ, Σ N(µ, Σ) p(β) 1 σ 2 InvG(0.01, 0.01). µ N(0, 100 2 I p ) Σ InvWishart(0.01, 0.01I p ) ST740 (4) Modeling - Part 1 Page 9

Missing data Overview of the Gibbs sampling algorithm: The full conditional of missing Y i is: The full conditional of missing X i is: Stacks handout. ST740 (4) Modeling - Part 1 Page 10

Robust regression So far we have only considered Gaussian errors. The resulting posterior for β is sensitive to outliers. To add robustness to outlying Y i, you can use a residual distribution with heavier tails. Double exponential errors: Student t errors: ST740 (4) Modeling - Part 1 Page 11

Logistic regression The most common model for binary data Y i {0, 1} is logistic regression. Logistic regression is a linear model on the log odds of Prob(Y = 1), Then the log odds are Prob(Y i = 1 X i ) = exp(xt i β) 1 + exp(x T i β). The full conditionals are not conjugate and so Metropolis sampling is required. ST740 (4) Modeling - Part 1 Page 12

Probit regression Compared to logistic regression, probit regression is easier computationally but more difficult to interpret. In probit regression Prob(Y i = 1) = Φ(X T i β), where Φ is the standard normal CDF. There is no nice interpretation of β such as log odds ratio. However, using auxiliary variables leads to conjugate updates. Let Z i be the latent/auxiliary variable with Z i N(X T i β, 1). Z i relates to the observed data via Y i = I(Z i > 0). This is equivalent to probit regression because ST740 (4) Modeling - Part 1 Page 13

Probit regression The full conditional distributions are Z i rest β rest ST740 (4) Modeling - Part 1 Page 14

MCMC for probit regression #Draw samples from a truncated normal: rtnorm<-function(n,mu,sigma,lower,upper){ l <- pnorm(lower,mu,sigma) u <- pnorm(upper,mu,sigma) p <- runif(n,l,u) y <- qnorm(p,mu,sigma) return(y)} probit<-function(y,x,sd.beta=100, iters=10000,burn=1000,update=10){ #Bookkeeping n <- length(y) p <- ncol(x) low <- ifelse(y==1,0,-inf) high <- ifelse(y==1,inf,0) #Initial values z <- y-.5 beta <- rep(0,p) #store samples here keep.beta <- matrix(0,iters,p) #Do some matrix manipulations offline cov.beta <- solve(t(x)%*%x+diag(p)/sd.betaˆ2) P1 <- cov.beta%*%t(x) P2 <- t(chol(cov.beta)) #Let s go! for(i in 1:iters){ } #update the latent probit variables, z: z <- rtnorm(n,x%*%beta,1,low,high) #update beta: beta <- P1%*%z+P2%*%rnorm(p) # keep track of the samples keep.beta[i,] <- beta list(beta=keep.beta)} Code is available at http://www4.stat.ncsu.edu/ reich/st740/code/probit.r. ST740 (4) Modeling - Part 1 Page 15

Probit regression Probit regression is great for complicated binary regression problems because we can specify priors for the latent Gaussian variables, and Gaussian models are easy to work with. For example, consider correlated binary data where Y i = (Y i1, Y i2 ) and Y ij {0, 1}. The probit regression model is Y ij = I(Z ij > 0), and then we can use a bivariate normal model for (Z i1, Z i2 ). The correlation between Z i1 and Z i2 induces correlation between Y i1 and Y i2. ST740 (4) Modeling - Part 1 Page 16