Bayesian linear regression - PDF Free Download

Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding vector of known predictors β = (β 1,..., β p ) T are the regression parameters of interest ε i iid N(0, σ 2 ) are the errors In matrix notation, where ε N(0, σ 2 I n ). Y = X β + ε n 1 n p p 1 n 1 We will discuss a Bayesian analysis of this model, including Priors for β Priors for σ 2 Connections with penalized regression Posterior predictive distributions Missing data Robust regression Generalized linear models ST740 (4) Modeling - Part 1 Page 1

Priors for β, σ fixed The Jefferys prior is flat, f(β) 1. The posterior is then ) β N (ˆβOLS, Σ OLS, where ˆβ OLS = (X T X) 1 X T Y and Σ OLS = σ 2 (X T X) 1. The posterior is proper if and only if X T X is nonsingular. With a flat prior, estimates and inference are identical to least squares. Interpretation is not the same as least squares. ST740 (4) Modeling - Part 1 Page 2

Priors for β, σ fixed Another common prior is the independent normal prior β j iid N(0, σ 2 /λ). The variance parameter λ can either be fixed or treated as unknown with a prior. The prior variance is proportional to σ 2 to account for the scale of the data. The posterior is The posterior mode is β R = argmin(y Xβ) T (Y Xβ) + λ β p βj 2. In penalized regression, this is known as the ridge regression solution. Ridge regression is used for stability in high variance problems such as regression with collinearity or many predictors. It reduces variance by shrinking coefficients towards zero. The prior variance λ is called the ridge parameter, and is selected using BIC or crossvalidation. j=1 ST740 (4) Modeling - Part 1 Page 3

Priors for β, σ fixed Zellner s g-prior is β N [0, σ2 ( X T X ) ] T 1. g This is a good automatic prior because σ 2 accounts for the scale of Y and (X T X T ) 1 adjusts for the scale of X. The posterior is Therefore, c = 1/(g + 1) has a clear role as the shrinkage parameter. Is it OK to have the prior depend on X? Is this empirical Bayes? Under this prior, if X i and X j are positively correlated, then β i and β j will be negatively correlated. Is that a good idea? How to pick g? 1. g gamma gives a mixture of g-priors 2. g = 1/n is the unit information prior ST740 (4) Modeling - Part 1 Page 4

Priors for β, σ fixed Double exponential priors have heavier tails than Gaussian priors, and are thus more appropriate if you believe many coefficients are near zero but a few are large. The double exponential prior is β j iid DE(0, λ) and p(β) exp ( λ 2 β ). The posterior mode is the LASSO solution! This is a useful penalty because it stabilizes the estimates by shrinking towards zero, but also gives a solution with some coefficients exactly zero, β j = 0, thus performing variable selection. ST740 (4) Modeling - Part 1 Page 5

Priors for σ 2 The most common prior for the error variance is σ 2 InvGamma(a, b). The hyperparameters are often set to be small, such as a = b = 0.01. σ 2 β, Y If β has a flat prior, σ 2 Y The posterior mean is ˆσ 2 B = ST740 (4) Modeling - Part 1 Page 6

Marginal posterior of β Assume β has a flat prior and σ 2 InvGamma(a, b). The marginal posterior (derived in the handout) of β over σ 2 is multivariate t with n p + 2a degrees of freedom Location ˆβ OLS And scale matrix ˆσ 2 B (XT X) 1. What is the effect of accounting for uncertainty in σ 2? What happens if a 0 and b 0? What if you assume a strong prior for σ 2? ST740 (4) Modeling - Part 1 Page 7

Prediction Let X p be a new covariate vector and Y p the corresponding response we wish to predict. We ll discuss prediction in the regression setting, but the concepts apply more generally. Conditional (plug-in) prediction: Y p N(X T ˆβ, p ˆσ 2 ). This fails to account for uncertainty in β and σ 2. To properly account for uncertainty, we want the posterior predictive distribution Y p Y. This accounts for uncertainty in the model parameters θ = (β, σ 2 ) because MCMC easily produces samples from the posterior predictive distribution: ST740 (4) Modeling - Part 1 Page 8

Missing data Often either Y i or elements X i are missing. MCMC easily handles this as well, assuming you are willing to specify a parametric model for the missing values. The main idea is to treat the missing values as unknown parameters in the Bayesian model. Unknown parameters need priors, so missing X i must have priors such as X i iid N(µ X, Σ X ). Of course if the prior is way off, the results will be invalid. For example, if in reality the data are not missing at random the Bayesian model will likely give bad results. If specified correctly, the model will lead to inference in β that properly accounts for uncertainty about the missing data. Hierarchical linear regression model with missing data: Y i X i, β, σ 2 N(X T i β, σ 2 ) X i µ, Σ N(µ, Σ) p(β) 1 σ 2 InvG(0.01, 0.01). µ N(0, 100 2 I p ) Σ InvWishart(0.01, 0.01I p ) ST740 (4) Modeling - Part 1 Page 9

Missing data Overview of the Gibbs sampling algorithm: The full conditional of missing Y i is: The full conditional of missing X i is: Stacks handout. ST740 (4) Modeling - Part 1 Page 10

Robust regression So far we have only considered Gaussian errors. The resulting posterior for β is sensitive to outliers. To add robustness to outlying Y i, you can use a residual distribution with heavier tails. Double exponential errors: Student t errors: ST740 (4) Modeling - Part 1 Page 11

Logistic regression The most common model for binary data Y i {0, 1} is logistic regression. Logistic regression is a linear model on the log odds of Prob(Y = 1), Then the log odds are Prob(Y i = 1 X i ) = exp(xt i β) 1 + exp(x T i β). The full conditionals are not conjugate and so Metropolis sampling is required. ST740 (4) Modeling - Part 1 Page 12

Probit regression Compared to logistic regression, probit regression is easier computationally but more difficult to interpret. In probit regression Prob(Y i = 1) = Φ(X T i β), where Φ is the standard normal CDF. There is no nice interpretation of β such as log odds ratio. However, using auxiliary variables leads to conjugate updates. Let Z i be the latent/auxiliary variable with Z i N(X T i β, 1). Z i relates to the observed data via Y i = I(Z i > 0). This is equivalent to probit regression because ST740 (4) Modeling - Part 1 Page 13

Probit regression The full conditional distributions are Z i rest β rest ST740 (4) Modeling - Part 1 Page 14

MCMC for probit regression #Draw samples from a truncated normal: rtnorm<-function(n,mu,sigma,lower,upper){ l <- pnorm(lower,mu,sigma) u <- pnorm(upper,mu,sigma) p <- runif(n,l,u) y <- qnorm(p,mu,sigma) return(y)} probit<-function(y,x,sd.beta=100, iters=10000,burn=1000,update=10){ #Bookkeeping n <- length(y) p <- ncol(x) low <- ifelse(y==1,0,-inf) high <- ifelse(y==1,inf,0) #Initial values z <- y-.5 beta <- rep(0,p) #store samples here keep.beta <- matrix(0,iters,p) #Do some matrix manipulations offline cov.beta <- solve(t(x)%*%x+diag(p)/sd.betaˆ2) P1 <- cov.beta%*%t(x) P2 <- t(chol(cov.beta)) #Let s go! for(i in 1:iters){ } #update the latent probit variables, z: z <- rtnorm(n,x%*%beta,1,low,high) #update beta: beta <- P1%*%z+P2%*%rnorm(p) # keep track of the samples keep.beta[i,] <- beta list(beta=keep.beta)} Code is available at http://www4.stat.ncsu.edu/ reich/st740/code/probit.r. ST740 (4) Modeling - Part 1 Page 15

Probit regression Probit regression is great for complicated binary regression problems because we can specify priors for the latent Gaussian variables, and Gaussian models are easy to work with. For example, consider correlated binary data where Y i = (Y i1, Y i2 ) and Y ij {0, 1}. The probit regression model is Y ij = I(Z ij > 0), and then we can use a bivariate normal model for (Z i1, Z i2 ). The correlation between Z i1 and Z i2 induces correlation between Y i1 and Y i2. ST740 (4) Modeling - Part 1 Page 16