Bayesian Model Averaging Hoff Chapter 9, Hoeting et al 1999, Clyde & George 2004, Liang et al 2008 October 24, 2017
Bayesian Model Choice Models for the variable selection problem are based on a subset of the X 1,... X p variables Encode models with a vector γ = (γ 1,... γ p ) where γ j {0, 1} is an indicator for whether variable X j should be included in the model M γ. γ j = 0 β j = 0 Each value of γ represents one of the 2 p models. Under model M γ : Y α, β, σ 2, γ N(1α + X γ β γ, σ 2 I) Where X γ is design matrix using the columns in X where γ j = 1 and β γ is the subset of β that are non-zero.
Posterior Probabilities of Models Posterior model probabilities p(m j Y) = p(y M j)p(m j ) j p(y M j)p(m j ) Marginal likelihod of a model is proportional to p(y M γ ) = p(y α, β γ, σ 2 )p(β γ γ, σ 2 )p(α, σ 2 γ)dβ dα, dσ Bayes Factor BF[i : j] = p(y M i )/p(y M j ) P(M i Y) P(M j Y) = p(y M i) p(y M j ) P(M i) P(M j ) Posterior Odds = Bayes Factor Prior odds Probability β j 0: M j :β j 0 p(m j Y) (marginal posterior inclusion probability)
Prior Distributions Bayesian Model choice requires proper prior distributions on regression coefficients (exception parameters that are included in all models) otherwise marginal likelihoods are ill-determined Vague but proper priors may lead to paradoxes! Conjugate Normal-Gammas lead to closed form expressions for marginal likelihoods, Zellner s g-prior is one of the most popular.
Zellner s g-prior within Models Centered model: Y = 1 n α + X c γβ γ + ϵ Common parameters p(α, ϕ) ϕ 1 Model Specific parameters β γ α, ϕ, γ N(0, gϕ 1 (X c γ X c γ) 1 ) Marginal likelihood of M γ is proportional to p(y M γ ) = C(1 + g) n p 1 2 (1 + g(1 R 2 γ)) (n 1) 2 where R 2 γ is the usual R 2 for model M γ and C is a constant that is p(y M 0 ) (model with intercept alone) uniform distribution over space of models p(m γ ) = 1/(2 p )
USair Data: Enumeration of All Models # library(devtools) # suppressmessages(install_github("merliseclyde/bas")) # c library(bas) poll.bma = bas.lm(log(so2) ~ temp + log(firms) + log(popn) + wind + precip+ rain, data=usair, prior="g-prior", alpha=41, # g = n n.models=2^7,# enumerate (can omit) modelprior=uniform(), method="deterministic") # fast enumerat
Problem with g-prior with arbitrary g The Bayes factor for comparing M γ to the null model: Let g be a fixed constant and take n fixed. Let F = R 2 γ/p γ (1 R 2 γ)/(n 1 p γ) As R 2 γ 1, F LR test would reject H 0 where F is the usual F statistic for comparing model M γ to M 0 Bayes Factor would go to (1 + g) (n pγ 1)/2 as F (bounded for fixed g, n and p γ Bayes and Frequentist would not agree in this limit Information paradox
Resolution of Paradox Liang et al (2008) show that paradox can be resolved with mixtures of g-priors p(β γ ϕ) = 0 N(β γ ; 0, g(x T γx γ ) 1 /ϕ)p(g)dg BF if R 2 1 E g [(1 + g) pγ/2 ] diverges Zellner-Siow Cauchy prior 1/g Gamma(1/2, n/2) Hyper-g p(g) (1 + g) a/2 1 if 2 < a 3 hyper-g/n g 1 + g Beta(1, a 2 1) robust prior (Bayarri et al Annals of Statistics 2012)
Example library(bas) poll.zs = bas.lm(log(so2) ~ temp + log(firms) + log(popn) + wind + precip+ rain, data=usair, prior="jzs", #Jeffreys Zellner-Siow n.models=2^7,# enumerate (can omit) modelprior=uniform(), method="deterministic") # fast enumerat use prior = hyper-g and a = 3 for hyper-g or prior = hyper-g/n and a=3 for hyper-g/n
plot(poll.zs, which=4) Marginal Inclusion Probability 0.0 0.2 0.4 0.6 0.8 1.0 Inclusion Probabilities Intercept temp log(firms) log(popn) wind precip rain bas.lm(log(so2) ~ temp + log(firms) + log(popn) + wind + precip + rain)
Bayesian Model Averaging Posterior for µ = 1α + Xβ is a mixture distribution p(µ Y) = p(µ Y, M γ )p(m γ Y) with expectation expressed as a weighted average E[µ Y] = 1ˆα + X E[β Y, M γ ]p(m γ Y) Predictive Distribution for Y p(y Y) = p(y Y, M γ )p(m γ Y) Posterior Distribution of β j p(β j Y) = p(γ j = 0 Y)δ 0 (β)+ p(β j Y, M γ )γ j p(m γ Y)
Estimator Find ˆµ that minimizes posterior expected loss E[(µ ˆµ) T (µ ˆµ) Y] Solution is posterior mean under BMA E[µ Y] = 1ˆα + X E[β Y, M γ ]p(m γ Y) If one model has probability 1, then BMA is equivalent to using the highest posterior probability model incorporates estimates from other models when there is substantial uncertainty
Coefficients under BMA beta.zs = coef(poll.zs) beta.zs ## ## Marginal Posterior Summaries of Coefficients: ## ## Using BMA ## ## Based on the top 64 models ## post mean post SD post p(b!= 0) ## Intercept 3.153004 0.082496 1.000000 ## temp -0.054355 0.019464 0.974978 ## log(firms) 0.188818 0.167961 0.742681 ## log(popn) -0.031011 0.163396 0.330113 ## wind -0.118926 0.082304 0.794158 ## precip 0.010030 0.010754 0.620004 ## rain 0.001675 0.003843 0.349521
Posterior of Coefficients under BMA temp log(firms) log(popn) 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.15 0.05 0.05 0.0 0.5 1.0 1.5 0.5 0.5 wind precip rain 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.4 0.2 0.0 0.2 0.04 0.00 0.04 0.02 0.00 0.02
Credible Intervals for Coefficients under BMA plot(confint(beta.zs, parm=2:7)) 0.6 0.4 0.2 β 0.0 0.2 0.4 0.6
Selection and Model Uncertainty Select a model and ˆµ that minimizes posterior expected loss E[(µ ˆµ) T (µ ˆµ) Y] BMA is best estimator without selection Best model and estimator is the posterior mean under the model that is closest to BMA under squared error loss (ˆµ BMA ˆµ Mγ ) T (ˆµ BMA ˆµ Mγ ) Often contains more predictors than the HPM or Median Probability Model
Best Predictive Model #BPM BPM = predict(poll.zs, estimator = "BPM") BPM$bestmodel ## [1] 0 1 2 4 5 6 (poll.zs$namesx[attr(bpm$fit, 'model') +1])[-1] ## [1] "temp" "log(firms)" "wind" "precip" #HPM HPM = predict(poll.zs, estimator = "HPM") HPM$bestmodel ## [1] 0 1 2 4 5
HPM MPM BPM BMA 0.75 0.50 0.25 0.00 3.5 3.0 2.5 3.5 3.0 2.5 3.5 3.0 2.5 Corr: 1 Corr: 1 Corr: 1 Corr: 0.991 Corr: 0.991 Corr: 0.992 HPM MPM BPM BMA 2.53.03.5 2.53.03.5 2.53.03.5 2.5 3.0 3.5
Summary BMA shown in practice to have better out of sample predictions than selection (in many cases) avoids selecting a single model and accounts for out uncertainty if one model dominates BMA is very close to selection (asymptotically will put probability one on model that is closest to the true model) MCMC allows one to implement without enumerating all models BMA depends on prior on coefficients, variance and models (sensitivity to choice?) Mixtures of g priors preferred to usual g prior but can use g = n