Model Choice. Hoff Chapter 9, Clyde & George Model Uncertainty StatSci, Hoeting et al BMA StatSci. October 27, 2015

Similar documents
Model Choice. Hoff Chapter 9. Dec 8, 2010

Mixtures of Prior Distributions

Mixtures of Prior Distributions

Bayesian Model Averaging

MODEL AVERAGING by Merlise Clyde 1

Regression, Ridge Regression, Lasso

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Bayesian methods in economics and finance

Notes for ISyE 6413 Design and Analysis of Experiments

Marcel Dettling. Applied Statistical Regression AS 2012 Week 05. ETH Zürich, October 22, Institute for Data Analysis and Process Design

Sparse Linear Models (10/7/13)

Bayesian Linear Models

Introduction to Statistical modeling: handout for Math 489/583

Statistics 262: Intermediate Biostatistics Model selection

Bayesian Linear Models

10. Alternative case influence statistics

Supplementary materials for Scalable Bayesian model averaging through local information propagation

Linear Model Selection and Regularization

Lecture 14: Shrinkage

Bayesian linear regression

Mixtures of g-priors for Bayesian Variable Selection

Machine Learning for OR & FE

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Bayesian Variable Selection Under Collinearity

The linear model is the most fundamental of all serious statistical models encompassing:

Vector Autoregressive Model. Vector Autoregressions II. Estimation of Vector Autoregressions II. Estimation of Vector Autoregressions I.

How the mean changes depends on the other variable. Plots can show what s happening...

Machine Learning for Biomedical Engineering. Enrico Grisan

5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1)

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Model Selection. Frank Wood. December 10, 2009

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Default Priors and Effcient Posterior Computation in Bayesian

Day 4: Shrinkage Estimators

Bayesian Linear Regression

Linear Models (continued)

Inference for a Population Proportion

Subjective and Objective Bayesian Statistics

CHAPTER 6: SPECIFICATION VARIABLES

ISyE 691 Data mining and analytics

Regression I: Mean Squared Error and Measuring Quality of Fit

ST 740: Model Selection

Data Mining Stat 588

Bayesian Linear Models

STAT 425: Introduction to Bayesian Analysis

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Contents. Part I: Fundamentals of Bayesian Inference 1

Machine Learning Linear Regression. Prof. Matteo Matteucci

Linear Model Selection and Regularization

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Penalized Loss functions for Bayesian Model Choice

7. Estimation and hypothesis testing. Objective. Recommended reading

Subject CS1 Actuarial Statistics 1 Core Principles

Simple Linear Regression

Linear model selection and regularization

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

Variable Selection in Predictive Regressions

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

High-dimensional regression modeling

Linear Regression In God we trust, all others bring data. William Edwards Deming

Mixtures of g-priors for Bayesian Variable Selection

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Variable Selection and Model Building

Bayesian Regression (1/31/13)

Linear Models A linear model is defined by the expression

Mixtures of g Priors for Bayesian Variable Selection

Finite Population Estimators in Stochastic Search Variable Selection

STAT 100C: Linear models

Machine Learning Linear Classification. Prof. Matteo Matteucci

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Lecture 7: Modeling Krak(en)

Chapter 1 Statistical Inference

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

MS-C1620 Statistical inference

Linear regression methods

Stat 5101 Lecture Notes

Introduction to Machine Learning and Cross-Validation

A Frequentist Assessment of Bayesian Inclusion Probabilities

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

1 Hypothesis Testing and Model Selection

Biostatistics Advanced Methods in Biostatistics IV

ST430 Exam 2 Solutions

Part 7: Hierarchical Modeling

MCMC algorithms for fitting Bayesian models

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Bayesian Inference and MCMC

Lasso & Bayesian Lasso

1 Introduction. 2 AIC versus SBIC. Erik Swanson Cori Saviano Li Zha Final Project

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Bayesian Inference for Normal Mean

Bayesian performance

Robust Bayesian Regression

Bayesian Variable Selection Under Collinearity

Bayesian Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Model Selection Procedures

Bayesian Methods for Machine Learning

Transcription:

Model Choice Hoff Chapter 9, Clyde & George Model Uncertainty StatSci, Hoeting et al BMA StatSci October 27, 2015

Topics Variable Selection / Model Choice Stepwise Methods Model Selection Criteria Model Averaging

Variable Selection Reasons for reducing the number of variables in the model:

Variable Selection Reasons for reducing the number of variables in the model: Philosophical

Variable Selection Reasons for reducing the number of variables in the model: Philosophical Avoid the use of redundant variables (problems with interpretations)

Variable Selection Reasons for reducing the number of variables in the model: Philosophical Avoid the use of redundant variables (problems with interpretations) KISS

Variable Selection Reasons for reducing the number of variables in the model: Philosophical Avoid the use of redundant variables (problems with interpretations) KISS Occam s Razor

Variable Selection Reasons for reducing the number of variables in the model: Philosophical Avoid the use of redundant variables (problems with interpretations) KISS Occam s Razor Practical

Variable Selection Reasons for reducing the number of variables in the model: Philosophical Avoid the use of redundant variables (problems with interpretations) KISS Occam s Razor Practical Inclusion of un-necessary terms yields less precise estimates, particularly if explanatory variables are highly correlated with each other

Variable Selection Reasons for reducing the number of variables in the model: Philosophical Avoid the use of redundant variables (problems with interpretations) KISS Occam s Razor Practical Inclusion of un-necessary terms yields less precise estimates, particularly if explanatory variables are highly correlated with each other it is too expensive to use all variables

Variable Selection Procedures Stepwise Regression: Forward, Stepwise, Backward add/delete variables until all t-statistics are significant (easy, but not recommended)

Variable Selection Procedures Stepwise Regression: Forward, Stepwise, Backward add/delete variables until all t-statistics are significant (easy, but not recommended) Select variables with non-zero coefficients from Lasso

Variable Selection Procedures Stepwise Regression: Forward, Stepwise, Backward add/delete variables until all t-statistics are significant (easy, but not recommended) Select variables with non-zero coefficients from Lasso Select variables where shrinkage coefficient less than 0.5

Variable Selection Procedures Stepwise Regression: Forward, Stepwise, Backward add/delete variables until all t-statistics are significant (easy, but not recommended) Select variables with non-zero coefficients from Lasso Select variables where shrinkage coefficient less than 0.5 Use a Model Selection Criterion to pick the best model

Variable Selection Procedures Stepwise Regression: Forward, Stepwise, Backward add/delete variables until all t-statistics are significant (easy, but not recommended) Select variables with non-zero coefficients from Lasso Select variables where shrinkage coefficient less than 0.5 Use a Model Selection Criterion to pick the best model R2 (picks largest model)

Variable Selection Procedures Stepwise Regression: Forward, Stepwise, Backward add/delete variables until all t-statistics are significant (easy, but not recommended) Select variables with non-zero coefficients from Lasso Select variables where shrinkage coefficient less than 0.5 Use a Model Selection Criterion to pick the best model R2 (picks largest model) Adjusted R2

Variable Selection Procedures Stepwise Regression: Forward, Stepwise, Backward add/delete variables until all t-statistics are significant (easy, but not recommended) Select variables with non-zero coefficients from Lasso Select variables where shrinkage coefficient less than 0.5 Use a Model Selection Criterion to pick the best model R2 (picks largest model) Adjusted R2 Mallow s Cp C p = (SSE/ˆσ Full 2 ) + 2p m n

Variable Selection Procedures Stepwise Regression: Forward, Stepwise, Backward add/delete variables until all t-statistics are significant (easy, but not recommended) Select variables with non-zero coefficients from Lasso Select variables where shrinkage coefficient less than 0.5 Use a Model Selection Criterion to pick the best model R2 (picks largest model) Adjusted R2 Mallow s Cp C p = (SSE/ˆσ Full 2 ) + 2p m n AIC (Akaike Information Criterion) proportional to Cp for linear models

Variable Selection Procedures Stepwise Regression: Forward, Stepwise, Backward add/delete variables until all t-statistics are significant (easy, but not recommended) Select variables with non-zero coefficients from Lasso Select variables where shrinkage coefficient less than 0.5 Use a Model Selection Criterion to pick the best model R2 (picks largest model) Adjusted R2 Mallow s Cp C p = (SSE/ˆσ Full 2 ) + 2p m n AIC (Akaike Information Criterion) proportional to Cp for linear models BIC(m) (Bayes Information Criterion) ˆσ 2 m + log(n)p m

Variable Selection Procedures Stepwise Regression: Forward, Stepwise, Backward add/delete variables until all t-statistics are significant (easy, but not recommended) Select variables with non-zero coefficients from Lasso Select variables where shrinkage coefficient less than 0.5 Use a Model Selection Criterion to pick the best model R2 (picks largest model) Adjusted R2 Mallow s Cp C p = (SSE/ˆσ Full 2 ) + 2p m n AIC (Akaike Information Criterion) proportional to Cp for linear models BIC(m) (Bayes Information Criterion) ˆσ 2 m + log(n)p m Trade off model complexity (number of coefficients p m ) with goodness of fit ( ˆσ 2 m)

Model Selection Selection of a single model has the following problems

Model Selection Selection of a single model has the following problems When the criteria suggest that several models are equally good, what should we report? Still pick only one model?

Model Selection Selection of a single model has the following problems When the criteria suggest that several models are equally good, what should we report? Still pick only one model? What do we report for our uncertainty after selecting a model?

Model Selection Selection of a single model has the following problems When the criteria suggest that several models are equally good, what should we report? Still pick only one model? What do we report for our uncertainty after selecting a model? Typical analysis ignores model uncertainty!

Model Selection Selection of a single model has the following problems When the criteria suggest that several models are equally good, what should we report? Still pick only one model? What do we report for our uncertainty after selecting a model? Typical analysis ignores model uncertainty! Winner s Curse

Bayesian Model Choice Models for the variable selection problem are based on a subset of the X 1,... X p variables

Bayesian Model Choice Models for the variable selection problem are based on a subset of the X 1,... X p variables Encode models with a vector γ = (γ 1,... γ p ) where γ j {0, 1} is an indicator for whether variable X j should be included in the model M γ. γ j = 0 β j = 0

Bayesian Model Choice Models for the variable selection problem are based on a subset of the X 1,... X p variables Encode models with a vector γ = (γ 1,... γ p ) where γ j {0, 1} is an indicator for whether variable X j should be included in the model M γ. γ j = 0 β j = 0 Each value of γ represents one of the 2 p models.

Bayesian Model Choice Models for the variable selection problem are based on a subset of the X 1,... X p variables Encode models with a vector γ = (γ 1,... γ p ) where γ j {0, 1} is an indicator for whether variable X j should be included in the model M γ. γ j = 0 β j = 0 Each value of γ represents one of the 2 p models. Under model M γ : Y β, σ 2, γ N(X γ β γ, σ 2 I) Where X γ is design matrix using the columns in X where γ j = 1 and β γ is the subset of β that are non-zero.

Bayesian Model Averaging Rather than use a single model, BMA uses all (or potentially a lot) models, but weights model predictions by their posterior probabilities (measure of how much each model is supported by the data)

Bayesian Model Averaging Rather than use a single model, BMA uses all (or potentially a lot) models, but weights model predictions by their posterior probabilities (measure of how much each model is supported by the data) Posterior model probabilities p(m j Y) = p(y M j)p(m j ) j p(y M j)p(m j )

Bayesian Model Averaging Rather than use a single model, BMA uses all (or potentially a lot) models, but weights model predictions by their posterior probabilities (measure of how much each model is supported by the data) Posterior model probabilities p(m j Y) = p(y M j)p(m j ) j p(y M j)p(m j ) Marginal likelihod of a model is proportional to p(y M γ ) = p(y β γ, σ 2 )p(β γ γ, σ 2 )p(σ 2 γ)dβ dσ 2

Bayesian Model Averaging Rather than use a single model, BMA uses all (or potentially a lot) models, but weights model predictions by their posterior probabilities (measure of how much each model is supported by the data) Posterior model probabilities p(m j Y) = p(y M j)p(m j ) j p(y M j)p(m j ) Marginal likelihod of a model is proportional to p(y M γ ) = p(y β γ, σ 2 )p(β γ γ, σ 2 )p(σ 2 γ)dβ dσ 2 Probability β j 0: M j :β j 0 p(m j Y) (marginal inclusion probability)

Bayesian Model Averaging Rather than use a single model, BMA uses all (or potentially a lot) models, but weights model predictions by their posterior probabilities (measure of how much each model is supported by the data) Posterior model probabilities p(m j Y) = p(y M j)p(m j ) j p(y M j)p(m j ) Marginal likelihod of a model is proportional to p(y M γ ) = p(y β γ, σ 2 )p(β γ γ, σ 2 )p(σ 2 γ)dβ dσ 2 Probability β j 0: M j :β j 0 p(m j Y) (marginal inclusion probability) Predictions ˆ Y Y = j p(m j Y)Ŷ Mj

Prior Distributions Bayesian Model choice requires proper prior distributions on parameters that are not common across models

Prior Distributions Bayesian Model choice requires proper prior distributions on parameters that are not common across models Vague but proper priors may lead to paradoxes!

Prior Distributions Bayesian Model choice requires proper prior distributions on parameters that are not common across models Vague but proper priors may lead to paradoxes! Conjugate Normal-Gammas lead to closed form expressions for marginal likelihoods, Zellner s g-prior is the most popular.

Prior Distributions Bayesian Model choice requires proper prior distributions on parameters that are not common across models Vague but proper priors may lead to paradoxes! Conjugate Normal-Gammas lead to closed form expressions for marginal likelihoods, Zellner s g-prior is the most popular.

Zellner s g-prior Centered model: Y = 1 n α + X c β + ɛ where X c is the centered design matrix where all variables have had their mean subtracted X c = (I P 1n )X

Zellner s g-prior Centered model: Y = 1 n α + X c β + ɛ where X c is the centered design matrix where all variables have had their mean subtracted X c = (I P 1n )X p(α) 1

Zellner s g-prior Centered model: Y = 1 n α + X c β + ɛ where X c is the centered design matrix where all variables have had their mean subtracted X c = (I P 1n )X p(α) 1 p(σ 2 ) 1/σ 2

Zellner s g-prior Centered model: Y = 1 n α + X c β + ɛ where X c is the centered design matrix where all variables have had their mean subtracted X c = (I P 1n )X p(α) 1 p(σ 2 ) 1/σ 2 β γ α, σ 2, γ N(0, gσ 2 (X c X c ) 1 )

Zellner s g-prior Centered model: Y = 1 n α + X c β + ɛ where X c is the centered design matrix where all variables have had their mean subtracted X c = (I P 1n )X p(α) 1 p(σ 2 ) 1/σ 2 β γ α, σ 2, γ N(0, gσ 2 (X c X c ) 1 ) which leads to marginal likelihood of M γ that is proportional to p(y M γ ) = C(1 + g) n p 1 2 (1 + g(1 Rγ)) 2 (n 1) 2 where R 2 is the usual R 2 for model M γ.

Zellner s g-prior Centered model: Y = 1 n α + X c β + ɛ where X c is the centered design matrix where all variables have had their mean subtracted X c = (I P 1n )X p(α) 1 p(σ 2 ) 1/σ 2 β γ α, σ 2, γ N(0, gσ 2 (X c X c ) 1 ) which leads to marginal likelihood of M γ that is proportional to p(y M γ ) = C(1 + g) n p 1 2 (1 + g(1 Rγ)) 2 (n 1) 2 where R 2 is the usual R 2 for model M γ. Trade-off of model complexity versus goodness of fit Lastly, assign prior distribution to space of models (Uniform, or Beta-binomial on model size)

USair Data library(bas) poll.bma = bas.lm(log(so2) ~ temp + log(mgfirms) + log(popn) + wind + precip+ raindays, data=pollution, prior="g-prior", alpha=41, # g = n modelprior=uniform(), # beta.binomial(1, n.models=2^6, update=50, initprobs="uniform") par(mfrow=c(2,2)) plot(poll.bma, ask=f)

Plots Residuals vs Fitted Model Probabilities Residuals 1.0 0.0 0.5 1.0 31 11 25 Cumulative Probability 0.0 0.2 0.4 0.6 0.8 1.0 13 9 2.5 3.0 3.5 0 10 20 30 40 50 60 Predictions under BMA Model Search Order log(marginal) 4 2 0 2 4 6 8 Model Complexity 13 9 Marginal Inclusion Probability 0.0 0.2 0.4 0.6 0.8 1.0 Inclusion Probabilities 1 2 3 4 5 6 7 Model Dimension Intercept temp g(mfgfirms) log(popn) wind precip raindays

Posterior Distribution with Uniform Prior on Model Space image(poll.bma) ercept temp firms) popn) wind precip ndays Log Posterior Odds 0 0.774 1.018 1.208 1.499 1.744 1.919 3.367 20 15 13 11 9 8 7 6 5 4 3 2 1 Model Rank

Posterior Distribution with BB(1,p) Prior on Model Space image(poll-bb.bma) ercept temp firms) popn) wind precip ndays Log Posterior Odds 0 0.774 1.018 1.208 1.499 1.744 1.919 3.367 20 15 13 11 9 8 7 6 5 4 3 2 1 Model Rank

Jeffreys Scale of Evidence B = BF [H o : B a ] Bayes Factor Interpretation B 1 H 0 supported 1 > B 10 1 2 minimal evidence against H 0 10 1 2 > B 10 1 substantial evidence against H 0 10 1 > B 10 2 strong evidence against H 0 10 2 > B decisive evidence against H 0 in context of testing one hypothesis with equal prior odds

Coefficients beta = coef(poll.bma) par(mfrow=c(2,3)); plot(beta, subset=2:7,ask=f) temp log(mfgfirms) log(popn) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.2 0.4 0.6 0.15 0.05 0.05 0.0 0.5 1.0 1.5 1.5 0.5 0.0 0.5 wind precip raindays 0.0 0.2 0.4 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.4 0.2 0.0 0.2 0.04 0.00 0.04 0.02 0.00 0.02

Problem with g Prior The Bayes factor for comparing M γ to the null model: BF (M γ : M 0 ) = (1 + g) (n 1 pγ)/2 (1 + g(1 R 2 )) (n 1)/2 Let g be a fixed constant and take n fixed. Let F = R 2 γ/p γ (1 R 2 γ)/(n 1 p γ) As R 2 γ 1, F LR test would reject H 0 usual F statistic for comparing model M γ to M 0

Mortality & Pollution Data from Statistical Sleuth 12.17

Mortality & Pollution Data from Statistical Sleuth 12.17 60 cities

Mortality & Pollution Data from Statistical Sleuth 12.17 60 cities response Mortality

Mortality & Pollution Data from Statistical Sleuth 12.17 60 cities response Mortality measures of HC, NOX, SO2

Mortality & Pollution Data from Statistical Sleuth 12.17 60 cities response Mortality measures of HC, NOX, SO2 Is pollution associated with mortality after adjusting for other socio-economic and meteorological factors?

Mortality & Pollution Data from Statistical Sleuth 12.17 60 cities response Mortality measures of HC, NOX, SO2 Is pollution associated with mortality after adjusting for other socio-economic and meteorological factors? 15 predictor variables implies 2 15 = 32, 768 possible models

Mortality & Pollution Data from Statistical Sleuth 12.17 60 cities response Mortality measures of HC, NOX, SO2 Is pollution associated with mortality after adjusting for other socio-economic and meteorological factors? 15 predictor variables implies 2 15 = 32, 768 possible models Use Zellner-Siow Cauchy prior 1/g G(1/2, n/2) mort.bma = bas.lm(mortality ~., data=mortality, prior="zs-null", alpha=60, n.models=2^15, update=100, initprobs="eplogp")

Posterior Distributions 850 900 950 1000 1050 100 50 0 50 100 Predictions under BMA Residuals Residuals vs Fitted 60 4 50 0 5000 15000 25000 0.0 0.2 0.4 0.6 0.8 1.0 Model Search Order Cumulative Probability Model Probabilities 5 10 15 5 0 5 10 15 20 25 Model Dimension log(marginal) Model Complexity 0.0 0.2 0.4 0.6 0.8 1.0 Marginal Inclusion Probability Intercept Precip Humidity JanTemp JulyTemp Over65 House Educ Sound Density NonWhite WhiteCol Poor loghc lognox logso2 Inclusion Probabilities

Posterior Probabilities What is the probability that there is no pollution effect?

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include at least one pollution variable

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include at least one pollution variable > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include at least one pollution variable > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641 Posterior probability no effect is 0.011

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include at least one pollution variable > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641 Posterior probability no effect is 0.011 Odds that there is an effect (1.011)/(.011) = 89.9

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include at least one pollution variable > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641 Posterior probability no effect is 0.011 Odds that there is an effect (1.011)/(.011) = 89.9 Prior Odds 7 = (1.5 3 )/.5 3

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include at least one pollution variable > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641 Posterior probability no effect is 0.011 Odds that there is an effect (1.011)/(.011) = 89.9 Prior Odds 7 = (1.5 3 )/.5 3 Bayes Factor for a pollution effect 89.9/7 = 12.8

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include at least one pollution variable > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641 Posterior probability no effect is 0.011 Odds that there is an effect (1.011)/(.011) = 89.9 Prior Odds 7 = (1.5 3 )/.5 3 Bayes Factor for a pollution effect 89.9/7 = 12.8 Bayes Factor for NOXEffect based on marginal inclusion probability 0.917/(1 0.917) = 11.0

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include at least one pollution variable > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641 Posterior probability no effect is 0.011 Odds that there is an effect (1.011)/(.011) = 89.9 Prior Odds 7 = (1.5 3 )/.5 3 Bayes Factor for a pollution effect 89.9/7 = 12.8 Bayes Factor for NOXEffect based on marginal inclusion probability 0.917/(1 0.917) = 11.0 Marginal inclusion probability for loghc = 0.4271; BF = 0.75 Marginal inclusion probability for logso2 = 0.2189; BF = 0.28 Note Bayes Factors are not additive!

Model Space Intercept Precip Humidity JanTemp JulyTemp Over65 House Educ Sound Density NonWhite WhiteCol Poor loghc lognox logso2 Log Posterior Odds 0 0.16 0.371 0.634 1.091 1.124 1.33 1.942 20 14 10 8 7 6 5 4 3 2 1 Model Rank

Coefficients Intercept Precip Humidity JanTemp 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0 400 800 2 2 4 6 8 5 0 5 8 4 0 4 JulyTemp Over65 House Educ 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 10 0 10 60 20 20 400 0 400 100 0 50

Coefficients Sound Density NonWhite WhiteCol 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 10 0 5 0.02 0.01 0.03 0 5 10 10 0 5 10 Poor loghc lognox logso2 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 10 10 150 50 50 50 50 150 40 0 40

Effect Estimation Coefficients in each model are adjusted for other variables in the model OLS: leave out a predictor with a non-zero coefficient then estimates are biased! Model Selection in the presence of high correlation, may leave out redundant variables; improved MSE for prediction (Bias-variance tradeoff) Bayes is biased anyway so should we care? What is meaning of γ β jγγ j P(M γ Y) Problem with confounding! Need to change prior?

Other Problems Computational

Other Problems Computational if p > 35 enumeration is difficult

Other Problems Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ

Other Problems Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations

Other Problems Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations Metropolis Hastings algorithms more flexibility

Other Problems Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations Metropolis Hastings algorithms more flexibility Stochastic Search (no guarantee samples represent posterior)

Other Problems Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations Metropolis Hastings algorithms more flexibility Stochastic Search (no guarantee samples represent posterior) in BMA all variables are included, but coefficients are shrunk to 0; alternative is to use Shrinkage methods

Other Problems Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations Metropolis Hastings algorithms more flexibility Stochastic Search (no guarantee samples represent posterior) in BMA all variables are included, but coefficients are shrunk to 0; alternative is to use Shrinkage methods Prior Choice: Choice of prior distributions on β and on γ

Other Problems Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations Metropolis Hastings algorithms more flexibility Stochastic Search (no guarantee samples represent posterior) in BMA all variables are included, but coefficients are shrunk to 0; alternative is to use Shrinkage methods Prior Choice: Choice of prior distributions on β and on γ Model averaging versus Model Selection what are objectives?