Mixtures of Prior Distributions

Similar documents
Mixtures of Prior Distributions

Model Choice. Hoff Chapter 9, Clyde & George Model Uncertainty StatSci, Hoeting et al BMA StatSci. October 27, 2015

Model Choice. Hoff Chapter 9. Dec 8, 2010

Bayesian Model Averaging

MODEL AVERAGING by Merlise Clyde 1

Bayesian Variable Selection Under Collinearity

Supplementary materials for Scalable Bayesian model averaging through local information propagation

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

Mixtures of g-priors for Bayesian Variable Selection

Bayesian Variable Selection Under Collinearity

Marcel Dettling. Applied Statistical Regression AS 2012 Week 05. ETH Zürich, October 22, Institute for Data Analysis and Process Design

Mixtures of g-priors in Generalized Linear Models

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

Mixtures of g-priors for Bayesian Variable Selection

Mixtures of g-priors in Generalized Linear Models

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

Bayesian linear regression

Posterior Model Probabilities via Path-based Pairwise Priors

Mixtures of g Priors for Bayesian Variable Selection

Finite Population Estimators in Stochastic Search Variable Selection

Penalized Loss functions for Bayesian Model Choice

Principles of Bayesian Inference

Contents. Part I: Fundamentals of Bayesian Inference 1

7. Estimation and hypothesis testing. Objective. Recommended reading

Bayesian methods in economics and finance

On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression

Divergence Based priors for the problem of hypothesis testing

Robust Bayesian Regression

Markov Chain Monte Carlo Using the Ratio-of-Uniforms Transformation. Luke Tierney Department of Statistics & Actuarial Science University of Iowa

Overall Objective Priors

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

A Bayesian perspective on GMM and IV

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Subjective and Objective Bayesian Statistics

On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression

Markov Chain Monte Carlo (MCMC)

1 Hypothesis Testing and Model Selection

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Default Priors and Effcient Posterior Computation in Bayesian

STA 4273H: Statistical Machine Learning

Lecture 5: Conventional Model Selection Priors

Notes for ISyE 6413 Design and Analysis of Experiments

Lecture Notes based on Koop (2003) Bayesian Econometrics

Mortgage Default. Zachary Kramer. Department of Statistical Science and Economics Duke University. Date: Approved: Merlise Clyde, Supervisor.

A Bayesian Approach to Phylogenetics

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

7. Estimation and hypothesis testing. Objective. Recommended reading

Simple Linear Regression

Bayesian Methods for Machine Learning

Package spatial.gev.bma

STA 216, GLM, Lecture 16. October 29, 2007

Bayesian Linear Regression

Stat 5101 Lecture Notes

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

QTL model selection: key players

Divergence Based Priors for Bayesian Hypothesis testing

Shrinkage Methods: Ridge and Lasso

An Overview of Objective Bayesian Analysis

Lecture 6: Markov Chain Monte Carlo

Bayesian Model Averaging and Forecasting

CPSC 540: Machine Learning

Metropolis-Hastings Algorithm

g-priors for Linear Regression

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Sparse Linear Models (10/7/13)

Inference for a Population Proportion

Comparison of Three Calculation Methods for a Bayesian Inference of Two Poisson Parameters

MCMC: Markov Chain Monte Carlo

Some Curiosities Arising in Objective Bayesian Analysis

Bayesian Phylogenetics:

ST 740: Model Selection

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Bayesian Inference and MCMC

Bayesian variable selection in high dimensional problems without assumptions on prior model probabilities

POSTERIOR ANALYSIS OF THE MULTIPLICATIVE HETEROSCEDASTICITY MODEL

Melanie A. Wilson. Department of Statistics Duke University. Approved: Dr. Edwin S. Iversen, Advisor. Dr. Merlise A. Clyde. Dr.

ETH Zürich, October 25, 2010

Bayes Estimators & Ridge Regression

Answers and expectations

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Bayesian Methods in Multilevel Regression

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

A simple two-sample Bayesian t-test for hypothesis testing

Bayesian shrinkage approach in variable selection for mixed

Bayesian search for other Earths

Bayesian Linear Models

Vector Autoregressive Model. Vector Autoregressions II. Estimation of Vector Autoregressions II. Estimation of Vector Autoregressions I.

DAG models and Markov Chain Monte Carlo methods a short overview

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Gaussian kernel GARCH models

Robust Estimation of ARMA Models with Near Root Cancellation. Timothy Cogley. Richard Startz * June Abstract

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Bayesian Additive Regression Tree (BART) with application to controlled trail data analysis

Plausible Values for Latent Variables Using Mplus

Markov Chain Monte Carlo methods

Transcription:

Mixtures of Prior Distributions Hoff Chapter 9, Liang et al 2007, Hoeting et al (1999), Clyde & George (2004) November 10, 2016

Bartlett s Paradox The Bayes factor for comparing M γ to the null model: BF (M γ : M 0 ) = (1 + g) (n 1 pγ)/2 (1 + g(1 R 2 γ)) (n 1)/2

Bartlett s Paradox The Bayes factor for comparing M γ to the null model: BF (M γ : M 0 ) = (1 + g) (n 1 pγ)/2 (1 + g(1 R 2 γ)) (n 1)/2 For g, the BF 0 for fixed n and R 2 γ Increasing vagueness in the prior leads to BF favoring the null model!

Information Paradox The Bayes factor for comparing M γ to the null model: BF (M γ : M 0 ) = (1 + g) (n 1 pγ)/2 (1 + g(1 R 2 )) (n 1)/2

Information Paradox The Bayes factor for comparing M γ to the null model: BF (M γ : M 0 ) = (1 + g) (n 1 pγ)/2 (1 + g(1 R 2 )) (n 1)/2 Let g be a fixed constant and take n fixed.

Information Paradox The Bayes factor for comparing M γ to the null model: BF (M γ : M 0 ) = (1 + g) (n 1 pγ)/2 (1 + g(1 R 2 )) (n 1)/2 Let g be a fixed constant and take n fixed. Let F = R 2 γ /pγ (1 R 2 γ )/(n 1 pγ)

Information Paradox The Bayes factor for comparing M γ to the null model: BF (M γ : M 0 ) = (1 + g) (n 1 pγ)/2 (1 + g(1 R 2 )) (n 1)/2 Let g be a fixed constant and take n fixed. Let F = R 2 γ /pγ (1 R 2 γ )/(n 1 pγ) As R 2 γ 1, F LR test would reject M 0 where F is the usual F statistic for comparing model M γ to M 0

Information Paradox The Bayes factor for comparing M γ to the null model: BF (M γ : M 0 ) = (1 + g) (n 1 pγ)/2 (1 + g(1 R 2 )) (n 1)/2 Let g be a fixed constant and take n fixed. Let F = R 2 γ /pγ (1 R 2 γ )/(n 1 pγ) As R 2 γ 1, F LR test would reject M 0 where F is the usual F statistic for comparing model M γ to M 0 BF converges to a fixed constant (1 + g) pγ/2 (does not go to infinity Information Inconsistency see Liang et al JASA 2008

Mixtures of g priors & Information consistency Need BF if R 2 1 E g [(1 + g) (n 1 pγ)/2 ] diverges (proof in Liang et al)

Mixtures of g priors & Information consistency Need BF if R 2 1 E g [(1 + g) (n 1 pγ)/2 ] diverges (proof in Liang et al) Zellner-Siow Cauchy prior

Mixtures of g priors & Information consistency Need BF if R 2 1 E g [(1 + g) (n 1 pγ)/2 ] diverges (proof in Liang et al) Zellner-Siow Cauchy prior hyper-g prior (Liang et al JASA 2008) p(g) = a 2 (1 + g) a/2 2 or g/(1 + g) Beta(1, (a 2)/2) need 2 < a 3

Mixtures of g priors & Information consistency Need BF if R 2 1 E g [(1 + g) (n 1 pγ)/2 ] diverges (proof in Liang et al) Zellner-Siow Cauchy prior hyper-g prior (Liang et al JASA 2008) p(g) = a 2 (1 + g) a/2 2 or g/(1 + g) Beta(1, (a 2)/2) need 2 < a 3 Hyper-g/n (g/n)(1 + g/n) (Beta(1, (a 2)/2)

Mixtures of g priors & Information consistency Need BF if R 2 1 E g [(1 + g) (n 1 pγ)/2 ] diverges (proof in Liang et al) Zellner-Siow Cauchy prior hyper-g prior (Liang et al JASA 2008) p(g) = a 2 (1 + g) a/2 2 or g/(1 + g) Beta(1, (a 2)/2) need 2 < a 3 Hyper-g/n (g/n)(1 + g/n) (Beta(1, (a 2)/2) Jeffreys prior on g corresponds to a = 2 (improper) robust prior (Bayarrri et al Annals of Statistics 2012

Mixtures of g priors & Information consistency Need BF if R 2 1 E g [(1 + g) (n 1 pγ)/2 ] diverges (proof in Liang et al) Zellner-Siow Cauchy prior hyper-g prior (Liang et al JASA 2008) p(g) = a 2 (1 + g) a/2 2 or g/(1 + g) Beta(1, (a 2)/2) need 2 < a 3 Hyper-g/n (g/n)(1 + g/n) (Beta(1, (a 2)/2) Jeffreys prior on g corresponds to a = 2 (improper) robust prior (Bayarrri et al Annals of Statistics 2012 Intrinsic prior (Womack et al JASA 2015) All have prior tails for β that behave like a Cauchy distribution and (the latter 4) marginal likelihoods that can be computed using special hypergeometric functions ( 2 F 1, Appell F 1 )

Desiderata - Bayarri et al 2012 AoS Proper priors on non-common coefficients

Desiderata - Bayarri et al 2012 AoS Proper priors on non-common coefficients If LR overwhelmingly rejects a model, Bayesian should also reject

Desiderata - Bayarri et al 2012 AoS Proper priors on non-common coefficients If LR overwhelmingly rejects a model, Bayesian should also reject Selection Consistency: large samples probability of the true model goes to one.

Desiderata - Bayarri et al 2012 AoS Proper priors on non-common coefficients If LR overwhelmingly rejects a model, Bayesian should also reject Selection Consistency: large samples probability of the true model goes to one. Intrinsic prior consistency (prior converges to a fixed proper prior as n

Desiderata - Bayarri et al 2012 AoS Proper priors on non-common coefficients If LR overwhelmingly rejects a model, Bayesian should also reject Selection Consistency: large samples probability of the true model goes to one. Intrinsic prior consistency (prior converges to a fixed proper prior as n Invariance (invariance under scale/location changes of data/model leads to p(β 0, φ) 1/φ); other group invariance, rotation invariance.

Desiderata - Bayarri et al 2012 AoS Proper priors on non-common coefficients If LR overwhelmingly rejects a model, Bayesian should also reject Selection Consistency: large samples probability of the true model goes to one. Intrinsic prior consistency (prior converges to a fixed proper prior as n Invariance (invariance under scale/location changes of data/model leads to p(β 0, φ) 1/φ); other group invariance, rotation invariance. predictive matching: predictive distributions match under minimal sample sizes so that BF = 1 Mixtures of g priors like Zellner-Siow, hyper-g-n, robust, intrinsic

Mortality & Pollution Data from Statistical Sleuth 12.17

Mortality & Pollution Data from Statistical Sleuth 12.17 60 cities

Mortality & Pollution Data from Statistical Sleuth 12.17 60 cities response Mortality

Mortality & Pollution Data from Statistical Sleuth 12.17 60 cities response Mortality measures of HC, NOX, SO2

Mortality & Pollution Data from Statistical Sleuth 12.17 60 cities response Mortality measures of HC, NOX, SO2 Is pollution associated with mortality after adjusting for other socio-economic and meteorological factors?

Mortality & Pollution Data from Statistical Sleuth 12.17 60 cities response Mortality measures of HC, NOX, SO2 Is pollution associated with mortality after adjusting for other socio-economic and meteorological factors? 15 predictor variables implies 2 15 = 32, 768 possible models

Mortality & Pollution Data from Statistical Sleuth 12.17 60 cities response Mortality measures of HC, NOX, SO2 Is pollution associated with mortality after adjusting for other socio-economic and meteorological factors? 15 predictor variables implies 2 15 = 32, 768 possible models Use Zellner-Siow Cauchy prior 1/g G(1/2, n/2) mort.bma = bas.lm(mortality ~., data=mortality, prior="zs-null", alpha=60, n.models=2^15, update=100, initprobs="eplogp")

Posterior Distributions 850 900 950 1000 1050 100 50 0 50 100 Predictions under BMA Residuals Residuals vs Fitted 60 4 50 0 5000 15000 25000 0.0 0.2 0.4 0.6 0.8 1.0 Model Search Order Cumulative Probability Model Probabilities 5 10 15 5 0 5 10 15 20 25 Model Dimension log(marginal) Model Complexity 0.0 0.2 0.4 0.6 0.8 1.0 Marginal Inclusion Probability Intercept Precip Humidity JanTemp JulyTemp Over65 House Educ Sound Density NonWhite WhiteCol Poor loghc lognox logso2 Inclusion Probabilities

Posterior Probabilities What is the probability that there is no pollution effect?

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include no pollution variables

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include no pollution variables > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include no pollution variables > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641 Posterior probability no effect is 0.011

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include no pollution variables > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641 Posterior probability no effect is 0.011 Posterior Odds that there is an effect (1.011)/(.011) = 89.

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include no pollution variables > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641 Posterior probability no effect is 0.011 Posterior Odds that there is an effect (1.011)/(.011) = 89. Prior Odds 7 = (1.5 3 )/.5 3

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include no pollution variables > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641 Posterior probability no effect is 0.011 Posterior Odds that there is an effect (1.011)/(.011) = 89. Prior Odds 7 = (1.5 3 )/.5 3 Bayes Factor for a pollution effect 89.9/7 = 12.8

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include no pollution variables > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641 Posterior probability no effect is 0.011 Posterior Odds that there is an effect (1.011)/(.011) = 89. Prior Odds 7 = (1.5 3 )/.5 3 Bayes Factor for a pollution effect 89.9/7 = 12.8 Bayes Factor for NOX based on marginal inclusion probability 0.917/(1 0.917) = 11.0

Posterior Probabilities What is the probability that there is no pollution effect? Sum posterior model probabilities over all models that include no pollution variables > which.mat = list2matrix.which(mort.bma,1:(2^15)) > poll.in = (which.mat[, 14:16] %*% rep(1, 3)) > 0 > sum(poll.in * mort.bma$postprob) [1] 0.9889641 Posterior probability no effect is 0.011 Posterior Odds that there is an effect (1.011)/(.011) = 89. Prior Odds 7 = (1.5 3 )/.5 3 Bayes Factor for a pollution effect 89.9/7 = 12.8 Bayes Factor for NOX based on marginal inclusion probability 0.917/(1 0.917) = 11.0 Marginal inclusion probability for loghc = 0.427144 (BF =.745) Marginal inclusion probability for logso2 = 0.218978 (BF =.280)

Model Space Intercept Precip Humidity JanTemp JulyTemp Over65 House Educ Sound Density NonWhite WhiteCol Poor loghc lognox logso2 Log Posterior Odds 0 0.16 0.371 0.634 1.091 1.124 1.33 1.942 20 14 10 8 7 6 5 4 3 2 1 Model Rank

Coefficients Intercept Precip Humidity JanTemp 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0 400 800 2 2 4 6 8 5 0 5 8 4 0 4 JulyTemp Over65 House Educ 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 10 0 10 60 20 20 400 0 400 100 0 50

Coefficients Sound Density NonWhite WhiteCol 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 10 0 5 0.02 0.01 0.03 0 5 10 10 0 5 10 Poor loghc lognox logso2 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 10 10 150 50 50 50 50 150 40 0 40

Effect Estimation Coefficients in each model are adjusted for other variables in the model OLS: leave out a predictor with a non-zero coefficient then estimates are biased! Model Selection in the presence of high correlation, may leave out redundant variables; improved MSE for prediction (Bias-variance tradeoff) Bayes is biased anyway so should we care? With confounding, should not use plain BMA. Need to change prior to include potential confounders (advanced topic)

Computational Issues Computational

Computational Issues Computational if p > 35 enumeration is difficult

Computational Issues Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ

Computational Issues Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations

Computational Issues Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations Metropolis Hastings algorithms more flexibility (method= MCMC )

Computational Issues Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations Metropolis Hastings algorithms more flexibility (method= MCMC ) Stochastic Search (no guarantee samples represent posterior)

Computational Issues Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations Metropolis Hastings algorithms more flexibility (method= MCMC ) Stochastic Search (no guarantee samples represent posterior) Variational, EM, etc to find modal model

Computational Issues Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations Metropolis Hastings algorithms more flexibility (method= MCMC ) Stochastic Search (no guarantee samples represent posterior) Variational, EM, etc to find modal model in BMA all variables are included, but coefficients are shrunk to 0; alternative is to use Shrinkage methods

Computational Issues Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations Metropolis Hastings algorithms more flexibility (method= MCMC ) Stochastic Search (no guarantee samples represent posterior) Variational, EM, etc to find modal model in BMA all variables are included, but coefficients are shrunk to 0; alternative is to use Shrinkage methods Models with Non-estimable parameters? (use generalized inverse) Prior Choice: Choice of prior distributions on β and on γ

Computational Issues Computational if p > 35 enumeration is difficult Gibbs sampler or Random-Walk algorithm on γ poor convergence/mixing with high correlations Metropolis Hastings algorithms more flexibility (method= MCMC ) Stochastic Search (no guarantee samples represent posterior) Variational, EM, etc to find modal model in BMA all variables are included, but coefficients are shrunk to 0; alternative is to use Shrinkage methods Models with Non-estimable parameters? (use generalized inverse) Prior Choice: Choice of prior distributions on β and on γ Model averaging versus Model Selection what are objectives?

BAS Algorithm - Clyde, Ghosh, Littman - JCGS Sampling w/out Replacement method="bas" MCMC Sampling method="mcmc" See package Vignette