ST 740: Model Selection

Similar documents
ST 740: Linear Models and Multivariate Normal Inference

ST 740: Multiparameter Inference

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

A Very Brief Summary of Bayesian Inference, and Examples

7. Estimation and hypothesis testing. Objective. Recommended reading

Part III. A Decision-Theoretic Approach and Bayesian testing

Bayesian RL Seminar. Chris Mansley September 9, 2008

Classical and Bayesian inference

Part 6: Multivariate Normal and Linear Models

Introduction to Bayesian Methods

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016

Bayesian Regression Linear and Logistic Regression

A Very Brief Summary of Statistical Inference, and Examples

Foundations of Statistical Inference

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Bayesian hypothesis testing

Conjugate Analysis for the Linear Model

Introduc)on to Bayesian Methods

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

1 Hypothesis Testing and Model Selection

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Principles of Bayesian Inference

Chapter 5. Bayesian Statistics

STAT 425: Introduction to Bayesian Analysis

Statistical Inference

Bayesian Inference and MCMC

Principles of Bayesian Inference

Hierarchical Models & Bayesian Model Selection

Bayesian Analysis (Optional)

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Lecture 6: Model Checking and Selection

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Compute f(x θ)f(θ) dθ

P Values and Nuisance Parameters

Bayesian Learning (II)

Harrison B. Prosper. CMS Statistics Committee

7. Estimation and hypothesis testing. Objective. Recommended reading

Module 4: Bayesian Methods Lecture 5: Linear regression

Density Estimation. Seungjin Choi

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

Bayesian model selection: methodology, computation and applications

Bayesian Inference for Normal Mean

Bayesian inference: what it means and why we care

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Comparison of Bayesian and Frequentist Inference

Bayesian Inference: Posterior Intervals

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Inference. p(y)

Bayesian Regression (1/31/13)

Introduction to Bayesian Statistics

Principles of Bayesian Inference

A Bayesian Approach to Phylogenetics

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Lecture 6. Prior distributions

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Bayesian Model Diagnostics and Checking

Introduction to Probabilistic Machine Learning

Introduction to Bayesian Inference

(1) Introduction to Bayesian statistics

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Introduction to Bayesian Statistics 1

Bayes Factors, posterior predictives, short intro to RJMCMC. Thermodynamic Integration

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression

Bayesian Methods for Machine Learning

Discrete Binary Distributions

INTRODUCTION TO BAYESIAN STATISTICS

Bayesian Assessment of Hypotheses and Models

Monte Carlo in Bayesian Statistics

Physics 403. Segev BenZvi. Classical Hypothesis Testing: The Likelihood Ratio Test. Department of Physics and Astronomy University of Rochester

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

PHASES OF STATISTICAL ANALYSIS 1. Initial Data Manipulation Assembling data Checks of data quality - graphical and numeric

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Markov Chain Monte Carlo methods

The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

Bayes Testing and More

Bayesian Inference. Chapter 1. Introduction and basic concepts

Bayesian Hypothesis Testing: Redux

Bayesian Inference. Chapter 2: Conjugate models

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Bayesian Models in Machine Learning

LECTURE 15 Markov chain Monte Carlo

Computational Perception. Bayesian Inference

Introduction to Bayesian Computation

Bayesian Computation

Fundamental Probability and Statistics

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Confidence Intervals. CAS Antitrust Notice. Bayesian Computation. General differences between Bayesian and Frequntist statistics 10/16/2014

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Lecture : Probabilistic Machine Learning

Machine Learning using Bayesian Approaches

Time Series and Dynamic Models

Probability and Estimation. Alan Moses

Transcription:

ST 740: Model Selection Alyson Wilson Department of Statistics North Carolina State University November 25, 2013 A. Wilson (NCSU Statistics) Model Selection November 25, 2013 1 / 29

Formal Bayesian Model Selection Bayesians are, well, sometimes Bayesian about model choice. A prior probability distribution on the possible models is chosen (usually uniform), and then Bayes Theorem is used to derive the posterior probability distribution. Bayesian model choice is aimed at computing a posterior probability distribution over a set of hypotheses or models, in terms of their relative support from the data. Note: Because this distribution is marginalized over the parameters, improper priors on the parameters cannot, in general, be used. A. Wilson (NCSU Statistics) Model Selection November 25, 2013 2 / 29

Bayes Factor Suppose that we wish to compare models M 1 and M 0 for the same data. The two models may have different likelihoods, different numbers of unknown parameters, etc. (For example, we might want to compare two non-nested regression models.) The Bayes factor in the general case is the ratio of the marginal likelihoods under the two candidate models. If θ 1 and θ 0 denote parameters under models M 1 and M 0 respectively, then p(y M 1 ) = p(y M 0 ) = π(θ 1 )f (y θ 1 )dθ 1 π(θ 0 )f (y θ 0 )dθ 0 BF 10 = p(y M 1) p(y M 0 ) A. Wilson (NCSU Statistics) Model Selection November 25, 2013 3 / 29

Bayes Factor π(m = i y) π(m = i) = Bayes Factor(m = i, m = j) π(m = j y) π(m = j) π(m = i y) π(m = j y) = = = f (y m=i)π(m=i) f (y m=i)π(m=i)+f (y m=j)π(m=j) f (y m=j)π(m=j) f (y m=i)π(m=i)+f (y m=j)π(m=j) π(m = i) f (y m = i) π(m = j) f (y m = j) π(m = i) f (y θi, m = i)π(θ i m = i)dθ i π(m = j) f (y θj, m = j)π(θ j m = j)dθ j The Bayes Factor is the ratio of the posterior odds to the prior odds. A. Wilson (NCSU Statistics) Model Selection November 25, 2013 4 / 29

Guidelines for Bayes Factors Bayes Factor(m = i, m = j) Interpretation Under 1 Supports model j 1-3 Weak support for model i 3-20 Support for model i 20-150 Strong support for model i Over 150 Very strong support for model i A. Wilson (NCSU Statistics) Model Selection November 25, 2013 5 / 29

Calculating Bayes Factors It may be possible to do the integration and get an analytic expression for the Bayes Factor. Assuming that we can t get an analytic expression, there are a variety of ways to calculate f (y m = i) = f (y θ i, m = i)π(θ i m = i)dθ i General references Han and Carlin (2001), Markov Chain Monte Carlo Methods for Computing Bayes Factors: A Comparative Review, JASA Kass and Raftery (1995), Bayes Factors, JASA Clyde and George (2004), Model Uncertainty, Statistical Science A. Wilson (NCSU Statistics) Model Selection November 25, 2013 6 / 29

Hypothesis Testing Classical Approach Most simple problems have the following general form. There is an unknown parameter θ Θ, and you want to know whether θ Θ 0 or θ Θ 1, where Θ 0 Θ 1 = Θ and Θ 0 Θ 1 =. You make a set of observations y 1, y 2,..., y n whose sampling density f (y θ) depends on θ. We refer to H 0 : θ Θ 0 as the null hypothesis, and H A : θ Θ 1 as the alternative hypothesis. A. Wilson (NCSU Statistics) Model Selection November 25, 2013 7 / 29

Hypothesis Testing Classical Approach A test is decided by a rejection region R where R = {y : observing y would lead to the rejection of H 0.} Decisions between tests are based on the probability of Type I errors, P(R θ) for θ Θ 0, and on the probability of Type II errors, 1 P(R θ) for θ Θ 1. Often R is chosen so that the probability of a Type II error is as small as possible subject to the requirement that the probability of a Type I error is always less than or equal to some fixed value α. A. Wilson (NCSU Statistics) Model Selection November 25, 2013 8 / 29

Example The Likelihood Principle The Likelihood Principle. In making inferences or decisions about θ after y is observed, all relevant experimental information is contained in the likelihood function for the observed y. Furthermore, two likelihood functions contain the same information about θ if they are proportional to each other (as functions of θ). A. Wilson (NCSU Statistics) Model Selection November 25, 2013 9 / 29

The Likelihood Principle From Berger (1985). Suppose that we make 12 independent tosses of a fair coin. The results of the experiment are 9 heads and 3 tails. If number of tosses was fixed at 12, then y (the number of heads) is Binomial(12,p), and ( ) n f (y p, n) = p y (1 p) n y p 9 (1 p) 3 y If we tossed the coin until 3 tails appeared, then y is NegativeBinomial(3, p) and ( ) r + y 1 f (y p, r) = p y (1 p) r p 9 (1 p) 3 r 1 A classical test of the hypothesis H 0 : p = 1 2 versus H 1 : p > 1 2 has p-value 0.075 for binomial sampling and 0.0325 for negative binomial sampling. But data are the same: 9 heads and 3 tails! A. Wilson (NCSU Statistics) Model Selection November 25, 2013 10 / 29

Hypothesis Testing Bayesian Approach The Bayesian approach calculates the posterior probabilities p 0 = P(θ Θ 0 y) p 1 = P(θ Θ 1 y) and decides between H 0 and H A accordingly. p 0 + p 1 = 1 since Θ 0 Θ 1 = Θ and Θ 0 Θ 1 =. Of course the posterior probability of our hypothesis depends on the prior probability of our hypothesis. A. Wilson (NCSU Statistics) Model Selection November 25, 2013 11 / 29

Example Suppose that we are modeling the expected lifetime of a set of light bulbs. We model the lifetimes using an Exponential(λ) sampling distribution so that n f (t λ) = λ n exp( λ t i ). If we use Gamma prior for λ, our posterior distribution is Gamma(α + n, β + n i=1 t i), and our posterior distribution for 1 λ, the expected lifetime, is InverseGamma(α + n, β + n i=1 t i). i=1 A. Wilson (NCSU Statistics) Model Selection November 25, 2013 12 / 29

Example Suppose that we choose a Gamma(21,200) prior distribution for λ and H 0 : 1 λ 10 and H A : 1 λ > 10. We observe 30 i=1 t i = 442.95. λ 0 10 20 30 0.00 0.05 0.10 0.15 0.20 0.25 1 λ 0.00 0.15 0 5 10 15 20 25 A. Wilson (NCSU Statistics) Model Selection November 25, 2013 13 / 29

Example Our prior probability that 1 λ 10 is 0.559. If we consider this in terms of 0.559 odds, our prior odds on H 0 are 1 0.559 = 1.27. Our posterior probability that 1 λ 10 is 0.038. If we consider this in terms 0.038 of odds, our posterior odds on H 0 are 1 0.038 = 0.040. We define the Bayes factor as the ratio of the posterior odds to the prior odds. Here BF = 0.040 1.27 = 0.033 The Bayes factor is a measure of the evidence provided by the data in support of the hypothesis H 0. In this case, the data provides strong evidence against the null hypothesis. A. Wilson (NCSU Statistics) Model Selection November 25, 2013 14 / 29

Model Selection Oxygen Uptake Example Twelve healthy men who did not exercise regularly were recruited to take part in a study of the effects of two different exercise regimens on oxygen uptake. Six of the twelve men were randomly assigned to a 12-week flat-terrain running program, and the remaining six were assigned to a 12-week step aerobics program. The maximum oxygen uptake of each subject was measured (in liters per minute) while running on an incline treadmill, both before and after the 12-week program. Of interest is how a subject s change in maximal oxygen upake depends on which program they were assigned to. However, we also want to account for the age of the subject, and we think there may be an interaction between age and the exercise program. Y i = β 0 + β 1 x 1 + β 2 age + β 3 (age x 1 ) + ɛ i where x 1 = 0 if subject i is on the running program and 1 if on the aerobic program. A. Wilson (NCSU Statistics) Model Selection November 25, 2013 15 / 29

Model Selection Oxygen Uptake Example O2 Uptake 10 5 0 5 10 15 20 22 24 26 28 30 Age A. Wilson (NCSU Statistics) Model Selection November 25, 2013 16 / 29

Bayesian Model Selection Using classical methods, we might use forward, backward, or stepwise selection to choose the variables to include in a regression model. The Bayesian solution to model selection is conceptually straightforward: if we believe that some of the regression coefficients are potentially equal to zero, we specify a prior distribution that captures this. A convenient way to represent this is to write the regression coefficient for variable j as β j = z j b j, where z j {0, 1} and b j is a real number. We then write our regression model as Y i = z 0 b 0 + z 1 b 1 x 1 + z 2 b 2 age + z 3 b 3 (age x 1 ) + ɛ i Each value of z = (z 0,..., z p ) corresponds to a different model; in particular to a different collection of variables with non-zero regression coefficients. A. Wilson (NCSU Statistics) Model Selection November 25, 2013 17 / 29

Bayesian Model Selection Since we haven t observed z, we treat it just like the other parameters in the model. To do model selection, we want to look at the posterior probability for each possible value of z. π(z y, X) = π(z)f (y X, z) π( z)f (y X, z) z So if we can calculate f (y X, z), then we can calculate the posterior distribution for z. Remember from our discussion of Bayes factors, π(z a y, X) π(z b y, X) = π(z a) π(z b ) f (y z a, X) f (y z b, X) A. Wilson (NCSU Statistics) Model Selection November 25, 2013 18 / 29

Bayesian Model Selection f (y X, z) = = f (y, β, σ 2 X, z)dβdσ 2 f (y β, σ 2, X, z)π(β X, z, σ 2 )π(σ 2 X, z)dβdσ 2 The g-prior is convenient because you can do the integration in closed form. f (y z, X) = π n/2 Γ((ν 0 + n)/2) (1 + g) pz /2 (ν 0 σ0 2)ν 0/2 Γ(ν 0 /2) (ν 0 σ0 2 + SSRz g ) (ν 0+n)/2 A. Wilson (NCSU Statistics) Model Selection November 25, 2013 19 / 29

Other Prior Distributions g-prior The g-prior starts from the idea that parameter estimation should be invariant to the scale of the regressors. Suppose that X is a given set of regressors and X = XH for some p p matrix H. If we obtain the posterior distribution of β from y and X and β from y and X, then, according to this principle of invariance, the posterior distributions of β and H β should be the same. π(β σ 2 ) MVN(0, k(x T X) 1 ) π(σ 2 ) InverseGamma(ν 0 /2, ν 0 σ 2 0/2) A popular choice for k is gσ 2. Choosing g large reflects less informative prior beliefs. A. Wilson (NCSU Statistics) Model Selection November 25, 2013 20 / 29

Other Prior Distributions g-prior In this case, we can show that π(β σ 2 g, y, X) MVN( g + 1 (XT X) 1 X T g y, g + 1 (XT X) 1 σ 2 ) π(σ 2 y, X) InverseGamma( n + ν 0, ν 0σ0 2 + SSR g ) 2 2 where SSR g = y T (I g g+1 X(XT X) 1 X T )y. Another popular noninformative choice is ν 0 = 0. A. Wilson (NCSU Statistics) Model Selection November 25, 2013 21 / 29

Assumptions β j = 0 for any j where z j = 0 p z is the number of non-zero elements in z X z is the n p z matrix corresponding to the variables j for which z j = 1 SSRg z = y T (I g g + 1 X z(x T z X z ) 1 X z )y A. Wilson (NCSU Statistics) Model Selection November 25, 2013 22 / 29

Bayesian Model Selection From Hoff Table 9.1, assuming g = n and equal a priori probabilities for each value of z: z Model π(z y, X) (1,0,0,0) β 0 0.00 (1,1,0,0) β 0 + β 1 x 1 0.00 (1,0,1,0) β 0 + β 2 age 0.18 (1,1,1,0) β 0 + β 1 x 1 + β 2 age 0.63 (1,1,1,1) β 0 + β 1 x 1 + β 2 age + β 3 age x 1 0.19 Which model has the highest posterior probability? What evidence is there for an age effect? (What is the sum of the posterior probabilities of the models that include age?) A. Wilson (NCSU Statistics) Model Selection November 25, 2013 23 / 29

Bayesian Methods for Point Null Hypotheses We can t use a continuous prior density to conduct a test of H 0 : θ = θ 0 because that gives a prior probability of 0 that θ = θ 0 and hence a posterior probability of 0 to the null hypothesis. What we do instead is assign a probability p 0 to θ = θ 0 and a probability density p 1 π (θ) to values θ θ 0, where p 1 = 1 p 0. In mathematics, we assign a prior distribution π(θ) = p 0 δ(θ θ 0 ) + p 1 I (θ θ 0 )π (θ) A. Wilson (NCSU Statistics) Model Selection November 25, 2013 24 / 29

Dirac Delta δ(x) = 0 for x 0. δ(x) is a singular measure giving point mass to 0. Cdf is a step function that changes from 0 to 1 at x = 0. f (x) δ(x)dx = f (0) A. Wilson (NCSU Statistics) Model Selection November 25, 2013 25 / 29

Bayesian Methods for Point Null Hypotheses We know that π(θ y) = f (y θ)π(θ) f (y θ)π(θ)dθ Let f (y) = f (y θ)π (θ)dθ be the predictive distribution of y under the alternative hypothesis. Then f (y θ)π(θ)dθ = p 0 δ(θ θ 0 )f (y θ)dθ + = p 0 f (y θ 0 ) + p 1 f (y) p 1 I (θ θ 0 )π (θ)f (y θ)dθ A. Wilson (NCSU Statistics) Model Selection November 25, 2013 26 / 29

Bayesian Methods for Point Null Hypotheses Then π(θ y) = f (y θ)π(θ) f (y θ)π(θ)dθ = p 0δ(θ θ 0 )f (y θ) + p 1 I (θ θ 0 )π (θ)f (y θ) p 0 f (y θ 0 ) + p 1 f (y) p 0 f (y θ 0 ) = p 0 f (y θ 0 ) + p 1 f (y) δ(θ θ 0) p 1 f (y) + p 0 f (y θ 0 ) + p 1 f (y) π (θ y)i (θ θ 0 ) This means that the posterior probability that θ = θ 0 is p 0 f (y θ 0 ) p 0 f (y θ 0 ) + p 1 f (y) A. Wilson (NCSU Statistics) Model Selection November 25, 2013 27 / 29

Bayesian Methods for Point Null Hypotheses Example From Albert (2007). Suppose that we are interested in assessing the fairness of a coin, and that we re interested in testing H 0 : p = 0.5 against H A : p 0.5. We re indifferent between the two hypotheses (fair coin and not fair coin), so we assign p 0 = 0.5 and p 1 = 0.5. If the coin is fair, our entire prior probability distribution is concentrated at 0.5. If the coin is unfair, we need to specify our beliefs about what the unfair coin probabilities will look like. Suppose that we choose a π (θ) to be a Beta(10,10). (This has mass from about 0.2 to 0.8.) A. Wilson (NCSU Statistics) Model Selection November 25, 2013 28 / 29

Bayesian Methods for Point Null Hypotheses Example Suppose that we observe y = 5 heads in n = 20 trials. To determine our posterior probability that the coin is fair, we need to calculate f (y) = f (y θ)π (θ)dθ = = p 5 (1 p) 15 p 10 1 (1 p) 10 1 dp B(10, 10) B(15, 25) B(10, 10) This means that the posterior probability that p = 0.5 is (0.5)(0.5) 5 (1 0.5) 15 (0.5)(0.5) 5 (1 0.5) 15 + 0.5 B(15,25) B(10,10) = 0.2802215 A. Wilson (NCSU Statistics) Model Selection November 25, 2013 29 / 29