Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Similar documents
7. Estimation and hypothesis testing. Objective. Recommended reading

Gibbs Sampling in Linear Models #1

Monte Carlo Composition Inversion Acceptance/Rejection Sampling. Direct Simulation. Econ 690. Purdue University

Predictive Distributions

1 Hypothesis Testing and Model Selection

Bayesian Linear Regression

Regression #2. Econ 671. Purdue University. Justin L. Tobias (Purdue) Regression #2 1 / 24

David Giles Bayesian Econometrics

Regression #4: Properties of OLS Estimator (Part 2)

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

A Bayesian Treatment of Linear Gaussian Regression

Regression #5: Confidence Intervals and Hypothesis Testing (Part 1)

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Gibbs Sampling in Linear Models #2

Bayesian Inference: Concept and Practice

Gibbs Sampling in Latent Variable Models #1

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Linear Models A linear model is defined by the expression

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Bayesian Linear Models

Divergence Based priors for the problem of hypothesis testing

GLS and FGLS. Econ 671. Purdue University. Justin L. Tobias (Purdue) GLS and FGLS 1 / 22

Bayesian Linear Models

David Giles Bayesian Econometrics

Time Series and Dynamic Models

g-priors for Linear Regression

Lecture : Probabilistic Machine Learning

Bayesian hypothesis testing

7. Estimation and hypothesis testing. Objective. Recommended reading

Remarks on Improper Ignorance Priors

GAUSSIAN PROCESS REGRESSION

Bayesian linear regression

STAT 740: Testing & Model Selection

Invariant HPD credible sets and MAP estimators

Inference for a Population Proportion

CS281A/Stat241A Lecture 22

Lecture 13 Fundamentals of Bayesian Inference

The linear model is the most fundamental of all serious statistical models encompassing:

Introduction into Bayesian statistics

Gibbs Sampling in Endogenous Variables Models

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Principles of Bayesian Inference

Bayes Factors, posterior predictives, short intro to RJMCMC. Thermodynamic Integration

COS513 LECTURE 8 STATISTICAL CONCEPTS

Bayesian Ingredients. Hedibert Freitas Lopes

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Choosing among models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

One-parameter models

Introduction to Bayesian Methods

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Seminar über Statistik FS2008: Model Selection

A Very Brief Summary of Statistical Inference, and Examples

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Regression #3: Properties of OLS Estimator

Statistical Theory MT 2007 Problems 4: Solution sketches

STAT 425: Introduction to Bayesian Analysis

Uniformly Most Powerful Bayesian Tests and Standards for Statistical Evidence

Bayesian Model Diagnostics and Checking

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Bayesian model selection: methodology, computation and applications

Stat 5101 Lecture Notes

Density Estimation. Seungjin Choi

MCMC algorithms for fitting Bayesian models

CS-E3210 Machine Learning: Basic Principles

The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

AMS-207: Bayesian Statistics

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Computational Cognitive Science

Parametric Techniques Lecture 3

Bayesian Econometrics

1 Bayesian Linear Regression (BLR)

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

I. Bayesian econometrics

Introduction to Probabilistic Machine Learning

Bayesian Inference for Normal Mean

CPSC 540: Machine Learning

One Parameter Models

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Bayesian Inference

Hierarchical Models & Bayesian Model Selection

Introduction to Bayesian Computation

Introduction to Bayesian inference

ST 740: Model Selection

A Very Brief Summary of Bayesian Inference, and Examples

Statistical learning. Chapter 20, Sections 1 3 1

The Bayesian approach to inverse problems

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian Linear Models

Stat260: Bayesian Modeling and Inference Lecture Date: March 10, 2010

GWAS IV: Bayesian linear (variance component) models

Bayesian Phylogenetics:

variability of the model, represented by σ 2 and not accounted for by Xβ

PMR Learning as Inference

Transcription:

Hypothesis Testing Econ 690 Purdue University Justin L. Tobias (Purdue) Testing 1 / 33

Outline 1 Basic Testing Framework 2 Testing with HPD intervals 3 Example 4 Savage Dickey Density Ratio 5 Bartlett s Paradox Justin L. Tobias (Purdue) Testing 2 / 33

Basics Basic Testing Framework An advantage of the Bayesian approach is its unified treatment of testing hypotheses. Consider two competing models, denoted M 1 and M 2. (These can be nested or non-nested). For example, M 1 can denote an (unrestricted) regression model with k explanatory variables, and M 2 denotes M 1 with one [or more] of the explanatory variables removed. Alternatively, M 1 could denote a regression model with Gaussian (normal) errors while M 2 denotes the model with Student-t errors. M 1 could denote the probit model while M 2 could denote the logit. Justin L. Tobias (Purdue) Testing 3 / 33

Basics Basic Testing Framework Note that where p(m j y) = p(y M j)p(m j ), p(y) 1 p(m j y) is the posterior probability of model j. 2 p(y M j ) is the marginal likelihood under model j. 3 p(m j ) is the prior probability of model j. 4 p(y) = J p(y M j )Pr(M j ). j=1 Justin L. Tobias (Purdue) Testing 4 / 33

Basics Basic Testing Framework Consider two competing models M 1 and M 2, and note from our previous derivation that where p(m 1 y) p(m 2 y) = p(y M 1) p(m 1 ) p(y M 2 ) p(m 2 ) 1 p(m 1 y)/p(m 2 y) is the posterior odds of Model 1 in favor of Model 2. 2 p(y M 1 )/p(y M 2 ) is termed the Bayes factor. 3 p(m 1 )/p(m 2 ) is the prior odds in favor of Model 1. Justin L. Tobias (Purdue) Testing 5 / 33

Basics Basic Testing Framework K 12 = p(m 1 y) p(m 2 y) If the posterior odds ratio above equals unity, then we are indifferent between M 1 and M 2. Jeffreys (1961) recommends interpreting a Bayes factor exceeding (approximately) 10 as strong evidence in favor of M 1 (and a Bayes factor smaller than.1 as strong evidence in favor of M 2 ). Of course, the magnitude of the odds itself is interpretable, and one does not really need to select a model on the basis of this information. Actually, a more sensible thing to do is to average over the models using the corresponding probabilities as weights. We will return to such Bayesian model averaging later in the course. Justin L. Tobias (Purdue) Testing 6 / 33

Basics Basic Testing Framework Although this approach to testing can be generally applied, its implementation often proves difficult in models of moderate complexity since: Later in the course, we will provide a variety of numerical, simulation-based approaches for approximating marginal likelihoods (and thus Bayes factors). In this lecture we will also describe an approach for doing this when the models involve a zero subvector hypothesis. Justin L. Tobias (Purdue) Testing 7 / 33

HPD intervals Intuitive Testing Consider the regression model (i.e., a regression equation for a production function). The primary parameter of interest is the return to scale parameter In previous slides on the linear regression model, we showed that under the prior p(β, σ 2 ) σ 2 the marginal posterior for β is of the form: with X defined in the obvious way. Justin L. Tobias (Purdue) Testing 8 / 33

HPD intervals Intuitive Testing The posterior distribution of θ can be obtained by first defining the selector matrix (here a vector) R as follows: Using standard properties of the multivariate Student-t distribution, it follows that Thus, via a simple change of variable, we have derived the posterior distribution for the return to scale parameter θ. Justin L. Tobias (Purdue) Testing 9 / 33

HPD intervals Intuitive Testing Given this posterior, one can immediately calculate probabilities of interest, such as the posterior probability that the production function exhibits increasing returns to scale: with T ν denoting the standardized Student-t cdf with ν degrees of freedom. Justin L. Tobias (Purdue) Testing 10 / 33

HPD intervals Intuitive Testing Alternatively, we can calculate a 1 α HPD interval for θ. Since the marginal posterior for θ is symmetric around the mean / median / mode R ˆβ, it follows that is a 1 α HPD interval for θ. If this interval does not include 1 (with a reasonable choice of α), then there is little evidence supporting constant returns to scale. The above offers a reasonable way of investigating the credibility of various hypotheses. Justin L. Tobias (Purdue) Testing 11 / 33

Example Marginal Likelihoods in a Regression Example Consider, again, our regression example: y i = β 0 + β 1 Ed i + ɛ i, ɛ i Ed iid N(0, σ 2 ). We seek to illustrate more formally how model selection and comparison can be carried out. For this regression model, we employ priors of the form β σ 2 N(µ, σ 2 V β ) [ ν σ 2 IG 2, 2(νλ) 1] with ν = 6, λ = [2/3](.2), µ = 0 and V β = 10I 2. The prior for σ 2 has a mean and standard deviation equal to.2. Justin L. Tobias (Purdue) Testing 12 / 33

Example Marginal Likelihoods in a Regression Example In our previous lecture notes on the linear regression model, we showed that, with this prior, we obtain a marginal likelihood of the form: y t ( X µ, λ(i + XV β X ), ν ). Given values for the hyperparameters µ, V β, λ and ν, we can calculate the marginal likelihood p(y) [evaluated at the realized y outcomes]. Justin L. Tobias (Purdue) Testing 13 / 33

Example Marginal Likelihoods in a Regression Example In our regression analysis, a question of interest is: do returns to education exist? To investigate this question, we let M 1 denote the unrestricted regression model, and M 2 denote the restricted model, dropping education from the right hand side. [Thus, X is simply a column vector of ones under M 2.] We then calculate p(y) with the given hyperparameter values, as shown on the last slide, under two different definitions of X. Justin L. Tobias (Purdue) Testing 14 / 33

Example Marginal Likelihoods in a Regression Example When doing these calculations (on the log scale!), we obtain: 1 log p(y M 1 ) = 936.7 2 log p(y M 2 ) = 1, 021.4. Since log K 12 = log[p(y M 1 )] log[p(y M 2 )] 84.7, K 12 = exp(84.7) 6.09 10 36 (!) Thus, our data show strong evidence in favor of the unrestricted model, i.e., a model including education in the regression equation. This was obvious, perhaps, since the posterior mean of the return to education was.091 and the posterior standard deviation was.0066. Justin L. Tobias (Purdue) Testing 15 / 33

Savage Dickey Density Ratio Savage Dickey Density Ratio Consider the likelihood L(θ), θ = [θ 1 θ 2 ], and the hypotheses Let p(θ 1 M 1 ) be the prior for parameters under M 1 and similarly, let p(θ 1, θ 2 M 2 ) be the prior under M 2. Assume, additionally, that the priors have the following structure, generated from g(θ 1, θ 2 ) = g(θ 1 θ 2 )g(θ 2 ): 1 2 In other words, the prior we use for the restricted model M 1 is the same prior that we would use under M 2 given the restriction. (If prior independence is assumed in g, then we simply require the same priors for parameters common to both models). Justin L. Tobias (Purdue) Testing 16 / 33

Savage Dickey Density Ratio Savage Dickey Density Ratio Show, under these assumptions [Verdinelli and Wasserman (1995)] that the Bayes factor for M 1 versus M 2 can be written as the ratio (known as the Savage-Dickey density ratio) : where In other words, the test can be carried out by calculating the marginal posterior and marginal prior ordinates at θ 2 = 0 under the unrestricted M 2. Justin L. Tobias (Purdue) Testing 17 / 33

Savage Dickey Density Ratio Savage Dickey Density Ratio If the posterior ordinate at zero is much higher than the prior ordinate at zero, (leading to K 12 > 1), this implies that the data have moved us in the direction of the restriction. As such, it is sensible that we would support the restricted model. Note that everything here is calculated under the unrestricted M 2, not unlike what we do when implementing a Wald test. Potentially this offers a much easier way to calculate (nested) hypothesis tests. Here we do not need to calculate the marginal likelihood (which is typically not possible analytically). Justin L. Tobias (Purdue) Testing 18 / 33

Savage Dickey Density Ratio Savage Dickey Density Ratio Integrating the joint posterior with respect to θ 1 yields Justin L. Tobias (Purdue) Testing 19 / 33

Savage Dickey Density Ratio Savage Dickey Density Ratio Dividing this expression by p(θ 2 M 2 ) and evaluating both at θ 2 = 0 implies Justin L. Tobias (Purdue) Testing 20 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox In the case of a diffuse prior p(θ) c, for some constant c, we have p(θ y) p(y θ). Thus, the posterior mode and the maximum likelihood estimate will agree. Justin L. Tobias (Purdue) Testing 21 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox For purposes of obtaining the posterior distribution, specification of an improper prior of the form p(θ) c still can lead to a proper posterior since p(θ y) = p(y θ)p(θ) p(y) p(y θ)p(θ) = Θ p(y θ)p(θ)dθ p(y θ) = Θ p(y θ)dθ, i.e., the arbitrary constant c cancels in the ratio. Justin L. Tobias (Purdue) Testing 22 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox However, for purposes of testing, improper priors should generally be avoided. (They can, however, be employed for nuisance parameters which are common to both models under consideration). To see why improper priors should not be employed (when testing), consider M 1 and M 2 with priors p(θ 1 ) c 1 and p(θ 2 ) c 2. Then, p(y M 1 ) p(y M 2 ) = Θ 1 p(y θ 1 )p(θ 1 )dθ 1 = c 1 Θ 1 p(y θ 1 )dθ 1 Θ 2 p(y θ 2 )p(θ 2 )dθ 2 c 2 Θ 2 p(y θ 2 )dθ 2 involves a ratio of arbitrary constants c 1 /c 2. Justin L. Tobias (Purdue) Testing 23 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox OK. So, improper priors should be avoided when testing. To get around this issue, can we instead employ a proper prior (i.e., one that integrates to unity) and let the prior variances be huge? Wouldn t our associated hypothesis test then be (nearly) free of prior information? And give us a similar result to a classical test (since the posterior is nearly proportional to the likelihood)? The following exercise illustrates, perhaps surprisingly, that this is not the case, and that testing results are quite sensitive to the prior. Justin L. Tobias (Purdue) Testing 24 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox Suppose Y t θ iid N(0, 1) under hypothesis H 1 and under hypothesis H 2. Y t θ iid N(θ, 1) In this example, we restrict the variance parameter to unity to fix ideas, and focus attention on scalar testing related to θ. Assume the prior θ H 2 N(0, v), v > 0. Find the Bayes factor K 21 (i.e., the odds in favor of allowing a non-zero mean relative to imposing a zero mean) and discuss its behavior for large v. Justin L. Tobias (Purdue) Testing 25 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox First, note that under H 1 there are no unknown parameters, and so the likelihood and marginal likelihood functions are the same. (This is like a dogmatic prior that imposes the mean to be zero and the variance to be unity with probability one). Thus, when integrating the likelihood over the prior we simply evaluate the likelihood at these values. Thus, Justin L. Tobias (Purdue) Testing 26 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox To evaluate the marginal likelihood under H 2, we must calculate: As before, we must complete the square on θ and then integrate it out. Justin L. Tobias (Purdue) Testing 27 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox Note that (y t θ) 2 + v 1 θ 2 = yt 2 2θT y + θ 2 (T + v 1 ) t t ( = [T + v 1 ] θ 2 T y 2θ T + v + t y 2 ) t 1 T + v 1 [ ( = [T + v 1 ] θ T y ) 2 [ ] ] 2 T y + T + v 1 T + v 1 ( [T + v 1 t ] y t 2 ). T + v 1 Justin L. Tobias (Purdue) Testing 28 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox Plugging this back into our expression for the marginal likelihood, we obtain: The last line is almost the integral of a normal density for θ, except we need an integrating constant equal to [T + v 1 ] 1/2 Justin L. Tobias (Purdue) Testing 29 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox So, multiplying and dividing by [T + v 1 ] 1/2, we obtain a marginal likelihood under H 2 equal to Write [T + v 1 ] 1/2 as [T + v 1 ] 1/2 = (T 1 v) 1/2 [ T 1 v ( T + v 1)] 1/2 = T 1/2 v 1/2 (v + T 1 ) 1/2. Thus, the term outside the exponential kernel of our marginal likelihood can be written as (2π) T /2 (v + T 1 ) 1/2 T 1/2. Justin L. Tobias (Purdue) Testing 30 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox Recall that and we have shown p(y H 1 ) = (2π) T /2 exp ( 1 2 p(y H 2 ) = (2π) T /2 (v+t 1 ) 1/2 T 1/2 exp Thus, ( t 1 2 y 2 t ). [ t ]) yt 2 T 2 y 2 T + v 1 Justin L. Tobias (Purdue) Testing 31 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox ([ vt K 21 = (v + T 1 ) 1/2 T 1/2 exp 1 + vt ] ) T y 2 Note that, as v (keeping T fixed but moderately large): The exponential kernel approaches 2 and similarly Justin L. Tobias (Purdue) Testing 32 / 33

Bartlett s Paradox Bayes Factors under (Nearly) Flat Priors: Bartlett s Paradox The results of this exercise clearly show that results of the test will depend heavily on the value of v chosen. In particular, as v grows, we tend to prefer the restricted model H 1.? This rather counter-intuitive result is known as Bartlett s paradox. The lesson here is that priors will matter considerably for purposes of testing, while for estimation, choice of prior is decidedly less important. Justin L. Tobias (Purdue) Testing 33 / 33