MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Similar documents
Model comparison. Christopher A. Sims Princeton University October 18, 2016

Testing Restrictions and Comparing Models

Eco517 Fall 2014 C. Sims MIDTERM EXAM

Choosing among models

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

PROBABILITY MODELS FOR MONETARY POLICY DECISIONS. I.3. Lack of a language to discuss uncertainty about models and parameters.

Eco517 Fall 2004 C. Sims MIDTERM EXAM

BAYESIAN ECONOMETRICS

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM

F & B Approaches to a simple model

Bayesian Econometrics

Bayesian Assessment of Hypotheses and Models

Frequentist-Bayesian Model Comparisons: A Simple Example

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS

7. Estimation and hypothesis testing. Objective. Recommended reading

Bayesian Model Diagnostics and Checking

A Bayesian perspective on GMM and IV

(1) Introduction to Bayesian statistics

Bayesian Methods for Machine Learning

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

TAKEHOME FINAL EXAM e iω e 2iω e iω e 2iω

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

MID-TERM EXAM ANSWERS. p t + δ t = Rp t 1 + η t (1.1)

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayes: All uncertainty is described using probability.

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

1 Hypothesis Testing and Model Selection

Inference with few assumptions: Wasserman s example

ROBINS-WASSERMAN, ROUND N

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

GAUSSIAN PROCESS REGRESSION

Session 2B: Some basic simulation methods

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Penalized Loss functions for Bayesian Model Choice

Stable Limit Laws for Marginal Probabilities from MCMC Streams: Acceleration of Convergence

Markov Chain Monte Carlo methods

(I AL BL 2 )z t = (I CL)ζ t, where

Bayesian Model Comparison

g-priors for Linear Regression

ICES REPORT Model Misspecification and Plausibility

STAT 425: Introduction to Bayesian Analysis

Stat 451 Lecture Notes Numerical Integration

Markov Chain Monte Carlo methods

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Bayesian search for other Earths

2. A Basic Statistical Toolbox

The Expectation-Maximization Algorithm

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Bayesian Semiparametric GARCH Models

CSC 2541: Bayesian Methods for Machine Learning

Bayesian Semiparametric GARCH Models

STAT215: Solutions for Homework 2

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Statistical Inference

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Bayesian Asymptotics

Statistical Inference: Maximum Likelihood and Bayesian Approaches

Introduction to Bayesian Computation

Direct Simulation Methods #2

Likelihood-free MCMC

7. Estimation and hypothesis testing. Objective. Recommended reading

Stat 5101 Lecture Notes

MCMC algorithms for fitting Bayesian models

A Kullback-Leibler Divergence for Bayesian Model Comparison with Applications to Diabetes Studies. Chen-Pin Wang, UTHSCSA Malay Ghosh, U.

Nonparametric Bayesian Methods (Gaussian Processes)

David Giles Bayesian Econometrics

Item Parameter Calibration of LSAT Items Using MCMC Approximation of Bayes Posterior Distributions

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

Bayesian Regularization

Eco517 Fall 2014 C. Sims FINAL EXAM

Bayesian Inference. Chapter 9. Linear models and regression

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

ST 740: Model Selection

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Density Estimation. Seungjin Choi

Hierarchical Models & Bayesian Model Selection

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

EXERCISE ON VAR AND VARMA MODELS

Seminar über Statistik FS2008: Model Selection

Bayesian Inference: Posterior Intervals

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Bayesian Inference for Normal Mean

Statistics: Learning models from data

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Computational Cognitive Science

Parametric Techniques Lecture 3

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

2 Bayesian Hierarchical Response Modeling

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

An Introduction to Bayesian Linear Regression

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Topic 4 Unit Roots. Gerald P. Dwyer. February Clemson University

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Bayesian Inference. Chapter 1. Introduction and basic concepts

CONJUGATE DUMMY OBSERVATION PRIORS FOR VAR S

Assessing Regime Uncertainty Through Reversible Jump McMC

Transcription:

ECO 513 Fall 2008 MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY SIMS@PRINCETON.EDU 1. MODEL COMPARISON AS ESTIMATING A DISCRETE PARAMETER Data Y, models 1 and 2, parameter vectors θ 1, θ 2. Models are p 1 (Y θ 1 ) and p(y θ 2 ). m 1, m 2 are the dimensions of θ 1, θ 2. We can treat them as a single model, with pdf q(y (θ 1, θ 2, γ)), where the parameter space for θ 1, θ 2 is some subset (or the whole of) R m 1+m 2 and that for γ, the model number, is {1, 2. Then the posterior probability over the two models is just the marginal distribution of the γ parameter i.e., the posterior p.d.f. with θ 1, θ 2 integrated out. 2. MARGINAL DATA DENSITIES, BAYES FACTORS The kernel of the probability distribution over model i is therefore p(y θ 1 )π(θ 1, θ 2, i) dθ 1 dθ 2 ). The ratios of these kernel weights are sometimes called Bayes factors. The integral values themselves are sometimes called marginal data densities. The posterior probabilities are the ratios of the mdd s to their sum i.e. the kernel weights normalized to sum to one. 3. NOTES Often the prior over θ 1, θ 2 makes them independent, in which case only the prior pdf of θ i enters the formula. The prior matters here, even asymptotically. With independent priors, assuming the likelihood concentrates asymptotically in the neighborhood of θ i for each model i, a change in π(θ i )/π(θ j ) always has a direct proportional effect on the posterior odds, even as sample size goes to infinity. It is still true in a sense that the prior doesn t matter: The posterior odds favoring the true model will go to infinity, regardless of the prior. But in any sample in which the odds ratio turns out to be of modest size, it can shift to favor one model or the other based on reasonable variations in the prior. This discussion generalizes directly to cases with any finite number of models. Date: November 17, 2008. c 2008 by Christopher A. Sims. This document may be reproduced for educational and research purposes, so long as the copies contain this notice and are retained for personal use or distributed free.

MODEL COMPARISON 2 4. CALCULATING MDD S This is just a question of how to evaluate an integral. For some cases, including reduced form VAR s with conjugate priors, this can be done analytically. Suppose, though, we can t evaluate the integral analytically, but can generate a sample from the posterior pdf. Case 1: We can generate a sample from the whole posterior, over both the continuous and discrete components. Then posterior mdd is estimated just by counting relative frequencies of γ = 1 vs. γ = 2. Case 2: We can generate a sample from either model s conditional posterior on θ i, but not a sample from the joint θ 1, θ 2, γ posterior. 5. CASE 2 We have a sample {θ j i, j = 1,..., N of draws from the posterior on θ i conditional on model i being correct. We also assume here that the priors on θ 1 and θ 2 are independent. Form {k(y, θ j i ) = {p(y θ j i )π(θj i ). This is the kernel of the posterior conditional on model i being the truth. Pick a weighting function g(θ i ) that integrates to one over θ i -space. Form 1 N N j=1 g(θ j i ) k(y, θ j i ) This converges to ( k(y, θ i )dθ i ) 1, because [ ] g(θi ) E k(y, θ i ) g(θ = i ) k(y, θ i ) q(θ i Y, i)dθ i 6. CHOOSING g = g(θ i ) k(y, θ i ) dθ k(y, θ i ) i k(y, θi )dθ i The method we have described is called the modified harmonic mean method. The original idea, called the harmonic mean method, was to use π() as the weighting function. Since k(y, θ) = p(y θ i )π(θ i ), this amounted to simply taking the sample mean of 1/p i (Y θ i ), i.e. of one over the likelihood. So long as g integrates to one, the expectation of g/(p i π i ), under the posterior density on θ i, exists and is finite. We already proved this.

MODEL COMPARISON 3 7 But there is no guarantee that its second moment is finite. For example, if g declines slowly as θ i goes to infinity, i.e. if g has fat tails, while k does not, then g/k will be unbounded. It will have finite expectation, but only because the very large values of g/k that will eventually occur will occur rarely because we are sampling from the thin-tailed k kernel. The rapidly declining k kernel offsets the k in the denominator of g/k. It is not enough to offset the k 2 in the denominator of g 2 /k 2. So if lim θi g 2 /k > 0, g/k has infinite second moment. This means usual measures of uncertainty, based on variances, do not apply, and that convergence of the sample mean to its expected value will be slow. 8. GEWEKE S SUGGESTED g Convergence will be more rapid the closer g/k is to a constant. Geweke suggested taking g to be the standard normal approximation to the posterior density at its peak ˆθ i i.e. to be a ( ( N ˆθ 2 ) 1 ) log(k(y, θ i, i ) θ i θ i density, but to truncate it at (θ i ˆθ i ) V 1 (θ i ˆθ) = κ for some κ, where V is the asymptotic covariance matrix. The integral of the normal density over this truncated region can be looked up in a chi-squared distribution table, so the density can easily be rescaled to integrate to exactly one over this region. The idea is that g should be close to k near the peak, while the truncation avoids any possibility that g/k becomes unbounded in the tails. 9. REMAINING PROBLEMS WITH g CHOICE Geweke s idea often works well, but not always. The tails are not the only place that g/k might become unbounded. If there is a point θ at which k(θ ) = 0, and if k is twice-differentiable at that point, while g is continuous at that point, then g/k has infinite variance under the posterior density. In high-dimensional models it may be difficult to know whether there are points with k = 0. With, say, 30 parameters, the typical draw of θ j i will have (θj i ˆθ i ) V 1 (θ j i ˆθ i ). = 30 (because this quanitity has a chi-squared(30) distribution.) But this means that the density at θ i = ˆθ i is e 30 times larger than the density at a typical draw. Of course if the ratio of k at the peak and at a typical draw is also of this size, this causes no problem, but obviously there is tremendous room for g/k to vary, and hence for convergence of the estimate of the Bayes factor to be slow.

MODEL COMPARISON 4 There are ways to do better, if sampling g/k is not working well: bridge sampling, e.g. An important dumb idea: What about just taking sample averages of p(y θ j i ) from the simulated posterior distribution? (Important because every year one or more students does this on an exam or course paper.) 10. THE SCHWARZ CRITERION In large samples, under fairly general regularity conditions, the posterior density is well approximated as proportional to a N( ˆθ, V) density, where V is minus the inverse of the second derivative matrix for the log posterior density, evaluated at the MLE ˆθ. We know the integral of a Normal density, so if the approximation is working well, we will find that the integral of the posterior density is approximately (2π) k/2 V 1/2 p(y ˆθ)π( ˆθ). In a stationary model, V T T 11. V CONVERGENCE V, where V is a constant matrix. This follows because (1) (2) log ( p(y θ)π(θ) ) = 1 T 2 k(y θ) θ θ a.s. E T t=1 log ( p(y t {Y s, s < t, θ) ) [ 2 log ( p(y t {Y s, s < t, θ) ) ] θ θ and V is just minus the inverse of this second derivative matrix. Of course above we have invoked ergodicity of the Y process and also assumed that Y t depends on Y t s for only finitely many lags s, so that after the first few observations all the terms in the sum that makes up the log likelihood are of the same form. 12. THE SCHWARZ CRITERION, II Now we observe that since log { V/T = n log T + log { V (where n is the length of θ), the log of the approximate value for the integrated density is n 2 log(2π) n 2 log T + 1 2 log { V + log(k(y, ˆθ)). Dominating terms when taking difference across two models: log ( k 1 (Y, θ 1 ) ) log ( k 2 (Y, θ 2 ) ) n 1 n 2 2 log(t)

MODEL COMPARISON 5 13. COMMENTS ON THE SC The formula applies whether the models are nested or not. Frequentist model comparison, for the special case where one model is just a restriction of the other, treats the difference of log likelihoods, times 2, as chisquared(n 1 n 2 ) and rejects the restricted model if this statistic exceeds the 1 α per cent level of the chi-squared distribution. The Schwarz criterion compares this statistic to a level that increases not just with n 1 n 2, but also with log T. In other words, Bayesian model comparison penalizes more richly parameterized models in large samples, and does so more stringently (relative to a frequentist likelihood ratio test) the larger the sample. 14. THE LINDLEY PARADOX Take a small collection of reasonable models, calculate Bayesian posterior odds. A typical result: One model has probability 1 ε, the others probability less than ε, with ε very small (say.00001). Frequentist comparisons (for the nested case where they work) seem not to give such drastic answers. One interpretation Bayesian methods are more powerful. Another Bayesian methods produce unreasonably sharp conclusions. 15. INTERPRETING AND ACCOUNTING FOR THE LINDLEY PARADOX The reason it s a problem is that these sharp results are often unstable: If, say, we introduce one more model, we may find that the new one is the one with 1 ε probability, which makes our previous near certainty look foolish. In other words, the problem is that the result is conditional on our being certain that the set of models compared contains the truth, that there are no other possibilities waiting in the wings. Gelman, Carlin, Stern and Rubin suggest (almost) never doing model comparison: Their view is that it should (almost) always be reasonable, when comparing models, to make each model a special case of a larger model with a continuously varying parameter. In other words, when you think you want to use a {0, 1-valued model number parameter, figure out how to replace this with a continuously varying parameter, say γ, such that the two models arise as γ = 1 and γ = 0.