Model comparison. Christopher A. Sims Princeton University October 18, 2016

Similar documents
MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Testing Restrictions and Comparing Models

Eco517 Fall 2014 C. Sims MIDTERM EXAM

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Choosing among models

BAYESIAN ECONOMETRICS

PROBABILITY MODELS FOR MONETARY POLICY DECISIONS. I.3. Lack of a language to discuss uncertainty about models and parameters.

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

(1) Introduction to Bayesian statistics

Bayesian Econometrics

Bayesian Model Diagnostics and Checking

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian Methods for Machine Learning

MID-TERM EXAM ANSWERS. p t + δ t = Rp t 1 + η t (1.1)

Markov Chain Monte Carlo methods

F & B Approaches to a simple model

(I AL BL 2 )z t = (I CL)ζ t, where

MCMC algorithms for fitting Bayesian models

Bayesian Assessment of Hypotheses and Models

TAKEHOME FINAL EXAM e iω e 2iω e iω e 2iω

Bayes: All uncertainty is described using probability.

Frequentist-Bayesian Model Comparisons: A Simple Example

A Bayesian perspective on GMM and IV

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

7. Estimation and hypothesis testing. Objective. Recommended reading

The Expectation-Maximization Algorithm

Bayesian RL Seminar. Chris Mansley September 9, 2008

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

1 Hypothesis Testing and Model Selection

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Bayes Factors, posterior predictives, short intro to RJMCMC. Thermodynamic Integration

Bayesian Regression Linear and Logistic Regression

Inference with few assumptions: Wasserman s example

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Stable Limit Laws for Marginal Probabilities from MCMC Streams: Acceleration of Convergence

ROBINS-WASSERMAN, ROUND N

Markov Chain Monte Carlo methods

Introduction to Bayesian Computation

Item Parameter Calibration of LSAT Items Using MCMC Approximation of Bayes Posterior Distributions

Bayesian Asymptotics

Metropolis-Hastings Algorithm

ICES REPORT Model Misspecification and Plausibility

Penalized Loss functions for Bayesian Model Choice

GAUSSIAN PROCESS REGRESSION

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Session 2B: Some basic simulation methods

Density Estimation. Seungjin Choi

STAT 425: Introduction to Bayesian Analysis

Bayesian model selection: methodology, computation and applications

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Bayesian Inference for Normal Mean

Stat 451 Lecture Notes Numerical Integration

Bayesian Inference. Chapter 1. Introduction and basic concepts

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Learning Bayesian networks

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

CSC 2541: Bayesian Methods for Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

g-priors for Linear Regression

Bayesian Model Comparison

EXERCISE ON VAR AND VARMA MODELS

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Statistics: Learning models from data

Assessing Regime Uncertainty Through Reversible Jump McMC

Statistical Inference: Maximum Likelihood and Bayesian Approaches

Sequential Monte Carlo Methods

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Likelihood-free MCMC

Bayesian Estimation An Informal Introduction

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Advanced Statistical Methods. Lecture 6

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Plausible Values for Latent Variables Using Mplus

7. Estimation and hypothesis testing. Objective. Recommended reading

Nonparametric Bayesian Methods (Gaussian Processes)

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

BAYESIAN MODEL CRITICISM

Seminar über Statistik FS2008: Model Selection

2 Bayesian Hierarchical Response Modeling

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Bayesian Inference: Posterior Intervals

2. A Basic Statistical Toolbox

Bayesian search for other Earths

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Hierarchical Models & Bayesian Model Selection

Use of Bayesian multivariate prediction models to optimize chromatographic methods

Bayesian Inference and MCMC

The Metropolis-Hastings Algorithm. June 8, 2012

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

ECO 513 Fall 2008 C.Sims KALMAN FILTER. s t = As t 1 + ε t Measurement equation : y t = Hs t + ν t. u t = r t. u 0 0 t 1 + y t = [ H I ] u t.

Bayesian Semiparametric GARCH Models

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Bayesian Semiparametric GARCH Models

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Transcription:

ECO 513 Fall 2008 Model comparison Christopher A. Sims Princeton University sims@princeton.edu October 18, 2016 c 2016 by Christopher A. Sims. This document may be reproduced for educational and research purposes, so long as the copies contain this notice and are retained for personal use or distributed free.

Model comparison as estimating a discrete parameter Data Y, models 1 and 2, parameter vectors θ 1, θ 2. Models are p 1 (Y θ 1 ) and p(y θ 2 ). θ 1, θ 2. m 1, m 2 are the dimensions of We can treat them as a single model, with pdf q(y (θ 1, θ 2, γ)), where the parameter space for θ 1, θ 2 is some subset (or the whole of) R m 1+m 2 and that for γ, the model number, is {1, 2}. Then the posterior probability over the two models is just the marginal distribution of the γ parameter i.e., the posterior p.d.f. with θ 1, θ 2 integrated out. 1

Marginal data densities, Bayes factors The kernel of the probability distribution over model i is therefore p(y θ 1 )π(θ 1, θ 2, i) dθ 1 dθ 2 ). The ratios of these kernel weights are sometimes called Bayes factors. The integral values themselves are sometimes called marginal data densities. The posterior probabilities are the ratios of the mdd s to their sum i.e. the kernel weights normalized to sum to one. 2

Notes Often the prior over θ 1, θ 2 makes them independent, in which case only the prior pdf of θ i enters the formula. The prior matters here, even asymptotically. With independent priors, assuming the likelihood concentrates asymptotically in the neighborhood of θ i for each model i, a change in π(θ i )/π(θ j ) always has a direct proportional effect on the posterior odds, even as sample size goes to infinity. It is still true in a sense that the prior doesn t matter: The posterior odds favoring the true model will go to infinity, regardless of the prior. But in any sample in which the odds ratio turns out to be of modest size, it can shift to favor one model or the other based on reasonable variations in the prior. 3

This discussion generalizes directly to cases with any finite number of models. 4

Calculating mdd s This is just a question of how to evaluate an integral. For some cases, including reduced form VAR s with conjugate priors, this can be done analytically. Suppose, though, we can t evaluate the integral analytically, but can generate a sample from the posterior pdf. Case 1: We can generate a sample from the whole posterior, over both the continuous and discrete components. Then posterior mdd is estimated just by counting relative frequencies of γ = 1 vs. γ = 2. Case 2: We can generate a sample from either model s conditional posterior on θ i, but not a sample from the joint θ 1, θ 2, γ posterior. 5

Case 2 } We have a sample {θ j i, j = 1,..., N of draws from the posterior on θ i conditional on model i being correct. We also assume here that the priors on θ 1 and θ 2 are independent. } } Form {k(y, θ j i ) = {p(y θ j i )π(θj i ). This is the kernel of the posterior conditional on model i being the truth. Pick a weighting function g(θ i ) that integrates to one over θ i -space. Form 1 N N j=1 g(θ j i ) k(y, θ j i ) 6

This converges to ( k(y, θ i )dθ i ) 1, because E [ ] g(θi ) k(y, θ i ) = g(θ i ) k(y, θ i ) q(θ i Y, i)dθ i g(θ i ) k(y, θ i ) = dθ i k(y, θ i ) k(y, θi )dθ i 7

Choosing g The method we have described is called the modified harmonic mean method. The original idea, called the harmonic mean method, was to use π() as the weighting function. Since k(y, θ) = p(y θ i )π(θ i ), this amounted to simply taking the sample mean of 1/p i (Y θ i ), i.e. of one over the likelihood. So long as g integrates to one, the expectation of g/(p i π i ), under the posterior density on θ i, exists and is finite. We already proved this. 8

But there is no guarantee that its second moment is finite. For example, if g declines slowly as θ i goes to infinity, i.e. if g has fat tails, while k does not, then g/k will be unbounded. It will have finite expectation, but only because the very large values of g/k that will eventually occur will occur rarely because we are sampling from the thin-tailed k kernel. The rapidly declining k kernel offsets the k in the denominator of g/k. It is not enough to offset the k 2 in the denominator of g 2 /k 2. So if lim θi g 2 /k > 0, g/k has infinite second moment. This means usual measures of uncertainty, based on variances, do not apply, and that convergence of the sample mean to its expected value will be slow. 9

Geweke s suggested g Convergence will be more rapid the closer g/k is to a constant. Geweke suggested taking g to be the standard normal approximation to the posterior density at its peak ˆθ i i.e. to be a N ( ( ˆθ 2 ) 1 ) log(k(y, θ i ) i, θ i θ i density, but to truncate it at (θ i ˆθ i ) V 1 (θ i ˆθ) = κ for some κ, where V is the asymptotic covariance matrix. The integral of the normal density over this truncated region can be looked up in a chi-squared 10

distribution table, so the density can easily be rescaled to integrate to exactly one over this region. The idea is that g should be close to k near the peak, while the truncation avoids any possibility that g/k becomes unbounded in the tails. 11

Remaining problems with g choice Geweke s idea often works well, but not always. The tails are not the only place that g/k might become unbounded. If there is a point θ at which k(θ ) = 0, and if k is twice-differentiable at that point, while g is continuous at that point, then g/k has infinite variance under the posterior density. In high-dimensional models it may be difficult to know whether there are points with k = 0. With, say, 30 parameters, the typical draw of θ j i will have (θ j i ˆθ i ) V 1 (θ j i ˆθ i ) =. 30 (because this quanitity has a chi-squared(30) distribution.) But this means that the density at θ i = ˆθ i is e 30 times larger than the density at a typical draw. 12

Of course if the ratio of k at the peak and at a typical draw is also of this size, this causes no problem, but obviously there is tremendous room for g/k to vary, and hence for convergence of the estimate of the Bayes factor to be slow. There are ways to do better, if sampling g/k is not working well: bridge sampling, e.g. An important dumb idea: What about just taking sample averages of p(y θ j i ) from the simulated posterior distribution? (Important because every year one or more students does this on an exam or course paper.) 13

An identity that provides methods i = 1, 2, then q 1 (θ)q 2 (θ)α(θ) z 2 q 2 (θ)q 1 (θ)α(θ) z 1 dθ dθ If q 1 (θ) and q 2 (θ) are positive, integrable functions on the same domain (i.e. can be thought of as unnormalized probability densities), with z i = qi (θ)dθ, and if α(θ) is any function such that 0 < α(θ)q i (θ) <, = E 2[q 1 α] E 1 [q 2 α] = z 1 z 2. 14

Specific methods importance sampling q 2 = z 2 = 1, α = 1/q 2, z 1 = E 2 [q 1 /q 2 ]. Does not use MCMC draws. Blows up if q 1 /q 2 is huge for some θ s. modified harmonic mean z 2 = 1, α = 1/q 1, z 1 = 1/E 1 [q2/q 1 ]. Uses only MCMC draws. Blows up if q 1 /q 2 is huge for some θ s. bridge sampling Pick α so both q 1 α and q 2 α are bounded, e.g. α = 1/(q 1 + q 2 ). Uses draws from both q 1 and q 2. 15

Optimal α With same number of draws from q 1 and q 2, it s α = 1 z 1 q 2 + z 2 q 1. Since we don t know z 1 /z 2, this is not directly a help. But if our initial guess is off, we can update it and repeat with new z 1 /z 2, but re-using the old draws of q 2 and q 1. 16

The Schwarz Criterion In large samples, under fairly general regularity conditions, the posterior density is well approximated as proportional to a N(ˆθ, V ) density, where V is minus the inverse of the second derivative matrix for the log posterior density, evaluated at the MLE ˆθ. We know the integral of a Normal density, so if the approximation is working well, we will find that the integral of the posterior density is approximately (2π) k/2 V 1/2 p(y ˆθ)π(ˆθ). 17

V convergence In a stationary model, V T This follows because 1 T log ( p(y θ)π(θ) ) = 2 k(y θ) θ θ a.s. E T V, where V is a constant matrix. T log ( p(y t {Y s, s < t}, θ) ) (1) t=1 [ 2 log ( p(y t {Y s, s < t}, θ) ) ] θ θ and V is just minus the inverse of this second derivative matrix. Of course above we have invoked ergodicity of the Y process and also assumed that Y t depends on Y t s for only finitely many lags s, so that after the first few observations all the terms in the sum that makes up the log likelihood are of the same form. (2) 18

The Schwarz Criterion, II Now we observe that since log V /T = n log T + log V (where n is the length of θ), the log of the approximate value for the integrated density is n 2 log(2π) n 2 log T + 1 2 log { V } + log(k(y, ˆθ)). Dominating terms when taking difference across two models: log ( k 1 (Y, θ 1 ) ) log ( k 2 (Y, θ 2 ) ) n 1 n 2 2 log(t ) 19

Comments on the SC The formula applies whether the models are nested or not. Frequentist model comparison, for the special case where one model is just a restriction of the other, treats the difference of log likelihoods, times 2, as chi-squared(n 1 n 2 ) and rejects the restricted model if this statistic exceeds the 1 α per cent level of the chi-squared distribution. The Schwarz criterion compares this statistic to a level that increases not just with n 1 n 2, but also with log T. In other words, Bayesian model comparison penalizes more richly parameterized models in large samples, and does so more stringently (relative to a frequentist likelihood ratio test) the larger the sample. 20

The Lindley Paradox Take a small collection of reasonable models, calculate Bayesian posterior odds. A typical result: One model has probability 1 ε, the others probability less than ε, with ε very small (say.00001). Frequentist comparisons (for the nested case where they work) seem not to give such drastic answers. One interpretation Bayesian methods are more powerful. Another Bayesian methods produce unreasonably sharp conclusions. 21

Interpreting and accounting for the Lindley Paradox The reason it s a problem is that these sharp results are often unstable: If, say, we introduce one more model, we may find that the new one is the one with 1 ε probability, which makes our previous near certainty look foolish. In other words, the problem is that the result is conditional on our being certain that the set of models compared contains the truth, that there are no other possibilities waiting in the wings. Gelman, Carlin, Stern and Rubin suggest (almost) never doing model comparison: Their view is that it should (almost) always be reasonable, when comparing models, to make each model a special case of a larger model with a continuously varying parameter. 22

In other words, when you think you want to use a {0, 1}-valued model number parameter, figure out how to replace this with a continuously varying parameter, say γ, such that the two models arise as γ = 1 and γ = 0. 23