Testing Restrictions and Comparing Models

Similar documents
MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Model comparison. Christopher A. Sims Princeton University October 18, 2016

Choosing among models

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM

MID-TERM EXAM ANSWERS. p t + δ t = Rp t 1 + η t (1.1)

ECO 513 Fall 2008 C.Sims KALMAN FILTER. s t = As t 1 + ε t Measurement equation : y t = Hs t + ν t. u t = r t. u 0 0 t 1 + y t = [ H I ] u t.

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Eco517 Fall 2014 C. Sims MIDTERM EXAM

Economics 536 Lecture 7. Introduction to Specification Testing in Dynamic Econometric Models

Eco517 Fall 2014 C. Sims FINAL EXAM

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Lecture 10: Generalized likelihood ratio test

Bayesian Econometrics

TAKEHOME FINAL EXAM e iω e 2iω e iω e 2iω

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

Lecture 2: Univariate Time Series

Non-Stationary Time Series and Unit Root Testing

Seminar über Statistik FS2008: Model Selection

Non-Stationary Time Series and Unit Root Testing

The Effects of Monetary Policy on Stock Market Bubbles: Some Evidence

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

Short Questions (Do two out of three) 15 points each

1 Hypothesis Testing and Model Selection

i=1 h n (ˆθ n ) = 0. (2)

Topic 4 Unit Roots. Gerald P. Dwyer. February Clemson University

CONJUGATE DUMMY OBSERVATION PRIORS FOR VAR S

Spring 2017 Econ 574 Roger Koenker. Lecture 14 GEE-GMM

Loglikelihood and Confidence Intervals

Parametric Techniques Lecture 3

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS

Frequentist-Bayesian Model Comparisons: A Simple Example

Model comparison and selection

Ch. 5 Hypothesis Testing

A Bayesian perspective on GMM and IV

What s New in Econometrics. Lecture 13

Non-Stationary Time Series and Unit Root Testing

Multivariate Time Series: Part 4

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )

Econ 5150: Applied Econometrics Dynamic Demand Model Model Selection. Sung Y. Park CUHK

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

Fitting a Straight Line to Data

Do Markov-Switching Models Capture Nonlinearities in the Data? Tests using Nonparametric Methods

Eco504, Part II Spring 2010 C. Sims PITFALLS OF LINEAR APPROXIMATION OF STOCHASTIC MODELS

Volatility. Gerald P. Dwyer. February Clemson University

On the Power of Tests for Regime Switching

Applied time-series analysis

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Vector Auto-Regressive Models

A nonparametric test for seasonal unit roots

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

ECE521 week 3: 23/26 January 2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

VAR Models and Applications

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

The Multivariate Gaussian Distribution [DRAFT]

Bayesian data analysis in practice: Three simple examples

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

7. Estimation and hypothesis testing. Objective. Recommended reading

Nuisance parameters and their treatment

STA 4273H: Statistical Machine Learning

Uniformly Most Powerful Bayesian Tests and Standards for Statistical Evidence

Parametric Techniques

δ -method and M-estimation

STAT Financial Time Series

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

ARIMA Modelling and Forecasting

Inference with few assumptions: Wasserman s example

Bayesian Methods for Machine Learning

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Physics 342 Lecture 23. Radial Separation. Lecture 23. Physics 342 Quantum Mechanics I

Testing for Regime Switching in Singaporean Business Cycles

Elements of Multivariate Time Series Analysis

Limits at Infinity. Horizontal Asymptotes. Definition (Limits at Infinity) Horizontal Asymptotes

Nonparametric Identification of a Binary Random Factor in Cross Section Data - Supplemental Appendix

1 Procedures robust to weak instruments

Eco517 Fall 2013 C. Sims EXERCISE DUE FRIDAY NOON, 10/4

Reading Assignment. Distributed Lag and Autoregressive Models. Chapter 17. Kennedy: Chapters 10 and 13. AREC-ECON 535 Lec G 1

Chapter 5: Models for Nonstationary Time Series

Obtaining Critical Values for Test of Markov Regime Switching

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Improving Model Selection in Structural Equation Models with New Bayes Factor Approximations

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

STA 414/2104, Spring 2014, Practice Problem Set #1

Probability Theory and Statistics. Peter Jochumzen

ECON3327: Financial Econometrics, Spring 2016

Autoregressive Moving Average (ARMA) Models and their Practical Applications

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Christopher Dougherty London School of Economics and Political Science

(I AL BL 2 )z t = (I CL)ζ t, where

Using Matching, Instrumental Variables and Control Functions to Estimate Economic Choice Models

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Inference for Normal Mean

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models

AR, MA and ARMA models

Transcription:

Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by pdf s p(x θ) and q(x φ), with θ ranging over a parameter space Θ and φ ranging over Φ, with the two parameter spaces being of possibly different dimension. A special case that we will discuss further is that in which the second q model is defined by restricting θ to a a submanifold of Θ defined by the vector of restrictions R(θ) = 0. We aim at forming posterior probabilities of the two models, or in the case of the model defined by restrictions, for the truth of the restriction R(θ) = 0, conditional on the data. It is easy to prescribe how to handle this problem in general terms. The full parameter space is really Θ Φ. We will have some prior over it, which we can think of as built from conditional pdf s over Θ and Φ and prior probabilities on the two parameter spaces. The posterior probability can then be calculated by the usual formulas. However, in contrast to the usual situation where the prior is continuous (i.e. has a pdf), the effects of the prior do not disappear in large samples in this situation. It is therefore useful to examine whether there remains anything useful to say about large-sample approximations for this problem.. GAUSSIAN APPROXIMATION In large samples, under rather general regularity conditions, the likelihood comes to dominate the prior if the parameter space is a subset of R m with non-empty interior and the true value of the parameter is in the interior. In this case the use of a flat prior" gives accurate conclusions. Furthermore, the likelihood takes on an approximately Gaussian shape in large samples, in the sense that a second-order Taylor expansion of the log likelihood in the neighborhood of its maximum gives accurate results if used to approximate the posterior pdf. 1 Even if the true model is not any of the pdf s p(x θ) for θ Θ, under reasonable regularity conditions likelihood concentrates in the neighborhood of a pseudo-true value" in Θ and the quadratic approximation to the log likelihood s shape becomes accurate in large samples. Copyright 00 by Christopher A. Sims. This document may be reproduced for educational and research purposes, so long as the copies contain this notice and are retained for personal use or distributed free. 1 An informal sketch of a proof is in Gelman, Carlin, Stern, and Rubin (1995, Appendix B). A careful proof is in Schervish (1995, Chapter 7.4). Both of these references include further references to the literature. 1

For functions whose logs are quadratic (and which therefore are scaled versions of Gaussian pdf s) we can characterize their integrals as functions of their maxima and their second derivative matrices. This leads to some easily computed approximations to posterior probabilities, based on the height of the likelihood function at its maxima in the two spaces and on the second derivative matrices of the log likelihood at these maxima which are minus the usual classical asymptotic approximations to the covariance matrices of the MLE estimates. If the log likelihood is exactly quadratic when restricted to Θ, then the likelihood itself has the form within Θ log(p(x θ)) = K 1 (θ ˆθ) Ω 1 (θ ˆθ), (1) where K = log(p(x ˆθ) is the value of log-likelihood at its peak and ˆθ is the value of θ that maximizes p(x θ). We can then rewrite it as a constant plus the log of a Gaussian pdf, log(p(x θ)) = K + m log(π) + 1 log Ω [ ( m ) + log(π) + 1 log Ω ] 1(θ ˆθ) Ω 1 (θ ˆθ), () where m is the dimension of the space Θ. Obviously, then, p(x θ) dθ = e K (π) m 1 Ω. (3) Θ If the likelihood is concentrated on small enough subsets of Φ and Θ that the prior pdf on each space is approximately constant where likelihood is non-trivially large, then we can apply (3) to approximate the integrals of the joint pdf of data and parameters over Θ and Φ, respectively, with the data X held fixed at the observed value, as g( ˆθ)p(X ˆθ)(π) m Σθ 1 (4) h( ˆφ)q(X ˆφ)(π) n 1 Σφ (5) where n is the dimension of Φ, Σ θ and Σ φ are the usual asymptotically justified estimates of the covariance matrix of the MLE (minus the inverse of the second derivative of the log likelihood) within each model s parameter space, and g and h are the conditional prior pdf s on Θ and Φ, respectively. Posterior odds on the two This type of approximation is known as the method of Laplace and is discussed in more general form in Schervish (1995, section 7.4.3)

models Θ and Φ are then approximated as the ratio of prior probabilities of the two models multiplied by the ratio of the expressions (4) and (5). 3 It is clear, then, that use of the asymptotic normal approximation will not make the choice of prior asymptotically irrelevant in computing posterior probabilities. The odds ratio between the two models will contain the ratio of prior probabilities times the ratio of conditional prior densities at the maximum g(x ˆθ)/h(X ˆφ) no matter how large the sample or how good the Gaussian approximation. However, if we do not aim at getting an asymptotically accurate odds ratio, but instead only to get an asymptotically accurate decision that is, to determine accurately which model the odds ratio favors then we can avoid dependence on the prior under some additional regularity conditions. In models of i.i.d. data, or of stationary time series data, it is usually true that P TΣ θ Ω θ, (6) T where Ω θ is a fixed matrix, with a similar result for Σ φ. Letting µ(θ) be the prior probability of Θ, we can then write the approximation to the log of the posterior odds in favor of Θ as log ( p(x ˆθ)µ(Θ)g( ˆθ) q(x ˆφ)(1 µ(θ))h( ˆφ) ) + m n log log(π) ( ) ( ) T m Ωθ 1 + log T n Ω φ 1. (7) Also, for i.i.d. or time series models, log p(x θ) is a sum of similarly distributed random variables, so that under reasonable regularity conditions 1 T p(x P ˆθ) p, (8) T again with a similar result holding for q. Every term in (7) is constant, as T increases, except for log(p/q) and the two terms on the end involving Ω s. The odds ratio will, under usual regularity conditions, converge to infinity or zero as T, so that eventually the terms that depend on T dominate its behavior. Thus we can form an approximate odds ratio" considering only the terms that vary with T, i.e. log(p(x ˆθ)) log(q(x ˆφ)) m n log T. (9) 3 We could improve the order of approximation by taking account of the fact that g( ˆθ)/ θ and h( ˆφ)/ φ are not zero. This could be done by replacing ˆθ and ˆφ with the values of θ and φ that maximize the posterior pdf s, instead of those that maximize the likelihood, or else by computing g( ˆθ)/ θ and h( ˆφ)/ φ and using them to make a first-order correction to ˆθ and ˆφ as estimates of the posterior mode. This would in turn imply a second-order (in θ ˆθ and φ ˆφ) correction to the heights of the posterior pdf s at their maxima. 3

This criterion will converge in probability to + if only Θ contains the true model and if only Φ does. It is often called the Schwarz criterion, as Schwarz (1978) introduced it. 4 The Schwarz criterion is fairly widely applied. Its popularity stems from the fact that it can be computed without any consideration of what a reasonable prior might be. But it is apparent from (7) that unless the Schwarz criterion is extremely large or small, it is likely to differ substantially from the odds ratio, even when the sample size is large enough to make the Gaussian approximation to log likelihood work well. Serious applied work ought to include in the reporting of results a consideration of what might be a reasonable specification of the prior, as well as a consideration of whether the normal approximation to the likelihood is accurate in the sample at hand. Note that we could avoid having to think about the prior without dropping so many terms from (7). Dropping only those terms that depend on the prior, instead of all those terms that fail to grow with T, leads to log ( p(x ˆθ) q(x ˆφ) ) + m n ( ) log(π) + 1 log Σ θ Σφ. (10) This expression makes it clear that it is the precision of the estimates that produces the dependence on T in the Schwarz Criterion. The expression is more robust it converges to the same limit as the Schwarz Criterion when the Schwarz criterion is justified, but produces asymptotically correct decisions in certain situations where the Schwarz criterion does not. For example, in unit root time series models the asymptotic Gaussian approximation to the shape of the likelihood is accurate 5, but TΣ θ fails to be bounded in probability. In this situation the Schwarz criterion may not lead to asymptotically correct decisions, whereas direct use of (10) does. 4 3. INFINITELY MANY PARAMETERS There are a number of classes of cases in which it is natural to think of a nested sequence of models, or a model with a countable infinity of parameters, as containing the truth. For example, AR or MA models with different numbers of lags, or models that fit a polynomial time trend or that use polynomials to approximate a nonlinear regression function. These situations can be treated as involving a Bayesian prior over the infinite sequence of models or over the infinite sequence of parameters. Obviously in the case of a sequence of models we have to have probabilities on 4 However, it was implicit in earlier work on Laplace approximations, and Schwarz considered only its application to cases where Φ is a lower-dimensional subset of Θ. 5 This was shown by Kim (1994)

models that eventually get small as we go out the sequence of models or probabilities otherwise we could not have probability one on the whole sequence of models. Correspondingly, when we have infinite sequences of parameters we usually have to assume that the parameters far out in the sequence are very likely to be unimportant. Consider the case of a simple autoregression k y t = β s y t s + ε t. (11) s=0 Choice of k here does involve choice among models, but it is clear that our priors on the sequence of β s probably vary systematically with k. Not only are we likely to think that large values of k are less likely than smaller values, at least for large enough k, but we are also likely to think that β s for large s, within a model with a given k, are probably smaller in absolute value than values of β s for smaller s. These beliefs might be justified by experience with fitting models to many economic time series. But they also can be justified by supposing that we put some substantial probability (maybe even 1) on the region of the parameter space that implies stationarity or unit-root behavior, and little or no probability on explosive models. Most β sequences in these regions will have the properties we have suggested are natural. In a distributed lag model k y t = γ s x t s + η t, (1) s=0 our prior is likely to be that the coefficients on γ s for large s are smaller than the coefficients on small s, because otherwise our priors would imply that the variance of the explained part of y t grows without bound with increasing k. We don t have time to cover these cases in any generality, but consider the case, for example, where we have a distributed lag model and within each model k, we assume exponential decay of the γ s, i.e. θ 0 0... 0 γ 0. γ k N 0 θλ 0... 0 0,..... 0 0 θλ k 1 0 0 0 0 θλ k 5. (13) It is not hard to see that this kind of prior pdf will produce systematic behavior in the log Ω terms in (7), and in particular that if we look across k s they will contribute a component that varies as O(k log λ). Since it is the inverse of Ω that enters the posterior pdf and λ < 1, these terms increase with k. Since they do not vary with

T, they will be dominated by the SC terms as T. But in a given sample, they will imply less of a penalty on large models than would be apparent from approximations to posterior odds (like the SC, or (10)) that ignore the contribution of priors. This may account for the reputation of the SC, that it tends to favor small models more strongly than is reasonable. On the other hand, we also must recognize that the prior weights on models must decrease with k if they are to sum to a finite number, and this will work in the opposite direction, disfavoring larger models. If the weights on models decline only as O(e k ) or O(k p ), however, the effect above from exponentially declining sizes of coefficients will dominate. 6 4. COMPARISON TO CLASSICAL METHODS Classical asymptotic methods produce results only for the case where Φ is defined as {θ Θ R(θ) = 0} for a function R that takes values in R m n. It is then conventional procedure to use the fact that, on the null hypothesis that R(θ) = 0, the likelihood ratio (LR) statistic ( log(p(x ˆθ)/q(X ˆθ)) ) is in large samples distributed approximately as χ (m n). Usually some standard significance level, say α =.01 or α =.05, is chosen, and the null hypothesis is rejected if the LR exceeds χ α(m n). Because both the Schwarz criterion and the standard likelihood ratio test compare the LR to a critical value, the Schwarz criterion is always equivalent, in any given sample, to a classical likelihood ratio test for some α. However the value to which the Schwarz criterion compares the LR is increasing in T, while the classical test, if α is kept at a constant level, compares the LR to a fixed value. In fact it is obvious by construction that the choice of Θ vs. Φ does not converge in probability to the truth as T if the choice is made with a conventional LR test with fixed α. By definition, such a test in large samples has probability α of rejecting a true null hypothesis i.e. of giving the wrong decision even for very large samples. The Schwarz criterion (and some other variants on it that have been proposed) instead gives a probability of wrong decision that goes to zero as T. The Bayesian approach then suggests that significance levels for tests should generally be set more stringently (lower α s) in large samples. Though there is no classical statistical argument for doing so, applied workers do tend in fact to use lower α s in very large samples, or indeed, as (10) would suggest, in any context where estimated standard errors are very small. REFERENCES GELMAN, A., J. B. CARLIN, H. S. STERN, AND D. B. RUBIN (1995): Bayesian Data Analysis. Chapman and Hall, London.

KIM, J. Y. (1994): Bayesian Asymptotic Theory in a Times Series Model with a Possible Nonstationary Process, Econometric Theory, 10(3), 764 773. SCHERVISH, M. J. (1995): Theory of Statistics, Springer Series in Statistics. Springer, New York. SCHWARZ, G. (1978): Estimating the Dimension of a Model, Annals of Statistics, 6, 461 64. 7