A Bayesian perspective on GMM and IV

Similar documents
Inference with few assumptions: Wasserman s example

UNDERSTANDING NON-BAYESIANS

Bayesian Econometrics

Eco517 Fall 2014 C. Sims MIDTERM EXAM

Eco517 Fall 2004 C. Sims MIDTERM EXAM

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS

Bayesian Linear Regression

STA 4273H: Statistical Machine Learning

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

COMMENT ON DEL NEGRO, SCHORFHEIDE, SMETS AND WOUTERS COMMENT ON DEL NEGRO, SCHORFHEIDE, SMETS AND WOUTERS

Stat 5101 Lecture Notes

ECO 513 Fall 2008 C.Sims KALMAN FILTER. s t = As t 1 + ε t Measurement equation : y t = Hs t + ν t. u t = r t. u 0 0 t 1 + y t = [ H I ] u t.

Bayesian linear regression

TAKEHOME FINAL EXAM e iω e 2iω e iω e 2iω

Testing Restrictions and Comparing Models

MID-TERM EXAM ANSWERS. p t + δ t = Rp t 1 + η t (1.1)

Bayesian Inference for Normal Mean

Short Questions (Do two out of three) 15 points each

Model comparison. Christopher A. Sims Princeton University October 18, 2016

(I AL BL 2 )z t = (I CL)ζ t, where

Eco517 Fall 2014 C. Sims FINAL EXAM

What s New in Econometrics. Lecture 13

Inference for a Population Proportion

Bayesian Regression Linear and Logistic Regression

Comparing Non-informative Priors for Estimation and Prediction in Spatial Models

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Making rating curves - the Bayesian approach

Overall Objective Priors

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

A Bayesian Treatment of Linear Gaussian Regression

F denotes cumulative density. denotes probability density function; (.)

Linear Models A linear model is defined by the expression

LECTURE 15 Markov chain Monte Carlo

Computational statistics

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

13: Variational inference II

Bayesian inference for multivariate extreme value distributions

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Probabilistic Graphical Models

NONINFORMATIVE NONPARAMETRIC BAYESIAN ESTIMATION OF QUANTILES

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

g-priors for Linear Regression

Probabilistic Graphical Models

A Study into Mechanisms of Attitudinal Scale Conversion: A Randomized Stochastic Ordering Approach

Bayesian model selection for computer model validation via mixture model estimation

Contents. Part I: Fundamentals of Bayesian Inference 1

ROBINS-WASSERMAN, ROUND N

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Default Priors and Effcient Posterior Computation in Bayesian

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models

Introduction. Chapter 1

Part 6: Multivariate Normal and Linear Models

Gaussian Models

Bayesian Methods for Machine Learning

Curve Fitting Re-visited, Bishop1.2.5

Chapter 9. Non-Parametric Density Function Estimation

BAYESIAN ECONOMETRICS

Dynamic Factor Models and Factor Augmented Vector Autoregressions. Lawrence J. Christiano

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Introduction to Machine Learning Midterm, Tues April 8

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Physics 509: Bootstrap and Robust Parameter Estimation

Markov Switching Regular Vine Copulas

Markov Chain Monte Carlo (MCMC)

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

Bayesian Methods in Multilevel Regression

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

MCMC Sampling for Bayesian Inference using L1-type Priors

The linear model is the most fundamental of all serious statistical models encompassing:

I. Bayesian econometrics

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Bayesian Semiparametric GARCH Models

Nearest Neighbor Gaussian Processes for Large Spatial Data

Statistics: Learning models from data

A Note on Lenk s Correction of the Harmonic Mean Estimator

Machine Learning, Fall 2009: Midterm

Bayesian Semiparametric GARCH Models

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4)

GAUSSIAN PROCESS REGRESSION

Chapter 9. Non-Parametric Density Function Estimation

Nonparametric Drift Estimation for Stochastic Differential Equations

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

Lecture : Probabilistic Machine Learning

Supervised Learning: Non-parametric Estimation

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Bayesian Linear Models

Marginal Specifications and a Gaussian Copula Estimation

Transcription:

A Bayesian perspective on GMM and IV Christopher A. Sims Princeton University sims@princeton.edu November 26, 2013

What is a Bayesian perspective? A Bayesian perspective on scientific reporting views all data analysis as reporting of the shape of the likelihood in ways likely to be useful to the readers of the report. We examine moment-based inference from that perspective. 1

What is special about GMM and IV? 2

What is special about GMM and IV? 1. They are not derived from likelihood. 2

What is special about GMM and IV? 1. They are not derived from likelihood. 2. They are used mostly with only asymptotic theory to guides statements about uncertainty of inference. 2

What is special about GMM and IV? 1. They are not derived from likelihood. 2. They are used mostly with only asymptotic theory to guides statements about uncertainty of inference. 3. They are appealing because the assumptions they force us to make explicit the moment conditions often are intuitively appealing or even emerge directly from a respectable theory, while they allow us to sweep under the asymptotic rug auxiliarly assumptions. 2

What is special about GMM and IV? 1. They are not derived from likelihood. 2. They are used mostly with only asymptotic theory to guides statements about uncertainty of inference. 3. They are appealing because the assumptions they force us to make explicit the moment conditions often are intuitively appealing or even emerge directly from a respectable theory, while they allow us to sweep under the asymptotic rug auxiliarly assumptions. 4. They lead to easy computations. 2

Which of these are actually pluses? 1 and 2 are defects, and some Bayesians have taken the view that IV and GMM methods are just mistakes, because their lack of likelihood foundation and small sample distribution theory makes Bayesian interpretation of them seem difficult. 3

Which of these are actually pluses? 1 and 2 are defects, and some Bayesians have taken the view that IV and GMM methods are just mistakes, because their lack of likelihood foundation and small sample distribution theory makes Bayesian interpretation of them seem difficult. But 4, computational convenience, is certainly an advantage and a reason that Bayesians should not dismiss IV and GMM. 3

Which of these are actually pluses? 1 and 2 are defects, and some Bayesians have taken the view that IV and GMM methods are just mistakes, because their lack of likelihood foundation and small sample distribution theory makes Bayesian interpretation of them seem difficult. But 4, computational convenience, is certainly an advantage and a reason that Bayesians should not dismiss IV and GMM. And 3, while problematic, reflects a legitimate desire for inference that is not sensitive to assumptions that we sense are somewhat arbitrary. 3

Do asymptotics free us from assumptions? 4

Do asymptotics free us from assumptions? The short answer: no. 4

Do asymptotics free us from assumptions? The short answer: no. The usual Gaussian asymptotics can be given either a Bayesian or a non-bayesian (pre-sample) interpretation. The Bayesian interpretation is that asymptotics gives us approximations to likelihood shape. (Kwan, 1998) 4

Do asymptotics free us from assumptions? The short answer: no. The usual Gaussian asymptotics can be given either a Bayesian or a non-bayesian (pre-sample) interpretation. The Bayesian interpretation is that asymptotics gives us approximations to likelihood shape. (Kwan, 1998) If we wish to characterize the implications of a particular sample, we may decide that asymptotic theory is likely to be a good guide, or we may not. This is a decision-theoretic judgment call. We know, usually, that there are conditions on the true model that would imply that asymptotics are a good guide for this sample. Actually using the asymptotic theory amounts to judging, without explicit discussion, that it is ok to assume these conditions are met. 4

Examples of the freedom from assumptions fallacy: kernel methods Frequency domain methods in time series, kernel methods for regression. Frequency domain methods are non-parametric in comparison to ARMA models in the same mathematical sense that kernel methods are non-parametric in comparison to parametric polynomial regressions. 5

Examples of the freedom from assumptions fallacy: kernel methods Frequency domain methods in time series, kernel methods for regression. Frequency domain methods are non-parametric in comparison to ARMA models in the same mathematical sense that kernel methods are non-parametric in comparison to parametric polynomial regressions. In time series, after an initial burst of enthusiasm, the fact that there is no true increased generality in use of frequency-domain methods sank in (possibly in part because for any application involving forecasting, the FD methods are inconvenient). In cross-section non-parametrics, one still finds econometricians who think that kernel methods require fewer assumptions. 5

Using kernel methods in a particular sample with a particular model obviously requires that the true spectral density or regression function not be badly distorted by convolution with the kernel. This greatly limits the class of admissible spectral densities or regression functions, in comparison with the class allowed by the asymptotic theory. 6

Using kernel methods in a particular sample with a particular model obviously requires that the true spectral density or regression function not be badly distorted by convolution with the kernel. This greatly limits the class of admissible spectral densities or regression functions, in comparison with the class allowed by the asymptotic theory. There is a well defined topological sense in which it can be shown that a countable, dense class of finitely parameterized models is as large as the classes of functions that are allowed by the smoothness restrictions required for kernel-method asymptotics. 6

IV and GMM as assumption-free They make assertions about the distributions of only certain functions of the data not enough functions of the data to fully characterize even the first and second moments. 7

IV and GMM as assumption-free They make assertions about the distributions of only certain functions of the data not enough functions of the data to fully characterize even the first and second moments. They limit their assertions to a few moments, hence do not require a complete description of the distribution of even those functions of the data about which they do make assertions. 7

Embedding IV and GMM in a framework for inference Required: An explicit set of assumptions that will let us understand what we are implicitly assuming in applying IV or GMM in a particular sample, and thereby also let us construct likelihood or posterior distributions. 8

Embedding IV and GMM in a framework for inference Required: An explicit set of assumptions that will let us understand what we are implicitly assuming in applying IV or GMM in a particular sample, and thereby also let us construct likelihood or posterior distributions. Approach 1: Find assumptions that make the asymptotic theory exact. This may involve using the sample moments on which IV or GMM are based in somewhat different ways. 8

Embedding IV and GMM in a framework for inference Required: An explicit set of assumptions that will let us understand what we are implicitly assuming in applying IV or GMM in a particular sample, and thereby also let us construct likelihood or posterior distributions. Approach 1: Find assumptions that make the asymptotic theory exact. This may involve using the sample moments on which IV or GMM are based in somewhat different ways. Approach 2: Choose from among the models that are consistent with the asymptotic theory, a model that is conservative, either in the sense that it is robust (a good approximation to a large class of models) or that it draws the weakest possible inferences from the data, given the explicit assumptions. 8

Examples of these approaches: Approach 1 LIML: Normality assumptions on disturbances imply both a likelihood function and asymptotic theory that matches that of IV. An analogue to LIML for GMM? If GMM is based on E[g(y t, β) z t ] = 0, then in the LIML case we are providing a (linear) model, not dependent on β, for the distribution of g/ β z t. For GMM, it is not obvious that a linear model for this object is appropriate, and there are apparently many possible choices. There may be work on this issue of which I am not aware, but it seems ripe for exploration. 9

Examples continued: Approach 2 Empirical likelihood: Treat the joint distribution of data and parameters as concentrating all probability on the observed data points and as generated from a uniform distribution over all probabilities on those points that satisfy the moment restrictions. The usual procedure is then to find the mode of this distribution jointly in the probabilities of the data points and the parameters. But one can also take a Bayesian perspective and integrate over the probabilities to get a marginal on the parameters. Of course the assumption of probabilities concentrated on the observed points would be a terrible approximation for certain purposes. 10

Zellner BMOM: assume it is exactly true that E[Z u y] = 0, where this is the sample cross-moment matrix for observed instrument matrix Z and unobserved error vector u in the current sample, with expectation over the distribution of the parameter given the data. Then find the maximal entropy distribution for {β data} under this assumption. This exactly justifies the usual IV asymptotics and thus is an example of both approach 2 and approach 1. Of course the usual models out of which IV arises do not justify the basic assumption made to generate BMOM, except as an asymptotic approximation. 11

Kitamura and Stutzer maximal entropy GMM: Treat the joint distribution of data and parameters as concentrated on the observed data points. Derive the maximal entropy distribution of the parameters given the data under this assumption plus the moment restrictions. Closely related to empirical likelihood. More clearly conservative, but harder to give a complete Bayesian interpretation: It generates a posterior, but it s not clear what joint distribution of data and parameters would lead to this posterior via Bayes rule. 12

Can we do better? Both entropy-maximization and searching for exact theory that supports the use of asymptotics are useful ideas. Even exact theory (like BMOM) that requires strange assumptions is helpful in make it clear what we are implicitly assuming when we use the usual asymptotics. 13

Can we do better? Both entropy-maximization and searching for exact theory that supports the use of asymptotics are useful ideas. Even exact theory (like BMOM) that requires strange assumptions is helpful in make it clear what we are implicitly assuming when we use the usual asymptotics. But the BMOM assumption and the probability concentrated on the observed points assumptions are obviously unattractive. 13

Can we do better? Both entropy-maximization and searching for exact theory that supports the use of asymptotics are useful ideas. Even exact theory (like BMOM) that requires strange assumptions is helpful in make it clear what we are implicitly assuming when we use the usual asymptotics. But the BMOM assumption and the probability concentrated on the observed points assumptions are obviously unattractive. Gaussianity assumptions do emerge from maximizing entropy subject to first and second moment restrictions, which suggests that where normality assumptions lead to tractable likelihoods consistent with the 13

asymptotics, normality may be a conservative assumption. 14

However entropy is a slippery concept. It is not invariant to transformations of the random variable. More reliable, in my view, is the notion of mutual information between random variables. This is the expected reduction in entropy of the distribution of one random variable after we observe another random variable. It is invariant to transformations, which is another way of saying it depends only on the copula of the two jointly distributed variables. 15

However entropy is a slippery concept. It is not invariant to transformations of the random variable. More reliable, in my view, is the notion of mutual information between random variables. This is the expected reduction in entropy of the distribution of one random variable after we observe another random variable. It is invariant to transformations, which is another way of saying it depends only on the copula of the two jointly distributed variables. We can imagine deriving conservative models that minimize the mutual information between data and parameters subject to moment restrictions. I have explored this idea a bit. It is clear that it will only work with proper priors, and that it will not deliver exactly standard GMM except as a limiting case. 15

Applying these ideas IV and GMM have two widely recognized breakdown regions: Weak instruments, and overabundant instruments. We mistrust the implications of conventional asymptotic inference in these situations, but non-bayesian procedures give us little guidance in characterizing our uncertainty. Likelihood description here provides better guidance about the nature of uncertainty. 16

A simple model y T 1 = xβ + ε = Z T k γβ + νβ + ε (1) x T 1 = Zγ + ν (2) Var([νβ + ε ε]) = Σ (3) or y = zθ + ξ (4) x = zγ + ν (5) θ = γβ. (6) 17

When we use the parametrization (1-2), likelihood does not go to zero as β, no matter what the sample size. This can create problems for naive Bayesian approaches MCMC not converging, integrating out nuisance parameters creating paradoxes. A flat prior on (θ, γ) with no rank restrictions does produce an integrable posterior if the sample size is not extremely small. It therefore seems promising to use Euclidean distance to define a metric on the reduced-rank submanifold of (θ, γ), then transform the flat prior (Lebesgue measure) on this submanifold to β, γ coordinates. This derivation does not in itself actually guarantee that posteriors under this improper prior will be proper, but it is a promising approach. 18

The improper prior on β, γ that emerges from this approach is Π (β, γ) ( ) Π (β, γ) 1 2 = γ (1 + β 2 ) 1 2. (7) It does lead to proper posteriors. Even with this prior, however, the posteriors decline only at a polynomial rate in the tails, and the degree of the polynomial does not increase with sample size. This is in contrast with the posteriors under a flat prior in a linear regression model, where the tails decline at a polynomial rate that increases linearly with sample size. 19

Boomerang IV Consider now what this means when we combine a prior that has Gaussian tails or tails that decrease as a high-order polynomial with the likelihood weighted by (7). If we start with the prior mean and the likelihood peak lined up with each other, then let the likelihood peak move away from the prior mean while keeping the shapes of the likelihood and the prior pdf fixed, the posterior mean, median and mode move away from the prior mean (as expected) at first, but then reverse direction, coming back to coincide with the prior mean when the likelihood peak is very far from the prior mean. This is illustrated graphically in the figure. 20

0.08 Posteriors as prior mean shifts out, Cauchy tail likelihood 0.07 µ=1 0.06 0.05 0.04 0.03 0.02 likelihood µ=2 µ=4 0.01 0 0 1 2 3 4 5 6 beta Posterior as distance of MLE from prior mean increases. Prior is t with 9 d.f., σ = 1/3. Likelihood is Cauchy, σ =.1, center 1. 21

2.5 Joint posterior pdf on Structural β and RF γ, prior flat 2 1.5 γ 1 0.5 0 0.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 β Z Z = 4, ˆΣ = [.67 ].33.33.67, ˆβ = ˆγ = 1 22

2.5 Joint posterior pdf on structural β and RF γ, prior γ 2 1.5 γ 1 0.5 0 0.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 3 β 23

Weak instrument case In a sample where the posterior is highly non-gaussian and has substantial, slowly declining tails, even apparently very weak prior information can substantially affect inference. The graphs displayed here show the effects of imposing a N(0, 100I) prior on β, γ jointly in a model and sample that imply an estimated β around 1 in magnitude, with substantial uncertainty. Even this weak prior has a dramatic effect on the posterior, largely by eliminating extremely spread-out tails. Certainly posterior means would be greatly affected by including such weak prior information. 24

10000 Histogram of β posterior, "RF flat" prior 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 0.5 0 0.5 1 1.5 2 2.5 β x 10 4 [ [ 1 1 x = Z + ε, y = Z + u.1].1] Z, ε, ν all iidn(0, 1). 20 2 Actual IV ˆβ = 1.5132 with asymptotic standard error 0.8412 25

4500 Histogram of β posterior, "RF flat" N(0,100I) prior 4000 3500 3000 2500 2000 1500 1000 500 0 15 10 5 0 5 10 15 20 25 30 β 26

0.45 Posterior β density, with (red) and without prior 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 β Note: Densities estimated from a 300 nearest neighbor calculation. 27

2.5 x 104 Scatter of joint posterior on β, γ pure "RF flat" prior 2 1.5 β 1 0.5 0 0.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 γ 28

30 Scatter of joint posterior on β, γ with prior 25 20 15 10 β 5 0 5 10 15 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 γ 29

methods To prepare these graphs I generated MCMC artificial samples of size 10,000, sampling successively from the conditional likelihoods of {β γ, Σ}, {γ β, Σ}, and {Σ γ, β}, which are all of standard forms, then applying a Metropolis-Hastings step to reflect the influence of the prior. When the prior is not used, very large values of β occur in the sampling, and when β gets so large, dependence of β on γ is very strong. As is well known, heavy dependence produces slow convergence of Gibbs samplers. The Figures illustrate how the use of the prior improves the convergence properties of the Gibbs sampler, by eliminating the extremely large draws of β. 30

* References Kwan, Y. K. (1998): Asymptotic Bayesian analysis based on a limited information estimator, Journal of Econometrics, 88, 99 121. 31