Inference with few assumptions: Wasserman s example

Similar documents
A Bayesian perspective on GMM and IV

UNDERSTANDING NON-BAYESIANS

Bayesian Econometrics

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM

Testing Restrictions and Comparing Models

ROBINS-WASSERMAN, ROUND N

Eco517 Fall 2004 C. Sims MIDTERM EXAM

TAKEHOME FINAL EXAM e iω e 2iω e iω e 2iω

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Conditional probabilities and graphical models

Model comparison. Christopher A. Sims Princeton University October 18, 2016

Eco517 Fall 2014 C. Sims MIDTERM EXAM

Choosing among models

MID-TERM EXAM ANSWERS. p t + δ t = Rp t 1 + η t (1.1)

Eco517 Fall 2014 C. Sims FINAL EXAM

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Physics 509: Bootstrap and Robust Parameter Estimation

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS

ECO 513 Fall 2008 C.Sims KALMAN FILTER. s t = As t 1 + ε t Measurement equation : y t = Hs t + ν t. u t = r t. u 0 0 t 1 + y t = [ H I ] u t.

Fitting a Straight Line to Data

Bayesian Methods for Machine Learning

Introduction. Chapter 1

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

0. Introduction 1 0. INTRODUCTION

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

A General Overview of Parametric Estimation and Inference Techniques.

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Long-Run Covariability

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

Instrumental Variables

Lecture 4: Lower Bounds (ending); Thompson Sampling

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

CSC 2541: Bayesian Methods for Machine Learning

6 Pattern Mixture Models

ECO 513 Fall 2009 C. Sims CONDITIONAL EXPECTATION; STOCHASTIC PROCESSES

Physics 403. Segev BenZvi. Credible Intervals, Confidence Intervals, and Limits. Department of Physics and Astronomy University of Rochester

Theory of Maximum Likelihood Estimation. Konstantin Kashin

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

Lecture 2: Basic Concepts of Statistical Decision Theory

Lecture : Probabilistic Machine Learning

Probability theory basics

Advanced Introduction to Machine Learning CMU-10715

Computing Likelihood Functions for High-Energy Physics Experiments when Distributions are Defined by Simulators with Nuisance Parameters

Gibbs Sampling in Linear Models #2

Bayesian Estimation An Informal Introduction

1 Using standard errors when comparing estimated values

CONJUGATE DUMMY OBSERVATION PRIORS FOR VAR S

ECE 275A Homework 7 Solutions

Carl N. Morris. University of Texas

CSC321 Lecture 18: Learning Probabilistic Models

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Measures. 1 Introduction. These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland.

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

18.175: Lecture 2 Extension theorems, random variables, distributions

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

ECO Class 6 Nonparametric Econometrics

Spring 2014 Advanced Probability Overview. Lecture Notes Set 1: Course Overview, σ-fields, and Measures

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

Statistics: Learning models from data

A Discussion of the Bayesian Approach

BAYESIAN DECISION THEORY

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

6.867 Machine learning, lecture 23 (Jaakkola)

Expressions for the covariance matrix of covariance data

Using prior knowledge in frequentist tests

Short Questions (Do two out of three) 15 points each

EMERGING MARKETS - Lecture 2: Methodology refresher

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Predicting AGI: What can we say when we know so little?

Machine Learning 4771

POLI 8501 Introduction to Maximum Likelihood Estimation

For more information about how to cite these materials visit

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Parametric Techniques

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4)

Physics 509: Error Propagation, and the Meaning of Error Bars. Scott Oser Lecture #10

More Empirical Process Theory

Bootstrap and Parametric Inference: Successes and Challenges

6.867 Machine Learning

Time Series 3. Robert Almgren. Sept. 28, 2009

Solutions Homework 6

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

The Inductive Proof Template

Bayesian vs frequentist techniques for the analysis of binary outcome data

Foundations of Nonparametric Bayesian Methods

STA 294: Stochastic Processes & Bayesian Nonparametrics

This chapter covers asymptotic analysis of function growth and big-o notation.

Lecture 4: Training a Classifier

Solution: chapter 2, problem 5, part a:

COMMENT ON DEL NEGRO, SCHORFHEIDE, SMETS AND WOUTERS COMMENT ON DEL NEGRO, SCHORFHEIDE, SMETS AND WOUTERS

Flexible Estimation of Treatment Effect Parameters

Parametric Techniques Lecture 3

Transcription:

Inference with few assumptions: Wasserman s example Christopher A. Sims Princeton University sims@princeton.edu October 27, 2007 Types of assumption-free inference A simple procedure or set of statistics is proposed. It is shown to have good properties under very weak assumptions about the model. OLS, IV, GMM, differencing out individual effects An infinite-dimensional parameter space or a nuisance-parameter vector that grows with sample size is explicit. A procedure that explicitly accounts for the infinite-dimensionality is proposed. Kernel regression, frequency-domain time series models. Are these approaches inherently non-bayesian? Certainly the idea that we would prefer that inference not depend on uncertain or arbitrary assumptions is reasonable from any perspective. Some Bayesians have argued that insistence on using all aspects of the observed data is an important part of the Bayesian perspective on inference, but I don t agree. Simple procedures that work will be used. Bayesians might want to use them for the same reason non-bayesians might. c 2007 by Christopher A. Sims. This document may be reproduced for educational and research purposes, so long as the copies contain this notice and are retained for personal use or distributed free.

Spuriously appealing simple procedures should not be used. Bayesian analysis of simple procedures may help us identify procedures, or combinations of procedures and types of samples, in which the simple procedures break down. What are micro-founded (i.e. Bayesian) versions of these approaches? Limited-information Bayesian inference based on asymptotics. (Yum-Keung Kwan). Convert frequentist asymptotics to posteriors conditional on the statistics used to form the estimator and its asymptotic variance. Bayesian sieves. Postulate a specific prior distribution over an infinite-dimensional parameter space and proceed. Usually this has the form of a countable union of finite-dimensional parameter spaces with probabilities over the dimension. Limited-information Bayesian inference based on entropy. Conservative inference based on the idea that assumptions that imply the data reduce entropy slowly are weaker than assumptions that imply that in this sense information accumulates faster. (Jaynes, Zellner, Jae-young Kim) Explicit infinite-dimensionality Uncertain lag length; polynomial regression with degree uncertain; autoregressive estimation of spectral density with uncertain lag length; random effects approaches to panel data. These appear to be restrictive. Insisting that a regression function is a finite-order polynomial with probability one seems more restrictive than just insisting that, e.g., there is some (unknown) uniform bound on some of its derivatives. That s not true, though. Every Borel measure on a complete, separable metric space puts probability one on a countable union of compact sets, and in an infinitedimensional linear space, compact sets are all topologically small in the same sense that finite-dimensional subspaces are. Frequentist procedures that provide confidence sets face the same need to restrict attention to small subspaces. 2

Example: l 2, the space of square summable sequences {b i }. Many time series setups (most frequency domain procedures, e.g.) require that there is some A > 0 (which may not be known) such that A b i i q is bounded for some particular q 1. The set of all such b s is a countable union of compact sets. Compact subsets of an infinite dimensional linear space are topologically equivalent to subsets of R k. So the union over T of all b s such that b i = 0 for i > T is the same size as the subsets satisfying this rate of convergence criterion. Time series econometricians long ago got over the idea that frequency domain estimation, which makes only smoothness assumptions on spectral densities, is any more general than time-domain estimation with finite parametrization. In fact, the preferred way to estimate a spectral density is usually to fit an AR or ARMA model. It s a bit of a mystery why non-time-series econometricians still habitually suggest that kernel estimates make fewer assumptions than (adaptive) parametric models. No Lebesgue measure R n is a complete, separable metric space. But Lebesgue measure provides a topologically-grounded definition of a big set as an alternative to category. Lebesgue measure is Borel and translation-invariant. In an -dim space S, for any Borel measure µ on S, there is an x S such that for some measurable A S, µ(a) > 0 and µ(a + x) = 0. I.e., not only is there no translation-invariant measure, translation can always take a measure into one that is mutually discontinuous with it. Is this a problem? Not a special problem for Bayesians. Any attempt to produce confidence regions runs into it. We can never relax about the possibility that a reasonable-looking prior is actually dogmatic about something important. In R n, priors equivalent to Lebesgue measure all agree on the zero-probability sets and in this sense are not dogmatic. No such safe form of distribution is available in -dimensions. 3

Using entropy While entropy-reduction has axiomatic foundations, using it as if it came from a loss function is hazardous. See Bernardo and Smith. We may care about some parameters and not others, e.g., so that assumptions that let us learn rapidly about parameters we don t care about, while limiting the rate at which we learn about others, may in fact be conservative. One reasonable approach: Make assumptions about model and prior that imply a joint distribution for the observations and the data that makes the mutual information between data and parameters as small as possible, subject to some constraints. Minimizing mutual information between y and β s.t. y This is often easy, and appealing. a fixed marginal on When maximizing the entropy of y β leads to a well-defined pdf, this solves the problem. E.g. the SNLM, which emerges as maximum entropy given the conditional mean and variance of y The fixed marginal on y is perhaps fairly often reasonable if we know nothing about parameters, we often know almost nothing about the distribution of y. IV: emerges from the IV moment conditions. Leads to using the LIML likelihood to characterize uncertainty, which makes a lot more sense than treating the conventional asymptotic distribution for IV itself as if it were correct. This approach handles weak instruments automatically. The Wasserman example Why study this very stylized example? It originally was meant to show that Bayesian or other likelihood-based methods run astray in high-dimensional parameter spaces, while a simple, frequentist, GMM-family estimate worked fine something that some statisticians and econometricians think is well known, which is probably why Wasserman, a very smart statistician at CMU, himself ran astray in the example. 4

The model and estimator are a boiled-down version of propensity score methods that are widely used in empirical labor economics. The discussion will illustrate all the general points made above. For each ω {1,..., B} there is a pair θ i, ξ i, each of which is in [0, 1]. B is a very large integer. We know ξ() : {1,..., B} (0, 1). We do not know θ() : {1,..., B} (0, 1), though it exists. Our inference will concern this unknown object. The sample For each observation number j = 1,..., N, an ω j was drawn randomly from a uniform distribution over {1,..., B}. R j {0, 1}, P [R = 1 ξ(ω j ), θ(ω j )] = ξ(ω j ). (From this point on we will use the shorthand notation θ j and ξ j for θ(ω j ) and ξ(ω j )). Y j {0, 1}, P [Y j = 1 θ j, ξ j ] = θ j We observe the triple (ξ j, R j, R j Y j ). We are interested in inference about θ = (1/B) ω θ(ω) The setup in words We have a sample of a zero-1 variable (Y j ) with missing observations. We would like to know the average probability that Y is one, but we can t simply take sample averages of the Y j s, because the probability that a sample value Y j is missing might vary with θ j. Wasserman s claims The pdf of our observations as a function of the unknown parameters {θ(ω), ω 1,..., B} (and of the known parameters ξ(ω)) is N j=1 ξ R j j (1 ξ j ) 1 R j θ R jy j j (1 θ j ) (R j R j Y j ). 5

The likelihood function, which varies only with the unknown parameters θ(1,..., B), depends on the known {ξ j } sequence only through a scale factor, which does not affect the likelihood shape. Therefore Bayesian, or any likelihood-based inference, must ignore the ξ j s. If N << B, most of the θ(ω) values do not appear in the likelihood function. For all those values of ω that have not appeared in the sample, the posterior mean has to be the prior mean. The posterior mean for θ must therefore be almost the same as the prior mean for that quantity. The robust frequentist approach Use the Horwitz-Thompson estimator. This just weights all observations (including those for which Y is unobserved, treating them as zeros) by the inverse of their probability of inclusion in the sample. That is, set ˆθ = 1 N R j Y j. N ξ j j=1 It s easily checked that if Z j strongly consistent. = R j Y j /ξ j, E[Z j ] = E[θ j ]. So ˆθ is unbiased and Conservative frequentist confidence intervals If we have a lower bound on ξ, we have an upper bound on the variance of ˆθ, which will allow us to create Chebyshev-style conservative confidence intervals for θ, valid even in small samples. Obviously we can also create the usual sort of asymptotically valid confidence interval (which may not have approximately correct coverage at any sample size) by forming the usual sort of t-statistic from the Z j s, What about the likelihood principle? The wrong likelihood. Doesn t Wasserman s argument show that inference about θ that uses the ξ j values violates the likelihood principle? Wasserman has actually incorrectly specified the likelihood function. What he (and we, above) displayed is the correct conditional joint pdf for (Y j, R j Y j ) 6

{ξ j, θ j, j = 1,..., N}. It is also true that {θ(ω), ξ(ω), ω = 1,..., B} can be regarded as unknown and known parameters, respectively. But the ξ j s are not the ω j s. The ξ j are realized values of a random variable that is observed. The likelihood must describe the joint distribution of all observable random variables, conditional on the unknown parameter values, which here are just {θ(ω)}. Correcting the likelihood So we need to specify the joint distribution of θ j, ξ j, not just condition on them, and then, since the θ j s are unobservable random variables, we have to integrate them out to obtain the conditional distribution of observables given the parameters {θ(ω)}. The probability space is S = {1,..., B}. The joint distribution is fully characterized by ξ(ω) and θ(ω). But since the θ(ω) function is unknown, we can t write down the pdf of the data, and therefore the likelihood, without filling in this gap. Ways to fill the gap Postulate independence. Then the likelihood, and hence inference about θ() will not depend on the observed values of ξ j, indeed inference can be based on the sample where Y j is observed, ignoring the censoring. However even in this case Wasserman is wrong to say that the form of the likelihood implies that the posterior mean for θ(ω) must be the prior mean for all those values of θ(ω) that do not appear in the sample. The prior on θ() in the independence (of ξ) case is a stochastic process on 1,..., B. It can perfectly well be an exchangeable process with uncertain mean. For example, {θ(ω) ν} Beta(kν, k(1 ν)), all ω ν U(0, 1). With this prior inference is straightforward and the posterior mean of θ will be very close to the average value of Y j over those observations for which it is observed. 7

But notice: We at first inadvertently, by extending a reasonable low-dimensional prior to a high-dimensional space, were dogmatic about exactly what interested us. We fixed that. But we re still dogmatic about a lot of other stuff e.g., the difference in the average of the θ s over the first half and the last half of the list of θ s. We should probably be modeling our prior on θ(ω) as a stochastic process. But it is inevitable that we are in some dimensions dogmatic. Limited information Horwitz-Thompson achieves simplicity both for calculating the estimator and for proving a couple of its properties (unbiasedness and consistency) by throwing away information. Throwing away information to achieve simplicity is sometimes a good idea, and is as usable for Bayesian as for non-bayesian inference. What would Bayesian inference based on Z j = R j Y j /ξ j, j = 1,..., N alone look like? Z j has probability mass concentrated on a known set of points: {0, 1/ξ(ω), ω = 1,..., B}. We don t know the probabilities of those points for a given draw j conditional on {θ()}. Small number of ξ(ω) values If the range of the ξ(ω) function is only a few discrete points, within the set of ω values for which ξ j takes on a particular value, there is trivially no dependence of θ j on ξ j. It therefore makes sense to apply our solution above for the independence case separately to the k subsamples corresponding to the k values of ξ j. The estimated mean θ s within each subsample can then be weighted together using the assumedknown probabilities of the k ξ(ω) values. 8

Extension to many ξ(ω) values We could break up the range of ξ(ω) into k segments, assume the distribution of θ(ω) within each such segment is independent of ξ(ω). This obviously is feasible and will be pretty accurate if the dependence of the distribution of θ ξ on ξ is smooth. To be more rigorous, we could put probabilities π k over the number of segments, with each k corresponding to a given list of k segments of (0,1). This is a Bayesian sieve, and leads to accurate inference under fairly general assumptions. This approach would work even if there is an uncountable probability space and we are just given a marginal density g(ξ) for ξ. Using entropy It makes things easier here if we (realistically) suppose that the underlying probability space is uncountable, rather than {1,..., B}. Each Z j has an unknown distribution on {0} [1, ), but with mean θ. We can often arrive at conservative Bayesian inference by maximizing entropy of the data conditional on known moments. Here, the max entropy distribution for Z will put probability π 0 ( θ) on Z = 0 and density π 0 ( θ) exp( λ( θ)z) on Z 1. The max entropy distribution makes Z j a sufficient statistic. Its posterior mean will be consistent and respects the a priori known bound θ [0, 1]. If we relax the constraint connecting π 0 and the height of the continuous pdf at Z = 1, we get a sample likelihood α n (1 α) m µ n e µ P Z i >0 (Z i 1), where n is the number of non-zero observations and m the number of zero observations on Z. The bounds on θ place bounds on µ, α. 9

Ignoring the bounds and using a conjugate prior parametrized by γ gives a posterior mean for θ of ˆθ = n + 1 ( n + m + 2 Zi n γ ), n very close to the Horwitz-Thompson estimator for large n, m. Admissibility Ignoring the bounds makes this estimator and the Horwitz-Thompson estimator inadmissible. If θ is near 1, these estimators can have a high probability of exceeding 1. We can truncate them, but this undoes unbiasedness of Horwitz-Thompson. One admissible estimator is obtained as the posterior mean of the α, µ model, accounting for the bounds. The draft note shows how that model leads to a posterior that is more diffuse when there are fewer non-zero Z s. The entropy-maximizing model would not have this property. Its posterior would depend on Z j and sample size, but not on the number of non-zero draws. Probably, therefore, its posterior is more diffuse for samples with few-non-zero draws. I haven t decided yet whether this is going to often be reasonable. Lessons I. Don t try to prove that the likelihood principle leads to bad estimators. II. In high-dimensional parameter spaces (as with nuisance parameters), be careful to assess what kind of prior knowledge is implied by an apparently standard-looking prior. Use exchangeability to make sure important functions of the distribution are reasonably uncertain a priori. III. Good estimators usually are exactly or almost equivalent to Bayesian estimators. Interpreting them from a Bayesian perspective can often suggest ways to improve them. 10

GMM If one minimizes mutual information between y and β subject to E[g(y; β)] = 0, E[g(y; β)g(y; β) = Σ] and to a fixed marginal on y, one arrives at a pdf for {y β, Σ} of the form exp( λ(y) µ(β, Σ)g(y; β) g(y; β) Ω(β, Σ)g(y; β)), where λ and µ have to be chosen to meet the requirement that the pdf integrates to one for all β, Σ and the moment condition is satisfied. Note this does not mean g is itself Gaussian, and Ω Σ. models). (except in linear Looking for λ, µ and Ω for some interesting real-world examples of nonlinear g s is an interesting project. 11