Bayesian Econometrics

Similar documents
BAYESIAN ECONOMETRICS

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM

A Bayesian perspective on GMM and IV

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Choosing among models

Inference with few assumptions: Wasserman s example

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

The linear model is the most fundamental of all serious statistical models encompassing:

Bayesian Linear Regression

Short Questions (Do two out of three) 15 points each

Testing Restrictions and Comparing Models

Eco517 Fall 2014 C. Sims MIDTERM EXAM

Bayesian Regression Linear and Logistic Regression

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

An Introduction to Bayesian Linear Regression

Model comparison. Christopher A. Sims Princeton University October 18, 2016

David Giles Bayesian Econometrics

Bayesian linear regression

Gibbs Sampling in Endogenous Variables Models

A Bayesian Treatment of Linear Gaussian Regression

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Parametric Techniques Lecture 3

A Very Brief Summary of Statistical Inference, and Examples

New Bayesian methods for model comparison

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Time Series and Dynamic Models

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Some Curiosities Arising in Objective Bayesian Analysis

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Parametric Techniques

UNDERSTANDING NON-BAYESIANS

Bayesian Linear Models

Confidence Distribution

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Bayesian vs frequentist techniques for the analysis of binary outcome data

Frequentist Statistics and Hypothesis Testing Spring

Hybrid Censoring; An Introduction 2

Introduction to Bayesian Methods

Bayesian Inference: What, and Why?

Bayesian Linear Models

Physics 403. Segev BenZvi. Credible Intervals, Confidence Intervals, and Limits. Department of Physics and Astronomy University of Rochester

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

MID-TERM EXAM ANSWERS. p t + δ t = Rp t 1 + η t (1.1)

The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

Bayesian RL Seminar. Chris Mansley September 9, 2008

Inference for a Population Proportion

A Very Brief Summary of Bayesian Inference, and Examples

1 Hypothesis Testing and Model Selection

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

COS513 LECTURE 8 STATISTICAL CONCEPTS

Post-exam 2 practice questions 18.05, Spring 2014

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

I. Bayesian econometrics

David Giles Bayesian Econometrics

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture 10: Generalized likelihood ratio test

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

Linear Methods for Prediction

Physics 509: Bootstrap and Robust Parameter Estimation

Conjugate Analysis for the Linear Model

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Bayesian Inference: Concept and Practice

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

GOV 2001/ 1002/ E-2001 Section 3 Theories of Inference

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

Bayesian Analysis (Optional)

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

ECE521 week 3: 23/26 January 2017

CS 361: Probability & Statistics

2. I will provide two examples to contrast the two schools. Hopefully, in the end you will fall in love with the Bayesian approach.

g-priors for Linear Regression

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Political Science 236 Hypothesis Testing: Review and Bootstrapping

INTERVAL ESTIMATION AND HYPOTHESES TESTING

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Lecture : Probabilistic Machine Learning

Principles of Bayesian Inference

Class 26: review for final exam 18.05, Spring 2014

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Sampling distribution of GLM regression coefficients

Bayesian Inference for Normal Mean

Massachusetts Institute of Technology Department of Economics Time Series Lecture 6: Additional Results for VAR s

PMR Learning as Inference

Part 6: Multivariate Normal and Linear Models

Gaussian Models

ECO 513 Fall 2008 C.Sims KALMAN FILTER. s t = As t 1 + ε t Measurement equation : y t = Hs t + ν t. u t = r t. u 0 0 t 1 + y t = [ H I ] u t.

Bayesian Models in Machine Learning

Learning Bayesian network : Given structure and completely observed data

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ST 740: Linear Models and Multivariate Normal Inference

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Transcription:

Bayesian Econometrics Christopher A. Sims Princeton University sims@princeton.edu September 20, 2016

Outline I. The difference between Bayesian and non-bayesian inference. II. Confidence sets and confidence intervals. Why? III. Bayesian interpretation of frequentist data analysis. IV. A case where Bayesian and frequentist approaches seem aligned: SNLM 1

Bayesian Inference is a Way of Thinking, Not a Basket of Methods Frequentist inference makes only pre-sample probability assertions. A 95% confidence interval contains the true parameter value with probability.95 only before one has seen the data. After the data has been seen, the probability is zero or one. Yet confidence intervals are universally interpreted in practice as guides to post-sample uncertainty. They often are reasonable guides, but only because they often are close to posterior probability intervals that would emerge from a Bayesian analysis. 2

People want guides to uncertainty as an aid to decision-making. They want to characterize uncertainty about parameter values, given the sample that has actually been observed. That it aims to help with this is the distinguishing characteristic of Bayesian inference. 3

Abstract decision theory Unknown state S taking values in Ω. We choose δ from a set. Payoff is U(δ, S). Admissible δ: There is no other δ such that U(δ, S) > U(δ, S) for all S. Bayes decision rule δ : maximizes E P [U(δ, S)] over δ for some probability distribution P over Ω. 4

The complete class theorem See Ferguson (1967) Under fairly general conditions, every admissible decision procedure is Bayes. This is, as mathematics, almost the same as the result that every efficient allocation in general equilibrium corresponds to a competitive equilibrium. Both are separating hyperplanes results, with the hyperplane defined by prices in general equilibrium and by probabilities in the complete class theorem. 5

A two-state graphical example 6

prior and model Suppose we observe an animal dashing into a hole underneath a garden shed, and then note that there is no particularly bad odor around the hole. We are sure it is a woodchuck or a skunk. Before we made these observations, our probability distribution on a randomly occuring animal in our back yard was.5 on woodchuck and.5 on skunk. This was our prior distribution. Conditional on the animal being a woodchuck, we believe the probability that there would be odor around the burrow (O = 1) is.3, and the probability that the animal would run away fast (R = 1) is.7. Conditional on the animal being a skunk, we believe P[O = 1 skunk] =.7 and P[R = 1 skunk] =.3 7

We think these observations are independent. 8

Tables of probabilities Conditional probablities of observations, given woodchuck and given skunk. Woodchuck Skunk O = 0 O = 1 O = 0 O = 1 R = 0.21.09.21.49 R = 1.49.21.09.21 9

Joint distribution of parameters and data, posteior distribution Therefore, the probability of what we saw is P[O = 0 and R = 1 woodchuck] =.49 and P[O = 0 and R = 1 skunk] =.09. To get the unconditional probabilities of these two events we multiply them by the prior probabilities of the parameters, which in this case is.5 for each of them. Therefore the conditional probability of the animal being a woodchuck given our observations is.49/(.49+.09) =.845. 10

Confidence sets? Is this also an 84.5% confidence set for the animal s species? No, for two reasons. A confidence set is a random set that varies with the observed data. We can t determine whether this set {woodchuck} is the value of a confidence set without specifying how the set would vary if we had made other observations. But besides that, there is no necessary connection between Bayesian posterior probabilities for intervals and confidence levels attached to them. 11

Constructing a confidence set For example, we might specify that when we see R = 0, O = 1 our confidence set is just {skunk}, that when we see R = 1, O = 0 it is {woodchuck}, and that in the other two cases (where our posterior probabilities would be.5 on each) it is {woodchuck,skunk}. This would be a 90% confidence set (.49+.21+.21), though it would often leave us asserting with 90 per cent confidence that the animal is either a woodchuck or skunk, even though we are quite sure that it is either one or the other. 12

Effect of a different prior Our Bayesian calculations were based on the assumption that woodchucks and skunks are equally likely. If we knew before our observations that woodchucks are much more common than skunks, say 10 times more common, our marginal probability would put probability 10/11 on woodchuck. Then the probability of our R = 1, O = 0 observation is.49 10/11 conditional on woodchuck,.09 1/11 on skunk. Our observations would therefore move us from a 10/11=.909 probability on woodchuck to 4.9/(4.9+.09) =.982. If instead we had seen R = 0, O = 1, the evidence would have shifted us from.909 probability on woodchuck to.049/(.049 +.09)=.353, 13

Model, parameters, prior, likelihood This framework is not essential to Bayesian decision theory, but it is nearly universal in empirical work. There are some unknown objects we label parameters, in a vector β. There is a conditional distribution of observable data y given β, p(y β). The function p( ) is the model. Frequentists and Bayesians agree on this setup. They also agree on calling p(y β), as a function of β with y fixed at the observed value, the likelihood function. The formal difference: Bayesians treat y as non-random once it has been observed, but treat β as random (since it is unknown) both before and after y has been observed. Frequentists persist in making probability statements about y and functions of y (like estimators) even after y has been observed. 14

Bayes Rule With a pdf π() describing uncertaintly about β before y has been observed (this is the prior pdf), The joint distribution of y, β has pdf π(β)p(y β). Bayes rule then uses this to calculate the conditional density of β y via π(β)p(y β) q(β y) = π(β )p(y β )dβ. Bayes rule is math; not controversial. What bothers frequentists is the need for π and for treating the fixed object β as random. 15

Is the difference that Bayesian methods are subjective? No. The objective aspect of Bayesian inference is the set of rules for transforming an initial distribution into an updated distribution conditional on observations. Bayesian thinking makes it clear that for decisionmaking, pre-sample beliefs are therefore in general important. But most of what econometricians do is not decision-making. It is reporting of data-analysis for an audience that is likely to have diverse initial beliefs. = In such a situation, as was pointed out long ago by Hildreth (1963) and Savage (1977, p.14-15), the task is to present useful information about the shape of the likelihood. 16

How to characterize the likelihood Present its maximum. Present a local approximation to it based on a second-order Taylor expansion of its log. (Standard MLE asymptotics.) Plot it, if the dimension is low. If the dimension is high, present slices of it, marginalizations of it, and implied expected values of functions of the parameter. The functions you choose might be chosen in the light of possible decision applications. 17

The marginalization is often more useful if a simple, transparent prior is used to downweight regions of the parameter space that are widely agreed to be uninteresting. 18

HPD sets, marginal sets, frequentist problems with multivariate cases 19

Confidence set simple example number 1 X U(β + 1, β 1) β known to lie in (0,2) 95% confidence interval for β is (X.95, X +.95) Can truncate to (0,2) interval or not it s still a 95% interval. 20

Example 1, cont. Bayesian with U(0, 2) prior has 95% posterior probability ( credibility ) interval that is generally a subset of the intersection of the X ± 1 interval with (0,2), with the subset 95% of the width of the interval. Frequentist intervals are simply the intersection of X ±.95 with (0,2). They are wider than Bayesian intervals for X in (0, 2), but narrower for X values outside that range, and in fact simply vanish for X values outside (.95, 2.95). 21

Are narrow confidence intervals good or bad? A 95% confidence interval is always a collection of all points that fail to be rejected in a 5% significance level test. A completely empty confidence interval, as in example 1 above, is therefore sometimes interpreted as implying rejection of the model. As in example 1, an empty interval is approached as a continuous limit by very narrow intervals, yet very narrow intervals are usually interpreted as implying very precise inference. 22

Confidence set simple example number 2 X = β + ε, ε e ε on (0, ). likelihood: e X+β for β (, X) Bayesian credible sets show reversed asymmetry contrast with naive (but commonly applied) bootstrap, which would produce an interval entirely concentrated on values of β above ˆβ MLE = X, which are impossible. (naive parametric bootstrap: Find an estimator ˆβ. Simulate draws from X ˆβ, take the 2.5% tails of this distribution as the 95% confidence interval.) 23

Confidence set simple example number 3 Y N(β, 1); we are interested in g(β), where g is nonlinear and monotone, but unknown. We observe not Y, but X = g(y). We have a maximum likelihood estimator ĝ for g(β). If we could observe Y, a natural confidence and credible interval for β would be Y ± 1.96. If we also knew g, we could then use (g(y 1.96), g(y + 1.96)) as an excellent confidence and credible interval. 24

Using the naive bootstrap here, if we base it on an estimator ĝ asymptotically equivalent to the MLE, gives exactly the natural interval. So, without an analysis of likelihood, there is no answer as to whether the naive bootstrap gives reasonable results. 25

Mueller-Norets bettable confidence intervals If not bettable from either side, they re Bayesian for some prior. If only not bettable against, they exist, and are expanded versions of Bayesian posterior probability intervals. 26

Bayesian interpretation of frequentist asymptotics A common case: T( ˆβ T β) θ D T N(0, Σ) Under mild regularity conditions, this implies T(β ˆβ T ) ˆβ T N(0, Σ) D T Kwan (1998) presents this standard case (though his main theorem does not make strong enough assumptions to deliver his result). Kim (2002) extended the results beyond the T normalization case and thereby covered time series regression with unit roots. 27

SNLM The SNLM often denoted by the equation Y = Xβ + ε, asserts the following conditional pdf for the vector of Y data conditional on the matrix of X data and on the parameters β, σ 2 : p( Y T 1 X T k ) = ϕ(y Xβ; σ 2 I) = (2π) T/2 σ T exp ( (Y Xβ) ) (Y Xβ) The most common framework for Bayesian analysis of this model asserts a prior that is flat in β and log σ or log σ 2, i.e. dσ/σ or dσ 2 /σ 2. We will assume the prior has the form dσ/σ p, then discuss how the results depend on p. 2σ 2 28

Marginal for σ 2 We let u(β) = Y Xβ and denote the least squares estimate as ˆβ = (X X) 1 X Y. Also û = u( ˆβ). Then the posterior can be written, by multiplying the likelihood above by σ p and rearranging, as proportional to ( ) σ T p û û + (β ˆβ) X X(β ˆβ) exp dβ dσ 2σ 2 ( σ T p+k X X û ) 1 2 û exp ϕ ( β ˆβ; σ 2 (X X) 1) dβ dσ 2σ 2 ( ) û û v (T+p k)/2 exp 2 v X X 1 2 ϕ ( β ˆβ; σ 2 (X X) 1) dβ dv v, 3/2 where v = 1/σ 2. 29

Integrating this expression w.r.t. β and setting α = û û/2 gives us an expression proportional to ( v (T+p k 3)/2 û ) û exp 2 v dv α (T+p k 1)/2 v (T+p k 3)/2 e αv dv, which is a standard Γ((T + p k 1)/2, α) pdf. Because it is v = 1/σ 2 that has the Γ distribution, we say that σ 2 itself has an inverse-gamma distribution. Since a Γ(n/2, 1) variable, multiplied by 2, is a χ 2 (n) random variable, some prefer to say that û û/σ 2 has a χ 2 (T k) distribution, and thus that σ 2 has an inverse-chi-squared distribution. 30

Marginal on β Start with the same rearrangement of the likelihood in terms of v and β, and rewrite it as ( v (T+p 3)/2 exp 1 ) 2 u(β) u(β)v dv dβ. As a function of v, this is proportional to a standard Γ ( (T + p 1)/2, u(β) u(β)/2 ) pdf, but here there is a missing normalization factor that depends on β. When we integrate with respect to v, therefore, we arrive at ( u(β) ) u(β) (T+p 1)/2 dβ 2 ( 1 + (β ˆβ) X X(β ˆβ) û û ) (T+p 1)/2 dβ. 31

( This is proportional to what is known as a multivariate t ) n 0, (û û)/n pdf, where n = T + p k 1 is the degrees of freedom. It makes each element β i of β an ordinary univariate t n ( ˆβ i, s 2 β ), where s2 β = s2 (X X) 1 ii and s 2 = û û/n. Thus the statistics computed from the data can be analyzed with the same tables of distributions from either a Bayesian or non-bayesian perspective. 32

Breaks Suppose we have a model specifying Y t is i.i.d. with a pdf f (y t ; µ t ), and with µ t taking on just two values, µ t = µ for t < t, µ t = µ for t t. The pdf of the full sample {Y 1,..., Y T } therefore depends on the three parameters, µ, µ, t. If f has a form that is easily integrated over µ t and we choose a prior π(µ, µ) that is conjugate to f (meaning it has the same form as the likelihood), then the posterior marginal pdf for t under a flat prior on t is easy to calculate: for each possible value of t, integrate the posterior over µ, µ. 33

The plot of this integrated likelihood as a function of t gives an easily interpreted characterization of uncertainty about the break date. Frequentist inference about the break date has to be based on asymptotic theory, and has no intepretation for any observed complications (like multiple local peaks, or narrow peaks) in the global shape of the likelihood. 34

Model comparison Can be thought of as just estimating a discrete parameter: y f (y θ, m), with θ R k, m Z. Then apply Bayes rule and marginalization to get posterior on m: p(m y) f (y θ, m) dθ. The rhs is the Bayes factor for the model. This is the right way to compare models in principle, but it can be misleading if the discrete parameter arises as an approximation to an underlying continuous range of uncertainty as we ll discuss later. 35

Simple examples of model comparison, vs. testing models Test H0 : X N(0, 1). reject if X too big. vs. what? HA : X N(0, 2)? N(0,.5)? The classic frequentist testing literature emphasized that tests must always be constructed with the alternative hypothesis in mind. There is no such thing as a meaningful test that requires no specification of an alternative. Posterior odds vs. 5% test for N(0, 1) vs. N(0, 2). N(0, 1) vs. N(2, 1), N(µ, 1)? Need for prior. 36

The stopping rule paradox Boy preference birth ratio example Sampling to significance example Posterior distribution vs. test outcome 37

* References FERGUSON, T. S. (1967): Mathematical Statistics: A Decision Theoretic Approach. Academic Press, New York and London. HILDRETH, C. (1963): Bayesian Statisticians and Remote Clients, Econometrica, 31(3), 422 438. KIM, J.-Y. (2002): Limited information likelihood and Bayesian analysis, Journal of Econometrics, 107, 175 193. KWAN, Y. K. (1998): Asymptotic Bayesian analysis based on a limited information estimator, Journal of Econometrics, 88, 99 121. 38

SAVAGE, L. J. (1977): The Shifting Foundations of Statistics, in Logic, Laws and Life, ed. by R. Colodny, pp. 3 18. University of Pittsburgh Press. 39