MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM

Similar documents
Bayesian Econometrics

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

Testing Restrictions and Comparing Models

Eco517 Fall 2014 C. Sims MIDTERM EXAM

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

CONJUGATE DUMMY OBSERVATION PRIORS FOR VAR S

Model comparison. Christopher A. Sims Princeton University October 18, 2016

Eco517 Fall 2014 C. Sims FINAL EXAM

Bayesian Linear Regression

(I AL BL 2 )z t = (I CL)ζ t, where

Mathematical statistics

Problem Selected Scores

Choosing among models

Inference with few assumptions: Wasserman s example

Bayesian Inference: Concept and Practice

MID-TERM EXAM ANSWERS. p t + δ t = Rp t 1 + η t (1.1)

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Parametric Techniques Lecture 3

Parametric Techniques

E. Santovetti lesson 4 Maximum likelihood Interval estimation

ECO 513 Fall 2008 C.Sims KALMAN FILTER. s t = As t 1 + ε t Measurement equation : y t = Hs t + ν t. u t = r t. u 0 0 t 1 + y t = [ H I ] u t.

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

I. Bayesian econometrics

Fitting a Straight Line to Data

Political Science 236 Hypothesis Testing: Review and Bootstrapping

Bayesian Regression (1/31/13)

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Introduction to Bayesian Learning. Machine Learning Fall 2018

Interpreting Regression Results

Using prior knowledge in frequentist tests

Bayesian Regression Linear and Logistic Regression

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Preliminary Statistics Lecture 5: Hypothesis Testing (Outline)

A Bayesian perspective on GMM and IV

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 3. Inference about multivariate normal distribution

DIFFERENT APPROACHES TO STATISTICAL INFERENCE: HYPOTHESIS TESTING VERSUS BAYESIAN ANALYSIS

review session gov 2000 gov 2000 () review session 1 / 38

Bayesian Linear Models

Chapter 9: Interval Estimation and Confidence Sets Lecture 16: Confidence sets and credible sets

AMS-207: Bayesian Statistics

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Introduction to Applied Bayesian Modeling. ICPSR Day 4

7. Estimation and hypothesis testing. Objective. Recommended reading

Physics 403. Segev BenZvi. Credible Intervals, Confidence Intervals, and Limits. Department of Physics and Astronomy University of Rochester

ST 740: Linear Models and Multivariate Normal Inference

Statistical Methods in Particle Physics

A Bayesian Treatment of Linear Gaussian Regression

The linear model is the most fundamental of all serious statistical models encompassing:

Frequentist Statistics and Hypothesis Testing Spring

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS

Bayesian Linear Models

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Part 6: Multivariate Normal and Linear Models

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Statistical Inference

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

BAYESIAN ECONOMETRICS

We set up the basic model of two-sided, one-to-one matching

David Giles Bayesian Econometrics

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

CS-E3210 Machine Learning: Basic Principles

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Introduction to Statistical Inference

Lectures 5 & 6: Hypothesis Testing

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

Hypothesis Testing: The Generalized Likelihood Ratio Test

Chapter 8: Sampling distributions of estimators Sections

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Statistics: Learning models from data

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Lecture 2: Statistical Decision Theory (Part I)

POLI 8501 Introduction to Maximum Likelihood Estimation

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Review. December 4 th, Review

Lecture 4: Testing Stuff

Bayesian Inference. Chapter 9. Linear models and regression

The Linear Regression Model

Statistical Distribution Assumptions of General Linear Models

Parameter Estimation. Industrial AI Lab.

One-parameter models

STA 4273H: Sta-s-cal Machine Learning

A noninformative Bayesian approach to domain estimation

Economics 536 Lecture 7. Introduction to Specification Testing in Dynamic Econometric Models

Markov Chain Monte Carlo methods

F denotes cumulative density. denotes probability density function; (.)

Lecture 15. Hypothesis testing in the linear model

Bayesian Inference for Normal Mean

Linear Models A linear model is defined by the expression

BIOS 312: Precision of Statistical Inference

Solution: chapter 2, problem 5, part a:

Ch. 5 Hypothesis Testing

8. Hypothesis Testing

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Bayesian Phylogenetics:

LECTURE 5. Introduction to Econometrics. Hypothesis testing

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

Transcription:

Eco517 Fall 2004 C. Sims MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM 1. SOMETHING WE SHOULD ALREADY HAVE MENTIONED A t n (µ, Σ) distribution converges, as n, to a N(µ, Σ). Consider the univariate case, where the t n (0, 1) pdf is ( ) Γ ν+1 ) (ν+1)/2 2 (x µ)2 ( νγ ν2 ) ( ) (1 +. Γ 12 ν Using the calculus fact that (1 + a/n) n e a as n, it is easy to show that the part of the pdf that depends on x converges to exp( (x µ) 2 /2). Note also that the t distribution has moments only up to order ν 1. So it does not have a moment generating function. 2. STEIN S RESULT In the standard normal linear model, with a loss function of the form (β ˆβ) W(β ˆβ) with W p.s.d., the OLS estimator of β is admissible for k 2, but not for k > 2. He proved this by constructing an estimator that dominates OLS. However, his estimator is also not admissible. Bayesian posterior means with proper priors are of course admissible. However, only a narrow class of them dominates OLS, and the class will vary with W. So long as an estimator is admissible, that it also dominate OLS is not necessarily desirable. These results reflect standard good practice in applied work. When there are many regressors, everyone understands that it is possible to make the predictions from OLS regression estimates turn out badly by including regressors whose estimates have high standard errors. So researchers exclude variables based on prior beliefs. But it might be better sometimes to formulate priors explicitly probabilistically, instead of excluding variables informally. Date: November 1, 2004. k 1 c 2004 by Christopher A. Sims. This document may be reproduced for educational and research purposes, so long as the copies contain this notice and are retained for personal use or distributed free.

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM 2 3. MAXIMUM LIKELIHOOD ESTIMATION The MLE of θ is the value of θ that maximizes p(y θ). It may not exist. It is rarely justifiable as a Bayesian estimator. While it is often thought of as a non-bayesian estimator, it is not generally unbiased and does not generally have any other good properties except being a function of sufficient statistics. Under general conditions that we will study later, it has approximately good properties when the sample size is large enough. It is usually the starting point for the task of describing the shape of the likelihood. But there are some cases where it is by itself not much use: many local maxima, all of similar height; cases where the peak of the LH is narrow and far from the main mass of probability. 4. SET ESTIMATION This is a procedure that is hard to rationalize from a Bayesian perspective, so we ll come back to that at the end. A 100(1 α)% confidence set for the parameter θ in the parameter space Θ is a mapping from observations Y into subsets S(Y) Θ with the property that for every θ Θ, P[θ S(Y) θ] = (1 α). 1 α is the coverage probability of the set. Such a random set can have different coverage probabilities for different values of θ, but then it is not an exact confidence set. An exact confidence set may not exist, so the definition is commonly relaxed to say that S(Y) is 100(1 α)% set if min P[θ S(Y) θ] = 1 α, θ Θ that is, if its coverage probability is at least 1 α for every θ. 5. WHAT IS CONFIDENCE? It is in practice nearly always treated as if it represented posterior probability. In both popular press and applied economic literature you will see a result that a 95% interval S(Y) for θ has realized value (a, b) described as a result that it is 95% sure that θ is between a and b or the probability that θ is between a and b is 95%. So is it connected to posterior probability? Yes, to some extent. In the SNLM, confidence sets generated in the usual way (which we will see shortly) have, under a flat prior on β and log σ, posterior probability equal to their coverage probabilities.

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM 3 In general, a 100(1 α)% confidence set must have posterior probability of at least 1 γ with unconditional probability at least 1 α/γ. So, e.g., an interval with coverage probability.99 must have posterior probability at least.9 for a set of Y s with pre-sample probability (accounting for uncertainty about θ via the prior) at least.9. For 95% confidence sets this result is pretty weak: 95% intervals must have posterior probability at least.9 with unconditional probability at least.5. E[P[θ S(Y) Y]] = E[P[θ S(Y)]] = E[P[θ S(Y) θ]] = 1 α. That is, the prior probability, before the data is seen, that an exact (1 α)% set will contain the true value of the parameter is 1 α, regardless of the prior distribution on the parameter, and this is the prior expected value of the posterior probability of the set. So the posterior probability of the set, if it is not always equal to the confidence level, must be higher in some samples, lower in others. 6. EXAMPLE: BOUNDED INTERVAL PARAMETER SPACE We are estimating µ which we know must lie in [0, 1]. We have available an estimator ˆµ with the property that { ˆµ Y} N(µ,.1 2 ). A 95% confidence interval for µ is therefore ˆµ ±.196. Notice that the fact that we know µ [0, 1] did not enter the calculation of the confidence interval. In fact to keep it a subset of the parameter space, we must make the interval { ˆµ ±.196} [0, 1]. With non-zero probability, the confidence set is empty. With non-zero probability the confidence set is a very short interval, with very small posterior probability, which conventional mistaken interpretations would treat as indicating great precision of the inference. 7. EXAMPLE: RED-GREEN COLOR BLIND AT THE TRAFFIC LIGHT A witness to a traffic accident is red-green color blind, but can perfectly distinguish yellow. The traffic light is unusual, arranged horizontally. The witness, we have determined, does not like to admit colorblindness, and when asked the color of red and green objects simply announces one or the other color at random, with equal probabilities. His deposition in this accident states that he observed the traffic light, and it was yellow. Do we say with 100% confidence the light was yellow, or with 50% confidence the light was yellow? Both statements could be valid, but we would have

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM 4 had to commit before seeing the witness s answer to how we would behave if he reported red or green. If we would say with 50% confidence the light was red when the report was red, then we have to quote the same confidence level when the light is yellow. But if when the report is red we would say instead with 100% confidence the light was either red or green, then we are using a 100% confidence set and we should say the light is yellow with 100% confidence. Of course this is ridiculous. The posterior probability of yellow given the report of yellow is 1.0, regardless of the prior, so every sensible person would simply say the light was surely yellow, and the fact that the witness was red-green color blind is irrelevant, given that the light was not in fact red or green. This is a special case of a general problem with using pre-sample probability statements as if they were posterior probability statements pre-sample probabilities depend on samples that did not in fact occur. 8. EXAMPLE: DATA WITH DIFFERENT VARIANCES Suppose Θ is the two-point space {0, 1} and when θ = 0 our data Y are N(0,.5) while when θ = 1 Y N(1, 1). An apparently natural way to form a confidence set would be to have it include both 0 and 1 when Y (.644,.82), 0 only when Y <.644, and 1 only when Y >.82. This would indeed be an exact 95% confidence set, as you should be able to check. But when Y is below -1, the likelihood ratio in favor of θ = 1 is large. Indeed the strongest likelihood ratios in favor of θ = 1 occur for large negative (or positive) Y s. So making such observations deliver a confidence set containing only θ = 0 is unreasonable. There are ways to construct more reasonable confidence sets in this example, and of course posterior probability intervals would not show this kind of anomaly. 9. SIGNIFICANT AND INSIGNIFICANT RESULTS What we say here applies about equally to confidence sets and to minimumsize posterior probability sets. There is a big difference between a result that posterior probability is concentrated in a small (from the point of view of the substance of the problem) region around β = 0 and the result that the sample data is so uninformative that the posterior probability is spread widely, with a 95% HPD region therefore including β = 0. The former says we are quite sure that β is substantively small. The latter says β could be very big, indeed from looking at the data alone seems more likely to be big in absolute value than small in absolute value.

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM 5 Yet it is not uncommon to see one regression study, which found an insignificant effect of a variable X, cited as contradicting another study which found a significant effect, without any attention to what the probability intervals were and the degree to which they overlap. 10. OLD BUSINESS Lancaster s definition of a natural conjugate prior: It implies that any prior pdf that has the form π 0 (β)l(y, θ), where l(y, θ) is what we have called a conjugate prior, is a Lancaster-conjugate prior, regardless of what π 0 is. Gelman, Carlin, Stern and Rubin label Lancaster s natural conjugate priors just plain conjugate priors, and use natural conjugate to refer to priors that are in the same family as the likelihood function. Probably this is the best usage. So our definition of conjugate in lecture should instead be labeled natural conjugate, and Lancaster in correspondence has agreed that the text should make this distinction. 11. A GENERAL METHOD FOR CONSTRUCTING CONFIDENCE REGIONS For each θ Θ, choose a test statistic T(Y, θ), where Y is the observable data. For each θ choose a rejection region R(θ) Θ such that P[T(Y, θ) R(θ) θ] = α. Define S(Y) = {θ T(Y, θ) R(θ)}. Then S(Y) is a 100(1 α)% confidence region for θ. 12. REMARKS ON THE GENERAL METHOD Regions constructed this way are exact confidence regions. When Y is continuously distributed, it is always possible to find T(Y, θ) s and R(θ) s to implement this idea. It is generally not possible to use this idea to produce confidence regions for individual elements of the θ vector or linear combinations of them. There are obviously many ways to choose the T and R functions, and thus many confidence regions, for a given α value. 13. PIVOTAL QUANTITIES Sometimes we can find a collection of functions T (Y, θ) with the property that the distribution of {T (Y, θ) θ} does not depend on θ. In this case we call T a pivotal quantity or pivot for θ. Constructing confidence regions based on pivots is particularly easy: One defines a single region R such that P[T (Y, θ) R] = α, and this rejection

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM 6 region defines the test used to construct the confidence region for all values of θ. Most confidence regions actually used in practice are based on pivots. Certain kinds of pivots also make confidence regions based on them correspond to posterior probability regions with probabilities matching the confidence levels under certain flat priors. A leading special case is the SNLM, in which û û σ 2 and (β ˆβ) X X(β ˆβ) û û are a pair of jointly pivotal quantities and confidence regions based on the distribution of these statistics conditional on β, σ turn out to coincide with posterior probability regions with probabilities corresponding to 1 α if the (1/σ)dσ dβ flat prior is used. Other cases: scale parameters in general, like α in the Gamma or σ in the t distribution, allow this kind of construction, as do location parameters in general, like µ in the t distribution. Most confidence regions in actual use are not only based on pivots, they are based on location-scale pivots that allow this flat-prior Bayesian interpretation. 14. TESTS Each T(Y, θ), R(θ) pair in our construction of a confidence region makes up what is known as a statistical test of the null hypothesis that θ is the true value of the parameter. The parameter α is known as the significance level or size of the test. So an exact confidence region can always be interpreted as a collection of statistical tests with significance level α. More generally, we can consider tests of hypotheses that do not consist of a single point in Θ. For such compound hypotheses, DeGroot and Schervish distinguish level, or significance level, and size. The null hypothesis is some set Ω 0 Θ and the test still takes the form of a statistic T(Y) and rejection region R. Only now it is possible that P[T(Y) R θ] differs across θ s in Ω 0. The standard definitions now say that the test has significance level α if P[T(Y) R θ] α for all θ Ω 0, and that it has size α if it has level γ for all γ α. 15. POWER The power function of a test is P[T(Y) R θ] considered as a function of θ over Θ Ω 0. (It could be extended to range over Ω 0 also.)

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM 7 We would like a test to have a small size and have large values of the power function for θ outside Ω 0. We would like a test to be unbiased, meaning that the infimum of P[T(Y) R θ] over Θ Ω 0 is no smaller than the supremum over Ω 0 of the same thing. (Here is the set difference operator: A B = A B c. 16. LINEAR COMBINATIONS OF PARAMETERS In the SNLM with dσ/σ prior, consider a m k matrix R used to form m linear combinations of β. {Rβ Y, X} t T k (R ˆβ, RΣ β R ), where Σ β = (û û/(t k))(x X) 1. This implies (it is not hard to show, but we are skipping the algebra) that { (β ˆβ) R (RΣ β R ) 1 R(β ˆβ) Y, X } m F(m, T k). There is a completely analogous non-bayesian result for the same pivot: same as ( ) but with the conditioning on β, σ instead of on Y, X. ( ) 17. CONFIDENCE SETS FOR LINEAR COMBINATIONS Obviously we can use this pivot to develop elliptical confidence sets for arbitrary linear combinations of coefficients, or individual coefficients (where they are equivalent to intervals based on the t T k distribution). Many programs, including R, report automatically what is called the F-statistic for the equation or regression. It is the statistic we are discussing with R a selection matrix all zeros and ones, such that Rβ is the (k 1) 1 vector obtained by deleting the constant term from the regression. What is the constant term? Most of the time, we include as the first or last column of X a vector of ones. If we write the equation for one observation as Y t = β 0 + k j=1 X jt β j + ε t, [ This is equivalent to Y = Xβ + ε, with X = 1 X ] T 1, with β consisting of the k + 1 β j s. If this F-statistic is not large, it may imply that Rβ = 0 is inside a standard confidence set or credible set (some authors term for a set with a given posterior probability). For the particular R we are considering, this means that a parameter vector with every element zero except for the constant term is

MAXIMUM LIKELIHOOD, SET ESTIMATION, MODEL CRITICISM 8 inside the confidence/credible set, and thus that in some sense the whole regression is insignificant.