P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Similar documents
For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Chapter 3. Point Estimation. 3.1 Introduction

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

5.2 Fisher information and the Cramer-Rao bound

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Introduction to Estimation Methods for Time Series models Lecture 2

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Chapter 4: Asymptotic Properties of the MLE

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

DA Freedman Notes on the MLE Fall 2003

Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program

Maximum Likelihood Estimation

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Maximum Likelihood Estimation

Statistics 3858 : Maximum Likelihood Estimators

Lecture 1: Introduction

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Graduate Econometrics I: Maximum Likelihood I

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Lecture 3 September 1

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Section 8.2. Asymptotic normality

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Multivariate Analysis and Likelihood Inference

simple if it completely specifies the density of x

BTRY 4090: Spring 2009 Theory of Statistics

Master s Written Examination

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Linear Methods for Prediction

1 One-way analysis of variance

δ -method and M-estimation

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

Econometrics I, Estimation

Economics 583: Econometric Theory I A Primer on Asymptotics

A Very Brief Summary of Statistical Inference, and Examples

EM Algorithm II. September 11, 2018

Lecture 6: Discrete Choice: Qualitative Response

A Very Brief Summary of Statistical Inference, and Examples

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Stat 411 Lecture Notes 03 Likelihood and Maximum Likelihood Estimation

6.1 Variational representation of f-divergences

Outline of GLMs. Definitions

Lecture 1: August 28

1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

Review and continuation from last week Properties of MLEs

Section 8: Asymptotic Properties of the MLE

Normalising constants and maximum likelihood inference

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Sampling distribution of GLM regression coefficients

Large Sample Theory. Consider a sequence of random variables Z 1, Z 2,..., Z n. Convergence in probability: Z n

Linear Methods for Prediction

i=1 h n (ˆθ n ) = 0. (2)

Notes on Asymptotic Theory: Convergence in Probability and Distribution Introduction to Econometric Theory Econ. 770

Fisher Information & Efficiency

Generalized Linear Models Introduction

Mathematical statistics

Submitted to the Brazilian Journal of Probability and Statistics

Some explanations about the IWLS algorithm to fit generalized linear models

Maximum Likelihood Large Sample Theory

5601 Notes: The Sandwich Estimator

Generalized Linear Models. Kurt Hornik

Generalized Linear Models

Chapter 3 : Likelihood function and inference

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

A Primer on Asymptotics

STAT5044: Regression and Anova

Estimation and Model Selection in Mixed Effects Models Part I. Adeline Samson 1

Maximum Likelihood Estimation

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )

Lecture 17: Likelihood ratio and asymptotic tests

Statistics Ph.D. Qualifying Exam

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

Inference in non-linear time series

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

Qualifying Exam in Probability and Statistics.

POLI 8501 Introduction to Maximum Likelihood Estimation

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Section 9: Generalized method of moments

Information in a Two-Stage Adaptive Optimal Design

Economics 620, Lecture 9: Asymptotics III: Maximum Likelihood Estimation

Linear regression COMS 4771

Regression #4: Properties of OLS Estimator (Part 2)

10. Linear Models and Maximum Likelihood Estimation

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n


STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Likelihood-Based Methods

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

11 Survival Analysis and Empirical Likelihood

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

(x 3)(x + 5) = (x 3)(x 1) = x + 5. sin 2 x e ax bx 1 = 1 2. lim

Answers to the 8th problem set. f(x θ = θ 0 ) L(θ 0 )

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Brief Review on Estimation Theory

Transcription:

Large Sample Theory Large Sample Theory is a name given to the search for approximations to the behaviour of statistical procedures which are derived by computing limits as the sample size, n, tends to infinity. Suppose we have a data set with a fairly large sample size, say n = 100. We imagine our data set is one in a sequence of possible data sets one for each possible value of n. If we have a sequence of statistics which converges to something as n tends to infinity we approximate the value of a probability, for instance, when n = 100 by the corresponding value when n =. In this section we will investigate approximations of this kind for Maximimum Likelihood Estimation. Most large sample theory uses three main technical tools: the Law of Large Numbers (LLN), the Central Limit Theorem (CLT) and Taylor expansion. I assume you have heard of all of these but will state versions of them as we go. These tools are generally easier to apply to statistics for which we have explicit formulas than to statistics, like maximum likelihood estimates, where we do not usually have explicit formulas. In this situation we study the equations you solve to find the MLEs instead. We therefore will study the approximate behaviour of the MLE, ˆθ, by studying the function U. Notice first that U is a sum of independent random variables. This will allow us to apply both the LLN and the CLT to U. Theorem 1 (LLN) If Y 1, Y 2,... are iid with mean µ then Yi n µ This is called the law of large numbers but it comes in two forms: Strong and Weak. Theorem 2 (SLLN) If Y 1, Y 2,... are iid with mean µ then ( P lim n Yi n =µ)=1. The strong law is harder to prove than the weak law of large numbers: 1

Theorem 3 (WLLN) If Y 1, Y 2,... are iid with mean µ then for each positive ɛ lim P ( Yi n µ > ɛ) = 0 For iid Y i the stronger conclusion (the SLLN) holds but for our heuristics we will ignore the differences between these notions. Now suppose that θ 0 is the true value of θ. Then U(θ)/n µ(θ) where [ ] log f µ(θ) =E θ0 (X i, θ) log f = (x, θ)f(x, θ 0 )dx Remark: This convergence is pointwise and might not be uniform. That is, there might be some θ values where the convergence takes much longer than others. It could be that for every n there is a θ where U(θ)/n and µ(θ) are not close together. Consider as an example the case of N(µ, 1) data where U(µ)/n = (X i µ)/n = X µ If the true mean is µ 0 then X µ 0 and U(µ)/n µ 0 µ If we think of a µ < µ 0 we see that the derivative of l(µ) is likely to be positive so that l increases as we increase µ. For µ more than µ 0 the derivative is probably negative and so l tends to be decreasing for µ > 0. It follows that l is likely to be maximized close to µ 0. Now we repeat these ideas for a more general case. We study the random variable log[f(x i, θ)/f(x i, θ 0 )]. You know the inequality E(X) 2 E(X 2 ) 2

(because the difference is V ar(x) 0. This inequality has the following generalization, called Jensen s inequality. If g is a convex function (nonnegative second derivative, roughly) then g(e(x)) E(g(X)) The inequality above has g(x) = x 2. We use g(x) = log(x) which is convex because g (x) = x 2 > 0. We get But log(e θ0 [f(x i, θ)/f(x i, θ 0 )] E θ0 [ log{f(x i, θ)/f(x i, θ 0 )}] f(x, θ) E θ0 [f(x i, θ)/f(x i, θ 0 )] = f(x, θ 0 ) f(x, θ 0)dx = f(x, θ)dx = 1 We can reassemble the inequality and this calculation to get E θ0 [log{f(x i, θ)/f(x i, θ 0 )}] 0 It is possible to prove that the inequality is strict unless the θ and θ 0 densities are actually the same. Let µ(θ) < 0 be this expected value. Then for each θ we find n 1 [l(θ) l(θ 0 )] = n 1 log[f(x i, θ)/f(x i, θ 0 )] µ(θ) This proves that the likelihood is probably higher at θ 0 than at any other single θ. This idea can often be stretched to prove that the MLE is consistent. Definition: A sequence ˆθ n of estimators of θ is consistent if ˆθ n converges weakly (or strongly) to θ. Proto theorem: In regular problems the MLE ˆθ is consistent. Now let us study the shape of the log likelihood near the true value of ˆθ under the assumption that ˆθ is a root of the likelihood equations close to θ 0. We use Taylor expansion to write, for a 1 dimensional parameter θ, U(ˆθ) = 0 = U(θ 0 ) + U (θ 0 )(ˆθ θ 0 ) + U ( θ)(ˆθ θ 0 ) 2 /2 3

for some θ between θ 0 and ˆθ. (This form of the remainder in Taylor s theorem is not valid for multivariate θ.) The derivatives of U are each sums of n terms and so should be both proportional to n in size. The second derivative is multiplied by the square of the small number ˆθ θ 0 so should be negligible compared to the first derivative term. If we ignore the second derivative term we get U (θ 0 )(ˆθ θ 0 ) U(θ 0 ) Now let s look at the terms U and U. In the normal case U(θ 0 ) = (X i µ 0 ) has a normal distribution with mean 0 and variance n (SD n). The derivative is simply U (µ) = n and the next derivative U is 0. We will analyze the general case by noticing that both U and U are sums of iid random variables. Let and U i = log f (X i, θ 0 ) V i = 2 log f 2 (X i, θ) In general, U(θ 0 ) = U i has mean 0 and approximately a normal distribution. Here is how we check that: E θ0 (U(θ 0 )) = ne θ0 (U 1 ) log(f(x, θ)) = n (x, θ 0 )f(x, θ 0 )dx f/(x, θ0 ) = n θf(x, θ 0 )dx f(x, θ 0 ) f = n (x, θ 0)dx = n f(x, θ)dx = n 1 = 0 θ=θ0 4

Notice that I have interchanged the order of differentiation and integration at one point. This step is usually justified by applying the dominated convergence theorem to the definition of the derivative. The same tactic can be applied by differentiating the identity which we just proved log f (x, θ)f(x, θ)dx = 0 Taking the derivative of both sides with respect to θ and pulling the derivative under the integral sign again gives [ ] log f (x, θ)f(x, θ) dx = 0 Do the derivative and get 2 log(f) log f f(x, θ)dx = (x, θ) f (x, θ)dx 2 [ 2 log f = (x, θ)] f(x, θ)dx Definition: The Fisher Information is I(θ) = E θ (U (θ)) = ne θ0 (V 1 ) We refer to I(θ 0 ) = E θ0 (V 1 ) as the information in 1 observation. The idea is that I is a measure of how curved the log likelihood tends to be at the true value of θ. Big curvature means precise estimates. Our identity above is I(θ) = V ar θ (U(θ)) = ni(θ) Now we return to our Taylor expansion approximation U (θ 0 )(ˆθ θ 0 ) U(θ 0 ) and study the two appearances of U. We have shown that U = U i is a sum of iid mean 0 random variables. The central limit theorem thus proves that n 1/2 U(θ) N(0, σ 2 ) 5

where σ 2 = V ar(u i ) = E(V i ) = I(θ). Next observe that U (θ) = V i where again V i = U i The law of large numbers can be applied to show U (θ 0 )/n E θ0 [V 1 ] = I(θ 0 ) Now manipulate our Taylor expansion as follows n 1/2 (ˆθ θ 0 ) [ Vi n ] 1 Ui n Apply Slutsky s Theorem to conclude that the right hand side of this converges in distribution to N(0, σ 2 /I(θ) 2 ) which simplifies, because of the identities, to N(0, 1/I(θ). Summary In regular families: Under strong regularity conditions Jensen s inequality can be used to demonstrate that ˆθ which maximizes l globally is consistent and that this ˆθ is a root of the likelihood equations. It is generally easier to study l only close to θ 0. For instance define A to be the event that l is concave on the set of θ such that θ θ 0 < δ and the likelihood equations have a unique root in that set. Under weaker conditions than the previous case we can prove that there is a δ > 0 such that P (A) 1 In that case we can prove that the root ˆθ of the likelihood equations mentioned in the definition of A is consistent. Sometimes we can only get an even weaker conclusion. Define B to be the event that l(θ) is concave for n 1/2 θ θ 0 < L and there is a unique root of l over this range. Again this root is consistent but there might be other consistent roots of the likelihood equations. 6

Under any of these scenarios there is a consistent root of the likelihood equations which is definitely the closest to the true value θ 0. This root ˆθ has the property n(ˆθ θ0 ) N(0, 1/I(θ)). We usually simply say that the MLE is consistent and asymptotically normal with an asymptotic variance which is the inverse of the Fisher information. This assertion is actually valid for vector valued θ where now I is a matrix with ijth entry ( ) 2 l I i j = E i j Estimating Equations The same ideas arise in almost any model where estimates are derived by solving some equation. As an example I sketch large sample theory for Generalized Linear Models. Suppose that for i = 1,..., n we have observations of the numbers of cancer cases Y i in some group of people characterized by values x i of some covariates. You are supposed to think of x i as containing variables like age, or a dummy for sex or average income or... A parametric regression model for the Y i might postulate that Y i has a Poisson distribution with mean µ i where the mean µ i depends somehow on the covariate values. Typically we might assume that g(µ i ) = β0 + x i β where g is a so-called link function, often for this case g(µ) = log(µ) and x i β is a matrix product with x i written as a row vector and µ a column vector. This is supposed to function as a linear regression model with Poisson errors. I will do as a special case log(µ i ) = βx i where x i is a scalar. The log likelihood is simply l(β) = (Y i log(µ i ) µ i ) ignoring irrelevant factorials. The score function is, since log(µ i ) = βx i, U(β) = (Y i x i x i µ i ) = x i (Y i µ i ) (Notice again that the score has mean 0 when you plug in the true parameter value.) The key observation, however, is that it is not necessary to believe 7

that Y i has a Poisson distribution to make solving the equation U = 0 sensible. Suppose only that log(e(y i )) = x i β. Then we have assumed that E β (U(β)) = 0 This was the key condition in proving that there was a root of the likelihood equations which was consistent and here it is what is needed, roughly, to prove that the equation U(β) = 0 has a consistent root ˆβ. Ignoring higher order terms in a Taylor expansion will give V (β)( ˆβ β) U(β) where V = U. In the MLE case we had identities relating the expectation of V to the variance of U. In general here we have V ar(u) = x 2 i V ar(y i ) If Y i is Poisson with mean µ i (and so V ar(y i ) = µ i ) this is Moreover we have and so V ar(u) = x 2 i µ i V i = x 2 i µ i V (β) = x 2 i µ i The central limit theorem (the Lyapunov kind) will show that U(β) has an approximate normal distribution with variance σ 2 U = x 2 i V ar(y i ) and so ˆβ β N(0, σ 2 U/( x 2 i µ i ) 2 ) If V ar(y i ) = µ i, as it is for the Poisson case, the asymptotic variance simplifies to 1/ x 2 i µ i. Notice that other estimating equations are possible. People suggest alternatives very often. If w i is any set of deterministic weights (even possibly depending on µ i then we could define U(β) = w i (Y i µ i ) and still conclude that U = 0 probably has a consistent root which has an asymptotic normal distribution. This idea is being used all over the place these days: see, for example Zeger and Liang s Generalized estimating equations (GEE) which the econometricians call Generalized Method of Moments. 8

Problems with maximum likelihood 1. In problems with many parameters the approximations don t work very well and maximum likelihood estimators can be far from the right answer. See your homework for the Neyman Scott example where the MLE is not consistent. 2. When there are multiple roots of the likelihood equation you must choose the right root. To do so you might start with a different consistent estimator and then apply some iterative scheme like Newton Raphson to the likelihood equations to find the MLE. It turns out not many steps of NR are generally required if the starting point is a reasonable estimate. Finding (good) preliminary Point Estimates Method of Moments Basic strategy: set sample moments equal to population moments and solve for the parameters. Definition: The r th sample moment (about the origin) is The r th population moment is 1 n n i=1 X r i E(X r ) Definition: Central moments are 1 n (X i n X) r and i=1 E [(X µ) r ]. If we have p parameters we can estimate the parameters θ 1,..., θ p by solving the system of p equations: µ 1 = X 9

and so on to µ 2 = X 2 µ p = X p You need to remember that the population moments µ k involving the parameters. Gamma Example The Gamma(α, β) density is f(x; α, β) = 1 βγ(α) ( ) α 1 [ x exp x ] 1(x > 0) β β will be formulas and has and This gives the equations µ 1 = αβ µ 2 = αβ 2 αβ = X αβ 2 = X 2 Divide the second by the first to find the method of moments estimate of β is β = X 2 /X Then from the first equation get α = X/ β = (X) 2 /X 2 The equations are much easier to solve than the likelihood equations which involve the function called the digamma function. ψ(α) = d dα log(γ(α)) 10