Section 8: Asymptotic Properties of the MLE

Similar documents
Chapter 4: Asymptotic Properties of the MLE

Section 8.2. Asymptotic normality

Math 5210, Definitions and Theorems on Metric Spaces

Maximum Likelihood Estimation

The Heine-Borel and Arzela-Ascoli Theorems

Problem Set 2: Solutions Math 201A: Fall 2016

Lebesgue s Differentiation Theorem via Maximal Functions

converges as well if x < 1. 1 x n x n 1 1 = 2 a nx n

Math 104: Homework 7 solutions

Economics 204 Fall 2011 Problem Set 2 Suggested Solutions

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Immerse Metric Space Homework

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Statistical Inference

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

1. Let A R be a nonempty set that is bounded from above, and let a be the least upper bound of A. Show that there exists a sequence {a n } n N

λ(x + 1)f g (x) > θ 0

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

A GENERAL THEOREM ON APPROXIMATE MAXIMUM LIKELIHOOD ESTIMATION. Miljenko Huzak University of Zagreb,Croatia

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

Lecture Notes on Metric Spaces

MATH 202B - Problem Set 5

Summer Jump-Start Program for Analysis, 2012 Song-Ying Li

Probability and Measure

Math 152. Rumbos Fall Solutions to Assignment #12

Problem 3. Give an example of a sequence of continuous functions on a compact domain converging pointwise but not uniformly to a continuous function

Chapter 3. Point Estimation. 3.1 Introduction

EC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 1: Preliminaries

Section 9: Generalized method of moments

Metric Spaces and Topology

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Analogy Principle. Asymptotic Theory Part II. James J. Heckman University of Chicago. Econ 312 This draft, April 5, 2006

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

x 0 + f(x), exist as extended real numbers. Show that f is upper semicontinuous This shows ( ɛ, ɛ) B α. Thus

Continuous Functions on Metric Spaces

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Lecture 4 September 15

Maximum likelihood estimation

ANALYSIS WORKSHEET II: METRIC SPACES

Lecture 1: Introduction

Closest Moment Estimation under General Conditions

Classical regularity conditions

McGill University Math 354: Honors Analysis 3

The Uniform Weak Law of Large Numbers and the Consistency of M-Estimators of Cross-Section and Time Series Models

Lecture 4: September Reminder: convergence of sequences

Course 212: Academic Year Section 1: Metric Spaces

This chapter contains a very bare summary of some basic facts from topology.

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

STAT215: Solutions for Homework 2

Problem set 1, Real Analysis I, Spring, 2015.

1 Topology Definition of a topology Basis (Base) of a topology The subspace topology & the product topology on X Y 3

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Closest Moment Estimation under General Conditions

Computing the MLE and the EM Algorithm

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Lecture 2: Consistency of M-estimators

Real Analysis Notes. Thomas Goller

consists of two disjoint copies of X n, each scaled down by 1,

Analysis III. Exam 1

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)]

THE INVERSE FUNCTION THEOREM

Thus f is continuous at x 0. Matthew Straughn Math 402 Homework 6

6.1 Variational representation of f-divergences

A LOCALIZATION PROPERTY AT THE BOUNDARY FOR MONGE-AMPERE EQUATION

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued

STAT 512 sp 2018 Summary Sheet

DA Freedman Notes on the MLE Fall 2003

Measure and Integration: Solutions of CW2

Introductory Analysis I Fall 2014 Homework #9 Due: Wednesday, November 19

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Math 117: Topology of the Real Numbers

Chapter 4: Asymptotic Properties of the MLE (Part 2)

MAT 544 Problem Set 2 Solutions

Inference in non-linear time series

Maximum Likelihood Estimation

Exercise Solutions to Functional Analysis

The Lindeberg central limit theorem

Chapter 7. Hypothesis Testing

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Section 10: Role of influence functions in characterizing large sample efficiency

Exam 2 extra practice problems

Math 328 Course Notes

ON SPACE-FILLING CURVES AND THE HAHN-MAZURKIEWICZ THEOREM

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

Mathematics II, course

LECTURE 9 LECTURE OUTLINE. Min Common/Max Crossing for Min-Max

Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University

g 2 (x) (1/3)M 1 = (1/3)(2/3)M.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Real Analysis Chapter 4 Solutions Jonathan Conder

Problem Set 5. 2 n k. Then a nk (x) = 1+( 1)k

B. Appendix B. Topological vector spaces

Integral Jensen inequality

Transcription:

2 Section 8: Asymptotic Properties of the MLE In this part of the course, we will consider the asymptotic properties of the maximum likelihood estimator. In particular, we will study issues of consistency, asymptotic normality, and efficiency. Manyofthe proofs will be rigorous, to display more generally useful techniques also for later chapters. We suppose that X n =(X 1,...,X n ), where the X i s are i.i.d. with common density p(x; θ 0 ) P= {p(x; θ) :θ Θ}. We assume that θ 0 is identified in the sense that if θ θ 0 and θ Θ, then p(x; θ) p(x; θ 0 ) with respect to the dominating measure µ.

3 For fixed θ Θ, the joint density of X n is equal to the product of the individual densities, i.e., p(x n ; θ) = n p(x i ; θ) i=1 As usual, when we think of p(x n ; θ) as a function of θ with x n held fixed, we refer to the resulting function as the likelihood function, L(θ; x n ). The maximum likelihood estimate for observed x n is the value θ Θ which maximizes L(θ; x n ), ˆθ(x n ). Prior to observation, x n is unknown, so we consider the maximum likelihood estimator, MLE, to be the value θ Θ which maximizes L(θ; X n ), ˆθ(X n ). Equivalently, the MLE can be taken to be the maximum of the standardized log-likelihood, l(θ; X n ) n = log L(θ; X n) n = 1 n n log p(x i ; θ) = 1 n i=1 n l(θ; X i ) i=1

4 We will show that the MLE is often 1. consistent, ˆθ(X n ) P θ 0 2. asymptotically normal, n(ˆθ(x n ) θ 0 ) D(θ 0) Normal R.V. 3. asymptotically efficient, i.e., if we want to estimate θ 0 by any other estimator within a reasonable class, the MLE is the most precise. To show 1-3, we will have to provide some regularity conditions on the probability model and (for 3) on the class of estimators that will be considered.

5 Section 8.1 Consistency We first want to show that if we have a sample of i.i.d. data from a common distribution which belongs to a probability model, then under some regularity conditions on the form of the density, the sequence of estimators, {ˆθ(X n )}, will converge in probability to θ 0. So far, we have not discussed the issue of whether a maximum likelihood estimator exists or, if one does, whether it is unique. We will get to this, but first we start with a heuristic proof of consistency.

6 Heuristic Proof TheMLEisthevalueθ Θ that maximizes Q(θ; X n ):= 1 n n i=1 l(θ; X i). By the WLLN, we know that Q(θ; X n )= 1 n n l(θ; X i ) P Q 0 (θ) := E θ0 [l(θ; X)] i=1 = E θ0 [log p(x; θ)] = {log p(x; θ)}p(x; θ 0 )dµ(x) We expect that, on average, the log-likelihood will be close to the expected log-likelihood. Therefore, we expect that the maximum likelihood estimator will be close to the maximum of the expected log-likelihood. We will show that the expected log-likelihood, Q 0 (θ) is maximized at θ 0 (i.e., the truth).

7 Lemma 8.1: If θ 0 is identified and E θ0 [ log p(x; θ) ] < for all θ Θ, Q 0 (θ) is uniquely maximized at θ = θ 0. Proof: By Jensen s inequality, we know that for any strictly convex function g( ), E[g(Y )] > g(e[y ]). Take g(y) = log(y). So, for θ θ 0, E θ0 [ log( p(x; θ) p(x; θ 0 ) )] > log(e p(x; θ) θ 0 [ p(x; θ 0 ) ]) Note that E θ0 [ p(x; θ) p(x; θ 0 ) ]= p(x; θ) p(x; θ 0 ) p(x; θ 0)dµ(x) = p(x; θ) =1 So, E θ0 [ log( p(x;θ) p(x;θ 0 ))] > 0or Q 0 (θ 0 )=E θ0 [log p(x; θ 0 )] >E θ0 [log p(x; θ)] = Q 0 (θ) This inequality holds for all θ θ 0.

Under technical conditions for the limit of the maximum to be the maximum of the limit, ˆθ(X n ) should converge in probability to θ 0. Sufficient conditions for the maximum of the limit to be the limit of the maximum are that the convergence is uniform and the parameter space is compact. 8

The discussion so far only allows for a compact parameter space. In theory compactness requires that one know bounds on the true parameter value, although this constraint is often ignored in practice. It is possible to drop this assumption if the function Q(θ; X n ) cannot rise too much as θ becomes unbounded. We will discuss this later. 9

10 Definition (Uniform Convergence in Probability): Q(θ; X n ) converges uniformly in probability to Q 0 (θ) if sup Q(θ; X n ) Q 0 (θ) P (θ 0) 0 θ Θ More precisely, we have that for all ɛ>0, P θ0 [sup Q(θ; X n ) Q 0 (θ) >ɛ] 0 θ Θ Why isn t pointwise convergence enough? Uniform convergence guarantees that for almost all realizations, the paths in θ are in the ɛ-sleeve. This ensures that the maximum is close to θ 0.For pointwise convergence, we know that at each θ, mostofthe realizations are in the ɛ-sleeve, but there is no guarantee that for another value of θ the same set of realizations are in the sleeve. Thus, the maximum need not be near θ 0.

11 Theorem 8.2: Suppose that Q(θ; X n ) is continuous in θ and there exists a function Q 0 (θ) such that 1. Q 0 (θ) is uniquely maximized at θ 0 2. Θiscompact 3. Q 0 (θ) is continuous in θ 4. Q(θ; X n ) converges uniformly in probability to Q 0 (θ). then ˆθ(X n ) defined as the value of θ ΘwhichforeachX n = x n maximizes the objective function Q(θ; X n )satisfiesˆθ(x n ) P θ 0.

12 Proof: For a positive ɛ, define the ɛ-neighborhood about θ 0 to be We want to show that Θ(ɛ) ={θ : θ θ 0 <ɛ} P θ0 [ˆθ(X n ) Θ(ɛ)] 1 as n.sinceθ(ɛ) is an open set, we know that Θ Θ(ɛ) C is a compact set (Assumption 2). Since Q 0 (θ) is a continuous function (Assumption 3), then sup θ Θ Θ(ɛ) C {Q 0 (θ)} is a achieved for a θ in the compact set. Denote this value by θ.sinceθ 0 is the unique max, let Q 0 (θ 0 ) Q 0 (θ )=δ>0.

13 Now for any θ, we distinguish between two cases. Case 1: θ Θ Θ(ɛ) C. Let A n be the event that sup θ Θ Θ(ɛ) C Q(θ; X n ) Q 0 (θ) <δ/2. Then, A n Q(θ; X n ) <Q 0 (θ)+δ/2 Q 0 (θ )+δ/2 = Q 0 (θ 0 ) δ + δ/2 = Q 0 (θ 0 ) δ/2

14 Case 2: θ Θ(ɛ). Let B n be the event that sup θ Θ(ɛ) Q(θ; X n ) Q 0 (θ) <δ/2. Then, B n Q(θ; X n ) >Q 0 (θ) δ/2 for all θ Q(θ 0 ; X n ) >Q 0 (θ 0 ) δ/2 By comparing the last expressions for each of cases 1,2, we conclude that if both A n and B n hold then ˆθ Θ(ɛ). But by uniform convergence, pr(a n B n ) 1, so pr(ˆθ Θ(ɛ)) 1.

15 A key element of the above proof is that Q(θ; X n )converges uniformly in probability to Q 0 (θ). This is often difficult to prove. A useful condition is given by the following lemma: Lemma 8.3: If X 1,...,X n are i.i.d. p(x; θ 0 ) {p(x; θ) :θ Θ}, Θ is compact, log p(x; θ) is continuous in θ for all θ Θandallx X, and if there exists a function d(x) such that log p(x; θ) d(x) for all θ Θandx X,andE θ0 [d(x)] <, then i. Q 0 (θ) =E θ0 [log p(x; θ)] is continuous in θ ii. sup θ Θ Q(θ; X n ) Q 0 (θ) P 0 Example: Suicide seasonality and von Mises distribution (in class)

16 Proof: We first prove the continuity of Q 0 (θ). For any θ Θ, choose a sequence θ k Θ which converges to θ. By the continuity of log p(x; θ), we know that log p(x; θ k ) log p(x; θ). Since log p(x; θ k ) d(x), the dominated convergence theorem tells us that Q 0 (θ k )=E θ0 [log p(x; θ k )] E θ0 [log p(x; θ)] = Q 0 (θ). This implies that Q 0 (θ) is continuous. Next, we work to establish the uniform convergence in probability. We need to show that for any ɛ, η > 0thereexistsN(ɛ, η) such that for all n>n(ɛ, η), P [sup Q(θ; X n ) Q 0 (θ) >ɛ] <η θ Θ

Since log p(x; θ) is continuous in θ Θ and since Θ is compact, we know that log p(x; θ) is uniformly continuous (see Rudin, page 90). Uniform continuity says that for all ɛ>0thereexistsaδ(ɛ) > 0 such that log p(x; θ 1 ) log p(x; θ 2 ) <ɛfor all θ 1,θ 2 Θforwhich θ 1 θ 2 <δ(ɛ). This is a property of a function defined on a set of points. In contrast, continuity is defined relative to a particular point. Continuity of a function at a point θ says that for all ɛ>0, there exists a δ(ɛ, θ ) such that log p(x; θ) log p(x; θ ) <ɛfor all θ Θforwhich θ θ <δ(ɛ, θ ). For continuity, δ depends on ɛ and θ. For uniform continuity, δ depends only on ɛ. In general, uniform continuity is stronger than continuity; however, they are equivalent on compact sets. 17

18 Aside: Uniform Continuity vs. Continuity Consider f(x) =1/x for x (0, 1). This function is continuous for each x (0, 1). However, it is not uniformly continuous. Suppose this function was uniformly continuous. Then, we know for any ɛ>0, we can find a δ(ɛ) > 0 such that 1/x 1 1/x 2 <ɛfor all x 1,x 2 such that x 1 x 2 <δ(ɛ). Given an ɛ>0, consider the points x 1 and x 2 = x 1 + δ(ɛ)/2. Then, we know that 1 x 1 1 x 2 = 1 x 1 1 x 1 + δ(ɛ)/2 = δ(ɛ)/2 x 1 (x 1 + δ(ɛ)/2) Take x 1 sufficiently small so that x 1 < min( δ(ɛ) 2, 1 2ɛ ). This implies that δ(ɛ)/2 x 1 (x 1 + δ(ɛ)/2) > δ(ɛ)/2 x 1 δ(ɛ) = 1 >ɛ 2x 1 This is a contradiction.

19 Uniform continuity also implies that (x, δ) = sup log p(x; θ 1 ) log p(x; θ 2 ) 0 (1) {(θ 1,θ 2 ): θ 1 θ 2 <δ} as δ 0. By the assumption of Lemma 8.3, we know that (x, δ) 2d(x) for all δ. By the dominated convergence theorem, we know that E θ0 [ (X, δ)] 0asδ 0. Now, consider open balls of length δ about each θ Θ, i.e., B(θ, δ) ={ θ : θ θ <δ}. The union of these open balls contains Θ. This union is an open cover of Θ. Since Θ is a compact set, we know that there exists a finite subcover, which we denote by {B(θ j,δ),j =1,...,J}.

20 Taking a θ Θ, by the triangle inequality, we know that Q(θ; X n ) Q 0 (θ) Q(θ; X n ) Q(θ j ; X n ) + (2) Q(θ j ; X n ) Q 0 (θ j ) + (3) Q 0 (θ j ) Q 0 (θ) (4) Choose θ j so that θ B(θ j ; δ). Since θ θ j <δ, we know that (2) is equal to 1 n n {log p(x i ; θ) log p(x i ; θ j )} i=1 andthisislessthanorequal 1 n n log p(x i ; θ) log p(x i ; θ j ) 1 n i=1 n (X i,δ) i=1

21 We also know that (4) is less than or equal to sup Q 0 (θ 1 ) Q 0 (θ 2 ) {(θ 1,θ 2 ): θ 1 θ 2 <δ} Since Q 0 (θ) is uniformly continuous, we know that this bound can be made arbitrarily small by choosing δ to be small. That is, this bound can be made less than ɛ/3, for a δ<δ 3. We also know that (3) is less than max j=1,...,j Q(θ j; X n ) Q 0 (θ j ) Putting these results together, we have that for any δ<δ 3

22 sup Q(θ; X n ) Q 0 (θ) 1 θ Θ n i=1 nx (X i,δ) + max j=1,...,j Q(θ j; X n ) Q 0 (θ j ) + ɛ/3

23 So for any δ<δ 3, we know that P θ0 [sup θ Θ Q(θ; X n ) Q 0 (θ) >ɛ] P θ0 [ 1 n P θ0 [ 1 n n (X i,δ)+ i=1 max Q(θ j; X n ) Q 0 (θ j ) > 2ɛ/3] j=1,...,j n (X i,δ) >ɛ/3] + (5) i=1 P θ0 [ max j=1,...,j Q(θ j; X n ) Q 0 (θ j ) >ɛ/3] (6) Now, we can show that we can take n sufficiently large so that (5) and (6) can be made small.

24 Note that (5) is equal to P θ0 [ 1 n n { (X i,δ) E θ0 [ (X, δ)]} + E θ0 [ (X, δ)] >ɛ/3] i=1 We already have demonstrated that E θ0 [ (X, δ)] 0asδ 0. Choose δ small enough (< δ 1 )that E θ0 [ (X, δ)] <ɛ/6. Call this number δ 1.Takeδ<min(δ 1,δ 3 ). Then (5) is less than P θ0 [ 1 n n { (X i,δ) E θ0 [ (X, δ)]} >ɛ/6] i=1 By the WLLN, we know that there exists N 1 (ɛ, η) sothatforall n>n 1 (ɛ, η), the above term is less than η/2. So, (5) is less than η/2.

25 For the δ considered so far, find the finite subcover {B(θ j,δ),j =1,...,J}. Now,(6)isequalto P θ0 [ J j=1{ Q(θ j ; X n ) Q 0 (θ j ) >ɛ/3}] J P θ0 [ Q(θ j ; X n ) Q 0 (θ j ) >ɛ/3] j=1 By the WLLN, we know that for each θ j and for any η>0, we know there exists N 2j (ɛ, η) sothatforalln>n 2j (ɛ, η) P θ0 [ Q(θ j ; X n ) Q 0 (θ j ) >ɛ/3] η/(2j) Let N 2 (ɛ, η) =max j=1,...,j {N 2j }. Then, for all n>n 2 (ɛ, η), we know that J P θ0 [ Q(θ j ; X n ) Q 0 (θ j ) >ɛ/3] <η/2 j=1 This implies that (6) is less than η/2.

26 Combining the results for (5) and (6), we have demonstrated that there exists an N(ɛ, η) =max(n 1 (ɛ, η),n 2 (ɛ, η)) so that for all n>n(ɛ, η), P θ0 [sup Q(θ; X n ) Q 0 (θ) >ɛ] <η θ Θ Q.E.D.

27 Other conditions that might be useful to establish uniform convergence in probability are given in the lemmas below. Lemma 8.4 may be useful when the data are not independent. Lemma 8.4: IfΘiscompact,Q 0 (θ) is continuous in θ Θ, Q(θ; X n ) P Q 0 (θ) for all θ Θ, and there is an α>0andc(x n ) which is bounded in probability such that for all θ, θ Θ, Q( θ, X n ) Q(θ, X n ) C(X n ) θ θ α then, sup Q(θ; X n ) Q 0 (θ) P 0 θ Θ Aside: C(X n ) is bounded in probability if for every ɛ>0there exists N(ɛ) andη(ɛ) > 0 such that P θ0 [ C(X n ) >η(ɛ)] <ɛ

for all n>n(ɛ). 28

29 Θ is Not Compact Suppose that we are not willing to assume that Θ is compact. One way around this is bound the objective function from above uniformly in parameters that are far away from the truth. For example, suppose that there is a compact set D such that E θ0 [ sup {log p(x; θ)}] <Q 0 (θ 0 )=E θ0 [log p(x; θ 0 )] θ Θ D C By the law of large numbers, we know that with probability approaching one that sup θ Θ D C {Q(θ; X n )} 1 n < 1 n n sup log p(x i ; θ) θ Θ D C n log p(x i ; θ 0 )=Q(θ 0 ; X n ) i=1 i=1

30 Therefore, with probability approaching one, we know that ˆθ(X n ) is in the compact set D. Then, we can apply the previous theory to show consistency. The following lemma can be used in cases where the objective function is concave. Lemma 8.5: If there is a function Q 0 (θ) such that i. Q 0 (θ) is uniquely maximized at θ 0 ii. θ 0 is an element of the interior of a convex set Θ (does not have to be bounded) iii. Q(θ, x n )isconcaveinθ for each x n,and iv. Q(θ; X n ) P Q 0 (θ) for all θ Θ then ˆθ(X n ) exists with probability approaching one and ˆθ(X n ) P θ 0.