LECTURE 7 NOTES 1. Convergence of random variables. Before delving into the large samle roerties of the MLE, we review some concets from large samle theory. 1. Convergence in robability: x n x if, for any δ > 0, P( x n x > δ) 0 2. Convergence in distribution: x n d x if E [g(x n )] E [g(x)] for all bounded, continuous functions g : X R. If x n and x are real d valued, x n x is equivalent to F n (t) F (t) at the continuity oints of F, where F n is the CDF of x n and F is the CDF of x. Convergence in robability is the stronger notion of convergence. In articular, x n x imlies xn d x. There is a artial converse: if xn d c, where c is a constant, then x n c. For convenience, we say o P and O P for convergence and boundedness in robability. Recall for a non-negative sequence of scalars {a n }, 1. a n o(1) means a n 0 as n grows. 2. a n o(b n ) for a non-negative sequence {b n } means an b n Further, o(1). 1. a n O(1) means {a n } is bounded; i.e. there is some (large) M > 0 such that a n 2. a n O(b n ) (for a non-negative sequence {b n }) means an b n O(1). There are robabilistic analogues of the receding statements. Let x n be a sequence of non-negative real-valued random variables. 1. x n o P (1) means x n 0 as n grows. 2. x n o P (b n ) for a non-negative sequence {b n } means xn b n o P (1). 3. x n o P (y n ) for a sequence of non-negative random variables {y n } imlies xn y n o P (1). 1
2 STAT 201B Further, 1. x n O P (1) means {x n } is bounded in robability; i.e. for any ɛ > 0, there is M > 0 such that su n P(x n > M) ɛ. 2. x n O P (b n ) means xn b n O P (1). 3. x n O P (y n ) means xn y n O P (1). The aforementioned convergence roerties are reserved under transformations. For any continuous fu- Theorem 1.1 (Continuous maing theorem). nction g, 1. x n x imlies g(xn ) g(x). d 2. x n x imlies g(xn ) d g(x). Theorem 1.2 (Slutsky s theorem). Let x n d x and xn d c. We have 1. x n + y n d x + c, 2. x n y n d cx. More generally, g(x n, y n ) d g(x, c), for any continuous function g. Proof sketch. The key ste in the roof of Slutsky s theorem shows that x n d x and yn d c imlies (xn, y n ) d (x, c). We aeal to the continuous maing theorem to obtain the stated conclusion. d d We remark that x n x and yn y (convergence of the marginal distributions) generally does not imly x n + y n x + y; convergence of the joint d distribution is crucial. Lemma 1.3 (delta method). grows. For any differentiable g, Let a n (x n c) d x, where a n as n a n (g(x n ) g(c)) d g(c) T x. Proof. By Taylor s theorem, g(x n ) g(c) = g(c) T (x n c) + o( x n c 2 ) for any sequence {x n } c. Since a n (x n c) d x, by Slutsky s theorem, we deduce x n c d 0, which imlies x n c 0. Thus g(x n ) g(c) = g(c) T (x n c) + o P ( x n c 2 ),
LECTURE 7 NOTES 3 which imlies a n (g(x n ) g(c)) = a n g(c) T (x n c) + o P (a n x n c 2 ) = a n g(c) T (x n c) + o P (1). Rearranging, a n (g(x n ) g(c)) a n g(c) T (x n c) 0. By the continuous maing theorem, a n g(c) T (x n c) d g(c) T x, which allows us to conclude a n (g(x n ) g(c)) d g(c) T x. Corollary 1.4. Let n(x n θ) d N (0, Σ). We have n(g(xn ) g(θ)) d N ( 0, g(θ) T Σ g(θ) ). Corollary 1.4 is by far the most common alication of the delta method to the extent that some sources go as far as to call it the delta method. 2. Consistency. Most asymtotic statements concern a sequence of estimators {ˆθ n } indexed by samle size n. In the usual asymtotic framework, we observe i.i.d. random variables {x i } and consider the sequence of estimators ˆθ n = ˆθ ( {x} i [n] ) that consists of the same estimation rocedure erformed on growing samles. For examle, in the coin tossing investigation, the sequence of MLE s of consists of the means of growing samles: ˆ n = 1 x i. n i [n] In statistical arlance, statements about asymtotic roerties of an estimator (instead of a sequence of estimators) are common. Such statements are actually about the underlying estimation rocedure, and should be interreted as such. Definition 2.1. A sequence of oint estimators {ˆθ n } of θ Θ is consistent if ˆθ converges in robability to θ.
4 STAT 201B We remark that since convergence in mean squared imlies convergence in robability, a sufficient condition for consistency is ] 2 ] [ ] bias θ [ˆθn + varθ [ˆθn = Eθ ˆθ n θ 2 2 0. Consistency is the basic notion of correctness for a oint estimator. It ensures an estimator converges to its target as the samle size grows. For simle estimators, such as those that are smooth functions of the observations {x i } i [n], consistency is often a direct consequence of the law of large numbers (LLN). In the coin tossing investigation, the (weak) LLN imlies ˆ n. Further, by the continuous maing theorem, the estimator g(ˆθ n ) is a consistent estimator of g(θ ) as long as ˆθ n is a consistent estimator of θ. i.i.d. Examle 2.2. Let x i Ber( ). Recall the MLE of is x n. By the law of large numbers, x n, which shows that the MLE is consistent. i.i.d. Examle 2.3. Let x i unif(0, θ ). Recall the MLE of θ is x (n). It is ossible to show the MLE is consistent by direct calculation. Indeed, ( P θ θ x (n) > ɛ ) ( = P θ maxi [n] x i < θ ɛ ) = ( P θ xi < θ ɛ ) which converges to zero as n grows. i [n] = i [n] θ ɛ θ, We turn our attention to establishing the consistency of the MLE. The MLE is an examle of an extremum estimator: it is the maximizer of the log-likelihood l x (θ). Since the log-likelihood deends on the observations x, it is a random function (of θ). The basic idea is to show that 1. the log-likelihood converges uniformly to a oulation objective, 2. the true arameter θ is the maximizer of the oulation objective. The first condition is known as a uniform law of large numbers. Definition 2.4. A set of (real-valued) functions F satisfies a uniform law of large numbers (ULLN) if 1 su f F n i [n] f(x i) E [ f(x 1 ) ] 0.
LECTURE 7 NOTES 5 We remark that as long as E [ f ] is finite, the (weak) LLN ensures 1 n i [n] f(x i) E [ f(x 1 ) ] 0 for each f F. A ULLN is a stronger statement that says the convergence is uniform over F. Why is the stronger statement necessary? Consider the MLE ˆθ n of θ in a arametric model. The LLN ensures 1 n i [n] l x i (θ) [ E θ lx1 (θ) ] for any θ Θ. Unfortunately, ointwise convergence of the log-likelihood is not even enough to ensure the convergence of the maximum, much less the maximizer, which is required to conclude the consistency of the MLE. Theorem 2.5. Let {x i } be i.i.d. random variables, and f θ (x) be their density for some θ Θ. Assume 1. f θ (x) a.e. = f θ (x) if and only if θ = θ, 1 2. su θ Θ (θ) lx1 (θ) ] 0, then θ is the unique maximizer of lx1 (θ) ] on Θ and lx1 (θ ) ] lx1 (ˆθ n ) ] 0. Proof. The roof consists of two arts: 1. show that θ is the unique maximizer of lx1 (θ) ] on Θ. 2. show that the maximum of 1 (θ)) converges to the maximum of lx1 (θ) ]. We begin by showing lx1 (θ ) ] lx1 (θ) ] > 0 for any θ θ. By the linearity of the exectation, lx1 (θ ) ] [ E θ lx1 (θ) ] = log fθ (x 1 ) ] [ E θ log fθ (x 1 ) ] [ [ ]] fθ (x 1 ) = E θ log, f θ (x 1 )
6 STAT 201B which is the Kullback-Leibler divergence between F θ and F θ. The KL divergence is always non-negative. Indeed, by Jensen s inequality, [ [ ]] [ [ ]] fθ (x 1 ) fθ (x 1 ) E θ log = E θ log f θ (x 1 ) f θ (x 1 ) ( ) log f θ (x 1 )dx 1 = log 1. X Further, since log x is strictly concave, the inequality is strict unless [ ] f θ (x 1 ) a.s. fθ (x 1 ) = E θ, f θ (x 1 ) f θ (x 1 ) which, by the first assumtion, is not ossible at any θ θ. Thus θ is the unique maximizer of lx1 (θ) ]. To comlete the roof, we show that (2.1) lx1 (θ ) ] lx1 (ˆθ n ) ] 0. We exand lx1 (ˆθ n ) ] lx1 (θ ) ] : lx1 (θ ) ] [ E θ lx1 (ˆθ n ) ] = lx1 (θ ) ] 1 (θ )) + 1 (θ ) 1 (ˆθ n ) }{{} + 1 n i [n] l [ x i (ˆθ n ) E θ lx1 (ˆθ n ) ] lx1 (θ ) ] 1 (θ )) + 0 since ˆθ n maximizes i [n] lx i (θ) 1 n i [n] l x i (ˆθ n ) lx1 (ˆθ n ) ], The first term is o P (1) by the LLN, and the second term is also o P (1) by the uniform convergence assumtion, which establishes (2.1). Theorem 2.5 catures the essence of most results on the consistency of the MLE. The key ingredients are identifiability and uniform convergence. The first assumtion of Theorem 2.5 is an identifiability assumtion. It ensures the maximizer of E θ [ lx1 (θ) ] ] is unique by disallowing two arameters θ 1, θ 2 to corresond to the same density. The second assumtion is a ULLN for the set of log-likelihood functions {log f θ (x) : θ Θ}. Here is an examle of such a ULLN. Lemma 2.6. Assume 1. l x (θ) is a continuous function (of θ) for any x X
LECTURE 7 NOTES 7 2. there is an enveloe function b such that E [ b(x) ] < and l x (θ) b(x) for any θ Θ. If Θ is comact, then 1 su θ Θ (θ) E [ l x1 (θ) ] 0. We remark that Theorem 2.5 establishes the excess risk of the MLE lx1 (θ ) ] [ E θ lx1 (ˆθ n ) ] vanishes. However, sans other assumtions, we cannot conclude ˆθ n θ. Most results on the consistency of the MLE state a laundry list of assumtions, most of which are required to establish uniform convergence and to conclude ˆθ n θ from (2.1). For comleteness, we state such a result. Corollary 2.7. Let {x i } be i.i.d. random variables, and f θ (x) be their density for some θ in a comact arameter sace Θ. If 1. f θ (x) = f θ (x) if and only if θ = θ, 2. l x (θ) is continuous for any x X 3. there is a function b : X R such that l x (θ) b(x) for any θ Θ and b(x1 ) ] <. then the MLE is consistent; i.e. arg max θ Θ i [n] l x i (θ) θ. Proof. First, we show that lx1 (θ) ] is continuous under the stated assumtions. For any sequence {θ n } converging to θ, {l x (θ n )} l x (θ) by the continuity of l x. Since l x is dominated by b and b(x1 ) ] is finite, lx1 (θ n ) ] [ E θ lx1 (θ) ] by the dominated convergence theorem. To comlete the roof, we show that ˆθ n θ. For any δ > 0, consider su θ Θ: θ θ 2 δ lx1 (θ) ]. Since lx1 (θ) ] is continuous and {θ Θ : θ θ 2 δ} is comact, the su is attained at some θ {θ Θ : θ θ 2 δ}. By Theorem 2.5, θ is the unique maximizer of lx1 (θ) ] on Θ. Thus ɛ := lx1 (θ ) ] lx1 ( θ) ] > 0.
8 STAT 201B 1 When su θ Θ n i [n] g [ x i (θ) E θ lx1 (θ) ] < ɛ su θ Θ 1 n i [n] l x i (θ) 1 n 2, i [n] l x i (θ ) > lx1 (θ ) ] ɛ 2 = lx1 ( θ) ] + ɛ 2 su θ Θ: θ θ 2 δ 1 n i [n] l x i (θ) which imlies ˆθn θ 2 δ. By Lemma 2.6, ( 1 P θ su θ Θ n i [n] g [ x i (θ) E θ lx1 (θ) ] ) < ɛ 2 We conclude P θ ( ˆθn θ 2 δ ) 1. 1, The takeaways from the lengthy receding section on the consistency of the MLE are 1. as long as 1 (θ) converges to E [ l x1 (θ) ] uniformly on θ Θ, the risk of the MLE converges to the risk of θ. Establishing uniform convergence is non-trivial, and we do not dwell on the toic. 2. to show consistency ˆθ n θ requires additional technical conditions on the arametric model. There are many aroaches to establishing consistency: we followed Wald s original aroach. To wra u, we consider the consequences of an incorrect model; i.e. the arametric model F := {f θ (x) : X R : θ Θ} does not include the generative distribution of the observations. We know that the MLE converges to the maximizer of E F [ log fθ (x) ] or equivalently, the minimizer of E F [ log f(x) ] EF [ log fθ (x) ], where F is the generative distribution and f(x) is its density. We recognize the difference as the KL divergence between F and F θ : [ ]] f(x) = E F [log. f θ (x) Thus the MLE generally converges to a θ Θ that is the best aroximation of F in the arametric model. That is F θ has the smallest KL divergence to F among the distributions in the arametric model. Yuekai Sun Berkeley, California December 4, 2015