Stats 3b: Theory of Statistics Winter 28 Lecture 3 January 6 Lecturer: Yu Bai/John Duchi Scribe: Shuangning Li, Theodor Misiakiewicz Warning: these notes may contain factual errors Reading: VDV Chater 5.-5.6; ELST Chater 7.-7.3 Outline of Lecture 2:. Basic consistency and identifiability 2. Asymtotic Normality (a) Taylor exansions (b) Classical log-likelihood & asymtotic normality (c) Fisher Information Reca of Delta Method Last lecture, we discussed the Delta Method (aka Taylor exansions). The basic idea was as follows: Claim. If r n (T n θ) d T, and φ : R d R k is smooth, then r n (φ(t n ) φ(θ)) φ (θ)t, if φ (θ). Idea of roof: r n (φ(t n ) φ(θ)) = r n (φ (θ)(t n θ) + o (T n θ)) = r n (φ (θ)(t n θ)) + o (r n (T n θ)) = r n (φ (θ)(t n θ)) + o () d φ (θ)t. Notation: (from now on) Given distribution P on X, function f : X R d, P f := fdp = f(x)dp (x) = E P [f(x)] X Examle (Emirical distributions): Consider the observations x, x 2,..., x n X. Let the emirical distribution P n = n n i= x i. For any set A X, P n (A) = n {i [n] : x i A} = P n {x A}. Hence for any function f, P n f = n n i= f(x i).
Taylor exansions. Real-valued functions For f : R d R differentiable at x R d, f(y) = f(x) + f(x) T (y x) + o( y x ). (Remainder version) If f is twice differentiable, f(y) = f(x) + f( x) T (y x). (Mean value version) f(y) = f(x) + f(x) T (y x) + 2 (y x)t 2 f(x)(y x) + o( y x 2 ). (Remainder version) f(y) = f(x) + f(x) T (y x) + 2 (y x)t 2 f( x)(y x). (Mean value version) 2. Vector-valued functions f f T Let f : R d R k f 2 (x) f2 T, f(x) =. Define Df(x) = (x).. Rk d to be the Jacobian of f. f k fk T (x) Then, f(y) = f(x) + Df(x)(y x) + o( y x ). (Remainder version) But for mean value version, we don t necessarily have x such that f(y) = f(x) + Df( x)(y x). Examle 2 (Failure of mean value version): Let f : R R k, f(x) = 2x kx k. Take x =, y =, then f(y) f(x) = =. Yet Df( x) =. x x 2, then Df(x) =. x k 2 x. k x k.. Examle 3 (Quantitative continuity guarantees): Recall the oerator norm of A is A o = su Au 2, u 2 = this imlied that Ax 2 A o x 2. For f : R d R k, differentiable, assume that Df is L Lischitz, i.e. Df(x) Df(y) o L x y 2. (Roughly, this means that D 2 f(x) L.) Claim 2. We have f(y) = f(x) + Df(x)(y x) + R(y x), where R is a remainder matrix (deending on x, y) that satisfy R o L 2 y x and R(y x) L 2 y x 2. 2
Proof Define φ i (t) = f i (( t)x + ty), φ i : [, ] R. Note that φ i () = f i (x), φ i () = f i (y), and φ i = ( f i (( t)x + ty) ) T (y x). Then f T (( t)x + ty) φ f T 2 (( t)x + ty) (t) φ 2 Df(( t)x + ty)(y x) = (y x) = (t)... (( t)x + ty) φ k (t) Since φ i () φ i () = φ (t)dt, f(y) f(x) = = To bound the remainder term, f T k Df(( t)x + ty)(y x)dt ( Df(( t)x + ty) Df(x) ) (y x)dt + Df(x)(y x). ( ) Df(( t)x + ty) Df(x) (y x)dt ( Df(( t)x + ty) Df(x) ) (y x) dt Df(( t)x + ty) Df(x) o (y x) dt L t(y x) (y x) dt Lt (y x) 2 dt = L 2 (y x) 2. Consistency and asymtotic distribution: Setting:. We have some model family {P θ } θ Θ of distributions on X, where Θ R d. Also, assume all P θ have density with resect to base measure µ on X, i.e. = dp θ dµ. 2. We consider the log-likelihood of the distribution l θ (x) = log (x), with [ ] d l θ (x) := log (x) R d θ j j= [ ] 2 2 d l θ (x) := log (x) R d d θ i θ j i,j= For simlicity, we will denote: lθ l θ (x) and l θ 2 l θ (x). The gradient of the log-likelihood is often called the score function. We will use this term to refer to l θ (x) throughout future lectures. 3
3. Observe X i iid Pθ where θ is unknown. Our goal is to estimate θ. 4. A standard estimator is to choose ˆθ n to maximize the likelihood, i.e. the robability of the data. Main questions:. Consistency: does ˆθ n θ as n +? ˆθ n argmax P n l θ (x) θ Θ 2. Asymtotic distribution: does r n (ˆθ n θ ) converge in distribution? 3. Otimality? (in the next lecture) Consistency: Definition. (Identifiability). A model {P θ } θ Θ is identifiable if P θ P θ2 for all θ, θ 2 Θ (θ θ 2 ). Equivalently, D kl (P θ P θ2 ) > when θ θ 2. Recall that D kl (P θ P θ2 ) = log dp θ dp θ. dp θ2 Note that P θ P θ2 means that set A X such that P θ (A) P θ2 (A). Now that we have established what both identifiability and consistency mean, we can rove a basic result regarding the finite consistency of the Maximum Likelihood estimator (MLE). Proosition 3 (Finite Θ consistency of MLE). Suose {P θ } θ Θ is identifiable and card Θ <. Then, if ˆθ n := argmax θ Θ P n l θ (x), ˆθ n θ when X i iid Pθ. Proof of Proosition when x i iid Pθ. By the Strong Law of Large Numbers, we know that P n l θ (x) a.s. P θ l θ (x) [ P θ l θ (x) P θ l θ (x) = E θ log ] θ (x) (x) = D kl (P θ P θ ) We know that D kl (P θ P θ ) > unless θ = θ. Combining this remark with P n l θ (x) P n l θ (x) a.s. D kl (P θ P θ ), we deduce that there exists N(θ) such that for all n > N(θ), we have P n l θ (x) P n l θ (x) > with robability. It follows that for n > max θ Θ,θ θ N(θ), we have P n l θ (x) > P n l θ (x) for all θ θ. Therefore ˆθ n = θ and we conclude that, for sufficiently large n and finite Θ, we have ˆθ n = θ eventually. Remark The above result can fail for Θ infinite even if Θ is countable. Uniform law: One sufficient condition often used for consistency results is a uniform law, i.e. for iid x i P, we have suθ Θ P n l θ P l θ. In this case, if P θ l θ < P θ l θ 2ɛ and su Θ P n l θ P θ l θ ɛ, then ˆθ n θ. We will have: ˆθ n {θ : P θ l θ P θ l θ 2ɛ} 4
Now, that we have established some basic definitions and results regarding the consistency of estimators, we turn our attention to understanding their asymtotic behavior. Asymtotic Normality via Taylor Exansions: Definition.2 (Oerator norm). A o := su u 2 Au 2. Note: A R k d, u R d and Ax 2 A o x 2. Before we do anything, we have to make several assumtions.. We have a nice, smooth model, i.e. the Hessian is Lischitz-continuous. To be rigorous, the following must hold: 2 l θ (x) 2 l θ2 (x) o M(x) θ θ 2 2 E θ [M 2 (x)] < 2. The MLE, ˆθ n argmax θ Θ P n l θ (x), is consistent, i.e. ˆθ n θ under P θ. 3. Θ is a convex set. iid Theorem 4. Let x i Pθ, ˆθ n be the MLE (i.e. P n = ) and assume the conditions stated lˆθn above. Then, n(ˆθ n θ ) d N(, (P θ 2 l θ ) P θ l θ l T θ (P θ 2 l θ ) ). ( ) Remark Let us rewrite the asymtotic variance. Given that 2 l θ = θ = 2 T θ : 2 θ [ 2 ] 2 E θ = dµ = 2 dµ = 2 dµ = As a result: [ ( θ ) ( ) ] T E θ [ 2 θ l θ ] = E θ = Cov( l θ (x)) θ We define the Fisher Information as I θ := E θ [ l θ (x) l θ (x) T ] = Cov θ l θ where the final equality holds because E θ [ l θ (x)] = (θ maximizes E θ [l θ (x)]). To show this, assume that we can swa, E. Then, l θ (x) = log (x) = (x) (x). Using that result, we see that: [ ] θ θ E θ [ l θ ] = E = dµ = dµ = dµ = () =. We now have a more comact reresentation of the asymtotic distribution described in the Theorem above. n(ˆθn θ ) d N(, I θ I θ I θ ) = N(, I Consider I θ = 2 E[l θ (x)]. If the magnitude of the second derivative is large, that imlies that the log-likelihood is stee around the global maximum (making it easy to find). Alternatively, if the magnitude of 2 E[l θ (x)] is small, we do not have sufficient curvature to find the otimal θ. θ ) 5
Proof Let r(x) R d d be the remainder matrix in Taylor exansion of the gradients of the individual log likelihood terms around θ guaranteed by Taylor s theorem (which certainly deends on θ n θ ), that is, l θn (x) = l θ (x) + 2 l θ (x)( θ n θ ) + r(x)( θ n θ ), where by Taylor s theorem r(x) o M(x) θ n θ. Writing this out using the emirical distribution and that θ n = argmax θ P n l θ (X), we have P n l θn = = P n l θ + P n 2 l θ ( θ n θ ) + P n r(x)( θ n θ ). () But of course, exanding the term P n r(x) R d d, we find that P n r(x) = n n r(x i ) and i= P n r o n M(X i ) θ n θ = o n }{{} P (). i= }{{} a.s. E θ [M(X)] In articular, revisiting exression (), we have = P n l θ + P n 2 l θ ( θ n θ ) + o P ()( θ n θ ). = P n l θ + ( P θ 2 l θ + (P n P θ ) 2 l θ + o P () ) ( θ n θ ). The strong law of large numbers guarantees that (P n P θ ) 2 l θ = o P (), and multilying each side by n yields n(pθ 2 l θ + o P ())( θ n θ ) = np n l θ. Alying Slutsky s theorem gives the result: indeed, we have T n = np n l θ satisfies T n d N(, I θ ) by the central limit theorem, and noting that P θ 2 l θ + o P () is eventually invertible gives n( θn θ ) d N(, (P θ 2 l θ ) I θ (P θ 2 l θ ) ) as desired. 6