Lecture 3 January 16

Similar documents
1 Extremum Estimators

Lecture 2: Consistency of M-estimators

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)]

Maximum Likelihood Asymptotic Theory. Eduardo Rossi University of Pavia

Lecture 3 September 1

Maximum Likelihood Estimation

DA Freedman Notes on the MLE Fall 2003

Maximum Likelihood Estimation

p-adic Measures and Bernoulli Numbers

Elementary Analysis in Q p

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Lecture 4 September 15

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

MA3H1 TOPICS IN NUMBER THEORY PART III

Problem Set 2 Solution

Lecture 10: Hypercontractivity

Department of Mathematics

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

EE/Stats 376A: Information theory Winter Lecture 5 Jan 24. Lecturer: David Tse Scribe: Michael X, Nima H, Geng Z, Anton J, Vivek B.

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

Unconstrained minimization of smooth functions

Graduate Econometrics I: Maximum Likelihood I

Analysis of some entrance probabilities for killed birth-death processes

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Lecture 6. 2 Recurrence/transience, harmonic functions and martingales

Chapter 3. Point Estimation. 3.1 Introduction

REVIEW OF DIFFERENTIAL CALCULUS

Sampling distribution of GLM regression coefficients

DIFFERENTIAL GEOMETRY. LECTURES 9-10,

Sharp gradient estimate and spectral rigidity for p-laplacian

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

ON THE NORM OF AN IDEMPOTENT SCHUR MULTIPLIER ON THE SCHATTEN CLASS

Chapter 7: Special Distributions

Theoretical Statistics. Lecture 17.

THE 2D CASE OF THE BOURGAIN-DEMETER-GUTH ARGUMENT

Math 701: Secant Method

A Very Brief Summary of Statistical Inference, and Examples

1. Fisher Information

Lecture 14: October 17

Stochastic Convergence, Delta Method & Moment Estimators

Maximum likelihood estimation

STAT 730 Chapter 4: Estimation

3 Properties of Dedekind domains

Theoretical Statistics. Lecture 25.

Elementary theory of L p spaces

The Hasse Minkowski Theorem Lee Dicker University of Minnesota, REU Summer 2001

Mathematical Economics: Lecture 9

Theoretical Statistics. Lecture 23.

An Overview of Witt Vectors

NONLINEAR OPTIMIZATION WITH CONVEX CONSTRAINTS. The Goldstein-Levitin-Polyak algorithm

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Econometrics I, Estimation

Lecture 1: January 12

MATH 250: THE DISTRIBUTION OF PRIMES. ζ(s) = n s,

Convex Analysis and Economic Theory Winter 2018

Math 205A - Fall 2015 Homework #4 Solutions

Theoretical Statistics. Lecture 22.

Maximum Likelihood Estimation

Lecture 1: Introduction

LORENZO BRANDOLESE AND MARIA E. SCHONBEK

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

1. Bounded linear maps. A linear map T : E F of real Banach

4. Score normalization technical details We now discuss the technical details of the score normalization method.

Chapter 2. Discrete Distributions

PETER J. GRABNER AND ARNOLD KNOPFMACHER

10-725/36-725: Convex Optimization Prerequisite Topics

Almost 4000 years ago, Babylonians had discovered the following approximation to. x 2 dy 2 =1, (5.0.2)

Statistical Inference

Multivariate Analysis and Likelihood Inference

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Sobolev Spaces with Weights in Domains and Boundary Value Problems for Degenerate Elliptic Equations

Introduction to gradient descent

Extension of Minimax to Infinite Matrices

Lecture 17: Numerical Optimization October 2014

Advanced Calculus I. Part A, for both Section 200 and Section 501

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

f(x) f(z) c x z > 0 1

We denote the derivative at x by DF (x) = L. With respect to the standard bases of R n and R m, DF (x) is simply the matrix of partial derivatives,

Sums of independent random variables

6.1 Variational representation of f-divergences

6 Binary Quadratic forms

f(r) = a d n) d + + a0 = 0

A Very Brief Summary of Statistical Inference, and Examples

k- price auctions and Combination-auctions

Applications to stochastic PDE

Difference of Convex Functions Programming for Reinforcement Learning (Supplementary File)

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE 275A Homework 6 Solutions

CHAPTER 2: SMOOTH MAPS. 1. Introduction In this chapter we introduce smooth maps between manifolds, and some important

On Isoperimetric Functions of Probability Measures Having Log-Concave Densities with Respect to the Standard Normal Law

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

General Linear Model Introduction, Classes of Linear models and Estimation

Lecture 3 - Linear and Logistic Regression

Location of solutions for quasi-linear elliptic equations with general gradient dependence

Inference for Empirical Wasserstein Distances on Finite Spaces: Supplementary Material

MA102: Multivariable Calculus

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Transcription:

Stats 3b: Theory of Statistics Winter 28 Lecture 3 January 6 Lecturer: Yu Bai/John Duchi Scribe: Shuangning Li, Theodor Misiakiewicz Warning: these notes may contain factual errors Reading: VDV Chater 5.-5.6; ELST Chater 7.-7.3 Outline of Lecture 2:. Basic consistency and identifiability 2. Asymtotic Normality (a) Taylor exansions (b) Classical log-likelihood & asymtotic normality (c) Fisher Information Reca of Delta Method Last lecture, we discussed the Delta Method (aka Taylor exansions). The basic idea was as follows: Claim. If r n (T n θ) d T, and φ : R d R k is smooth, then r n (φ(t n ) φ(θ)) φ (θ)t, if φ (θ). Idea of roof: r n (φ(t n ) φ(θ)) = r n (φ (θ)(t n θ) + o (T n θ)) = r n (φ (θ)(t n θ)) + o (r n (T n θ)) = r n (φ (θ)(t n θ)) + o () d φ (θ)t. Notation: (from now on) Given distribution P on X, function f : X R d, P f := fdp = f(x)dp (x) = E P [f(x)] X Examle (Emirical distributions): Consider the observations x, x 2,..., x n X. Let the emirical distribution P n = n n i= x i. For any set A X, P n (A) = n {i [n] : x i A} = P n {x A}. Hence for any function f, P n f = n n i= f(x i).

Taylor exansions. Real-valued functions For f : R d R differentiable at x R d, f(y) = f(x) + f(x) T (y x) + o( y x ). (Remainder version) If f is twice differentiable, f(y) = f(x) + f( x) T (y x). (Mean value version) f(y) = f(x) + f(x) T (y x) + 2 (y x)t 2 f(x)(y x) + o( y x 2 ). (Remainder version) f(y) = f(x) + f(x) T (y x) + 2 (y x)t 2 f( x)(y x). (Mean value version) 2. Vector-valued functions f f T Let f : R d R k f 2 (x) f2 T, f(x) =. Define Df(x) = (x).. Rk d to be the Jacobian of f. f k fk T (x) Then, f(y) = f(x) + Df(x)(y x) + o( y x ). (Remainder version) But for mean value version, we don t necessarily have x such that f(y) = f(x) + Df( x)(y x). Examle 2 (Failure of mean value version): Let f : R R k, f(x) = 2x kx k. Take x =, y =, then f(y) f(x) = =. Yet Df( x) =. x x 2, then Df(x) =. x k 2 x. k x k.. Examle 3 (Quantitative continuity guarantees): Recall the oerator norm of A is A o = su Au 2, u 2 = this imlied that Ax 2 A o x 2. For f : R d R k, differentiable, assume that Df is L Lischitz, i.e. Df(x) Df(y) o L x y 2. (Roughly, this means that D 2 f(x) L.) Claim 2. We have f(y) = f(x) + Df(x)(y x) + R(y x), where R is a remainder matrix (deending on x, y) that satisfy R o L 2 y x and R(y x) L 2 y x 2. 2

Proof Define φ i (t) = f i (( t)x + ty), φ i : [, ] R. Note that φ i () = f i (x), φ i () = f i (y), and φ i = ( f i (( t)x + ty) ) T (y x). Then f T (( t)x + ty) φ f T 2 (( t)x + ty) (t) φ 2 Df(( t)x + ty)(y x) = (y x) = (t)... (( t)x + ty) φ k (t) Since φ i () φ i () = φ (t)dt, f(y) f(x) = = To bound the remainder term, f T k Df(( t)x + ty)(y x)dt ( Df(( t)x + ty) Df(x) ) (y x)dt + Df(x)(y x). ( ) Df(( t)x + ty) Df(x) (y x)dt ( Df(( t)x + ty) Df(x) ) (y x) dt Df(( t)x + ty) Df(x) o (y x) dt L t(y x) (y x) dt Lt (y x) 2 dt = L 2 (y x) 2. Consistency and asymtotic distribution: Setting:. We have some model family {P θ } θ Θ of distributions on X, where Θ R d. Also, assume all P θ have density with resect to base measure µ on X, i.e. = dp θ dµ. 2. We consider the log-likelihood of the distribution l θ (x) = log (x), with [ ] d l θ (x) := log (x) R d θ j j= [ ] 2 2 d l θ (x) := log (x) R d d θ i θ j i,j= For simlicity, we will denote: lθ l θ (x) and l θ 2 l θ (x). The gradient of the log-likelihood is often called the score function. We will use this term to refer to l θ (x) throughout future lectures. 3

3. Observe X i iid Pθ where θ is unknown. Our goal is to estimate θ. 4. A standard estimator is to choose ˆθ n to maximize the likelihood, i.e. the robability of the data. Main questions:. Consistency: does ˆθ n θ as n +? ˆθ n argmax P n l θ (x) θ Θ 2. Asymtotic distribution: does r n (ˆθ n θ ) converge in distribution? 3. Otimality? (in the next lecture) Consistency: Definition. (Identifiability). A model {P θ } θ Θ is identifiable if P θ P θ2 for all θ, θ 2 Θ (θ θ 2 ). Equivalently, D kl (P θ P θ2 ) > when θ θ 2. Recall that D kl (P θ P θ2 ) = log dp θ dp θ. dp θ2 Note that P θ P θ2 means that set A X such that P θ (A) P θ2 (A). Now that we have established what both identifiability and consistency mean, we can rove a basic result regarding the finite consistency of the Maximum Likelihood estimator (MLE). Proosition 3 (Finite Θ consistency of MLE). Suose {P θ } θ Θ is identifiable and card Θ <. Then, if ˆθ n := argmax θ Θ P n l θ (x), ˆθ n θ when X i iid Pθ. Proof of Proosition when x i iid Pθ. By the Strong Law of Large Numbers, we know that P n l θ (x) a.s. P θ l θ (x) [ P θ l θ (x) P θ l θ (x) = E θ log ] θ (x) (x) = D kl (P θ P θ ) We know that D kl (P θ P θ ) > unless θ = θ. Combining this remark with P n l θ (x) P n l θ (x) a.s. D kl (P θ P θ ), we deduce that there exists N(θ) such that for all n > N(θ), we have P n l θ (x) P n l θ (x) > with robability. It follows that for n > max θ Θ,θ θ N(θ), we have P n l θ (x) > P n l θ (x) for all θ θ. Therefore ˆθ n = θ and we conclude that, for sufficiently large n and finite Θ, we have ˆθ n = θ eventually. Remark The above result can fail for Θ infinite even if Θ is countable. Uniform law: One sufficient condition often used for consistency results is a uniform law, i.e. for iid x i P, we have suθ Θ P n l θ P l θ. In this case, if P θ l θ < P θ l θ 2ɛ and su Θ P n l θ P θ l θ ɛ, then ˆθ n θ. We will have: ˆθ n {θ : P θ l θ P θ l θ 2ɛ} 4

Now, that we have established some basic definitions and results regarding the consistency of estimators, we turn our attention to understanding their asymtotic behavior. Asymtotic Normality via Taylor Exansions: Definition.2 (Oerator norm). A o := su u 2 Au 2. Note: A R k d, u R d and Ax 2 A o x 2. Before we do anything, we have to make several assumtions.. We have a nice, smooth model, i.e. the Hessian is Lischitz-continuous. To be rigorous, the following must hold: 2 l θ (x) 2 l θ2 (x) o M(x) θ θ 2 2 E θ [M 2 (x)] < 2. The MLE, ˆθ n argmax θ Θ P n l θ (x), is consistent, i.e. ˆθ n θ under P θ. 3. Θ is a convex set. iid Theorem 4. Let x i Pθ, ˆθ n be the MLE (i.e. P n = ) and assume the conditions stated lˆθn above. Then, n(ˆθ n θ ) d N(, (P θ 2 l θ ) P θ l θ l T θ (P θ 2 l θ ) ). ( ) Remark Let us rewrite the asymtotic variance. Given that 2 l θ = θ = 2 T θ : 2 θ [ 2 ] 2 E θ = dµ = 2 dµ = 2 dµ = As a result: [ ( θ ) ( ) ] T E θ [ 2 θ l θ ] = E θ = Cov( l θ (x)) θ We define the Fisher Information as I θ := E θ [ l θ (x) l θ (x) T ] = Cov θ l θ where the final equality holds because E θ [ l θ (x)] = (θ maximizes E θ [l θ (x)]). To show this, assume that we can swa, E. Then, l θ (x) = log (x) = (x) (x). Using that result, we see that: [ ] θ θ E θ [ l θ ] = E = dµ = dµ = dµ = () =. We now have a more comact reresentation of the asymtotic distribution described in the Theorem above. n(ˆθn θ ) d N(, I θ I θ I θ ) = N(, I Consider I θ = 2 E[l θ (x)]. If the magnitude of the second derivative is large, that imlies that the log-likelihood is stee around the global maximum (making it easy to find). Alternatively, if the magnitude of 2 E[l θ (x)] is small, we do not have sufficient curvature to find the otimal θ. θ ) 5

Proof Let r(x) R d d be the remainder matrix in Taylor exansion of the gradients of the individual log likelihood terms around θ guaranteed by Taylor s theorem (which certainly deends on θ n θ ), that is, l θn (x) = l θ (x) + 2 l θ (x)( θ n θ ) + r(x)( θ n θ ), where by Taylor s theorem r(x) o M(x) θ n θ. Writing this out using the emirical distribution and that θ n = argmax θ P n l θ (X), we have P n l θn = = P n l θ + P n 2 l θ ( θ n θ ) + P n r(x)( θ n θ ). () But of course, exanding the term P n r(x) R d d, we find that P n r(x) = n n r(x i ) and i= P n r o n M(X i ) θ n θ = o n }{{} P (). i= }{{} a.s. E θ [M(X)] In articular, revisiting exression (), we have = P n l θ + P n 2 l θ ( θ n θ ) + o P ()( θ n θ ). = P n l θ + ( P θ 2 l θ + (P n P θ ) 2 l θ + o P () ) ( θ n θ ). The strong law of large numbers guarantees that (P n P θ ) 2 l θ = o P (), and multilying each side by n yields n(pθ 2 l θ + o P ())( θ n θ ) = np n l θ. Alying Slutsky s theorem gives the result: indeed, we have T n = np n l θ satisfies T n d N(, I θ ) by the central limit theorem, and noting that P θ 2 l θ + o P () is eventually invertible gives n( θn θ ) d N(, (P θ 2 l θ ) I θ (P θ 2 l θ ) ) as desired. 6