LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)]

Similar documents
Maximum Likelihood Asymptotic Theory. Eduardo Rossi University of Pavia

1 Extremum Estimators

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Elementary Analysis in Q p

Lecture 2: Consistency of M-estimators

Lecture 3 January 16

General Linear Model Introduction, Classes of Linear models and Estimation

On Wald-Type Optimal Stopping for Brownian Motion

Econometrics I. September, Part I. Department of Economics Stanford University

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Lecture 3 Consistency of Extremum Estimators 1

Asymptotically Optimal Simulation Allocation under Dependent Sampling

Lecture 6. 2 Recurrence/transience, harmonic functions and martingales

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Elementary theory of L p spaces

Introduction to Banach Spaces

Chapter 7: Special Distributions

4. Score normalization technical details We now discuss the technical details of the score normalization method.

Statics and dynamics: some elementary concepts

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

B8.1 Martingales Through Measure Theory. Concept of independence

Sums of independent random variables

Analysis of some entrance probabilities for killed birth-death processes

Introduction to Probability and Statistics

Lecture: Condorcet s Theorem

ECON 4130 Supplementary Exercises 1-4

Convex Analysis and Economic Theory Winter 2018

Convergence of random variables, and the Borel-Cantelli lemmas

IMPROVED BOUNDS IN THE SCALED ENFLO TYPE INEQUALITY FOR BANACH SPACES

ON THE NORM OF AN IDEMPOTENT SCHUR MULTIPLIER ON THE SCHATTEN CLASS

Problem Set 2 Solution

Mollifiers and its applications in L p (Ω) space

Real Analysis 1 Fall Homework 3. a n.

GENERICITY OF INFINITE-ORDER ELEMENTS IN HYPERBOLIC GROUPS

Lecture 4: Law of Large Number and Central Limit Theorem

MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression

Approximating min-max k-clustering

Location of solutions for quasi-linear elliptic equations with general gradient dependence

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

Improved Bounds on Bell Numbers and on Moments of Sums of Random Variables

1 Martingales. Martingales. (Ω, B, P ) is a probability space.

On Doob s Maximal Inequality for Brownian Motion

A Note on Massless Quantum Free Scalar Fields. with Negative Energy Density

Lecture 2: August 31

On split sample and randomized confidence intervals for binomial proportions

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

On Isoperimetric Functions of Probability Measures Having Log-Concave Densities with Respect to the Standard Normal Law

MATH 2710: NOTES FOR ANALYSIS

Econometrics I, Estimation

simple if it completely specifies the density of x

The following document is intended for online publication only (authors webpage).

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

16.2. Infinite Series. Introduction. Prerequisites. Learning Outcomes

A construction of bent functions from plateaued functions

ANALYTIC NUMBER THEORY AND DIRICHLET S THEOREM

Topic 7: Using identity types

Sharp gradient estimate and spectral rigidity for p-laplacian

A Note on the Positive Nonoscillatory Solutions of the Difference Equation

Best approximation by linear combinations of characteristic functions of half-spaces

Estimation of the large covariance matrix with two-step monotone missing data

PArtially observable Markov decision processes

Quantitative estimates of propagation of chaos for stochastic systems with W 1, kernels

Convex Optimization methods for Computing Channel Capacity

Applications of the course to Number Theory

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

A note on the random greedy triangle-packing algorithm

t 0 Xt sup X t p c p inf t 0

Inference for Empirical Wasserstein Distances on Finite Spaces: Supplementary Material

Lecture 6 : Dimensionality Reduction

PETER J. GRABNER AND ARNOLD KNOPFMACHER

arxiv: v3 [physics.data-an] 23 May 2011

ON UNIFORM BOUNDEDNESS OF DYADIC AVERAGING OPERATORS IN SPACES OF HARDY-SOBOLEV TYPE. 1. Introduction

HAUSDORFF MEASURE OF p-cantor SETS

Commutators on l. D. Dosev and W. B. Johnson

ON THE ADVERSARIAL ROBUSTNESS OF ROBUST ESTIMATORS

On-Line Appendix. Matching on the Estimated Propensity Score (Abadie and Imbens, 2015)

Research Note REGRESSION ANALYSIS IN MARKOV CHAIN * A. Y. ALAMUTI AND M. R. MESHKANI **

Measure and Integration: Solutions of CW2

MATH 250: THE DISTRIBUTION OF PRIMES. ζ(s) = n s,

Consistency and asymptotic normality

EXISTENCE AND UNIQUENESS OF SOLUTIONS FOR NONLOCAL p-laplacian PROBLEMS

COMMUNICATION BETWEEN SHAREHOLDERS 1

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Estimating Time-Series Models

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process

Probabilistic Graphical Models

STK4900/ Lecture 7. Program

δ -method and M-estimation

Estimating function analysis for a class of Tweedie regression models

WALD S EQUATION AND ASYMPTOTIC BIAS OF RANDOMLY STOPPED U-STATISTICS

Economics 583: Econometric Theory I A Primer on Asymptotics

NONLINEAR OPTIMIZATION WITH CONVEX CONSTRAINTS. The Goldstein-Levitin-Polyak algorithm

Universal Finite Memory Coding of Binary Sequences

Biostat Methods STAT 5500/6500 Handout #12: Methods and Issues in (Binary Response) Logistic Regression

On Z p -norms of random vectors

Lecture 4 September 15

arxiv: v1 [physics.data-an] 26 Oct 2012

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Transcription:

LECTURE 7 NOTES 1. Convergence of random variables. Before delving into the large samle roerties of the MLE, we review some concets from large samle theory. 1. Convergence in robability: x n x if, for any δ > 0, P( x n x > δ) 0 2. Convergence in distribution: x n d x if E [g(x n )] E [g(x)] for all bounded, continuous functions g : X R. If x n and x are real d valued, x n x is equivalent to F n (t) F (t) at the continuity oints of F, where F n is the CDF of x n and F is the CDF of x. Convergence in robability is the stronger notion of convergence. In articular, x n x imlies xn d x. There is a artial converse: if xn d c, where c is a constant, then x n c. For convenience, we say o P and O P for convergence and boundedness in robability. Recall for a non-negative sequence of scalars {a n }, 1. a n o(1) means a n 0 as n grows. 2. a n o(b n ) for a non-negative sequence {b n } means an b n Further, o(1). 1. a n O(1) means {a n } is bounded; i.e. there is some (large) M > 0 such that a n 2. a n O(b n ) (for a non-negative sequence {b n }) means an b n O(1). There are robabilistic analogues of the receding statements. Let x n be a sequence of non-negative real-valued random variables. 1. x n o P (1) means x n 0 as n grows. 2. x n o P (b n ) for a non-negative sequence {b n } means xn b n o P (1). 3. x n o P (y n ) for a sequence of non-negative random variables {y n } imlies xn y n o P (1). 1

2 STAT 201B Further, 1. x n O P (1) means {x n } is bounded in robability; i.e. for any ɛ > 0, there is M > 0 such that su n P(x n > M) ɛ. 2. x n O P (b n ) means xn b n O P (1). 3. x n O P (y n ) means xn y n O P (1). The aforementioned convergence roerties are reserved under transformations. For any continuous fu- Theorem 1.1 (Continuous maing theorem). nction g, 1. x n x imlies g(xn ) g(x). d 2. x n x imlies g(xn ) d g(x). Theorem 1.2 (Slutsky s theorem). Let x n d x and xn d c. We have 1. x n + y n d x + c, 2. x n y n d cx. More generally, g(x n, y n ) d g(x, c), for any continuous function g. Proof sketch. The key ste in the roof of Slutsky s theorem shows that x n d x and yn d c imlies (xn, y n ) d (x, c). We aeal to the continuous maing theorem to obtain the stated conclusion. d d We remark that x n x and yn y (convergence of the marginal distributions) generally does not imly x n + y n x + y; convergence of the joint d distribution is crucial. Lemma 1.3 (delta method). grows. For any differentiable g, Let a n (x n c) d x, where a n as n a n (g(x n ) g(c)) d g(c) T x. Proof. By Taylor s theorem, g(x n ) g(c) = g(c) T (x n c) + o( x n c 2 ) for any sequence {x n } c. Since a n (x n c) d x, by Slutsky s theorem, we deduce x n c d 0, which imlies x n c 0. Thus g(x n ) g(c) = g(c) T (x n c) + o P ( x n c 2 ),

LECTURE 7 NOTES 3 which imlies a n (g(x n ) g(c)) = a n g(c) T (x n c) + o P (a n x n c 2 ) = a n g(c) T (x n c) + o P (1). Rearranging, a n (g(x n ) g(c)) a n g(c) T (x n c) 0. By the continuous maing theorem, a n g(c) T (x n c) d g(c) T x, which allows us to conclude a n (g(x n ) g(c)) d g(c) T x. Corollary 1.4. Let n(x n θ) d N (0, Σ). We have n(g(xn ) g(θ)) d N ( 0, g(θ) T Σ g(θ) ). Corollary 1.4 is by far the most common alication of the delta method to the extent that some sources go as far as to call it the delta method. 2. Consistency. Most asymtotic statements concern a sequence of estimators {ˆθ n } indexed by samle size n. In the usual asymtotic framework, we observe i.i.d. random variables {x i } and consider the sequence of estimators ˆθ n = ˆθ ( {x} i [n] ) that consists of the same estimation rocedure erformed on growing samles. For examle, in the coin tossing investigation, the sequence of MLE s of consists of the means of growing samles: ˆ n = 1 x i. n i [n] In statistical arlance, statements about asymtotic roerties of an estimator (instead of a sequence of estimators) are common. Such statements are actually about the underlying estimation rocedure, and should be interreted as such. Definition 2.1. A sequence of oint estimators {ˆθ n } of θ Θ is consistent if ˆθ converges in robability to θ.

4 STAT 201B We remark that since convergence in mean squared imlies convergence in robability, a sufficient condition for consistency is ] 2 ] [ ] bias θ [ˆθn + varθ [ˆθn = Eθ ˆθ n θ 2 2 0. Consistency is the basic notion of correctness for a oint estimator. It ensures an estimator converges to its target as the samle size grows. For simle estimators, such as those that are smooth functions of the observations {x i } i [n], consistency is often a direct consequence of the law of large numbers (LLN). In the coin tossing investigation, the (weak) LLN imlies ˆ n. Further, by the continuous maing theorem, the estimator g(ˆθ n ) is a consistent estimator of g(θ ) as long as ˆθ n is a consistent estimator of θ. i.i.d. Examle 2.2. Let x i Ber( ). Recall the MLE of is x n. By the law of large numbers, x n, which shows that the MLE is consistent. i.i.d. Examle 2.3. Let x i unif(0, θ ). Recall the MLE of θ is x (n). It is ossible to show the MLE is consistent by direct calculation. Indeed, ( P θ θ x (n) > ɛ ) ( = P θ maxi [n] x i < θ ɛ ) = ( P θ xi < θ ɛ ) which converges to zero as n grows. i [n] = i [n] θ ɛ θ, We turn our attention to establishing the consistency of the MLE. The MLE is an examle of an extremum estimator: it is the maximizer of the log-likelihood l x (θ). Since the log-likelihood deends on the observations x, it is a random function (of θ). The basic idea is to show that 1. the log-likelihood converges uniformly to a oulation objective, 2. the true arameter θ is the maximizer of the oulation objective. The first condition is known as a uniform law of large numbers. Definition 2.4. A set of (real-valued) functions F satisfies a uniform law of large numbers (ULLN) if 1 su f F n i [n] f(x i) E [ f(x 1 ) ] 0.

LECTURE 7 NOTES 5 We remark that as long as E [ f ] is finite, the (weak) LLN ensures 1 n i [n] f(x i) E [ f(x 1 ) ] 0 for each f F. A ULLN is a stronger statement that says the convergence is uniform over F. Why is the stronger statement necessary? Consider the MLE ˆθ n of θ in a arametric model. The LLN ensures 1 n i [n] l x i (θ) [ E θ lx1 (θ) ] for any θ Θ. Unfortunately, ointwise convergence of the log-likelihood is not even enough to ensure the convergence of the maximum, much less the maximizer, which is required to conclude the consistency of the MLE. Theorem 2.5. Let {x i } be i.i.d. random variables, and f θ (x) be their density for some θ Θ. Assume 1. f θ (x) a.e. = f θ (x) if and only if θ = θ, 1 2. su θ Θ (θ) lx1 (θ) ] 0, then θ is the unique maximizer of lx1 (θ) ] on Θ and lx1 (θ ) ] lx1 (ˆθ n ) ] 0. Proof. The roof consists of two arts: 1. show that θ is the unique maximizer of lx1 (θ) ] on Θ. 2. show that the maximum of 1 (θ)) converges to the maximum of lx1 (θ) ]. We begin by showing lx1 (θ ) ] lx1 (θ) ] > 0 for any θ θ. By the linearity of the exectation, lx1 (θ ) ] [ E θ lx1 (θ) ] = log fθ (x 1 ) ] [ E θ log fθ (x 1 ) ] [ [ ]] fθ (x 1 ) = E θ log, f θ (x 1 )

6 STAT 201B which is the Kullback-Leibler divergence between F θ and F θ. The KL divergence is always non-negative. Indeed, by Jensen s inequality, [ [ ]] [ [ ]] fθ (x 1 ) fθ (x 1 ) E θ log = E θ log f θ (x 1 ) f θ (x 1 ) ( ) log f θ (x 1 )dx 1 = log 1. X Further, since log x is strictly concave, the inequality is strict unless [ ] f θ (x 1 ) a.s. fθ (x 1 ) = E θ, f θ (x 1 ) f θ (x 1 ) which, by the first assumtion, is not ossible at any θ θ. Thus θ is the unique maximizer of lx1 (θ) ]. To comlete the roof, we show that (2.1) lx1 (θ ) ] lx1 (ˆθ n ) ] 0. We exand lx1 (ˆθ n ) ] lx1 (θ ) ] : lx1 (θ ) ] [ E θ lx1 (ˆθ n ) ] = lx1 (θ ) ] 1 (θ )) + 1 (θ ) 1 (ˆθ n ) }{{} + 1 n i [n] l [ x i (ˆθ n ) E θ lx1 (ˆθ n ) ] lx1 (θ ) ] 1 (θ )) + 0 since ˆθ n maximizes i [n] lx i (θ) 1 n i [n] l x i (ˆθ n ) lx1 (ˆθ n ) ], The first term is o P (1) by the LLN, and the second term is also o P (1) by the uniform convergence assumtion, which establishes (2.1). Theorem 2.5 catures the essence of most results on the consistency of the MLE. The key ingredients are identifiability and uniform convergence. The first assumtion of Theorem 2.5 is an identifiability assumtion. It ensures the maximizer of E θ [ lx1 (θ) ] ] is unique by disallowing two arameters θ 1, θ 2 to corresond to the same density. The second assumtion is a ULLN for the set of log-likelihood functions {log f θ (x) : θ Θ}. Here is an examle of such a ULLN. Lemma 2.6. Assume 1. l x (θ) is a continuous function (of θ) for any x X

LECTURE 7 NOTES 7 2. there is an enveloe function b such that E [ b(x) ] < and l x (θ) b(x) for any θ Θ. If Θ is comact, then 1 su θ Θ (θ) E [ l x1 (θ) ] 0. We remark that Theorem 2.5 establishes the excess risk of the MLE lx1 (θ ) ] [ E θ lx1 (ˆθ n ) ] vanishes. However, sans other assumtions, we cannot conclude ˆθ n θ. Most results on the consistency of the MLE state a laundry list of assumtions, most of which are required to establish uniform convergence and to conclude ˆθ n θ from (2.1). For comleteness, we state such a result. Corollary 2.7. Let {x i } be i.i.d. random variables, and f θ (x) be their density for some θ in a comact arameter sace Θ. If 1. f θ (x) = f θ (x) if and only if θ = θ, 2. l x (θ) is continuous for any x X 3. there is a function b : X R such that l x (θ) b(x) for any θ Θ and b(x1 ) ] <. then the MLE is consistent; i.e. arg max θ Θ i [n] l x i (θ) θ. Proof. First, we show that lx1 (θ) ] is continuous under the stated assumtions. For any sequence {θ n } converging to θ, {l x (θ n )} l x (θ) by the continuity of l x. Since l x is dominated by b and b(x1 ) ] is finite, lx1 (θ n ) ] [ E θ lx1 (θ) ] by the dominated convergence theorem. To comlete the roof, we show that ˆθ n θ. For any δ > 0, consider su θ Θ: θ θ 2 δ lx1 (θ) ]. Since lx1 (θ) ] is continuous and {θ Θ : θ θ 2 δ} is comact, the su is attained at some θ {θ Θ : θ θ 2 δ}. By Theorem 2.5, θ is the unique maximizer of lx1 (θ) ] on Θ. Thus ɛ := lx1 (θ ) ] lx1 ( θ) ] > 0.

8 STAT 201B 1 When su θ Θ n i [n] g [ x i (θ) E θ lx1 (θ) ] < ɛ su θ Θ 1 n i [n] l x i (θ) 1 n 2, i [n] l x i (θ ) > lx1 (θ ) ] ɛ 2 = lx1 ( θ) ] + ɛ 2 su θ Θ: θ θ 2 δ 1 n i [n] l x i (θ) which imlies ˆθn θ 2 δ. By Lemma 2.6, ( 1 P θ su θ Θ n i [n] g [ x i (θ) E θ lx1 (θ) ] ) < ɛ 2 We conclude P θ ( ˆθn θ 2 δ ) 1. 1, The takeaways from the lengthy receding section on the consistency of the MLE are 1. as long as 1 (θ) converges to E [ l x1 (θ) ] uniformly on θ Θ, the risk of the MLE converges to the risk of θ. Establishing uniform convergence is non-trivial, and we do not dwell on the toic. 2. to show consistency ˆθ n θ requires additional technical conditions on the arametric model. There are many aroaches to establishing consistency: we followed Wald s original aroach. To wra u, we consider the consequences of an incorrect model; i.e. the arametric model F := {f θ (x) : X R : θ Θ} does not include the generative distribution of the observations. We know that the MLE converges to the maximizer of E F [ log fθ (x) ] or equivalently, the minimizer of E F [ log f(x) ] EF [ log fθ (x) ], where F is the generative distribution and f(x) is its density. We recognize the difference as the KL divergence between F and F θ : [ ]] f(x) = E F [log. f θ (x) Thus the MLE generally converges to a θ Θ that is the best aroximation of F in the arametric model. That is F θ has the smallest KL divergence to F among the distributions in the arametric model. Yuekai Sun Berkeley, California December 4, 2015