Point Estimation: properties of estimators 1 FINITE-SAMPLE PROPERTIES. finite-sample properties (CB 7.3) large-sample properties (CB 10.

Similar documents
Lecture 19: Convergence

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Unbiased Estimation. February 7-12, 2008

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Introductory statistics

Chapter 6 Infinite Series

Lecture 12: September 27

6. Sufficient, Complete, and Ancillary Statistics

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Sequences and Series of Functions

Lecture 8: Convergence of transformations and law of large numbers

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Topic 9: Sampling Distributions of Estimators

Empirical Processes: Glivenko Cantelli Theorems

Probability and Statistics

Topic 9: Sampling Distributions of Estimators

Lecture 10 October Minimaxity and least favorable prior sequences

Topic 9: Sampling Distributions of Estimators

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

Asymptotics. Hypothesis Testing UMP. Asymptotic Tests and p-values

Random Variables, Sampling and Estimation

Problem Set 4 Due Oct, 12

1.010 Uncertainty in Engineering Fall 2008

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Stat410 Probability and Statistics II (F16)

LECTURE 2 LEAST SQUARES CROSS-VALIDATION FOR KERNEL DENSITY ESTIMATION

7.1 Convergence of sequences of random variables

Distribution of Random Samples & Limit theorems

An Introduction to Asymptotic Theory

Statistical Theory MT 2008 Problems 1: Solution sketches

Statistical Theory MT 2009 Problems 1: Solution sketches

Lecture 11 and 12: Basic estimation theory

LECTURE 14 NOTES. A sequence of α-level tests {ϕ n (x)} is consistent if

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Chapter 6 Principles of Data Reduction

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Output Analysis and Run-Length Control

Statistical Inference Based on Extremum Estimators

Kernel density estimator

SOME THEORY AND PRACTICE OF STATISTICS by Howard G. Tucker

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

11 THE GMM ESTIMATION

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

This section is optional.

Math 341 Lecture #31 6.5: Power Series

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Fall 2013 MTH431/531 Real analysis Section Notes

LECTURE 8: ASYMPTOTICS I

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Last Lecture. Biostatistics Statistical Inference Lecture 16 Evaluation of Bayes Estimator. Recap - Example. Recap - Bayes Estimator

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

EFFECTIVE WLLN, SLLN, AND CLT IN STATISTICAL MODELS

Lecture 11 October 27

Last Lecture. Wald Test

Solutions: Homework 3

f n (x) f m (x) < ɛ/3 for all x A. By continuity of f n and f m we can find δ > 0 such that d(x, x 0 ) < δ implies that

Rates of Convergence by Moduli of Continuity

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Lecture 10: Universal coding and prediction

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

lim za n n = z lim a n n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Lecture 3: MLE and Regression

4. Partial Sums and the Central Limit Theorem

Notes 5 : More on the a.s. convergence of sums

ECE 901 Lecture 13: Maximum Likelihood Estimation

Maximum Likelihood Estimation

Lecture 2: Monte Carlo Simulation

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

Detailed proofs of Propositions 3.1 and 3.2

ST5215: Advanced Statistical Theory

7.1 Convergence of sequences of random variables

Lecture 6 Ecient estimators. Rao-Cramer bound.

Machine Learning Brett Bernstein

Empirical Process Theory and Oracle Inequalities

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Asymptotic Results for the Linear Regression Model

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Advanced Stochastic Processes.

Estimation for Complete Data

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Application to Random Graphs

Notes 19 : Martingale CLT

Questions and Answers on Maximum Likelihood

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

2.1. Convergence in distribution and characteristic functions.

STAT Homework 1 - Solutions

Study the bias (due to the nite dimensional approximation) and variance of the estimators

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Lecture 3 : Random variables and their distributions

Notes On Median and Quantile Regression. James L. Powell Department of Economics University of California, Berkeley

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck!

MA Advanced Econometrics: Properties of Least Squares Estimators

Exponential Families and Bayesian Inference

Transcription:

Poit Estimatio: properties of estimators fiite-sample properties CB 7.3) large-sample properties CB 10.1) 1 FINITE-SAMPLE PROPERTIES How a estimator performs for fiite umber of observatios. Estimator: W Parameter: θ Criteria for evaluatig estimators: Bias: does EW = θ? Variace of W you would like a estimator with a smaller variace) Example: X 1,..., X i.i.d. µ, σ 2 ) Ukow parameters are µ ad σ 2. Cosider: ˆµ 1 i X i, estimator of µ ˆσ 2 1 i X i X ) 2, estimator of σ 2. Bias: E ˆµ = 1 µ = µ. So ubiased. Var ˆµ = 1 2 σ 2 = 1 σ2. Eˆσ 2 = E 1 X i X ) 2 ) i = 1 EXi 2 2EX i X + E X ) 2 i = 1 [ ] µ 2 + σ 2 ) 2µ 2 + σ2 ) + σ2 + µ2 = 1 σ2. 1

Hece it is biased. To fix this bias, cosider the estimator s 2 1 ubiased). 1 i X i X ) 2, ad Es 2 = σ 2 Mea-squared error MSE) of W is EW θ) 2. Commo criterio for comparig estimators. Decompose: MSEW ) = V W + EW θ) 2 =Variace + Bias) 2. Hece, for a ubiased estimator: MSEW ) = V W. Example: X 1,..., X U[0, θ]. fx) = 1/θ, x [0, θ]. Cosider estimator ˆθ 2 X. E ˆθ = 2 1 E i X i = 2 1 1 2 MSEˆθ N ) = V ˆθ = 4 2 i V X i = θ2 3 θ = θ. So ubiased Cosider estimator θ max X 1,..., X ). I order to derive momets, start by derivig CDF: P θ z) = P X 1 z, X 2 z,..., X z) = P X i z) i=1 { z ) if z θ = θ 1 otherwise Therefore z) = z 1 1 f θ θ), for 0 x θ. θ Bias θ ) = θ/ + 1) E θ ) 2 = θ θ 0 z+1 dz = Hece V θ = θ 2 +2 θ z ) 1 1 E θ ) = z 0 θ θ dz = θ z dz = θ + 1 θ. +2 θ2. +1 Accordigly, MSE= 2θ 2 +2)+1) ) 2 ) 0 = θ 2 +2)+1) 2. 2

Cotiue the previous example. Redefie θ = +1 maxx 1,..., X ). Now both estimators ˆθ ad θ are ubiased. Which is better? V ˆθ = θ2 = O1/). 3 V θ = ) ) +1 2 V maxx1,..., X )) = θ 2 1 = O1/ 2 ). +2) Hece, for large eough, θ has a smaller variace, ad i this sese it is better. Best ubiased estimator: if you choose the best i terms of MSE) estimator, ad restrict yourself to ubiased estimators, the the best estimator is the oe with the lowest variace. A best ubiased estimator is also called the Uiform miimum variace ubiased estimator UMVUE). Formally: a estimator W is a UMVUE of θ satisfies: i) E θ W = θ, for all θ ubiasedess) ii) V θ W V θ W, for all θ, ad all other ubiased estimators W. The uiform coditio is crucial, because it is always possible to fid estimators which have zero variace for a specific value of θ. It is difficult i geeral to verify that a estimator W is UMVUE, sice you have to verify coditio ii) of the defiitio, that V W is smaller tha all other ubiased estimators. Luckily, we have a importat result for the lowest attaiable variace of a estimator. Theorem 7.3.9 Cramer-Rao Iequality): Let X 1,..., X be a sample with joit pdf f X θ), ad let W X) be ay estimator satisfyig d i) dθ E θw X) [ = W X) θ f X θ) ] dx; The V θ W X) ii) V θ W X) <. E θ d E dθ θw X) ) 2 log f X θ) θ ) 2. 3

The RHS above is called the Cramer-Rao Lower Boud. Proof: CB, pg. 336. I short, we apply the Cauchy-Schwarz iequality V S) covs, T )/V T ) with S = W X) ad T = θ log f X θ). The choice of T here may seem a bit arbitrary. To get some ituitio, cosider Cramer s derivatio. 1 Start with the followig maipulatio of the equality E θ W X) = W X)fX θ)dx: d dθ E θw X) = d W X)fX θ)dx dθ = W X) θ fx θ)dx = W X) E θ W X)) θ fx θ)dx ote E θ W X) fx θ)dx = 0) θ ) = W X) E θ W X)) log fx θ) fx θ)dx θ Applyig the Cauchy-Schwarz iequality, we have [ d dθ E θw X)] 2 V ar θ W X) E θ [ log fx θ)]2 θ or V ar θ W X) [ d dθ E θw X)] 2 E θ [ θ log fx θ)]2. The LHS of coditio i) above is d dθ W X)fX θ)dx, so by Leibiz rule, this coditio rules out cases where the support of X is depedet o θ. The crucial step i the derivatio of the CR-boud is the iterchage of differetiatio ad itegratio which implies E θ θ log f X θ) = 1 f X θ) = f X θ)dx θ = θ 1 = 0 1 Cramer, Mathematical Methods of Statistics, p. 475ff. f X θ) f X θ)dx θ 1) 4

skip) The above derivatio is oteworthy, because θ log f X θ) = 0 is the FOC of maximum likelihood estimatio problem. Alteratively, as i CB, apply coditio i) of CR result, usig W X) = 1.) I the i.i.d. case, this becomes the sample average 1 θ log fx i θ) = 0. i Ad by the LLN: 1 i θ log fx i θ) p E θ0 θ log fx i θ), where θ 0 is the true value of θ 0. This shows that maximum likelihood estimatio of θ is equivalet to estimatio based o the momet coditio E θ0 θ log fx i θ) = 0 which holds oly at the true value θ = θ 0. Thus MLE is cosistet for the true value θ 0, as we ll see later.) However, ote that Eq. 1) holds at all values of θ, ot just θ 0.) [Thik about] What if model is misspecified, i the sese that true desity of X is g x), ad that for all θ Θ, f x θ) g x) that is, there is o value of the parameter θ such that the postulated model f coicides with the true model g)? Does Eq. 1) still hold? What is MLE lookig for? I the iid case, the CR lower boud ca be simplified Corollary 7.3.10: if X 1,..., X i.i.d. fx θ), the V θ W X) d E dθ θw X) ) 2 E θ θ log fx θ)) 2. Up to this poit, Cramer-Rao results ot that operatioal for us to fid a best estimator, because the estimator W X) is o both sides of the iequality. However, for a ubiased estimator, how ca you simplify the expressio further? 5

Example: X 1,..., X i.i.d. Nµ, σ 2 ). What is CRLB for a ubiased estimator of µ? Ubiased umerator =1. log fx θ) = log 2π log σ 1 ) 2 x µ 2 σ ) ) x µ 1 log fx θ) = = x µ µ σ σ σ 2 ) 2 E log fx θ) = E X µ) 2 σ 4) = 1 θ σ V X = 1 4 σ. 2 Hece the CRLB = 1 1 σ 2 = σ2. This is the variace of the sample mea, so that the sample mea is a UMVUE for µ. Sometimes we ca simplify the deomiator of the CRLB further: Lemma 7.3.11 Iformatio iequality): if fx θ) satisfies ) [ ) ] d *) dθ E θ log fx θ) = log fx θ) fx θ) dx, θ θ θ the ) 2 ) 2 E θ θ log fx θ) = E θ log fx θ). θ2 Rough proof: LHS of *): Usig Eq. 1) above, we get that LHS of *) =0. RHS of *): [ ) ] θ θ log f f dx 2 ) 2 log f 1 f = fdx + dx θ 2 f θ ) 2 =E 2 log f log f + E. θ 2 θ Puttig the LHS ad RHS together yields the desired result. 6

The LHS of the above coditio *) is just d log fx θ)) fx θ)dx. As dθ θ before, the crucial step is the iterchage of differetiatio ad itegratio. skip) Also, the iformatio iequality depeds crucially o the equality E θ log fx θ) = θ 0, which depeds o the correct specificatio of the model. Thus iformatio iequality ca be used as basis of specificatio test. How?) Example: for the previous example, cosider CRLB for ubiased estimator of σ 2. We ca use the iformatio iequality, because coditio *) is satisfied for the ormal distributio. Hece: E 1 log fx θ) = σ2 2σ + 1 x µ) 2 ) 2 2 σ 4 log fx θ) = 1 x µ)2 σ 2 σ2 2σ4 σ )) 6 1 log fx θ) = σ2 2σ 1 ) 4 σ 4 σ 2 Hece the CRLB is 2σ4. = 1 2σ 4. Example: X 1,..., X U[0, θ]. Check coditios for CRLB for a ubiased estimator W X) of θ. d EW X) = 1 because it is ubiased) dθ [ W X)f θ X θ) ] dx = ) W X) 1 dx d dθ EW X) = 1 θ 2 Hece, coditio i) of theorem ot satisfied. skip) But whe ca CRLB if it exists) be attaied? Corollary 7.3.15: X 1,..., X i.i.d. fx θ), satisfyig the coditios of CR theorem. Let Lθ X) = i=1 fx θ) deote the likelihood fuctio. Estimator W X) ubiased for θ W X) attais CRLB iff you ca write θ log Lθ X) [ = aθ) W X) ] θ for some fuctio aθ). 7

Example: X 1,..., X i.i.d. Nµ, σ 2 ) Cosider estimatig µ: µ log Lθ X) = µ ) log fx θ) i=1 = log 2π log σ 1 ) ) X µ) 2 µ 2 σ 2 i=1 ) X µ = i=1 σ 2 = σ 2 X µ). Hece, CRLB ca be attaied i fact, we showed earlier that CRLB attaied by X ) Loss fuctio optimality Let X f X θ). Cosider a loss fuctio Lθ, W X)), takig values i [0, + ), which pealizes you whe your W X) estimator is far from the true parameter θ. Note that Lθ, W X)) is a radom variable, sice X ad W X)) are radom. Cosider estimators which miimize expected loss: that is mi E θlθ, W X)) mi Rθ, W )) W ) W ) where Rθ, W )) is the risk fuctio. Note: the risk fuctio is ot a radom variable, because X has bee itegrated out.) Loss fuctio optimality is a more geeral criterio tha miimum MSE. I fact, 2, because MSEW X)) = E θ W X) θ) the MSE is actually the risk fuctio associated with the quadratic loss fuctio Lθ, W X)) 2. = W X) θ) Other examples of loss fuctios: Absolute error loss: W X) θ 8

Relative quadratic error loss: W X) θ) 2 θ +1 The exercise of miimizig risk takes a give value of θ as give, so that the miimized risk of a estimator depeds o whichever value of θ you are cosiderig. You might be iterested i a estimator which does well regardless of which value of θ you are cosiderig. Aalogous to the focus o the uiform miimal variace.) For this differet problem, you wat to cosider a otio of risk which does ot deped o θ. Two criteria which have bee cosidered are: Average risk: mi W ) Rθ, W ))hθ)dθ. where hθ) is some weightig fuctio across θ. I a Bayesia iterpretatio, hθ) is a prior desity over θ.) Mimax criterio: mi max Rθ, W )). W ) θ Here you choose the estimator W ) to miimize the maximum risk = max θ Rθ, W )), where θ is set to the worse value. So mimax optimizer is the best that ca be achieved i a worst-case sceario. Example: X 1,..., X i.i.d. Nµ, σ 2 ). Sample mea X is: ubiased miimum MSE UMVUE attais CRLB miimizes expected quadratic loss 9

2 LARGE SAMPLE PROPERTIES OF ESTIMATORS It ca be difficult to compute MSE, risk fuctios, etc., for some estimators, especially whe estimator does ot resemble a sample average. Large-sample properties: exploit LLN, CLT Cosider { data {X 1, X 2,...} by which we costruct a sequece of estimators W W X 1 ), W X } 2 ),.... W is a radom sequece. Defie: we say that W is cosistet for a parameter θ iff the radom sequece W as coverges i some stochastic sese) to θ. Strog cosistecy obtais whe W θ. p Weak cosistecy obtais whe W θ. For estimators like sample-meas, cosistecy either weak or strog) follows easily usig a LLN. Cosistecy ca be thought of as the large-sample versio of ubiasedess. Defie: a M-estimator is a estimator of θ which a maximizer of a objective fuctio Q θ). Examples: MLE: Q θ) = 1 i=1 log fx i θ) Least squares: Q θ) = i=1 [y i gx i ; θ)] 2. OLS is special case whe gx i ; θ) = α + X iβ. GMM: Q θ) = G θ) W θ)g θ) where G θ) = [ 1 m 1 x i ; θ), 1 i=1 m 2 x i ; θ),..., 1 i=1 m M x i ; θ)], i=1 a M 1 vector of sample momet coditios, ad W is a M M weightig matrix. Notatio: For each θ Θ, let f θ fx 1,..., x,... ; θ) deote the joit desity of the data for the give value of θ. For θ 0 Θ, we deote the limit objective fuctio Q 0 θ) = plim,fθ0 Q θ) at each θ). 10

Cosistecy of M-estimators Make the followig assumptios: 1. For each θ 0 Θ, the limitig objective fuctio Q 0 θ) is uiquely maximized at θ 0 idetificatio ) 2. Parameter space Θ is a compact subset of R K. 3. Q 0 θ) is cotiuous i θ 4. Q θ) coverges uiformly i probability to Q 0 θ); that is: p sup Q θ) Q 0 θ) 0. θ Θ Theorem: Cosistecy of M-Estimator) Uder assumptio 1,2,3,4, θ p θ0. Proof: We eed to show: for ay arbitrarily small eighborhood N cotaiig θ 0, P θ N ) 1. For large eough, the uiform covergece coditios that, for all ɛ, δ > 0, ) P sup Q θ) Q 0 θ) < ɛ/2 θ Θ > 1 δ. The evet sup θ Θ Q θ) Q 0 θ) < ɛ/2 implies Similarly, Q θ ) Q 0 θ ) < ɛ/2 Q 0 θ ) > Q θ ) ɛ/2 2) Q θ 0 ) Q 0 θ 0 ) > ɛ/2 Q θ 0 ) > Q 0 θ 0 ) ɛ/2. 3) Sice θ = argmax θ Q θ), Eq. 2) implies Hece, addig Eqs. 3) ad 4), we have So we have show that Q 0 θ ) > Q θ 0 ) ɛ/2. 4) Q 0 θ ) > Q 0 θ 0 ) ɛ. 5) sup Q θ) Q 0 θ) < ɛ/2 = Q 0 θ ) > Q 0 θ 0 ) ɛ θ Θ ) P Q 0 θ ) > Q 0 θ 0 ) ɛ) P sup Q θ) Q 0 θ) < ɛ/2 θ Θ 1. 11

Now defie N as ay ope eighborhood of R K, which cotais θ 0, ad N is the complemet of N i R K. The Θ N is compact, so that max θ Θ N Q 0 θ) exists. Set ɛ = Q 0 θ 0 ) max θ Θ N Q 0 θ). The Q 0 θ ) > Q 0 θ 0 ) ɛ Q 0 θ ) > max Q 0 θ) θ Θ N θ N P θ N ) P Q 0 θ ) > Q 0 θ 0 ) ɛ) 1. Sice the argumet above holds for ay arbitrarily small eighborhood of θ 0, we are doe. I geeral, the limit objective fuctio Q 0 θ) = plim Q θ) may ot be that straightforward to determie. But i may cases, Q θ) is a sample average of some sort: Q θ) = 1 qx i θ) eg. least squares, MLE). The by a law of large umbers, we coclude that for all θ) Q 0 θ) = plim 1 qx i θ) = E xi qx i θ) i where E xi deote expectatio with respect to the true but uobserved) distributio of x i. skip) Most of the time, θ 0 ca be iterpreted as a true value. But if model is misspecified, the this iterpretatio does t hold ideed, uder misspecificatio, ot eve clear what the true value is). So a more cautious way to iterpret the cosistecy result is that i θ p argmaxθ Q 0 θ) which holds give the coditios) o matter whether model is correctly specified. skip) Cosider the abstract) mappig from parameter values θ to joit desities of the data f θ = fx 1,..., x,... ; θ). The idetificatio coditio posits: for each θ 1, the limit objective fuctio Q 1 plim fθ1 Q θ) i uiquely optimized at θ 1. This implies that the mappig θ f θ is oe-to-oe. Assume ot: that is, there are two values θ 1 θ 2 such that f θ1 = f θ2 a.s.). 2 The θ 1 = argmi θ Q 1 θ) = argmi θ Q 2 θ) = θ 2. 2 I other words, θ 1 ad θ 2 are observatioally equivalet. 12

** Let s upack the uiform covergece coditio. Sufficiet coditios for this coditios are: 1. Poitwise covergece: For each θ Θ, Q θ) Q 0 θ) = o p 1). 2. Q θ) is stochastically equicotiuous: for every ɛ > 0, η > 0 there exists a sequece of radom variable ɛ, η) ad ɛ, η) such that for all >, P > ɛ) < η ad for each θ there is a ope set N cotaiig θ with sup θ N Q θ) Q θ), >. Note that both ad do ot deped o θ: it is uiform result. This is a i probability versio of the determiistic otio of uiform equicotiuity: we say a sequece of determiistic fuctios R θ) is uiformly equicotiuous if, for every ɛ > 0 there exists δɛ) ad ɛ) such that for all θ sup θ: θ θ <δ R θ) R θ) ɛ, >. To uderstad this more ituitively, cosider a aive argumet for cosistecy. By cotiuity of Q 0, we kow that Q 0 θ) is close to Q 0 θ 0 ) for θ N θ 0 ). By poitwise covergece, we have Q θ) covergig to Q 0 θ) for all θ; hece, eve if Q θ) is ot optimized by θ 0, the optimizer θ = argmax θ Q θ) should ot be far from θ 0. Implicitly, for the last part, we have assumed that Q θ) is equally close to Q 0 θ) for all θ, because the the optimizers of Q ad Q 0 caot be too far apart. However, poitwise covergece is ot eough to esure this equally closeess. At ay give, Q θ 0 ) beig close to Q 0 θ 0 ) does ot imply this at other poits. Uiform covergece esures, roughly speakig, that at ay give, Q ad Q 0 are equally close at all poits θ. As the above discussio shows, uiform covergece is related to a cotiuity coditio o Q ); if Q ) is ot cotiuous, the it ca be maximized far from θ 0. Stochastic equicotiuity is the right otio of cotiuity here sice Q ) ad θ are both radom. Asymptotic ormality for M-estimators Defie the score vector [ Q θ) θq θ) =,..., Q θ) θ 1 θ K. θ= θ] θ= θ 13

Similarly, defie the K K Hessia matrix [ θ θ Q θ) ] i,j = 2 Q θ) θ i θ j Note that the Hessia is symmetric. Make the followig assumptios: 1 i, j K. θ= θ, 1. θ = argmax θ Q θ) p θ 0 2. θ 0 iteriorθ) 3. Q θ) is twice cotiuously differetiable i a eighborhood N of θ 0. 4. θ0 Q θ) d N0, Σ) 5. Uiform covergece of Hessia: there exists the matrix Hθ) which is cotiuous at θ 0 ad sup θ N θθ Q θ) Hθ) 0. p 6. Hθ 0 ) is osigular Theorem Asymptotic ormality for M-estimator): Uder assumptios 1,2,3,4,5, where H 0 Hθ 0 ). θ θ 0 ) d N0, H 1 0 ΣH 1 0 ) Proof: sketch) By Assumptios 1,2,3, θ Q θ) = 0 this is FOC of maximizatio problem). The usig mea-value theorem with θ deotig mea value): 0 = θ Q θ) = θ0 Q θ) + θ θ Q θ)θ θ 0 ) θ θ 0 ) = θ0 Q θ) }{{} θ θ Q θ) }{{} p H 0 usig A5) d N0,Σ) usig A4) θ θ 0 ) d Hθ 0 ) 1 N0, Σ) = N0, H 1 0 ΣH 1 0 ). Note: A5 is a uiform covergece assumptio o the sample Hessia. Give previous discussio, it esures that the sample Hessia θθ Q θ) evaluated at θ which is close to θ 0 ) does ot vary far from the limit Hessia Hθ) at θ 0, which is implied by a type of cotiuity of the sample Hessia close to θ 0. 14

2.1 Maximum likelihood estimatio The cosistecy of MLE ca follow by applicatio of the theorem above for cosistecy of M-estimators. Essetially, as we oted above,what the cosistecy theorem showed above was that, for ay M-estimator sequece θ : plim θ = argmax θ Q 0 θ). For MLE, there is a distict ad earlier argumet due to Wald 1949), who shows that, i the i.i.d. case, the limitig likelihood fuctio correspodig to Q 0 θ)) is ideed globally maximized at θ 0, the true value. Thus, we ca directly cofirm the idetificatio assumptio of the M-estimator cosistecy theorem. This argumet is of iterest by itself. Argumet: summary of Amemiya, pp. 141 142) MLE 1 Defie ˆθ argmax θ i log fx i θ). Let θ 0 deote the true value. By LLN: 1 i log fx i θ) p E θ0 log fx i θ), for all θ ot ecessarily the true θ 0 ). ) ) By Jese s iequality: E θ0 log fx θ) fx θ 0 < log E fx θ) ) θ0 fx θ 0 ) ) But E fx θ) θ0 fx θ 0 = ) fx θ) ) fx θ 0 fx θ ) 0 ) = 1, sice fx θ) is a desity fuctio, for all θ. 3 Hece: ) fx θ) E θ0 log < 0, θ fx θ 0 ) = E θ0 log fx θ) < E θ0 log fx θ 0 ), θ = E θ0 log fx θ) is maximized at the true θ 0. This is the idetificatio assumptio from the M-estimator cosistecy theorem. 3 I this step, ote the importace of assumptio A3 i CB, pg. 516. If x has support depedig o θ, the it will ot itegrate to 1 for all θ. 15

skip) Aalogously, we also kow that, for δ > 0, µ 1 = E θ0 log fx; θ ) 0 δ) < 0; µ 2 = E θ0 log fx; θ ) 0 + δ) fx; θ 0 ) fx; θ 0 ) i < 0. By the SLLN, we kow that 1 log fx ) i; θ 0 δ) = 1 fx i ; θ 0 ) [log L x; θ 0 δ) log L x : θ 0 )] as µ 1 so that, with probability 1, log L x; θ 0 δ) < log L x; θ 0 ) for large eough. Similarly, for large eough, log L x; θ 0 + δ) < log L x; θ 0 ) with probability 1. Hece, for large, ˆθ argmax θ log L x; θ) θ 0 δ, θ 0 + δ). That is, the MLE ˆθ is strogly cosistet for θ 0. Note that this argumet requires weaker assumptios tha the M-estimator cosistecy theorem above. Now we cosider aother idea, efficiecy, which is a large-sample aalogue of the miimum variace cocept. For the sequece of estimators W, suppose that k)w θ) d N0, σ 2 ) where k) is a polyomial i. The σ 2 is deoted the asymptotic variace of W. I usual cases, k) =. For example, by the CLT, we kow that X µ) d N0, σ 2 ). Hece, σ 2 is the asymptotic variace of the sample mea X. Defiitio 10.1.11: A estimator sequece W is asymptotically efficiet for θ if W θ) d N0, vθ)), where 1 the asymptotic variace vθ) = the E θ0 log fx θ))2 θ CRLB) 16

By asymptotic ormality result for M-estimator, we kow what the asymptotic distributio for the MLE should be. However, it turs out give the iformatio iequality, the MLE s asymptotic distributio ca be further simplified. Theorem 10.1.12: Asymptotic efficiecy of MLE Proof: followig Amemiya, pp. 143 144) MLE ˆθ satisfies the FOC of the MLE problem: Usig the mea value theorem: 0 = log Lθ X ) θ 0 = log Lθ X ) θ θ=θ0 + 2 log Lθ X ) θ 2 = ) ˆθ θ 0 = log Lθ X ) θ θ=θ0 2 log Lθ X ) = θ=θ θ 2 θ=ˆθmle. θ=θ ˆθMLE 1 1 ) θ 0 i log fx i θ) i θ θ=θ0 2 log fx i θ) θ 2 θ=θ ) Note that, by the LLN, 1 i log fx i θ) θ θ=θ0 p log fx θ) Eθ0 θ θ=θ0 = fxi θ) θ θ=θ0 dx. Usig same argumet as i the iformatio iequality result above, the last term is: f θ dx = fdx = 0. θ Hece, the CLT ca be applied to the umerator of **): ) ) umerator of **) d log fxi θ) 2 N 0, E θ0 θ θ=θ0. By LLN, ad uiform covergece of Hessia term: deomiator of **) p E θ0 2 log fx θ) θ 2 θ=θ0. 17

Hece, by Slutsky theorem: ˆθ θ 0 ) d N By the iformatio iequality: so that E log fxi ) 2 θ) θ0 θ θ=θ0 0, [ ] 2. E 2 log fx θ) θ=θ0 θ0 θ 2 ) log fxi θ) 2 E θ0 2 log fx θ) θ θ=θ0 = E θ0 θ 2 ˆθ θ 0 ) d N 0, so that the asymptotic variace is the CRLB. 1 E θ0 log fxi θ) θ θ=θ0 ) 2 θ=θ0 Hece, the asymptotic approximatio for the fiite-sample distributio is ˆθ MLE a N θ 0, 1 1 E θ0 log fxi θ) θ ) 2. θ=θ0 18