DA Freedman Notes on the MLE Fall 2003

Similar documents
simple if it completely specifies the density of x

A Very Brief Summary of Statistical Inference, and Examples

To appear in The American Statistician vol. 61 (2007) pp

A Very Brief Summary of Statistical Inference, and Examples

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Chapter 4. Theory of Tests. 4.1 Introduction

Mathematical statistics

Z-estimators (generalized method of moments)

Lecture 7 Introduction to Statistical Decision Theory

Testing Statistical Hypotheses

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Some General Types of Tests

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Chapter 7. Hypothesis Testing

Answers to the 8th problem set. f(x θ = θ 0 ) L(θ 0 )

Ch. 5 Hypothesis Testing

parameter space Θ, depending only on X, such that Note: it is not θ that is random, but the set C(X).

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Master s Written Examination - Solution

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

Lecture 3 September 1

Classical regularity conditions

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Master s Written Examination

6.1 Variational representation of f-divergences

Testing Statistical Hypotheses

Lecture 1: Introduction

Information in a Two-Stage Adaptive Optimal Design

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Large Sample Properties of Estimators in the Classical Linear Regression Model

Greene, Econometric Analysis (6th ed, 2008)

Math 494: Mathematical Statistics

Chapter 3 : Likelihood function and inference

Chapter 3. Point Estimation. 3.1 Introduction

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

Statistical Inference

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Central Limit Theorem ( 5.3)

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

Introduction to Estimation Methods for Time Series models Lecture 2

Suggested solutions to written exam Jan 17, 2012

Spring 2012 Math 541B Exam 1

1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

1. Fisher Information

BTRY 4090: Spring 2009 Theory of Statistics

Maximum Likelihood Estimation

Sampling distribution of GLM regression coefficients

Maximum Likelihood Estimation

Chapters 10. Hypothesis Testing

Econometrics I, Estimation

Lecture 3 January 16

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form.

Mathematical statistics

Lecture 17: Likelihood ratio and asymptotic tests

Submitted to the Brazilian Journal of Probability and Statistics

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

Statistics and econometrics

Lecture 8: Information Theory and Statistics

Stat 5102 Final Exam May 14, 2015

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

Theory of Statistical Tests

Statistics & Data Sciences: First Year Prelim Exam May 2018

Maximum Likelihood Tests and Quasi-Maximum-Likelihood

ECE 275B Homework # 1 Solutions Winter 2018

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Principles of Statistics

ECE 275B Homework # 1 Solutions Version Winter 2015

Hypothesis testing: theory and methods

Topic 15: Simple Hypotheses

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

Proof In the CR proof. and

5601 Notes: The Sandwich Estimator

Institute of Actuaries of India

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

Statistical Theory MT 2007 Problems 4: Solution sketches

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

exp{ (x i) 2 i=1 n i=1 (x i a) 2 (x i ) 2 = exp{ i=1 n i=1 n 2ax i a 2 i=1

Lecture 13: Subsampling vs Bootstrap. Dimitris N. Politis, Joseph P. Romano, Michael Wolf

March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS

Maximum Likelihood Estimation

Chapter 4: Constrained estimators and tests in the multiple linear regression model (Part III)

Statistics Ph.D. Qualifying Exam

Stat 411 Lecture Notes 03 Likelihood and Maximum Likelihood Estimation

Summary of Chapters 7-9

Graduate Econometrics I: Maximum Likelihood I

4.5.1 The use of 2 log Λ when θ is scalar

Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee

5.2 Fisher information and the Cramer-Rao bound

Transcription:

DA Freedman Notes on the MLE Fall 2003 The object here is to provide a sketch of the theory of the MLE. Rigorous presentations can be found in the references cited below. Calculus. Let f be a smooth, scalar function of the p 1 vector x. We view f asa1 p vector of partial derivatives {f/x i : i = 1,...,p}. Likewise, f is a p p matrix and f is a 3-D array of partials. As a matter of notation, f is the derivative of f and g T is the transpose of g. Moreover, x is the Euclidean norm, x 2 = p x2 i. Abbreviate f ij k = 3 f x i x j x k (1) Lemma. Let M = max ij k max x δ f ij k ; the indices i, j, and k need not be distinct. If x δ, then f(x)= f(0) + f (0)x + 1 2 xt f (0)x + g(x) where g(x) 1 6 p3/2 M x 3. Sketch of proof. We may assume that f(0) = f (0) = f (0) = 0. Fix x with x δ; let 0 u 1; view φ(u) = f (ux) as a scalar function of the scalar u; then φ(0) = φ (0) = φ (0) = 0, so φ(u) = u 3 φ (v)/3! by Taylor s theorem, with 0 v u. Now0 u 3 1, and φ (v) = ij k f ij k(vx)x i x j x k so φ(u) 1 6 M ij k by the inequality of Cauchy-Schwarz. x i x j x k = 1 6 M ( Let g be a smooth 1 p function of x. We view g as p p; and i ) 3 1 x i 6 Mp3/2 x 3 g(x + δ) = g(x) + δ T g (x) + O( δ 2 ) as δ 0. 1

(2) Lemma. Let h = fg, where f is scalar and g is 1 p; both functions are smooth. Then h is p p and h = fg + f T g. Examples. Suppose a isa1 p vector of reals and f(x) = ax. Then f (x) = a and f (x) = 0. Suppose A is a p p matrix of reals, perhaps asymmetric, and g(x) = x T A. Then g (x) = A and g (x) = 0. Let h(x) = x T Ax. Then h (x) = x T (A + A T ), h (x) = (A + A T ) and h (x) = 0. Fisher information. Let f θ (x) be a density, bounded, positive, vanishing rapidly as x. There are problems at boundary points; we take θ and x to be Euclidean; θ,x f θ (x) is assumed smooth. Now f θ (x) dx = 1, so (3) 2 θ f θ (x) dx = θ 2 f θ (x) dx = 0. The Fisher Information Matrix is 2 ) I(θ) = θ 2 log f θ (x) f θ (x) dx. (4) Lemma. ) I(θ) = θ log f θ (x) θ log f θ (x) f θ (x) dx ) 1 = θ f θ (x) θ f θ (x) f θ (x) dx. Proof. To begin with, (5) θ log f θ (x) = 1 f θ (x) θ f θ (x). Then by (2), 2 θ 2 log f θ (x) = 1 2 f θ (x) θ 2 f θ (x) 1 ( ) f θ (x) 2 θ f θ (x) θ f θ (x) ; and (3) completes the proof. The statistical model. Let X i be measurable functions on (, F ) for i = 1,...,n, with values in R q.forθ R p, let P θ be a probability on (, ). With respect to the probability P θ, let X i be independent random variables, having common probability density f θ on R q. In particular, 2 } I(θ) = E θ θ 2 log f θ (X i ). 2

We assume I(θ)is invertible. By (3) and (4), } } (6) E θ θ log f θ (X i ) = 0, var θ θ log f θ (X i ) = I(θ). The log likelihood function is L(θ) = log f θ (X i ). The X i are often viewed as fixed, the variable is θ. Write θ 0 for the (unknown) true value of θ. The first derivative of the log likelihood function is L (θ) = θ log f θ (X i ). Of course, L (θ) is random, because it depends on the X i ; this is suppressed in the notation. By (6), (7) E θ {L (θ)} =0, var θ {L (θ)} =ni (θ). The MLE ˆθ, by definition, maximizes the likelihood function. (Technically, there may be multiple maxima, but see below; with weaker conditions, there may be no maximum, but the theory can still be pushed through.) The main result to be discussed here says that asymptotically, the MLE is normal, with mean θ 0 and variance I(θ 0 ) 1 /n. Asymptotic optimality is another idea, see the references below. (8) Theorem. As n, the P θ0 -distribution of n( ˆθ θ 0 ) converges to normal, with mean 0 and variance I(θ 0 ) 1. Sketch of proof. By entropy considerations, for large n, the MLE will almost surely be within a small neighborhood of the true parameter value θ 0. Indeed, if f and g are densities, then f log g< f log f unless g = f.sol(θ) is much smaller than L(θ0 ) unless θ θ 0 δ. Then the log likelihood function can be expanded in a Taylor series around θ 0 : L(θ) = L(θ 0 ) + L (θ 0 )(θ θ 0 ) + 1 2 (θ θ 0) T L (θ 0 )(θ θ 0 ) + R. The lead term L(θ 0 ) is random; but since this term does not depend on θ, its behavior is immaterial. The first derivative L (θ 0 ) is asymptotically normal with mean 0 and variance ni (θ 0 ) by the central limit theorem and (7). By the strong law, L (θ 0 ) ni (θ 0 ). The remainder term R has R =O(n θ θ 0 3 ) 3

by (1), and is negligible relative to the quadratic term. Thus, the MLE ˆθ essentially maximizes θ L (θ 0 )(θ θ 0 ) + 1 2 (θ θ 0) T L (θ 0 )(θ θ 0 ). So ˆθ θ 0 L (θ 0 ) 1 L (θ 0 ) T and is asymptotically N(0, I(θ 0 ) 1 /n), as required. The maximum can be found by setting the derivative to 0. The likelihood equation" L ( ˆθ) = 0 (almost) boils down to L (θ 0 ) + (θ θ 0 ) T L (θ 0 ) = 0. The observed information" L ( ˆθ)/n can be used to approximate Fisher information. There is similar theory for integer-valued random variables, for random variables with fairly general range spaces, for a half-space, an open subset of R p, etc., etc. Testing. Let 0 be a p 0 -dimensional subset of R p. We wish to test the null hypothesis that θ 0 0. Let ˆθ 0 be the MLE, where the maximization is restricted to 0. For a simple hypothesis, Wald s t-test compares the MLE to its SE and there is a version like Hotelling s T 2 for composite hypotheses. The Neyman-Pearson (or Wilks) statistic is 2[L( ˆθ) L( ˆθ 0 )], which has under the null hypothesis an asymptotic χp p 2 0 distribution. Rao s score test uses the statistic L ( ˆθ 0 )I ( ˆθ 0 ) 1 L ( ˆθ 0 ) T /n; again the asymptotic distribution is χp p 2 0. At interior points, these test statistics are asymptotically equivalent; at boundary points, Wald s test and the Neyman-Pearson statistic get into trouble, while the score test often does fine. The leading special case for the null distribution of these tests has n = 1, X N(θ 0,I),soI(θ 0 ) is the identity matrix, and 0 ={θ : θ p0 +1 = =θ p = 0}. The general case follows by change of variables and rotation. Examples. Suppose the U i are IID N(α, 1), the V i are IID N(β,1), the U s and V s are independent. Let X i = (U i,v i ) and θ = (α, β). Now 2L(θ) = n log 1 2π (U i Ū) 2 (V i V) 2 n(ū α) 2 n( V β) 2. The MLE is the sample mean. For testing the null hypothesis that β = 0, the Neyman-Pearson statistic and the Rao score statistic are both n V 2. If you restrict β to be non-negative, ˆβ is 0 when V <0; the Neyman-Pearson statistic n ˆβ 2 is not χ 2 -like: the score statistic is still n V 2, whose null distribution is χ 2 1. Exercises. Suppose the X i are IID Poisson, with mean λ. Write down L, L,L,I. Find the MLE for λ. Ifλ 1 > 0, write down the Neyman-Pearson statistic and the score statistic for testing the null hypothesis that λ = λ 1. Verify the asymptotic distributions under the null. Which test is more powerful for λ>λ 1?Forλ<λ 1? 4

Suppose the X i are independent N(θ i, 1) for i = 1,...,p; the θ i are unrestricted real numbers. Find the MLE for θ. Find the Neyman-Pearson and Rao tests for the null hypothesis that θ i = 0 for i = p 0 + 1,...,p Let M be a non-random n p matrix of full rank; suppose Y = Mθ + ɛ, where the ɛ i are IID N(0, σ 2 ). Write down L, L,L,I. Find the MLE for θ and σ 2. Write down the Neyman- Pearson statistic and the score statistic for testing the null hypothesis that θ 1 = 0. Derive the normal equations by differentiating θ Y Mθ 2 with respect to θ. Let be the standard normal distribution function, with = φ being the density. According to the probit model, given X 1,...,X n, the variables Y 1,...,Y n are independent 0 1 variables, each being 1 with probability (X i β). Forx>0, show that 1 (x) < φ(x)/x. Conclude that and 1 are log concave. Conclude further that the log likelihood function for the probit model is concave. Hint: show first 1 (x) < x (z/x)φ(z) dz. References CR Rao (1973). Linear Statistical Inference. 2nd ed. Wiley. Chapter 6 discusses likelihood techniques; pp.415ff cover Wald s t-test, the Neyman-Pearson likelihood ratio test, and Rao s score test. Notation I = I(θ) l = L V = L (θ) T / n D = n( ˆθ θ 0 ). The subscript 0 means, substitute θ 0 for θ. The superscript * means, substitute θ for θ, the former being the MLE over a restricted subset of parameter space; Rao s parameter θ is q-dimensional, and his restricted parameter space is s-dimensional. EL Lehmann (1991). Testing Statistical Hypotheses. 2nd ed. Wadsworth & Brooks/Cole. The χ 2 and likelihood ratio tests are discussed on pp.477ff. EL Lehmann (1991). Theory of Point Estimation. Wadsworth & Brooks/ Cole. The information inequality (aka the Cramèr-Rao inequality) is discussed on pp.115ff, and the theory of the MLE is developed in Chapter 6. For exponential families, the calculus is much more tractable; see pp.119, 417, 438. 5