Chapter 4: Asymptotic Properties of the MLE (Part 2)

Similar documents
Chapter 4: Asymptotic Properties of the MLE

Section 8.2. Asymptotic normality

Maximum Likelihood Estimation

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Lecture 32: Asymptotic confidence sets and likelihoods

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Statistics 3858 : Maximum Likelihood Estimators

Maximum Likelihood Estimation

Lecture 1: Introduction

Section 10: Role of influence functions in characterizing large sample efficiency

HT Introduction. P(X i = x i ) = e λ λ x i

Math 181B Homework 1 Solution

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

A Very Brief Summary of Statistical Inference, and Examples

5.2 Fisher information and the Cramer-Rao bound

Brief Review on Estimation Theory

Lecture 28: Asymptotic confidence sets

Problem Selected Scores

Statistics & Data Sciences: First Year Prelim Exam May 2018

Graduate Econometrics I: Maximum Likelihood I

Lecture 3 September 1

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Notes on the Multivariate Normal and Related Topics

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

[y i α βx i ] 2 (2) Q = i=1

Econometrics I, Estimation

STAT 730 Chapter 4: Estimation

Asymptotic Statistics-III. Changliang Zou

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

The Expectation-Maximization Algorithm

Introduction to Normal Distribution

Graduate Econometrics I: Maximum Likelihood II

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

13. Parameter Estimation. ECE 830, Spring 2014

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Finite Singular Multivariate Gaussian Mixture

A Very Brief Summary of Statistical Inference, and Examples

Review and continuation from last week Properties of MLEs

Chapter 3: Maximum Likelihood Theory

Mathematical Statistics

where x and ȳ are the sample means of x 1,, x n

Lecture 1 October 9, 2013

Introduction to Estimation Methods for Time Series models Lecture 2

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Section 9: Generalized method of moments

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Introduction to Machine Learning. Lecture 2

Probability and Statistics qualifying exam, May 2015

Multivariate Analysis and Likelihood Inference

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

SOLUTION FOR HOMEWORK 7, STAT p(x σ) = (1/[2πσ 2 ] 1/2 )e (x µ)2 /2σ 2.

Lecture 7 Introduction to Statistical Decision Theory

Theory of Statistics.

Maximum Likelihood Estimation

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

Parametric Techniques Lecture 3

simple if it completely specifies the density of x

Covariance function estimation in Gaussian process regression

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Chapter 3. Point Estimation. 3.1 Introduction

POLI 8501 Introduction to Maximum Likelihood Estimation

Parametric Techniques

Quick Tour of Basic Probability Theory and Linear Algebra

An Empirical Characteristic Function Approach to Selecting a Transformation to Normality

Information in a Two-Stage Adaptive Optimal Design

Math 152. Rumbos Fall Solutions to Assignment #12

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Stat 5101 Lecture Notes

ECE 275A Homework 7 Solutions

ECE 275B Homework # 1 Solutions Version Winter 2015

COM336: Neural Computing

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Math 494: Mathematical Statistics

Maximum Likelihood Asymptotic Theory. Eduardo Rossi University of Pavia

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Qualifying Exam in Probability and Statistics.

Master s Written Examination

F & B Approaches to a simple model

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Master s Written Examination - Solution

Limiting Distributions

Chapter 6: Large Random Samples Sections

Chapter 3 : Likelihood function and inference

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

ECE 275B Homework # 1 Solutions Winter 2018

Generalized Linear Models. Kurt Hornik

Economics 583: Econometric Theory I A Primer on Asymptotics

REGRESSION WITH SPATIALLY MISALIGNED DATA. Lisa Madsen Oregon State University David Ruppert Cornell University

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

Maximum Likelihood Estimation. only training data is available to design a classifier

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Linear Methods for Prediction

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata


Transcription:

Chapter 4: Asymptotic Properties of the MLE (Part 2) Daniel O. Scharfstein 09/24/13 1 / 1

Example Let {(R i, X i ) : i = 1,..., n} be an i.i.d. sample of n random vectors (R, X ). Here R is a response indicator and X is a covariate. We assume that logit P[R = 1 X ] = α + βx and assume that X is normally distributed with mean µ and variance σ 2. So, our probability model has four parameters, θ = (α, β, µ, σ 2 ). Let θ 0 = (α 0, β 0, µ 0, σ0 2 ) denote the true value of θ. 2 / 1

Likelihood L(θ; x, r) = n i=1 1 σ 2π exp( 1 2σ 2 (x i µ) 2 ) exp(r i(α + βx i )) 1 + exp(α + βx i ) The log-likelihood for an individual is l(θ; x, r) log(σ) 1 2σ 2 (x µ)2 +r(α+βx) log(1+exp(α+βx)) 3 / 1

Score The score vector for an individual is r ψ(x, r; θ) = = x(r l(θ;x,r) α l(θ;x,r) β l(θ;x,r) µ l(θ;x,r) σ 2 exp(α+βx) 1+exp(α+βx) exp(α+βx) 1+exp(α+βx) ) x µ σ 2 1 + (x µ)2 2σ 2 2σ 4 The score equations for the full sample are: n r i exp(α + βx i) 1 + exp(α + βx i ) = 0 (1) i=1 n i=1 n i=1 n i=1 x i (r i exp(α + βx i) 1 + exp(α + βx i ) ) = 0 (2) x i µ σ 2 = 0 (3) 1 2σ 2 + (x i µ) 2 2σ 4 = 0 (4) 4 / 1

MLE Note that Equations (3) and (4) can be solved explicitly to get solutions for ˆµ and ˆσ 2, i.e., ˆµ = 1 n n x i and ˆσ 2 = 1 n i=1 n (x i ˆµ) 2 i=1 The solutions for ˆα and ˆβ are obtained by solving Equations (1) and (2). Requires iterative technique such as Newton-Raphson (more later). 5 / 1

Observed Information Matrix Let p i (α, β) = exp(α+βx i ) 1+exp(α+βx i ) and q i(α, β) = 1 p i (α, β), Now, nj n (θ) = [ Jn1 (α, β) 0 0 J n2 (µ, σ 2 ) ] where [ n J n1 (α, β) = i=1 p i(α, β)q i (α, β) n i=1 x ip i (α, β)q i (α, β) n i=1 x ip i (α, β)q i (α, β) n i=1 x i 2p i(α, β)q i (α, β) ] and J n2 (µ, σ 2 ) = [ n 1 σ 1 2 σ 4 n i=1 (x i µ) n σ 4 i=1 (x i µ) n + 3 n 2σ 3 2σ 5 i=1 (x i µ) 2 ] 6 / 1

Assumptions Assume that α K 1, β K 2, µ K 3, and K 4 σ 2 K 5, where K 1,..., K 5 are finite positive quantities. Assume that θ 0 (truth) belongs to the interior of Θ. 7 / 1

Regularity Conditions Condition i) is satisfied by assumption. Condition ii) is satisfied since log p(x, r; θ) is continuous at each θ for all < x < and r = 0, 1. Now, we need to show condition iii) that log p(x, r; θ) d(x, r) for all θ Θ and E θ0 [d(x, R)] <. log p(x, r; θ) = 1 1 log(2π) log(σ) 2 2σ (x 2 µ)2 + r(α + βx) log(1 + exp(α + βx)) By the triangle inequality, we know that log p(x, r; θ) 1 2 log(2π) + log(σ) + 1 2σ (x 2 µ)2 + r(α + βx) + log(1 + exp(α + βx)) 8 / 1

Inequalities x 1 + x 2 log(1 + exp(α + βx)) 0 log(1 + exp(α + βx)) log(1 + exp( α + β x )) 1 + log(exp( α + β x )) = 1 + α + β x 1 + K 1 + K 2 (1 + x 2 ) log(σ) 1 2 max( log(k 4), log(k 5 ) ) 1 2σ 2 (x µ)2 1 2σ 2 x 2 + 1 µ2 µ x + σ2 2σ 2 x 2 + K 3 (1 + x 2 ) + K 3 2 2K 4 K 4 2K 4 9 / 1

Inequalities r(α + βx) α + β x K 1 + K 2 (1 + x 2 ) log(1 + exp(α + βx)) 1 + K 1 + K 2 (1 + x 2 ) 10 / 1

Regularity Conditions Let d(x, r) = 1 2 log(2π) + 1 2 max( log(k 4), log(k 5 ) ) + x 2 2K 4 + K 3 K 4 (1 + x 2 ) + K 2 3 2K 4 + K 1 + K 2 (1 + x 2 ) + 1 + K 1 + K 2 (1 + x 2 ) Now, E θ0 [X 2 ] = µ 2 0 + σ2 0 <. This implies that E θ0 [d(x, R)] <. 11 / 1

Regularity Conditions For condition iv), we have to show that p(x, r; θ) is twice-continuously differentiable and p(x, r; θ) > 0 for θ in a neighborhood of θ 0. This follows simply from the fact that the density is made up of polynomial and exponentials for which derivatives exists. The density is clearly positive over the entire parameter space. 12 / 1

Regularity Conditions For condition v), we have to show that p(x,r;θ) θ e(x, r), where e(x, r)is integrable. Remember, and p(x, r; θ) = 1 σ 2π exp( 1 2σ 2 (x exp(r(α + βx)) µ)2 ) 1 + exp(α + βx) p(x, r; θ) θ = 1 σ exp( 1 2π 1 σ exp( 1 2π 1 σ exp( 1 2π 2σ 2 (x µ) 2 ) 2σ 2 (x µ) 2 ) r exp(r(α+βx))+(r 1) exp((r+1)(α+βx) (1+exp(α+βx)) 2 xr exp(r(α+βx))+x(r 1) exp((r+1)(α+βx) (1+exp(α+βx)) 2 (1+exp(α+βx)) { x µ 2 σ } 2 2σ (x µ) 2 ) exp(r(α+βx)) 2 1 σ exp( 1 2π 2σ (x µ) 2 ) exp(r(α+βx)) 2 (1+exp(α+βx)) { 1 2 2σ 2 + (x µ)2 σ 2 } 13 / 1

Regularity Conditions Now, p(x, r; θ) θ p(x, r; θ) + α p(x, r; θ) + β p(x, r; θ) p(x, r; θ) + µ σ 2 We want to bound each of the terms in the sum by a function of x and r which is integrable. To do this, note that 1 σ 2π exp( 1 2σ 2 (x µ)2 ) e (x) where e (x) = 1 2K4π 1 exp( 1 2K4π 1 exp( 1 2K4π 2K 5 (x K 3 ) 2 ) x > K 3 K 3 x K 3 2K 5 (x + K 3 ) 2 ) x < K 3 14 / 1

Regularity Conditions Now, p(x, r; θ) α e 1 (x) = 2e (x) p(x, r; θ) β e 2 (x) = 2xe (x) p(x, r; θ) µ e 3 (x) = e (x)( x + K 3 K4 2 ) p(x, r; θ) σ 2 e 4 (x) = e (x)( 1 2K5 2 + x 2 (1 + 2K 3 ) + 2K 3 + K 2 3 K 2 4 We know that p(x,r;θ) θ e(x) = e 1 (x) + e 2 (x) + e 3 (x) + e 4 (x). e(x) is integrable since each e i (x) (i = 1,..., 4) is integrable. ) 15 / 1

Regularity Conditions For condition vi), we need to show that I (θ 0 ) = E θ0 [ψ(x, R; θ 0 )ψ(x, R; θ 0 ) ] is non-singular. It would suffice to show that I (θ 0 ) was positive definite. That is, a I (θ 0 )a > 0 for all a such that a > 0. Alternatively, we can show that Var θ0 [a ψ(x, R; θ 0 )] > 0. Now, a ψ(x, R; θ 0 ) = a 1 (R exp(α 0 + β 0 X ) 1 + exp(α 0 + β 0 X ) ) + a 2 X (R exp(α 0 + β 0 X ) 1 + exp(α 0 + β 0 X ) ) + a 3 ( X µ 0 σ0 2 ) + a 4 ( 1 2σ0 2 + (X µ 0) 2 σ0 4 ) Var θ0 [a ψ(x, R; θ 0)] = E θ0 [Var θ0 [a ψ(x, R; θ 0) X ]]+Var θ0 [E θ0 [a ψ(x, R; θ 0) X ]] 16 / 1

Regularity Conditions Var θ0 [a ψ(x, R; θ 0 ) X ] = (a 1 + a 2 X ) 2 exp(α 0 + β 0 X ) (1 + exp(α 0 + β 0 X )) 2 E θ0 [a ψ(x, R; θ 0 ) X ] = a 3 ( X µ 0 σ0 2 ) + a 4 ( 1 2σ0 2 + (X µ 0) 2 σ0 4 ) Var θ0 [a ψ(x, R; θ 0 )] = E θ0 [(a 1 + a 2 X ) 2 exp(α 0 + β 0 X ) (1 + exp(α 0 + β 0 X )) 2 ] + Var θ0 [a 3 ( X µ 0 σ0 2 ) + a 4 ( 1 2σ0 2 + (X µ 0) 2 σ0 4 )] = E θ0 [(a 1 + a 2 X ) 2 exp(α 0 + β 0 X ) (1 + exp(α 0 + β 0 X )) 2 ] + a3 2 σ0 2 + a2 4 4σ 4 0 The variance can only equal zero if a 1 = a 2 = a 3 = a 4 = 0. So, I (θ 0 ) is positive definite. 17 / 1

Regularity Conditions Conditions vii) and viii) involve showing that the norm of the matrix of second derivatives of the log likelihood and likelihood are bounded by functions of x and r which are integrable. The norm of a matrix A = [a ij ] is equal to ( i,j a2 ij )1/2. For each of these matrices, it is sufficient to show that the absolute value of each of entries is bounded by an integrable function of x and r. The bounding techniques are very similar to those used to show condition v). 18 / 1

Asymptotic Variance of θ n I (θ 0 ) 1 = exp(α E θ0 [ 0+β 0X ) (1+exp(α 0+β 0X )) ] 2 E θ0 [X exp(α0+β0x ) (1+exp(α 0+β 0X )) ] 2 0 0 E θ0 [X exp(α0+β0x ) (1+exp(α 0+β 0X )) ] 2 E θ0 [X 2 exp(α 0+β 0X ) (1+exp(α 0+β 0X )) ] 2 0 0 0 0 1 σ 2 0 0 0 0 1 4σ 4 0 0 1 19 / 1

Regular Estimators Suppose we want to estimate a real-valued function of θ, say g(θ), g( ) : R k R. Assume that g has continuous partial derivatives. Consider estimating g(θ 0 ) by g(ˆθ(x n )), where ˆθ(X n ) is the MLE. Assuming that the 8 regularity conditions hold, we know that n(ˆθ(x n ) θ 0 ) D N(0, I 1 (θ 0 )). By the multivariate delta method, we know that n(g(ˆθ(x n )) g(θ 0 )) D N(0, a(θ 0 ) I 1 (θ 0 )a(θ 0 )) where a(θ 0 ) is the gradient of g(θ) at θ 0. 20 / 1

Regular Estimators Consider an estimator, T (X n ), of g(θ 0 ) which is asymptotically normal and asymptotically unbiased, i.e., n(t (Xn ) g(θ 0 )) D N(0, V (θ 0 )) (5) It turns out that under some additional regularity conditions on T (X n ), we can shown that V (θ 0 ) a(θ 0 ) I 1 (θ 0 )a(θ 0 ) (6) 21 / 1

Regular Estimators Definition: A regular estimator T (X n ) of g(θ 0 ) which satisfies (5) with V (θ 0 ) = a(θ 0 ) I 1 (θ 0 )a(θ 0 ) is said to be asymptotically efficient. If g(ˆθ(x n )) is regular, then we know that it is asymptotically efficient. For a long time, it was believed that the regularity conditions needed to make (6) hold involved regularity conditions on the density p(x; θ). This belief was exploded by Hodges. 22 / 1

Hodges Example of Super-Efficiency Suppose that X n = (X 1,..., X n ) where the X i s are i.i.d. Normal(µ 0, 1). We can show that I (µ 0 ) = 1 for all µ 0. Consider the following estimator of µ 0, { X ˆµ(X n ) = n X n n 1/4 0 X n < n 1/4 Suppose that µ 0 0. Then n(ˆµ(xn ) µ 0 ) = n(ˆµ(x n ) X n ) + (7) n( Xn µ 0 ) (8) By the CLT, we know that (8) converges in distribution to a N(0, 1) random variable. We can show that (7) converges in probability to zero. By Slutsky s theorem, we know that n(ˆµ(xn ) µ 0 ) D N(0, 1). So the estimator attains the efficiency bound. 23 / 1

Hodges Example of Super-Efficiency To show that (7) converges in probability to zero, note that P µ0 [ n(ˆµ(x n) X n) = 0] = P µ0 [ˆµ(X n) = X n] = P µ0 [ X n n 1/4 ] = P µ0 [ X n n 1/4 ] + P µ0 [ X n n 1/4 ] = P µ0 [ n( X n µ 0) n(n 1/4 µ 0)] + P µ0 [ n( X n µ 0) n( n 1/4 µ 0)] = Φ( n(n 1/4 µ 0)) + 1 Φ( n( n 1/4 µ 0)) = Φ(n 1/4 n 1/2 µ 0) + 1 Φ( n 1/4 n 1/2 µ 0) 1 24 / 1

Hodges Example of Super-Efficiency If µ 0 = 0, then we can show that n(ˆµ(x n ) µ 0 ) P 0 or n(ˆµ(xn ) µ 0 ) D N(0, 0). This violates (6). To show this result, we note that P µ0 [ n(ˆµ(x n ) µ 0 ) = 0] = P µ0 [ˆµ(X n ) = 0] = P µ0 [ X n < n 1/4 ] = P µ0 [ n 1/4 < X n < n 1/4 ] = P µ0 [ n 1/4 < n X n < n 1/4 ] 1 25 / 1

Remarks about Super-Efficiency The problem here is not due to irregularities of the density function, but due to the partialness of our estimator to µ 0 = 0. This example shows that no regularity conditions on the density can prevent an estimator from violating (6). This possibility can only be avoided by placing restrictions on the sequence of estimators. LeCam (1953) showed that for any sequence of estimators satisfying (5), the set of points in Θ violating (6) has Lebesgue measure zero. 26 / 1

Regular Estimators When you first discussed estimators, it was argued that it was important to rule out impartial estimators, i.e., estimators which favored some values of the parameters over others. In an asymptotic sense, we may want our sequence of estimators to be impartial so that we rule out estimators like the one presented by Hodges. Toward this end, we may restrict ourselves to regular estimators. A regular sequence of estimators is one whose asymptotic distribution remains the same in shrinking neighborhoods of the true parameter value. 27 / 1

Regular Estimators More formally. consider a sequence of estimators {T (X n )} and a sequence of parameter values {θ n } so that n(θ n θ 0 ) is bounded. T (X n ) is a regular sequence of estimators if converges to the same limit as P θn [ n(t (X n ) g(θ n )) a] P θ0 [ n(t (X n ) g(θ 0 )) a] for all a. For estimators that are regular and satisfy (2), (3) holds. When we talk about influence functions, we will formally establish this result. 28 / 1

Is the Hodge s Estimator Regular? Suppose that µ 0 = 0. We know that the limiting distribution of n(ˆµ(xn ) µ 0 ) is a degenerate distribution with point mass at zero. Consider the sequence µ n = τ/ n, where τ is some positive constant. Note that n(µ n µ 0 ) τ. Now, we can show that n(ˆµ(xn ) µ n ) P(µn) τ. To see this, not that P µn [ n( ˆµ(X n) µ n) = τ] = P µn [ n( ˆµ(X n) τ/ n) = τ] = P µn [ ˆµ(X n) = 0] = P µn [ X n < n 1/4 ] = P µn [ n 1/4 < X n < n 1/4 ] = P µn [ n( n 1/4 µ n) < n( X n µ n) < n(n 1/4 µ n)] = Φ( n( n 1/4 µ n)) Φ( n(n 1/4 µ n)) = Φ( n 1/4 τ) Φ(n 1/4 τ) 1 29 / 1

Is the Hodge s Estimator Regular? So, the limiting distribution of n(ˆµ(x n ) µ n ) is also degenerate, but with point mass at τ. This is a different limiting distribution than that of n(ˆµ(x n ) µ 0 ). Therefore, ˆµ(X n ) is not regular. 30 / 1