Z-estimators (generalized method of moments)

Similar documents
Classical regularity conditions

DA Freedman Notes on the MLE Fall 2003

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Mathematical statistics

Inference in non-linear time series

Chapter 3. Point Estimation. 3.1 Introduction

ECE 275A Homework 7 Solutions

5.2 Fisher information and the Cramer-Rao bound

Lecture 7 Introduction to Statistical Decision Theory

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

SOLUTION FOR HOMEWORK 7, STAT p(x σ) = (1/[2πσ 2 ] 1/2 )e (x µ)2 /2σ 2.

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Final practice, Math 31A - Lec 1, Fall 2013 Name and student ID: Question Points Score Total: 90

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

The loss function and estimating equations

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Lecture 8: Information Theory and Statistics

56 CHAPTER 3. POLYNOMIAL FUNCTIONS

3 Polynomial and Rational Functions

University of Regina. Lecture Notes. Michael Kozdron

The Liapunov Method for Determining Stability (DRAFT)

Solutions to Math 41 First Exam October 15, 2013

REVIEW OF DIFFERENTIAL CALCULUS

Review and continuation from last week Properties of MLEs

Ch. 5 Hypothesis Testing

Maximum-Likelihood Estimation: Basic Ideas

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form.

Mathematical statistics

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Introduction to Estimation Methods for Time Series models Lecture 2

Testing Restrictions and Comparing Models

Chapter 11 - Sequences and Series

Exercises and Answers to Chapter 1

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

6.1 Variational representation of f-divergences

Chapter 6. Convergence. Probability Theory. Four different convergence concepts. Four different convergence concepts. Convergence in probability

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Homework Assignment #2 for Prob-Stats, Fall 2018 Due date: Monday, October 22, 2018

1. Point Estimators, Review

Stat 135, Fall 2006 A. Adhikari HOMEWORK 6 SOLUTIONS

1 + lim. n n+1. f(x) = x + 1, x 1. and we check that f is increasing, instead. Using the quotient rule, we easily find that. 1 (x + 1) 1 x (x + 1) 2 =

Estimation of Dynamic Regression Models

Implicit Functions, Curves and Surfaces

Computer Intensive Methods in Mathematical Statistics

3.7 Constrained Optimization and Lagrange Multipliers

u xx + u yy = 0. (5.1)

SOLUTIONS FOR PROBLEMS 1-30

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Brief Review on Estimation Theory

General Bayesian Inference I

Name: AK-Nummer: Ergänzungsprüfung January 29, 2016

The Delta Method and Applications

Limiting Distributions

ROOT FINDING REVIEW MICHELLE FENG

M311 Functions of Several Variables. CHAPTER 1. Continuity CHAPTER 2. The Bolzano Weierstrass Theorem and Compact Sets CHAPTER 3.

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

Lecture Notes on Monotone Comparative Statics and Producer Theory

Taylor and Laurent Series

Math Precalculus I University of Hawai i at Mānoa Spring

Module 3. Function of a Random Variable and its distribution

Chapter 5B - Rational Functions

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

MTAEA Continuity and Limits of Functions

ECE 275B Homework # 1 Solutions Version Winter 2015

557: MATHEMATICAL STATISTICS II BIAS AND VARIANCE

Math 115 Spring 11 Written Homework 10 Solutions

5601 Notes: The Sandwich Estimator

Is there a rigorous high school limit proof that 0 0 = 1?

LAN property for sde s with additive fractional noise and continuous time observation

Chapter 10. Definition of the Derivative Velocity and Tangents

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Graduate Econometrics I: Maximum Likelihood I

Induction, sequences, limits and continuity

Orders of Growth. Also, f o(1) means f is little-oh of the constant function g(x) = 1. Similarly, f o(x n )

Problem 3. Give an example of a sequence of continuous functions on a compact domain converging pointwise but not uniformly to a continuous function

Advanced Calculus I Chapter 2 & 3 Homework Solutions October 30, Prove that f has a limit at 2 and x + 2 find it. f(x) = 2x2 + 3x 2 x + 2

Section Properties of Rational Expressions

MATH 220: INNER PRODUCT SPACES, SYMMETRIC OPERATORS, ORTHOGONALITY

Review. December 4 th, Review

(x 3)(x + 5) = (x 3)(x 1) = x + 5. sin 2 x e ax bx 1 = 1 2. lim

Submitted to the Brazilian Journal of Probability and Statistics

System Identification, Lecture 4

LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO

STAT 450: Final Examination Version 1. Richard Lockhart 16 December 2002

2.1 Convergence of Sequences

Lecture 1: Introduction

with a given direct sum decomposition into even and odd pieces, and a map which is bilinear, satisfies the associative law for multiplication, and

ECE 275B Homework # 1 Solutions Winter 2018

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

System Identification, Lecture 4

Nonparametric Density Estimation

To appear in The American Statistician vol. 61 (2007) pp

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Topological properties

Review of Probability Theory II

Limits at Infinity. Horizontal Asymptotes. Definition (Limits at Infinity) Horizontal Asymptotes

Algebra I Vocabulary Cards

Chapter 3 : Likelihood function and inference

Loglikelihood and Confidence Intervals

Transcription:

Z-estimators (generalized method of moments) Consider the estimation of an unknown parameter θ in a set, based on data x = (x,...,x n ) R n. Each function h(x, ) on defines a Z-estimator θ n = θ n (x,...,x n ) as a zero of a random criterion function H n (θ) := H n (θ, x) := i n n h(x i,θ). That is, θ n is defined by the equality H n ( θ n ) = 0. For different choices of h we get different estimators different functions of the data. The choice of h can be suggested by a model or by various optimality criteria. For simplicity I will consider only the case where is a subset of the real line. Vector-valued parameters can be handled by taking h as a vector-valued function. <> Example. Suppose the data x,...,x n are modelled as independent observations from a density belonging to a family { f θ (x) : θ }. The maximum likelihood estimator (MLE) is defined as the value that maximizes the joint density p(x,...,x n,θ) = i n f θ(x i ).Iff θ is a smooth function of θ, and if the maximum occurs at the point where p/ θ is zero, the MLE corresponds to the Z-estimator defined by h(x,θ)= log f (x,θ). θ For example, for fitting a N(θ, ) model the function h(x,θ) = x θ generates the MLE, which happens to coincide with the method of moments estimator. In general, if m(θ) := xf θ (x) dx, the method of moments estimator for a one-dimensional parameter θ is defined as the solution to m( θ n ) = i n x i/n, which corresponds to the function h(x,θ):= x m(θ). For the purposes of numerical illustration I will work with the function { x θ if x θ <2> h(x,θ)= + if x >θ+ if x <θ I choose this particular function for two reasons: there is no simple closed-form expression for the corresponding Z-estimator θ n ; and similar h functions have played an important role in the modern theory of robust statistics. The first property shows why it is important to have some general theory for the behaviour of Z-estimators, to cover cases where we cannot analyze a closedform representation. For this handout, whenever I write a bar over a function or estimator you will know that I am referring to this particular choice for h. Thus H n denotes the corresponding criterion function for a sample of size n, and θ n is defined by H n ( θ n ) = 0. Suppose the data are generated as independent observations from some fixed density f. (The method also works for discrete distributions. I leave the substitution of sums for integrals to you.) It is seldom possible to calculate the exact distribution of the Z-estimator θ n. But, as I will soon explain, if the function h is smooth enough in θ and the sample size n is large enough, then θ n will typically have an approximately normal distribution, with variance of order /n. To understand why θ n behaves well for large samples from a fixed f, we first have to understand what the random criterion function H n is doing. Different realizations of the data generate different H n functions. For example, the following pictures were obtained by superimposing the H n functions (corresponding to the h from <2>) for 0independent samples of size n = 5: Statistics 60a: October 200

2 from the N(0, ) distribution on the left-hand side, and from the standard Cauchy distribution on the right-hand side. - - - - Look at the 0realizations of the estimator θ n (the points at which the H n curves intersect the horizontal axis) in each picture. Notice the spread around the origin. The estimator θ n has a distribution that depends on the joint distribution of the data. Consistency Suppose the x i are independent observations from some density f. For each fixed θ, the random variable H n (θ) is an average of the n independent random variables h(x i,θ),fori =, 2,...,n. By the law of large numbers, H n (θ) should be close to its expected value, H(θ, f ) := E f h(x,θ) = h(x,θ)f (x) dx. For example, if f equals φ, then(0, ) density, with distribution function (x), then the H(θ, φ) corresponding to the h from <2> is given by H(θ, φ) = θ+ φ(x) dx θ φ(x) dx + θ+ θ (x θ)φ(x) dx = (θ + ) (θ ) θ( (θ + ) (θ )) (φ(θ + ) φ(θ )). - - With a little imagination you might convince yourself that each of the 0 superimposed plots for H 5 ( ) from the N(0, ) density look like H(,φ). Perhaps the effect is more obvious in a sequence for n = 5, 0, 20: - - - - - As n gets larger, H n converges to H(,φ). With probability tending to one, the estimator θ n concentrates around the solution θ = 0for the equation H(θ, φ) = 0. That is, θ n converges in probability to 0for independent samples from the N(0, ) density. For general h with data x, x 2,... generated independently from a density f, the estimator θ n converges in probability (as n )toz = z( f, h), the root which I assume is unique of the equation H(z, f ) = 0. If the data x, x 2... are modelled as independent observations from a density belonging to some f from a family { f θ (x) : θ }, it is traditional - Statistics 60a: October 200

to consider behaviour of θ n under each f θ. If the equation H(z, f θ ) = 0has its solution z( f θ, h) equal to θ, for each θ, then θ n will converge in probability under the f θ model to θ; if the data actually are generated from an f θ0,foran unknown true value θ 0, then the Z-estimator will converge to that θ 0. This consistent requirement places a constraint on h. Asymptotic normality How closely will θ n be distributed about its limiting value z = z( f, h), for data generated independently from a density f? A Taylor expansion about z gives the answer. 0 = H n ( θ n ) H n (z) + (θ z) i n h (x i, z)/n, where h denotes the partial derivative of h with respect to θ. Solve. n( θ n z) = nh n (z) i n h (x i, z)/n. You ll see in a moment why I have multiplied through by a n. By the law of large numbers, the average in the denominator has large probability of being close to J( f, h) := E f h (x, z) = h (x, z) f (x) dx where z is defined by the equality h(x, z) f (x) dx = 0. A similar argument shows that H n (z) should lie close to 0 = H(z, f ), which merely reconfirms that θ n z should get close to zero. We can do better. As an average of independent random variables h(x i, z) with zero expected values, the random variable H n (z) will have an approximate N(0,σ 2 ( f, h)/n) distribution, where σ 2 ( f, h) := var f h(x, z) = h(x, z) 2 f (x) dx because H(z, f ) = 0. The extra factor of n magnifies the H n (z) up to a quantity distributed roughly N(0,σ 2 ( f, h)), leaving n( θ n z) with an approximate normal distribution with zero mean and variance equal to σ 2 ( f, h)/j( f, h) 2. A magnified version of the picture for ten realizations of the H 20 function, generated from samples of size 20with f equal to the standard Cauchy, shows what is going on. 0.2 0.5 0. 0.05 - -0.75 - -0.25 0.25 0.75-0.05-0. -0.2 In the region near zero, where the Z-estimator lies with high probability, the H n function is roughly linear, with slope equal to J( f, h), with the intercept shifted around by the random variable H n (z), which has roughly a normal distribution with standard deviation of order / n. For the asymptotics at the / n-level, the Z-estimation problem reduces to a simple linear equation, through the workings of the law of large numbers and the central limit theorem. 3 Statistics 60a: October 200

Optimal choice of h Once again consider the situation where the data x,...,x n are modelled as independent observations from a density belonging to a family { f θ (x) : θ }. Let me write E θ and var θ, instead of E fθ and var fθ, to denote calculations carried out under the f θ model. Thus E θ g(x) will be shorthand for g(x) f θ (x) dx; the x is treated both as a generic x i and as a dummy variable of integration. In order that the Z-estimator θ n should converge to θ under the f θ model, for every θ, wemusthave <3> E θ h(x,θ)= h(x,θ)f θ (x) dx = 0for every θ. 4 <4> The problem is to find the h function that minimizes σ 2 ( f θ, h) J( f θ, h) := E θ h(x,θ) 2 2 (E θ h (x,θ)) 2 for each θ, subject to the constraint <3>. I will show that the minimum is achieved when h(x,θ) equals l θ (x) := θ log f θ(x) = f θ (x) f θ (x). That is, the lower bound for asymptotic variance is achieved when h defines the MLE. Some authors call l the score function for the model. The corresponding J( f θ,l θ ) is given by ( 2 ) J( f θ,l θ ) = E θ θ log f θ(x) =: I(θ) = var 2 θ (l θ ) = E θ l 2 θ, the (Fisher) information function for the model. To establish the optimality property for l, we can argue as in the proof of the information inequality to derive first a lower bound for σ 2 ( f θ, h)/j( f θ, h) 2, and then show that l achieves that lower bound. Differentiate <3> with respect to θ, assuming appropriate smoothness for the densities (and ignoring the question of whether we are allowed to take the derivative inside the integral sign): h (x,θ)f θ (x) dx + h(x,θ)f θ (x) dx = 0. Recognize the first integral as J( f θ, h). Rewrite f θ (x) as l θ(x) f θ (x) to recognize the second integral as E θ h(x,θ)l θ (x), and thereby deduce that J( f θ, h) = E θ h(x,θ)l θ (x) and σ 2 ( f θ, h)/j( f θ, h) 2 E θ h(x,θ) 2 = (E θ h(x,θ)l θ (x)) 2. The Cauchy-Schwarz inequality asserts ( Eθ h(x,θ) 2)( E θ l θ (x) 2) ( E θ h(x,θ)l θ (x) ) 2, with equality when h(x,θ) equals l θ (x). Thus σ 2 ( f θ, h)/j( f θ, h) 2 /E θ l θ (x) 2 = /I(θ), with equality when h(x,θ) = l θ (x), in which case σ 2 ( f θ,l) = I(θ) = J( f θ,l) and n( θ n θ) is approximately N(0, /I(θ)) distributed under the f θ model, for each θ. Statistics 60a: October 200

In summary: If we require that the Z-estimator converge in probability to θ under independent sampling from f θ, for every θ, then the asymptotic variance cannot be smaller than /I θ. The asymptotic normal distribution for the MLE has variance equal to the lower bound. Warnings I have not been rigorous about the conditions required for the arguments leading to asymptotic optimality of the MLE amongst the class of consistent Z-estimators. For example, the argument surely fails when f θ denotes the Uniform(0,θ) density, which is not everywhere differentiable. A completely rigorous treatment is quite difficult. The development of the rigorous theory has been a major theme in modern theoretical statistics. 5 Statistics 60a: October 200