Statistical Inference Based on Extremum Estimators

Similar documents
Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Notes On Median and Quantile Regression. James L. Powell Department of Economics University of California, Berkeley

Asymptotic Results for the Linear Regression Model

Lecture 6 Testing Nonlinear Restrictions 1. The previous lectures prepare us for the tests of nonlinear restrictions of the form:

Efficient GMM LECTURE 12 GMM II

1 General linear Model Continued..

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Lecture 19: Convergence

Large Sample Theory. Convergence. Central Limit Theorems Asymptotic Distribution Delta Method. Convergence in Probability Convergence in Distribution

Properties and Hypothesis Testing

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

Solution to Chapter 2 Analytical Exercises

1 Covariance Estimation

Lecture 33: Bootstrap

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

11 THE GMM ESTIMATION

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

x iu i E(x u) 0. In order to obtain a consistent estimator of β, we find the instrumental variable z which satisfies E(z u) = 0. z iu i E(z u) = 0.

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Study the bias (due to the nite dimensional approximation) and variance of the estimators

SOME THEORY AND PRACTICE OF STATISTICS by Howard G. Tucker

POWER COMPARISON OF EMPIRICAL LIKELIHOOD RATIO TESTS: SMALL SAMPLE PROPERTIES THROUGH MONTE CARLO STUDIES*

Estimation for Complete Data

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

An Introduction to Asymptotic Theory

Lecture 7 Testing Nonlinear Inequality Restrictions 1

Lecture 2: Monte Carlo Simulation

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Logit regression Logit regression

Introduction to Optimization Techniques. How to Solve Equations

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

The standard deviation of the mean

Chapter 6 Sampling Distributions

7.1 Convergence of sequences of random variables

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Regression with an Evaporating Logarithmic Trend

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Bull. Korean Math. Soc. 36 (1999), No. 3, pp. 451{457 THE STRONG CONSISTENCY OF NONLINEAR REGRESSION QUANTILES ESTIMATORS Seung Hoe Choi and Hae Kyung

A Note on Box-Cox Quantile Regression Estimation of the Parameters of the Generalized Pareto Distribution

( θ. sup θ Θ f X (x θ) = L. sup Pr (Λ (X) < c) = α. x : Λ (x) = sup θ H 0. sup θ Θ f X (x θ) = ) < c. NH : θ 1 = θ 2 against AH : θ 1 θ 2

LECTURE 14 NOTES. A sequence of α-level tests {ϕ n (x)} is consistent if

Machine Learning Brett Bernstein

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Lecture 3: MLE and Regression

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

Random Variables, Sampling and Estimation

Output Analysis and Run-Length Control

CS537. Numerical Analysis and Computing

Introductory statistics

6 Sample Size Calculations

6. Sufficient, Complete, and Ancillary Statistics

Nonlinear regression

Stochastic Simulation

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Exercise 4.3 Use the Continuity Theorem to prove the Cramér-Wold Theorem, Theorem. (1) φ a X(1).

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Problem Set 4 Due Oct, 12

Kernel density estimator

Mathematical Statistics - MS

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Matrix Representation of Data in Experiment

Rates of Convergence by Moduli of Continuity

Exponential Families and Bayesian Inference

Lecture 7: Properties of Random Samples

Differentiable Convex Functions

4. Partial Sums and the Central Limit Theorem

The Method of Least Squares. To understand least squares fitting of data.

The central limit theorem for Student s distribution. Problem Karim M. Abadir and Jan R. Magnus. Econometric Theory, 19, 1195 (2003)

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

Questions and Answers on Maximum Likelihood

Lecture Stat Maximum Likelihood Estimation

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Topic 9: Sampling Distributions of Estimators

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

CSE 527, Additional notes on MLE & EM

LECTURE 8: ASYMPTOTICS I

Section 14. Simple linear regression.

Mathematics 170B Selected HW Solutions.

NANYANG TECHNOLOGICAL UNIVERSITY SYLLABUS FOR ENTRANCE EXAMINATION FOR INTERNATIONAL STUDENTS AO-LEVEL MATHEMATICS

TAMS24: Notations and Formulas

1 Introduction to reducing variance in Monte Carlo simulations

Linear Regression Demystified

Random assignment with integer costs

CS321. Numerical Analysis and Computing


7.1 Convergence of sequences of random variables

Transcription:

T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0 by miimizig (over 2 S) some objective fuctio Q Q(; y), where the radom vector y represets the data from a sample of size. Some examples are:. Liear least squares regressio, where Q = (y X) 0 (y X) ad X 0 is the coditioal expectatio of the radom vector y give the regressor matrix X: 2. Noliear least squares (NLS), where Q = [y g(; X)] 0 [y g(; X)]; g( 0 ; X) is the coditioal expectatio of y give X, ad g has a kow fuctioal form. 3. Least absolute deviatio (LAD) regressio where Q = P i jy i x 0 ij ad the coditioal media of y i give x i is equal to x 0 i 0 : 4. Maximum likelihood (ML), where Q is mius the log of the joit probability desity (or mass) fuctio of the data i a fully speci ed parametric model. 5. Geeralized method of momets (GMM), where Q = m(; y) 0 W m(; y); m(; y) is a q-dimesioal vector such that plim m( 0 ; y) = 0, ad W is a q q positive de ite matrix. 2 Asymptotic Properties of Extremum Estimators Clearly, ot every fuctio Q will lead to good estimates. We shall cosider fuctios that have the followig property: whe is large, Q(; y)= (viewed as a fuctio of ) is with high probability very close to a oradom fuctio Q () that has a proouced miimum at 0. The, i large samples, it seems plausible that ^, the miimizer of Q(; y), should with high probability be close to 0 ; the miimizer of Q (). This ituitio is made precise i the followig cosistecy theorem: Suppose Q(,y) is a cotiuous fuctio of i the compact parameter set SR p. The, if Q(,y) coverges i probability uiformly to a fuctio which has a uique miimum at 0, the value ^ that miimizes Q(,y) coverges i probability to 0. Regularity assumptios o the exogeous variables ad weak depedece across trials ofte imply that these coditios for cosistecy hold. EXAMPLE: I the liear model y = X 0 + u, suppose the u s are i.i.d. with mea zero, variace 2 ad idepedet of X. Assume further that X 0 X coverges i probability to a positive de ite matrix B. For the least-squares criterio fuctio, we have Q = (y X)0 (y X) = u0 u 2( 0 )X 0 u + ( 0 ) 0 X 0 X( 0 ) But u 0 u p! 2 ad X 0 u p! 0. Thus plim Q = ( 0 ) 0 B( 0 ) where the covergece is uiform i ay compact set of parameter values. Sice B is positive de ite, this limitig fuctio has a uique miimum at = 0.

Give cosistecy ad employig liearizatio methods, we ca ofte show that extremum estimators are approximately ormal whe the sample size is large. Suppose Q(; y) is twice di eretiable i with rst derivative vector S () ad cotiuous secod derivative matrix H (). (S is ofte called the score ad H the hessia for Q. Of course both will also deped o the sample data y, but this depedecy will be surpressed to simplify the otatio.) If ^ is a regular iterior miimum of Q, the S (^) = 0 ad H (^) is positive de ite. Usig the mea value theorem, we ca write 0 = =2 S (^) = =2 S ( 0 ) + H p (^ 0 ) where H is the hessia H with each elemet evaluated at a value somewhere betwee ^ ad 0. Suppose that =2 S ( 0 ) coverges i distributio to a ormal radom vector havig mea zero ad p p covariace matrix A. The, if H coverges i probability to a pp positive de ite matrix B, it follows that p (^ 0 ) has the same limitig distributio as B =2 S ( 0 ) ad hece p (^ 0 ) d! N(0; B AB ): () I large samples, oe might the act as though ^ were ormal with mea 0 ad variace matrix B b A b B b, where A b ad B b are cosistet estimates of A ad B. To prove that () is valid, we must show that =2 S ( 0 ) teds to a ormal radom variable ad that H teds to a oradom, full-rak matrix. I may problems S ( 0 ) is the sum of idepedet (or weakly depedet) mea-zero radom variables; stadard cetral limit theorems ca the be employed. Demostratig the covergece of H is usually more di cult sice we typically do ot have a closed form expressio for H: However, cotiuity argumets ad the law of large umbers ofte ca be employed to show that H coverges i probability to B = lim E H ( 0 ). Rigorous proofs of the asymptotic ormality of extremal estimators will ot be attempted here. The LAD case where Q is odi eretiable is particularly tricky sice the liearizatio o loger ca be obtaied by the mea value theorem. I cotrast, the OLS case where Q is quadratic is simple sice the H does ot deped o. 3 Computig Extremum Estimates I practice, the computatioal problem of dig the miimum of Q(; y) is otrivial. Computer itesive iterative methods are ofte successful. If Q(; y) is twice di eretiable i ; the quadratic Taylor series approximatio aroud a poit = a 0 is give by Q(; y) Q(a 0 ; y) + (a 0 ) 0 S 0 + 2 (a0 ) 0 H 0 (a 0 ) where S 0 is the gradiet of Q ad H 0 is the matrix of secod derivatives, both evaluated at = a 0. The Newto-Raphso algorithm for miimizig Q starts with a trial value a 0 ad, if H 0 is positive de ite, miimizes the quadratic approximatio obtaiig a = a 0 H0 S0. [If H 0 is ot de ite, oe ca replace it with H + ci for some small umber c.] The ext step is to evaluate S ad H at = a ad repeat the calculatio. Usig the recursio a r+ = a r Hr S r, oe cotiues util a r+ a r. The, as log as H r is positive de ite, oe uses a r as the estimate ^ sice S(a r ) 0. Of course, this yields at best oly a local miimum; oe must try various startig values a 0 to be co det that oe has truly miimized Q. The Gauss-Newto algorithm is a variat of Newto-Raphso for the special case where Q ca be writte as e 0 e=2 ad the elemets of the -dimesioal vector e are oliear fuctios of. De ig Z to be the p matrix @e=@ 0, the gradiet vector S ca be 2

writte as Z 0 e. Moreover, the hessia H is equal to Z 0 Z plus a term that teds to be smaller. The G-N algorithm drops this smaller term ad uses the recursio a r+ = a r (Z 0 rz r ) Z 0 re r, where Z ad e are evaluated at the previous estimate a r. Note that the G-N algorithm ca be implemeted by a sequece of least squares regressios. 4 Two-step Estimators We ofte ecouter problems where Q () is di cult to miimize because some elemets of eter i a complicated way. I those cases, a two-step estimatio procedure is ofte employed. Suppose the parameter vector is partitioed ito two parts ( ; 2 ) ad that eters ito Q i a simple way so that, if 2 were kow, Q could easily be miimized over :That is, de ig the gradiet of Q with respect to by S, we assume that the equatio S ( ; 2 ) = 0 ca easily by solved as = h( 2 ; y). If oe could d a simple estimate say ~ 2, oe might estimate by e = h( e 2 ; y). That is, we would estimate by miimizig Q( ; e 2 ). If both the (computatioally di cult) true extremal estimator b ad the simple estimator e 2 are cosistet ad joitly asymptotically ormal, the it ca be show that the estimator e is also cosistet ad asymptotically ormal. Its asymptotic variace will typicaly deped o the asymptotic variace of e 2 ad ca be computed by liearizig the rst-order coditio S ( e ; e 2 ) = 0. If, for example, =2 S ( e ; e 2 ) = =2 S ( 0 ; 0 2) + B p ( e 0 ) + B 2 p ( e 2 0 2) + o p () the the asymptotic variace of e ca be calculated from p ( e 0 ) = B [ =2 S ( 0 ; 0 2) + B 2 p ( e 2 0 2)] + o p () as log as oe kows the joit limitig distributio of =2 S ( 0 ; 0 2) ad p ( e 2 0 2): 5 Asymptotic Tests Based o Extremum Estimators Cosider the ull hypothesis that the true parameter value 0 satis es the equatio g( 0 ) = 0, where g is a vector of q smooth fuctios with cotiuous qp Jacobia matrix G() = @g=@. For otatioal coveiece we de e G G( 0 ). Suppose we have estimated by miimizig some objective fuctio Q(; y) as i sectio 2 ad that the stadardized estimator p (^ 0 ) is asymptotically N(0; V ) where V = B AB. (A ad B are de ed i that sectio.) The, usig the delta method, we d p [g( b ) g( 0 )] G p ( b 0 ) d! N(0; GV G 0 ): Uder the ull hypothesis, g(^) 0 (GV G 0 ) g(^) is asymptotically 2 (q). Replacig G with the cosistet estimate c G G( b ) does ot a ect the asymptotic distributio. Hece, for some cosistet estimate b V ; we might reject the hypothesis that g( 0 ) = 0 if the Wald statistic W g( b ) 0 ( b G b V b G 0 ) g( b ) (2) is larger tha the 95% quatile of a 2 (q) distributio. Sometimes solvig for b is computatioally di cult ad oe seeks a way to test the hypothesis that g( 0 ) = 0 that avoids this computatio. Suppose there exists a easy-tocalculate estimate ~ that satis es g( ~ ) = 0: Suppose further that, whe the ull hypothesis is true, ~ is cosistet, asymptotically ormal ad the usual liear approximatios are valid: p [g(^) g( ~ )] = G p (^ ~ ) + op () 3

=2 [S (^) S ( ~ )] = B p (^ ~ ) + op (): Assumig B is a cotiuous fuctio of 0 ; de e ~ G = G( ~ ) ad ~ B = B( ~ ): Sice g( ~ ) ad S (^) are both zero vectors, we see that W is asymptotically equivalet uder the ull hypothesis to ( b ~ ) 0 b G 0 ( b G b V b G 0 ) b G( b ~ ) (3) ad to S ( ~ ) 0 ~ B ~ G 0 [ ~ G ~ V ~ G 0 ] ~ G ~ B S ( ~ ): (4) Note that (4) ca be computed without solvig the miimizatio problem. I the liear regressio model where Q = 2 (y X)0 (y X); we d H = X 0 X ad var[s ()] = 2 X 0 X: Suppose the ull hypothesis is the liear costrait G 0 = 0: Estimatig A by s 2 X 0 X= ad B by X 0 X=, we obtai W = b 0 G 0 [G(X 0 X) G 0 ] G b =s 2 which is q times the usual F-statistic. This feature of LS regressio (that the variace matrix for the score is proportioal to the expectatio of the hessia) holds i may estimatio problems. Whe it occurs, ot oly do (2), (3) ad (4) simplify but also some additioal asymptotically equivalet expressios for the Wald statistic are available. Cosider the problem of miimizig Q(; y) subject to the costrait g() = 0. The solutio ca be used for our ~ : The rst order coditio for a iterior miimum is S ( ~ ) = ~G 0 which implies = ( ~ GB ~ G 0 ) ~ G 0 B S ( ~ ) ad S ( ~ ) = ~ G 0 ( ~ GB ~ G 0 ) ~ G 0 B S ( ~ ): If A = cb ad ~ is the costraied miimizer of Q ; the test statistics (3) ad (4) simplify to (^ ~ ) 0 b B(^ ~ )=^c (3 ) ad S ( ~ ) 0 ~ B S ( ~ )=^c : (4 ) Furthermore, a Taylor s series expasio of the statistic 2[Q (^) Q ( ~ )]=^c (5) shows that it too is asymptotically equivalet to (3 ) ad hece to W. Thus, if A = cb for some ozero scalar c; plim ^c = c, ad ~ is the costraied miimizer of Q ; we have four aymptotically equivalet statistics that might be used for testig g( 0 ) = 0: (a) a quadratic form i g( b ) : g( b ) 0 ( b G b B b G 0 ) g( b )=^c (b) a quadratic form i the estimator di erece ~ b ( b e ) 0 b B( b e )=^c (c) a quadratic form i the score (which is the same as a quadratic form i the Lagrage multiplier ) S( e ) 0 e B S( e )=^c 0 e G e B e G 0 =^c (d) the di erece i costraied ad ucostraied miimized objective fuctio (multiplied by 2/^c): 2[Q( e ; y) Q( b ; y)]=^c: 4

Although our discussio has cocered oly the ull distributio of these tests, the asymptotic equivalece holds also uder earby alteratives. Exact equivalece occurs i the liear regressio case because there H ad G are oradom ad do ot deped o the ukow 0. I the geeral case where H may be radom ad both may deped o 0, we d oly asymptotic equivalece. There are may di eret ways to cosistetly estimate B, icludig the hessias H ( b ) ad H ( ~ ) as well as the aalytic expressio EH () evaluated at b or ~. Thus there are really lots of asymptotically equivalet test statistics available. Computatioal coveiece is ofte used as a basis for choice amog asymptotically equivalet tests. However, if p ad q are large, the asymptotic approximatios are sometimes poor. It is usually wise to perform some simulatios to verify that the chose test statistic has approximately the correct small sample rejectio probability uder the ull hypothesis. Whe Q is mius the log likelihood fuctio we d that A = B sice A ad B are the alterative expressios for the limitig iformatio matrix; (d) is the the likelihood ratio statistic ad (c) is the score statistic. Whe the correct weightig matrix is used i the quadratic form de ig GMM, we also have A proportioal to B ad a choice of tests. 6 Score Tests as Diagostics The followig type of problem ofte occurs i ecoometrics. We postulate a fairly simple probability model for the data but etertai the possibility that a more complicated model may be eeded. Let Q ( ; 2 ) be the objective fuctio assumig the complicated model is correct; is the p-dimesioal parameter vector i the simple model ad 2 is the q- dimesioal vector of additioal parameters eed to cope with the complicatio. The simple model beig correct is equivalet to the parametric hypothesis that 2 = 0. Thus, before publishig estimates of based o the simple model, oe might wat to check that 2 is really close to zero. A Wald or likelihood ratio test would require actually estimatig the complicated model. The score test does ot ad is therefore commoly used as a diagostic. Assumig that A = B, the score test statistic for 2 = 0 is S( e ) 0 e B S( e ), where e miimizes Q subject to the costrait. The score vector S ca be partitioed ito two subvectors, say S ad S 2. But, whe evaluated at the costraied estimate e, S must be zero (sice that is the rst-order coditio for miimizig Q subject to 2 = 0.) Thus the test statistic ca be writte as S 2 ( e ) 0 CS 2 ( e ), where C is the q q lower right had block of B. Usig H( e ) as a estimate of B ad partitioig coformably, a atural estimate of C is ( e H 22 e H2 e H e H 2 ). I other words, to test the hypothesis that the simple model is valid: rst, compute e, the MLE for the parameters of the simple model; secod, evaluate the score S 2 at e = ( e ; 0); third, compute the test statistic S 2 ( e ) 0 ( e H 22 e H2 e H e H 2 ) S 2 ( e ). I may examples, the iformatio matrix is block diagoal implyig plim H 2 = 0; the the test statistic ca be simpli ed to S 2 ( e ) 0 e H 22 S 2( e ). I other words, whe the iformatio matrix is block diagoal, oe ca test 2 = 0 by pretedig that were kow ad equal to the estimate e. 5