Statistical Inference Based on Extremum Estimators

T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0 by miimizig (over 2 S) some objective fuctio Q Q(; y), where the radom vector y represets the data from a sample of size. Some examples are:. Liear least squares regressio, where Q = (y X) 0 (y X) ad X 0 is the coditioal expectatio of the radom vector y give the regressor matrix X: 2. Noliear least squares (NLS), where Q = [y g(; X)] 0 [y g(; X)]; g( 0 ; X) is the coditioal expectatio of y give X, ad g has a kow fuctioal form. 3. Least absolute deviatio (LAD) regressio where Q = P i jy i x 0 ij ad the coditioal media of y i give x i is equal to x 0 i 0 : 4. Maximum likelihood (ML), where Q is mius the log of the joit probability desity (or mass) fuctio of the data i a fully speci ed parametric model. 5. Geeralized method of momets (GMM), where Q = m(; y) 0 W m(; y); m(; y) is a q-dimesioal vector such that plim m( 0 ; y) = 0, ad W is a q q positive de ite matrix. 2 Asymptotic Properties of Extremum Estimators Clearly, ot every fuctio Q will lead to good estimates. We shall cosider fuctios that have the followig property: whe is large, Q(; y)= (viewed as a fuctio of ) is with high probability very close to a oradom fuctio Q () that has a proouced miimum at 0. The, i large samples, it seems plausible that ^, the miimizer of Q(; y), should with high probability be close to 0 ; the miimizer of Q (). This ituitio is made precise i the followig cosistecy theorem: Suppose Q(,y) is a cotiuous fuctio of i the compact parameter set SR p. The, if Q(,y) coverges i probability uiformly to a fuctio which has a uique miimum at 0, the value ^ that miimizes Q(,y) coverges i probability to 0. Regularity assumptios o the exogeous variables ad weak depedece across trials ofte imply that these coditios for cosistecy hold. EXAMPLE: I the liear model y = X 0 + u, suppose the u s are i.i.d. with mea zero, variace 2 ad idepedet of X. Assume further that X 0 X coverges i probability to a positive de ite matrix B. For the least-squares criterio fuctio, we have Q = (y X)0 (y X) = u0 u 2( 0 )X 0 u + ( 0 ) 0 X 0 X( 0 ) But u 0 u p! 2 ad X 0 u p! 0. Thus plim Q = ( 0 ) 0 B( 0 ) where the covergece is uiform i ay compact set of parameter values. Sice B is positive de ite, this limitig fuctio has a uique miimum at = 0.

Give cosistecy ad employig liearizatio methods, we ca ofte show that extremum estimators are approximately ormal whe the sample size is large. Suppose Q(; y) is twice di eretiable i with rst derivative vector S () ad cotiuous secod derivative matrix H (). (S is ofte called the score ad H the hessia for Q. Of course both will also deped o the sample data y, but this depedecy will be surpressed to simplify the otatio.) If ^ is a regular iterior miimum of Q, the S (^) = 0 ad H (^) is positive de ite. Usig the mea value theorem, we ca write 0 = =2 S (^) = =2 S ( 0 ) + H p (^ 0 ) where H is the hessia H with each elemet evaluated at a value somewhere betwee ^ ad 0. Suppose that =2 S ( 0 ) coverges i distributio to a ormal radom vector havig mea zero ad p p covariace matrix A. The, if H coverges i probability to a pp positive de ite matrix B, it follows that p (^ 0 ) has the same limitig distributio as B =2 S ( 0 ) ad hece p (^ 0 ) d! N(0; B AB ): () I large samples, oe might the act as though ^ were ormal with mea 0 ad variace matrix B b A b B b, where A b ad B b are cosistet estimates of A ad B. To prove that () is valid, we must show that =2 S ( 0 ) teds to a ormal radom variable ad that H teds to a oradom, full-rak matrix. I may problems S ( 0 ) is the sum of idepedet (or weakly depedet) mea-zero radom variables; stadard cetral limit theorems ca the be employed. Demostratig the covergece of H is usually more di cult sice we typically do ot have a closed form expressio for H: However, cotiuity argumets ad the law of large umbers ofte ca be employed to show that H coverges i probability to B = lim E H ( 0 ). Rigorous proofs of the asymptotic ormality of extremal estimators will ot be attempted here. The LAD case where Q is odi eretiable is particularly tricky sice the liearizatio o loger ca be obtaied by the mea value theorem. I cotrast, the OLS case where Q is quadratic is simple sice the H does ot deped o. 3 Computig Extremum Estimates I practice, the computatioal problem of dig the miimum of Q(; y) is otrivial. Computer itesive iterative methods are ofte successful. If Q(; y) is twice di eretiable i ; the quadratic Taylor series approximatio aroud a poit = a 0 is give by Q(; y) Q(a 0 ; y) + (a 0 ) 0 S 0 + 2 (a0 ) 0 H 0 (a 0 ) where S 0 is the gradiet of Q ad H 0 is the matrix of secod derivatives, both evaluated at = a 0. The Newto-Raphso algorithm for miimizig Q starts with a trial value a 0 ad, if H 0 is positive de ite, miimizes the quadratic approximatio obtaiig a = a 0 H0 S0. [If H 0 is ot de ite, oe ca replace it with H + ci for some small umber c.] The ext step is to evaluate S ad H at = a ad repeat the calculatio. Usig the recursio a r+ = a r Hr S r, oe cotiues util a r+ a r. The, as log as H r is positive de ite, oe uses a r as the estimate ^ sice S(a r ) 0. Of course, this yields at best oly a local miimum; oe must try various startig values a 0 to be co det that oe has truly miimized Q. The Gauss-Newto algorithm is a variat of Newto-Raphso for the special case where Q ca be writte as e 0 e=2 ad the elemets of the -dimesioal vector e are oliear fuctios of. De ig Z to be the p matrix @e=@ 0, the gradiet vector S ca be 2

writte as Z 0 e. Moreover, the hessia H is equal to Z 0 Z plus a term that teds to be smaller. The G-N algorithm drops this smaller term ad uses the recursio a r+ = a r (Z 0 rz r ) Z 0 re r, where Z ad e are evaluated at the previous estimate a r. Note that the G-N algorithm ca be implemeted by a sequece of least squares regressios. 4 Two-step Estimators We ofte ecouter problems where Q () is di cult to miimize because some elemets of eter i a complicated way. I those cases, a two-step estimatio procedure is ofte employed. Suppose the parameter vector is partitioed ito two parts ( ; 2 ) ad that eters ito Q i a simple way so that, if 2 were kow, Q could easily be miimized over :That is, de ig the gradiet of Q with respect to by S, we assume that the equatio S ( ; 2 ) = 0 ca easily by solved as = h( 2 ; y). If oe could d a simple estimate say ~ 2, oe might estimate by e = h( e 2 ; y). That is, we would estimate by miimizig Q( ; e 2 ). If both the (computatioally di cult) true extremal estimator b ad the simple estimator e 2 are cosistet ad joitly asymptotically ormal, the it ca be show that the estimator e is also cosistet ad asymptotically ormal. Its asymptotic variace will typicaly deped o the asymptotic variace of e 2 ad ca be computed by liearizig the rst-order coditio S ( e ; e 2 ) = 0. If, for example, =2 S ( e ; e 2 ) = =2 S ( 0 ; 0 2) + B p ( e 0 ) + B 2 p ( e 2 0 2) + o p () the the asymptotic variace of e ca be calculated from p ( e 0 ) = B [ =2 S ( 0 ; 0 2) + B 2 p ( e 2 0 2)] + o p () as log as oe kows the joit limitig distributio of =2 S ( 0 ; 0 2) ad p ( e 2 0 2): 5 Asymptotic Tests Based o Extremum Estimators Cosider the ull hypothesis that the true parameter value 0 satis es the equatio g( 0 ) = 0, where g is a vector of q smooth fuctios with cotiuous qp Jacobia matrix G() = @g=@. For otatioal coveiece we de e G G( 0 ). Suppose we have estimated by miimizig some objective fuctio Q(; y) as i sectio 2 ad that the stadardized estimator p (^ 0 ) is asymptotically N(0; V ) where V = B AB. (A ad B are de ed i that sectio.) The, usig the delta method, we d p [g( b ) g( 0 )] G p ( b 0 ) d! N(0; GV G 0 ): Uder the ull hypothesis, g(^) 0 (GV G 0 ) g(^) is asymptotically 2 (q). Replacig G with the cosistet estimate c G G( b ) does ot a ect the asymptotic distributio. Hece, for some cosistet estimate b V ; we might reject the hypothesis that g( 0 ) = 0 if the Wald statistic W g( b ) 0 ( b G b V b G 0 ) g( b ) (2) is larger tha the 95% quatile of a 2 (q) distributio. Sometimes solvig for b is computatioally di cult ad oe seeks a way to test the hypothesis that g( 0 ) = 0 that avoids this computatio. Suppose there exists a easy-tocalculate estimate ~ that satis es g( ~ ) = 0: Suppose further that, whe the ull hypothesis is true, ~ is cosistet, asymptotically ormal ad the usual liear approximatios are valid: p [g(^) g( ~ )] = G p (^ ~ ) + op () 3

=2 [S (^) S ( ~ )] = B p (^ ~ ) + op (): Assumig B is a cotiuous fuctio of 0 ; de e ~ G = G( ~ ) ad ~ B = B( ~ ): Sice g( ~ ) ad S (^) are both zero vectors, we see that W is asymptotically equivalet uder the ull hypothesis to ( b ~ ) 0 b G 0 ( b G b V b G 0 ) b G( b ~ ) (3) ad to S ( ~ ) 0 ~ B ~ G 0 [ ~ G ~ V ~ G 0 ] ~ G ~ B S ( ~ ): (4) Note that (4) ca be computed without solvig the miimizatio problem. I the liear regressio model where Q = 2 (y X)0 (y X); we d H = X 0 X ad var[s ()] = 2 X 0 X: Suppose the ull hypothesis is the liear costrait G 0 = 0: Estimatig A by s 2 X 0 X= ad B by X 0 X=, we obtai W = b 0 G 0 [G(X 0 X) G 0 ] G b =s 2 which is q times the usual F-statistic. This feature of LS regressio (that the variace matrix for the score is proportioal to the expectatio of the hessia) holds i may estimatio problems. Whe it occurs, ot oly do (2), (3) ad (4) simplify but also some additioal asymptotically equivalet expressios for the Wald statistic are available. Cosider the problem of miimizig Q(; y) subject to the costrait g() = 0. The solutio ca be used for our ~ : The rst order coditio for a iterior miimum is S ( ~ ) = ~G 0 which implies = ( ~ GB ~ G 0 ) ~ G 0 B S ( ~ ) ad S ( ~ ) = ~ G 0 ( ~ GB ~ G 0 ) ~ G 0 B S ( ~ ): If A = cb ad ~ is the costraied miimizer of Q ; the test statistics (3) ad (4) simplify to (^ ~ ) 0 b B(^ ~ )=^c (3 ) ad S ( ~ ) 0 ~ B S ( ~ )=^c : (4 ) Furthermore, a Taylor s series expasio of the statistic 2[Q (^) Q ( ~ )]=^c (5) shows that it too is asymptotically equivalet to (3 ) ad hece to W. Thus, if A = cb for some ozero scalar c; plim ^c = c, ad ~ is the costraied miimizer of Q ; we have four aymptotically equivalet statistics that might be used for testig g( 0 ) = 0: (a) a quadratic form i g( b ) : g( b ) 0 ( b G b B b G 0 ) g( b )=^c (b) a quadratic form i the estimator di erece ~ b ( b e ) 0 b B( b e )=^c (c) a quadratic form i the score (which is the same as a quadratic form i the Lagrage multiplier ) S( e ) 0 e B S( e )=^c 0 e G e B e G 0 =^c (d) the di erece i costraied ad ucostraied miimized objective fuctio (multiplied by 2/^c): 2[Q( e ; y) Q( b ; y)]=^c: 4

Although our discussio has cocered oly the ull distributio of these tests, the asymptotic equivalece holds also uder earby alteratives. Exact equivalece occurs i the liear regressio case because there H ad G are oradom ad do ot deped o the ukow 0. I the geeral case where H may be radom ad both may deped o 0, we d oly asymptotic equivalece. There are may di eret ways to cosistetly estimate B, icludig the hessias H ( b ) ad H ( ~ ) as well as the aalytic expressio EH () evaluated at b or ~. Thus there are really lots of asymptotically equivalet test statistics available. Computatioal coveiece is ofte used as a basis for choice amog asymptotically equivalet tests. However, if p ad q are large, the asymptotic approximatios are sometimes poor. It is usually wise to perform some simulatios to verify that the chose test statistic has approximately the correct small sample rejectio probability uder the ull hypothesis. Whe Q is mius the log likelihood fuctio we d that A = B sice A ad B are the alterative expressios for the limitig iformatio matrix; (d) is the the likelihood ratio statistic ad (c) is the score statistic. Whe the correct weightig matrix is used i the quadratic form de ig GMM, we also have A proportioal to B ad a choice of tests. 6 Score Tests as Diagostics The followig type of problem ofte occurs i ecoometrics. We postulate a fairly simple probability model for the data but etertai the possibility that a more complicated model may be eeded. Let Q ( ; 2 ) be the objective fuctio assumig the complicated model is correct; is the p-dimesioal parameter vector i the simple model ad 2 is the q- dimesioal vector of additioal parameters eed to cope with the complicatio. The simple model beig correct is equivalet to the parametric hypothesis that 2 = 0. Thus, before publishig estimates of based o the simple model, oe might wat to check that 2 is really close to zero. A Wald or likelihood ratio test would require actually estimatig the complicated model. The score test does ot ad is therefore commoly used as a diagostic. Assumig that A = B, the score test statistic for 2 = 0 is S( e ) 0 e B S( e ), where e miimizes Q subject to the costrait. The score vector S ca be partitioed ito two subvectors, say S ad S 2. But, whe evaluated at the costraied estimate e, S must be zero (sice that is the rst-order coditio for miimizig Q subject to 2 = 0.) Thus the test statistic ca be writte as S 2 ( e ) 0 CS 2 ( e ), where C is the q q lower right had block of B. Usig H( e ) as a estimate of B ad partitioig coformably, a atural estimate of C is ( e H 22 e H2 e H e H 2 ). I other words, to test the hypothesis that the simple model is valid: rst, compute e, the MLE for the parameters of the simple model; secod, evaluate the score S 2 at e = ( e ; 0); third, compute the test statistic S 2 ( e ) 0 ( e H 22 e H2 e H e H 2 ) S 2 ( e ). I may examples, the iformatio matrix is block diagoal implyig plim H 2 = 0; the the test statistic ca be simpli ed to S 2 ( e ) 0 e H 22 S 2( e ). I other words, whe the iformatio matrix is block diagoal, oe ca test 2 = 0 by pretedig that were kow ad equal to the estimate e. 5