The Quasi-Maximum Likelihood Method: Theory

Size: px

Start display at page:

Download "The Quasi-Maximum Likelihood Method: Theory"

Ambrose O’Connor’
6 years ago
Views:

1 Chapter 9 he Quasi-Maximum Likelihood Method: heory As discussed in preceding chapters, estimating linear and nonlinear regressions by the least squares method results in an approximation to the conditional mean function of the dependent variable. While this approach is important and common in practice, its scope is still limited and does not fit the purposes of various empirical studies. First, this approach is not suitable for analyzing the dependent variables with special features, such as binary and truncated (censored) variables, because it is unable to take data characteristics into account. Second, it has little room to characterize other conditional moments, such as conditional variance, of the dependent variable. As far as a complete description of the conditional behavior of a dependent variable is concerned, it is desirable to specify a density function that admits specifications of different conditional moments and other distribution characteristics. his motivates researchers to consider the method of quasi-maximum likelihood (QML). he QML method is essentially the same as the ML method usually seen in statistics and econometrics textbooks. It is conceivable that specifying a density function, while being more general and more flexible than specifying a function for conditional mean, is more likely to result in specification errors. How to draw statistical inferences under potential model misspecification is thus a major concern of the QML method. By contrast, the traditional ML method assumes that the specified density function is the true density function, so that specification errors are assumed away. herefore, the results in the ML method are just special cases of the QML method. Our discussion below is primarily based on White (994); for related discussion we also refer to White (982), Amemiya (985), Godfrey (988), and Gourieroux, (994). 229

2 230 CHAPER 9. HE QUASI-MAXIMUM LIKELIHOOD MEHOD: HEORY 9. Kullback-Leibler Information Criterion We first discuss the concept of information. In a random experiment, suppose that the event A occurs with probability p. he message that A will surely occur would be more valuable when p is small but less important when p is large. For example, when p =0.99, the message that A will occur is not much informative, but when p =0.0, such a message is very helpful. Hence, the information content of the message that A will occur ought to be a decreasing function of p. A common choice of the information function is ι(p) =log(/p), which decreases from positive infinity (p 0) to zero (p =). Itshouldbeclearthat ι( p), the information that A will not occur, is not the same as ι(p), unless p =0.5. he expected information of these two messages is ( ( ) I = pι(p)+( p) ι( p) =p log +( p) log. p) p It is also common to interpret ι(p) as the surprise resulted from knowing that the event A will occur when IP(A) =p. he expected information I is thus also a weighted average of surprises and known as the entropy of the event A. Similarly, the information that the probability of the event A changes from p to q would be useful when p and q are very different, but not of much value when p and q are close. he resulting information content is then the difference between these two pieces of information: ι(p) ι(q) =log(q/p), which is positive (negative) when q > p(q < p). Given n mutually exclusive events A,...,A n, each with an information value log(q i /p i ), the expected information value is then n ( qi ) I = q i log. p i i= his idea is readily generalized to discuss the information content of density functions, as discussed below. Let g be the density function of the random variable z and f be another density function. Define the Kullback-Leibler Information Criterion (KLIC) of g relative to f as ( g(ζ) ) II(g :f) = log g(ζ) dζ. f(ζ) R

3 9.. KULLBACK-LEIBLER INFORMAION CRIERION 23 When f is used to describe z, the value II(g:f) is the expected surprise resulted from knowing g is in fact the true density of z. he following result shows that the KLIC of g relative to f is non-negative. heorem 9. Let g be the density function of the random variable z and f be another density function. hen II(g : f) 0, with the equality holding if, and only if, g = f almost everywhere (i.e., g = f except on a set with the Lebesgue measure zero). Proof: Using the fact that log( + x) <xfor all x>, we have ( g ) ( log = log + f g ) > f f g g. It follows that ( g(ζ) ) ( log g(ζ) dζ> f(ζ) ) g(ζ) dζ =0. f(ζ) g(ζ) Clearly, if g = f almost everywhere, II(g :f) = 0. Conversely, given II(g :f) = 0, suppose without loss of generality that g = f except that g>fon the set B that has a non-zero Lebesgue measure. hen, ( g(ζ) ) ( g(ζ) ) log g(ζ) dζ = log g(ζ) dζ> (log )g(ζ) dζ = 0, R f(ζ) B f(ζ) B contradicting II(g : f) = 0. hus, g mustbethesameasf almost everywhere. Note, however, that the KLIC is not a metric because it is not reflexive in general, i.e., II(g :f) II(f :g), and it does not obey the triangle inequality; see Exercise 9.. Hence, the KLIC is only a crude measure of the closeness between f and g. Let {z t } be a sequence of R ν -valued random variables defined on the probability space (Ω, F, IP o ). Without causing any confusion, we shall use the same notations for random vectors and their realizations. Let z =(z, z 2,...,z ) denote the the information set generated by all z t. he vector z t is divided into two parts y t and w t,wherey t is a scalar and w t is (ν ). Corresponding to IP o,there exists a joint density function g for z : g (z )=g(z ) t=2 g t (z t ) g t (z t ) = g(z ) g t (z t z t ), where g t denote the joint density of t random variables z,...,z t,andg t is the density function of z t conditional on past information z,...,z t. he joint density function t=2

4 232 CHAPER 9. HE QUASI-MAXIMUM LIKELIHOOD MEHOD: HEORY g is the random mechanism governing the behavior of z and will be referred to as the data generation process (DGP) of z. As g is unknown, we may postulate a conditional density function f t (z t z t ; θ), where θ Θ R k, and approximate g (z )by f (z ; θ) =f(z ) f t (z t z t ; θ). t=2 he function f is referred to as the quasi-likelihood function, in the sense that f need not agree with g. For notation convenience, we set the unconditional density g(z )(or f(z )) as a conditional density and write g (or f ) as the product of all conditional density functions. Clearly, the postulated density f would be useful if it is close to the DGP g. It is therefore natural to consider minimizing the KLIC of g relative to f : ( g II(g :f (ζ ) ) ; θ) = log R f (ζ g (ζ )d(ζ ). (9.) ; θ) his amounts to minimizing the surprise level resulted from specifying an f for the DGP g. As g does not involve θ, minimizing the KLIC (9.) with respect to θ is equivalent to maximizing log(f (ζ ; θ))g (ζ )d(ζ ) = IE[log f (z ; θ)], R where IE is the expectation operator with respect to the DGP g (z ). his is, in turn, equivalent to maximizing the average of log f t : ( L (θ) = IE[log f (z ; θ)] = ) IE log f t (z t z t ; θ). (9.2) Let θ be the maximizer of L (θ). hen, θ is also the minimizer of the KLIC (9.). If there exists a unique θ o Θ such that f (z ; θ o )=g (z ) for all,wesaythat {f t } is specified correctly in its entirety for {z t }. In this case, the resulting KLIC II(g : f ; θ o ) = 0 and reaches the minimum. his shows that θ = θ o when {f t } is specified correctly in its entirety. Maximizing L is, however, not a readily solvable problem because the objective function (9.2) involves the expectation operator and hence depends on the unknown DGP g. It is therefore natural to consider maximizing the sample counterpart of L (θ): L (z ; θ) := log f t (z t z t ; θ), (9.3)

5 9.. KULLBACK-LEIBLER INFORMAION CRIERION 233 which is known as the quasi-log-likelihood function. he maximizer of L (z ; θ), θ,is known as the quasi-maximum likelihood estimator (QMLE) of θ. he prefix quasi is used to indicate that this solution may be obtained from a misspecified log-likelihood function. When {f t } is specified correctly in its entirety for {z t }, the QMLE is understood as the MLE, as in standard statistics textbooks. Specifying a complete probability model for z may be a formidable task in practice because it involves too many random variables ( random vectors z t,eachwithν random variables). Instead, econometricians are typically interested in modeling a variable of interest, say, y t, conditional on a set of pre-determined variables, say, x t,wherex t includes some elements of w t and z t. his is a relatively simple job because only the conditional behavior of y t need to be considered. As w t are not explicitly modeled, the conditional density g t (y t x t ) provides only a partial description of {z t }. We then find a quasi-likelihood function f t (y t x t ; θ) to approximate g t (y t x t ). Analogous to (9.), the resulting average KLIC of g t relative to f t is ĪI ({g t :f t }; θ) := II(g t :f t ; θ). (9.4) Let y =(y,...,y )andx =(x,...,x ). Minimizing ĪI ({g t :f t }; θ) in (9.4) is thus equivalent to maximizing L (θ) = IE[log f t (y t x t ; θ)]; (9.5) cf. (9.2). he parameter θ that maximizes (9.5) must also minimize the average KLIC (9.4). If there exists a θ o Θ such that f t (y t x t ; θ o )=g t (y t x t ) for all t, wesay that {f t } is correctly specified for {y t x t }. In this case, ĪI ({g t : f t }; θ o ) is zero, and θ = θ o. Similar as before, L (θ) is not directly observable, we instead maximize its sample counterpart: L (y, x ; θ) := log f t (y t x t ; θ), (9.6) which will also be referred to as the quasi-log-likelihood function; cf. (9.3). he resulting solution is the QMLE θ.when{f t } is correctly specified for {y t x t },theqmle θ is also understood as the usual MLE. As a special case, one may concentrate on certain conditional attribute of y t and postulate a specification µ t (x t ; θ) for this attribute. A leading example is the following

6 234 CHAPER 9. HE QUASI-MAXIMUM LIKELIHOOD MEHOD: HEORY specification of conditional normality with µ t (x t ; θ) as the specification of its mean: y t x t N ( µ t (x t ; β),σ 2) ; note that conditional variance is not explicitly modeled. Setting θ =(β σ 2 ),itiseasy to see that the maximizer of the quasi-log-likelihood function log f t(y t x t ; θ) is also the solution to min β [y t µ t (x t ; β)] [y t µ t (x t ; β)]. he resulting QMLE of β is thus the NLS estimator. herefore, the NLS estimator can be viewed as a QMLE under the assumption of conditional normality with conditional homoskedasticity. We say that {µ t } is correctly specified for the conditional mean IE(y t x t )ifthereexistsaθ o such that µ t (x t ; θ o ) = IE(y t x t ). A more flexible specification, such as y t x t N ( µ t (x t ; β),h(x t ; α) ), would allow us to characterize conditional variance as well. 9.2 Asymptotic Properties of the QMLE he quasi-log-likelihood function is, in general, a nonlinear function in θ, so that the QMLE thus must be computed numerically using a nonlinear optimization algorithm. Given that maximizing L is equivalent to minimizing L, the algorithms discussed in Section are readily applied. We shall not repeat these methods here but proceed to the discussion of the asymptotic properties of the QMLE. For our subsequent analysis, we always assume that the specified quasi-log-likelihood function is twice continuously differentiable on the compact parameter space Θ with probability one and that integration and differentiation can be interchanged. Moreover, we maintain the following identification condition. [ID-2] here exists a unique θ that minimizes the KLIC (9.) or (9.4) Consistency We sketch the idea of establishing the consistency of the QMLE. In the light of the uniform law of large numbers discussed in Section 8.3., we know that if L (z ; θ) (or L (y, x ; θ)) tends to L (θ) uniformly in θ Θ, i.e., L (z ; θ) obeys a WULLN, then

7 9.2. ASYMPOIC PROPERIES OF HE QMLE 235 θ θ in probability. his shows that the QMLE is a weakly consistent estimator for the minimizer of KLIC II(g :f ; θ) (or the average KLIC ĪI ({g t : f t }; θ)), whether the specification is correct or not. When {f t } is specified correctly in its entirety (or for {y t x t }), the KLIC minimizer θ is also the true parameter θ o. In this case, the QMLE is weakly consistent for θ o. herefore, the conditions required to ensure QMLE consistency are also those for a WULLN; we omit the details Asymptotic Normality Consider first the specification of the quasi-log-likelihood function for z : L (z ; θ). When θ is in the interior of Θ, the mean-value expansion of L (z ; θ )aboutθ is L (z ; θ )= L (z ; θ )+ 2 L (z ; θ )( θ θ ), (9.7) where θ is between θ and θ. Note that requiring θ in the interior of Θ is to ensure that the mean-value expansion and subsequent asymptotics are valid in the parameter space. he left-hand side of (9.7) is zero because the QMLE θ solves the first order condition L (z ; θ) =0. hen, as long as 2 L (z ; θ ) is invertible with probability one, (9.7) can be written as ( θ θ )= 2 L (z ; θ ) L (z ; θ ). Note that the invertibility of 2 L (z ; θ ) amounts to requiring quasi-log-likelihood function L being locally quadratic at θ. Let IE[ 2 L (z ; θ)] = H (θ) betheexpected Hessian matrix. When 2 L (z ; θ ) obeys a WULLN, we have 2 L (z ; θ) H (θ) IP 0, uniformly in Θ. As θ is weakly consistent for θ,soisθ. he assumed WULLN then implies that 2 L (z ; θ ) H (θ ) IP 0. It follows that ( θ θ )= H (θ ) L (z ; θ )+o IP (). (9.8) his shows that the asymptotic distribution of ( θ θ ) is essentially determined by the asymptotic distribution of the normalized score: L (z ; θ ).

8 236 CHAPER 9. HE QUASI-MAXIMUM LIKELIHOOD MEHOD: HEORY Let B denote the variance-covariance matrix of the normalized score L (z ; θ): B (θ) =var ( L (z ; θ) ), which will also be referred to as the information matrix. hen provided that log f t (z t z t )obeysacl,wehave B (θ ) /2 ( L (z ; θ ) IE[ L (z ; θ )] ) When differentiation and integration can be interchanged, D N (0, Ik ). (9.9) IE[ L (z ; θ)] = IE[L (z ; θ)] L (θ), where the right-hand side is the first order derivative of (9.2). As θ is the KLIC minimizer, L (θ )=0 so that IE[ L (z ; θ )] = 0. By (9.8) and (9.9), ( θ θ )= H (θ )B (θ ) /2[ B (θ ) /2 L (z ; θ ) ] + o IP (), which has an asymptotic normal distribution. his immediately leads to the following result. heorem 9.2 When (9.7), (9.8) and (9.9) hold, where C (θ ) /2 ( θ θ ) D N (0, I k ), C (θ )=H (θ ) B (θ )H (θ ), with H (θ )=IE[ 2 L (z ; θ )] and B (θ )=var ( L (z ; θ ) ). Remark: For the specification of {y t x t }, the QMLE is obtained from the quasi-loglikelihood function L (y, x ; θ), and its asymptotic normality holds similarly as in heorem 9.2. hat is, C (θ ) /2 ( θ θ ) D N(0, I k ), where C (θ )=H (θ ) B (θ )H (θ ) with H (θ )=IE[ 2 L (y, x ; θ )] and B (θ )=var ( L (y, x ; θ ) ).

9 9.3. INFORMAION MARIX EQUALIY Information Matrix Equality A useful result in the quasi-maximum likelihood theory is the information matrix equality. his equality shows that, under certain conditions, the information matrix B (θ) is the same as the negative of the expected Hessian matrix H (θ) so that the covariance matrix C (θ) can be simplified. In such a case, the estimation of C (θ) wouldbe much simpler. Define the following score functions: s t (z t ; θ) = log f t (z t z t ; θ). he average of these scores is s (z ; θ) = L (z ; θ) = s t (z t ; θ). Clearly, s t (z t ; θ)f t (z t z t ; θ) = f t (z t z t ; θ). It follows that ( )( s (z ; θ)f (z ) ; θ) = s t (z t ; θ) f τ (z τ z τ ; θ) = f t (z t z t ; θ) f τ (z τ z τ ; θ) τ t = f (z ; θ). As f (z ; θ) dz =, its derivative must be zero. By permitting interexchange of differentiation and integration we have 0 = f (z ; θ) dz = s (z ; θ)f (z ; θ) dz, and 0 = 2 f (z ; θ) dz [ s = (z ; θ)+s (z ; θ)s (z ; θ) ] f (z ; θ) dz. he equalities are simply the expectations with respect to f (z ; θ), yet they need not hold when the integrator is not f (z ; θ). If {f t } is correctly specified in its entirety for {z t }, the results above imply IE[s (z ; θ o )] = 0 and IE[ s (z ; θ o )] + IE[s (z ; θ o )s (z ; θ o ) ]=0, where the expectations are taken with respect to the true density g (z )=f (z ; θ o ). his is the information matrix equality stated below.

10 238 CHAPER 9. HE QUASI-MAXIMUM LIKELIHOOD MEHOD: HEORY heorem 9.3 Suppose that there exists a θ o such that f t (z t z t ; θ o )} = g t (z t z t ). hen, H (θ o )+B (θ o )=0, where H (θ o )=IE[ 2 L (z ; θ o )] and B (θ o )=var ( L (z ; θ o ) ). When this equality holds, the covariance matrix C in heorem 9.2 simplifies to C (θ o )=B (θ o ) = H (θ o ). hat is, the QMLE achieves the Cramér-Rao lower bound asymptotically. On the other hand, when f is not a correct specification for z, the score function is not related to the true density g. Hence, there is no guarantee that the mean score is zero, i.e., IE[s (z ; θ)] = s (z ; θ)g (z )dz 0, even when this expectation is evaluated at θ. By the same reason, IE[ s (z ; θ)] + IE[s (z ; θ)s (z ; θ) ] 0, even when it is evaluated at θ. For the specification of {y t x t },wehave f t (y t x t ; θ) =s t (y t, x t ; θ)f t (y t x t ; θ), and f t (y t x t ; θ) dy t =. Similar as above, s t (y t, x t ; θ)f t (y t x t ; θ) dy t = 0, and [ s t (y t, x t ; θ)+s t (y t, x t ; θ)s t (y t, x t ; θ) ] f t (y t x t ; θ) dy t = 0. If {f t } is correctly specified for {y t x t },wehaveie[s t (y t, x t ; θ o ) x t ]=0. hen by the law of iterated expectations, IE[s t (y t, x t ; θ o )] = 0, so that the mean score is still zero under correct specification. Moreover, IE[ s t (y t, x t ; θ o ) x t ]+IE[s t (y t, x t ; θ o )s t (y t, x t ; θ o ) x t ]=0, which implies IE[ s t (y t, x t ; θ o )] + IE[s t (y t, x t ; θ o )s t (y t, x t ; θ o ) ] = H (θ o )+ IE[s t (y t, x t ; θ o )s t (y t, x t ; θ o ) ] = 0.

11 9.3. INFORMAION MARIX EQUALIY 239 he equality above is not necessarily equivalent to the information matrix equality, however. o see this, consider the specifications {f t (y t x t ; θ)}, which are correct for {y t x t }. hese specifications are said to have dynamic misspecification if they are not correctly specified for {y t w t, z t }. hat is, there does not exist any θ o such that f t (y t x t ; θ o )=g t (y t w t, z t ). hus, the information contained in w t and z t cannot be fully represented by x t. On the other hand, when dynamic misspecification is absent, it is easily seen that IE[s t (y t, x t ; θ o ) x t ]=IE[s t (y t, x t ; θ o ) w t, z t ]. (9.0) It is then easily verified that [( B (θ o )= )( IE )] s t (y t, x t ; θ o ) s t (y t, x t ; θ o ) = = IE[s t (y t, x t ; θ o )s t (y t, x t ; θ o ) ]+ τ= t=τ+ τ= t=τ+ IE[s t τ (y t τ, x t τ ; θ o )s t (y t, x t ; θ o ) ]+ IE[s t (y t, x t ; θ o )s t+τ (y t+τ, x t+tau ; θ o ) ] IE[s t (y t, x t ; θ o )s t (y t, x t ; θ o ) ], where the last equality holds by (9.0) and the law of iterated expectations. When there is dynamic misspecification, the last equality fails because the covariances of scores do not vanish. his shows that the average of individual information matrices, i.e., var(s t (y t, x t ; θ o )), need not be the information matrix. heorem 9.4 Suppose that there exists a θ o such that f t (y t x t ; θ o )=g t (y t x t ) and there is no dynamic misspecification. hen, H (θ o )+B (θ o )=0, where H (θ o )= IE[ s t (y t, x t ; θ o )] and B (θ o )= IE[s t (y t, x t ; θ o )s t (y t, x t ; θ o ) ].

12 240 CHAPER 9. HE QUASI-MAXIMUM LIKELIHOOD MEHOD: HEORY When heorem 9.4 holds, the covariance matrix needed to normalize ( θ θ o ) again simplifies to B (θ o ) = H (θ o ), and the QMLE achieves the Cramér-Rao lower bound asymptotically. Example 9.5 Consider the following specification: y t x t N(x t β,σ2 ) for all t. Let θ =(β σ 2 ),then log f(y t x t ; θ) = 2 log(2π) 2 log(σ2 ) (y t x tβ) 2 2σ 2, and L (y, x ; θ) = 2 log(2π) 2 log(σ2 ) (y t x tβ) 2 2σ 2. Straightforward calculation yields L (y, x ; θ) = 2 L (y, x ; θ) = x t(y t x t β) σ 2 + (yt x 2σ 2 t β)2 2(σ 2 ) 2, xtx t xt(yt x σ 2 t β) (σ 2 ) 2 (yt x t β)x t (σ 2 ) 2 2(σ 2 ) (yt x 2 t β)2 (σ 2 ) 3 Setting L (y, x ; θ) =0 we can solve for β to obtain the QMLE. It is easily verified that the QMLE of β is the OLS estimator ˆβ and that the QMLE of σ 2 is the average of the OLS residuals: σ 2 = (y t x ˆβ t ) 2. If the specification above is correct for {y t x t },thereexistsθ o =(β o σo) 2 such that the conditional distribution of y t given x t is N (x tβ o,σo). 2 aking expectation with respect to the true distribution function, we have IE[x t (y t x tβ)] = IE(x t x t)(β o β), which is zero when evaluated at β = β o. Similarly, IE[(y t x tβ) 2 ]=IE[(y t x tβ o + x tβ o x tβ) 2 ] =IE[(y t x tβ o ) 2 ]+IE[(x tβ o x tβ) 2 ] = σ 2 o +IE[(x tβ o x tβ) 2 ], where the second term on the right-hand side is zero if it is evaluated at β = β o.hese results together show that. H (θ) =IE[ 2 L (θ)] = IE(xtx t ) σ 2 IE(xtx t )(β o β) (σ 2 ) 2 (β o β) IE(x tx t ) σ2 o+ie[(x (σ 2 ) 2 2(σ 2 ) 2 t β o x t β)2 ] (σ 2 ) 3.

13 9.3. INFORMAION MARIX EQUALIY 24 When this matrix is evaluated at θ o =(β o σ2 o ), H (θ o )= IE(xtx t ) σ 0 o (σo) 2 2 If there is no dynamic misspecification, B (θ) = IE (y t x t β)2 x tx t xt(yt x (σ 2 ) 2 t β) + xt(yt x 2(σ 2 ) 2 t β)3 2(σ 2 ) 3 (yt x t β)x t 2(σ 2 ) + (yt x 2 t β)3 x t 2(σ 2 ) 3 4(σ 2 ) (yt x 2 t β)2 2(σ 2 ) + (yt x 3 t β)4 4(σ 2 ) 4 Given that y t is conditionally normally distributed, its conditional third and fourth central moments are zero and 3(σo 2)2, respectively. It can then be verified that IE[(y t x t β)3 ]= 3σ 2 o IE[(x t β o x t β)] IE[(x t β o x t β)3 ], (9.) which is zero when evaluated at β = β o,andthat IE[(y t x t β)4 ]=3(σ 2 o )2 +6σ 2 o IE[(x t β o x t β)2 ]+IE[(x t β o x t β)3 ], (9.2) which is 3(σo 2)2 when evaluated at β = β o ; see Exercise 9.2. Consequently, B (θ o )= IE(x tx t ) σ 0 o (σ 2 o) 2 his shows that the information matrix equality holds. A typical consistent estimator of H (θ o )is xtx t 0 H ( θ )= σ ( σ 2 )2 Due to the information matrix equality, a consistent estimator of B (θ o )isb ( θ )= H ( θ ). It can be seen that the upper-left block of H ( θ ) is the standard estimator for the covariance matrix of ˆβ. On the other hand, when there is dynamic missepcification, B (θ) is not the same as given above so that the information matrix equality fails; see Exercise he information matrix equality may also fail in different circumstances. he example below shows that, even when the specification for IE(y t x t ) is correct and there is no dynamic misspecification, the information matrix equality may still fail to hold if there is misspecification of other conditional moments, such as neglected conditional heteroskedasticity.

14 242 CHAPER 9. HE QUASI-MAXIMUM LIKELIHOOD MEHOD: HEORY Example 9.6 Consider the specification as in Example 9.5: y t x t N(x t β,σ2 )for all t. Suppose that the DGP is y t x t N ( x t β o,h(x t γ o )), where the conditional variance h(x t γ o ) varies with x t. hen, this specification includes a correct specification for the conditional mean but itself is incorrect for {y t x t } because it ignores conditional heteroskedasticity. From Example 9.5 we see that the upper-left block of H (θ) is σ 2 IE(x t x t)σ 2, and that the corresponding block of B (θ) is IE[(y t x t β)2 x t x t ] (σ 2 ) 2. As the specification is correct for the conditional mean but not the conditional variance, the KLIC minimzer is θ =(β o, (σ ) 2 ). Evaluating the submatrix of H (θ)atθ yields (σ ) 2 IE(x t x t), which differs from the corresponding submatrix of B (θ) evaluated at θ : (σ ) 4 IE[h(x t γ o )2 x t x t ]. hus, the information matrix equality breaks down, despite that the conditional mean specification is correct. A consistent estimator of H (θ )is H ( θ )= xtx t 0 σ 2 0 2( σ 2 )2, yet a consistent estimator of B (θ )is B ( θ )= ê2 t xtx t ( σ 2 ) ( σ 2 )2 + ê4 t 4( σ 2 )4 ;

15 9.4. HYPOHESIS ESING 243 see Exercise 9.4. Due to the block diagonality of H( θ )andb( θ ), it is easy to verify that the upper-left block of H( θ ) B( θ )H( θ ) is ( ) ( x t x t )( ê 2 t x t x t x t x t). his is precisely the the Eicker-White estimator (6.0) for the covariance matrix of the OLS estimator ˆβ, as shown in Section Hypothesis esting In this section we discuss three classical large sample tests (Wald, LM, and likelihood ratio tests), the Hausman (978) test, and the information matrix test of White (982, 987) for the null hypothesis Rθ = r, wherer is q k matrix with full row rank. Failure to reject the null hypothesis suggests that there is no empirical evidence against the specified model under the null hypothesis Wald est Similar to the Wald test for linear regression discussed in Section 6.4., the Wald test under the QMLE framework is based on the difference R θ ) r. From (9.8), R( θ θ )= RH (θ ) B (θ ) /2[ B (θ ) /2 L (z ; θ ) ] + o IP (). his shows that RC (θ )R = RH (θ ) B (θ )H (θ ) R can be used to normalize R( θ θ ) to obtain asymptotic normality. Under the null hypothesis, Rθ = r so that [RC (θ )R ] /2 (R( θ r) D N(0, I q ). (9.3) his is the key distribution result for the Wald test. For notation simplicity, we let H = H ( θ ) denote a consistent estimator for H (θ )and B = B ( θ ) denote a consistent estimator for B (θ ). It follows that a consistent estimator for C (θ )is C = C ( θ )= H B H. Substituting C for C (θ) in (9.3) we have [R tildebc (θ )R ] /2 (R( θ r) D N(0, I q ). (9.4)

16 244 CHAPER 9. HE QUASI-MAXIMUM LIKELIHOOD MEHOD: HEORY he Wald test statistic is the inner product of the left-hand side of (9.4): W = (R θ r) (R C R ) (R θ r). (9.5) he limiting distribution of the Wald test now follows from (9.4) and the continuous mapping theorem. heorem 9.7 Suppose that heorem 9.2 for the QMLE θ holds. hen under the null hypothesis, W D χ 2 (q), where W is defined in (9.5) and q is the number of rows of R. Example 9.8 Consider the quasi-log-likelihood function specified in Example 9.5. We write θ =(σ 2 β ) and β =(b b 2),whereb is (k s), and b 2 is s. We are interested in the null hypothesis that b 2 = Rθ = 0, wherer =[0 R ]iss (k +) and R =[0 I s ]iss k. he Wald test can be computed according to (9.5): W = β 2, (R C R ) β2,, where β 2, = R θ is the estimator of b 2. As shown in Example 9.5, when the information matrix equality holds, C = is block diagnoal so that R C R = R he Wald test then becomes H R = σ 2 R (X X/ ) R. W = β 2, [R (X X/ ) R ] β2, / σ 2. In this case, the Wald test is just s times the standard F statistic which is readily available from most of econometric packages. H Lagrange Multiplier est Consider now the problem of maximizing L (θ) subject to the constraint Rθ = r. he Lagrangian is L (θ)+θ R λ,

17 9.4. HYPOHESIS ESING 245 where λ is the vector of Lagrange multipliers. he maximizers of the Lagrangian and denoted as θ and λ,where θ is the constrained QMLE of θ. Analogous to Section 6.4.2, the LM test under the QML framework also checks whether λ is sufficiently close to zero. First note that θ and λ satisfy the saddle-point condition: L ( θ )+R λ = 0. he mean-value expansion of L ( θ )aboutθ yields L (θ )+ 2 L (θ )( θ θ )+R λ = 0, where θ is the mean value between θ and θ. It has been shown in (9.8) that ( θ θ )= H (θ ) L (θ )+o IP (). Hence, H (θ ) ( θ θ )+ 2 L (θ ) ( θ θ )+R λ = o IP (). Using the WULLN result: 2 L (θ ) H (θ ) IP 0, weobtain ( θ θ )= ( θ θ ) H (θ ) R λ + o IP (). (9.6) his establishes a relationship between the constrained and unconstrained QMLEs. Pre-multiplying both sides of (9.6) by R and noting that the constrained estimator θ must satisfy the constraint so that R( θ θ )=0, wehave λ =[RH (θ ) R ] R ( θ θ )+o IP (), (9.7) which relates the Lagrangian multiplier and the unconstrained QMLE θ. When heorem 9.2 holds for the normalized θ, we obtain the following asymptotic normality result for the normalized Lagrangian multiplier: where Λ /2 λ = Λ /2 [RH (θ ) R ] R ( θ θ ) D N(0, I q ), (9.8) Λ (θ )=[RH (θ ) R ] RC (θ )R [RH (θ ) R ]. Let Ḧ = H ( θ ) denote a consistent estimator for H (θ )and C = C ( θ )denote a consistent estimator for C (θ ), both based on the constrained QMLE θ. hen, Λ =(RḦ R ) R C R (RḦ R )

18 246 CHAPER 9. HE QUASI-MAXIMUM LIKELIHOOD MEHOD: HEORY is consistent for Λ (θ ). It follows from (9.8) that Λ /2 λ = /2 Λ (RḦ R ) R ( θ θ ) D N(0, I q ), (9.9) he LM test statistic is the inner product of the left-hand side of (9.9): LM = λ Λ λ = λ RḦ R (R C R ) RḦ R λ. (9.20) he limiting distribution of the LM test now follows easily from (9.9) and the continuous mapping theorem. heorem 9.9 Suppose that heorem 9.2 for the QMLE θ holds. hen under the null hypothesis, LM D χ 2 (q), where LM is defined in (9.20) and q is the number of rows of R. Remark: When the information matrix equality holds, the LM statistic (9.20) becomes LM = λ RḦ R λ = L ( θ ) Ḧ L ( θ ), which mainly involves the averages of scores: L ( θ ). he LM test is thus a test that checks if the average of scores is sufficiently close to zero and hence also known as the score test. Example 9.0 Consider the quasi-log-likelihood function specified in Example 9.5. We write θ =(σ 2 β ) and β =(b b 2 ),whereb is (k s), and b 2 is s. We are interested in the null hypothesis that b 2 = Rθ = 0, wherer =[0 R ]iss (k +) and R =[0 I s ]iss k. From the saddle-point condition, L ( θ )= R λ. which can be partitioned as σ 2L ( θ ) L ( θ )= b L ( θ ) b2 L ( θ ) = 0 0 λ = R λ. Partitioning x t accordingly as (x t x 2t ),wehave b2 L ( θ )= σ 2 x 2t ɛ t = X 2 ɛ/( σ2 ).

19 9.4. HYPOHESIS ESING 247 where σ 2 = ɛ ɛ/,and ɛ is the vector of constrained residuals obtained from regressing y t on x t and X 2 is the s matrix whose t th row is x 2t. he LM test can be computed according to (9.20): 0 0 LM = 0 Ḧ R (R C R ) RḦ 0, X 2 ɛ/( σ2 ) X 2 ɛ/( σ2 ) which converges in distribution to χ 2 (s) under the null hypothesis. Note that we do not have to evaluate the complete score vector for computing the LM test; only the subvector of the score that corresponds to the constraint matters. When the information matrix equality holds, the LM statistic has a simpler form: LM = [0 ɛ X 2 / ](X X/ ) [0 ɛ X 2 / ] / σ 2 = [ ɛ X(X X) X ɛ/ ɛ ɛ] = R 2, where R 2 is the non-centered coefficient of determination obtained from the auxiliary regression of the constrained residuals ɛ t on x t and x 2t. Example 9. (Breusch-Pagan) Suppose that the specification is y t x t, ζ t N ( x tβ,h(ζ tα) ), where h : R (0, ) is a differentiable function, and ζ tα = α 0 + p i= ζ tiα i.henull hypothesis is conditional homoskedasticity, i.e., α = = α p =0sothath(α 0 )=σ0 2. Breusch and Pagan (979) derived the LM test for this hypothesis under the assumption that the information matrix equality holds. his test is now usually referred to as the Breusch-Pagan test. Note that the constrained specification is y t x t, ζ t N(x tβ,σ 2 ), where σ 2 = h(α 0 ). his is leads to the standard linear regression model without heteroskedasticity. he constrained QMLEs for β and σ 2 are, respectively, the OLS estimators ˆβ and σ 2 = ê2 t /,whereê t are the OLS residuals. he score vector corresponding to α is: α L (y t, x t, ζ t ; θ) = [ h (ζ tα)ζ t 2h(ζ tα) ( (yt x tβ) 2 h(ζ tα) )],

20 248 CHAPER 9. HE QUASI-MAXIMUM LIKELIHOOD MEHOD: HEORY where h (η) = dh(η)/ dη. Under the null hypothesis, h (ζ t α ) = h (α 0 )isjusta constant, say, c. he score vector above evaluated at the constrained QMLEs is α L (y t, x t, ζ t ; θ )= c [ ζt 2 σ 2 ( ˆɛ 2 t σ 2 )]. he (p +) (p + ) block of the Hessian matrix corresponding to α is [ (yt x ] t β)2 h 3 (ζ + tα) 2h 2 (ζ [h (ζ α)] 2 ζ t ζ t tα) [ (yt x + tβ) 2 ] 2h 2 (ζ tα) 2h 2 (ζ h (ζ α)ζ t ζ t tα). Evaluating the expectation of this block at θ =(β o α 0 0 ) and noting that (σ ) 2 = h(α 0 )wehave ( c 2 ) ( ) 2[(σ ) 2 ] 2 IE(ζ t ζ t), which, apart from the constant c, can be estimated by ( c 2 ) ( ) (ζ t ζ t). 2[ σ 2 ]2 he LM test is now readily derived from the results above when the information matrix equality holds. Setting d t =ˆɛ 2 t / σ2, the LM statistic is ( )( ) ( ) LM = d t ζ t ζ t ζ t ζ t d t /2 = d Z(Z Z) Z d/2 D χ 2 (p), where d is withthetth element d t,andz is the (p+) matrix with the t th row ζ t. It can be seen that the numerator of the LM statistic is the (centered) regression sum of squares (RSS) of regressing d t on ζ t. his shows that the Breusch-Pagan test can also be computed by running an auxiliary regression and using the resulting RSS/2 asthe statistic. Intuitively, this amounts to checking whether the variables in ζ t are capable of explaining the square of the (standardized) OLS residuals. It is also interesting to see that the value of c and the functional form of h do not matter in deriving the

21 9.4. HYPOHESIS ESING 249 statistic. he latter feature makes the Breusch-Pagan test a general test for conditional heteroskedasticity. Koenker (98) noted that under conditional normality, d2 t / IP 2. hus, a test that is more robust to non-normality and asymptotically equivalent to the Breusch- Pagan test is to replace the denominator 2 with d2 t /. his robust version can be expressed as LM = [d Z(Z Z) Z d/d d]=r 2, where R 2 is the (centered) R 2 from regressing d t on ζ t. his is also equivalent to the centered R 2 from regressing ˆɛ 2 t on ζ t. Remarks:. o compute the Breusch-Pagan test, one must specify a vector ζ t that determines the conditional variance. Here, ζ t may contain some or all the variables in x t.if ζ t is chosen to include all elements of x t, their squares and pairwise products, the resulting R 2 is also the White (980) test for (conditional) heteroskedasticity of unknown form. he White test can also be interpreted as an information matrix test discussed below. 2. he Breusch-Pagan test is obtained under the condition that the information matrix equality holds. We have seen that the information matrix equality may fail when there is dynamic misspecification. hus, the Breusch-Pagan test is not valid when, e.g., the errors are serially correlated. Example 9.2 (Breusch-Godfrey) Given the specification y t x t N(x tβ,σ 2 ), suppose that one would like to check if the errors are serially correlated. Consider first the AR() errors: y t x t β = ρ(y t x t β)+u t with ρ < and{u t } a white noise. he null hypothesis is ρ = 0, i.e., no serial correlation. It can be seen that a general specification that allows for serial correlations is y t y t, x t, x t N(x t β + ρ(y t x t β),σ2 u ). he constrained specification is just the standard linear regression model y t = x t β. esting the null hypothesis that ρ = 0 is thus equivalent to testing whether an additional variable y t x t β should be included in the mean specification. In the light of Example 9.0, if the information matrix equality holds, the LM test can be computed as R 2,whereR 2 is obtained from regressing the OLS residuals ˆɛ t on x t and

22 250 CHAPER 9. HE QUASI-MAXIMUM LIKELIHOOD MEHOD: HEORY ˆɛ t = y t x t ˆβ. his is precisely the Breusch (978) and Godfrey (978) test for AR() errors with the limiting χ 2 () distribution. he test above can be extended straightforwardly to check AR(p) errors. By regressing ˆɛ t on x t and ˆɛ t,...,ˆɛ t p, the resulting R 2 is the LM test when the information matrix equality holds and has a limiting χ 2 (p) distribution. Such tests are known as the Breusch-Godfrey test. Moreover, if the specification is y t x tβ = u t + αu t, i.e., the errors follow an MA() process, we can write y t x t,u t N(x tβ + αu t,σu). 2 he null hypothesis is α = 0. It is readily seen that the resulting LM test is identical to that for AR() errors. hus, the Breusch-Godfrey test for MA(p) errors is the same as that for AR(p) errors. Remarks:. he Breusch-Godfrey tests discussed above are obtained under the condition that the information matrix equality holds. We have seen that the information matrix equality may fail when, for example, there is neglected conditional heteroskedasticity. hus, the Breusch-Godfrey tests are not valid when conditional heteroskedasticity is present. 2. It can be shown that the square of Durbin s h test is also an LM test. While Durbin s h test may not be feasible in practice, the Breusch-Godfrey test can always be computed Likelihood Ratio est As discussed in Section 6.4.3, the LR test compares the performance of the constrained and unconstrained specifications based on their likelihood ratio: LR = 2 [L ( θ ) L ( θ )]. (9.2) Utilizing the relationship between the constrained and unconstrained QMLEs (9.6) and the relationship between the Lagrangian multiplier and unconstrained QMLE (9.7), we can write ( θ θ )=H (θ ) R λ + o IP () (9.22) = H (θ ) R

23 9.4. HYPOHESIS ESING 25 By aylor expansion of L ( θ )about θ, 2 [L ( θ ) L ( θ )] = 2 L ( θ )( θ θ ) ( θ θ ) H ( θ )( θ θ )+o IP () = ( θ θ ) H (θ )( θ θ )+o IP () = ( θ θ ) R [RH (θ ) R ] R( θ θ )+o IP (), where the second equality follows because L ( θ )=0. It can be seen that the righthand side is essentially the Wald statistic when RH (θ ) R is the normalizing variance-covariance matrix. his leads to the following distribution result for the LR test. heorem 9.3 Suppose that heorem 9.2 for the QMLE θ and the information matrix equality both hold. hen under the null hypothesis, LR D χ 2 (q), where LR is defined in (9.2) and q is the number of rows of R. heorem 9.3 differs from heorem 9.7 and heorem 9.9 in that it also requires the validity of the information matrix equality. When the information matrix equality does not hold, RH (θ ) R is not a proper normalizing matrix so that LR does not have a limiting χ 2 distribution. his result clearly indicates that, when L is constructed by specifying density functions for {y t x t }, the LR test is not robust to dynamic misspecification and misspecifications of other conditional attributes, such as neglected conditional heteroskedasticity. By contrast, the Wald and LM tests can be made robust by employing a proper normalizing variance-covariance matrix.

24 252 CHAPER 9. HE QUASI-MAXIMUM LIKELIHOOD MEHOD: HEORY Exercises 9. Let g and f be two density functions. Show that the KLIC II(g :f) does not obey the triangle inequality, i.e., II(g : f) II(g : h) +II(h : f) for any other density function h. 9.2 Prove equations (9.) and (9.2) in Example In Example 9.5, suppose there is dynamic misspecification. What is B (θ )? 9.4 In Example 9.6, what is B (θ )? Show that B ( θ )= ê2 t xtx t ( σ 2 ) ( σ 2 )2 + ê4 t 4( σ 2 )4 is a consistent estimator for B (θ ). 9.5 Consider the specification y t x t N(x t β,h(ζ tα)). What condition would ensure that H (θ )andb (θ ) are block diagonal? 9.6 Consider the specification y t x t,y t N(γy t +x t β,σ2 ) and the AR() errors: y t αy t x tβ = ρ(y t αy t 2 x t β)+u t, with ρ < and{u t } a white noise. Derive the LM test for the null hypothesis ρ = 0 and show its square root is Durbin s h test; see Section

25 9.4. HYPOHESIS ESING 253 References Amemiya, akeshi (985). Advanced Econometrics, Cambridge, MA: Harvard University Press. Breusch,. S. (978). esting for autocorrelation in dynamic linear models, Australian Economic Papers, 7, Breusch,. S. and A. R. Pagan (979). A simple test for heteroscedasticity and random coefficient variation, Econometrica, 47, Engle, Robert F. (982). Autoregressive conditional heteroskedasticity with estimates of the variance of U.K. inflation, Econometrica, 50, Godfrey, L. G. (978). esting against general autoregressive and moving average error models when the regressors include lagged dependent variables, Econometrica, 46, Godfrey, L. G. (988). Misspecification ests in Econometrics: he Lagrange Multiplier Principle and Other Approaches, New York: Cambridge University Press. Hamilton, James D. (994). Press. ime Series Analysis, Princeton: Princeton University Hausman, Jerry A. (978). Specification tests in econometrics, Econometrica, 46, Koenker, Roger (98). A note on studentizing a test for heteroscedasticity, Journal of Econometrics, 7, White, Halbert (980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity, Econometrica, 48, White, Halbert (982). Maximum likelihood estimation of misspecified models, Econometrica, 50, 25. White, Halbert (994). Estimation, Inference, and Specification Analysis, NewYork: Cambridge University Press.

Quasi Maximum Likelihood Theory

Quasi Maximum Likelihood Theory Quasi Maximum Likelihood heory CHUNG-MING KUAN Department of Finance & CREA June 17, 2010 C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, 2010 1 / 119 Lecture Outline 1 Kullback-Leibler