Quasi Maximum Likelihood Theory

Size: px

Start display at page:

Download "Quasi Maximum Likelihood Theory"

Kellie O’Brien’
5 years ago
Views:

1 Quasi Maximum Likelihood heory CHUNG-MING KUAN Department of Finance & CREA June 17, 2010 C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

2 Lecture Outline 1 Kullback-Leibler Information Criterion 2 Asymptotic Properties of the QMLE Asymptotic Normality Information Matrix Equality 3 Large Sample ests: Nested Models Wald est LM (Score) est Likelihood Ratio est 4 Large Sample ests: Non-Nested Models Wald Encompassing est Score Encompassing est Pseudo-rue Score Encompassing est C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

3 Lecture Outline (cont d) 5 Example I: Discrete Choice Models Binary Choice Models Multinomial Models Ordered Multinomial Models 6 Example II: Limited Dependent Variable Models runcated Regression Models Censored Regression Models Sample Selection Models 7 Example III: ime Series Models C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

4 Quasi-Maximum Likelihood (QML) heory Drawbacks of the least-squares method: It leaves no room for modeling other conditional moments, such as conditional variance, of the dependent variable. It fails to accommodate certain characteristics of the dependent variable, such as binary response and data truncation. he quasi-maximum likelihood (QML) method: Specifying a likelihood function that admits specifications of different conditional moments and/or distribution characteristics. Model misspecification is allowed, cf. conventional ML method. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

5 Kullback-Leibler Information Criterion (KLIC) Given IP(A) = p, the message that A will surely occur would be more (less) valuable or more (less) surprising when p is small (large). he information content of the message above ought to be a decreasing function of p. An information function is ι(p) = log(1/p), which decreases from positive infinity (p 0) to zero (p = 1). Clearly, ι(1 p) ι(p). he expected information is ( 1 ) ( 1 ) I = p ι(p) + (1 p) ι(1 p) = p log + (1 p) log, p 1 p which is also known as the entropy of the event A. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

6 he information that IP(A) changes from p to q would be useful when p and q are very different. he information content is ι(p) ι(q) = log(q/p), which is positive (negative) when q > p (q < p). Given n mutually exclusive events A 1,..., A n, each with an information value log(q i /p i ), the expected information value is I = n i=1 ( qi ) q i log. p i Kullback-Leibler Information Criterion (KLIC) of g relative to f is ( g(ζ) ) II(g :f ) = log g(ζ) dζ, f (ζ) R where g is the density function of z and f is another density function. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

7 heorem 9.1 II(g :f ) 0; the equality holds if, and only if, g = f almost everywhere. Proof: As log(1 + x) < x for all x > 1, we have ( g ) ( log = log 1 + f g f g It follows that log ( g(ζ) f (ζ) ) g(ζ) dζ > ) > 1 f g. ( 1 f (ζ) ) g(ζ) dζ = 0. g(ζ) Remark: he KLIC measures the closeness between f and g, but is not a metric because it is not reflexive in general, i.e., II(g :f ) II(f :g), and does not obey the triangle inequality. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

8 Let z t = (y t w t) be ν 1 and z t = {z 1, z 2,..., z t }. Given a sample of observations, specifying a complete probability model for z may be a formidable task in practice. It is practically more convenient to focus on the conditional density g t (y t x t ), where x t include some elements of w t and z t 1. We thus specify a quasi-likelihood function f t (y t x t ; θ) with θ Θ R k. he KLIC of g t relative to f t is II(g t :f t ; θ) = log R ( gt (y t x t ) f t (y t x t ; θ) ) g t (y t x t ) dy t. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

9 For a sample of obs, the average of individual KLICs is: ĪI ({g t :f t }; θ) := 1 = 1 II(g t :f t ; θ) ( IE[log gt (y t x t )] IE[log f t (y t x t ; θ)] ). Minimizing ĪI ({g t :f t }; θ) is equivalent to maximizing L (θ) = 1 IE[log f t (y t x t ; θ)]; he maximizer of L (θ), θ, is the minimizer of the average KLIC. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

10 If there exists a θ o Θ such that f t (y t x t ; θ o ) = g t (y t x t ) for all t, we say that {f t } is correctly specified for {y t x t }. In this case, II(g t :f t ; θ o ) = 0, so that ĪI ({g t :f t }; θ) is minimized at θ = θ o. Maximizing the sample counterpart of L (θ): L (y, x ; θ) := 1 log f t (y t x t ; θ), the average of the individual quasi-log-likelihood functions, the resulting solution, θ, is known as the quasi-maximum likelihood estimator (QMLE) of θ. When {f t } is specified correctly for {y t x t }, the QMLE is the standard MLE. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

11 Concentrating on certain conditional attribute of y t, we have, for example, y t x t N ( µ t (x t ; β), σ 2), or more generally, y t x t N ( µ t (x t ; β), h(x t ; α) ). Suppose y t x t N ( µ t (x t ; β), σ 2) and let θ = (β σ 2 ). he maximizer of 1 log f t(y t x t ; θ) also solves min β 1 [y t µ t (x t ; β)] [y t µ t (x t ; β)]. hat is, the NLS estimator is a QMLE under the specification of conditional normality with conditional homoskedasticity. Even when {µ t } is correctly specified for the conditional mean, there is no guarantee that the specification of σ 2 is correct. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

12 Asymptotic Properties of the QMLE he quasi-log-likelihood function is, in general, a nonlinear function in θ. he QMLE θ must be computed numerically using a nonlinear optimization algorithm. [ID-3] here exists a unique θ that minimizes the KLIC. Consistency: Suppose that L (y, x ; θ) tends to L (θ) in probability, uniformly in θ Θ, i.e., L (y, x ; θ) obeys a WULLN. hen, it is natural to expect the QMLE θ to converge in probability to θ, the minimizer of the average KLIC, ĪI ({g t :f t }; θ). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

13 Asymptotic Normality When θ is in the interior of Θ, the mean-value expansion of L (y, x ; θ ) about θ is L (y, x ; θ ) = L (y, x ; θ ) + 2 L (y, x ; θ )( θ θ ), where the left-hand side is zero because θ must satisfy the first order condition. Let H (θ) = IE[ 2 L (y, x ; θ)]. When 2 L (y, x ; θ) obeys a WULLN, 2 L (y, x ; θ ) H (θ ) IP 0. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

14 When H (θ ) is nonsingular, ( θ θ ) = H (θ ) 1 L (y, x ; θ ) + o IP (1). Let B (θ) = var ( L (y, x ; θ) ) be the information matrix. When log f t (y t x t ; θ) obeys a CL, B (θ ) 1/2 ( L (y, x ; θ ) IE[ L (y, x ; θ )] ) D N (0, Ik ). Knowing IE[ L (y, x ; θ)] = IE[L (y, x ; θ)] is zero at θ = θ, the KLIC minimizer, we have B (θ ) 1/2 L (y, x ; θ ) D N (0, I k ). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

15 his shows that ( θ θ ) is asymptotically equivalent to H (θ ) 1 B (θ ) 1/2[ B (θ ) 1/2 L (y, x ; θ ) ], which has an asymptotic normal distribution. heorem 9.2 Letting C (θ ) = H (θ ) 1 B (θ )H (θ ) 1, C (θ ) 1/2 ( θ θ ) D N (0, I k ). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

16 Information Matrix Equality When the likelihood is specified correctly in a proper sense, the information matrix equality holds: H (θ o ) + B (θ o ) = 0. In this case, C (θ o ) simplifies to H (θ o ) 1 or B (θ o ) 1. For the specification of {y t x t }, define the score functions: s t (y t, x t ; θ) = log f t (y t x t ; θ) = f t (y t x t ; θ) 1 f t (y t x t ; θ), so that f t (y t x t ; θ) = s t (y t, x t ; θ)f t (y t x t ; θ). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

17 By permitting interchange of differentiation and integration, s t (y t, x t ; θ)f t (y t x t ; θ) dy t = f t (y t x t ; θ) dy t = 0. R If {f t } is correctly specified for {y t x t }, IE[s t (y t, x t ; θ o ) x t ] = 0, where the conditional expectation is taken with respect to g t (y t x t ) = f t (y t x t ; θ o ). hus, mean score is zero: IE[s t (y t, x t ; θ o )] = 0. As f t = s t f t, we have 2 f t = ( s t )f t + (s t s t)f t, so that [ s t (y t, x t ; θ) + s t (y t, x t ; θ)s t (y t, x t ; θ) ] f t (y t x t ; θ) dy t R = 2 = 0. R f t (y t x t ; θ) dy t R C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

18 We have shown IE[ s t (y t, x t ; θ o ) x t ] + IE[s t (y t, x t ; θ o )s t (y t, x t ; θ o ) x t ] = 0. It follows that 1 IE[ s t (y t, x t ; θ o )] + 1 = H (θ o ) + 1 = 0. IE[s t (y t, x t ; θ o )s t (y t, x t ; θ o ) ] IE[s t (y t, x t ; θ o )s t (y t, x t ; θ o ) ] i.e., the expected Hessian matrix is negative of the average of individual information matrices, which need not be the information matrix B (θ o ). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

19 By definition, the information matrix is [( B (θ o ) = 1 ) ( IE )] s t (y t, x t ; θ o ) s t (y t, x t ; θ o ) = 1 IE[s t (y t, x t ; θ o )s t (y t, x t ; θ o ) ] τ=1 t=τ+1 1 τ=1 t=τ+1 IE[s t τ (y t τ, x t τ ; θ o )s t (y t, x t ; θ o ) ] IE[s t (y t, x t ; θ o )s t+τ (y t+τ, x t+τ ; θ o ) ]. hat is, the information matrix involves the variances and autocovariances of individual score functions. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

20 A specification of {y t x t } is said to have dynamic misspecification if it is not correctly specified for {y t w t, z t 1 }; that is, there does not exist any θ o such that f t (y t x t ; θ o ) = g t (y t w t, z t 1 ). When f t (y t x t ; θ o ) = g t (y t w t, z t 1 ), IE[s t (y t, x t ; θ o ) x t ] = IE[s t (y t, x t ; θ o ) w t, z t 1 ] = 0, and by the law of iterated expectations, IE [ s t (y t, x t ; θ o )s t+τ (y t+τ, x t+τ ; θ o ) ] = IE [ s t (y t, x t ; θ o ) IE[s t+τ (y t+τ, x t+τ ; θ o ) w t+τ, z t+τ 1 ] ] = 0, for τ 1. In this case, B (θ o ) = 1 IE [ s t (y t, x t ; θ o )s t (y t, x t ; θ o ) ]. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

21 heorem 9.3 Suppose that there exists a θ o such that f t (y t x t ; θ o ) = g t (y t x t ) and there is no dynamic misspecification. hen, H (θ o ) + B (θ o ) = 0, where H (θ o ) = 1 IE[ s t(y t, x t ; θ o )] and B (θ o ) = 1 IE[s t (y t, x t ; θ o )s t (y t, x t ; θ o ) ]. When heorem 9.3 holds, B (θ o ) 1/2 ( θ θ o ) D N (0, I k ). hat is, the QMLE is asymptotically efficient because it achieves the Cramér-Rao lower bound asymptotically. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

22 Example: Assume: y t x t N (x tβ, σ 2 ) for all t, so that L (y, x ; θ) = 1 2 log(2π) 1 2 log(σ2 ) 1 (y t x tβ) 2 2σ 2. Straightforward calculation yields L (y, x ; θ) = 1 2 L (y, x ; θ) = 1 x t(y t x t β) σ (yt x 2σ 2 t β)2 2(σ 2 ) 2, xtx t xt(yt x σ 2 t β) (σ 2 ) 2 (yt x t β)x t 1 (yt x (σ 2 ) 2 2(σ 2 ) 2 t β)2 (σ 2 ) 3. solving L (y, x ; θ) = 0 we obtain the QMLEs for β and σ 2. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

23 If the specification above is correct for {y t x t }, there exists θ o = (β o σo) 2 such that y t x t N (x tβ o, σo). 2 aking expectation with respect to the true distribution, IE[x t (y t x tβ)] = IE[x t (IE(y t x t ) x tβ)] = IE(x t x t)(β o β), which is zero when evaluated at β = β o. Similarly, IE[(y t x tβ) 2 ] = IE[(y t x tβ o + x tβ o x tβ) 2 ] = IE[(y t x tβ o ) 2 ] + IE[(x tβ o x tβ) 2 ] = σo 2 + IE[(x tβ o x tβ) 2 ], where the second term on the right-hand side is zero if it is evaluated at β = β o. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

24 he results above show that H (θ) = IE[ 2 L (θ)] = 1 IE(xtx t ) σ 2 IE(xtx t )(β o β) (σ 2 ) 2 (β o β) IE(x tx t ) 1 σ2 (σ 2 ) 2 2(σ 2 ) 2 o +IE[(x t β o x t β)2 ] (σ 2 ) 3, and H (θ o ) = 1 IE(xtx t ) 0 σo (σo 2)2. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

25 Without dynamic misspecification, the information matrix B (θ) is (y 1 t x t β)2 x tx t xt(yt x IE (σ 2 ) 2 t β) + xt(yt x 2(σ 2 ) 2 t β)3 2(σ 2 ) 3 (yt x t β)x t + (yt x 2(σ 2 ) 2 t β)3 x t 1 (yt x 2(σ 2 ) 3 4(σ 2 ) 2 t β)2 + (yt x 2(σ 2 ) 3 t β)4 4(σ 2 ) 4. Given conditional normality, its conditional third and fourth central moments are zero and 3(σo) 2 2, respectively. hen, IE[(y t x tβ) 3 ] = 3σo 2 IE[(x tβ o x tβ)] + IE[(x tβ o x tβ) 3 ], which is zero when evaluated at β = β o. Similarly, IE[(y t x tβ) 4 ] = 3(σo) σo 2 IE[(x tβ o x tβ) 2 ] + IE[(x tβ o x tβ) 4 ], which is 3(σo) 2 2 when evaluated at β = β o. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

26 It is now easily seen that the information matrix equality holds because B (θ o ) = 1 IE(x tx t ) 0 σo (σ 2 o )2 A typical consistent estimator of H (θ o ) is P H = xtx t 0 σ ( σ 2 )2 When the information matrix equality holds, a consistent estimator of C (θ o ) is H 1. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

27 Example: he specification is y t x t N (x tβ, σ 2 ), but the true conditional behavior is y t x t N ( x tβ o, h(x t, α o ) ). hat is, our specification is correct only for the conditional mean. Due to misspecification, the KLIC minimizer is θ = (β o (σ ) 2 ). hen, H (θ ) = 1 IE(xtx t ) 0 (σ ) IE[h(xt,αo)], 2(σ ) 4 (σ ) 6 and B (θ ) = 1 IE[h(x t,α o)x tx t ] (σ ) IE[h(xt,αo)] + 3 IE[h(xt,αo)2 ] 4(σ ) 4 2(σ ) 6 4(σ ) 8. he information matrix equality does not hold here, despite that the conditional mean function is specified correctly. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

28 he upper-left block of H is x tx t/( σ 2 ), which remains a consistent estimator of the corresponding block in H (θ ). he information matrix B (θ ) can be consistently estimated by a block-diagonal matrix with the upper-left block: ê2 t x t x t ( σ 2 )2. he upper-left block of C = ( 1 ) 1 ( x t x 1 t 1 H B H 1 is thus ) ( êt 2 x t x 1 t 1 x t x t). his is precisely the the Eicker-White estimator. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

29 Wald est Consider the null hypothesis Rθ = r, where R is q k matrix with full row rank. he Wald test checks if R θ is sufficiently close to r. By the asymptotic normality result: C (θ ) 1/2 ( θ θ ) D N (0, I k ), we have under the null hypothesis that [RC (θ )R ] 1/2 (R θ r) H 1 D N (0, I q ). his result remains valid when C (θ ) is replaced by a consistent estimator: C = B H 1. he Wald test is W = (R θ r) (R C R ) 1 (R θ r) D χ 2 (q). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

30 Example: Specification:y t x t N (x tβ, σ 2 ). Writing θ = (σ 2 β ) and β = (b 1 b 2), where b 1 is (k s) 1, and b 2 is s 1. We are interested in the hypothesis b 2 = Rθ = 0, where R = [0 R 1 ] and R 1 = [0 I s ] is s k. With β 2, = R θ, the Wald test is W = β 2, (R C R ) 1 β 2,. When the information matrix equality holds, C = H 1 so that R C R = R H 1 R = σ 2 R 1(X X/ ) 1 R 1. he Wald test becomes W = β 2, [R 1 (X X/ ) 1 R 1] 1 β 2, / σ 2 D χ 2 (s). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

31 LM (Score) est Maximizing L (θ) subject to the constraint Rθ = r yields the Lagrangian: L (θ) + θ R λ, where λ is the vector of Lagrange multipliers. he constrained QMLEs are θ and λ ; we want to check if λ is sufficiently close to zero. First note the saddle-point condition: L ( θ ) + R λ = 0. he mean-value expansion of L ( θ ) about θ yields L (θ ) + 2 L (θ )( θ θ ) + R λ = 0, where θ is the mean value between θ and θ. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

32 Recall from the discussion of the Wald test that ( θ θ ) = H (θ ) 1 L (θ ) + o IP (1), we obtain 0 = H (θ ) 1 L (θ ) H (θ ) 1 2 L (θ ) ( θ θ ) H (θ ) 1 R λ = ( θ θ ) ( θ θ ) H (θ ) 1 R λ + o IP (1). Pre-multiplying both sides by R and noting that R( θ θ ) = 0, we have λ = [RH (θ ) 1 R ] 1 R ( θ θ ) + o IP (1). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

33 For Λ (θ ) = [RH (θ ) 1 R ] 1 RC (θ )R [RH (θ ) 1 R ] 1, the normalized Lagrangian multiplier is Λ (θ ) 1/2 λ = Λ (θ ) 1/2 [RH (θ ) 1 R ] 1 R ( θ θ ) It follows that Λ 1/2 λ D N (0, I q ). D N (0, Iq ). where Λ = (RḦ 1 R ) 1 R C R (RḦ 1 R ) 1, and Ḧ and C are consistent estimators based on the constrained QMLE θ. he LM test is LM = λ 1 Λ λ D χ 2 (q). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

34 In the light of the saddle-point condition: L ( θ ) + R λ = 0, LM = λ RḦ 1 R (R C R 1 ) RḦ 1 R λ = [ L ( θ )] Ḧ 1 R (R C R ) 1 RḦ 1 [ L ( θ )], which mainly depends on the score function L evaluated at θ. When the information matrix equality holds, LM = λ RḦ 1 R λ = [ L ( θ )] Ḧ 1 [ L ( θ )]. he LM test is also known as the score test in the statistics literature. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

35 Example: Specification: y t x t N (x tβ, σ 2 ). Let θ = (σ 2 β ) and β = (b 1 b 2), where b 1 is (k s) 1, and b 2 is s 1. he null hypothesis is b 2 = Rθ = 0, where R = [0 R 1 ] is s (k + 1) and R 1 = [0 I s ] is s k. From the saddle-point condition, L ( θ ) = R λ, and hence L ( θ ) = σ 2L ( θ ) b1 L ( θ ) b2 L ( θ ) = hus, the LM test is mainly based on: b2 L ( θ ) = 1 σ λ. x 2t ɛ t = X 2 ɛ/( σ 2 ), where σ 2 = ɛ ɛ/, with ɛ the vector of constrained residuals. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

36 he LM test statistic now reads: 0 0 X 2 ɛ/( σ2 ) R (R C R ) 1 RḦ 1 Ḧ X 2 ɛ/( σ2 ), which converges in distribution to χ 2 (s) under the null. When the information matrix equality holds, LM = [ L ( θ )] Ḧ 1 [ L ( θ )] = [0 ɛ X 2 / ](X X/ ) 1 [0 ɛ X 2 / ] / σ 2 = [ ɛ X(X X) 1 X ɛ/ ɛ ɛ] = R 2, where R 2 is from the auxiliary regression of ɛ t on x 1t and x 2t. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

37 Example (Breusch-Pagan est): Let h : R (0, ) be a differentiable function. Consider the specification: y t x t, ζ t N ( x tβ, h(ζ tα) ), where ζ tα = α 0 + p i=1 ζ tiα i. Under this specification, L (y, x, ζ ; θ) = 1 2 log(2π) log ( h(ζ tα) ) (y t x tβ) 2 2h(ζ tα). he null hypothesis is α 1 = = α p = 0 so that h(α 0 ) = σ 2 0, i.e., conditional homoskedasticity. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

38 For the LM test, the corresponding score vector is α L (y, x, ζ ; θ) = 1 [ h1 (ζ tα)ζ t 2h(ζ tα) ( (yt x tβ) 2 h(ζ tα) )] 1, where h 1 (η) = dh(η)/ dη. Under the null, h 1 (ζ tα) = h 1 (α 0 ) =: c. he constrained specification is y t x t, ζ t N (x tβ, σ 2 ), and the constrained QMLEs are the OLS estimator ˆβ and σ 2 = ê2 t /. he score vector evaluated at the constrained QMLEs is α L (y t, x t, ζ t ; θ ) = c [ ζt 2 σ 2 ( ê2 t σ 2 )] 1. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

39 It can be shown that the (p + 1) (p + 1) block of the Hessian matrix corresponding to α is 1 [ (yt x tβ) 2 ] 1 h 3 (ζ + tα) 2h 2 (ζ [h 1 (ζ α)] 2 ζ t ζ t tα) [ (yt x + tβ) 2 ] 2h 2 (ζ tα) 1 2h(ζ h 2 (ζ α)ζ t ζ t, tα) where h 2 (η) = dh 1 (η)/ dη. Evaluating the expectation of this block at θ o = (β o α 0 0 ) and noting that σ 2 o = h(α 0 ), we have ( ) ( c 2 1 2[(σo) 2 2 ) IE(ζ t ζ t). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

40 his block of the expected Hessian matrix can be consistently estimated by ( ) ( ) c 2 1 2( σ 2 (ζ )2 t ζ t). Setting d t = êt 2 / σ 2 1, the LM statistic under the information matrix equality is ( LM = 1 ) ( ) 1 ( ) d 2 t ζ t ζ t ζ D t ζ t d t χ 2 (p), where the numerator is the (centered) regression sum of squares (RSS) of regressing d t on ζ t. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

41 Remarks: In the Breusch-Pagan test, the function h does not show up in the statistic. As such, this test is capable of testing general conditional heteroskedasticity without specifying a functional form of h. When ζ t contains the squares and all cross-product terms of the non-constant elements of x t : xit 2 and x itx jt (let there be n of such terms), the resulting Breusch-Pagan test is also the test of heteroskedasticity of unknown form due to White (1980) and has the limiting distribution χ 2 (n). he Breusch-Pagan test is valid when the information matrix equality holds. hus, the Breusch-Pagan test is not robust to dynamic misspecification, e.g., when the errors are serially correlated. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

42 Koenker (1981): As 1 ê4 t IP 3(σ 2 o) 2 under the null, 1 dt 2 = 1 ê 4 t ( σ 2 )2 2 ê 2 t σ IP 2. hus, the test below is asymptotically equivalent to the original Breusch-Pagan test: LM = ( ) ( ) 1 ( )/ d t ζ t ζ t ζ t ζ t d t dt 2, which can be computed as R 2, where R 2 is obtained from regressing d t on ζ t. As i=1 d i = 0, the centered and non-centered R 2 are equivalent. hus, the Breusch-Pagan test can be computed as R 2 from the regression of ê 2 t on ζ t. (Why?) C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

43 Example (Breusch-Godfrey est): Given the specification y t x t N (x tβ, σ 2 ), we are interested in testing if y t x tβ are serially uncorrelated. For the AR(1) error: y t x tβ = ρ(y t 1 x t 1β) + u t, with ρ < 1 and {u t } a white noise. he null hypothesis is ρ = 0. Consider a general specification that admits serial correlations: y t y t 1, x t, x t 1 N (x tβ + ρ(y t 1 x t 1β), σ 2 u). Under the null, y t x t N (x tβ, σ 2 ), and the constrained QMLE of β is the OLS estimator ˆβ. esting the null hypothesis amounts to testing whether y t 1 x t 1β should be included in the mean specification. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

44 When the information matrix equality holds, the LM test can be computed as R 2, where R 2 is from the regression of ê t = y t x ˆβ t on x t and y t 1 x t 1 β. Replacing β with its constrained estimator ˆβ, we can also obtain R 2 from the regression of ê t on x t and ê t 1. his test has the limiting χ 2 (1) distribution under the null. he Breusch-Godfrey test can be extended straightforwardly to check if the errors follow an AR(p) process. By regressing ê t on x t and ê t 1,..., ê t p, the resulting R 2 is the LM test when the information matrix equality holds and has a limiting χ 2 (p) distribution. Remark: If there is neglected conditional heteroskedasticity, the information matrix equality would fail, and the Breusch-Godfrey test no longer has a limiting χ 2 distribution. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

45 If the specification is y t x tβ = u t + αu t 1, i.e., the errors follow an MA(1) process, we can write y t x t, u t 1 N (x tβ + αu t 1, σu). 2 he null hypothesis is α = 0. Again, the constrained specification is the standard linear regression model y t = x tβ, and the constrained QMLE of β is still the OLS estimator ˆβ. he LM test of α = 0 can be computed as R 2 with R 2 obtained from the regression of û t = y t x ˆβ t on x t and û t 1. his is identical to the LM test for AR(1) errors. Similarly, the Breusch-Godfrey test for MA(p) errors is also the same as that for AR(p) errors. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

46 Likelihood Ratio est he likelihood ratio (LR) test compares the log-likelihoods of the constrained and unconstrained specifications: LR = 2 [L ( θ ) L ( θ )]. Recall from the discussion of the LM test: ( θ θ ) = ( θ θ ) H (θ ) 1 R λ + o IP (1), and λ = [RH (θ ) 1 R ] 1 R ( θ θ ) + o IP (1), we have ( θ θ ) = H (θ ) 1 R [RH (θ ) 1 R ] 1 R ( θ θ )+o IP (1). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

47 By aylor expansion of L ( θ ) about θ, we have LR = 2 [ L ( θ ) L ( θ ) ] = 2 L ( θ )( θ θ ) ( θ θ ) H ( θ )( θ θ ) + o IP (1) = ( θ θ ) H (θ )( θ θ ) + o IP (1), because L ( θ ) = 0. It follows that LR = ( θ θ ) R [RH (θ ) 1 R ] 1 R( θ θ ) + o IP (1). he first term on the RHS would have an asymptotic χ 2 (q) distribution provided that the information matrix equality holds. (Why?) C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

48 Remarks: he 3 classical large sample tests check different aspects of the likelihood function. he Wald test checks if θ is close to θ ; the LM test checks if L ( θ ) is close to zero; the LR test checks if the constrained and unconstrained log-likelihood values are close to each other. he Wald and LM tests are based on unconstrained and constrained estimation results, respectively, but the LR test requires both. he Wald and LM test can be made robust to misspecification by employing a suitable consistent estimator of the asymptotic covariance matrix. Yet, the LR test is valid only when the information matrix equality holds. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

49 esting Non-Nested Models Consider testing non-nested specifications: H 0 : y t x t, ξ t f (y t x t ; θ), θ Θ R p, H 1 : y t x t, ξ t ϕ(y t ξ t ; ψ), ψ Ψ R q, where x t and ξ t are two sets of variables. hese specification are non-nested because they can not be derived from each other by imposing restrictions on the parameters. he encompassing principle of Mizon (1984) and Mizon and Richard (1986) asserts that, if the model under the null is true, it should encompass the model under the alternative, such that a statistic of the alternative model should be close to its pseudo-true value, the probability limit evaluated under the null model. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

50 Wald Encompassing est An encompassing test for non-nested hypotheses is based on the difference between a chosen statistic and the sample counterpart of its pseudo-true value. When the chosen statistic is the QMLE of the alternative model, the resulting test is the Wald encompassing test (WE). Consider now the non-nested specifications of the conditional mean function: H 0 : y t x t, ξ t N (x tβ, σ 2 ), β B R k, H 1 : y t x t, ξ t N (ξ tδ, σ 2 ), δ D R r, where x t and ξ t do not have elements in common. Let ˆβ and ˆδ denote the QMLEs of the parameters in the null and alternative models. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

51 Under the null: y t x t, ξ t N (x tβ o, σ 2 o), IE(ξ t y t ) = IE(ξ t x t)β o, and hence where ( 1 ˆδ = ) 1 ( ξ t ξ 1 t ) IP ξ t y t M 1 ξξ M ξxβ o, 1 M ξξ = lim IE(ξ t ξ t), 1 M ξx = lim IE(ξ t x t). Clearly, the pseudo-true parameter δ(β o ) = M 1 ξξ M ξxβ o would not be the probability limit of ˆδ if x tβ is an incorrect specification of the conditional mean. hus, whether ˆδ and the sample counterpart of δ(β o ) is sufficiently close to zero constitutes an evidence for or against the null hypothesis. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

52 he WE is then based on: ( 1 ˆδ ˆδ(ˆβ ) = ˆδ ( 1 = ) 1 ( ξ t ξ 1 t ) 1 ( ξ t ξ 1 t ) ξ t x t ˆβ ) ξ t (y t x ˆβ t ), which in effect checks if ɛ t = y t x tβ and ξ t are correlated. Letting ê t = y t x t ˆβ, we have 1 which is 1 ξ t ê t = 1 ( 1 ξ t ɛ t ( 1 ξ t ɛ t ) ( ξ t x 1 t ξ t x t) (ˆβ β o ), ) 1 ( x t x 1 t x t ɛ t ). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

53 Setting ˆξ t = M ξx M 1 xx x t, we can write 1 ξ t ê t = 1 Under the null hypothesis, where 1 Σ o = σ 2 o ξ t ê t ( D N (0, Σo ), lim 1 (ξ t ˆξ t )ɛ t + o IP (1). IE [ (ξ t ˆξ t )(ξ t ˆξ t ) ]) = σo 2 ( Mξξ M ξx M 1 ) xx M xξ. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

54 Consequently, 1/2 [ˆδ ˆδ(ˆβ )] and hence D N ( 0, M 1 ξξ Σ om 1 ) ξξ, [ˆδ ˆδ(ˆβ ) ] Mξξ Σ 1 o M ξξ [ˆδ ˆδ(ˆβ ) ] D χ 2 (r). A consistent estimator for Σ o is [( Σ = ˆσ 2 1 ( 1 ) ξ t ξ t ) ( ξ t x 1 t ) 1 ( x t x 1 t t) x t ξ. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

55 he WE statistic reads WE = [ˆδ ˆδ(ˆβ ) ] ( 1 ) ξ t ξ t D χ 2 (r). ( Σ 1 1 ) ξ t ξ t [ˆδ ˆδ(ˆβ ) ] When x t and ξ t have s (s < r) elements in common, ξ tê t have s elements that are identically zero. hus, rank(σ o ) = r r s, and the WE should be computed with inverse Σ. Σ 1 replaced by Σ, the generalized C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

56 Score Encompassing est Under H 1 : y t x t, ξ t N (ξ tδ, σ 2 ), the score function evaluated at the pseudo-true parameter δ(β o ) is (apart from a constant σ 2 ) 1 ξ t [y t ξ tδ(β o )] = 1 [ ξ t yt ξ ( t M 1 ξξ M )] ξxβ o. When the pseudo-true parameter is replaced by its estimator ˆδ(ˆβ ), the score function becomes ( 1 ξ t y t ξ 1 t ) 1 ( ξ t ξ 1 t ) ξ t x t ˆβ = 1 ξ t ê t. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

57 he score encompassing test (SE) is based on 1 ξ t ê t and the SE statistic is ( ) 1 ê t ξ t Σ 1 D N (0, Σo ), ( ) ξ t ê t D χ 2 (r). he WE and SE are based on the same ingredient and both require evaluating the pseudo-true parameter. he WE and SE are difficult to implement when the pseudo-true parameter is not readily derived; this may happen when, e.g., the QMLE does not have an analytic form. (Examples?) C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

58 Pseudo-rue Score Encompassing est Chen and Kuan (2002) propose the pseudo-true score encompassing (PSE) test which is based on the pseudo-true value of the score function under the alternative. Although it may be difficult to evaluate the pseudo-true value of a QMLE when it does not have a closed form, it would be easier to evaluate the pseudo-true score because the analytic from of the score function is usually available. he null and alternative hypotheses are: H 0 : y t x t, ξ t f (x t ; θ), θ Θ R p, H 1 : y t x t, ξ t ϕ(ξ t, ψ), ψ Ψ R q. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

59 Let s f,t (θ) = log f (y t x t ; θ), s ϕ,t (ψ) = log ϕ(y t ξ t ; ψ), and L f, (θ) = 1 s f,t (θ), L ϕ, (ψ) = 1 s ϕ,t (ψ). he pseudo-true score function of L ϕ, (ψ) is J ϕ (θ, ψ) := lim IE [ f (θ) Lϕ, (ψ) ]. As the pseudo-true parameter ψ(θ o ) is the KLIC minimizer when the null hypothesis is specified correctly, we have J ϕ (θ o, ψ(θ o )) = 0. hus, whether J ϕ (ˆθ, ˆψ ) is close to zero constitutes an evidence for or against the null hypothesis. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

60 Following Wooldridge (1990), we can incorporate the null model into the the score function and write L ϕ, (θ, ψ) = 1 d 1,t (θ, ψ) + 1 d 2,t (θ, ψ)c t (θ), where IE f (θ) [c t (θ) x t, ξ t ] = 0. As such, 1 J ϕ (θ, ψ) = lim [ IE f (θ) d1,t (θ, ψ) ], and its sample counterpart is Ĵ ϕ (ˆθ, ˆψ ) = 1 d 1,t (ˆθ, ˆψ ) = 1 ) ) d 2,t (ˆθ, ˆψ ct (ˆθ, where the second equality follows because L ϕ, ( ˆψ ) = 0. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

61 For non-nested, linear specifications for the conditional mean, we can incorporate the null model (y t = x tβ + ε t ) into the score and obtain δ L ϕ, (θ, ψ) = 1 ξ t (x tβ ξ tδ) + 1 ξ t ε t with d 1,t (θ, ψ) = ξ t (x tβ ξ tδ), d 2,t (θ, ψ) = ξ t, and c t (θ) = ε t. hen, Ĵ ϕ (ˆθ, ˆψ ) = 1 as in the SE discussed earlier. ξ t ( x t ˆβ ξ t ˆδ ) = 1 ξ t ê t, C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

62 For nonlinear specifications, it can be seen that Ĵ ϕ (ˆθ, ˆψ ) = 1 δ µ ( ξ t, ˆδ )[ ( m xt, ˆβ ) ( µ ξt, ˆδ )] = 1 δ µ ( ξ t, ˆδ )êt, with ê t = y t m(x t, ˆβ ) the nonlinear OLS residuals. Yet, it is not easy to compute the SE here because evaluating the pseudo-true value of ˆδ is a formidable task. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

63 he linear expansion of 1/2 Ĵ ϕ (ˆθ, ˆψ ) about (θo, ψ(θ o )) is Ĵ ϕ (ˆθ, ˆψ ) 1 d 2,t (θ o, ψ(θ o ))c t (θ o ) A o (ˆθ θ o ), where A o = lim 1 IE f (θ o)[ d2,t (θ o, ψ(θ o )) θ c t (θ o ) ]. Note that the other terms in the expansion that involve c t would vanish in the limit because they have zero mean. Recall also that (ˆθ θ o ) = H (θ o ) 1 1 where IE f (θo)[s f,t (θ o ) x t, ξ t ] = 0. s f,t (θ o ) + o IP (1), C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

64 Collecting terms we have Ĵ ϕ (ˆθ, ˆψ ) = 1 b t (θ o, ψ(θ o )) + o IP (1), where b t (θ o, ψ(θ o )) = d 2,t (θ o, ψ(θ o ))c t (θ o ) A o H (θ o ) 1 s f,t (θ o ) and IE f (θo)[b t (θ o, ψ(θ o )) x t, ξ t ] = 0. By invoking a suitable CL, 1/2 Ĵ ϕ (ˆθ, ˆψ ) has a limiting normal distribution with the asymptotic covariance matrix: 1 Σ o = lim IE f (θo)[ bt (θ o, ψ(θ o ))b t (θ o, ψ(θ o )) ], which can be consistently estimated by Σ = 1 b t (ˆθ, ˆψ ) bt (ˆθ, ˆψ ). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

65 It follows that the PSE test is PSE = (ˆθ Ĵϕ, ˆψ ) Σ (ˆθ Ĵϕ, ˆψ ) D χ 2 (k), where k is the rank of Σ and Σ is the generalized inverse of Σ. he PSE test can be understood as an extension of the conditional mean encompassing test of Wooldridge (1990). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

66 Example: Consider non-nested specifications of conditional variance: H 0 : y t x t, ξ t N ( 0, h(x t, α) ), H 1 : y t x t, ξ t N ( 0, κ(ξ t, γ) ). When h t and κ t are evaluated at the respective QMLEs α and γ, we write ĥ t and ˆκ t. It can be verified that s h,t (α) = αh t 2ht 2 (yt 2 h t ), s κ,t (γ) = γκ t 2κ 2 t where IE f (θ) (y 2 t h t x t, ξ t ) = 0. (yt 2 κ t ) = γκ t 2κ 2 (h t κ t ) t }{{} d 1,t + γκ t 2κ 2 t }{{} d 2,t (yt 2 h t ) }{{} c t, C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

67 he sample counterpart of the pseudo-true score function is thus 1 γ ˆκ t 2ˆκ 2 (yt 2 ĥ t ). t hus, the PSE test amounts to checking whether γ κ t /(2κ 2 t ) are correlated with the generalized errors (y 2 t h t ). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

68 Binary Choice Models Consider the binary dependent variable: { 1, with probability IP(y y t = t = 1 x t ), 0, with probability 1 IP(y t = 1 x t ). he density function of y t given x t is: g(y t x t ) = IP(y t = 1 x t ) yt [1 IP(y t = 1 x t )] 1 yt. Approximating IP(y t = 1 x t ) by F (x t ; θ), the quasi-likelihood function is f (y t x t ; θ) = F (x t ; θ) yt [1 F (x t ; θ)] 1 yt. he QMLE θ is obtained by maximizing L (θ) = 1 [ yt log F (x t ; θ) + (1 y t ) log(1 F (x t ; θ)) ]. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

69 Probit model: F (x t ; θ) = Φ(x tθ) = x t θ 1 2π e v 2 /2 dv, where Φ denotes the standard normal distribution function. Logit model: F (x t ; θ) = G(x tθ) = exp( x tθ) = exp(x tθ) 1 + exp(x tθ), where G is the logistic distribution function with mean zero and variance π 2 /3. Note that the logistic distribution is more peaked around its mean and has slightly thicker tails than the standard normal distribution. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

70 Global Concavity he log-likelihood function L is globally concave provided that 2 θ L (θ) is negative definite for θ Θ. By a 2nd-order aylor expansion, L (θ) = L ( θ ) + θ L ( θ )(θ θ ) + (θ θ ) 2 θ L (θ )(θ θ ) = L ( θ ) + (θ θ ) 2 θ L (θ )(θ θ ), where θ is between θ and θ. Global concavity implies that L (θ) must be less than L ( θ ) for any θ Θ. hus, θ must be a global maximizer. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

71 For the logit model, as G (u) = G(u)[1 G(u)], we have θ L (θ) = 1 = 1 = 1 [ G (x y tθ) t G(x tθ) (1 y t) G (x tθ) 1 G(x tθ) ] x t { yt [1 G(x tθ)] (1 y t )G(x tθ) } x t [y t G(x tθ)]x t, from which we can solve for the QMLE θ. Note that 2 θ L (θ) = 1 G(x tθ)[1 G(x tθ)]x t x t, which is negative definite, so that L is globally concave in θ. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

72 For the probit model, θ L (θ) = 1 = 1 [ φ(x y tθ) t Φ(x tθ) (1 y t) φ(x tθ) 1 Φ(x tθ) y t Φ(x tθ) Φ(x tθ)[1 Φ(x tθ)] φ(x tθ)x t, ] x t where φ is the standard normal density function. It can be verified that 2 θ L (θ) = 1 [ y t φ(x tθ) + x tθφ(x tθ) Φ 2 (x tθ) + (1 y t ) φ(x tθ) x tθ[1 Φ(x tθ)] [1 Φ(x tθ)] 2 ] φ(x tθ)x t x t, which is also negative definite; see e.g., Amemiya (1985, pp ). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

73 As IE(y t x t ) = IP(y t = 1 x t ), we can write y t = F (x t ; θ) + e t. Note that when F is correctly specified for the conditional mean, y t is conditionally heteroskedastic with var(y t x t ) = IP(y t = 1 x t )[1 IP(y t = 1 x t )]. hus, the probit and logit models are also different nonlinear mean specifications with conditional heteroskedasticity. he NLS estimator of θ is inefficient because the NLS objective function ignores conditional heteroskedasticity. A weighted NLS estimator that takes into account the conditional variance is still inefficient. (Why?) C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

74 As IE(y t x t ) = IP(y t = 1 x t ), we can write y t = F (x t ; θ) + e t. Note that when F is correctly specified for the conditional mean, y t is conditionally heteroskedastic with var(y t x t ) = IP(y t = 1 x t )[1 IP(y t = 1 x t )]. hus, the probit and logit models are also different nonlinear mean specifications with conditional heteroskedasticity. he NLS estimator of θ is inefficient because the NLS objective function ignores conditional heteroskedasticity. A weighted NLS estimator that takes into account the conditional variance is still inefficient. (Why?) C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

75 Marginal response: For the probit model, Φ(x tθ) x tj = φ(x tθ)θ j ; for the logit model, G(x tθ) x tj = G(x tθ)[1 G(x tθ)]θ j. hese marginal effects all depend on x t. It is typical to evaluate the marginal response based on a particular value of x t, such as x t = 0 or x t = x, the sample average of x t. When x t = 0, φ(0) 0.4 and G(0)[1 G(0)] = his suggests that the QMLE for the logit model is approximately 1.6 times the QMLE for the probit model when x t are close to zero. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

76 Letting p = IP(y = 1 x), p/(1 p) is the odds ratio: the probability of y = 1 relative to the probability of y = 0. For the logit model, the log-odds-ratio is ( pt ) ln = ln(exp(x 1 p tθ)) = x tθ. t so that ln(p t /(1 p t )) x tj = θ j. As the effect of a regressor on the log-odds-ratio, each coefficient in the logit model is also understood as a semi-elasticity. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

77 Measure of Goodness of Fit Digression: Given the objective function Q, let Q (c) denote the value of Q when the model contains only a constant term, Q the largest possible value of Q (if exists), and Q (ˆθ ) the value of Q for the fitted model. hen, R 2 of relative gain is: R 2 RG = Q (ˆθ ) Q (c) Q Q (c) = 1 Q Q (ˆθ ) Q Q (c). For binary choice models, Q = L and the largest possible L = 0 when IP(y t = 1 x t ) = 1. he measure of McFadden (1974) is RRG 2 = 1 L (ˆθ ) [ L (ȳ) = 1 yt ln ˆp t + (1 y t ) ln(1 ˆp t ) ] [ ȳ ln ȳ + (1 ȳ) ln(1 ȳ) ], where ˆp t = G(x t ˆθ ) or ˆp t = Φ(x t ˆθ ). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

78 Latent Index Model Assume that y t is determined by the latent index variable yt : { 1, yt > 0, y t = 0, yt 0, where yt = x tβ + e t. hus, IP(y t = 1 x t ) = IP(y t > 0 x t ) = IP(e t > x t β x t ), which is also IP(e t < x t β x t ) provided that e t is symmetric about zero. he probit specification Φ or the logit specification G can be understood as specifications of the conditional distribution of e t. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

79 Multinomial Models Consider J + 1 mutually exclusive choices that do not have a natural ordering, e.g., employment status and commuting mode. he dependent variable y t takes on J + 1 values such that 0, with probability IP(y t = 0 x t ), 1, with probability IP(y t = 1 x t ), y t =. J, with probability IP(y t = J x t ). Define the new binary variable d t,j for j = 0, 1,..., J as { 1, if y d t,j = t = j, 0, otherwise, note that J j=0 d t,j = 1. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

80 he density function of d t,0,..., d t,j given x t is then J g(d t,0,..., d t,j x t ) = IP(y t = j x t ) d t,j. j=0 Approximating IP(y t = j x t ) by F j (x t ; θ) we obtain the quasi-log-likelihood function: L (θ) = 1 J d t,j ln F j (x t ; θ). j=0 he first order condition is θ L (θ) = 1 J j=0 1 [ d t,j θ F F j (x t ; θ) j (x t ; θ) ] = 0, from which we can solve for the QMLE θ. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

81 Multinomial Logit Model Common specifications of the conditional probabilities are: F j (x t ; θ) = G t,j = exp(x tθ j ) J, j = 0,..., J, k=0 exp(x tθ k ) where θ = (θ 0 θ 1... θ j ), and x t does not depend on choices. Note, however, that the parameters are not identified because, for example, = exp[x t(θ 0 + γ)] exp[x t(θ 0 + γ)] + J k=1 exp(x tθ k ) exp(x tθ 0 ) exp(x tθ 0 ) + J k=1 exp[x t(θ k γ)]. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

82 For normalization, we set θ 0 = 0, so that 1 F 0 (x t ; θ) = G t,0 = 1 + J k=1 exp(x tθ k ), F j (x t ; θ) = G t,j = exp(x tθ j ) 1 +, j = 1,..., J, J k=1 exp(x tθ k ) with θ = (θ 1 θ 2... θ j ). his leads to the multinomial logit model, and the quasi-log-likelihood function is L (θ) = 1 j=1 J d t,j x tθ j 1 ( log 1 + ) J exp(x tθ k ). k=1 C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

83 It is easy to see that θj G t,k = G t,k G t,j x t for k j, and θj G t,j = G t,j [ 1 Gt,j ] xt. It follows that θj L (θ) = 1 (d t,j G t,j )x t, j = 1,..., J. he Hessian matrix contains: θj θ i L (θ) = 1 (G t,j G t,i )x t x t, i j, i, j = 1,..., J, θj θ j L (θ) = 1 G t,j (1 G t,j )x t x t, j = 1,..., J. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

84 he marginal response of G t,j to the change of x t are J xt G t,0 = G t,0 G t,i θ i, i=1 xt G t,j = G t,j (θ j ) J G t,i θ i, j = 1,..., J. i=1 hus, all coefficient vectors θ i, i = 1,..., J, enter xt G t,j. he log-odds ratios (relative to the base choice j = 0) are: ln(g t,j /G t,0 ) = x t(θ j θ 0 ) = x tθ j, j = 1,..., J, as θ 0 = 0 in G t,0 by construction. hus, each coefficient θ j,k is also understood as the effect of x tk on the log-odds-ratio ln(g t,j /G t,0 ). C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

85 Conditional Logit Model In the conditional logit model, there are multiple choices with choice-dependent or alternative-varying regressors. For example, in choosing among several commuting modes, the regressors may include the in-vehicle time and waiting time that vary with the vehicle. Let x t = (x t,0 x t,1... x t,j ). he specifications for the conditional probabilities are G t,j = exp(x t,j θ) J j = 0, 1,..., J. k=0 exp(x t,kθ), In contrast with the multinomial model, θ is common for all j, so that there is no identification problem. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

86 he quasi-log-likelihood function is L (θ) = 1 J d t,j x t,jθ 1 j=0 ( J ) log exp(x t,k θ). k=0 he gradient and Hessian matrix are θ L (θ) = 1 J (d t,j G t,j )x t,j, j=0 2 θ L (θ) = 1 J J J G t,j x t,j x t,j G t,j x t,j G t,j x t,j j=0 j=0 j=0 = 1 J G t,j (x t,j x t )(x t,j x t ), j=0 where x t = J j=0 G t,jx t,j is the weighted average of x t,j. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

87 It can be shown that the Hessian matrix is negative definite so that the quasi-log-likelihood function is globally concave. Each choice probability G t,j is affected not only by x t,j but also the regressors for other choices, x t,i, because xt,i G t,j = G t,j G t,i θ, i j, i = 0,..., J, xt,j G t,j = G t,j (1 G t,j )θ, j = 0,..., J. Note that for positive θ, an increase in x t,j increases the probability of the j th choice but decreases the probabilities of other choices. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

88 Random Utility Interpretation McFadden (1974): Consider the random utility of the choice j: U t,j = V t,j + ε t,j, j = 0, 1,..., J. he alternative i would be chosen if IP(y t = i x t ) = IP(U t,i > U t,j, for all j i x t ) = IP(ε t,j ε t,i V t,j V t,i, for all j i x t ). Letting ε t,ji = ε t,j ε t,i and Ṽ t,ji = V t,j V t,i. hen we have (for j = 0), IP(y t = 1 x t ) = IP( ε t,j0 Ṽt,j0, j = 1, 2,..., J x t ) = Ṽ t,10 Ṽ t,j0 f ( ε t,10,... ε t,j0 ) d ε t,10 d ε t,j0. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

89 his setup is consistent with the theory of decision making. Different models are obtained by imposing different assumptions on the joint distribution of ε t,j. Suppose that ε t,j are independent random variables across j with the type I extreme value distribution: exp[ exp( ε t,j )], and the density: f (ε t,j ) = exp( ε t,j ) exp[ exp( ε t,j )], j = 0, 1,..., J. It can be shown that IP(y t = j x t ) = exp(v t,j ) J i=0 exp(v t,i). We obtain the conditional logit model when V t,j = x t,jθ and multinomial logit model when V t,j = x tθ j. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

90 Remarks: 1 In the conditional logit model, the choices must be quite different such that they are independent of each other. his is known as independence of irrelevant alternatives (IIA) which is a restrictive condition in practice. 2 he multinomial probit model is obtained by assuming that ε t,j are jointly normally distributed. his model permits correlations among the choices, yet it requires numerical or simulation method to evaluate the J-fold integral for choice probabilities. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

91 Nested Logit Models McFadden (1978): Generalized extreme value (GEV) distribution with F (ε 0, ε 1,..., ε J ) = exp [ G ( e ε 0, e ε 1,..., e ε )] J, where G is non-negative and homogeneous of degree one and satisfies other conditions. With this distribution assumption, the choice probabilities of the random utility model are e V G ( j j e V 0, e V 1,..., e V ) J G ( ) e V, j = 0, 1,..., J, 0, e V 1,..., e V J where G j denotes the derivative of G with respect to the j th argument. In particular, when G ( e V 0, e V1,,..., e V ) J = J k=0 exp( V k), we obtain the multinomial logit model. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

92 Consider the case that there are J groups of choices, in which the j th group has 1,..., K j choices. he random utility is: U t,jk = V t,jk + ε t,jk, j = 1,..., J, k = 1,..., K j. For example, one must first choose between taking public transportation and driving and then select a vehicle in the chosen group. he nested logit model postulates the GEV distribution for ε t,jk : exp [ G ( e ε 11,..., e ε 1K 1,..., e ε J1,..., e ε )] JK J r K j J j ( ) = exp e ε 1/rj jk, j=1 k=1 where r j = [1 corr(ε jk, ε jm )] 1/2. When the choices in the j th group are uncorrelated, we have r j = 1 and hence the multinomial logit model. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

93 Define the binary variable d t,jk as { 1, if y d t,jk = t = jk, j = 1,..., J, k = 1,..., K j. 0, otherwise, A particular choice jk is chosen with the probability p t,jk = IP(d t,jk = 1 x t ) = IP(U t,jk > U t,mn m, n x t ). Let x t be the collection of z t,j and w t,jk and V t,jk = z t,jα + w t,jk γ j, j = 1,..., J, k = 1,..., K j. hus, we have group-specific regressors and regressors that that depend on both j and k. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

94 Let I t,j = ln ( K j k=1 exp(w t,jk γ j/r j ) ) denote the inclusive value, where r j are known as scale parameters. We can write p t,jk = exp(z t,j α + r ji t,j ) exp(w t,jk J γ j/r j ) m=1 exp(z t,mα + r m I t,m ) Kj n=1 }{{} exp(w t,jn γ ; j/r j ) }{{} p t,j p t,k j details can be found in Cameron and rivedi (2005, pp ). Note that for a given j, p t,k j is a conditional logit model with the parameter γ j /r j. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

95 he approximating density is f = K J j ( ) dt,jk pt,j p t,k j = j=1 k=1 J j=1 and the quasi-log-likelihood function is p d t,jk t,j K j p d t,jk t,k j k=1, L = 1 J d t,jk ln(p t,j ) + 1 j=1 K J j d t,jk ln(p t,k j ). j=1 k=1 Maximizing this likelihood function yields the full information maximum likelihood (FIML) estimator. here is also a sequential (limited information maximum likelihood) estimator; we omit the details. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

96 Ordered Multinomial Models Suppose that the categories of data have an ordering such that 0, yt c 1 1, c 1 < yt c 2 y t =. J, c J < yt. where yt are latent variables. For example, language proficiency status and education level have a natural ordering. Such models may be estimated using multinomial logit models; yet taking into account this structure yields simpler models. C.-M. Kuan (National aiwan Univ.) Quasi Maximum Likelihood heory June 17, / 119

The Quasi-Maximum Likelihood Method: Theory

The Quasi-Maximum Likelihood Method: Theory Chapter 9 he Quasi-Maximum Likelihood Method: heory As discussed in preceding chapters, estimating linear and nonlinear regressions by the least squares method results in an approximation to the conditional