Setting RHS to be zero, 0= (θ )+ 2 L(θ ) (θ θ ), θ θ = 2 L(θ ) 1 (θ )= H θθ (θ ) 1 d θ (θ ) O =0 θ 1 θ 3 θ 2 θ Figure 1: The Newton-Raphson Algorithm where H is the Hessian matrix, d θ is the derivative matrix, and θ is the solution of this set of equations. Rewriting the above, θ = θ H θθ (θ ) 1 d θ (θ ). This turns out to be an updating formula of the estimates of θ. Notice if θ solves = 0, then (θ ) = (θ ) = d θ (θ ) = 0 and thus θ = θ. This suggest if θ θ 0, we need to iterate the formula, θ = θ H θ (θ ) 1 d θ (θ ) The updating sequence will not terminate until d θ (θ ) 0. More generally, ˆθ (n) = ˆθ (n 1) H θθ (ˆθ (n 1) ) 1 d θ (ˆθ (n 1) ) where ˆθ (n) is the estimates of θ at the end of nth iterations. 5 5 Accordingtoournotations,ˆθ 0 = θ, ˆθ1 = θ, ˆθ2 = θ 15
2. Scoring This method is to replace the Hessian matrix by the Fisher information matrix. Denote the information matrix for sample size T by F θθ,t. Because F θθ,t = E( 2 L ) it suggests replacing ( H θθ )byf θθ,t. In other words, the updating formula becomes ˆθ n = ˆθ (n 1) + F θθ,t (ˆθ (n 1) ) 1 d θ (ˆθ (n 1) ) Why should we do this? Some of times, F θθ,t is easier to compute than the Hessian matrix, because there are probably less elements to compute, for example F βσ 2,T (θ) =0 in the previous regression model, and also because this use information about the question we study. There is a variant of scoring 6 that deserves mentioning. Note that With LLN, we would expect F θθ,t = TE( ). T t t (ˆθ (n 1) ) t (ˆθ (n 1) ) F θθ,t (ˆθ (n 1) ) p F θθ (θ) where t ln ft =. 7 The above computation proposal simply shows that Econometrics is not just computer science. Taking advantage of the information at hand makes the problem solving easier. Now we look at an important case of nonlinear regression model where y t = g(x t,β)+u t,u t nid(0,σ 2 ) where x t areassumedtobefixed.sotheloglikelihoodare L(y 1,,y T θ) = 1 2 log 2π 1 2 log σ2 1 2Tσ 2 (yt g(x t,β)) 2. A bit calculations lead to β = 1 (yt g(x Tσ 2 t,β)) g(x t,β), β σ 2 = 1 2σ 2 + 1 2σ 4 T (yt g(x t,β)) 2 2 L β β = 1 σ 2 T g(xt,β) g(x t,β) + 1 (yt g(x β β Tσ 2 t,β)) 2 g(x t,β) β β 6 This is how LIMDEP, an econometric software, does. 7 The reason of dividing by T instead of times by T is that we have multiplied 1 T in log likelihood. 16
2 L β σ = 1 (yt g(x 2 Tσ 4 t,β)) g(x t,β), β Take expectation to get F θθ,t,where 2 L σ 2 σ 2 = 1 2σ 4 1 σ 6 T (yt g(x t,β)) 2 F ββ,t = 1 g(xt,β) Tσ 2 β The first relation is due to g(x t,β) β,f βσ 2,T =0,F σ 2 σ 2,T = 1 2σ 4. E((y t g(x t,β)) 2 g(x t,β) )= 2 g(x t,β) E(y β β β β t g(x t,β)) = 0 The scoring algorithm is [ ] [ ] ˆβn ˆβ(n 1) = + ˆσ 2 n ˆσ 2 (n 1) [ Fββ,T (ˆθ (n 1) ) F βσ 2,T (ˆθ (n 1) ) F σ 2 β,t(ˆθ (n 1) ) F σ 2 σ 2,T (ˆθ (n 1) ) Because F βσ 2,T (ˆθ (n 1) )=0=F σ 2 β,t(ˆθ (n 1) ). Therefore, ˆβ n = ˆβ (n 1) + F 1 ββ,t (ˆθ n 1 ) (ˆθ β n 1 ) = ˆβ 1 (n 1) +[ T ˆσ 2 (n 1) ĝt β ĝ t β ] 1 [ 1 T ˆσ 2 (n 1) ĝ t β ] 1 [ (y t g(x t, ˆβ (n 1) )) ĝt β ] = ˆβ (n 1) +[ ĝ t β = ˆβ (n 1) +[ z tz t ] 1 z t(y t g(x t, ˆβ (n 1) )) ] 1 [ (ˆθ ] β (n 1) ) (ˆθ σ 2 (n 1) ) T t=1 (y t g(x t, ˆβ (n 1) )) ĝt β ] where ĝt = g(xt,β) β β ˆβ(n 1) = z t. There is an regression explanation emerging from the derivation. Note that [ z tz t ] 1 z t(y t g(x t, ˆβ (n 1) )) is the estimated coefficients in the regression of y t g(x t, ˆβ (n 1) ) against z t. 8 In general, the updating procedure is an iterative least squares estimation. This is the so-called Gauss-Newton algorithm. In this case, the ML estimation amounts to nonlinear regression estimation, if the problem can be formulated in the way we see before. The Gauss-Newton algorithm is one of the simplest and most effective way of maximizing the likelihood. 7 Wald, LM, and LR tests: the trinity One of major goal for econometric exercises is to do inference from the observed data. Based on the MLE results, generally there are 3 important testing strategies we can undertake. 8 To see how this interpretation comes from, carefully compare the following simple regression model y t = x t β + e t, ˆβOLS =( x t x t) 1 ( x t y t) with the nonlinear regression where y t g(x t, ˆβ (n 1) ) plays the role of dependent variables as y t in the simple regression, and z t as the independent variables, x t. 17
They differ from each other by whether the restrictions under tests has been taken into account in the testing. In what follows, we will present them in order. Suppose we like to test if the true parameters, θ 0, satisfying the restrictions H 0 : Rθ 0 = r To have a concrete idea of what R and r look like, suppose we want to test if the production function is a Cobb-Douglass where the parameters should meet the constraint α 0 + β 0 =1. In correspondence to this example, R =[1, 1], θ 0 = [ α 0 ] β 0,r= 1. More generally, R can be a matrix and r be a vector, when more than one constraint are jointly under test. 7.1 Wald test The Wald test does not use information about the null hypothesis when forming the statistics. Suppose ˆθ is the estimates of θ 0, and possibly is estimated by maximum likelihood. So if the data is really drawn from the null, we should expect ˆθ is not much different from the true value θ 0. As a result, under the null of the hypothesis being tested, we expect Rˆθ r 0ifRθ 0 = r as ˆθ θ 0 ) So this is a subject that we can use to discriminate the null hypothesis from the alternative counterpart. This is because if the data is not drawn from the null, instead, Rˆθ r will be different from 0. The more the the difference shows, the stronger evidence is against the null. The question now is how we can employ this notion to construct an appropriate statistics that is powerful against the alternative. To do this with the Wald test, note that Rˆθ r = Rˆθ Rθ 0 + Rθ 0 r = R(ˆθ θ 0 ) where under H 0, Rθ 0 r = 0. However, the quantity R(ˆθ θ 0 ) is a random variable with some distribution. To investigate what the distribution is, observe the asymptotic normality for the MLE that T 1/2 (ˆθ θ 0 ) d N(0,F 1 θθ (θ 0)). Since R(ˆθ θ 0 )isjustalineartransformationofˆθ θ 0 ), it is straightforward to show that T 1/2 R(ˆθ θ 0 ) d N(0,RF 1 θθ (θ 0)R ) which gives the distribution we desired. It is natural to form the statistics W T (Rˆθ r) (RF 1 θθ (θ 0)R ) 1 (Rˆθ r) d χ 2 (dim(r)) 18
The asymptotic result comes from the facts that T 1/2 R(ˆθ θ 0 ) is normal, and that RF 1 θθ (θ 0)R is the scaling covariance. 9 It should be emphasized that in the Wald test statistics, the parameters of concern, ˆθ, are estimated without the information of the restrictions from the null. It is an unrestricted version of MLE. The Wald test statistics in fact are infeasible because F 1 θθ (θ 0) involves unknown parameters, θ 0. The test can be made feasible by replacing θ 0 with its consistent counterparts 10 ˆθ. So the test statistics takes the form W T (Rˆθ r) (RF 1 θθ,t (ˆθ)R ) 1 (Rˆθ r) The test statistics behave quite different under the alternative hypothesis where the constraint Rθ 0 r = 0 does not hold. Under the alternative, Rˆθ r is not close to 0, and the Wald test statistics is asymptotically a non-central χ 2 distribution. 7.2 LM test The major difference between the Lagrange multiplier (LM) test and the Wald test lies in whether the parameter estimates used are unrestricted or restricted. In the construction of the LM tests, the parameters are estimated with the information of the restrictions from the null. In this sense, the LM test statistics have lowest computation cost among the three test statistics under study. The restricted MLE is computed as follows: To solve the maximization question, we form max L(θ) subject to Rθ = r. L = L(θ)+λ (Rθ r) where λ is the Lagrange multiplier, and the LM test is a statistic based on this quantity. The first order conditions from the above are = (θ) + R λ =0, λ = Rθ r =0 Let ( θ, λ) be the solutions. where λ is an estimate of Lagrang multiplier. So ( θ)+r λ =0 9 Recall that if X N j (µ, Ω), then (X µ) Ω 1 (X µ) χ 2 (j). 10 The unrestricted MLE is consistent under both the null and the alternative. 19
Now under the null where the restrictions Rθ r = 0 are valid, imposition of the restrictions should therefore little change the likelihood. This implies that the Lagrange multiplier ( λ) should be very small, and thus ( θ) 0if λ 0 Thus, testing if λ = 0 is equivalent to testing if ( θ) =0.Buttestingif λ = 0 is really testing whether the restriction imposed on the estimation is correct. We can employ ( θ) =0to build test statistics. So the LM test statistics take the form LM T ( θ) F 1 θθ,t ( θ) ( θ) d χ 2 (dim(r)) Again under the H 0, the test statistics converge in distribution to χ distribution as the Wald test with degrees of freedom dim R. A few observations from the test statistics. First, it is natural to ask why is F θθ,t used as a scaling factor? To answer this, simply note that V ( )=E( )=F θθ,t. Since θ is unknown, again in practice replacing θ by θ would do the job. Second, why is there a T in front of the LM test? This is because we look at the average score when working on the maximization question, i,e. = 1 ln L. Furthermore, because the pdf is independent, T 1/2 it is easy to obtain the result that T N(0,F θθ). Collecting these arguments, the LM test statistics having an asymptotic χ 2 distribution is well expected. 7.3 LR test The LR test is another intuitive test statistics. It involves information from both restricted ML estimation and unrestricted one. Before spelling out the test statistics, first note that the ratio of the restricted likelihood to the unrestricted likelihood should be close to 1, under the null. In notations, λ = L ( θ) L (ˆθ), where θ and ˆθ are, respectively, restricted estimates and unrestricted estimates. The reason that this ratio is close to 1 under the null is that the restricted and unrestricted estimates are of similar magnitude, under the null. We have discussed this notion previously. Therefore, under H 0, θ ˆθ, and thus log λ 0. But to do inference, it is needed to know the distribution of the LM test statistics. Fortunately, such a result exists, as under H 0. 2logλ = 2[log L (ˆθ) log L ( θ)] = 2T [L(ˆθ) L( θ)] d χ 2 (dim(r)) 20
The LR test probably involve more computation cost than other alternative tests. This is because to compute the test statistics, both unrestricted and restricted ML estimation need to be performed first. Asymptotically, the three tests, Wald, LM and LR, are equivalent. But for a given data set, these tests differ from each other in small samples. Specifically we can obtain the order relation that LM < LR < W ald. That is, the LM test is most conservative, and the Wald test is most liberal. A figure that illustrates the difference among the 3 tests is contained below. L( θ) L( θ) LR φ(θ) =Rθ r L(θ) φ( θ) LM Wald 0 θ θ dl(θ)/dθ Figure 2: Three Asymptotically Equivalent Tests 8 Non-linear restrictions We now switch attention to the case of testing for nonlinear restrictions. An example of the nonlinear restriction could be β 1 β 2 β 3 =0 in contrast to the linear restriction β 1 + β 2 = 1 before. While the nonlinear restriction may appear to be quite difficult to deal with, testing for such restrictions turns out to be similar 21
to what we proceed in the linear case. In general, we are interested in testing for H 0 : φ(θ 0 )=0, where φ( ) = 0 is a known function. φ, though nonlinear, can be linearized by Taylor expansion. Suppose ˆθ is the unrestricted ML estimate. Taking Taylor expansion about φ(ˆθ) at θ 0, φ(ˆθ) =φ(θ 0 )+ φ (θ 0)(ˆθ θ 0 )+ 2 φ (θ 0)(ˆθ θ 2 0 ) 2 +. A bit calculation would lead to T 1/2 (φ(ˆθ) φ(θ 0 )) = φ (θ 0)T 1/2 (ˆθ θ 0 )+ 2 φ 2 (θ 0)[T 1/2 (ˆθ θ 0 ) 2 ]+ where φ(θ 0 ) = 0 under the null. Note that the second term on the RHS is asymptotically negligible because T 1/2 (ˆθ θ 0 ) d normal distribution and (ˆθ θ 0 ) p 0. Any higher terms are also asymptotically negligible by the same token. Therefore, asymptotically (in large sample) under the H 0, T 1/2 (φ(ˆθ)) φ (θ 0)T 1/2 (ˆθ θ 0 ) RT 1/2 (ˆθ θ 0 ) where R = φ (θ 0) is the first derivative matrix that contains constants as elements. The statistic now has the same expression as that in the linear case, except that R is the first derivative with respect to the parameters evaluated at θ 0 in the nonlinear case. Naturally, the Wald test is computed as W = T 1/2 (φ(ˆθ))var(φ(θ 0 )) 1 T 1/2 (φ(ˆθ)) Tφ(ˆθ)( ˆRF 1 θθ,t (ˆθ) ˆR ) 1 φ(ˆθ). The second equivalence is obtained by replacing unknown θ 0 with consistent estimates ˆθ. Equivalently, ˆR = φ (ˆθ). The aforementioned discussion concentrates on the Wald tests. Then, how to calculate both LM test and LR test under nonlinear restrictions? Because these 2 tests do not utilize the difference between the true parameters and the estimated counterpart to construct the statistic, calculating both tests remains the same except the restrictions of concern are nonlinear. 22