EM Algorithms for Ordered Probit Models with Endogenous Regressors

Size: px

Start display at page:

Download "EM Algorithms for Ordered Probit Models with Endogenous Regressors"

Shawn Woods
6 years ago
Views:

1 EM Algorithms for Ordered Probit Models with Endogenous Regressors Hiroyuki Kawakatsu Business School Dublin City University Dublin 9, Ireland Ann G. Largey Business School Dublin City University Dublin 9, Ireland June 27, 2007 Abstract We propose an EM algorithm to estimate ordered probit models with endogenous regressors. The proposed algorithm has a number of computational advantages in comparison to direct numerical maximization of the (limited information) log likelihood function. First, the sequence of conditional M-steps can all be computed analytically, mostly via least squares. Second, the EM algorithm is parameterized so that each updating step naturally satisfies certain restrictions such as positive definiteness of the covariance matrix and monotonicity of cutpoints. Third, to address the potentially slow convergence of the EM algorithm, we propose the parameter expansion (PX-EM) algorithm that can accelerate convergence. Given a numerically stable algorithm to obtain maximum likelihood estimates, a number of classical tests become available for specification testing. We discuss tests for the presence of endogeneity and for instrument exogeneity (overidentification test). We conduct Monte Carlo simulations to examine the finite sample performance of the proposed estimator and provide an empirical application using survey data on social interactions. JEL classification: C13, C35. Keywords: ordered probit, endogeneity, EM algorithm.

2 1 Introduction This paper considers limited information maximum likelihood (LIML) estimation of ordered probit models with continuous endogenous regressors. Despite its asymptotic efficiency, LIML does not appear to be the estimator of choice for models in this class perhaps because LIML suffers from a number of computational disadvantages, especially in large models. As a consequence, the LIML estimator has generally been avoided in favor of less efficient but computationally simpler estimation methods (Rivers and Vuong 1988, p.351). Rivers and Vuong (1988) proposed a simple two-step estimator that requires only least squares and estimation of a standard (non-endogenous) ordered probit model. Newey (1987) proposed an asymptotically efficient two-step minimum chi-square estimator that applies generally to limited dependent variable models with continuous endogenous regressors. Our main contribution is to propose a numerically stable LIML estimator based on the EM algorithm for this class of models. In Section 2 we formulate the model and discuss numerical difficulties that can arise with LIML estimation that directly maximizes the loglikelihood function. Section 3 describes the proposed EM algorithm. The algorithm has the following features. First, the algorithm breaks up the M-step into a sequence of conditional M-steps (Meng and Rubin 1993). The resulting ECM algorithm does not require numerical maximization and the M-steps can all be computed analytically, mostly by least squares. Second, we employ a parameterization such that certain parameter restrictions are naturally satisfied when the estimates are updated in the M-steps. These restrictions include positive definiteness of the covariance matrix and monotonicity of the cutpoint parameters. Third, as shown by Meng and Rubin (1993), each step of the ECM algorithm monotonically increases the likelihood function. In Section 3, we also address two well-known drawbacks of the EM algorithm. First, the algorithm does not produce an estimate of the parameter covariance matrix which is needed for statistical inference. For inference purposes, we suggest using the observed information matrix to estimate the parameter covariance matrix (Jamshidian and Jennrich 2000). As the analytic second derivatives of the log-likelihood are rather complicated, the observed 1

3 information can be estimated by numerically differentiating the analytic scores. Moreover, the so-called robust covariance matrix that allows for clustered dependence can be obtained in the usual manner. Second, another well known problem of the EM algorithm is its slow convergence. To address this problem, we propose a simple modified parameter expansion (PX-ECM) algorithm that is known to accelerate convergence of the EM algorithm in other applications (Liu, Rubin and Wu 1998). In addition to its asymptotic efficiency (for a correctly specified model), the main advantage of the LIML estimator over two-step type estimators is that statistical inference can be conducted using any of the standard trinity of classical tests. Section 4 discusses some hypotheses of interest for the endogenous ordered probit model. Rivers and Vuong (1988) proposed tests for the presence of endogeneity using the two-step estimator. For the LIML estimator, the test of the presence of endogeneity is a simple exclusion test of the covariance parameters. Lee (1992) proposed testing over-identification (instrument exogeneity) using Newey s (1987) asymptotically efficient two-step estimator. For the LIML estimator, we show how to implement the LR test of Anderson and Rubin (1949) for testing over-identifying restrictions. Section 5 reports results from Monte Carlo simulations that examine the finite sample performance of the proposed ECM algorithms. The Monte Carlo design varies the degree of endogeneity, the degree of over-identification, and the quality of instruments to examine how they affect the finite sample properties of the algorithm and test statistics. In particular, we find that the PX-ECM algorithm can substantially accelerate convergence of the ECM algorithm. In Section 6, we fit the model to examine the empirical relation between social interactions which are measured as ordered outcomes and population density using survey data. Although the models have more than sixty parameters to estimate, we did not encounter any difficulties with using the PX-ECM algorithm to fit the data. The empirical application examines how the partial effects of population density on social interactions vary nonlinearly as a function of age. 2

4 2 Ordered Probit with Endogenous Regressors The following formulation of the ordered probit model with endogenous regressors follows that of Newey (1987) and Rivers and Vuong (1988). The model is y i = Y i β + X 1iγ + u i, i = 1,..., n (1a) Y i = Π X i + V i = Π 1 X 1i + Π 2 X 2i + V i u i N (0, Σ), Σ = σ 11 σ21 V i σ 21 Σ 22 (1b) y i = j if α j 1 y i < α j, j = 1,..., m, α 0 =, α m = + (1c) where y i is the observed scalar dependent variable with m ordered outcomes, y i is the continuous latent variable underlying y i, Y i is the r 1 endogenous regressors, X 1i is the s 1 included exogenous regressors, and X 2i is the (k s) 1 excluded exogenous regressors. The error terms (u i, Vi ) are assumed to have a joint Gaussian distribution with mean zero and covariance matrix Σ. Denote the (r + s + kr + (r + 1)(r + 2)/2 + m 1) 1 parameter vector to be estimated as θ = (β, γ, vec(π), vech(σ), α ). The limited information maximum likelihood estimator of θ maximizes the joint loglikelihood function l(θ) = l(y i, Y i X i, θ) = l(y i Y i, X i, θ) + l(y i X i, θ) To obtain the conditional log-likelihood l(y i Y i, X i, θ), note that y i Y i, X i N(µ i, σ 1 2 ) where µ i = Y i β + X1iγ + Vi Σ 1 22 σ 21 (2a) σ 1 2 = σ 11 σ 21Σ 1 22 σ 21 (2b) 3

5 and ( αyi µ ) ( i αyi 1 µ ) i p(y i Y i, X i, θ) = Φ Φ σ1 2 σ1 2 where Φ( ) is the cumulative distribution function of the standard normal distribution. The marginal log-likelihood l(y i X i, θ) is the log multivariate normal density l(y i X i, θ) = 1 2 (r log(2π) + log Σ 22 + Vi Σ 1 22 V i) (3) The limited information maximum likelihood estimates can, in principle, be obtained by directly maximizing the log-likelihood function l(θ). The maximization, however, must be done numerically and is the source of a number of computational difficulties with LIML estimation. First, we need to ensure positive definiteness of the estimated covariance matrix Σ. A common reparametrization to impose this constraint is to estimate its lower triangular Cholesky factor L with positive diagonals such that LL = Σ. 1 However, the identifying normalization discussed below may be difficult to impose under this reparameterization. Second, we need to ensure monotonicity of the estimated cutoff points α 1 <... < α m 1. One way to do this is to use the reparametrization α 1 = δ 1, α j = α j 1 + δ j where δ j = exp(d j ) for j = 2,..., m 1 and estimate the unconstrained δ 1, d 2,..., d m 1 instead of α j. For identification purposes, we need to impose certain restrictions on the parameters of the model. One commonly used normalization is to set σ 11 = 1. 2 Rivers and Vuong (1988) suggest an alternative normalization σ 1 2 = 1 which simplifies the computation of the conditional loglikelihood l(y i Y i, X i, θ). We note that one difference between the two normalizations is that numerical equivalence of the two-step estimates discussed below and LIML estimates hold under the latter normalization for just-identified models. To identify all cutoff parameters α 1,..., α m 1, the constant term must be excluded from the regressor list X 1i. Alternatively, we can arbitrarily fix a cutoff parameter and instead estimate the constant term as part of γ. For the rest of this paper, we use the normalization σ 1 2 = 1 and set α 1 = 0 and include a 1 For alternative parameterizations of the covariance matrix, see Pinheiro and Bates (1996). 2 For this normalization we can set the (1, 1) element of the cholesky factor L of Σ to one. 4

6 constant term in X 1i. This model then reduces to the binary probit model with endogenous regressors as is commonly estimated when m = 2. 3 EM Algorithm As shown above, the log likelihood function for limited dependent variable models with endogenous regressors is relatively straightforward to write down. However, perhaps due to computational difficulties of the LIML estimator, a number of computationally simpler alternative estimators have been proposed and used to date. These two-step estimators typically only require least squares and a standard limited dependent variables estimator but are generally less efficient than LIML (Rivers and Vuong 1988). The exception is Newey (1987) who develops an asymptotically efficient two-step estimator for a general class of limited dependent variable models with endogenous regressors. To address the computational difficulties with the LIML estimator, we develop an EM algorithm as a computational device to obtain LIML estimates. The proposed EM algorithm has the following computational features. First, as explained below, the sequence of conditional M-steps can all be computed analytically. In particular, most parameter estimates can be updated via a least squares regression. Second, which is related to the analytical tractability just described, we use the parameterization θ = (β, γ, vec(π), λ, vech(σ 22 ), δ ) where λ = Σ 1 22 σ 21 is r 1 and δ is (m 2) 1 with typical element δ j = α j α j 1 > 0 if m > 2. 3 Third, the covariance matrix parameters Σ 22 are updated in such a way that positive definiteness of Σ 22 is guaranteed. Therefore, there is no need to rely on parameter transforms to impose the positive definiteness constraint. This is the case regardless of the normalization used. 4 Similarly, the cutoff parameters δ j are updated in a way that preserves their ordering and there is no need to rely on parameter transforms. 3 The λ parametrization was also used in Smith and Blundell (1986) and Rivers and Vuong (1988). 4 For this reason, the algorithm in the Appendix is described using σ 11 and σ 1 2 without imposing a specific normalization. The unnormalized form is also used for the parameter expansion algorithm discussed below. 5

7 The details of the proposed EM algorithm are given in the Appendix. Below we discuss some issues that are specific to the model under consideration. 3.1 E-Step The E-step requires computation of the expected complete data log likelihood q(θ) E[l(y ) y] where y = (y1,..., y n) and y = (y 1,..., y n ). As pointed out by Ruud (1991), the main difficulty is that the cutoff parameters α in the ordered dependent variable model (1) is not identified in the EM algorithm as the latent variable y in (1) depends on the parameter α. To remove this parameter dependency of y, we follow Ruud (1991) and use a reparametrized latent variable model (7) given in the Appendix. As the resulting complete data log likelihood belongs to the exponential family, the E- step merely requires updating the complete data sufficient statistics evaluated at the current parameter estimates θ (t). For the model under consideration, the sufficient statistics are the first two conditional moments from the truncated Gaussian and are given in (8). 3.2 M-Step The first order conditions q(θ)/ θ = 0 for the M-step do not appear to have a closed form solution. The proposed algorithm breaks up the M-step into a sequence of conditional M- steps where we maximize a subset of the parameters conditional on the remaining parameter values (Meng and Rubin 1993). More specifically, we update the current parameter estimates θ (t) to θ (t+1) in the following order. 1. Given θ (t) obtain β (t+1) via a least squares regression of ỹ i X1i γ(t) Vi λ (t) on Y i where ỹ i as defined in (9) is evaluated at θ (t). 2. Given (β (t+1), γ (t), vec(π) (t), λ (t), vech(σ 22 ) (t), δ (t) ) obtain γ (t+1) via a least squares regression of ỹ i Yi β (t+1) Vi λ (t) on X 1i with the same ỹ i as in step Given (β (t+1), γ (t+1), vec(π) (t), λ (t), vech(σ 22 ) (t), δ (t) ) obtain Π (t+1) via solving the system of linear equations given in (10). 6

8 4. Given (β (t+1), γ (t+1), vec(π) (t+1), λ (t), vech(σ 22 ) (t), δ (t) ) obtain λ (t+1) via a least squares regression of ỹ i Yi β (t+1) X1i γ(t+1) on V i = Y i Π (t+1) X i with the same ỹ i as in step Update Σ (t+1) 22 = n V iv i /n with V i from step 4. Note that this update ensures positive definiteness of Σ (t+1) If m > 2, update δ j in the order j = 2,..., m 1 via solving the quadratic equation (12) where the coefficients of the quadratic equation are evaluated at (β (t+1), γ (t+1), vec(π) (t+1), λ (t+1), vech(σ 22 ) (t+1), δ (t+1) 2,..., δ (t+1) j 1, δ(t) j,..., δ (t) m 1 ). As shown by Meng and Rubin (1993), each step of the ECM algorithm monotonically increases the likelihood function. 3.3 Parameter Covariance One of the main drawbacks of the EM algorithm is that it does not produce an estimate of the parameter covariance matrix which is needed for statistical inference (Jamshidian and Jennrich 2000). The SE(C)M algorithm (Meng and Rubin 1991, van Dyk, Meng and Rubin 1995) is a well-known method to obtain the parameter covariance by running supplementary EM iterations once the parameter estimates have been obtained. The SE(C)M algorithm, however, requires evaluation of the second derivatives of the complete data log likelihood and may be numerically unstable (Jamshidian and Jennrich 2000). For this reason, we follow the recommendation of Jamshidian and Jennrich (2000) and estimate the parameter covariance matrix by the inverse of the observed information matrix 2 l(θ)/ θ θ evaluated at the EM estimates θ. As the analytic second derivatives are rather complicated, the Hessian matrix can be evaluated by numerical second differences of l(θ) or by numeric first differences of the scores l(θ)/ θ. The expressions for the analytic scores for the normalization σ 1 2 = 1 are given in Appendix A.3. 5 Alternatively, a robust covariance matrix that allows for within group (clustered) dependence can be used (Wooldridge 2001, Chapter 13). Let c i be a variable that identifies the 5 The expressions for the analytic scores are slightly more complicated for the normalization σ 11 = 1 as σ 1 2 = 1 λ Σ 22λ depends on the parameters λ, Σ 22. 7

9 C independent groups or clusters. The robust covariance matrix is then obtained as ( 2 l(θ) θ θ ) 1( C j=1 g j g j ) ( 2 l(θ) θ θ ) 1 where g j = c i =j l i/ θ is the sum of contributions to the scores from observations belonging to the j-th cluster. 3.4 Parameter Expansion Another well known drawback of the EM algorithm is its slow convergence. A solution proposed by Liu et al. (1998) is to expand the complete-data model with a larger parameter space without altering the original observed-data model. The key to implement this parameter expansion (PX-EM) algorithm is to find a suitable expansion of the parameter space that accelerates convergence. The probit example (without endogeneity) considered in Liu et al. (1998) suggests a PX-ECM algorithm for the ordered probit model (1) in which the normalized (conditional) variance parameter σ 1 2 is activated and included in the parameter vector to be estimated. To describe the proposed PX-ECM algorithm, denote the expanded parameter vector as θ = (β, γ, vec(π), λ, vech(σ 22 ), σ 1 2, δ ) where the original parameter vector θ is expanded by one additional parameter σ 1 2. The E- step of PX-ECM is unchanged from the ECM algorithm, except that the conditional moments (8) are evaluated with σ 1 2 from the current iteration value of θ. The M-step of PX-ECM is a simple modification of the ECM algorithm. Steps 1, 2, 4, 5 are unchanged. For step 3, equation (10) is solved using σ 1 2 = σ (t) 1 2. Between steps 5 and 6, we update the additional parameter to σ (t+1) 1 2 using equation (11). Note that this update ensures σ (t+1) 1 2 > 0. Then in step 6, we solve the quadratic equation (12) where σ 1 2 is evaluated at σ (t+1) 1 2. Finally, once the PX-ECM algorithm has converged, we rescale the expanded parameter 8

10 vector θ to obtain the original identified parameter vector θ as θ = (β / σ 1 2, γ / σ 1 2, vec(π), λ / σ 1 2, vech(σ 22 ), δ / σ 1 2 ) Note that, as shown by Liu et al. (1998), monotonic convergence of the ECM algorithm is preserved for the PX-ECM algorithm. We examine the acceleration achieved using this PX-ECM algorithm in our Monte Carlo simulations below. 4 Inference 4.1 Testing Normality As likelihood inference for nonlinear models depends on distributional assumptions, it is important to have easy to apply diagnostic checks. For the ordered probit model (1), the (r + 1) 1 vector of error terms (u i, Vi ) are assumed to be independently multivariate normally distributed. While there are a variety of tests available for testing multivariate normality, the difficulty here is that the residuals û i are not observable due to the latent variable y i. Smith (1987) proposed a conditional moment test based on the departure of the third and fourth moments of the residuals from normality, while Butler and Chatterjee (1997) proposed a joint test for exogeneity and normality based on the GMM criterion. Although the test by Smith (1987) can be directly applied to the model under consideration, it is relatively cumbersome to implement as it requires evaluation of conditional expectations of all third and fourth moments of the vector error term. We propose a much simpler informal test of normality as diagnostic check. Since the marginal and conditional distributions of a multivariate normal distribution is also normal, the idea is to test normality separately for u i V i and V i. Because (1b) is essentially a linear regression, the usual tests for residual normality can be applied to V i. For r = 1, one can use the Bera-Jarque test of normality or the normal quantile plot to detect departures from normality. For r > 1, Vi Σ 1 22 V i should be independent χ 2 with r degrees of freedom. A 9

11 simple visual diagnostic is then to compare the empirical quantiles of V i Σ 1 V 22 i with those from χ 2 (r) by a quantile-quantile plot. To check the normality of u i conditional on V i, we apply the conditional moment test of Chesher and Irish (1987) to test y i V i N(µ i, σ 1 2 ). The test of Chesher and Irish (1987) is essentially a special case of Smith (1987) for a single equation limited dependent variable model (without simultaneity). Our proposal is to apply the much simpler test of Chesher and Irish (1987) assuming V i is an observed regressor in (2a). The details of the test, which requires computation of the first four moment residuals (Chesher and Irish 1987), are provided in Appendix A Testing Endogeneity Once we have a numerically stable method to obtain maximum likelihood estimates, the classical trinity of tests (Wald, likelihood ratio, Lagrange multiplier) are available for inference purposes. A hypothesis of particular interest for the model under consideration is the presence of endogeneity. Under the null hypothesis of no endogeneity σ 21 = 0, (1a) can be efficiently estimated by a standard ordered probit estimation routine. Rivers and Vuong (1988) proposed tests for endogeneity based on two-step estimators. Based on our ECM or PX-ECM algorithm, we can test the hypothesis by λ = Σ 1 22 σ 21 = 0 using any of the classical trinity of tests as a restriction on one of the estimated parameters. Alternatively, one could also test the hypothesis by σ 21 = Σ 22 λ = 0 using the delta method as the restriction is a nonlinear function of the estimated parameters Σ 22 and λ. 4.3 Testing Instrument Exogeneity When estimating models with endogeneity, whether one has a valid instrumental variable is always a practical concern. For the minimum distance based two-step estimator of Newey (1987), Lee (1992) developed an over-identification test for instrument exogeneity when k > r + s. Here, we develop a likelihood ratio test of over-identification based on the LIML estimator. 10

12 Denote the reduced form of (1) as y i = c 1 X 1i + c 2 X 2i + ɛ i Y i = Π 1 X 1i + Π 2 X 2i + V i (4a) (4b) where c 1, c 2, Π 1, Π 2 are parameters. Subtracting β times (4b) from (4a), we have y i = Y i β + X1i(c 1 Π 1 β) + X2i(c 2 Π 2 β) + ɛ i Vi β Comparing this with the structural form (1a), we can formulate the limited information maximum likelihood estimator as the constrained problem (Anderson and Rubin 1949) l 0 = max c 1,c 2,Π 1,Π 2,β,Σ l(y i, Y i X 1i, X 2i ) subject to c 2 Π 2 β = 0 (5) The test of instrument exogeneity is the test of the (over-identifying) constraints c 2 Π 2 β = 0. The likelihood ratio approach to testing these constraints compares l 0 to the unconstrained problem l 1 = max c 1,c 2,Π 1,Π 2,Σ l(y i, Y i X 1i, X 2i ) (6) The likelihood ratio test statistic 2(l 1 l 0 ) is asymptotically χ2 with k s r degrees of freedom under the null of c 2 Π 2 β = 0. To compute the test statistic, one needs to estimate the unconstrained model (6) to obtain l 1. (l 0 can be obtained from the LIML estimates as discussed in the previous sections.) Fortunately, the unconstrained model (6) is relatively easy to estimate. By writing the joint log-likelihood as l(y i, Y i X 1i, X 2i ) = l(y i Y i, X 1i, X 2i ) + l(y i X 1i, X 2i ) we see that this model has the same structure as that discussed in section 2, except that the endogenous regressors Y i in the y i equation are replaced by X 2i. From the analytic scores in Appendix A.3, the maximum likelihood estimates for the Y i equation parameters can be 11

13 obtained by a least squares regression of Y i on X 1i, X 2i and l(y i X 1i, X 2i ) can be evaluated as (3). 6 The conditional likelihood l(y i Y i, X 1i, X 2i ) is an ordered probit model with conditional mean µ i = c 1 X 1i + c 2 X 2i + V i λ instead of (2a), where V i are the residuals from the Y i regression. The unconstrained loglikelihood l 1 can therefore be evaluated by running least squares and estimating a standard ordered probit model. 4.4 Partial Effects In applied work, the estimated parameters are usually not of direct interest. More commonly, we are interested in how the various regressors in (1a) affect the probability of ordered outcomes y. The partial effect is defined as Pr(y = j)/ Z h for a continuous regressor Z h and Pr(y = j D = 1, Z = Z) Pr(y = j D = 0, Z = Z) for a dummy regressor D. These partial effects generally depend on both the parameter vector θ and the regressors. One common approach is to plug-in some representative values such as the sample means or the medians for the regressors. Alternatively, one can average the partial effects evaluated for each observation in the sample. The latter is an estimate of the expected partial effect over the population. If there are several dummy regressors, the averaging approach may be preferable to plugging-in sample means of the dummy variables. As the partial effects are nonlinear functions of the parameters, confidence intervals are usually computed via a first order asymptotic approximation (the delta method). Appendix A.5 provides expressions for the approximate covariance matrix of the partial effects based on the delta method. Here we suggest an alternative Monte Carlo method which does not rely on linearization and does not necessarily center the confidence intervals about the point estimates. We simply simulate draws of the parameters θ N( θ, Cov( θ)) and evaluate the partial effects at θ. By repeating this procedure many times, we obtain an empirical distribution of the partial effects from which we can draw inference. If the simulated draw θ does not satisfy certain parameter restrictions (such as monotonicity of cutpoints), we redraw 6 Intuitively, this follows from the SUR structure of the reduced form (4). 12

14 until the restrictions are satisfied. 5 Monte Carlo Simulations In this section we examine the finite sample properties of the proposed ECM and PX-ECM algorithms via Monte Carlo simulations. For the data generating process, we consider the case with r = 1 endogenous regressor Y, s = 2 included exogenous regressors X 1 = 1, X 2, up to three instruments Z 1,..., Z 3, and m = 5 ordered outcomes. The parameter values are y i = Y i + 1 X 2,i + u i, (α 1, α 2, α 3, α 4 ) = (0, 1, 3, 6) Y i = X 2,i + Z 1,i Z k 2,i + V i u i N (0, Σ), Σ = V i 1 1 ρ 2 ρ σ22 1 ρ 2 σ 22 The 2 2 covariance matrix Σ is parameterized so that the normalization σ 1 2 = 1 holds and Cor(u i, V i ) = ρ. In our simulations, we vary the correlation parameter ρ = {±0.8, ±0.6, ±0.4, ±0.2, 0} and the variance parameter σ 22 = {1/2, 1, 2}. By varying ρ, we examine the performance of the algorithm as the degree of endogeneity changes. In particular, there is no endogeneity when ρ = 0. By varying σ 22, we examine the performance of the algorithm as the quality of the instruments change. The instruments become poor as σ 22 increases. In the simulations we also vary the number of instruments Z so that k = 3, 4, 5 where k = 3 corresponds to the just-identified case. For each pair of (k, ρ, σ 22 ), we draw the exogenous regressors X 2, Z 1,..., Z k 2 from the standard normal distribution, independently from each other. The draw is made once and data for y and Y are generated using the same fixed regressor values across the replications. The simulations are repeated 1000 times for a sample size of n = 1000 for each pair of (k, ρ, σ 22 ). As the simulation design fixes the cutpoint parameters α, the distribution of the ordered outcome y changes as we change (k, ρ, σ 22 ). Figure 1 displays the interquartile range of the frequency of the simulated ordered outcomes for the case σ 22 = 1. 7 As is clear from Figure 1, 7 The distribution is similar for the cases σ 22 = 1/2, 2 and is not displayed to conserve space. 13

15 the simulated y is not evenly distributed over the m = 5 possible outcomes, y = 1 being the most frequently observed and y = 5 the least. This is a common characteristic of ordered outcomes in real data. Also note that the frequency distribution is not symmetric about ρ = EM Algorithm To investigate the finite sample performance of the proposed estimator, Figure 2 displays the root mean squared errors (RMSE) of the parameter estimates for the case k = 4, σ 22 = 1. The figure compares the estimates from the EM algorithm with those from the twostep Amemiya GLS (AGLS) estimator proposed by Newey (1987). 8 The RMSE from the two estimators are quite similar for most parameters, except for the excluded exogenous regressors (instruments) in the reduced form equation. The similar performance of the two estimators for the parameters for the y equation reflects the asymptotic efficiency of the AGLS estimator. The inferior performance of the AGLS estimator for the parameters on the instruments is due to the fact that these estimates are essentially least squares estimates which ignore Cor(u, V ) = ρ. As such, the performance worsens with the magnitude of ρ. We note that for the just-identified case (k = 3), the RMSE for the parameter on the single instrument is quite similar between the two estimators. The case k = 5 exhibits the same feature as the k = 4 case in that the RMSE of the parameters for the instruments are noticeably larger for AGLS than PX-ECM. As the parameter β on the endogenous regressor is often the parameter of interest in applied work, Figure 3 displays the RMSE of the estimates β of this parameter from the PX-ECM algorithm as we vary the number of instruments k 2 and the variance parameter σ 22. The top figure shows that, conditional on k, the RMSE of β deteriorates only slightly as the quality of the instruments becomes poor (σ 22 becomes large). The bottom figure shows that, conditional on σ 22, the RMSE of β improves noticeably as we move from a just-identified model (k = 3) to an over-identified model (k = 4, 5). Moving from k = 4 to k = 5 slightly 8 Numerical optimization for the AGLS estimator failed in a few cases (either in the first or second stage ordered probit estimation) and were discarded when computing the performance measures based on AGLS. The ECM and PX-ECM algorithms both converged successfully (see footnote 9 for convergence criterion used) in all cases to almost identical estimates. 14

16 improves the RMSE but not as much as by moving from k = 3 to k = 4. While these results may be specific to the chosen data generating process, it does illustrate the potential gains in finite sample precision from having an over-identified system rather than a just-identified system. Figure 4 compares the computational cost of the ECM and PX-ECM algorithms. The figure displays the interquartile range of the distribution of iteration counts over the Monte Carlo simulations as we vary the pair (k, σ 22 ). 9 We observe the following. First, conditional on k, convergence becomes slightly slower as the instruments become poor (σ 22 becomes large). Second, conditional on σ 22, convergence becomes faster as we increase k, particularly for the PX-ECM algorithm. Third, the PX-ECM algorithm generally improves on the ECM algorithm, sometimes substantially so as we increase k. For k = 5, σ 22 = 2, ρ = 0.8, the median iteration count was 177 for the ECM algorithm, while that for the PX-ECM was 79, a gain of about 55% Inference The finite sample performance of the endogeneity tests described in Section 4.2 are displayed in Figures 5 6. We compare the following five tests: the Wald test from the two-step Amemiya GLS, two Wald tests, the Lagrange multiplier (score) test, and the likelihood ratio test. All tests are based on likelihood based estimates from the EM algorithm, except the first which is based on the two-step estimates. The hypothesis under test is λ = σ 21 /σ 22 = 0, except for one of the Wald tests where we test σ 21 = λσ 22 = 0 based on the delta method. As can be seen from Figures 5 6, the finite sample performance of all tests are quite similar both in terms of size and power for all (k, σ 22 ) pairs considered. Figure 5 indicates that there is very little size distortion. If anything, the tests tend to be slightly undersized, except for the case (k, σ 22 ) = (5, 2) in which they appear to be slightly oversized. The tests tend to get slightly more oversized as the instruments become poor (with an increase in σ 22 ), 9 A completion of both the E-step and the M-step is counted as one iteration. The convergence criterion was satisfaction of either l(θ (t) ) l(θ (t 1) ) < ɛ or max i θ (t) i θ (t 1) i < 10 8 where l( ) is the observed log-likelihood and ɛ Although the PX-ECM has one additional conditional M-step to update σ 1 2, total computation time can still be substantially reduced due to the decrease in iteration counts. 15

17 particularly for the case k = 5. Figure 6 shows that for the alternatives considered, the empirical power is nearly one for all cases except ρ = ± For the case ρ = 0.2, we find that the power increases with the number of instruments k holding constant their quality σ 22. For a given number of instruments k, the power increases as the quality of the instruments improve (with smaller σ 22 ). The main message with get from this is that for testing endogeneity, the quality and number of instruments appear more important than the choice of the test statistic for values of λ close to zero. Figure 7 examines the empirical size of the instrument exogeneity (over-identification) test discussed in Section 4.3. The empirical size is the rejection frequency of the test statistic based on the asymptotic χ 2 distribution with k 3 degrees of freedom. As seen in Figure 7, the test is generally well sized. If anything, the test has a slight tendency to be undersized. However, there are no systematic changes in the finite sample performance of the test as we vary ρ and σ Empirical Application As empirical application, we examine the relationship between measures of social interactions and population density. A detailed discussion of the economic issues and data are given in Brueckner and Largey (2006). Brueckner and Largey (2006) examined the relationship for measures of social interactions that are either binary (dummy variable) or continuous. In this application, we consider three measures of social interactions that are coded as ordered outcomes: neisoc (how often the respondent socializes with neighbors), confide (number of people the respondent can confide in), and friends (number of close friends). These three ordered outcomes were treated as continuous variables in Brueckner and Largey (2006). Figure 8 provides the ordered coding of these three variables and sample frequencies of each 11 The empirical power is not size adjusted; size adjustment makes little difference as the size distortion is small. The alternatives considered in terms of λ vary from 0.14 to 0.94 for σ 22 = 1/2, from 0.20 to 1.33 for σ 22 = 1, from 0.29 to 1.89 for σ 22 = 2, where the lower bounds correspond to the case ρ = ± The results in Figure 7 are based on the LR test using LIML estimates from the EM algorithm. We also examined the performance of the over-identification test using the two-step Amemiya GLS estimates. The results were nearly identical to those from the LR tests and are not reported. 16

18 outcome. We note that while the sample outcomes for neisoc and friends are relatively balanced, the outcomes for confide are concentrated in the highest order category. The covariates that correlate with these measures of social interactions are described in Table 1. These covariates are mostly dummy variables, except for age (and age squared), number of children, and the log of census tract population density. Population density is considered an endogenous regressor for reasons explained in Brueckner and Largey (2006). As in Brueckner and Largey (2006), the instruments are log average density for urbanized areas (den ua) and metropolitan statistical areas (den msa) containing the tract. The LIML estimates are reported in Table 2 for neisoc and confide. (We do not report estimates for friends as it fails the normality test as discussed below.) The parameter λ which measures the extent of endogeneity is statistically significant for neisoc but not for confide. The implied correlation is Cor(u i, V i ) = for neisoc and Cor(u i, V i ) = for confide. The correlations thus appear to be practically small. The coefficients on the two instruments den ua and den msa in the reduced form equation for den tract are both statistically significant, indicating that the instrument relevance condition is satisfied. The over-identification LR test reported at the bottom of Table 2 is insignificant for both neisoc and confide, indicating that instrument exogeneity is satisfied. The test for normality of the residuals discussed in Section 4.1 are reported in Table 3. The tests indicate that the third and fourth moments of the residuals V i from the reduced form equation (1b) for the endogenous regressor depart from those implied by the normal distribution. The conditional moment test for normality of the residuals u i from the structural equation (1a) fail to reject the null for neisoc and confide but indicate departure from normality for friends. The statistic of interest in Brueckner and Largey (2006) is the relation between social interactions and population density. As the coefficient on den tract in Table 2 is difficult to interpret in our nonlinear model, we display the partial effects of den tract on the two measures of social interactions in Figure 9. These partial effects are evaluated at the sample median values of the covariates, except that we vary age over the relevant range. We note several interesting features of these partial effects. First, as the estimated parameter on 17

19 den tract is negative for both social interaction measures, the probability for higher interaction decreases with density. However, the effect is not monotone. For confide, for example, the probability for higher interaction first increases and then decreases with density. Second, the parameter on age is statistically significant for both social interaction measures. However, the effect of age on the partial effects of population density are quite different between the two social interaction measures. For confide, the partial effects hardly vary with age and suggest that the age effect is not of practical importance for the effect of population density. For neisoc, there is a noticeable age effect on the effect of population density. In particular, while the effects of population density on social interactions for middle ranked outcomes increase with age, those for low and high ranked outcomes decrease with age. Such nonlinear effects of covariates are hardly apparent in the parameter estimates and Figure 9 highlights the importance of displaying the partial effects when interpreting the implications of the estimated model. 7 Concluding Remarks The proposed ECM algorithm to estimate ordered probit models with endogenous regressors is numerically stable and works well even for models with a large number of parameters. As the inference procedures using LIML is well established, we expect more widespread application of LIML estimation using the proposed algorithm in applied work. Although the present paper focused on the ordered probit model with endogenous regressors, the EM algorithm can be easily modified or extended to other classes of discrete or limited dependent variable models as considered by Newey (1987). The binary probit model with endogenous regressors is a special case of the model considered in this paper and requires no modification. One can also modify the EM algorithm to estimate limited dependent variable (e.g. Tobit) models with endogenous regressors. 18

20 A EM Algorithm A.1 E-Step The reparametrized latent variable model is y i = j if yi j = 1 0 yi < 1, 1 < j < m (7a) 0 < yi, j = m where y i = µ i + u i, N( α 1, σ u 1 2 ), j = 1 i N ( (7b) (1/δ j 1)µ i α j 1 /δ j, σ 1 2 /δj 2 ), 1 < j µ i and σ 1 2 are defined in (2), δ j α j α j 1 > 0 for 1 < j < m and δ 1 = δ m = 1. The complete data log likelihood function can be written as l(y, Y ) = l(yi, Y i X i, θ) = l(yi Y i, X i, θ) + l(y i X i, θ) (y i µ i + α 1 ) 2 = n 2 log(2π) n 2 log σ 1 2 y i =1 m + (log δ j (δ jyi µ i + α j 1 ) 2 ) 2σ 1 2 j=2 y i =j 2σ 1 2 nr 2 log(2π) n 2 log Σ V i Σ 1 22 V i From the first two moments of the truncated normal distribution (Johnson and Kotz 1970, 19

21 pp.81 83), we have the conditional moments µ ij φ(z 0,ij) Φ(z 0,ij ) σ1 2, j = 1 ŷi E[yi y i = j] = µ ij + φ(z 0,ij) φ(z 1,ij ) Φ(z 1,ij ) Φ(z 0,ij ) σ1 2 δ j, 1 < j < m σ 2 i E[(yi ŷi ) 2 y i = j] ( 1 z 0,ijφ(z 0,ij ) Φ(z 0,ij ) ( φ(z 0,ij ) Φ(z 0,ij ) ( = ( µ ij + φ(z 0,ij) Φ( z 0,ij ) σ1 2, j = m ) 2 ) σ 1 2, j = z 0,ijφ(z 0,ij ) z 1,ij φ(z 1,ij ) Φ(z 1,ij ) Φ(z 0,ij ) ( φ(z 0,ij ) φ(z 1,ij ) Φ(z 1,ij ) Φ(z 0,ij ) 1 + z 0,ijφ(z 0,ij ) Φ( z 0,ij ) ( φ(z 0,ij ) Φ( z 0,ij ) ) 2 ) σ 1 2, ) ) 2 σ1 2, 1 < j < m δ 2 j j = m (8a) (8b) where µ i α 1, j = 1 µ ij (µ i α j 1 )/δ j, 1 < j, z 0,ij µ ijδ j σ1 2, z 1,ij (1 µ ij)δ j σ1 2 The expected complete data log likelihood conditional on the observables can then be written as q(θ) E[l(y, Y ) y] = n 2 log(2π) n 2 log σ 1 2 y i =1 + m j=2 y i =j A.2 M-step ( log δ j δ2 j (ŷ i µ i1) 2 + σ 2 i 2σ 1 2 ( (ŷ i µ ij ) 2 + σ i 2 ) ) nr 2σ log(2π) n 2 log Σ V i Σ 1 22 V i A.2.1 β update q m β = j=1 y i =j δ j (ŷi µ ij) Y i σ

22 The first order condition q/ β = 0 can be written as Y i Y i β = y i =1(ŷ i + α 1 )Y i + m (δ j ŷi + α j 1 )Y i j=2 y i =j (X 1iγ + V i λ)y i The r 1 parameter β can be computed as a least squares regression of ỹ i X1i γ V i λ on Y i where ŷi ỹ i = + α 1 for y i = 1 (9) δ j ŷi + α j 1 for 1 < y i A.2.2 γ update q m γ = j=1 y i =j δ j (ŷi µ ij) X 1i σ 1 2 The first order condition q/ γ = 0 can be written as X 1i X 1iγ = y i =1 (ŷ i + α 1 )X 1i + m (δ j ŷi + α j 1 )X 1i j=2 y i =j (Y i The s 1 parameter γ can be computed as a least squares regression of ỹ i Y i X 1i with ỹ i as defined in (9). β + Vi λ)x 1i β Vi λ on A.2.3 Π update q m vec(π) = j=1 y i =j δ j (ŷi µ ij) Dµ i (Π) σ DV i Σ 1 22 V i(π) where Dµ i (Π) = (I r X i )λ = (λ X i ) DVi Σ 1 22 V i(π) = 2(I r X i )Σ 1 22 V i = 2(Σ 1 22 X i)v i 21

23 The first order condition q/ vec(π) = 0 can be written as (λλ + σ 1 2 Σ 1 22 ) X i Xi vec(π) = vec ( n X i Yi (λλ + σ 1 2 Σ 1 22 )) λ which is just a system of linear equations in the parameters Π. (ỹi Y i β X 1iγ ) X i (10) A.2.4 λ update q m λ = j=1 y i =j δ j (ŷi µ ij) V i σ 1 2 The first order condition q/ λ = 0 can be written as V i V i λ = (ỹ i Y i β X 1iγ)V i The r 1 parameter λ can be computed as a least squares regression of ỹ i Yi β X1i γ on V i with ỹ i as defined in (9). A.2.5 Σ 22 update q vec(σ 22 ) = n 2 vec(σ 1 22 ) vec(σ 1 22 V ivi Σ 1 22 ) The first-order condition q/ vec(σ 22 ) = 0 can be solved for Σ 22 as Σ 22 = 1 n V i V i A.2.6 σ 1 2 update q = n + σ 1 2 2σ 1 2 m j=1 y i =j (δ j ŷ i δ jµ ij ) 2 + δ 2 j σ 2 i 2σ

24 The first-order condition q/ σ 1 2 = 0 can be solved for σ 1 2 as σ 1 2 = 1 n ( (ỹi µ i ) 2 + δ 2 y i σ 2 i ) (11) ỹ i is defined in (9) and n is the sample size. A.2.7 δ j update q = ( 1 δ j(ŷi 2 δ j δ y i =j j σ σ 2 i ) + δ jµ ij ŷ i σ 1 2 ) m k=j+1 y i =k δ k (ŷ i µ ik) σ 1 2 The first-order condition q/ δ j = 0 is a quadratic equation in δ j ( (ŷi 2 y i =j m + σ 2 i ) + n j n m )δ 2 j k=j+1 y i =k ( (µ i α 1 δ 2 δ j 1 )ŷi y i =j ) (δ k ŷi µ i + α 1 + δ δ j 1 + δ j δ k 1 ) δ j n j σ 1 2 = 0 (12) where n j is the number of observations with y i = j and δ m = This quadratic equation has real roots, the larger of which is positive as required for δ j. A.3 Scores This section provides expressions of the analytic scores for the normalization σ 1 2 = 1. Denote the joint log-likelihood function as l(θ) = m j=1 y i =j log ( Φ(z ji ) Φ(z j 1,i ) ) nr 2 log(2π) n 2 log Σ Vi Σ 1 22 V i 13 For j = 2, the first sum in the coefficient of δ j is P y i =j (µi α1)by i while for j = m 1, the second double sum in the coefficient of δ j becomes P y i =m (by i µ i + α 1 + δ δ m 2). 23

25 where z ji (α j µ i )/ σ 1 2. l β = l γ = l vec(π) = l λ = φ ji Y i φ ji X 1i ( φji (λ X i ) + (Σ 1 22 X i)v i ) = λ φ ji V i l vech(σ 22 ) = n 2 D r vec(σ 1 22 ) l δ j = 1 σ1 2 y i =j φ(z ji ) Φ(z ji ) Φ(z j 1,i ) Dr vec(σ 1 22 V ivi Σ 1 22 ) m k=j+1 y i =k φ ji X i + φ ki, j = 2,..., m 1 vec(x i Vi Σ 1 22 ) where D r is the r 2 r(r + 1)/2 duplication matrix such that vec(σ 22 ) = D r vech(σ 22 ) and σ1 2 φji φ(z j 1,i) φ(z ji ) Φ(z ji ) Φ(z j 1,i ) A.4 Conditional Moment Test The first four moment residuals (Chesher and Irish 1987) for the ordered dependent variable equation (1a) are given by e i,1 = φ(z j 1,i) φ(z j,i ) Φ(z j,i ) Φ(z j 1,i ) e i,2 = z j 1,iφ(z j 1,i ) z j,i φ(z j,i ) Φ(z j,i ) Φ(z j 1,i ) e i,3 = 2e i,1 + z2 j 1,i φ(z j 1,i) z 2 j,i φ(z j,i) Φ(z j,i ) Φ(z j 1,i ) e i,4 = 3e i,2 + z3 j 1,i φ(z j 1,i) z 3 j,i φ(z j,i) Φ(z j,i ) Φ(z j 1,i ) where φ( ), Φ( ) are the density and distribution function of the standard normal, z j,i (α j µ i )/ σ 1 2 and µ i is defined in (2a). z j,i is evaluated at the maximum likelihood estimates of the parameters. Note that e i,1 = φ ji as defined in Appendix A.3. 24

26 The score χ 2 statistic for the conditional moment test of normality of the residuals u i in (1a) is given by ι R(R R) 1 R ι where ι is an n 1 vector of ones and R is an n (2r +s+3) matrix with typical row R i = (Yi e i,1, X1ie i,1, Vi e i,1, e 2,i, e 3,i, e 4,i ) The score statistic can be obtained as the explained sum of squares from the regression of ι on R (where any collinear columns of R are dropped from the regression). Under the null of normality, the score statistic is asymptotically χ 2 distributed with degrees of freedom equal to the column rank of R. A.5 Partial Effects The outcomes of interest are ( Pr(y = j) = Pr(α j 1 y αj x β ) ( αj 1 x β ) < α j ) = Φ Φ σ11 σ11 for j = 1,..., m where x β Y β + X1 γ. To simplify notation in what follows, define z j (α j x β)/ σ11, z j,1 (α j x β k k β k )/ σ 11, z j,0 (α j x β k k )/ σ 11 where a negative subscript on a vector indicates the vector without the corresponding element. A.5.1 Continuous regressor case The marginal effect of a continuous regressor x k is h( θ) Pr(y = j) x k = ( φ( z j 1 ) φ( z j ) ) βk / σ 11 From the delta method, the approximate variance of the marginal effect h( θ) can be obtained as Var ( h( θ) ) ( h( θ)/ θ) Cov( θ) ( h( θ)/ θ) 25

27 where h( θ) θ = ( z j φ( z j ) z j θ z j 1φ( z j 1 ) z ) j 1 βk + ( φ( z j 1 ) φ( z j ) ) β k / σ 11 θ σ11 θ For the σ 1 2 = 1 normalization, the relevant subvector of the parameters is θ = (β, γ, λ, vech(σ 22 ), δ ) with z j θ = ( Y, X 1, z j λ Σ 22, σ11 σ11 σ 11 z j 2σ 11 (λ λ) D r, ι j 1, 0 ) m j 1 σ11 β k / σ 11 θ = ( 0 k 1, 1, 0 r+s k, β k σ 11 λ Σ 22, β k (λ λ) D r, 0 ) m 2 / σ11 2σ 11 where ι j is a j 1 vector of ones. with For the σ 11 = 1 normalization, the relevant subvector of the parameters is θ = (β, γ, δ ) z j θ = ( Y, X 1, ι j 1, 0 ) β k / σ 11 m j 1, θ = ( 0 k 1, 1, 0 r+s k, ) 0 m 2 A.5.2 Dummy regressor case The marginal effect of a dummy variable regressor x k is j ( θ) Pr(y = j x k = 1) Pr(y = j x k = 0) = Φ( z j,1 ) Φ( z j 1,1 ) Φ( z j,0 ) + Φ( z j 1,0 ) From the delta method, the approximate variance of the effect j ( θ) can be obtained as Var ( j ( θ) ) ( j ( θ)/ θ) Cov( θ) ( j ( θ)/ θ) where j ( θ) θ = φ( z j,1 ) z j,1 θ φ( z j 1,1) z j 1,1 θ φ( z j,0 ) z j,0 θ + φ( z j 1,0) z j 1,0 θ 26

28 The derivatives z j,1 / θ, z j,0 / θ are trivial modifications of z j / θ given for the continuous regressor case above where one of the elements in X 1 corresponding to the dummy regressor is replaced by one and zero, respectively. 27

29 k = k = 4 k = y = Frequency 0.2 y = 3 y = y = 2 y = ρ Figure 1: Distribution of simulated ordered outcomes. Each panel displays the interquartile range (and the lines connect the median) of the frequency of the simulated ordered outcomes over the 1000 Monte Carlo simulations for nine correlation parameter values ρ = 0.8, 0.6,..., k is the total number of exogenous variables in the simulated model. The variance parameter is σ 22 = 1; the distribution for the cases σ 22 = 1/2 and σ 22 = 2 are similar. 28

30 pxem agls b π c 1 π c 2 π 3 RMSE π 4 δ λ δ σ 22 δ ρ Figure 2: Root mean squared error (RMSE) of each parameter estimates. The model is y = by + c 1 + c 2 X 2 + u, Y = π 1 + π 2 X 2 + π 3 Z 1 + π 4 Z 2 + V, Cor(u, V ) = ρ, Var(V ) = σ 22, λ = σ 21 /σ 22 = ρ/ (1 ρ 2 )σ 22. The true parameter values are b = c 1 = 1, c 2 = 1, π 1 = 0, π 2 = 1, π 3 = π 4 = 1, σ 22 = 1, λ = ρ/ (1 ρ 2 ), δ 2 = 1, δ 3 = 2, δ 4 = 3 for ρ = 0.8, 0.6,..., pxem is from the PX-EM algorithm and agls is from the two-step Amemiya GLS (Newey 1987) estimator. Based on 1000 Monte Carlo replications. 29

31 k = 3 k = 4 k = 5 σ 22 = 0.5 σ 22 = 1 σ 22 = 2 RMSE ρ σ 22 = 0.5 σ 22 = 1 σ 22 = 2 k = 3 k = 4 k = 5 RMSE ρ Figure 3: Root mean squared error (RMSE) of the endogenous regressor parameter estimates β. The two figures display the same information but using different conditioning variables in each panel as we vary the number of exogenous regressors k and the variance parameter σ 22. RMSE are based on estimates from the PX-EM algorithm over 1000 Monte Carlo replications. 30

32 k = 3, σ 22 = k = 4, σ 22 = 0.5 k = 5, σ 22 = k = 3, σ 22 = 1 k = 4, σ 22 = 1 k = 5, σ 22 = Iterations k = 3, σ 22 = 2 k = 4, σ 22 = 2 k = 5, σ 22 = em px ρ Figure 4: Distribution of iteration counts of the EM and PX-EM algorithms. Each panel displays the interquartile range (and the lines connect the median) of the iterations required to estimate the endogenous ordered probit model for the simulated data set over 1000 Monte Carlo replications. The Monte Carlo design varies the correlation parameter ρ = 0.8, 0.6,..., +0.8, the total number of exogenous variables k, and the variance parameter σ

What s New in Econometrics. Lecture 13

What s New in Econometrics. Lecture 13 What s New in Econometrics Lecture 13 Weak Instruments and Many Instruments Guido Imbens NBER Summer Institute, 2007 Outline 1. Introduction 2. Motivation 3. Weak Instruments 4. Many Weak) Instruments