Simple Estimators for Invertible Index Models

Size: px

Start display at page:

Download "Simple Estimators for Invertible Index Models"

Norah Watkins
5 years ago
Views:

1 Simple Estimators for Invertible Index Models Hyungtaik Ahn Dongguk University Hidehiko Ichimura University of Tokyo Paul A. Ruud Vassar College March 205 James L. Powell University of California, Berkeley Preliminary and Incomplete Draft Abstract This paper considers estimation of the unknown linear index coeffi cients of a model in which a number of nonparametrically-identified reduced form parameters are assumed to be smooth and invertible function of one or more linear indices. The results extend the previous literature by allowing the number of reduced form parameters to exceed the number of indices i.e., the indices are "overdetermined" by the reduced form parameters. The estimation method is an extension of an approach proposed by Ahn, Ichimura, and Powell 2004 for monotone single-index regression models to a multi-index setting and extended by Blundell and Powell 2004 and Powell and Ruud 2008 to models with endogenous regressors and multinomial response, respectively. The estimator of the unknown index coeffi cients up to scale is the eigenvector of a matrix defined in terms of a first-step nonparametric estimator of the reduced form parameters corresponding to its smallest in magnitude eigenvalue. Under suitable conditions, the proposed estimator is root-n-consistent and asymptotically normal, and results of a small-scale simulation study illustrate its finite-sample performance. JEL Classification: C24, C4, C3.

2 . Introduction The object of this paper is to construct a class of computationally-simple estimators for models in which a vector of nonparametric "reduced form" parameters γx is assumed to depend upon the matrix X of regressors only through a vector Xθ 0 of indices that is, γx = gxθ 0 where g is an unknown function and the coeffi cient vector θ 0 is the unknown parameter of interest. In addition to the familiar smoothness restrictions imposed in the semiparametric literature, we assume that the function g is invertible in the vector of indices, a restriction that is satisfied for reduced form parameters of many econometric models in which the latent error terms are assumed to be independent of the regressors. The proposed estimator of θ 0 is the eigenvector of a random matrix Ŝ corresponding to its smallest in magnitude eigenvalue, making it relatively simple to compute. As described in more detail below, the index coeffi cient estimator is a semiparametric analogue of Berkson s 955 and Amemiya s 976 minimum chi-square logit estimators, with the reduced form parameters being nonparametrically estimated here, using general nonparametric regression methods instead of cell means for regressors with finite support. This paper consolidates the results in earlier working papers by Ahn, Ichimura, and Powell 996, which considered single-index models, and by Powell and Ruud 2008, which considered estimation of multiple indices for multinomial choice models. It also extends those previous results to "overdetermined" models, i.e., invertible models in which the dimension of the reduced-form vector γx exceeds the dimension of the index vector Xβ 0. Blundell and Powell 2004 considered a different extension of the Ahn-Ichimura-Powell estimation approach to binary response models with endogenous regressors, and that same extension could be adapted to the multi-index models considered here. While a large literature exists for estimation of semiparametric single-index regression models besides Ahn, Ichimura, and Powell 996, Han 987, Härdle and Stoker 989, Härdle and Horowitz 996, Ichimura 993, Klein and Spady 993, Manski 975, 985, ewey and Ruud 2005, and Powell, Stock and Stoker 989, among many others there are fewer results available on estimation of multiple-index regression models. Ichimura and Lee 992 consider estimation of a semiparametric sample selection model with multiple indices in the selection equation, and Lee 995 constructs a multinomial analogue to Klein and Spady s 993 estimator of the semiparametric binary choice model, estimating the index coeffi cients by minimizing a semiparametric "profile likelihood" constructed using nonparametric estimators of the choice probabilities as functions of the indices. Lee demonstrates semi-

3 parametric effi ciency of the estimator under the "distributional index" restriction that the conditional distribution of the errors depends on the regressors only through the indices. As Thompson 993 shows, though, the semiparametric effi ciency bound for multinomial choice under the assumption of independence of the errors which coincides with the bound the weaker distributional index restriction for binary choice differs when the number of choices exceeds two, indicating possible effi ciency improvements from the stronger independence restriction. Ruud 2000 shows that the stronger independence restrictions yield choice probabilities that are invertible functions of the indices and whose derivative matrix with respect to the indices is symmetric, neither of which needs hold under only the distributional index restriction, and Berry, Gandhi, and Haile 203 give general conditions on multivariate demand models that insure their invertibility with respect to the underlying latent utilities here assumed to be linear in the regressors. The next section defines the class of index models considered and the proposed estimator of the index model coeffi cients, and discusses how three examples single-index regression, multinomial choice, and monotonic location-scale quantile regression fit into the general framework. The large-sample behavior of the proposed estimator root- consistency and asymptotic normality is derived in Section 3, and effi cient choice of the "weight matrix" defining the estimator is discussed in Section The Model and Estimator General Framework The model for index restrictions considered here assumes that we have a random sample of observations on a random vector y i of dependent variables and a matrix X i of random covariates with q rows and p + columns; the dimension of y i plays no direct role in the analysis. For this setup, a key restriction is that some r-dimensional parameter γ i γx i of the conditional distribution of y i given X i the "reduced form parameter vector" depends only on a q-dimensional linear combination X i θ 0 of the regressors the "structural index vector", with the p + -dimensional coeffi cient vector θ 0 being the unknown parameter of interest. That is, for some mapping γ γx T F y X y X 2. 2

4 of the conditional distribution F y X of y i given X i, the restriction γ i γx i = gx i θ holds with probability one for some smooth unknown function g :R q R r.another key restriction imposed here is that the function g has a left inverse, so that mγ i g γ i 2.3 = X i θ 0 µ i. With the smoothness restriction on g, existence of the left-inverse function m requires r q, i.e., at least as many reduced form parameters as structural indicies. We call the case with r = q "justdetermined" and with r > q "overdetermined," in analogy with the traditional distinction between "just-" and "over-" identification in the traditional simultaneous equations literature. Examples A number of semiparametric models studied in econometrics fit into this framework. The single index regression literature has extensively considered the restriction 2.2 when γx i = E[y i X i and r = q =, and a main objective here is to extend the analysis to multiple reduced form parameters r > and multiple indices q >. A multivariate extension that generates the restrictions in 2.2 and 2.3 is the semiparametric multinomial choice MC model under the assumption of independence of the errors and regressors. In this case the q-dimensional dependent variable y i is a vector of indicator variables denoting which of q + mutually-exclusive and exhaustive alternatives numbered from j = 0 to j = q is chosen. Specifically, for individual i, alternative j is assumed to have an unobservable indirect utility y ij for that individual, and the alternative with the highest indirect utility is assumed chosen. Thus an individual component y ij of the vector y i has the form y ij = {yij yik for k = 0,..., q}, 2.4 with the convention that y i = 0 indicates choice of alternative j = 0. An assumption of joint continuity of the indirect utilities rules out ties with probability one; in this model, the indirect utilities are further restricted to have the linear form y ij = x ijθ 0 + ε ij 2.5 3

5 for j =,..., q, where the vector ε i of unobserved error terms is assumed to be jointly continuously distributed and independent of the q p + -dimensional matrix of regressors X i whose j th row is x ij. For alternative j = 0, the standard normalization y i0 = 0 is imposed. As Lee 995 notes, the MC model with independent errors restricts the conditional choice probabilities to depend upon the regressors only through the vector µ i X i θ 0 of linear indices; that is, it takes the form E[y i X i γ i = gx i θ for some unknown function g, so that y i = γ i + u i = gx i θ 0 + u i, 2.7 where E[u i X i = 0 by construction. In addition, the assumption of independence of the latent disturbances ε i and the regressors X i implies that the function gµ is smooth and invertible in its argument µ if ε i has nonnegative density everywhere Ruud A weaker condition yielding 2.6 is an assumption that the conditional distribution of ε i given X i only depends upon the vector of indices X i θ 0, but under this restriction the function g needs not be invertible in its argument, so invertibility would need to be imposed as an additional restriction for the method proposed here to apply. A different, somewhat artificial example is a model that restricts several conditional quantiles of a scalar dependent variable y i to depend upon only two indices, one for location and one for scale. Specifically, suppose the dependent variable y i is determined through the structural equation y i = α x iθ + σx i2θ 2 ε i 2.8 = α µ i + σµ i2 ε i, where the unknown functions α and σ are assumed to be strictly increasing. Suppose the unobservable error ε i is assumed to be continuously distributed given x i, and that r conditional quantiles are assumed to be independent of x i, i.e., Pr{ε i η j x i } = τ j 2.9 for some fractions 0 < τ <... < τ j <... < τ r < and corresponding quantiles {η j } assumed unique. Then the corresponding conditional quantiles of y i given x i satisfy γ j x i F y x τ j x i = α x iθ + σx i2θ 2 η j 4

6 for j =,..., r. As long as r 2, the indices µ i = x i θ and µ i2 = x i2 θ 2 can be written as functions of any two distinct conditional quantiles γ ij and γ ik of y i : µ i = ητ jα γ ik ητ k α γ ij, ητ j ητ k α µ i2 = σ γ ij α γ ik, ητ j ητ k so that the unknown transformation from the indices to the reduced form parameters is indeed invertible. Of course, a natural assumption generating the identifying restriction 2.9 would be statistical independence of ε i and x i, in which case the number of reduced-form quantiles r could be taken to be arbitrarily large, but here we take r 2 to be fixed and defer effi ciency considerations for this condition. Identification Returning to the general restrictions 2. through 2.3, if the inverse transformation mγ = g γ = Xθ 0 were known, the coeffi cient vector θ 0 would be identified if any of the rows of the regressor matrix X had a full-rank covariance matrix. Furthermore, given a consistent estimator ˆγ of γ and smoothness of the inverse transformation, the parameter vector θ 0 could be consistently estimated using generalized least-squares applied to the generated regression equation ˆµ i g ˆγ i mˆγ i = X i θ 0 + [mˆγ i mγ i [ mγi X i θ 0 + γ ˆγ i γ i X i θ 0 + û i. 2.0 This estimation method was proposed by Berkson 955 for the binary logit model, where the reduced form estimator ˆγ i was a cell mean for the response probability when the regressors are constant within cells, and the method was generalized by Amemiya 976 to general parametric multinomial choice models. However, in the present setting where the inverse transformation mγ = g γ is unknown, indentification of θ 0 is more delicate. The coeffi cient vector θ 0 is clearly only identified up to scale at best. and it will be convenient to normalize the first component of θ 0 to be one, with θ 0, 2. β 0 5

7 so that the object of estimation is the p-dimensional parameter β 0. Writing the regressor matrix X i [X i, X i2, with the column vector X i corresponding to the normalized component in θ 0, the relation 2.3 can be rewritten as mγ i = X i θ 0 = X i X i2 β 0 = X i = X i2 β 0 + mγ i = X i2 β 0 + mˆγ i û i. 2.2 In this representation the normalized regressor X i takes the form of the conditional mean of a vector of dependent variables from a semilinear regression model, with X i2 the regressors in the linear part and the identified γ i being the regressors in the nonparametric component. Identification of β 0 for such models, which was considered by Robinson 988 and Ahn and Powell 993, requires suffi cient variability in the rows of X i given γ i ; since 0 = Var [X i θ 0 γ i, the parameter vector θ 0 is identified up to scale if Var [X i δ γ i 0 with positive probability unless δ is proportional to θ 0. As discussed in detail by Lee 995, this requirement can be ensured if the component vector X i is jointly continuously distributed on R q conditional on the remaining components X i2 whose rows have full-rank covariance matrices. We impose this assumption in our analysis, which implies that the index µ i is jointly continuouslly distributed. Following Ahn, Ichimura, and Powell 2004, the alternative approach to identification and estimation of θ 0 taken here uses the fact that, for values of γ i that are nearly equal, the corresponding values of X i β 0 will also be nearly equal. This implies that the vector θ 0 can be identified by matching observations numbered i and j with the same reduced form parameters γ i = γ j but different matrices of regressors. Conditional on γ i = γ j, 2.3 relation 2.3 implies that X i X j θ 0 = It follows that E[L ij X i X j γ i = γ j θ 0 =

8 for any random p + q matrix L ij for which E L ij X i X j exists. A convenient class of such matrices is L ij = X i X j W ij, 2.6 for some suitable q q, nonnegative-definite "weight/trimming" matrix W ij WX i, X j ; this implies that the identified p + p + matrix has Σ 0 E[X i X j W ij X i X j γ i = γ j 2.7 = E[X i X j W ij X i X j µ i = µ j θ 0Σ 0 θ 0 = 0, 2.8 that is, Σ 0 has a zero eigenvalue with corresponding eigenvector equal to the true parameter θ 0. If the matrix W ij is chosen so that the zero eigenvalue of Σ 0 is unique which requires a suffi ciently rich support of the conditional distribution of X i given X i θ 0 this suffi ces to identify θ 0 up to scale as the unique nontrivial solution to 2.8. As discussed below, the weight/trimming matrix can be chosen for technical convenience and/or to improve the asymptotic effi ciency of the corresponding estimator of β 0. Because the index vector µ i will be jointly continuously distributed under our assumptions to help ensure that V ar[ X i δ 0 for any nontrivial δ θ 0, the conditioning event in the definition of Σ 0 in 2.7 has zero probability, but we assume the relevant conditional expectations are suffi ciently smooth so that Σ 0 can be expressed as as usual for nonparametric regression problems. Σ 0 = lim ε 0 E[X i X j W ij X i X j γ i γ j < ε, Given a random sample of size from this model, the preceding identification approach can be transformed into an estimation method for θ 0 by first estimating the unobservable reduced form vectors γ i by some nonparametric method and then, as in Ahn, Ichimura, and Powell 996, estimating a sample analogue to the matrix Σ 0 using pairs of observations with estimated values ˆγ i of γ i that were approximately equal. For the multinomial choice example, where γ i = E[y i x i, the first-step nonparametric estimator of γ i may be the familiar kernel regression estimator, which takes the form of a weighted average of the dependent variable, ˆγ i j= k ij y j i= k, 2.9 ij 7

9 with weights k ij given by xi x j k ij k h, 2.20 for x i a vector of the distinct components of the matrix X i, k a kernel function which tends to zero as the magnitude of its argument increases, and h h a first-step bandwidth which is chosen to tend to zero as the sample size n increases. More generally, the reduced form vector ˆγ i may include other nonparametric estimates of the parameters of the conditional distribution of y i given X i e.g., conditional quantiles, distribution functions, or higher moments and can be estimated using alternatives to kernel estimation e.g., nearest neighbor, local polynomial, splines, series, or sieve estimators. However it is constructed, we will assume the reduced form estimator ˆγ i has an "asymptotically linear" representation of the form ˆγ i = γ i + R j X i ξ j + r i, 2.2 j= where the i.i.d. "error term" ξ j has E [ ξ j X j = 0, V ar [ ξj X j Vj and E[ ξj 2 <, the matrix R j X i has second moment E[ R j X i 2 that increases at a slower rate than, i.e., E[ R j X i 2 = o 2.22 and the remainder term r in is negligible when multiplied by as described in the next section. For the kernel estimator example given in 2.9, given appropriate side conditions on the kernel and bandwidth terms, the components ξ i and R j X i will take the form ξ i = y i E[y i X i = y i γ i and R j X i = h c k ij /f X x i, where c is the number of continuously-distributed components of x i and f X is the joint density function of x i the product of the conditional density function of the continuous components given the discrete components and the probability mass function for the discrete components. Given this nonoparametric estimator ˆγ i of the reduced form parameter vector γ i, a second-step estimator of a matrix analogue to Σ 0 can be constructed as Ŝ 2 i= j=i+ h q K ˆγi ˆγ j h X i X j A ij X i X j, 2.23 where K is a univariate kernel analogous to k above, h h is a bandwidth sequence for the second-step estimator Ŝ, and A ij = A X i, X j = A ji is a q q, symmetric, nonnegative-definite 8

10 weight/trimming matrix which is constructed to equal zero for observations where ˆγ i or ˆγ j is imprecise i.e., where X i or X j is outside some compact subset of its support. The term K ˆγ i ˆγ j /h declines to zero as ˆγ i ˆγ j increases relative to the bandwidth h; thus, the conditioning event γ i = γ j in the definition of Σ 0 is ultimately imposed as this bandwidth shrinks with the sample size and the nonparametric estimator of γ i converges to its true value in probability. It is worth noting that, even though r, the number of components of ˆγ i, may exceed the number of indices q, the normalizing constant for the kernel weight in 2.23 is h q rather than h r, reflecting the assumption that the true reduced form parameters γ i have a continous distribution only on an r-dimensional manifold in q-dimensional Euclidean space since they depend only on the q-dimensional index vector µ i. Given the estimator Ŝ of Σ 0 which corresponds to a particular structure for the population weight matrix W im in the definition of Σ 0, as discussed below construction of an estimator of β 0 follows exactly the same form as in Ahn, Ichimura, and Powell 996 and Blundell and Powell 2004, exploiting a sample analogue of relation 2.7 based on the eigenvalues and eigenvectors of Ŝ. Since the estimator Ŝ may not be in finite samples if the kernel function K is not constrained to be nonnegative, the estimator ˆθ of θ 0 is defined here as the eigenvector for the eigenvalue of Ŝ that is closest to zero in magnitude. That is, defining ˆν,..., ˆν p+ to be the p + solutions to the determinantal equation Ŝ νi = 0, 2.24 the estimator ˆθ is defined as an appropriately-normalized solution to Ŝ ˆνIˆθ = 0, 2.25 where ˆν arg min j { ˆν j } ormalizing the first component of θ 0 to zero, with remaining coeffi cients defined as -β 0, i.e., ˆθ = ˆβ, θ 0 = β 0, 2.27 and partitioning Ŝ conformably as Ŝ [ Ŝ Ŝ 2 Ŝ 2 Ŝ 22,

11 the solution to 2.25 takes the form ˆβ [Ŝ 22 ˆνI Ŝ for this normalization. An alternative closed form estimator β of β 0 can be defined as β [Ŝ 22 Ŝ 2 ; 2.30a this estimator exploits the fact to be verified below that ˆν tends to zero in probability, since the smallest eigenvalue of the probability limit Σ 0 of Ŝ is zero. As discussed by Ahn, Ichimura, and Powell 2004, the relation of ˆβ to β here is analogous to the relationship between the two classical single-equation estimators for simultaneous equations systems, namely, limited-information maximum likelihood, which has an alternative derivation as a least-variance ratio LVR estimator, and two-stage least squares 2SLS, which can be viewed as a modification of LVR which replaces an estimated eigenvalue by its known zero probability limit. The analogy to these classical estimators extends to the asymptotic distribution theory for ˆβ and β, which, under the conditions imposed below, will be asymptotically equivalent, like LVR and 2SLS. Their relative advantages and disadvantages are also analogous e.g., β is slightly easier to compute, while ˆβ will be equivariant with respect to choice of which nonzero component of θ 0 to normalize to unity. As noted earlier, both estimators are semiparametric analogues to the minimum chi-square estimators of Berkson 955 and Amemiya 976, using a particular semilinear regression estimator applied to 2.2, which treats the g function as the unknown nonparametric component and the first component of X i with coeffi cient normalized to unity as the dependent variable. The proposed estimator will be easier to calculate than Lee s 995 estimator for the multinomial response model under index restrictions, which requires solution of a r-dimensional minimization problem with a criterion involving simultaneous estimation of J nonparametric regressions with the J index functions as arguments and minimization over the parameter vector β. Unlike the estimator proposed here, however, Lee s estimator does not impose invertibility of the vector of choice probabilities in the vector of indices. 3. Large Sample Properties of the Estimator Since the definition of the estimator ˆθ =, ˆβ is based on the same form of a pairwise difference matrix estimator Ŝ analyzed in Ahn, Ichimura, and Powell 2004 and Blundell and Powell 2004, 0

12 the regularity conditions imposed here will be quite similar to those imposed in these earlier papers. Rather than restrict the first-stage nonparametric estimator of the choice probabilities γx i to have a particular form e.g., the kernel estimator in 2.9, it is assumed to satisfy some higher-level restrictions on its rate of convergence and Bahadur representation which would need to be verified for the particular nonparametric estimation method utilized. Specifically, given a random sample of size for {y i, X i }, it is assumed that the reduced-form estimator ˆγX i has a relatively high convergence rate and the same asymptotic linear representation as for a kernel estimator. That is, defining the "trimming" indicator the condition t ij { A X i, X j 0}, 3.3 max t ij ˆγX i,j i γx i = o p 3/ is imposed. This is a restriction on both the nonparametric estimation and the construction of the weight-matrix function A X i, X j, which will generally requiring "trimming" of observations outside a bounded set of X i values to ensure that 3.32 is satisfied. Similarly, the remainder term r i in the asymptotic linear representation 2.2 for the reduced-form estimator is assumed to converge to zero at a rate faster than uniformly in i and j, max t ij r i = o p 3/ i,j Another restriction on the model is that the first column x i of the matrix of the regressors X i is continuously distributed conditionally on the remaining components, and that the corresponding coeffi cient β 0, is nonzero and normalized to unity; as discussed by Lee 995, this restriction helps ensure that the parameters are identified by ensuring that X i is suffi ciently variable conditional upon a given value of the reduced form vector γ i = gx i θ 0. Other conditions are imposed on the error distribution, kernel function K, and bandwidth h; a discussion of the relevant regularity conditions, which are extensions of conditions imposed in Ahn, Ichimura, and Powell 2004 and other single-index regression papers, is given in the appendix below. Under the assumptions in the appendix, consistency of the estimator Ŝ for a particular matrix Σ 0 can be established. That is, Ŝ p Σ 0, 3.34

13 where Σ 0 is of the form given in 4. with W ij = lim ij φ i φ j, 3.35 where the term φ i referenced here as the "gamma density" depends on the joint density f µ of the indices µ i, the Jacobian matrix and the kernel function K as follows: G i gx [ iθ 0 γ r q µ = µ, 3.36 i φ i δ i f µ µ i, 3.37 with δ i KG i udu R q 3.38 being an "inverse Jacobian determinant" term. In the overdetermined case, the form of the kernel function K affects the probability limit of Ŝ, but when r = q the term 3.38 reduces to δ i G i = mγ i γ and the term φ i becomes the joint density function of γx evaluated at the realized value γ i = gx i. The consistency of Ŝ for Σ 0, along with the identification restriction that the true coeffi cient vector θ 0 is the unique solution of 2.7, implies consistency of the corresponding estimator ˆθ up to scale. In order to derive the asymptotic normal representation for the unnormalized coeffi cients ˆβ,the regularity conditions also yield an asymptotically-linear representation for Ŝθ 0 of the form Ŝθ0 = 2 ι=, φ i X iãim i ξ i + o p, 3.39 where ξ i is the reduced-form error term given in 2.2 and φ i is the "gamma density" term defined in The remaining terms in 3.39 are defined as follows: M i δ i u q r R q [ KGi u v is an "inverse Jacobian matrix" term with δ i defined in 3.38; du, 3.40 Ã i lim E [ A ij X i, µ j 3.4 q q µ j =X i θ 0 2

14 is the "limiting weight matrix;" and X i X i Ã i lim E [ A ij X j X i, µ j p+ µ j =X i θ is a "residual regressor matrix." Of the terms appearing in the Bahadur representation 3.39, the "inverse Jacobian matrix" M i in 3.40 is perhaps the most obscure, because of its dependence on the kernel function K jusr like the "inverse Jacobian determinant" δ i in In the just-determined case r = q, the matrix M i will insdeed be independent of K, reducing to mγ i / γ = G i, the derivative of the inverse transformation between the reduced form parameters γ i and index vector µ i. However, in the overdetermined case r > q this matrix reflects the effect of the r-dimensional deviations ˆγ i γ ι of the reduced form estimators from their true values on the corresponding deviations ˆµ i µ i in the index estimates, an effect by the nature of the kernel weights Kˆγ i γ ι /h. For concreteness, suppose the kernel function takes the "elliptically symmetric" form Kv = c r kv Qv Q /2, 3.43 where kv is a scalar kernel, Q is a nonnegative-definite matrix and c r is a normalizing constant depending only on k and the dimension r of the argument v; in this case, the "inverse Jacobian deternuinant" will reduce to δ i = c r c q Q /2 G i QG i / and the "inverse Jacobian matrix" will become M i = G iqg i G iq This latter form is analogous to the relevant Jacobian matrix for GMM estimation, with Q representing the usual "limiting weight matrix" for the GMM criterion and G i serving the role of the expected derivatives of the moment functions with respect to the parameters. The "regression residual" term X i in 3.42 is more familiar from the literature on single-index regression models, though its definition is complicated by the presence of the weight matrix A ij and its limiting form Ã i. In the special case in which A ij only depends on X i and X j through the reduced form parameters γ i and γ j or, equivalently, through the indices µ i and µ j, the matrix X i reduces to X i = X i E [X i X i θ 0, 3

15 a familiar object in the large-sample theory for estimators of single-index or semiparametric regression models. The representation 3.39 facilitates the derivation of an asymptotic distribution theory for the estimator ˆθ =, ˆβ of the index coeffi cients. Under the assumptions imposed below, the terms in this normalized average have zero mean and finite variance, so the Lindeberg-Levy central limit theorem implies that Ŝ θ 0 is asymptotically normal. However, this asymptotic distribution will be singular: since θ 0 X i = θ 0X. i Ã i = 0, lim E [ A ij θ 0X j X i, X j θ 0 = X i θ it follows that θ 0 Ŝ θ 0 = o p This further implies that the smallest in magnitude eigenvalue ˆν converges in probability to zero faster than the square root of the sample size, because ˆν = min α 0 α Ŝ α / α 2 θ 0Ŝ θ 0 / θ 0 2 = o p To derive the asymptotic distribution of the proposed estimator ˆβ in 2.29, the normalized difference of ˆβ and β 0 can be decomposed as ˆβ β 0 = [Ŝ 22 ˆνI n [Ŝ 2 Ŝ 22 ˆνI θ 0 = [Ŝ 22 ˆνI n ŝ n ˆν [Ŝ 22 ˆνI β 0, 3.50 where ŝ Ŝ 2 Ŝ 22 β 0 [Ŝ θ 0 2, 3.5 i.e., ŝ is the subvector of Ŝθ 0 corresponding to the free coeffi cients β 0. Using conditions 3.49, the same arguments as in Ahn and Powell 993 yield Ŝ 22 ˆν I p Σ 22, 3.52 where Σ 22 is the lower p p diagonal submatrix of Σ 0, and also ˆβ β 0 = [Σ 22 n ŝ + o,

16 from which the consistency and asymptotic normality of ˆθ follow from the asymptotic normality of Ŝ β 0. Specifically, ˆβ β0 d 0, Σ where Ω 22 is the lower p p diagonal submatrix of where ψ i is the influence function term in 3.39, i.e., 22 Ω 22Σ 22, 3.54 Ω E[ψ i ψ i, 3.55 ψ i 2φ i X iãim i ξ i, 3.56 A similar argument yields the asymptotic equivalence of the "closed form" estimator β of 2.30a and the estimator ˆβ, since ˆβ β = [Ŝ 22 ˆν I [Ŝ 22 Ŝ 2 = ν [Ŝ 22 ˆν I ˆθ 3.57 = o p, for ν an intermediate value between ˆν and zero. A final requirement for conducting the usual large-sample normal inference procedures is a consistent estimator of the asymptotic covariance matrix Σ 22 Ω 22Σ 22 of ˆθ. Estimation of Σ 22 is straightforward; by the results given above, either [Ŝ 22 ˆνI or Ŝ 22 will be consistent, with the former being more natural for ˆθ and the latter for θ. Consistent estimation of the matrix Ω is less straightforward; given a suitably-consistent estimator ˆψ i of the influence function term ψ i of 3.56, a corresponding estimator of Ω could be constructed as ˆΩ n n ˆψ i ˆψ i i= Estimation of ψ i requires a corresponding estimator of the influence function term ξ i for the estimator of the reduced-form parameters, which would depend upon the form of these parameters for example, when γ i = E [y i X i, a natural estimator of ξ i would be ˆξ i = y i ˆγ i = y i Ê [y i X i, Given an appropriate estimator of the first-stage error term ˆξ i,a similar Taylor s series argument as in Ahn and Powell 993 yields a candidate ˆψ i of an estimator of the second step influence function ψ i, ˆψ i 2 n ˆγi n h q+ D ˆγ j ˆξi X i X j A ij X i X j, 3.59 h i= 5

17 with D v Kv/ v. Verification of the consistency of ˆΩ under the conditions imposed below would follow the same strategy as described in Section 4 of.ahn and Powell The Ideal Weight Matrix The results of the preceding section raise the question of the optimal choice of weight matrix A ij to be used in the construction of Ŝ in 2.23 above. The usual effi ciency arguments suggest that the optimal choice would yield equality more precisely, proportionality of the matrices Σ 22 and Ω 22 characterizing the asymptotic covariance matrix of ˆθ. The matrices Σ 0 and Ω 0 can be expressed as Σ 0 = lim E[φ ix i X j A ij X i X j µ i = µ j 4. = 2E [φ i X iãi X i 4.2 and Ω 0 = 4E [φ 2 X i iãim i V i M iãi X i, 4.3 so the Gauss-Markov heuristic would suggest that the weight matrix A ij would be effi cient if the corresponding limiting matrix Ã i equates 2Σ 0 and Ω 0, i.e., if Ã i = φ i M i V i M i, 4.4 with corresponding matrices [ 2Σ 0 = Ω 0 = 4E X i Mi V i M i X i. 4.5 Unfortunately, construction of a weight matrix A X i, X j yielding Ã i of 4.4 through the relation in 3.4 is apparently infeasible in the general setting in which V i = Var [ξ i X i is a general function of the regressor matrix X i. However, in the special case that V i is restricted to depend on X i only through γx i or, equivalently, through the index vector X i θ 0, there is a matrix A ij yielding the optimal limiting weight matrix Ã i of 4.4; that is, if V i Var [ξ i X i = Var [ξ i X i θ 0, 4.6 6

18 then a sequence of matrices A X i, X j for which A ij lim A X i, X j = φi M i V i M i + φ jm j V j M j would generate the limiting weight matrix in 4.4 and the corresponding equality of 2Σ 0 and Ω 0 in 4.5. The corresponding estimator ˆβ of the unnormalized slope coeffi cients β would have the asymptotic normal distribution ˆβ β 0 d 0, 4 [Ω 22 [ = 0, E X i Mi V i M, i X i where now X i = X i E [X i X i θ 0 due to the restricted form of A X i, X j. In the just-identified setting with r = q, the matrices M i and V i are square, with M i = G i, so the asymptotic distribution of ˆβ would simplify to ˆβ [ 22. β 0 d 0, E X ig ivi G i X i 4.9 Even when condition 4.6 is imposed, the Gauss-Markov motivation for 4.4 would only establish effi ciency of the estimator ˆβ within the class of estimators defined by 2.23, 2.29, and 2.30a, not its effi ciency among all regular estimators of β 0 under the restrictions imposed here. Still, for a special case, a stronger semiparametric effi ciency result holds. The binary response model y i = {X i θ 0 + ε i > 0} with independence of the error term ε i and the row vector of regressors X i will satisfy the index and invertibility restrictions imposed here, with r = q = and γ i = E[y i X i = gx i θ 0, where g is the c.d.f. of -ε i assumed invertible and smooth, G i = dgx i θ 0 /dx i θ 0 = G i, and V i = γ i γ i. For this model, Klein and Spady 993 propose an estimator of θ 0 up to scale and shows that it achieves the relevant semiparametric effi ciency bound, and with our normalization that effi cient estimator has the same asymptotic distribution as ˆβ in 4.9, implying that ˆβ is semiparametrically effi cient for binary response under independence of the errors and regressors. For the just-determined multiple index model 2.6, Lee 995 shows that 4Ω 22 is the semiparametric analogue of the information matrix for estimation of β 0 when only the index restrictions 2.6 and smoothness of g are imposed, so ˆβ would have the same asymptotic distribution as an effi cient semiparametric estimator under these 7

19 index restrictions. However, the argument for consistency of ˆβ exploits the additional restriction of invertibility of the conditional expections gx i θ 0 in the vector of indices X i θ 0, which is not required for consistency of Lee s 995 semiparametric maximum likelihood estimator. Indeed, Thompson 993 shows that, for the multinomial response model , the semiparametric information matrix under the assumption of independence of the errors and regressors generally exceeds 4Ω 22 when r >, implying that the semiparametric effi ciency of ˆβ only holds for r = binary response. As noted earlier, there is a close connection between the estimation approach proposed here and the "minimum chi-squared logit" estimator proposed for the binary logit by Berkson 955, and extended to general parametric multinomial choice models by Amemiya 976. For that estimation approach, it is assumed that the matrix of regressors X i is constant within each of a fixed number of groups, and a nonparametric estimator of the vector of choice probabilities γ i for each group is constructed from the observed choice frequencies for each group. With this setup, the relavant asymptotic covariance matrix V i takes the form V i = diag{γ i } γγ, 4.0 and information matrix I = E [ X ig iv i G i X i 4. Amemiya 976 shows that an effi cient estimator of θ 0 for this setup is the coeffi cient vector of the generalized least squares regression of g ˆγ i on X i, using the matrix G i V i G i or a feasible version that replaces the unknown probabilities by the group frequencies as the weighting matrix. The estimator ˆθ is a semiparametric analogue of this minimum chi-squared estimator, with asymptotic variance similar to 4. but with the regressors X i replaced by the residual regressors X i = X i E[X i X i θ 0. When the index restrictiion 4.6 on the reduced-form error variance holds and the model is overdetermined r > q, there is another avenue to reduce the asymptotic variance of ˆβ, namely, choice of kernel function K, which affects the asymptotic distribution of ˆβ through the inverse Jacobian matrix term M i. When this kernel function takes the elliptically-symmetric form 3.43 and the weight matrix is the infeasible A ij of 4.4, the asymptotic variance of ˆβ is the inverse of the lower p p 8

20 diagonal submatrix of 4 Ω = E = E [ X i Mi V i M i X i [ X ig iqg i G iqv i QG i G i QG i X i, 4.2 using the expression for M i in??. If V i were constant V i V 0, the usual Gauss-Markov heuristic would choose Q to maximize this matrix by equating G i QG i and G i QVQG i, i.e., by setting Q = Q = V, so that the inverse of the asymptotic covariance matrix of ˆβ would reduce to [ /4Ω = E X i G i V 0 G i X i. When the covariance matrix V i varies with X i but only as a function of γ i = gx i θ 0, this reduction would require a modification of the estimator Ŝ in 2.23, replacing the the fixed kernel K with a varying elliptically symmetric kernel Kij of the form with K ijv = c r kv Q ijv Q ij /2, 4.3 Q Vi + V j ij A modification of the arguments for the consistency and asymptotic normality of ˆβ would yield the same results for the improved estimator ˆβ using the weight matrix A ij in 4.7 and the varying kernel Kij in 4.3, yielding the same form for the asymptotic distribution ˆβ [ 22. β 0 d 0, E X ig ivi G i X i 4.5 whether or not the model is just- or overdetermined. Of course, even when condition 4.6 is imposed, the effi cient estimator ˆθ using the ideal weight matrix in 4.7 and optimal kernel in 4.3 would be infeasible in two respects the limiting matrix A ij does not specify the trimming mechanism needed to achieve the uniform nonparametric rate of convergence of ˆγ i to γ i specified in??, and the construction of ˆθ involves unknown nuisance parameters via the joint density function f µ µ i of the indices, the Jacobian matrix G i = gµ i / µ, and the reduced-form covariance matrix V i. Following Ichimura 2004, the trimming requirement can be imposed by multiplying the limiting A ij by a trimming term t ij of trimming terms of the form t ij {f X X i > b } {f X X i > b }, 4.6 where f X X i is the joint density function of the regressors and b is a bandwidth term declining to zero at an appropriate rate. Ichimura 2004 discusses conditions under which this trimming mechanism 9

21 will satisfy the requirements in 3.32 and also under which the asymptotic distribution Ŝ will be unaffected by replacement of the density f X X i in 4.6 by a consistent nonparametric kernel estimator ˆf X X i. As noted above, construction of a consistent estimator ˆV i of V i will depend upon the form of the reduced-form estimators ˆγ i ; the remaining nuisance parameters f µ µ i and G i could also be consistently estimated using a kernel density estimator applied to the estimated indices ˆµ i = X iˆθ and the derivative of a kernel regression of ˆγ i on ˆµ i, respectively. Demonstration that an estimator using a feasible version Â ij of A ij would achieve the same asymptotic distribution 4.5 of the ideal estimator ˆβ would involve strengthening of the regularity conditions and similar derivations to those in the Appendix. 5. Appendix The derivations of the consistency of Ŝ for Σ 0 in 3.34 and the asymptotic linearity representation 3.39 for Ŝθ 0 follow the same lines as in Ahn and Powell 993 and Blundell and Powell Regarding Ŝ as a function of the reduced-form estimators ˆγ i and ˆγ j, a second-order Taylor s series expansion around the true reduced-form parameters γ i and γ j yields Ŝ = S 0 + S + S 2, 5. where S 0 S 2 2 i= j=i+ i= j=i+ h q K h q+ D γi γ j h γi γ j h X i X j A ij X i X j, 5.2 ˆγ i γ i ˆγ j + γ j X i X j A ij X i X j, and S 2 2 i= j=i+ γ h q+2 ˆγ i γ i ˆγ j +γ j i γ j H ˆγ h i γ i ˆγ j +γ j X i X j A ij X i X j, where Dv Kv/ v and Hv 2 Kv/ v v, and where γ i and γ j are intermediate values. The first term S 0 is a second-order U-statistic with a summand depending on the sample size through the bandwidth term h. We assume that the terms in X i X j A ij X i X j have bounded second moments and have conditional expectations that are smooth functions of the underlying indices 20

22 X i θ 0 and X j θ 0, and also the kernel function K satisfies some common regularity conditions it integrates to one, is bounded and continuous with bounded derivatives. Under such conditions, the second moment of the terms in the summand for S 0 will be of order h q, so by Lemma 3. of Powell, Stock, and Stoker 989, the term S 0 will satisfy a weak law of large numbers, S 0 E [S 0 p 0, 5.3 provided h = h = o, h q = o. 5.4 Calculating the expectation of the summand in S 0, first conditioning on µ i = X i θ 0 and µ j and then making the change of variables u = µ i µ j /h, yields E [S 0 = E h q K = E K E gµi gµ j E [ X i X j A ij X i X j µ i, µ j h gµi gµ i hu h E [ X i X j A ij X i X j µ i, µ j = µ i hu f µ µ i hudu K G i u du E [ X i X j A ij X i X j µ i, µ j = µ i fµ µ i = E δ i f µ µ i X i X j A ij X i X j µ i, µ j = µ i Σ Under the rate condition 3.32 for the uniform convergence of the reduced form estimators, the remaining terms S and S 2 in the decomposition 5. satisfy S 2C X h q+ i X j A ij X i X j 2 i= j=i+ [ max { A X i,j i, X j 0} ˆγX i γx i = o p 3/8 h q+, 5.6 S 2 2C X h q+2 i X j A ij X i X j 2 i= j=i+ [ max { A X i,j i, X j 0} ˆγX i γx i 2 = o p 3/4 h q

23 where C is an upper bound for the kernel K and its first two derivatives. As long as the bandwidth h converges to zero suffi cently slowly so that it will follow that and implying h = o /4q+2, 5.8 S = o p 5.9 S 2 = o p /2, 5.0 Ŝ p Σ With the identification requirement that the lower p p submatrix Σ 22 has full rank, consistency of ˆβ follows from Σ 0 θ 0 = 0. To obtain the asymptotic linearity condition 3.39, we use the same decomposition 5., yielding Ŝθ0 = S 0 θ 0 + S θ 0 + S 2 θ = S 0 θ 0 + S θ 0 + o p 5.3 by 5.0. The first term is a second-order smoothed U-statistic, so under condition 5.8, which implies 5.4, Lemma 3. of Powell, Stock,and Stoker yield where ρ i E [ h q K = K S0 θ 0 = E[S 0 θ i= [ρ i E [ρ i + o p, 5.4 gµi gµ j X i X j A ij µ h i µ j X j gµi gµ i hu h E [ X i X j A ij hu X i, µ j = µ i hu f µ µ i hudu, 5.5 again using the change-of-variables u =µ i µ j /h for µ j. Since the second moment of ρ i is of order h 2, it follows that S0 θ 0 = E[S 0 θ 0 + o p,

24 that is, that the only contribution of the leading term on the right-hand side of 5.2 is a bias term E[S0 θ 0. Under the assumption that the expectation E[S 0 θ 0 = E[S 0 θ 0 h is a suffi ciently smooth function of the bandwidth parameter h, it will admit a power-series expansion of the form E[S 0 θ 0 h = L j= Γ j h j + o 5.7 as long as L is suffi ciently large so that h L+ = o /2 while 5.8 is satisfied. Assuming the expansion 5.7 is available, a kernel function K can be constructed to ensure that the coeffi cient matrices Γ j equal zero for j =,..., L using the jackknife approach described in Honoré and Powell That approach would replace the matrix Ŝ = Ŝh by a particular linear combination of matrices {Ŝc j h} for distinct positive constants c,..., c L, which is algebraically equivalent to use of a "higher-order kernel" constructed using the generalized jackknife approach of Schucany and Sommers 977. With this choice of kernel, the bias term E[S 0 θ 0 = o /2, which, together with 5.2 and 5.6, yields Ŝθ0 = S θ 0 + o p. 5.8 To verify the asymptotic linearity expression 3.39 for the remaining term S θ 0, we insert the corresponding asymptotic linearity expression 2.2 into the expression for S, obtaining S θ i= j=i+ l= 2 h q+ D γi γ j h X i X j A ij µ i µ j γi h q+ D γ j h i= j=i+ R l X i ξ j R l X i ξ j r l r l X i X j A ij µ i µ j T + T The second term is asymptotically negligible under condition 3.33, since T 2 C h q+ 2 i= j=i+ = O p /2 h q+ o p 3/4 = o p /4 h q+ Xi X j A ij µ i µ j max i,j t ij [ r i + r i = o p,

25 where the last line follows from condition 5.8 above with C an upper bound for D. Finally, apart from negligible terms with i = l or j = l, the term T can be rewritten as a third-order smoothed U-statistic, i.e., it is of the form where κ ijl h q+ D γi γ j h T = 3 i<j<l κ ijl + o p, 5.2 R l X i ξ j R l X i ξ j X i X j A ij µ i µ j Though the kernel κ ijl is not symmetric in the indices i, j, and l, it can be symmetrized in the same way as in the proof of Theorem 5. of Ahn and Powell 993, by rewriting T as where T = 3 i<j<l κ ijl + o p, 5.23 κ ijl 6 κ ijl + κ ilj + κ jil + κ jli + κ lij + κ lji A crude bound for the second moment of κ ijl is E[ κ ijl 2 = O E[ R j X i 2 O h 2q+ = O /2 h 2q+, under assumption 2.22, so this second moment will grow slower than the sample size under 5.8, which implies h 2q+ = o. Thus the projection lemma for smoothed U-statistics Lemma A.3 of Ahn and Powell 993 implies that the third-order U-statistic component of T can be replaced by its projection, i.e., T = 6 i= E[ κ ijl ξ i, X i + o p. Calculations analogous to those on pages 26 through 28 of Ahn and Powell 993 establish that [ E E[ κ ijl ξ i, X i 6 ψ i = o /2, where ψ i is the influence function term defined in 3.56, establishing the asymptotic linearity relation

26 6. References Ahn, H., H. Ichimura, and J.L. Powell, 996, "Simple Estimators for Monotone Index Models," manuscript, Department of Economics, University of California at Berkeley. Ahn, H. and J.L. Powell, 993, Semiparametric estimation of censored selection models with a nonparametric selection mechanism, Journal of Econometrics, 58, Amemiya, T., 976, "The maximum likelihood, the minimum chi-square, and the nonlinear weighted least-squares estimator in the general qualitative response model," Journal of the American Statistical Association, 7, Berkson, J., 955, "Maximum likelihood and minimum χ 2 estimates of the logistic function," Journal of the American Statistical Association, 50, Blundell, R.W. and J.L. Powell, 2004, "Endogeneity in Semiparametric Binary Response Models," Review of Economic Studies, 7, Han, A.K., 987, on-parametric analysis of a generalized regression model: the maximum rank correlation estimator, Journal of Econometrics 35, Härdle, W. and T.M. Stoker, 989, Investigating smooth multiple regression by the method of average derivatives, Journal of the American Statistical Association, 84, Härdle, W. and J. Horowitz, 996, Direct Semiparametric Estimation of Single-Index Models With Discrete Covariates, Journal of the American Statistical Association, 9, Honoré, B.E. and J.L. Powell, 2005, "Pairwise Difference Estimators for onlinear Models," in Andrews, D.W.K. and J.H. Stock, eds.,,identification and Inference for Econometric Models, ew York: Cambridge University Press. Ichimura, H., 993, Semiparametric least squares SLS and weighted SLS estimation of singleindex models, Journal of Econometrics, 58, Ichimura, H,. 2004, "Computation of Asymptotic Distribution for Semiparametric GMM Estimators," manuscript, Department of Economics, University Colleg London. Ichimura, H. and L-F. Lee, 99, Semiparametric least squares estimation of multiple index models: single equation estimation, in: W.A. Barnett, J.L. Powell, and G. Tauchen, eds., onparametric and semiparametric methods in econometrics and statistics, Cambridge: Cambridge University Press. Klein, R.W. and R.S. Spady, 993, An effi cient semiparametric estimator of the binary response 25

27 model, Econometrica, 6, Lee, L-F., 995, "Semiparametric maximum likelihood estimation of polychotomous and sequential choice models", Journal of Econometrics, 65, Manski, C.F., 975, Maximum score estimation of the stochastic utility model of choice, Journal of Econometrics, 3, Manski, C.F., 985, Semiparametric Analysis of discrete response, asymptotic properties of the maximum score estimator, Journal of Econometrics, 27, ewey, W.K. and P.A. Ruud, 2005, Density weighted least squares estimation, in Andrews, D.W.K. and J.H. Stock, eds., Identification and Inference for Econometric Models, ew York: Cambridge University Press. ewey, W.K. and T.M. Stoker, 993 Effi ciency of weighted average derivative estimators and index models, Econometrica, 6, Powell, J.L., J.H. Stock and T.M. Stoker, 989, Semiparametric estimation of weighted average derivatives, Econometrica 57, Ruud, P.A., 986, Consistent estimation of limited dependent variable models despite misspecification of distribution, Journal of Econometrics, 32, Ruud, P.A., 2000, "Semiparametric estimation of discrete choice models," manuscript, Department of Economics, University of California at Berkeley. Schucany, W.R. and J.P.Sommers, 977, "Improvement of kernel type density estimators, Journal of the American Statistical Association, 72, Thompson, T.S., 993, "Some effi ciency bounds for semiparametric discrete choice models," Journal of Econometrics, 58,

Simple Estimators for Invertible Index Models

Simple Estimators for Invertible Index Models Hyungtaik Ahn Dongguk University Hidehiko Ichimura University of Tokyo Paul A. Ruud Vassar College March 2015 James L. Powell University of California, Berkeley