Rank Estimation of Partially Linear Index Models

Rank Estimation of Partially Linear Index Models Jason Abrevaya University of Texas at Austin Youngki Shin University of Western Ontario October 2008 Preliminary Do not distribute Abstract We consider a generalized regression model with a partially linear index. The index contains an additive nonparametric component in addition to the standard linear component, and the model s dependent variable is transformed by a unknown monotone function. We propose weighted rank estimation procedures for estimating (i) the coe cients for the linear component, (ii) the nonparametric component (and its derivative), and (iii) the average derivative for the nonparametric component. The proposed estimation method is applied to an empirical study on the relationship between household income and children s cognitive development. JEL Classi cation: C13, C14. Keywords: Rank estimation, transformation model, partially linear model, average derivative e ect. 1 Introduction The semi-linear (or partially linear) speci cation has received considerable attention in the econometrics and statistics literature due to the compromise that it o ers between a purely linear speci cation and a purely nonparametric speci cation. A linear speci cation may not provide enough exibility to describe the way in which a covariate (or multiple covariates) a ects the dependent 1

variable. While a nonparametric approach avoids this lack of exibility, the curse of dimensionality (causing a slower convergence rate) hampers practical application in the presence of many covariates. The semi-linear speci cation minimizes the curse of dimensionality by allowing exibility with respect to a small number of covariates (often just one) and restricting other covariates to enter through a linear-index structure. In this paper, we extend the previous literature on semi-linear speci cations by providing an estimation methodology for a general semi-linear regression model in which no parametric assumptions are made on the error disturbance. The model considered is a semi-linear version of the generalized regression model introduced by Han (1987); as such, the model includes the binary-choice model, the transformation model (with unspeci ed link function), and various other non-linear models as special cases. The literature on semi-linear models naturally began by extending the linear regression model to include a non-parametric component, as follows: Y = X 0 0 + g 0 (W ) + : (1) Several approaches, including Powell (1987) and Robinson (1988), were proposed for p n-consistent estimation of 0 (and, in turn, non-parametric estimation of g 0 ()). The estimators of Powell (1987) and Robinson (1988) do not require parametric speci cation of the error disturbance. The Powell (1987) estimator of 0 is based upon pairwise di erences of the data (with weights based upon similarity in W values), essentially regressing Y i Y j on X i X j for pairs with W i W j. 1 The Robinson (1988) estimator instead is constructed through kernel estimators of E(Y jw ) and E(XjW ), regressing Y i b E(Yi jw i ) on X i b E(Xi jw i ). While the literature on estimation of (1) has been rather extensive, the literature on limiteddependent-variable versions of this model has been comparatively limited. For the case in which 1 Ahn and Powell (1993) used this idea in order to estimate a selection model with a non-parametric selection equation. The outcome equation for this model has a non-parametric selection correction term that causes it to take the form of equation (1). 2

error disturbances are parametrically speci ed, Emond and Self (1997) and Honoré and Powell (2005) (see also Aradillas-Lopez, Honoré, and Powell (2007)) have provided general approaches for estimation. For limited-dependent-variable semi-linear models, there are relatively few estimators for the case in which errors are not parametrically speci ed. Chen and Khan (2001) provide an estimator for the semi-linear censored regression model. The multiple-index estimator of Ichimura and Lee (1991), although not expressly developed for estimation of the semi-linear model, could be used for estimation of this model in fairly general settings. Our proposed estimation approach is similar in spirit to Ichimura and Lee (1991), but our focus extends beyond the linear-component coe cient parameter to estimation of the nonparametric component (and its derivative). 2 Section 2 of the paper introduces the model and the estimation method and provides asymptotic properties for the proposed estimators. Section 3 considers di erent variants of the model in Section 2 and proposes estimators for these models. Section 4 presents Monte Carlo simulation evidence to investigate the nite-sample properties of the estimation method. Section 5 presents an empirical study of the relationship between household income and children s academic achievement. Technical assumptions and proofs are left for the Appendix. 2 Model and Estimators In this section, we introduce a semiparametric partially linear index model, propose an estimation procedure, and present the asymptotic properties of the estimators. Consider a set of random variables (Y; X; W ) that satisfy the following regression model: T (Y ) = X 0 0 + g 0 (W ) + (2) 2 In addition, our estimation approach requires the choice of fewer smoothing parameters. 3

where T () is a non-degenerate monotone function and g 0 () is a smooth function. Since T () need not be strictly monotone, the model in (2) includes the binary choice model, as well as other examples of the generalized regression model in Han (1987). For the linear component of the model, X is a k 1 vector of regressors (as is its associated coe cient vector 0 ). The (potentially) non-linear regressor Z, taken to be scalar for simplicity, is linearly independent from X. Since the function T () is left unspeci ed in (2), the coe cient vector 0 is identi ed up-to-scale (and without an intercept) and the function g 0 () is identi ed relative to 0. We adopt the following normalizations: the rst component of 0 is xed equal to one ( 0;1 = 1) and the location of g 0 () is pinned down (speci cally, g 0 (w 0 ) = 0 for some chosen w 0 ). 3 Our primary interest lies in estimating the following: the parameter vector 0, the function g 0 (), the local-derivative function g (1) 0 (), and the average derivative E[g (1) 0 ()]. (The number in the superscript parentheses will be used to denote the order of the derivatives.) In the same way that 0 gives us the relative e ects of di erent X components, the local-derivative function (g (1) 0 ()) and the average derivative (E[g (1) 0 ()]) will give us the relative local e ect of W (i.e., relative to the X components) and the relative average e ect over the population. The proposed estimation procedure consists of three steps. First, the coe cient vector 0 is estimated using a local (conditional) rank estimator. Second, using ^ from the rst step, the functions g () and g (1) () are estimated via another local rank estimator. Third, with an estimate bg (1) (W i ) for each W i, the average derivative E[g (1) 0 ()] is easily estimated by its sample analogue. To estimate 0, the estimation approach uses information contained in those observation-pairs with W i W j. This approach is motivated by the following relationship implied by the original model: X 0 i 0 > X 0 j 0 () P (Y i > Y j jx i ; X j ; W i = W j ) > P (Y i < Y j jx i ; X j ; W i = W j ) : (3) 3 We take 0;1 to be positive without loss of generality. If the rst component of X has a negative coe cient, the negative of this covariate would have a positive one. 4

The parameter vector 0 maximizes the conditional rank correlation between Y i and Xi 0, de ned as Q () = E h1 (Y i > Y j ) 1 Xi 0 > Xj 0 + 1 (Y i < Y j ) 1 Xi 0 < Xj 0 i jw i = W j (4) where 1 () is an indicator function. Using the analogy principle, the estimator of 0 maximizes the sample analogue of??): b = arg max Q 1 X 1n () = arg max 1(Y i > Y j )1(X 0 2B 2B n(n 1) i > Xj)K 0 h (W i W j ) (5) i6=j where K h (u) = h 1 K (u=h) for a kernel function K () and bandwidth h. The parameter space B, following the normalization discussed previously, is a compact subset of f 2 R k : 1 = 1g. The estimator de ned by (5) is a conditional rank correlation estimator, as it (asymptotically) conditions on the event W i = W j. The idea of conditional rank correlation estimation has been considered in other contexts by Abrevaya, Khan, and Hausman (2008) and Shin (2007). Since the assumptions required to derive the asymptotic properties of b are quite standard, we leave them for the Appendix. In addition, all proofs have been collected in the Appendix. The following theorem states the consistency and asymptotic-normality results: Theorem 2.1 Under Assumptions A1-A7 listed in the appendix, (i) b p! 0 (ii) p n b d! 0 N 0; V 1 V 1 where 2V = E [r 2 1 (; 0 )] and = E r 1 1 (; 0 ) r 1 1 (; 0 ) 0, for 1 (z; ) = E 1 (y > Y ) 1 x 0 > X 0 K h (w W ) +E 1 (Y > y) 1 X 0 > x 0 K h (W w) : For inference, one can estimate the variance matrix using numerical derivatives (see, for example, Sherman (1993) and Khan and Tamer (2007)). For our empirical example, however, we prefer using 5

a bootstrap method since the numerical-derivative method tends to be sensitive to bandwidth choice. Subbotin (2007) has recently shown the validity of the bootstrap for the rank correlation estimator of Han (1987). Although this result has not yet been formally extended to the kernelweighted version of this estimator, there is no reason to believe that the bootstrap would be invalid in this context. Next, we turn our attention to estimating g () and g (1) (). Using the rst-stage estimator of 0, these functions will be estimated by exploiting another conditional rank correlation. In this step, we localize the objective function around two points along the support of W : an estimation point of the function (call it w) and the location normalization point w 0 (for which g 0 (w 0 ) = 0). For an observation-pair with W i = w and W j = w 0, the following relationship is implied by the model: X 0 i 0 + g 0 (w) > X 0 j 0 () Pr (Y i > Y j jx i ; X j ; W i = w; W j = w 0 ) > Pr (Y i < Y j jx i ; X j ; W i = w; W j = w 0 ) : (6) To estimate g 0 () and g (1) 0 () simultaneously, a local-polynomial estimation method is utilized. 4 First, expanding g (W i ) around w yields g (W i ) = g (w) + g (1) (w) (W i w) + 1 2 g(2) (w ) (W i w) 2 (7) for some w between w and W i : Letting 1 (w) = g (w), 2 (w) = g (1) (w), and 3 (w) = 1 2 g(2) (w ), equation (7) can be re-written g (W i ) = 1 (w) + 2 (w) (W i w) + 3 (w) (W i w) 2 : (8) 4 As an alternative to the local-polynomial method, one could instead using a global series approximation for g () (see, for example, Donald and Newey (1994) for the standard semilinear model). This method approximates the function g () as a nite series expansion such as g (w) ' P k p=0 pwp and conducts the rank estimation procedure for the nite number of parameters. As the parameters can be estimated in one step, this approach would be much easier computationally. However, the important advantage to the local polynomial method is that it is better equipped to detect local kinks and movements in the function g (). 6

For a generic vector ( 1 ; 2 ; 3 ) 0, we analogously de ne g (W i ; ) = 1 + 2 (W i w) + 3 (W i w) 2 : (9) Then, note that the parameter vector 0 (w) ( 1 (w); 2 (w); 3 (w)) 0 maximizes the following conditional rank correlation function: Q (; w; 0 ) = E h1 (Y i > Y j ) 1 Xi 0 0 + g (W i ; ) > Xj 0 0 +1 (Y i < Y j ) 1 X 0 i 0 + g (W i ; ) < X 0 j 0 jwi = w; W j = w 0 i (10) Using the sample analogue of (10) and substituting b for 0, the estimator of 0 (w) is de ned as b (w) = arg max Q 2n ; w; b 2 1 X = arg max 2 + 1 (Y i < Y j ) 1 n (n 1) i6=j h 1 (Y i > Y j ) 1 Xi 0 b + g (W i ; ) > Xj 0 b Xi 0 b + g (W i ; ) < Xj 0 b i K h (W i w) K h (W j w 0 ) : (11) The rst two components of b (w) (^ 1 (w); ^ 2 (w); ^ 3 (w)) are the estimators for g () and g (1) (), respectively. Under standard regularity conditions (provided in the Appendix), these estimators have the following asymptotic properties: Theorem 2.2 Under Assumptions A1-A11, uniformly over w 2 [w 1 ; w 2 ], (i) b (w) = 0 (w) + o p (1) (ii) p nh b (w) 0 (w) + B h 2 = p nh 1 P n n i=1 J (Z i; w) + o p (1) and p nh b (w) 0 (w) + B h 2 ) G (w), where G (w) is a mean-zero Gaussian process with covariance function EG (w) G w 0 = EJ (Z; w) J Z; w 0 (12) 7

with J (Z; w) = V (w) 1 @ 2 (Z; w; 0 ; ) : (13) @ Finally, we estimate E g (1) (W ) the average partial e ect of W (relative to the normalization imposed upon ). Since we have an estimator for g (1) (W ) for each W, the population average can be easily estimated by its sample analogue: g (1) = 1 X bg (1) (W i ) (14) n i where bg (1) (W i ) is the second component of b (W i ). The possibly heterogenous (local) e ects of W upon Y are weighted by the density of W. To see if exibility of g () is important for the average partial e ect, one could compare this average partial e ect with the estimated coe cient on W from a model with a linear speci cation in W. An appealing theoretical feature of the estimator in (14) is that it achieves the parametric convergence rate. Although the estimators of g and g (1) have slower (nonparametric) convergence rates, the averaging in (14) speeds the convergence of g to the root-n rate: Theorem 2.3 Under Assumptions A1-A11, g (1) h i E g (1) 0 (W ) = O p 1= p n : (15) As with the previous estimators, we will employ a bootstrap method for inference in our empirical application. We nish this section by discussing estimation of marginal e ects on E [Y jx; W ]. With the partially linear index consistently estimated (by X 0 b +bg (W )), the conditional expectation E [Y jx; W ] and its rst derivative (with respect to the partially linear index value) can be estimated by a local polynomial regression of Y on X 0 b + bg (W ). From the chain rule, the marginal e ects of covariates X and W can be estimated as @E[Y \ jx;w ] @(X 0b +bg(w )) b and @E[Y \ jx;w ] @(X 0b +bg(w )) bg(1) (W ), respectively. 8

3 Extensions In this section, we discuss a few extensions of the proposed estimation method. First, we discuss how the procedure may be modi ed when there exists covariate-dependent random censoring. Second, we show how the pairwise-di erence rank estimators can be constructed in the partially linear model. Finally, we consider the case in which there are multiple non-linear regressors (i.e., vectorvalued W ). 3.1 Random Censoring Consider the following model in which the dependent variable is subject to random censoring, where the censoring variable may depend upon the regressors in an arbitrary way: T (Y i ) = min X 0 i 0 + g 0 (W i ) + i ; C i d i = 1 X 0 i 0 + g 0 (W i ) + i C i : C i is the censoring variable, and d i is a binary variable indicating whether the dependent variable is censored or not. C i may depend on X i and/or W i in an arbitrary way, but we assume that " i is independent of (Xi 0; W i; C i ). Furthermore, C i need not be observed. Recently, Khan and Tamer (2007) investigated this model in the linear-index case (i.e., without the g 0 (W i ) component) and proposed the partial rank estimator for 0. Following Khan and Tamer (2007), we assume that the probability of censoring is not equal to one for all X for W almost surely. 5 Then, we can show that X 0 i 0 > X 0 j 0 () Pr (Y 1i > Y 0j jx i ; X j ; W i = W j ) > Pr (Y 0i < Y 1j jx i ; X j ; W i = W j ) where Y 1i = d i Y i + (1 d i ) (+1) and Y 0i = Y i. Using this identi cation condition, the following 5 Formally, let S X be the support of X i; and let X uc be the set X uc = fx 2 S X : Pr (d i = 1jX i = x) > 0g. Then X uc has positive measure W a:s: 9

objective function could be used to extend the partial rank estimator to the partially linear index case: b P R = arg max 2B 1 n(n 1) X 1(Y 1i > Y j0 )1(Xi 0 > Xj)K 0 h (W i W j ): i6=j Given W i = w and W j = w 0, it also follows that Z i + g 0 (w) > Z j, Pr (Y 1i > Y 0j jx i ; X j ; W i = w; W j = w 0 ) > Pr (Y 0i < Y 1j jx i ; X j ; W i = w; W j = w 0 ) : Therefore, estimators of g () and g (1) () can be obtained by maximizing: b P R = arg max 2 + 1 (Y 0i < Y 1j ) 1 1 X n (n 1) i6=j h 1 (Y 1i > Y 0j ) 1 Xi 0 b + g (W i ; ) > Xj 0 b X 0 i b + g (W i ; ) < X 0 j b i K h (W i w) K h (W j w 0 ) : (16) 3.2 Pairwise-di erence Rank Estimators Abrevaya (2000, 2003) introduced pairwise-di erence rank (PDR) estimators for the transformation model, in which the transformation function T () from (2) is strictly (rather than weakly) increasing. When T () is strictly increasing, the following condition holds: Pr (Y i > Y j jx i ; X j ; X k ; W i = W j = W k ) > Pr (Y j > Y k jx i ; X j ; X k ; W i = W j = W k ) () (X i X j ) 0 0 > (X j X k ) 0 0 for any triple of distinct observations indexed by i, j, and k. Therefore, the modi ed PDR estimator of 0 would be: b P DR = arg max 2B 1 n(n 1) (n 2) X i6=j6=k 1((X i X j ) 0 > (X j X k ) 0 ) [1(Y i > Y j ) 1(Y j > Y k )] K h (W i W j )K h (W j W k ): 10

Similarly, we can construct PDR-type estimators of g () and g (1) (). For W i = W k = w and W j = w 0, we have X 0 i 0 + X 0 k 0 + 2g (w) > 2X 0 j 0 () Pr (Y i > Y j jx i ; X j ; X k ; W i = W k = w; W j = w 0 ) > Pr (Y j > Y k jx i ; X j ; X k ; W i = W k = w; W j = w 0 ) : Then, the PDR estimator in the second stage maximizes the following objective function: b P DR (w) = arg max 2 1 n (n 1) (n 2) [1(Y i > Y j ) 1(Y j > Y k )] X i6=j6=k h 1 Xi 0 b P DR + Xk 0 b P DR + g (W i ; ) + g (W k ; ) > 2Xj 0 b P DR i K h (W i w) K h (W j w 0 ) K h (W k w) : (17) 3.3 Vector-valued W In Section 2, we assumed for simplicity that the non-linear regressor W was scalar. Although we suspect that the scalar case is most relevant for empirical researchers, we brie y discuss the case of multivariate W in this subsection. If W is a q-dimensional vector, the function g 0 () becomes a smooth function from R q to R 1. If the kernel function K h () is changed to a multivariate kernel function (in accordance with the q-dimensional W ), 0 can be estimated by maximizing the same objective function in (5). 6 For estimation of g 0 () and g (1) 0 (), some additional notation is required. The derivative g (1) 0 () is now q-dimensional vector-valued function, R q 7! R q : The mean value expansion in (7) should be a vector valued form as: g (W i ) = g (w) + g (1) (w) 0 (W i w) + 1 2 (W i w) 0 g (2) (w ) (W i w) 6 In order to use a single bandwidth h, one could normalize each component of W by dividing by its standard deviation. 11

where the second derivative g (2) () is a q q matrix function. As a result, there are q 2 + q + 1 parameters to be estimated (including the q 2 nuisance parameters for g (2) (w )) in the multivariate analogue to the objective function in (11). The curse of dimensionality (due to multivariate W ) causes the convergence rate for bg() to slow to p nh q. Finally, the average partial e ect E[g (1) 0 ()] \ (also a q-dimensional vector) can be estimated by a sample average of g (1) 0 (W i ) over W i. 4 Monte Carlo Simulation In this section, we investigate the nite-sample properties of the proposed estimators via Monte Carlo simulations. The base design is a transformation model with three explanatory variables: T (Y i ) = X 1i + X 2i 0 + g 0 (W i ) + " i (18) where X 1i and X 2i are random variables following the standard normal distribution and the chi square distribution of order 1, respectively, W i is distributed uniformly on the interval [ 1; 1], and " i follows the normal distribution N (0; 0:3). The coe cient of X 1i is normalized to be one since the model is identi ed only up to scale, and the true parameter value 0 is 0:5. We consider di erent functional forms for T () and g 0 (). For the transformation function T (), we use the identity function T (y) = y and the logarithmic function T (y) = log y. (Note that these functions are treated as unknown in the estimation procedure.) For each transformation function, we consider the two functional forms for g 0 (), g 0 (W ) = sin (W ) and g 0 (W ) = W 2, yielding a total of four simulation designs. For each design, we conducted 201 replications with sample sizes of 50, 100, 200, and 400. A grid-search method (using 101 equispaced points on [ 1; 1]) was employed to estimate 0 in the rst stage. The functions g 0 (W ) and g (1) 0 (W ) were estimated at n equispaced points on the support [ 1; 1] of W (where n is the sample size). 7 The simulated annealing method was used for the 7 Using these n equispaced points (rather than fw ig) made calculation of the average estimated function far easier and does not a ect the asymptotic theory. 12

second-stage optimization. For the objective function in this stage, the Epanechnikov kernel with the Silverman rule-of-thumb bandwidth 2:34^ W n 1=5 (see Silverman (1986)). Finally, the average h i partial e ect E g (1) 0 (W ) was estimated by the sample average of bg (1) (W ) over the estimating points of W from the second stage. Overall, the estimation method exhibits good nite-sample behavior. Tables 1 4 summarize the simulation results. For each simulation design, we report the mean bias and root-mean-squarederror (RMSE) of b and g. Across the four designs, there is minimal mean bias and both b and g seem to satisfy the parametric convergence rate (RMSE values scaled down by p 2 with n doubling) even for these small sample sizes. The integrated-mean-squared-error (IMSE) of bg () and bg (1) () are also reported in the tables. Figures 1 and 2 show the average estimated functions of bg () and bg (1) () for n = 400 in the simulation designs summarized in Tables 1 and 2, respectively. In these gures, the dashed lines denote the average value of estimated functions and the solid lines denote the true functions. The proposed estimator appears to, on average, capture the non-linear features of the underlying functions very well. 5 Application In this section, we consider an empirical application in which we estimate the association of household income with children s cognitive development. This application is based on the work by Dahl and Lochner (2008), but we will allow for a more exible speci cation than considered in their paper. Speci cally, we compare estimation results of a partially linear index with other results that assume a known transformation function T () and/or a known additive function g (). Using NLSY children data, Dahl and Lochner (2008) construct a panel dataset that includes child s testing scores, household income and other characteristics. For this application, we used a sample extracted from the year 2000. We dropped observations where household income was more than $50,000/year since there were too few observations to estimate a nonparametric function in 13

the region beyond $50,000/year. After dropping observations with missing values, a total of 1,821 observations remained for the analysis. Table A.3 summarizes the descriptive statistics for the sample. The dependent variable is child s normalized test score. The covariates include mother s AFQT (AFQT), number of dependents (DEP), mother s education (EDU), and household income (IN- COME) in $10,000 s. Since we expect that INCOME may have nonlinear relationship with child s test score, we allow for INCOME to enter the unspeci ed non-linear component of the partially linear index. For comparison, we also considered alternative model speci cations: two linear regression models with linear and quadratic form of INCOME (OLS1 and OLS2, respectively) and a generalized regression model with a linear index (MRC). In the generalized regression model, the parameter of EDU is normalized to be one since parameters are only identi ed up to scale. For comparison purposes, we also report the normalized results from the two linear regressions (i.e., by dividing all OLS coe cient estimates by the OLS estimate on EDU). In the partially linear index model, the bandwidth in the rst stage estimation was chosen by the rule-of-thumb approach in Silverman (1986) and was set to 0.50 in the second stage. 8 Table 6 and Figures 3 4 present the estimation results from di erent model speci cations. In Table 6, directions of the coe cients coincide with our expectation. All coe cients are signi cant at 95% level besides INCOME and its square (INCOME2) in OLS2. The F-test rejects the hypothesis that those coe cients are jointly zero at 99% level. However, it is di cult for us to tell the direction of the tted curve with this result. (We also tried adding a cubic term to the OLS regression, but this regression yielded insigni cant coe cient estimates for each of the three income variables.) The linear-index MRC results are similar to those obtained in OLS1. (MRC with quadratic model speci cation using INCOME and INCOME2 gave us insigni cant results, similar to OLS2, and thus has not been reported.) Figures 3 4 shows the estimated g (INCOME) and g (1) (INCOME) functions from our semipara- 8 We tried di erent values for the bandwidth in the second stage, with very little di erence found in the overall functional shapes. 14

metric partially linear index (PL) model. The solid curves denote the estimated functions, and the dashed curves are 95% (pointwise) con dence bands. To compare the results, we also add straight lines that represent the estimation result from MRC. There is evidence of a nonlinear relationship between income and test scores, as the straight lines (from MRC) are outside the con dence bands in income ranges $34,000 $42,000 and $30,000 $36,000, respectively. Since the nonlinear relationship could not be detected by adding a quadratic (or cubic) term in the linear models, this result shows the bene ts of the partially linear index model. While the estimate of the averaged partial e ect from the PL speci cation is close to the OLS1 estimate, Figure A.3 shows there are heterogeneous e ects at di erent income levels. 6 Conclusion References Abrevaya, J. (2000): Rank estimation of a generalized xed-e ects regression model, Journal of Econometrics, 95(1), 1 23. (2003): Pairwise-Di erence Rank Estimation of the Transformation Model, Journal of Business & Economic Statistics, 21(3), 437 47. Ahn, H., and J. L. Powell (1993): Semiparametric estimation of censored selection models with a nonparametric selection mechanism, Journal of Econometrics, 58(1-2), 3 29. Aradillas-Lopez, A., B. Honoré, and J. L. Powell (2007): Pairwise Di erence Estimation of Nonlinear Models with Nonparametric Functions, manuscript. Chen, S. (2002): Rank Estimation of Transformation Models, Econometrica, 70(4), 1683 1697. Chen, S., and S. Khan (2001): Semiparametric Estimation of a Partially Linear Censored Regression Model, Econometric Theory, 17, 567 590. 15

Dahl, G., and L. Lochner (2008): The Impact of Family Income on Child Achievement, manuscript. Emond, M. J., and S. G. Self (1997): An E cient Estimator for the Generalized Semilinear Model, Journal of the Ametrican Statistical Association, 92(439), 1033 1040. Han, A. (1987): Non-Parametric Analysis of a Generalized Regression Model, Journal of Econometrics, 35, 303 316. Honoré, B., and J. Powell (2005): Pairwise Di erence Estimation of Nonlinear Models, in Identi cation and Inference for Econometric Models, ed. by D. W. K. Andrews, and J. H. Stock, chap. 22, pp. 520 553. Cambridge University Press. Ichimura, H., and L.-F. Lee (1991): Semiparametric least squares estimation of multiple index models: Single equation estimation, in Nonparametric and Semiparametric Methods in Econometrics and Statistics, ed. by G. E. T. William A. Barnett, James Powell, chap. 1, pp. 3 50. Cambridge University Press. Khan, S., and E. Tamer (2007): Partial Rank Estimation of Transformation Models with General Forms of Censoring, Journal of Econometrics, 136, 251 280. Manski, C. (1985): Semiparametric Analysis of Discrete Response: Asymptotic Properties of Maximum Score Estimation, Journal of Econometrics, 27, 313 334. Newey, W. K., and D. McFadden (1994): Large sample estimation and hypothesis testing, in Handbook of Econometrics, ed. by R. F. Engle, and D. McFadden, vol. 4, chap. 36, pp. 2111 2245. Elsevier. Powell, J. L. (1987): Semiparametric Estimation of Bivariate Latent Variable Models, manuscript. 16

Powell, J. L., J. H. Stock, and T. M. Stoker (1989): Semiparametric Estimation of Index Coe cients, Econometrica, 57(6), 1403 1430. Robinson, P. M. (1988): Root-N-Consistent Semiparametric Regression, Econometrica, 56(4), 931 54. Sherman, R. (1993): The Limiting Distribution of the Maximum Rank Correlation Estimator, Econometrica, 61, 123 137. (1994): Maximal Inequalities for Degenerate U-Processes with Applications to Optimization Estimators, Annals of Statistics, 22, 439 459. Shin, Y. (2007): Local Rank Estimation of Transformation Models with Functional Coe cients, manuscript. Silverman, B. W. (1986): Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. van der Vaart, A. W., and J. A. Wellner (1996): Weak Convergence and Empirical Processes. Springer, New York. A Appendix A.1 Regularity Conditions In this section, we list regularity conditions for asymptotic results. These conditions are quite standard in a semiparametric and nonparametric literature as explained below. A1. There is a random sample (Y i ; X i ; W i ), and the regressors (X i ; W i ) are independent of the error term " i : 17

A2. The parameter space B is a compact subset of 2 R k : 1 = 1 ; and the true parameter 0 is an interior point of B. A3. The following support conditions hold: (a) The support of X 2 R k is not contained in any proper linear subspace of R k with W a:s: (b) The rst component of X has an everywhere positive Lebesgue density conditional on the remaining components e X = ex and W = w and corresponding coe cient 0;1 = 1: (c) W is continuously distributed on R: A4. The bandwidth h n satis es the following: (a) (b) (c) lim h n = 0 n!+1 lim n!+1 nhk n = 1 for k 4: lim n!+1 nh5 n = for 0 < 1: A5. The kernel function K () satis es the following properties: (a) K () is twice continuously di erentiable, has compact support. (b) K () is symmetric about 0 and integrates to 1. (c) R u 2 K (u) du < 1: A6. Let (W i ; W j ; ) = E [m ij () m ij ( 0 ) jw i ; W j ] where m ij () = 1 (Y i > Y j ) 1(Xi 0 > X0 j ): Let () be a pdf of W i : Then, the second order derivatives of (W i ; W j ; ) (W i ) 2 with respect to W i is continuous and bounded uniformly over in some neighborhood of W i where W i is around W j : 18

A7. Let Z = (Y; X; W ). Let (z; ) be a kernel of the empirical process de ned as follows: 1 (z; ) = E 1 (y > Y ) 1 x 0 > X 0 K h (w W ) +E 1 (Y > y) 1 X 0 > x 0 K h (W w) : Let N be a neighborhood of 0: (19) (a) For each z; all mixed second partial derivatives of (z; ) exist on N : (b) There is an integrable function M (z) such that for all z and in N jjr 2 1 (z; ) r 2 1 (z; 0 )jj M (z) j 0 j : (20) (c) E jr 1 1 (; 0 )j 2 < 1: (d) E jr 2 j 1 (; 0 ) < 1: (e) The matrix E [r 2 1 (; 0 )] is negative de nite. A9. g 0 (W ) is twice continuously di erentiable in W and its location is normalized as g 0 (w 0 ) = 0: For a compact interval 1 ; [g 0 (w 2 " ; w 1 + " )] 1 for a small positive number " for some w 0 ; w 1 ; and w 2 in the support of W: The same compactness condition holds for g (1) 0 () : A10. The density of W, () ; is twice continuously di erentiable, and the derivatives are uniformly bounded. A11. Let 2 (Z; w; ; ) be 2 (Z; w; ; b) = E[f 2 (; Z; w; ; b) K h ( w) K h (W w 0 ) +f 2 (Z; ; w; ; b) K h (W w) K h ( w 0 )]: (21) All conditions in Assumption A7 hold for 2 (Z; w; 0 ; ) : 19

Assumption A1 requires an i:i:d: sample and the regressors are independent of the error term. This condition can be relaxed into median independence to deal with conditional heteroschedasticity. Assumption A2 is the usual compactness and interior condition of the parameter. Assumption A3 is crucial for identi cation. It requires a continuous component of X i with non-zero parameter value given the other regressor W i : A similar condition is also adopted in the maximum score estimation of Manski (1985) and the maximum rank estimation of Han (1987). Assumptions A4 and A5 are the usual bandwidth and kernel function conditions. Assumptions A6 and A7 are smoothness conditions of the objective function and the kernel function of the U-process. Assumptions A9 and A10 are smoothness conditions for g 0 () and the density of W: Compactness condition on the range of g () and g (1) () is also required. Assumption A11 imposes a smoothness condition similar to A7 on 2 ; so that we can expand it around the true parameter value. A.2 Proofs Proof of Theorem 2.1: (i) We verify the conditions of Theorem 2.1 in Newey and McFadden (1994). Compactness follows from Assumption A2. To show uniform convergence, we de ne n () and () as follow: 1n () = 1 n (n 1) X (m ij () m ij ( 0 )) K h (W i W j ) i6=j 1 () = E Wi [E (m ij () m ij ( 0 ) jw i = W j ) (W i )] = E Wi [ (W i ; W i ; ) (W i )] By the triangular inequality, sup j 1n () 2B 1 ()j sup j 1n () 2B E [ 1n ()]j + sup je [ 1n ()] 1 ()j : 2B 20

We look at the rst term in the right hand side. The Euclidean property and the Corollary 7 in Sherman (1994) implies 1 sup j 1n () E [ 1n ()]j = O p pn = o p (1) 2B Now we examine the second term. A change of variables and Assumption A5 implies that sup je [ 1n ()] 1 ()j = O h 2 = o (1) ; 2B which establishes uniform convergence of 1n () to 1 () : Since m ij () is continuous at each with probability one and m ij () K h (W i W j ) is bounded, 1 () is continuous. Finally, unique maximization holds from Theorem in Han (1987) and the rank relationship between Y and X 0 given W: (ii) The asymptotic normality follows directly from the proof of Theorem 4 in Sherman (1993). Proof of Theorem 2.2: (i) Compactness follows form Assumption A9. To show the uniform convergence, de ne the following functions: f 2 (Z 1 ; Z 2 ; w; ; b) = 1 (Y i > Y j ) 1 Xi 0 b + g (W i ; ) > Xj 0 b +1 (Y i > Y j ) 1 Xi 0 b + g (W i ; ) > Xj 0 b 2n (w; ; b) = Q 2n (w; ; b) Q 2n (w; 0 ; b) 2 (w; ; b) = E [ 2n (w; ; b)] 2 (w; ) = E [f 2 (Z 1 ; Z 2 ; w; ; ) f 2 (Z 1 ; Z 2 ; w; 0 ; )] (w) (w 0 ) We suppress the dependency of on w for simplicity. Then, the Euclidean property of f 2 and the 21

uniform law of large numbers in U-process imply that sup 2n w; ; b 2 2 w; ; b = op (1) : >From the continuity of 2 and the consistency of b, we have 2 Therefore, by triangular inequality, w; ; b = 2 (w; ) + o p (1) : sup 2n w; ; b 2 2 (w; ) = o p (1) : Simple algebra and the rank correlation in (6) show that 2 (w; ) < 0 whenever 6= 0, so that it is uniquely maximized at 0 : Continuity of 2 follows from the support conditions of X and W and the continuity of g () : Therefore, b (w) is consistent for 0 (w) : In addition, we can show that the rst two components of b (w) are classes of smooth functions; then by arguments of Van der Vaart and Wellner (1996, p 81), they are Glivenko-Cantelli classes and satisfy the uniform consistency. (ii) For asymptotic normality, we can follow the arguments in Sherman (1993) and Shin (2007) to show that 2n w; ; b = 1 2 ( 0) 0 V 2 (w) ( 0 ) + p 1 Q 2n ( 0 ) nh j 0 j +o p p + o p j 0 j 2 1 + o p n nh as j (w) 0 (w)j! 0; uniformly over w 2 [w 2 ; w 1 ]; where Q 2n = p nh 1 n nx @ 2 (Z; w; 0 ; ) + B h 2 : @ i=1 Then, the remaining follows from arguments of Theorem 1 in Chen (2002). Proof of Theorem 2.3: From the results in Theorem 2.2, we can approximate bg (1) (W i ) as a linear 22

sum of J (Z; W i ) and have the following expansions: g (1) E W g (1) (W ) = 1 X J (Z j ; W i ) + 1 X n (n 1) n i6=j i 8 + 1 X < bg (1) (W i ) g (1) n : 0 (W i ) i i g (1) 0 (W i ) E W hg (1) (W ) 1 n 9 X = J (Z j ; W i ) ; j (22) Note that the rst term is a kernel weighted average. Thus, it is bounded by O p (1= p n) from arguments in Powell, Stock, and Stoker (1989). The second term is O p (1= p n) from the standard central limit theorem. Finally, the third term is the average of the remaining term that converges to zero in probability. Furthermore, we know from Theorem 2.2 that, uniformly over W i 2 [W 1 ; W 2 ]; E bg (1) (W i ) g (1) 0 (W i ) 1 n X J (Z j ; W i ) j 2 = O (1) (23) which implies the third term is also O p (1= p n). A.3 Tables and Figures 23

Table 1: Simulations Results of the Log-linear Model with a Sine Function b g bg () bg (1) () Mean Bias RMSE Mean Bias RMSE IMSE IMSE 50 obs. 0.0266 0.1685 0.2613 0.4383 0.1603 2.3944 100 obs. 0.0150 0.1160 0.2101 0.2969 0.0714 1.4031 200 obs. 0.0018 0.0634 0.1651 0.2448 0.0373 0.8289 400 obs. 0.0065 0.0481 0.1200 0.1691 0.0187 0.4878 Table 2: Simulations Results of the Log-linear Model with a Quadratic Function b g bg () bg (1) () Mean Bias RMSE Mean Bias RMSE IMSE IMSE 50 obs. 0.0294 0.1648 0.0281 0.2916 0.0942 1.3545 100 obs. 0.0083 0.1135-0.0133 0.1910 0.0426 0.7186 200 obs. 0.0038 0.0712 0.0217 0.1430 0.0212 0.4797 400 obs. 0.0083 0.0522 0.0008 0.1086 0.0110 0.3210 Table 3: Simulations Results of the Linear Model with a Sine Function b g bg () bg (1) () Mean Bias RMSE Mean Bias RMSE IMSE IMSE 50 obs. 0.0320 0.1941 0.2468 0.4197 0.1799 2.5042 100 obs. 0.0106 0.1162 0.1754 0.2946 0.0744 1.3130 200 obs. -0.0034 0.0704 0.1683 0.2227 0.0350 0.8216 400 obs. 0.0035 0.0467 0.1300 0.1725 0.0202 0.5262 Table 4: Simulations Results of the Linear Model with a Quadratic Function b g bg () bg (1) () Mean Bias RMSE Mean Bias RMSE IMSE IMSE 50 obs. 0.0242 0.1608 0.0135 0.2863 0.0918 1.3011 100 obs. 0.0123 0.0921-0.0198 0.2009 0.0398 0.7169 200 obs. 0.0085 0.0687 0.0159 0.1499 0.0215 0.4941 400 obs. 0.0017 0.0441-0.0145 0.1116 0.0111 0.3210 24

Figure 1: T (Y ) = log(y ) and g (W ) = sin (W ) 25

Figure 2: T (Y ) = log(y ) and g (W ) = W 2 26

Table 5: Descriptive Statistics Variable Mean Std. Dev. Min Max Child s Test Score 0.028 0.931-2.714 2.253 Mother s AFQT -0.163 0.999-1.822 2.061 Number of Dependents 2.798 1.309 1 9 Mother s Education 13.350 2.264 0 20 Household Income ($10,000) 2.017 1.266 0.0043 4.9455 Note: Data came from work of Dahl and Lochner (2008). We used a sample extracted from the year 2000, which has 1,821 observations. 3 2.5 2 g(income) 95% Confidence Band MRC (linear specification) 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.5 1 1.5 2 Income ($10,000) Figure 3: Estimation Result of the Function g (Income) 27

6 4 g'(income) 95% confidence band MRC (linear specification) 2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 2 4 6 Income ($10,000) Figure 4: Estimation Result of the Function g (1) (Income) 28

g(income) 2 g'(income) 1.5 1 0.5 0 0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1.5 2 2.5 3 3.5 Income ($ 10,000) Figure 5: Estimation Result of the Functions g (Income) and g (1) (Income) 29

Table 6: Estimation Results Normalized Estimates PL MRC OLS1 OLS2 OLS1 OLS2 AFQT 1 1 1 1 0.311 0.311 [0.263, 0.359] [0.263, 0.359] DEP -0.136-0.134 0:212 0:212-0.066-0.066 [-0.335, -0.069] [-0.272, -0.055] [ 0:305, 0:119] [ 0:305, 0:116] [-0.095, -0.037] [-0.095, -0.036] EDU 0.219 0.155 0:135 0:135 0.042 0.042 [0.055, 0.289] [0.050, 0.233] [0:077, 0:199] [0:077, 0:199] [0.022, 0.062] [0.022, 0.062] INCOME 0.286 0.214 0:277 0:289 0.086 0.090 [0.075, 0.520] [0.145, 0.402] [0:167, 0:386] [ 0:080, 0:659] [0.052, 0.120] [-0.025, 0.205] INCOME2 0:003-0.001 [ 0:080, 0:074] [-0.025, 0.023] const 1:515 1:524-0.471-0.474 [ 2:457, 0:569] [ 2:511, 0:537] [-0.764,-0.177] [-0.781, -0.167] Note: PL denotes the partially linear model speci cation, and MRC denotes the Maximum Rank Correlation Estimation of the linear transformation model. OLS1 is the least square estimation of the linear model with INCOME, and OLS2 is with INCOME and INCOME squared (INCOME2). The estimation result of INCOME under the PL is an estimate of the average derivative e ects E g (1) (INCOME) : Con dence intervals with 95% level are denoted under the estimates. 30