SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS

Size: px

Start display at page:

Download "SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS"

Phebe Thomas
5 years ago
Views:

1 SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS JONG-MYUN MOON Abstract. This paper studies transformation models T (Y ) = X + " with an unknown monotone transformation T. Our focus is on the identi cation and estimation of, leaving the speci cation of T and the distribution of " nonparametric. We identify under a new set of conditions; speci cally, we demonstrate that identi cation may be achieved even when the regressor X has bounded support and contains discrete random variables. Our identi cation is constructive and leads to sieve extremum estimator. The empirical criterion of our estimator has a U-process structure, and therefore does not conform to existing results in the sieve estimation literature. We derive the convergence rate of the estimator and demonstrate its asymptotic normality. For inference, the weighted bootstrap is proved to be consistent. The estimator is simple to implement with standard optimization algorithms. A simulation study provides insight on its nite-sample performance. Date: October 27, 214 A liation and Contact Information: UCL and CeMMAP, jong-myun.moon@ucl.ac.uk. 1

2 2 JONG-MYUN MOON 1. Introduction Data transformation is often used in econometric analysis. For example, dependent variables are routinely log-transformed in linear regressions in order to mitigate nonlinearity and heteroskedasticity. This e ective but arbitrary technique can be justi ed if transforming functions are included as model parameters and then estimated using data. The most prominent example of this approach is the in uential Box-Cox transformation model (Box and Cox, 1964). Those authors suggested a parametric family of power functions, including a log-transformation, as candidate functions for data transformation. There are several variations of this approach, which involve di erent sets of transformation functions. However, if complex patterns are possible, then a nonparametric approach provides a useful alternative. This paper concerns identi cation and estimation of regression models with a nonparametric transformation. Regression models with a transformed dependent variable are called transformation models. Speci cally, transformation models are represented by the equation (1) T (Y ) = X + "; where Y 2 R and X 2 R dx are observed random variables, and " 2 R is an unobserved error term. In the model (1), there are three parameters: (i) the regressor coe cient, (ii) the transformation T and (iii) the error distribution. We consider the case when both T and the error distribution are nonparametric. Horowitz (1996) and Chen (22) review the literature regarding identi cation and estimation of model (1). For related models in econometrics, see Matzkin (27). Following the literature, we assume (i) " is independent of X and (ii) T is strictly monotone. There are several applications of the transformation model (1). An important class of transformation models are duration models. In labor economics, the study of employment and unemployment durations is an important area of research, and the duration model has been the main vehicle of empirical studies (Kiefer, 1988, Farber, 1999). More recently, the unemployment duration is often studied through the labor-market search model (Mortensen and Pissarides, 1999, Rogerson, Shimer, and Wright, 25), which imposes testable implications on the duration models (Eckstein and Van den Berg, 27). See Meyer (1996), van den Berg and Ridder (1998) and van den Berg (21) for related works. Also, hedonic models with additive marginal utility and additive marginal production technology, studied by Ekeland, Heckman, and Nesheim (22, 24), are closely related to the transformation model (1). Chiappori, Komunjer, and Kristensen (213) provides an extensive list of applications in di erent areas.

3 SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 3 We contribute to the literature by producing new conditions for identi cation and proposing a new estimator for. The identi cation exploits two model features: the monotonicity of T and the additive separability of X and ". First, we notice that an ordering is preserved by any monotone transformation. Therefore, if we use only the ordering induced by Y for the identi cation, then the speci c form of the transformation T is entirely irrelevant. This is not to say T is not identi ed; indeed, if is identi ed, then the identi cation of T can be established following Chen (22). However, in order to identify, it is enough to consider the ordering induced by Y, as will be demonstrated. Further, in general, the ordering is completely characterized by a binary relation. Therefore, if we are to use the information on ordering only, it is enough to see the binary comparison of any pair of data. The second observation leading to the identi cation is that the ordering of Y is determined by a linear function X +". Suppose we have two observations (Y 1 ; X1 ) and (Y 2; X2 ). Then we see that Y 1 < Y 2 if and only if X1 + " 1 < X2 + " 2. This observation may be summarized to the equality (2) IfY 2 Y 1 > g = If (X 2 X 1 ) < " 2 " 1 g; for an indicator function Ifg. The relation (2) is similar to the binary choice model. The di erence of errors " 2 " 1 has the role of a random threshold, and the binary outcome of whether the inequality Y 2 Y 1 > holds or not is determined by whether the threshold is crossed by the di erence of two single indices (X 2 X 1 ). These two observations help us formulate a minimization problem that identi es the model parameter as a unique solution. Our identi cation result is similar in spirit to the identi - cation of the maximum rank correlation (MRC) estimator of Han (1987). A distinctive feature of our approach is that the cumulative distribution function (cdf) of " 2 " 1, denoted by F, will be identi ed along with. Our identi cation result is new, and provides new identifying conditions. Speci cally, we allow the regressor vector X to contain discrete random variables. Further, all continuous regressors may have bounded support. Our key identifying condition is intuitive; we require that discrete regressors do not dominate the continuous regressors in terms of the relative contribution to the single index X. However, regardless of the condition s being met, the subvector of for continuous regressors is identi ed. The identi cation is constructive in the sense that it suggests a natural estimator. Our estimator is de ned as a minimizing solution to an empirical criterion, and the empirical criterion is acquired as a sample analogue of the identifying criterion. We propose to use the method of sieves. Sieves refer to a collection of subsets of parameter space which approximates the original parameter space increasingly well. Conceptually, a denser sieve is employed as more data are collected. See Chen (27) for a survey of the literature on sieve estimation.

4 4 JONG-MYUN MOON Our estimation procedure involves minimizing an empirical criterion, which is a function of and F, over a sieve space. As implied by the equation (2), the criterion function will involve pairwise combinations of observations in its formulation. As such, our empirical criterion has a U-process structure; in other words, it appears as a double-summation over every pair of observations. Extremum estimation involving U-processes has been studied by Sherman (1993, 1994) for parametric problems, and the theory is applied to the MRC estimation. The MRC criterion function has a U-process structure, and it is a step function of a Euclidean parameter. On the other hand, our empirical criterion function is a smooth function of parameters, which is one advantage of our approach. However, we need to extend the existing literature to deal with a seminonparametric problem to account for the in nite-dimensional parameter F. To do so, we adopt and modify the existing results on sieve M-estimation by Shen and Wong (1994) and Shen (1997). The main contribution here is to show that the estimator minimizing the U- process can be represented as an approximate M-estimator. We achieve this by approximating the U-process using a more familiar empirical process. The theoretical device used for this task is the U-process maximal inequality; in Appendix B, we present its working form. We show that the estimator of F converges faster than the n 1=4 -rate in terms of L 2 -norm. The estimator of converges at the n 1=2 -rate to the normal distribution. Regarding the inference on, because we provide an explicit form of the asymptotic variance, an inference can be conducted relying on the asymptotic approximation. A downside of this approach is that the asymptotic covariance matrix has quite a complex form, and that it requires estimation of even more nonparametric objects, such as a conditional expectation. Therefore, we prefer simulation-based methods and suggest a weighted-bootstrap scheme to approximate the nite-sample distribution of the estimator. The consistency of weighted bootstrap has been recently shown by Ma and Kosorok (25) and Chen and Pouzo (29) for the sieve M-estimation and the conditional moment model, respectively. We extend these earlier works to the case when the empirical criterion has a U-process structure. There are several literatures related to this paper. First, several papers have proposed to estimate T nonparametrically, when p n-consistent estimator of is available. As such, these papers and our work are complementary. See Horowitz (1996), Ye and Duan (1997), Klein and Sherman (22), and Chen (22). If T is parametrized, then all the model parameters can be estimated jointly, including and T. Relevant works in this approach include Linton, Sperlich, and Van Keilegom (28) and Santos (211) among others. Second, there are rank-based estimators initiated by Han (1987). Other relevant works in this strand include Cavanagh and Sherman (1998), Abrevaya (23), Khan and Tamer (27) and Khan, Shin, and Tamer (211) among others. A common aspect shared by these methods is that

5 SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 5 is identi ed and estimated without knowledge of T and the error distribution. Third, the methods for the single-index model are applicable to the transformation model. Single-index models have been extensively studied in econometrics and statistics since Ichimura (1993); see Horowitz (1998) and Ichimura and Todd (27) for surveys. In addition, although our estimator is designed speci cally for the transformation model, its technical aspect is akin to that of the single-index regression model. This is because the Euclidean parameter enters the in nite-dimensional parameter F as its argument. Rather unexpectedly, however, few works relate the sieve estimation to single-index models; see Ding and Nan (211) and references therein. These results are not applicable to our problem. 1 Therefore, we develop a suitable asymptotic theory that applies to the single-index problem in the context of sieve estimation. The remainder of this paper is organized as follows. Section 2 de nes the model and establishes the identi cation. Section 3 de nes our estimator and shows its consistency. Section 4 derives the rate-of-convergence. Section 5 shows the asymptotic normality of the estimator. It also includes the consistency of the weighted bootstrap procedure. Section 6 contains a simulation study. Section 7 discusses possible extensions. Proofs are gathered in the Appendix. Most notations will be de ned in Section 2 and in Appendix A.1, but inevitably more notations will be added throughout the paper. 2. Identification We de ne the criterion function that identi es and F as its minimizing solution. To this end, we need to introduce scale and location normalizations. As a scale normalization, the rst component of is normalized to 1, and thus is written as ( ;1 ; ) for a scalar ;1 such that j ;1 j = 1 and some (d x 1)-dimensional vector. To see why it is necessary, consider T = ct and " i = c" i for some positive constant c >. Because T is strictly increasing and " i is not observed, an alternative model T (Yi ) = Xi (c ) + " i is observationally equivalent to the original model (1). Therefore, for the point identi cation of, we need to restrict the parameter space for so that no two possible points 1 and 2 can be related as a constant multiple of the other. There are other ways to achieve the scale normalization. For instance, we could set j j = 1, so that the parameter space for is a unit sphere in R dx. The location normalization is achieved by not allowing a constant term in X. Suppose we have a constant term c, and write the model as T (Y i ) = c+x i +" i. This can be equivalently 1 The recent work by Ding and Nan (211) assumes that the empirical criterion is twice Fréchet di erentiable with respect to a certain pseudo-metric. See pp of Ding and Nan (211). Our empirical criterion is not Fréchet di erentiable.

6 6 JONG-MYUN MOON written as T (Y i ) = c + Xi + " i for " i = " i + c c for any constant c. As these two models are observationally equivalent, the constant term c or c is not identi ed. Notice that we do not impose a location normalization to " i ; its mean or median is not restricted. As mentioned in the introduction, our criterion function is motivated by the relation (2). We develop further from (2) to induce the identifying criterion. By taking a conditional expectation in both sides of (2), we have P (Y > jx 1 ; X 2 ) = P (" > X jx 1 ; X 2 ) = 1 F ( X ); where F is the cdf of " 2 " 1 and the notation denotes a di erence of two consecutive observations; that is, () = () 2 () 1. Recall " 1 and " 2 are from i.i.d. samples, and hence the distribution of " 1 " 2 is equal to the distribution " 2 " 1. This implies that 1 F ( z) = F (z) for any z 2 R. Then we have an equation (3) P (Y > jx 1 ; X 2 ) = F (X ): This relation leads us to de ne a new criterion. To state it, let us relabel the parameters and F. Because the rst component of is normalized to 1, we denote it separately by b 2 f 1; 1g. Then is a (d x 1)-by-1 vector such that = (b; ). Combining the parameters of interest and F, we write = (; F ). Then, we de ne a nonlinear least squares criterion implied by the relation (3) as follows; for V i = (Y i ; X i ), (4) h(b; ; V 1 ; V 2 ) = fify > g F (X )g 2 ; Q(b; ) = E[h(b; ; V 1 ; V 2 )]: We call Q(b; ) the population criterion. A corresponding empirical criterion will be de ned in Section 3. Theorem 2.1 below shows that and F are uniquely identi ed as a minimizer of the population criterion function Q, and the in nite-dimensional parameter F is uniquely identi ed on the support of X. The following notations are needed. Because we have di erent conditions for continuous and discrete regressors (see Assumption 2.3), let us divide X i to a continuous random vector X i;c 2 R dc and a discrete random vector X i;c 2 R dx dc so that Xi = (X i;c ; X i;d ). Divide to c and d accordingly. Similarly, we write X = (Xc; Xd ). The support of a random vector X is denoted by supp X. 2 Lastly, we denote N j = supp " \ fx ;c + j 1 ;d : x 2 supp X i;c g \ fx ;c + j ;d : x 2 supp X i;c g; for some constants f j g dx dc j= and j 2 f1; ; d x d c g. The last notation N j is used only for the identi cation purpose (see Assumption 2.4). Assumption 2.1. fy i ; X i ; " i g n i=1 is independent and identically distributed (i.i.d.) and conforms to the equation (1). " i is continuous and independent of X i. 2 For a random variable X, its support is de ned as the smallest closed set B such that P [X 2 B c ] =.

7 SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 7 Assumption 2.2. (i) = ( ;1 ; ) for ;1 2 f 1; 1g and 2 R dx 1. (ii) F is a collection of continuous monotone functions on R. F 2 F. Assumption 2.3. (i) For X i = (X i;c ; X i;d ), X i;c 2 R dc is jointly continuous, and X i;d 2 R dx dc is discrete. There is no constant in X i. (ii) supp X i = supp X i;c supp X i;d. (iii) supp X d is not contained in a proper linear subspace of R dx dc. Assumption 2.4. There exist a set of points f = ; 1 ; ; dx d c g such that j 2 supp X d for j = 1; ; d x d c and f 1 ; ; dx d c g are linearly independent. In addition, the set N j has a non-empty interior for every j = 1; ; d x d c. Assumption 2.1 is standard in the literature although the independence of " i and X i can be weakened to the conditional median independence; see Khan, Shin and Tamer (212). We do not consider this possibility. Assumption 2.2 regards the parameter spaces for and F. Assumption 2.2 (i) restates our scale normalization and restricts to be compact. Assumption 2.2 (ii) de nes F, and restricts the parameter space F for F. Regarding the identi cation, F needs not to be smooth. However, it is essential that any F 2 F is continuous and monotone. The monotonicity requirement is not required if the support for X is R. Heuristically speaking, this assumption regulates the possible value of the parameter when the support of X i is not connected due to the existence of discrete regressors. Assumption 2.3 (i) allows that both continuous and discrete regressors in X i. The requirement that X i;c is jointly continuous implies that supp X c has a non-empty interior, or, equivalently, that supp X c is not included in a proper subspace of R dx;c. If this assumption is violated, then supp X c will exhibit the multi-collinearity. As explained above, a constant term is not allowed. Assumption 2.3 (ii) means that the support of X i;c does not depend on the realization of X i;d. This assumption can be weakened, and we may allow the support of X i;c depends on the value of the discrete regressors X i;d, as long as the support of X i;c conditional on X i;d has non-empty interior in R dc. What is necessary to this generalization is only to modify Assumption 2.4 accordingly. The proof of identi cation will remain essentially same. For simplicity, however, we do not attempt this generalization. Assumption 2.3 (iii) is a requirement for the discrete regressor X i;d, parallel to the requirement that X i;c is jointly continuous. Assumption 2.4 requires that (i) the contribution of discrete variables to the single index X i is not too large relative to that of continuous variables, and (ii) the variation of the error term is not too small. This assumption regards the identi cation of ;d, that is, the regression parameter for the discrete regressors. If there is no discrete regressor, therefore, Assumptions 2.4 is not needed. Even with discrete regressors, if the support of any regressor is R, then we can omit it.

8 8 JONG-MYUN MOON Theorem 2.1. Suppose Assumptions hold and de ne A = F. let Q(b 1 ; 1 ) = min Q(b; ) b2f 1;1g;2A for some b 1 2 f 1; 1g and 1 = ( 1 ; F 1 ) 2 A. Then b 1 = ;1, 1 = and F 1 (z) = F (z) for any z 2 supp X. The proof of Theorem 2.1 is in the Appendix A. Theorem 2.1 establishes the identi cation of and F. We stress that F is identi ed only on the support of X. This fact adds some complication when we study the estimation of and F. 3. Consistency 3.1. Extremum Estimation and Method of Sieves. The identi cation result of Theorem 2.1 is constructive in the sense that it suggests an extremum estimator. This section de nes our estimator and proves its consistency. As there is an in nite-dimensional parameter, the consistency will be stated in terms of a particular norm that we de ne soon. Before proceeding, let us add one simpli cation. Henceforth, we assume ;1 is known and its value is 1; this can be accepted without loss of generality, because our estimator of ;1 exactly equals to the true value, with probability approaching 1. Therefore we let = (1; ), and further, simplify notations to Q() = Q(1; ) and h(; ; ) = h(1; ; ; ). The sample analogue to the population criterion Q de ned in (4) is 1 X (5) Q n () = h(; V i ; V j ): n(n 1) Let us call Q n the empirical criterion. It is immediate from the de nition that E[Q n ()] = Q(). Also, Q n () is a U-statistic for Q(). If viewed as a stochastic process, then Q n () induces a U-process, a generalization of U-statistic; it is a U-process after centering by Q() and scaling by p n. Much of our asymptotic theory will rely on the U-process theory 3. We minimize Q n not over A but over a subset of A, called a sieve. Let us denote a collection of sieves by fa k g. It is required that the sieve A k approximates the entire parameter space A increasingly accurately as the index k grows. For a nite sample size, we get to pick one sieve A k to use. However, conceptually, a di erent sieve A kn is used as the sample size n changes. The sieve index k n depends on n, and grows to the in nity along with the sample size n. Our discussion below will rely on abstract assumptions on the sieve spaces fa k g and the speed of divergence of the sieve index k n. Because is nite-dimensional, we may de ne the sieve A k as a product of and F k ; that is, only the in nite-dimensional F is sieved. 3 U-Process theory is similar to the empirical process theory. For more about U-process theory, see Arcones and Giné (1993), Sherman (1994) and de la Peña and Giné (1999) among others.

9 SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 9 Using the sieve A kn = F kn, we de ne the estimator ^ n as follows; (6) ^ n 2 argmin Q n (): 2A kn We write ^ n = (^ n ; ^F n ) for ^ n 2 and ^F n 2 F kn. If there are multiple minimizers in (6), any point among them can be chosen as the estimator Consistency. In semi-nonparametric problems, there are several candidates for a norm attached to the parameter space. This is due to the in nite-dimensional nature of the problem. One of the main task in studying a semi-nonparametric problem is to nd out a proper norm to the context. In contrast, in parametric problems, the Euclidean norm is a natural choice to measure a distance. We start by de ning a suitable norm to state the consistency of the estimator ^ n. 4 When de ning norms on F, an important fact is that F is only identi ed on the support of X. Therefore, we rst de ne a norm on F as ( ) kf k F;c = max sup jf (z)j; sup jf (z)j z2supp X supp X Then, de ne the consistency norm k k c on A as kk c = jj + kf k F;c. Also, denote the usual sup-norm by k k 1. We are ready to state assumptions for the consistency. We assume X i has at least fourth moment (EjX i j 4 < 1). Also fv i = (Y i ; X i ) g is always a random sample. These two premises are maintained throughout the paper. We list other more substantial assumptions. Assumption 3.1. (i) The parameter is uniquely identi ed in the sense of Theorem 2.1. (ii) is a compact subset of R d with a non-empty interior. is an interior point of. Assumption 3.2. (i) For some integer 3, max i2f;1; ;g sup z2r j di F dz i (z)j < 1. (ii) For some constant! >, collect every monotone function F on R such that max sup d i2f;1; ;g ff (z) dzi F (z)g (1 + z2 )!=2 B; z2r for some positive constant B >. The set F is the closure of this function class in the norm kf k 1;1 = kf k 1 _ kf k 1. Assumption 3.3. There exists a sequence f k F g such that k F 2 F k and max i2f;1g sup z2r d dz i f k F (z) F (z)g! as k! 1: Assumptions 3.1 is standard. The true parameter needs not to be an interior point for consistency, but it is included for later results. Assumption 3.2 (i) states that F is at least 4 Later we add more norms when needed. See the de nition (7) and Appendix A.1. In fact, all those norms are only semi-norms. We do not stress this fact. :

10 1 JONG-MYUN MOON -times di erentiable and its derivatives are uniformly bounded. Assumption 3.2 (ii) de nes the set F. There are several implications. First, by de nition, F is an interior point 5 of F. Second, the weighting function (1 + () 2 )!=2 is included to address the case when X i have an unbounded support. The particular form of the weighting function and its technical usage come from Gallant and Nychka (1987). Third, F 2 F needs not to be a cdf. Recall that F 2 F being continuous monotone is enough for the identi cation (Assumption 2.2 (ii)). However, it is possible to make F include only cdfs. Similarly, knowing that F is symmetric (that is, F (z) = 1 F ( z) for any z 2 R), we may restrict that every F 2 F is symmetric. The asymptotic distribution of ^ n is not a ected by the choice of F. Assumption 3.3 speci es the approximation property of the sieves. For consistency, it is enough that the true parameter F is well approximated. We de ne k = ( ; k F ). Notice that k k k c!. Theorem 3.1. Suppose Assumptions hold. Then k^ n k c p!. The proof of Theorem 3.1 is in the appendix. Notice that the derivative F, as well as F, is consistently estimated, uniformly on the support of X. This result is used to establish the convergence rate of ^ n in a weaker norm. 4. Rate of Convergence This section derives the convergence rate of the estimator ^ n. The rst step is to de ne an appropriate norm on A. To this end, we need to show that the population criterion Q induces a norm on the parameter space local to. We provide heuristic explanations. Given the consistency result, we can focus on a subset of parameter space A near to. Consider a local neighborhood of in the normed space (A; k k c ). By the equality (3), it is easy to show that Q() Q( ) = E[F (X ) F (X )] 2 : Recall that we set ;1 = 1 and as such X = X 1 + X for X j = X 2;j X 1;j and X = [X 2 ; ; X dx ]. Applying the Taylor expansion to F (X ) F (X ), we obtain the following approximate equality: Q() Q( ) ' E[fF (X ) X ( ) + F (X ) F (X )g 2 ]; If k k c is small, then we may replace F (X ) by F (X ) in the last expression. This is the reason why the consistency norm kk c is chosen to involve the rst-order derivative of F. 5 Here, we regard F as a normed space attached with k k1;1; that is, F is a whole set. Note that F is not an interior point of a larger normed space ff : kf k 1;1 < 1g with the same norm.

11 SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 11 This heuristic observation motivates us to de ne the following norm as a measure of rate of convergence for ^ n ; de ne the rate norm k k q as (7) kk q = fe[ff (X ) X + F (X )g 2 ]g 1=2 : The subscript q is chosen to indicate that the norm is derived from the population criterion Q. Lemma A.15 proves that Q() Q( ) is locally similar to kk 2 q on the open neighborhood of in the normed space (A; k k c ). In standard parametric problems, a similar relation holds with the Euclidean norm jj. The rate norm kk q is not necessarily an object of interest; however, it turns out that the rate norm kk q is equivalent 6 to the norm jj+kf (X )k L2 (P ) for the usual L 2 -norm kk L2 (P ) with respect to the probability measure P. Then, for instance, the upper bound for the rate of j^ n j is given by the k k q -norm rate. The following three assumptions, in addition to the assumptions for the consistency, will be used to derive the rate of convergence. Assumption ! >! + : Assumption 4.2. There exists a sequence f k = ( ; F ;k ) : k 2 N; F k 2 F k g such that r n k kn k q = o(1). Assumption 4.3. Denote, for X = [X 2 ; ; X dx ], = E[F (X ) 2 f X E[ XjX ]gf X E[ XjX ]g ]: The matrix is non-singular. Assumption 4.1 limits possible values for the constants and!. Recall that these two constants are used to de ne the parameter space F in Assumptions Note that the convergence rate r n will be determined by these constants. Assumption 4.2 states that the sieve approximation error k kn k q vanishes faster than the convergence rate r n. This requirement is intuitive because the rate of k kn k q is an upper bound for the rate of k^ n k q. Assumption 4.3 is a key condition to the entire rate calculation. It has a similar role to the nonsingularity of Hessian matrix in usual parametric problems. The particular form of the matrix will be suggested in the proof of Lemma A.14, which proves the norm equivalence of kk q and jj + kf (X )k L2 (P ). These three assumptions with the consistency of ^ n in the norm k k c are su cient to have the following result. Recall that the constants and! are de ned in Assumption 3.2. Theorem 4.1 (Rate of Convergence). Suppose Assumptions and hold. Then r n k^ n k q = O p (1); 6 Two norms are equivalent if their ratio remains within a xed range [a; b] for < a < b < 1, for any point. This equivalence result is proved in Lemma A.14.

12 12 JONG-MYUN MOON for the rate-of-convergence factor r n = n!=(2!++!). The convergence rate for sieve M-estimator is proved by Shen and Wong (1994). A similar result can be found in van der Vaart and Wellner (1996), Theorem We use the proof method similar to van der Vaart and Wellner (1996). When doing so, it needs to be considered that the empirical criterion Q n has a U-process structure. Sherman (1993, 1994) study a similar problem in parametric problems. Our result extends Sherman (1993, 1994) to in nite-dimensional problems with sieve spaces. To facilitate asymptotic analysis, we need to decompose the criterion function. De ne for v; v 1 ; v 2 2 R 2+dx, m(; v) = E[h(; V 1 ; V 2 )jv 1 = v] + E[h(; V 1 ; V 2 )jv 2 = v] Q(); g(; v 1 ; v 2 ) = h(; v 1 ; v 2 ) E[h(; V 1 ; V 2 )jv 1 = v 1 ] + E[h(; V 1 ; V 2 )jv 2 = v 2 ] + Q(): Note that E[m(; V 2 )] = Q() and E[g(; V 1 ; V 2 )] =. Moreover, it can be checked that (8) Q n () = 1 nx 1 X m(; V i ) + g(; V i ; V j ): n n(n 1) i=1 The expression (8) is called the Höe ding decomposition; this is a fundamental result to the U-statistic theory. Because E[g(; V 1 ; V 2 )jv 1 ] = E[g(; V 1 ; V 2 )jv 2 ] = for any 2 A, the second term in the right of (8) is called a degenerate U-process. From the last expression, it is clear that the U-process criterion is the sum of a sample-mean process and the degenerate U-process. As such, our proof of Theorem 4.1 can be divided to two parts. First, we show that the degenerate U-process in (8) is asymptotically negligible. This is proved in Lemma A.13 in the appendix. Then, we can treat ^ n as a M-estimator minimizing the sample mean of m(; V i ) with some error; the error comes from the degenerate U-process. Second, we prove the rate-of-convergence using the empirical process theory, similar to van der Vaart and Wellner (1996) Theorem Asymptotic Normality This section focuses on the asymptotic distribution of ^ n ; recall that = (1; ). The in nitedimensional parameter F is treated as a nuisance parameter. The rst step is to express as a function of. We are to express such a functional as an inner product of and a special point v. The inner product is induced by the norm k k q. To de ne it, let V be a product space of R d and ff : kf (X )k L2 (P ) < 1g. For arbitrary two points v; w in V, we de ne (9) hv; wi = E[fF (X ) X v + F v (X )gff (X ) X w + F w (X )g];

13 SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 13 for v = ( v ; F v ) and w = ( w ; F w ). It can be easily veri ed that the bilinear map h; i is indeed an inner product. Then, the special point v is de ned as follows. Let v = ( ; F ) for = 1 and F (z) = F (z)e[ X jx = z] 1 : Assume that v is in V, or, equivalent, assume that kf (X )k L1(P ) is nite. By easy calculation 7, one can show that (1) = h; v i: Therefore, we know the exact expression for the special point v. Even when its expression is unknown, however, the existence of v is guaranteed by the Riesz representation theorem if V is a Hilbert space and the map 7! is bounded linear. For this reason, v is often called the Riesz representer. The representation of as an inner product (1) is instrumental, since it is possible to approximate the inner product by the population criterion. Note that the inner product (9) is equivalently de ned by the polarization identify: (11) 4 hv; wi = kv + wk 2 q kv wk 2 q: Therefore, if the two squared norms in (11) are well approximated, so is the inner product. A relevant fact is that the rate-norm k k q is chosen to approximate the population criterion Q locally to ; see (7) in the previous section. Therefore it is foreseeable that can be expressed using Q. There are technical subtlety in doing so, and more details can be found in the proof of Theorem 5.1. To obtain the asymptotic normality, the following assumptions are used. Assumption 5.1.! > +!. Assumption 5.2. For kn = ( ; F ;kn ) de ned in Assumption 4.2, kf ;kn (X ) F (X )k L2 (P ) = o(n 2=3 ), kf ;k n (X ) F (X )k L4 (P ) = o(n 1=3 ): Assumption 5.3. (i) F k spanfp 1 ; ; p k g for all k; (ii) fkp j k 1 g 1 j=1 is uniformly bounded. Assumption 5.4. Let j (k) = max 1ik k dj p dz j i k 1. Then the followings hold: (i) 1 (k n ) _ 2 (k n ). p k n ^ rn 2, (ii) k n rn 3 = o(n 1 ) and (iii) knr 2 n 1 = o(1). Assumption 5.5. Let p k (z) = (p 1 (z); ; p k (z)). The smallest eigenvalue of E[p k (X )p k (X ) ] is bounded away from zero uniformly in k 2 N. Assumption 5.6. For any 2 R d, there exists a sequence f kn v : k n v = ( ; F ;k n ); 2 ; F ;k n 2 spanfp 1 ; ; p kn gg; 7 A similar calculation appears in the proof of Lemma A.14.

14 14 JONG-MYUN MOON such that (i) p nr n k kn v v k q! as n! 1 and that (ii) sup n2r kf ;k n k 1 is bounded. Assumption 5.1 is stronger than Assumption 4.1. For the asymptotic normality of ^ n, we need that k k 2 q is well approximated by Q() Q( ) for close to ^ n. Therefore, if ^ n converges faster, then the approximation shows less error. By imposing Assumption 5.1, we achieve the faster convergence rate and hence control the approximation error. Assumption 5.2 demands the sieve approximation error vanishes not only for F but also for its derivative F at a certain rate. Assumption 5.3 limits the sieve space that we consider. As mentioned already, we choose F k to be nite-dimensional and linear. The functions fp 1 ; p 2 ; g are called basis functions. Assumption 5.4 regards the smoothness property of the basis functions. Note that j (k) can be regarded as a smoothness measure for a basis functions fp 1 ; ; p k g. The role of Assumption 5.4 is to control the convergence of derivatives of ^Fn. Recall that the convergence rate is stated in terms of the rate norm k k q, and that the convergence of k^ n k q does not imply that the derivatives ^F n and ^F n converge in some norm. However, by imposing Assumption 5.4, we can control the convergence rate of k ^F n F k 1 and k ^F n F k 1 with regard to k ^F n F k 1. Assumption 5.5 is used to establish the norm equivalence between kf (X )k L2 (P ) and kf (X )k L1(P ) for F in F k. This is possible because F k is a nitedimensional sieve; recall that in the Euclidian space, L p -norms are equivalent for 1 p 1. Assumption 5.6 states that the Riesz representer v can be approximated by a sequence in the sieves to a certain precision. Before stating the main result, we add one more notation. De ne the linear directional derivative of h(; ; ) to the direction v = ( v ; F v ) 2 V as (12) h (; ; )[v] = d dt h( + tv; ; ) t= : Now we state the main result of this paper. Theorem 5.1. Suppose Assumptions , 4.3, hold. Then p n(^n )! d N(; ); where the matrix is such that, for any 2 R d, = E h ( ; V 1 ; V 2 )[v h ( ; V 1 ; V 3 )[v ]]: The proof of Theorem 5.1 can be found in Appendix A, and the functional form of h ( ; V 1 ; V 2 )[] is derived by Lemma A.19. Because we have an explicit expression for v, it is possible to estimate v and then the matrix. If is consistently estimated, the inference on can be conducted relying on the asymptotic normality result of the above theorem. A downside to this approach is that it involves several nonparametric estimations. For instance,

15 SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 15 to estimate v, the conditional expectation E[ X jx ] needs to be estimated. Therefore, a simulation-based method is preferred. Below, we prove the consistency of weighted bootstrap Weighted Bootstrap. Consider a randomly generated sequence of weights fb i g n i=1. We assume E[B i ] = 1 and V ar(b i ) = 1. If these conditions are met, it may have any distribution. Possible distributions are the discrete uniform distribution on f; 2g or the normal distribution N(1; 1). De ne the weighted empirical criterion Q n() = 1 n(n 1) X B i B j h(; V i ; V j ): Next, de ne ^ n to be a point such that ^ n 2 A kn and ^ n 2 argmin 2Akn Q n(): Also, write ^ n = (^ n; ^F n). The following theorem proves that the asymptotic distribution of p ^ n(^ n n ) conditional on the sample fv 1 ; ; V n g is same with the unconditional asymptotic distribution of p n(^ n ). Theorem 5.2. Suppose all the conditions of Theorem 5.1 hold. If fb i g n i=1 is an i.i.d. sequence such that E[B i ] = 1 and V ar(b i ) = 1, then for any c 2 R d and any n 2 N, P [ p n(^ n ^n ) cjv 1 ; ; V n ] = P [ p n(^ n ) c] + o p (1): The bootstrap inference is easy to implement. Fix the distribution for B i, and draw the random weights fb i g n i=1. Then, estimate ^ n by minimizing the weighted empirical criterion Q n. The sieve-size index k n remains same with the original problem. By repeating this procedure, we obtain the empirical distribution of p n(^ n ^n ) conditional on fv i g n i=1. Then, the quantiles of the empirical distribution can be used as critical values for the inference on p n(^n ). 6. Simulation Study Many duration models are examples of the transformation model. Proportional hazard models and mixed proportional hazard models are all nested in transformation models. 8 We use those two models to conduct the following simulation study. 8 Proportional hazard models assume the error distribution is xed to be a negative extreme-value distribution, whereas the transformation function (or baseline hazard) remains nonparametric. Mixed proportional hazard models are more general, but still restrictive; for instance, the normal distribution is not allowed for the error distribution (Ridder, 199).

16 16 JONG-MYUN MOON Design 1 Design 2 Design Figure 1. CDF of We consider three designs. The transformation function T (y) = log y is chosen for data generation. However, note that all three estimators are numerically invariant even if Y is transformed by any other monotone function. The data are generated from the following equation log Y = X 1 1 X 2 2 X 3 ", for ( 1 ; 2 ) = (1; 1): This speci cation is shared by all three designs. Further, we x the distribution of (X 1 ; X 2 ; X 3 ); X 1 and X 2 are standard normal random variables and X 3 is a binary random variable with equal probabilities of being or 1. (X 1 ; X 2 ; X 3 ) are mutually independent. Across three designs, we vary only the distribution of ". This is summarized below: Design 1: " EV (; 1); Design 2: " d = log v + u; for v (1; 1) and u EV (; 1); Design 3: " d = log v + u; for v (3; 3) and u EV (; 1); where EV (; 1) means the standard extreme-value distribution with cdf F (z) = exp( exp( z)), and (; ) denotes the gamma distribution with mean and variance 2. Design 1 conforms to the proportional hazard model. Design 2 and 3 belong to the mixed proportional hazard model or frailty model. As the additional random error v follows the gamma distribution, they are also called a gamma frailty model. Finite-sample distributions of several estimators are compared. Let us call the sieve extremum estimator developed by this paper, the sieve estimator. We compare the sieve estimator with two others estimators: Cox estimator for the proportional hazard model and the MRC

17 SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 17 estimator of Han (1987). Note that the Cox estimator is mis-speci ed for Design 2 and Design 3. We still report the result because Cox model is widely used in empirical researches. For each design, we generate samples of size 1 and 3. Then the parameter = ( 1 ; 2 ) is estimated for (i) Cox estimator, (ii) sieve estimator, and (iii) MRC estimator. Regarding the sieve estimator, we vary the dimension of sieve space to k = 3; 5; 7. The estimation procedure is repeated 5 times, and we report the sample bias and the sample mean squared error (MSE) from 5 estimates of ve estimators. To implement the sieve estimator, sieve F k needs to be speci ed. We choose I-spline as basis functions. Ramsay (1988) explains the construction of I-spline. What is useful with I-spline is that each basis function is a cdf of some continuous random variable. Therefore, it is easy to tune F k to our purpose of estimating a symmetric cdf. We construct F k to contain only symmetric cdfs from I-spline bases. The dimension of F k equals to the index k. The simulation results are summarized in Figure 2-7. In each gure, the left panel shows the bias and the right panel shows MSE. Bias1 indicates the bias of estimating 1. Bias2 is for 2. MSE1 and MSE2 also correspond to 1 and 2 respectively. Design 1 provides a good benchmark to our estimator. It is because the Cox estimator is correctly speci ed and has one less in nite-dimensional parameter. Not surprisingly, the Cox estimator shows the least MSE. Our estimator behaves comparably well. The e ciency loss of our estimator relative to the Cox estimator seems bearable when considering Design 2 and Design 3. The Cox estimator shows a large bias in these mis-speci ed designs. On the contrary, the sieve estimator performs well across all three designs. Compared to MRC estimator, the sieve estimator shows less MSE, especially for a smaller sample size of n = 1. We also notice that the sieve estimator is not sensitive to di erent sieve-size indexes k 2 f3; 5; 7g. In summary, we nd that the sieve estimator behaves well, even for a small sample size. 7. Conclusion The intuition that a binary comparison speci es the ordering is used to identify the transformation model. A new estimator is constructed from the identi cation result. Its asymptotic distribution is derived, and the bootstrap inference is justi ed. As technical by-products, we contribute to the literature on the sieve estimation by studying a U-process problem and showing how to handle the single-index structure in the semi-nonparametric problem. Several important extensions are possible. Regarding its application to duration models, we may extend the current method to account for censoring and time-varying regressors. Another direction is to consider competing risks models. We hope to study these extensions in future researches.

18 18 JONG-MYUN MOON Cox Sieve3 Sieve5 Sieve7 MRC Cox Sieve3 Sieve5 Sieve7 MRC Bias1 Bias2 MSE1 MSE2 Figure 2. Simulation result for Design 1 when n = Cox Sieve3 Sieve5 Sieve7 MRC Cox Sieve3 Sieve5 Sieve7 MRC Bias1 Bias2 MSE1 MSE2 Figure 3. Simulation result for Design 1 when n = Cox Sieve3 Sieve5 Sieve7 MRC.1.5 Cox Sieve3 Sieve5 Sieve7 MRC Bias1 Bias2 MSE1 MSE2 Figure 4. Simulation result for Design 2 when n = 1

19 SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS Cox Sieve3 Sieve5 Sieve7 MRC Cox Sieve3 Sieve5 Sieve7 MRC Bias1 Bias2 MSE1 MSE2 Figure 5. Simulation result for Design 2 when n = Cox Sieve3 Sieve5 Sieve7 MRC Cox Sieve3 Sieve5 Sieve7 MRC Bias1 Bias2 MSE1 MSE2 Figure 6. Simulation result for Design 3 when n = Cox Sieve3 Sieve5 Sieve7 MRC Cox Sieve3 Sieve5 Sieve7 MRC Bias1 Bias2 MSE1 MSE2 Figure 7. Simulation result for Design 3 when n = 3

20 2 JONG-MYUN MOON Appendix A. Proofs A.1. Notations. We de ne and use several norms throughout the appendix. k k ;1;! The norm kf k ;1;! = max i sup z2r j di F (z)j(1 + z 2 )!=2 dz i k k ;1 The norm kf k ;1 = max i sup z2r j di F (z)j dz i k k 1 The norm kf k 1 = sup z2r jf (z)j k k L1(P ) The norm kxk L1(P ) is the essential supremum of the random variable X k k Lp(P ) The norm kxk Lp(P ) = fejxj P g 1=p for any integer p 1 k k F;c The norm kf k F;c = kf (Z )k L1(P ) + kf (Z )k L1(P ) for Z = X k k e;;1 The norm kk e;;1 = jj + kf k ;1 k k e;1 The norm kk e;1 = jj + kf k 1 k k c The norm kk c = jj + kf k F;c k k q The norm kk q is de ned in (7) k k e;lp The norm kk e;lp = jj + kf (Z )k Lp(P ) Other notations used in the appendix are gathered in the table below. Z d A scalar random variable such that Z = X a. b a Kb for a universal constant K not depending on a or b a b a. b and a & b N("; F; k k) The covering number 9 of size " for a set F under the norm k k N [] ("; F; k k) The bracketing number of size " for a function class F under the norm k k C 1 ; C 2 ; Generic positive constants which do no depend on the context of the proof The degree of smoothness of F; see Assumption 3.2! The constant for the weighting function (1 + z 2 )!=2 ; see Assumption 3.2 j See Assumption 5.4 n See Remark A.17 A.2. Proof for Section 2. Lemma A.1. Suppose Assumptions hold. Suppose 2 f 1; 1g and F 2 F. If F (x ) = F (x ) for any x 2 supp X, then = and F (z) = F (z) for any z 2 supp X : Proof. Note 2 supp X d. Hence, if x = (x c ; ), by Assumption 2.3 (ii), we have (13) F (x c c ) = F (x c ;c ) for any x c 2 supp X c : As " is a di erence of two i.i.d. continuous RVs, is an interior point of supp ". Regarding supp X c, the same holds by Assumption 2.3 (i). We know c 6= and ;c 6= since 9 See p.83 of van der Vaart and Wellner (1996) for the precise de nition.

21 SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 21 j ;1 j = 1. Observe that 2 R is an interior point to both supp Xc c and supp Xc ;c. Therefore we can nd an open neighborhood of denoted by N c R dc such that (14) N c supp " \ supp X c c \ supp X c ;c : We rst show F is strictly increasing on N c. Suppose not. Then nd two points x 1 ; x 2 2 N c with the following three properties; (i) x 1 and x 2 are di erent only in the rst coordinates, say x 1;1 6= x 2;1 ; (ii) x 1;1 < x 2;1 ; (iii) F (x 1 c) F (x 2 c). Because F is strictly increasing on N c, F (x 1 ;c) < F (x 2 ;c). Then either F (x 1 c) 6= F (x 1 ;c) or F (x 2 c) 6= F (x 2 ;c). This contradicts the condition of the lemma. As such, F is strictly increasing on N c. Next, we prove c = ;c. Suppose not. Find two points x 1 ; x 2 2 N c such that x 1 c > x 2 c and x 1 ;c < x 2 ;c. By strong monotonicity of F and F on N c, F (x 1 c) > F (x 2 c) and F (x 1 ;c) < F (x 2 ;c). We reach a contradiction and conclude c = ;c. Then by (13), we can infer that F (z) = F (z) for any z 2 supp X c ;c. So far, ;c is identi ed and F is identi ed only on supp X c ;c. We move on to the identi cation of ;d. To this end, we nd the values of f j ;d g dx dc j=1 ; for the de nition of j, see Assumption 2.4. Start by j = 1. By Assumption 2.4, there are two points x 1 ; x 2 2 supp X c such that x 1 ;c = x 2 ;c + 1 ;d 2 N 1. Because F is strictly increasing on N 1 and F = F on N 1, then it follows that F (x 1 ;c)? F (z) if x 1 ;c? z. In other words, (15) x 1 ;c = z if F (x 1 ;c ) = F (z): By the condition of the lemma, (16) F (x 1 ;c ) = F (x 1 ;c ) = F (x 2 ;c + 1 ;d ) = F (x 2 ;c + 1 d ): From (15) and (16), we see that 1 d = x 1 ;c x 2 ;c = 1 ;d and F (z) = F (z) on z 2 supp X c ;c [ fx ;c + 1 ;d : x 2 supp X c g: Repeat the same argument for each j to identify other j ;d s. Then we identify f j ;d g dx dc j=1. As the last step, we note that, since f 1 ; ; dx d c g is linearly independent, ;d 2 R dx dc is identi ed. Conclude that = and that F (z) = F (z) for any z 2 supp X. Proof of Theorem 2.1. We know P (Y > jx 1 ; X 2 ) = F (X ). iterated expectation, By this fact and the Q(b; ) = E[E[IfY gf1 2F (X )gjx 1; X 2 ]f1 2F (X )g + F (X ) 2 ] = E[ F (X )(1 2F (X )) + F (X ) 2 ]:

22 22 JONG-MYUN MOON The last expectation can be simpli ed to the sum of E[fF (X ) F (X )g 2 ] and some constant not depending on parameters. From this observation, it is obvious that Q() is minimized only if F (X ) = F (X ) almost surely. Lemma A.1 proves that if F (X ) = F (X ), then it follows that = and F (z) = F (z) for any z 2 supp X. Hence we conclude. A.3. Proof for Section 3. Remark A.2 (The constant B). Note that kf k ;1 for any F 2 F is uniformly bounded. By Hölder inequality and Assumptions 3.1, kf k ;1 kf F k ;1 + kf k ;1 B + kf k ;1 : The second inequality holds because the weighting function is strictly larger than 1. As kf k ;1 is bounded by Assumption 3.1, kf k ;1 is bounded by a universal constant B + kf k ;1. We denote B = B + kf k ;1. Lemma A.3. Under Assumptions3.1(ii) and 3.2(ii), for any 1 ; 2 2 A and v i = (d i ; y i ; x i ) 2 supp V i, (17) jh( 1 ; V 1 ; V 2 ) h( 2 ; V 1 ; V 2 )j. (jx 1 j + jx 2 j + 1)k 1 2 k e;1 : Proof. We use notations x; x below; they are de ned similarly to X; X. Observe jh( 1 ; V 1 ; V 2 ) h( 2 ; V 1 ; V 2 )j = d 1 d 2 j2ify g F 1 (x 1 ) F 2 (X 2 )j jf 1 (x 1 ) F 2 (x 2 )j (18). jf 1 (x 1 ) F 2 (x 2 )j; where the inequality holds by Remark A.2 and the fact that d 1 is a binary variable. Taylor expansion after obvious expansion, jf 1 (x 1 ) F 2 (x 2 )j is equal to By jf 1(z ) X ( 1 2 ) + F 1 (x 2 ) F 2 (x 2 )j: for some z 2 [ X 1 ; X 2 ]. Since kf 1 k 1 < B by Remark A.2, using Hölder inequality, we have jh( 1 ; V 1 ; V 2 ) h( 2 ; V 1 ; V 2 )j. Bjxj j 1 2 j + jf 1 (x 2 ) F 2 (x 2 )j (19). (jx 1 j + jx 2 j + 1) j 1 2 j + jf 1 (x 2 ) F 2 (x 2 )j ; where the second inequality holds by that Bjxj + 1. jx 1 j + jx 2 j + 1. The result (17) follows (19). Lemma A.4. Under Assumptions3.1(ii) and 3.2(ii), jq( 1 ) Q( 2 )j. k 1 2 k e;1 ;

23 SIEVE EXTREMUM ESTIMATION OF TRANSFORMATION MODELS 23 for any 1 ; 2 2 A. Proof. By Jensen s inequality, jq( 1 ) Q( 2 )j E jh( 1 ; V 1 ; V 2 ) h( 2 ; V 1 ; V 2 )j. The claim follows by Lemma A.3. Lemma A.5. Under Assumptions , F is compact in k k 1;1 -norm and A is compact in k k e;1;1 -norm. Proof. We recall Lemma A.4 of Gallant and Nychka (1987); let us call it GN. Let = ; =!; m = 1; m = 1; k = 1 in the cited lemma. Although one of the conditions is that < <, it can be learnt from the proof that can be zero (and indeed can be negative). The set F de ned in Assumption 3.2 is smaller than a corresponding set in the cited lemma; note we de ne F as k k ;1;! -ball of radius B=2 whereas GN sets up F as a ball in the L 2 - type norm similarly de ned to k k ;1;!. All other conditions of GN are included verbatim in Assumptions Therefore, we know F is relatively compact in k k 1;1 -norm. By Assumption 3.2, F is compact in k k 1;1 -norm. The second claim follows immediately. Lemma A.6. Suppose Assumptions hold. Let " > be small enough. Then log N("; A; kk e;1 ). 1 log " + " ; = +!! : Proof. The inequality (2) is immediate from the de nitions of the covering number and the norm kk e;1 : (2) N("; A; kk e;1 ) N("=2; ; jj) N("=2; F; kk 1 ): Because is compact, "=2-covering number of is proportional to " d. As such, ignoring constant terms, log N("=2; ; jj). 1 log ": Denote C ;! B=2 = ff : kf k ;1;! B=2g. By Lemma A.3 of Santos (212), for some " >, if " < ", log N("; C ;! B=2 ; kk 1 ). ". Since, by Assumption 3.2, ff F : F 2 Fg C ;! B=2 ; it follows that N("; F; kk 1 ) N("; C ;! B=2 ; kk 1 ). Hence the claim is shown. Remark A.7. When we use Lemma A.6, we ignore that it holds for small ". This is harmless simpli cation. Lemma A.8. Under Assumptions , sup 2A jq n () Q()j p! as n! 1. Proof. Let H = fh (; ; ) : 2 Ag. By Lemma A.3, Ejh(; V 1 ; V 2 ) h(; V 1 ; V 2 )j. Kk 1 2 k e;1 ;

Nonparametric Identi cation and Estimation of Truncated Regression Models with Heteroskedasticity

Nonparametric Identi cation and Estimation of Truncated Regression Models with Heteroskedasticity Songnian Chen a, Xun Lu a, Xianbo Zhou b and Yahong Zhou c a Department of Economics, Hong Kong University