Estimation of the Error Density in a Semiparametric Transformation Model

Estimation of the Error Density in a Semiparametric Transformation Model Benjamin Colling Université catholique de Louvain Cédric Heuchenne University of Liège and Université catholique de Louvain Rawane Sam Université catholique de Louvain Ingrid Van Keilegom Université catholique de Louvain Novemer 14, 201 Astract Consider the semiparametric transformation model Λ θo (Y) = m(x) + ε, where θ o is an unknown finite dimensional parameter, the functions Λ θo and m are smooth, ε is independent of X, and E(ε) = 0. We propose a kernel-type estimator of the density of the error ε, and prove its asymptotic normality. The estimated errors, which lie at the asis of this estimator, are otained from a profile likelihood estimator of θ o and a nonparametric kernel estimator of m. The practical performance of the proposed density estimator is evaluated in a simulation study. Key Words: Density estimation; Kernel smoothing; Nonparametric regression; Profile likelihood; Transformation model. Running Head: Error density estimation in transformation models Research supported y IAP research network P7/06 of the Belgian Government (Belgian Science Policy), and y the contract Projet d Actions de Recherche Concertées (ARC) 11/16-09 of the Communauté française de Belgique, granted y the Académie universitaire Louvain. Research supported y IAP research network P7/06 of the Belgian Government (Belgian Science Policy). Research supported y the European Research Council under the European Community s Seventh Framework Programme (FP7/2007-201) / ERC Grant agreement No. 20650, y IAP research network P7/06 of the Belgian Government (Belgian Science Policy), and y the contract Projet d Actions de Recherche Concertées (ARC) 11/16-09 of the Communauté française de Belgique, granted y the Académie universitaire Louvain. 1

1 Introduction Let (X 1,Y 1 ),...,(X n,y n ) e independent replicates of the random vector (X,Y), where Y is a univariate dependent variale and X is a one-dimensional covariate. We assume that Y and X are related via the semiparametric transformation model Λ θo (Y) = m(x)+ε, (1) where ε is independent of X and has mean zero. We assume that {Λ θ : θ Θ} (with Θ R p compact) is a parametric family of strictly increasing functions defined on an unounded suset D in R, while m is the unknown regression function, elonging to an infinite dimensional parameter set M. We assume that M is a space of functions endowed with the norm M =. We denote θ o Θ and m M for the true unknown finite and infinite dimensional parameters. Define the regression function m θ (x) = E[Λ θ (Y) X = x], for each θ Θ, and let ε θ = ε(θ) = Λ θ (Y) m θ (X). In this paper, we are interested in the estimation of the proaility density function (p.d.f.) f ε of the residual term ε = Λ θo (Y) m(x). To this end, we first otain the estimators θ and m θ of the parameter θ o and the function m θ, and second, form the semiparametric regression residuals ε i ( θ) = Λ θ(y i ) m θ(x i ). To estimate θ o we use a profile likelihood (PL) approach, developed in Linton et al. (2008), whereas m θ is estimated y means of a Nadaraya-Watson-type estimator (Nadaraya (1964), Watson (1964)). To our knowledge, the estimation of the density of ε in model (??) has not yet een investigated in the statistical literature. Estimatingtheerrordensityinthesemiparametrictransformationmodel(SPT)Λ θo (Y) = m(x)+ ε may e very useful in various regression prolems. First, taking transformations of the data may induce normality and error variance homogeneity in the transformed model. So the estimation of the error density in the transformed model may e used for testing these hypotheses; it may also e used for goodness-of-fit tests of a specified error distriution in a parametric or nonparametric regression setting. Some examples can e found in Akritas and Van Keilegom (2001), Cheng and Sun (2008), ut with Λ θo id, i.e. the response is not transformed. Next, the estimation of the error density in the aove model can e useful for testing the symmetry of the residual distriution. See Ahmad and Li (1997), Dette et al. (2002), Neumeyer and Dette (2007) and references therein, in the case Λ θo id. Under this model, Escanciano and Jacho-Chavez (2012) considered the estimation of the (marginal) density of the response Y via the estimation of the error density. Another application of the estimation of the error density in the SPT model is the forecasting of Λ θo (Y) y 2

means of the mode approach, since the mode of the p.d.f. of Λ θo (Y) given X = x is m(x)+argmax e R f ε (e), where f ε is the p.d.f. of the error term ε. Taking transformations of the data has een an important part of statistical practice for many years. A major contriution to this methodology was made y Box and Cox (1964), who proposed a parametric power family of transformations that includes the logarithm and the identity. They suggested that the power transformation, when applied to the dependent variale in a linear regression model, might induce normality and homoscedasticity. Lots of effort has een devoted to the investigation of the Box-Cox transformation since its introduction. See, for example, Amemiya (1985), Horowitz (1998), Chen et al. (2002), Shin (2008), and Fitzenerger et al. (2010). Other dependent variale transformations have een suggested, for example, the Zellner and Revankar(1969) transform and the Bickel and Doksum(1981) transform. The transformation methodology has een quite successful and a large literature exists on this topic for parametric models. See Carroll and Ruppert (1988) and Sakia (1992) and references therein. The estimation of (functionals of) the error distriution and density under simplified versions of model (??) has received considerale attention in the statistical literature in recent years. Akritas and Van Keilegom (2001) estimated the cumulative distriution function of the regression error in a heteroscedastic model with univariate covariates. The estimator they proposed is ased on nonparametrically estimated regression residuals. The weak convergence of their estimator was proved. The results otained y Akritas and Van Keilegom (2001) have een generalized y Neumeyer and Van Keilegom (2010) to the case of multivariate covariates. Müller et al. (2004) investigated linear functionals of the error distriution in nonparametric regression. Cheng (2005) estalished the asymptotic normality of an estimator of the error density ased on estimated residuals. The estimator he proposed is constructed y splitting the sample into two parts: the first part is used for the estimation of the residuals, while the second part of the sample is used for the construction of the error density estimator. Efromovich (2005) proposed an adaptive estimator of the error density, ased on a density estimator proposed y Pinsker (1980). Finally, Sam (2011) also considered the estimation of the error density, ut his approach is more closely related to the one in Akritas and Van Keilegom (2001). In order to achieve the ojective of this paper, namely the estimation of the error density under model (??), we first need to estimate the transformation parameter θ o. To this end, we make use of the results in Linton et al. (2008). In the latter paper, the authors first discuss the nonparametric identification of model (??), and second, estimate the transformation parameter θ o under the considered model. For the estimation of this parameter, they propose two approaches. The first approach uses a semiparametric profile likelihood

(PL) estimator, while the second is ased on a mean squared distance from independence-estimator (MD) using the estimated distriutions of X, ε and (X, ε). Linton et al. (2008) derived the asymptotic distriutions of their estimators under certain regularity conditions, and proved that oth estimators of θ o are root-n consistent. The authors also showed that, in practice, the performance of the PL method is etter than that of the MD approach. For this reason, the PL method will e considered in this paper for the estimation of θ o. The rest of the paper is organized as follows. Section 2 presents our estimator of the error density and groups some notations and technical assumptions. Section descries the asymptotic results of the paper. A simulation study is given in Section 4, while Section 5 is devoted to some general conclusions. Finally, the proofs of the asymptotic results are collected in Section 6. 2 Definitions and assumptions 2.1 Construction of the estimators The approach proposed here for the estimation of f ε is ased on a two-steps procedure. In a first step, we estimate the finite dimensional parameter θ o. This parameter is estimated y the profile likelihood (PL) method, developed in Linton et al.(2008). The asic idea of this method is to replace all unknown expressions in the likelihood function y their nonparametric kernel estimates. Under model (??), we have P(Y y X) = P(Λ θo (Y) Λ θo (y) X) = P(ε θo Λ θo (y) m θo (X) X) = F ε (Λ θo (y) m θo (X)). Here, F ε (t) = P(ε t), and so f Y X (y x) = f ε (Λ θo (y) m θo (x))λ θ o (y), where f ε and f Y X are the densities of ε, and of Y given X, respectively. Then, the log likelihood function is {logf εθ (Λ θ (Y i ) m θ (X i ))+logλ θ (Y i)}, θ Θ, where f εθ is the density function of ε θ. Now, let ( ) n j=1 Λ Xj x θ(y j )K 1 m θ (x) = n j=1 K 1( Xj x h h ) (2) e the Nadaraya-Watson estimator of m θ (x), and let f εθ (t) = 1 ) ( εi (θ) t K 2. () ng g 4

where ε i (θ) = Λ θ (Y i ) m θ (X i ). Here, K 1 and K 2 are kernel functions and h and g are andwidth sequences. Then, the PL estimator of θ o is defined y [ θ = argmax log f ] εθ (Λ θ (Y i ) m θ (X i ))+logλ θ (Y i). (4) θ Θ Recall that m θ (X i ) converges to m θ (X i ) at a slower rate for those X i which are close to the oundary of the support X of the covariate X. That is why we assume implicitly that the proposed estimator (??) of θ o trims the oservations X i outside a suset X 0 of X. Note that we keep the root-n consistency of θ proved in Linton et al. (2008) y trimming the covariates outside X 0. But in this case, the resulting asymptotic variance is different to the one otained in the latter paper. In a second step, we use the aove estimator θ to uild the estimated residuals ε i ( θ) = Λ θ(y i ) m θ(x i ). Then, our proposed estimator f ε (t) of f ε (t) is defined y ( ) f ε (t) = 1 ε i ( θ) t K, (5) n where K is a kernel function and is a andwidth sequence, not necessarily the same as the kernel K 2 and the andwidth g used in (??). Oserve that this estimator is a feasile estimator in the sense that it does not depend on any unknown quantity, as is desirale in practice. This contrasts with the unfeasile ideal kernel estimator f ε (t) = 1 n ( ) εi t K, (6) which depends in particular on the unknown regression errors ε i = ε i (θ o ) = Λ θo (Y i ) m(x i ). It is however intuitively clear that f ε (t) and f ε (t) will e very close for n large enough, as will e illustrated y the results given in Section. 2.2 Notations When there is no amiguity, we use ε and m to indicate ε θo and m θo. Moreover, N(θ o ) represents a neighorhood of θ o. For the kernel K j (j = 1,2,), let µ(k j ) = v 2 K j (v)dv and let K (p) j e the pth derivative of K j. For any function ϕ θ (y), denote ϕ θ (y) = ϕ θ (y)/ θ = ( ϕ θ (y)/ θ 1,..., ϕ θ (y)/ θ p ) t and ϕ θ (y) = ϕ θ(y)/ y. Also, let A = (A t A) 1/2 e the Euclidean norm of any vector A. For any functions m, r, f, ϕ and q, and any θ Θ, let s = ( m,r,f,ϕ,q), s θ = (m θ,ṁ θ,f εθ,f ε θ, f εθ ), ε i (θ, m) = Λ θ (Y i ) m(x i ), and define { n G n (θ,s) = n 1 1 f{ε i (θ, m)} [ ] ϕ{ε i (θ, m)}{ Λ θ (Y i ) r(x i )}+q{ε i (θ, m)} + Λ } θ (Y i) Λ θ (Y, i) 5

G(θ,s) = E[G n (θ,s)] and G(θ o,s θo ) = θ G(θ,s θ) θ=θo. 2. Technical assumptions The assumptions we need for the asymptotic results are listed elow for convenient reference. (A1) The function K j (j = 1,2,) is symmetric, has compact support, v k K j (v)dv = 0 for k = 1,...,q j 1 and v qj K j (v)dv 0 for some q j 4, K j is twice continuously differentiale, and K (1) (v)dv = 0. (A2) The andwidth sequences h, g and satisfy nh 2q1 = o(1), ng 2q2 = o(1) (where q 1 and q 2 are defined in (A1)), (n 5 ) 1 = O(1), n h 2 (logh 1 ) 2 and ng 6 (logg 1 ) 2. (A) (i) The support X of the covariate X is a compact suset of R, and X 0 is a suset with non empty interior, whose closure is in the interior of X. (ii) The density f X is ounded away from zero and infinity on X, and has continuous second order partial derivatives on X. (A4) The function m θ (x) is twice continuously differentiale with respect to θ on X N(θ 0 ), and the functions m θ (x) and ṁ θ (x) are q 1 times continuously differentiale with respect to x on X N(θ 0 ). All these derivatives are ounded, uniformly in (x,θ) X N(θ o ). (A5) The error ε = Λ θo (Y) m(x) has finite fourth moment and is independent of X. (A6) The distriution F εθ (t) is q +1 (respectively three) times continuously differentiale with respect to t (respectively θ), and sup θ,t k+l t k θ l1 1... θlp p F εθ (t) < for all k and l such that 0 k +l 2, where l = l 1 +...+l p and θ = (θ 1,...,θ p ) t. (A7) The transformation Λ θ (y) is three times continuously differentiale with respect to oth θ and y, and there exists a α > 0 such that E [ sup θ : θ θ α ] k+l Λ θ (Y) < y k θ l1 1... θlp p for all θ Θ, and for all k and l such that 0 k + l, where l = l 1 +... + l p and θ = (θ 1,...,θ p ) t. Moreover, sup x X E[ Λ 4 θ o (Y) X = x] <. (A8) For all η > 0, there exists ǫ(η) > 0 such that inf G(θ,s θ) ǫ(η) > 0. θ θ o >η 6

Moreover, the matrix G(θ o,s θo ) is non-singular. (A9) (i) E(Λ θo (Y)) = 1, Λ θo (0) = 0 and the set {x X 0 : m (x) 0} has nonempty interior. (ii) Assume that φ(x,t) = Λ θo (Λ 1 θ o (m(x) +t))f ε (t) is continuously differentiale with respect to t for all x and that for all t R and for some δ > 0. sup s: t s δ E φ s (X,s) <. (7) Assumptions (A1), part of (A2), (A)(ii), (A4) and (A6), (A7) and (A8) are used y Linton et al. (2008) to show that the PL estimator θ of θ o is root n-consistent. The differentiaility of K j up to second order imposed in assumption (A1) is used to expand the two-steps kernel estimator f ε (t) in (??) around the unfeasile one f ε (t). Assumptions (A)(ii) and (A4) impose that all the functions to e estimated have ounded derivatives. The last assumption in (A2) is useful for otaining the uniform convergence of the Nadaraya-Watson estimator of m θo in (??) (see for instance Einmahl and Mason (2005)). This assumption is also needed in the study of the difference etween the feasile estimator f ε (t) and the unfeasile estimator f ε (t). Finally, (A9)(i) is needed for identifying the model (see Vanhems and Van Keilegom (2011)). Asymptotic results In this section we are interested in the asymptotic ehavior of the estimator f ε (t). To this end, we first investigate its asymptotic representation, which will e needed to show its asymptotic normality. Theorem 1. Assume (A1)-(A9). Then, f ε (t) f ε (t) = 1 n where R n (t) = o P ( (n) 1/2 ) for all t R. ( ) εi t K f ε (t)+r n (t), This result is important, since it shows that, provided the ias term is negligile, the estimation of θ o and m( ) has asymptotically no effect on the ehavior of the estimator f ε (t). Therefore, this estimator is asymptotically equivalent to the unfeasile estimator f ε (t), ased on the unknown true errors ε 1,...,ε n. Our next result gives the asymptotic normality of the estimator f ε (t). Theorem 2. Assume (A1)-(A9). In addition, assume that n 2q+1 = O(1). Then, ) ( ) d n ( f ε (t) f ε (t) N 0,f ε (t) K(v)dv 2, 7

where f ε (t) = f ε (t)+ q q! f(q) ε (t) v q K (v)dv. The proofs of Theorems?? and?? are given in Section??. 4 Simulations In this section, we investigate the performance of our method for different models and different sample sizes. Consider where Λ θ is the Box-Cox (1964) transformation y θ 1 θ, θ 0, Λ θ (y) = log(y), θ = 0, Λ θo (Y) = 0 + 1 X 2 + 2 sin(πx)+σ e ε, (8) X is uniformly distriuted on the interval [ 1,1], and ε is independent of X. We carry out simulations for two cases : in the first case, ε has a standard normal distriution and, in the second case, the distriution of ε is the mixture of the normal distriutions N( 1.5,0.25) and N(1.5,0.25) with equal weights. To make computations easier, error distriutions are truncated at and (i.e., put to 0 outside the interval [, ]). We study three different model settings. For each of them, 2 = 0 σ e. The other parameters are chosen as follows: Model 1: 0 = 6.5, 1 = 5, σ e = 1.5; Model 2: 0 = 4.5, 1 =.5, σ e = 1; Model : 0 = 2.5, 1 = 2.5, σ e = 0.5. Oursimulationsareperformedforθ 0 = 0,0.5and1. WeusetheEpanechnikovkernelK(x) = 4 ( 1 x 2 ) 1( x 1) for oth the estimator of the regression function and the density function. The results are ased on 100 random samples of size n = 100 and n = 200. For the estimation of θ 0 and f ε (t), we proceed as follows. Let L θ (h,g) = [ log f ] εθ ( ε i (θ,h))+logλ θ(y i ), where ε i (θ,h) = Λ θ (Y i ) m θ (X i,h)and m θ (x,h) denotes m θ (x) constructedwith andwidthh.this function will e maximized with respect to θ for given (optimal) values of (h,g). For each value of θ, h (θ) is otained 8

y least squares cross-validation, where h (θ) = argmax h m i,θ (X i ) = (Λ θ (Y i ) m i,θ (X i )) 2, ) n j=1,j i Λ Xj X θ(y j )K( i n j=1,j i K ( Xj X i h and g can e chosen with a classical andwidth selection rule for kernel density estimation. Here, for simplicity, the normal rule is used (ĝ(θ) = (40 π) 1/5 n 1/5 σ ε(θ,h (θ)), where σ ε(θ,h (θ)) is the classical empirical estimator of the standard deviation ased on ε i (θ,h (θ)), i = 1,...,n). The solution h ) θ = argmaxl θ (h (θ),ĝ(θ)) θ is therefore otained iteratively (maximization prolems are solved with the function optimize in R with h [0,2] and θ [ 20,20]) and the estimator of f ε (t) is finally given y ( ) 1 ε i ( θ,h ( θ)) t f ε (t) = K. nĝ( θ) ĝ( θ) Tales??,?? and?? show the mean squared error (MSE) of the estimator f ε (t) of the standardized (pseudo-estimated) error ε = ( Λ θ(y) m θ(x) ) /σ e (with known σ e ), for t = 1, 0 and 1 (respectively t = 1.5, 1, 0, 1 and 1.5) and for the unimodal (respectively imodal) normal error distriution. Tales?? and?? show the integrated mean squared error (IMSE) of the estimator f ε ( ) for oth error distriutions, where the integration is done over the interval [,]. As expected, in oth cases, estimation is etter for the normal density than for the mixture of two normals, and estimation improves when n increases, and in most cases, when σ e decreases. In particular, this can e oservedfrom Tales?? and??. The limiting case θ 0 = 0 (the logarithmic transformation) seems to e more easily captured, especially when the error is normally distriuted. In general, we oserve from Tales??,??,?? that estimation is poorer near local maxima and minima of the density, which is not uncommon for kernel smoothing methods. This also suggests that the choice of the smoothing parameters is important and should e the oject of further investigation. 9

Model θ 0 n = 100 n = 200 f ǫ ( 1) f ǫ (0) f ǫ (1) f ǫ ( 1) f ǫ (0) f ǫ (1) Bias -.0421 -.0206 -.012 -.018.0196.0004 θ 0 = 0 Var.0006.0206.0017.0008.0116.0008 MSE.0024.0211.0018.0011.0120.0008 0 = 6.5 Bias -.0621.0469 -.061 -.0521.009 -.0262 1 = 5 θ 0 = 0.5 Var.0051.1624.0061.000.1555.0066 σ 0 = 1.5 MSE.0089.1646.0101.0057.1565.007 Bias -.0874.0806 -.0885 -.050.106 -.077 θ 0 = 1 Var.007.2261.0089.0049.1152.002 MSE.0149.226.0168.0077.1265.0086 Bias -.0029 -.095 -.022 -.0419.0627 -.0118 θ 0 = 0 Var.0019.0142.002.0004.010.0010 MSE.0019.02.0028.0022.0169.0012 0 = 4.5 Bias -.0522.0476 -.045 -.0228 -.019 -.0146 1 =.5 θ 0 = 0.5 Var.0041.1184.0062.0017.0.0020 σ 0 = 1 MSE.0068.1207.0081.0022.07.0022 Bias -.070.1816 -.087 -.0425.0240 -.041 θ 0 = 1 Var.0049.2497.0045.002.0519.0028 MSE.0098.2827.0114.0041.0525.0045 Bias -.02 -.005 -.0008 -.007.006 -.07 θ 0 = 0 Var.0006.0148.0011.0005.006.0002 MSE.0017.0148.0011.0005.0072.0016 0 = 2.5 Bias -.004.0156 -.0289 -.0214.022 -.0164 1 = 2.5 θ 0 = 0.5 Var.0014.0266.0020.0008.0129.0008 σ 0 = 0.5 MSE.0024.0268.0028.0012.014.0011 Bias -.0252.0411 -.008 -.0442.086 -.00 θ 0 = 1 Var.0020.0415.0042.0007.0256.0014 MSE.0026.042.0052.0026.025.002 Tale 1: MSE( f ε (t)) for different models, values of t and sample sizes, when f ε ( ) is a standard normal density. 10

Model θ 0 n = 100 n = 200 0 = 6.5 θ 0 = 0.0042.002 1 = 5 θ 0 = 0.5.0161.0106 σ 0 = 1.5 θ 0 = 1.027.0129 0 = 4.5 θ 0 = 0.0060.0029 1 =.5 θ 0 = 0.5.0125.005 σ 0 = 1 θ 0 = 1.0191.0075 0 = 2.5 θ 0 = 0.0027.0015 1 = 2.5 θ 0 = 0.5.0048.0026 σ 0 = 0.5 θ 0 = 1.0114.006 Tale 2: IMSE( f ε ) for different models and sample sizes, when f ε ( ) is a standard normal density. 5 Conclusions In this paper we have studied the estimation of the density of the error in a semiparametric transformation model. The regression function in this model is unspecified (except for some smoothness assumptions), whereas the transformation (of the dependent variale in the model) is supposed to elong to a parametric family of monotone transformations. The proposed estimator is a kernel-type estimator, and we have shown its asymptotic normality. The finite sample performance of the estimator is illustrated y means of a simulation study. It would e interesting to explore various possile applications of the results in this paper. For example, one could use the results on the estimation of the error density to test hypotheses concerning e.g. the normality of the errors, the homoscedasticity of the error variance, or the linearity of the regression function, all of which are important features in the context of transformation models. 6 Proofs Proof of Theorem??. Write f ε (t) f ε (t) = [ f ε (t) f ε (t)]+[ f ε (t) f ε (t)], where f ε (t) = 1 n ) ( εi t K 11

Model θ 0 n = 100 f ǫ ( 1.5) f ǫ ( 1) f ǫ (0) f ǫ (1) f ǫ (1.5) Bias -.1955 -.0292.1671 -.059 -.2069 θ 0 = 0 Var.000.0010.001.0012.0005 MSE.086.0018.029.0024.04 0 = 6.5 Bias -.1854 -.0004.1252 -.0086 -.191 1 = 5 θ 0 = 0.5 Var.0021.005.0017.0059.0021 σ 0 = 1.5 MSE.065.005.0174.0060.087 Bias -.2055 -.0046.1641 -.0188 -.217 θ 0 = 1 Var.00.0065.0167.0061.0027 MSE.0455.0065.046.0065.0499 Bias -.1665.0514.1921 -.0875 -.254 θ 0 = 0 Var.0004.0014.0010.0008.0005 MSE.0282.0040.079.0084.0589 0 = 4.5 Bias -.197 -.025.1584 -.0066 -.1892 1 =.5 θ 0 = 0.5 Var.0007.0026.0016.008.0012 σ 0 = 1 MSE.096.001.0267.008.070 Bias -.2025 -.0271.1659.0221 -.1902 θ 0 = 1 Var.0015.009.0044.009.0017 MSE.0425.0046.019.0044.079 Bias -.1544.0698.1915 -.1296 -.2547 θ 0 = 0 Var.000.0009.0006.0004.0007 MSE.0242.0057.072.0172.0656 0 = 2.5 Bias -.1924 -.0501.141.017 -.1459 1 = 2.5 θ 0 = 0.5 Var.0004.0011.0007.0021.0005 σ 0 = 0.5 MSE.074.006.0187.001.0218 Bias -.1654.012.1289 -.0642 -.1944 θ 0 = 1 Var.0005.0017.0010.0022.001 MSE.0279.0019.0167.006.091 Tale : MSE( f ε (t)) for different models, values of t and n = 100, when f ε ( ) is a mixture of two normal densities (N( 1.5, 0.25), N(1.5, 0.25)) with equal weights. 12

Model θ 0 n = 200 f ǫ ( 1.5) f ǫ ( 1) f ǫ (0) f ǫ (1) f ǫ (1.5) Bias -.1578 -.012.110 -.0212 -.1665 θ 0 = 0 Var.000.0009.0002.0010.000 MSE.0252.0011.012.0015.0281 0 = 6.5 Bias -.1425.072.0960 -.019 -.1652 1 = 5 θ 0 = 0.5 Var.0009.008.0005.009.0019 σ 0 = 1.5 MSE.0212.0052.0097.004.0285 Bias -.1697 -.0077.1019 -.021 -.1769 θ 0 = 1 Var.0014.0047.0007.0051.0018 MSE.002.0048.0111.0056.01 Bias -.1511 -.0022.0980 -.048 -.1681 θ 0 = 0 Var.0002.0007.0001.0009.0004 MSE.020.0007.0098.0021.0286 0 = 4.5 Bias -.1712 -.0287.1092.0099 -.158 1 =.5 θ 0 = 0.5 Var.0005.0019.0004.0025.0005 σ 0 = 1 MSE.0298.0028.012.0026.0242 Bias -.1278.02.060 -.0228 -.152 θ 0 = 1 Var.0009.008.0002.008.0015 MSE.017.0048.0042.004.0250 Bias -.140.0008.0915 -.0581 -.1749 θ 0 = 0 Var.0001.0004.0001.0005.0004 MSE.0205.0004.0085.009.010 0 = 2.5 Bias -.1406.0245.1067 -.0485 -.167 1 = 2.5 θ 0 = 0.5 Var.0001.0008.0002.0012.0006 σ 0 = 0.5 MSE.0199.0014.0116.005.0286 Bias -.1551 -.0291.089.001 -.146 θ 0 = 1 Var.000.0010.0001.001.000 MSE.0244.0019.0072.001.0210 Tale 4: MSE( f ε (t)) for different models, values of t and n = 200, when f ε ( ) is a mixture of two normal densities (N( 1.5, 0.25), N(1.5, 0.25)) with equal weights. 1

Model θ 0 n = 100 n = 200 0 = 6.5 θ 0 = 0.0148.0089 1 = 5 θ 0 = 0.5.0158.0106 σ 0 = 1.5 θ 0 = 1.0219.0119 0 = 4.5 θ 0 = 0.0184.0082 1 =.5 θ 0 = 0.5.0157.0099 σ 0 = 1 θ 0 = 1.0186.008 0 = 2.5 θ 0 = 0.0199.0079 1 = 2.5 θ 0 = 0.5.0118.0087 σ 0 = 0.5 θ 0 = 1.012.0078 Tale 5: IMSE( f ε ) for different models and sample sizes, when f ε ( ) is a mixture of two normal densities (N( 1.5, 0.25), N(1.5, 0.25)) with equal weights. and ε i = Λ θo (Y i ) m θo (X i ), i = 1,...,n. In a completely similar way as was done for Lemma A.1 in Linton et al. (2008), it can e shown that f ε (t) f ε (t) = 1 n ( ) εi t K f ε (t)+o P ((n) 1/2 ) (9) for all t R. Note that the remainder term in Lemma A.1 in the aove paper equals a sum of i.i.d. terms of mean zero, plus a o P (n 1/2 ) term. Hence, the remainder term in that paper is O P (n 1/2 ), whereas we write o P ((n) 1/2 ) in (??). Therefore,the resultofthe theoremfollowsifweprovethat f ε (t) f ε (t) = o P ((n) 1/2 ). To this end, write f ε (t) f ε (t) 1 = n 2 ( ε i ( θ) ε i (θ o ))K (1) + 1 2n ( ε i ( θ) ε i (θ o )) 2 K (2) ) ( εi (θ o ) t ( ε i (θ o )+β( ε i ( θ) ε i (θ o )) t for some β (0,1). In what follows, we will show that each of the terms aove is o P ((n) 1/2 ). First consider the last term of (??). Since Λ θ (y) and m θ (x) are oth twice continuously differentiale with respect to θ, the second order Taylor expansion gives, for some θ 1 etween θ o and θ (to simplify the notations, we assume ), 14

here that p = dim(θ) = 1), ε i ( θ) ε i (θ o ) = Λ θ(y i ) Λ θo (Y i ) ( m θ(x i ) m θo (X i ) ) = ( θ θ o )( Λ θo (Y i ) m θo (X i ))+ 1 2 ( θ θ o ) 2 ( Λ θ1 (Y i ) m θ1 (X i )). Therefore, since θ θ o = o P ((n) 1/2 ) y Theorem 4.1 in Linton et al. (2008) (as efore, we work with a slower rate than what is shown in the latter paper, since this leads to weaker conditions on the andwidths), assumptions (A2) and (A7) imply that 1 n ( ε i ( θ) ε i (θ o )) 2 K (2) ( ) ε i (θ o )+β( ε i ( θ) ε i (θ o )) t ( = O P (n ) 1), which is o P ((n) 1/2 ), since (n 5 ) 1 = O(1) under (A2). For the first term of (??), the decomposition of ε i ( θ) ε i (θ o ) given aove yields 1 n 2 ) ( εi ( ε i ( θ) ε i (θ o ))K (1) (θ o ) t ( Λ θo (Y i ) m θo (X i ))K (1) = ( θ θ o ) n 2 = ( θ θ o ) n 2 ( Λ θo (Y i ) ṁ θo (X i ))K (1) ( εi (θ o ) t ( εi t ) +o P ((n) 1/2 ) where the last equality follows from a Taylor expansion applied to K (1), the fact that m θo (x) ṁ θo (x) = O P ((nh) 1/2 (logh 1 ) 1/2 ), ) +o P ((n) 1/2 ), (10) uniformly in x X 0 y Lemma??, and the fact that nh (logh 1 ) 1 under (A2). Further, write [ n ( ) ] E ( Λ θo (Y i ) ṁ θo (X i ))K (1) εi t [ ( )] = E Λ θo (Y i )K (1) εi t [ ( )] E[ṁ θo (X i )]E K (1) εi t = A n B n. We will only show that the first term aove is O(n 2 ) for any t R. The proof for the other term is similar. Let ϕ(x,t) = Λ θo (Λ 1 θ o (m(x)+t)) and set φ(x,t) = ϕ(x,t)f ε (t). Then, applying a Taylor expansion to φ(x, ), 15

it follows that (for some β (0,1)) A n = [ ( E Λ θo Λ 1 θ o (m(x i )+ε i ) ) K (1) ( e t = n φ(x,e)k (1) = n φ(x,t+v)k (1) = n = n 2 ) f X (x)dxde (v)f X(x)dxdv [ φ(x,t)+v φ ] t (x,t+βv) ( )] εi t v φ t (x,t+βv)k(1) (v)f X(x)dxdv, K (1) (v)f X(x)dxdv since K (1) (v)dv = 0, and this is ounded y Kn2 sup s: t s δ E φ s (X,s) = O(n2 ) y assumption (A9)(ii). Hence, Tcheychev s inequality ensures that ( θ θ o ) 2 ( Λ θo (Y i ) ṁ θo (X i ))K (1) ( ) εi t = ( θ θ o ) n 2 O P (n 2 +(n) 1/2 ) = o P ((n) 1/2 ), since n /2 y (A2). Sustituting this in (??), yields 1 n 2 ) ( εi ( ε i ( θ) ε i (θ o ))K (1) (θ o ) t = o P ((n) 1/2 ), for any t R. This completes the proof. Proof of Theorem??. It follows from Theorem?? that f ε (t) f ε (t) = [ f ε (t) E f ε (t)]+[e f ε (t) f ε (t)]+o P ((n) 1/2 ). (11) The first term on the right hand side of (??) is treated y Lyapounov s Central Limit Theorem (LCT) for triangular arrays (see e.g. Billingsley (1968), Theorem 7.). To this end, let f in (t) = 1 K ( εi t Then, under (A1), (A2) and (A5) it can e easily shown that n E fin (t) E f in (t) Cn 2 f ε (t) K (v) dv +o ( n 2) ( n ) /2 ( = O((n) 1/2 ) = o(1), Var f in (t) n 1 f ε (t) K 2 (v)dv +o( n 1)) /2 ). 16

for some C > 0. Hence, the LCT ensures that This gives f ε (t) E f ε (t) Var f ε (t) = f ε (t) E f ε (t) Var f1n(t) ) ( d n ( fε (t) E f ε (t) N 0,f ε (t) n d N (0,1). K 2 (v)dv ). (12) For the second term of (??), straightforward calculations show that E f ε (t) f ε (t) = q q! f(q) ε (t) v q K (v)dv +o( q ). Comining this with (??) and (??), we otain the desired result. Lemma 1. Assume (A1)-(A5) and (A7). Then, sup m θo (x) m θo (x) = O P ((nh) 1/2 (logh 1 ) 1/2 ), x X 0 sup m θo (x) ṁ θo (x) = O P ((nh) 1/2 (logh 1 ) 1/2 ). x X 0 Proof. We will only show the proof for m θo (x) ṁ θo (x), the proof for m θo (x) m θo (x) eing very similar. Let c n = (nh) 1/2 (logh 1 ) 1/2, and define r θo (x) = 1 nh j=1 Λ θo (Y j )K 1 ( Xj x h x X 0 ), ṙ θo (x) = E[ r θo (x)], f X (x) = E[ f X (x)], where f X (x) = (nh) 1 n j=1 K 1( Xj x h ). Then, sup m θo (x) ṁ θo (x) sup m θo (x) ṙθ o (x) x X 0 f X (x) + sup ṙθo (x) f X (x)ṁ θo (x). (1) f X (x) x X 0 1 Since E[ Λ 4 θ o (Y) X = x] < uniformly in x X y assumption (A7), a similar proof as was given for Theorem 2 in Einmahl and Mason (2005) ensures that sup m θo (x) ṙθ o (x) f X (x) = O P(c n ). x X 0 Consider now the second term of (??). Since E[ ε(θ o ) X] = 0, where ε(θ o ) = d dθ (Λ θ(y) m θ (X)) θ=θo, we have [ ( )] X x ṙ θo (x) = h 1 E {ṁ θo (X)+ ε(θ o )}K 1 h [ ( )] X x = h 1 E ṁ θo (X)K 1 h = ṁ θo (x+hv)k 1 (v)f X (x+hv)dv, 17

from which it follows that ṙ θo (x) f X (x)ṁ θo (x) = [ṁ θo (x+hv) ṁ θo (x)]k 1 (v)f X (x+hv)dv. Hence, a Taylor expansion applied to ṁ θo ( ) yields sup ṙθo (x) f X (x)ṁ θo (x) = O(h q 1 ) = O(c n ), x X 0 since nh 2q1+1 (logh 1 ) 1 = O(1) y (A2). This proves that the second term of (??) is O(c n ), since it can e easily shown that f X (x) is ounded away from 0 and infinity, uniformly in x X 0, using (A)(ii). References [1] Ahmad, I. and Li, Q. (1997). Testing symmetry of an unknown density function y kernel method. Journal of Nonparametric Statistics, 7, 279 29. [2] Akritas, M.G. and Van Keilegom, I. (2001). Non-parametric estimation of the residual distriution. Scandinavian Journal of Statistics, 28, 549 567. [] Amemiya, T. (1985). Advanced Econometrics. Harvard University Press, Camridge. [4] Bickel, P.J. and Doksum, K. (1981). An analysis of transformations revisited. Journal of the American Statistical Association, 76, 296 11. [5] Billingsley, P. (1968). Convergence of Proaility Measures. Wiley, New York. [6] Box, G.E.P. and Cox, D.R. (1964). An analysis of transformations. Journal of the Royal Statistical Society - Series B, 26, 211 252. [7] Carroll, R.J. and Ruppert, D. (1988). Transformation and Weighting in Regression. Chapman and Hall, New York. [8] Chen, G., Lockhart, R.A. and Stephens, A. (2002). Box-Cox transformations in linear models: Large sample theory and tests of normality (with discussion). Canadian Journal of Statistics, 0, 177 24. [9] Cheng, F. (2005). Asymptotic distriutions of error density and distriution function estimators in nonparametric regression. Journal of Statistical Planning and Inference, 128, 27 49. 18

[10] Cheng, F. and Sun, S. (2008). A goodness-of-fit test of the errors in nonlinear autoregressive time series models. Statistics and Proaility Letters, 78, 50 59. [11] Dette, H., Kusi-Appiah, S. and Neumeyer, N. (2002). Testing symmetry in nonparametric regression models. Journal of Nonparametric Statistics, 14, 477 494. [12] Efromovich, S. (2005). Estimation of the density of the regression errors. Annals of Statistics,, 2194 2227. [1] Einmahl, U. and Mason, D.M. (2005). Uniform in andwidth consistency of kernel-type function estimators. Annals of Statistics,, 180 140. [14] Escanciano, J.C. and Jacho-Chavez, D. (2012). n-uniformly consistent density estimation in nonparametric regression. Journal of Econometrics, 167, 05 16. [15] Fitzenerger, B., Wilke, R.A. and Zhang, X. (2010). Implementing Box-Cox quantile regression. Econometric Reviews, 29, 158 181. [16] Horowitz, J.L. (1998). Semiparametric Methods in Economics. Springer-Verlag, New York. [17] Linton, O., Sperlich, S. and Van Keilegom, I. (2008). Estimation of a semiparametric transformation model. Annals of Statistics, 6, 686 718. [18] Müller, U.U., Schick, A. and Wefelmeyer, W. (2004). Estimating linear functionals of the error distriution in nonparametric regression. Journal of Statistical Planning and Inference, 119, 75 9. [19] Nadaraya, E. A. (1964). On estimating regression. Theory of Proaility and its Applications, 9, 141 142. [20] Neumeyer, N. and Dette, H. (2007). Testing for symmetric error distriution in nonparametric regression models. Statistica Sinica, 17, 775 795. [21] Neumeyer, N. and Van Keilegom, I. (2010). Estimating the error distriution in nonparametric multiple regression with applications to model testing. Journal of Multivariate Analysis, 101, 1067 1078. [22] Pinsker, M.S. (1980). Optimal filtering of a square integrale signal in Gaussian white noise. Prolems of Information Transmission, 16, 52 68. [2] Sakia, R.M. (1992). The Box-Cox transformation technique: a review. The Statistician, 41, 169 178. 19

[24] Sam, R. (2011). Nonparametric estimation of the density of regression errors. Comptes Rendus de l Académie des Sciences-Paris, Série I 49, 1281-1285. [25] Shin, Y. (2008). Semiparametric estimation of the Box-Cox transformation model. Econometrics Journal, 11, 517 57. [26] Vanhems, A. and Van Keilegom, I. (2011). Semiparametric transformation model with endogeneity: a control function approach. Journal of Econometrics (under revision). [27] Watson, G.S. (1964). Smooth regression analysis. Sankhya - Series A, 26, 59 72. [28] Zellner, A. and Revankar, N.S. (1969). Generalized production functions. Reviews of Economic Studies, 6, 241 250. Postal addresses : Benjamin Colling Cédric Heuchenne Université catholique de Louvain HEC-Management School of the University of Liège, Institute of Statistics Statistique appliquée à la gestion et à l économie Voie du Roman Pays 20 Rue Louvrex 14, Bâtiment N1 148 Louvain-la-Neuve 4000 Liège Belgium Belgium Rawane Sam Ingrid Van Keilegom Centre de recherche du CHUQ/CHUL Université catholique de Louvain 2705, Boulevard Laurier Institute of Statistics G1V 4G2, QC Voie du Roman Pays 20 Quéec 148 Louvain-la-Neuve Canada Belgium 20