Robust Average Derivative Estimation. February 2007 (Preliminary and Incomplete Do not quote without permission)

Size: px

Start display at page:

Download "Robust Average Derivative Estimation. February 2007 (Preliminary and Incomplete Do not quote without permission)"

Pearl Jones
5 years ago
Views:

1 Robust Average Derivative Estimation Marcia M.A. Scafgans Victoria inde-wals y February 007 (Preliminary and Incomplete Do not quote witout permission) Abstract. Many important models, suc as index models widely used in limited dependent variables, partial linear models and nonparametric demand studies utilize estimation of average derivatives (sometimes weigted) of te conditional mean function. Asymptotic results in te literature focus on situations were te ADE converges at parametric rates (as a result of averaging); tis requires making stringent assumptions on smootness of te underlying density; in practice suc assumptions may be violated. We extend te existing teory by relaxing smootness assumptions and obtain a full range of asymptotic results wit bot parametric and non-parametric rates. We consider bot te possibility of lack of smootness and lack of precise knowledge of degree of smootness and propose an estimation strategy tat produces te best possible rate witout a priori knowledge of degree of density smootness. Te new combined estimator is a linear combination of estimators corresponding to di erent bandwidt/kernel coices tat minimizes te estimated asymptotic mean squared error (AMSE). Estimation of te AMSE, selection of te set of bandwidts and kernels are dicussed. Monte Carlo results for density weigted ADE con rm good performance of te combined estimator. Department of Economics, London Scool of Economics. Mailing address: Hougton Street, London WCA AE, United Kingdom. y Department of Economics, McGill University and CIREQ. Tis work was supported by te Social Sciences and Humanities Researc Council of Canada (SSHRC) and by te Fonds québecois de la recerce sur la société et la culture (FRQSC).

2 Robust Average Derivative Estimation. Introduction Many important models rely on estimation of average derivatives (ADE) of te conditional mean function (averaged response coe cients); te most widely used suc model is te single index model were te conditional mean function can be represented as a univariate function of a linear combination of conditioning variables. Index representations are ubiquitous in econometric studies of limited dependent variable models, partial linear models and in nonparametric demand analysis. Estimation of coe cients in single index models relies on te fact tat averaged derivatives of te conditional mean (or conditional mean weigted by some function) are proportional to te coe cients, tus a non-parametric estimator of te derivative of te conditional mean function provides estimates of te coe cients (up to a multiplicative factor). Tis metod does not require assumptions about te functional form of eiter te density of te data or of te true regression function. Powell, Stock and Stoker (989) and Robinson (989) examined density weigted average derivatives, wile Härdle and Stoker (989) investigated te properties of te average derivatives temselves; one important di erence is te need to introduce some form of trimming wen tere is no weigting by density since te estimator of density appears in te denominator of te ADE and may be close to zero. Newey and Stoker (993) addressed te issue of e ciency related to te coice of weigting function. Horowitz and Härdle (996) extended te ADE approac in te estimation of coe cients in te single index model to te presence of discrete covariates. Donkers and Scafgans (005) addressed te lack of identi cation associated wit te estimation of coe cients in single index models for cases were te derivative of te unknown function on average equals zero; tey propose an estimator based on te average outer product of derivatives wic resolves tis lack of identi cation wile at te same time enabling te estimation of parameters in multiple index models. In all of te literature on ADE estimation asymptotic teory was provided for parametric rates of convergence. Even toug te estimators are based on a nonparametric kernel estimator of te conditional mean wic depends on te kernel and bandwidt and converges at

3 Robust Average Derivative Estimation 3 a nonparametric rate, averaging can produce a parametric convergence rate tus reducing dependence on selection of te kernel and bandwidt wic do not appear in te leading term of te AMSE expansion. However, oter terms are sensitive to bandwidt/kernel coice. Powell and Stoker (996) address te optimal bandwidt coice for (weigted) average derivative estimation. Furter results including nite sample performance of average derivatives and corrections to improve nite-sample properties are discussed in Robinson (995) and Niciyama and Robinson (000, 005). Parametric rates of convergence and tus all te results in tis literature rely on te assumption of su ciently ig degree of smootness of te underlying density. In tis paper we are motivated by a concern about te assumed ig degree of density smootness. Tere is some empirical evidence tat for many variables te density may not be su ciently smoot and may ave sapes tat are not consistent wit a ig level of smootness: peaks and cusps and even discontinuity of density functions are not uncommon (references). We extend te existing asymptotic results by relaxing assumptions on te density. We sow tat insu cient smootness will result in possible asymptotic bias and may easily lead to non-parametric rates. Te selection of optimal kernel order and optimal bandwidt in te absence of su cient smootness moreover presumes te knowledge of te degree of density smootness. Tus an additional concern for us is te possible uncertainty about te degree of density smootness. Incorrect assumptions about smootness may lead to using an estimator tat su ers from problems associated wit under- or oversmooting. To address problems associated wit an incorrect coice of a bandwidt/kernel pair we construct an estimator tat optimally combines estimators for di erent bandwidts and kernels to protect against te negative consequences of errors in assumptions about te order of density smootness. We examine a variety of estimators corresponding to di erent kernels and bandwidt rates and derive te joint limit process for tose estimators. Wen eac estimator is normalized appropriately (wit di erent rates) we obtain a joint Gaussian limit process wic possibly exibits an asymptotic bias and possibly some degeneracy. Any linear combination of suc estimators is asymptotically Gaussian and we are able select a combination

4 Robust Average Derivative Estimation 4 tat minimizes te estimated asymptotic MSE. Te resulting estimator is wat we call te combined estimator. Kotlyarova and inde-wals (006) ave sown tat te weigts in tis combination will be suc tat tey provide te best rate available among all te rates witout a priori knowledge of degree of smootness, tus protecting against making a bandwidt/kernel coice tat relies on incorrect smootness assumptions and would yield ig asymptotic bias. Performance of te combined estimator relies on good estimators for te asymptotic variances and biases tat enter into te combination; a metod of estimation tat does not depend on our knowledge about te degree of density smootness is required. Variances can be estimated witout muc di culty, e.g. by bootstrap. In Kotlyarova and inde-wals (006) a metod of estimation of te asymptotic bias of a (possibly) oversmooted estimator tat utilizes asymptotically unbiased undersmooted estimators is proposed; ere we add bootstrapping to improve te properties of tis estimator of asymptotic bias. Witout prior knowledge of smootness te bandwidt coices must be suc tat bandwidts optimal for smoot densities (obtained by rule-of-tumb or by cross-validation) sould be included to cover te possibility of ig smootness; since suc coices will correspond to oversmooting if density is not su ciently smoot. Some lower bandwidts determined e.g. as percentiles of te optimal bandwidt sould also be considered. Our metod requires utilization of undersmooted estimators to determine asymptotic bias, tus it is important to consider fairly small bandwidts. We select kernels of di erent orders for te combination; most of te Monte Carlo results bot ere and for oter combined estimators (for SMS in binary coice model and for density estimation, Kotlyarova, 005) are not very sensitive to kernel coices. Monte Carlo results ere are for te density weigted ADE for a single index model. We demonstrate ere tat even in te case were te smootness assumptions old te combined estimator performs similarly to te optimal ADE estimator and does not exibit muc of an e ciency loss con rming te results about its being equivalent to te optimal rate estimator. Te results in cases were te density is not su ciently smoot or wile smoot as a sape tat gives ig values for low-order derivatives (e.g. a trimodal mix-

5 Robust Average Derivative Estimation 5 ture of normals) indicate gains from te combined estimator relative to te optimal ADE estimator. Te paper is organized as follows. assumptions. In section we discuss te general set-up and In section 3 we derive te asymptotic properties of te density-weigted ADE under various assumptions about density smootness, derive te joint asymptotic distribution for several estimators and te combined estimator. Section 4 provides te result of a Monte Carlo study analysing for te Tobit model te performance of te combined estimator vis-a-vis single bandwidt/kernel based estimators for te density-weigted ADE in cases wit di erent smootness conditions.. General set-up and assumptions We sould ave a brief intro to tis section maybe? Te unknown conditional mean function can be represented as g(x) = E(yjx) = y f (x; y) dy = G(x) f(x) f(x) ; wit dependent variable y R and explanatory variables x R k. Te joint density of (y; x) is denoted by f (y; x), te marginal density of x is denoted by f(x) and G(x) denotes te function R yf (y; x)dy: Since te regression derivative, g 0 (x); can be expressed as g 0 (x) = G0 (x) f(x) g(x) f 0 (x) f(x) ; te need to avoid imprecise contributions to te average derivative for observations wit low densities emanates from te presence of te density in te denominator. One way of doing tis is to employ some weigting function, w(x); on te oter and, Fan (99, 993), Fan and Gijbels (99) avoid weigting by use of regularization wereby n is added to te denominator of te estimator. In Härdle and Stoker (989) trimming on te basis of te density takes te place of te weigting function, tat is tey consider w N (x) = (f(x) > b N ) were b N! 0: An alternative is te density weigted average derivative estimator of Powell,Stock and Stoker (989), PSS, wit w(x) = f(x): Here we focus on te PSS estimator.

6 Robust Average Derivative Estimation 6 Te nonparametric estimates for te various unknown derivative based functionals make use of kernel smooting functions. E.g., te nonparametric estimate for te derivative of te density is given by ^f 0 (K;)(x i ) = N NX j6=i k+ K 0 ( x i were K is te kernel smooting function and is a smooting parameter tat depends on te sample size N; wit! 0 as N! : We now turn to te fundamental assumptions. Te rst two assumptions are common in tis literature, restricting x to be a continuously distributed random variable, were no component of x is functionally determined by oter components of x; imposing a boundary condition allowing for unbounded x s and requiring di erentiability of f and g: Assumption. Let z i = (y i ; x T i ) T ; i = ; ::; N be a random sample drawn from f (y; x); wit f (y; x) te density of (y; x): Te underlying measure of (y; x) can be written as v y v x, were v x is Lebesque measure. Te support of f is a convex (possibly unbounded) subset of R k wit nonempty interior 0 : Assumption. Te density function f(x) is continuous in te components of x for all x R k ; so tat f(x) = 0 for all denotes te boundary of : f is continuously di erentiable in te components of x for all x 0 and g is continuously di erentiable in te components of all x ; were di ers from 0 by a set of measure 0. Additional requirements involving te conditional distribution of y given x as well as more smootness conditions need to be added. Te conditions are sligtly amended from ow tey appear in te literature, in particular we use te weaker Hölder conditions instead of Lipscitz conditions in te spirit of weakening smootness assumptions as muc as possible. Assumption 3. (a) E(y jx) is continuous in x x j )

7 Robust Average Derivative Estimation 7 (b) Te components of te random vector g 0 (x) and matrix f 0 (x)[y; x 0 ] ave nite second moments; (fg) 0 satis es a Hölder condition wit 0 < : (fg) 0 (x + x) and E(! (fg) 0(x)[ + jyj + kxk]) < : (fg) 0 (x)! (fg) 0(x) kxk Bot te coice of te kernel (its order) and te selection of bandwidt ave played a crucial role in te literature ensuring tat te asymptotic bias for te nonparametric estimates of te derivative based functionals (averages) vanises su ciently fast subject to a ig degree of density smootness. Te kernel smooting function is assumed to satisfy a fairly standard assumption, except for te fact tat we allow for te kernel to be asymmetric. Assumption 4. (a) Te kernel smooting function K(u) is a continuously di erentiable function wit bounded support [ ; ] k : (b) Te kernel function K(u) obeys were (i ; ::; i k ) is an index set, K(u)du = ; u i :::u i k k K(u)du = 0 u i :::u i k k K(u)du 6= 0 i + ::: + i k < v(k) i + ::: + i k = v(k) (c) Te kernel smooting function K(u) is di erentiable up to te order v(k). Various furter assumptions ave been made concerning te smootness of te density in te literature (iger degree of di erentiability, Lipscitz and boundedness conditions) to ensure parametric rates of convergence. We formalize te degree of density smootness in terms of te Hölder space of functions. Tis space for integer m 0 and 0 < is de ned as follows. For a set E R k te space C m+ (E) is a Banac space of bounded and continuous functions wic are m times continuous di erentiable wit all te m t order

8 Robust Average Derivative Estimation 8 derivatives satisfying Hölder s condition of order (see Matematiceskaya Encyclopedia. Englis., ed. M. Hazewinkel): f (m) (x + x) f (m) (x)!f (m)(x) kxk for every x; x + x E: Assumption 5. f C m+ () were C m+ () is te Hölder space of functions on R k wit m ; 0 < and E(! f (m)(x)) [ + jyj + kxk]) < : Te assumption implies tat eac component of te derivative of density f 0 (x) C m + () and tus for every component of te derivative of density continuous derivatives of order m exist (if m = 0 tere is just Hölder continuity of derivative). Tis permits te following expansion for c = 0; wit c = 0 for te expansion of density and c = for te expansion of te derivative of te density function: = = f (c) (x + x) ( mx X p=c i +:::i k =p ( mx X p=c i +:::i k =p c c f (p) i!:::i k (x)x +! f (p) i!:::i k (x)x +! X i +:::i k =m X i +:::i k =m c c f (m) i!:::i k (x + x)x! ) () i!:::i k! f (m) (x + x) f (m) (x) ) x ; were x denotes te vector (x ; ::; x k ); x te product x i x i k k wit te index set (i ; : : : ; i k ); and f (m) (x) te m c f (c) =(@x) ; also : 0 : Te rst equality is obtained by Taylor expansion (wit te remainder term in Lagrange form) and te second equality Assumption te f (m) (x + x) and tus te last sum is O(kxk m c+ ): is obtained by adding and subtracting te terms wit f (m) (x): By f (m) (x) in te last sum satis es te Hölder inequality Lack of smootness of te density can readily be sown to a ect te asymptotic bias of derivative based estimators since te biases of tose estimators can be expressed via te bias of te kernel estimator of te derivative of density. Let v be te degree of smootness of te derivative of te density (equal to m + by Assumption (5)), and v(k), te

9 Robust Average Derivative Estimation 9 order of te kernel. De ne v = min(v; v(k)): Provided v = v(k) v; te bias of te derivative of te density, E( ^f 0 (K;) (x i) f 0 (x i )) = E R K(u)(f 0 (x i u) f 0 (x i ))du ; is as usual O( v(k) ) (by applying te usual v t order Taylor expansion of f 0 (x i u) around f 0 (x i )): We next sow tat wit v = v < v(k); te bias of te derivative vanises at te lower rate O( v ): In te latter case substituting (), wit c = ; x = expression and using kernel order, yields = E E [f 0 (x u) f 0 (x)] K(u)du X +:::i k =m = O( m + ) O( v ); u; into te bias i i!:::i k! m ( ) f m (m) (x i u) f f (m) (x i ) K (u) u du ()! were te latter equality uses te Hölder inequality. If di erentiability conditions typically assumed to ensure tat v > k+ do not old, ten even for bandwidts suc tat N v(k) = o() te bias does not vanis su ciently fast. Wit v = min(v; v(k)) all we can state is te rate O( v ) for te bias: E [f 0 (x u) f 0 (x)] K(u)du = O( v ): 3. Average density weigted derivative estimator Te average density weigted derivative, introduced in Powell, Stock and Stoker (989), is de ned as Given Assumptions -3, (3) can be represented as 0 = E(f(x)g 0 (x)): (3) 0 = E(f 0 (x)y) (see Lemma. in PSS). R [f 0 (x u) f 0 (x)] K(u)du m +! f (m)(x) R kk (u)k kuk du O(); were Assumption 4(a) implies tat kk (u)k is bounded (since it is continuous on a closed bounded set), and kuk is bounded on te support of K; Assumption 5 ensures boundedness of E w f (m)(x) :

10 Robust Average Derivative Estimation 0 Te estimator of 0 proposed by PSS uses te sample analogue were f 0 (x) is replaced by a consistent nonparametric estimate, i.e., were ^N (K; ) = N ^f 0 (K;)(x i ) = N NX j6=i NX i= ^f 0 (K;)(x i )y i ; (4) k+ K 0 ( x i K is te kernel smooting function (wic PSS assume to be symmetric) and is a smooting parameter tat depends on te sample size N; wit! 0 as N! : We derive te variance of ^ N (K; ) witout relying on results on U x j ): statistics to accomodate possibly non-symmetric kernels. Tis is provided in Lemma in te Appendix. We obtain te following expression for tis variance: were V ar(^ N (K; )) = (K)N (k+) + N + O(N ) (5) (K) = 4E y f(x i ) (K) + (K)(gf)(x i )y i ; n o = 4 E([(g 0 f)(x i ) (y i g(x i ))f 0 (x i )] [(g 0 f)(x i ) (y i g(x i ))f 0 (x i )] T ) 4 0 T 0 ; for su ciently smoot f(x) coincides wit te asymptotic variance of p N^ N (K; ) considered in PSS, wen N k+! : For a symmetric kernel, (K) simpli es to 4 (K)E [ (x i )f(x i )] ; wit te conditional variance (x) = E(y jx) E(yjx). For tis case Powell and Stoker (996) discuss te rates of te asymptotic variance in (5) wit a view to selecting te optimal for MSE bandwidt rate. Te asymptotic variance does not depend on te kernel function wen te bandwidt satis es N k+! ; but only if we ave a certain degree of smootness of te density: v > (k + )=: In te absence of tis degree of di erentiability (or wen oversmooting) te asymptotic variance (as te asymptotic bias) does depend on te weigting used in te local averaging possibly yielding a non-parametric rate. To express te asymptotic bias of te estimator ^ N (K; ) de ne i A(K; ; x i ) = E zi ^f 0 (K;) (x i ) f 0 (x i ) = K(u)(f 0 (x i u) f 0 (x i ))du:

11 Robust Average Derivative Estimation Ten Bias(^ N (K; )) = E(A(K; ; x i )y i ): (6) As sown in Section, EA(K; ; x i ) is O( v ). We assume Assumption 6. As N! ; v E(A(K; ; x i )y i )! B(K); were jb(k)j < olds. Te asymptotic bias of te estimator ^ N (K; ) can ten be written as Bias(^ N (K; )) = v B(K) + o( v ) (7) and vanises as! 0: We note tat assumption (6) could old as a results of primitive moment assumptions on y i ; f(x i ); and g(x i ): Let d(n) O() denote te case wen bot d(n)) and =d(n) are O() as N! : Assume tat C = lim N! Nk+ always exists and C [0; ]: Teorem. Under Assumptions 6 (a) If te density is su ciently smoot and order of kernel su ciently ig: v > k+ i. coosing : N k+ = o(); N k+! provides an unbiased but not e cient estimator N k+ ii. if : N k+! C; 0 < C < ; ^N (K; ) 0 d! N(0; (K)); p N ^N (K; ) 0 d! N(0; C (K) + ); iii. wen : N k+! ; N v = o() te same result as in PSS, Teorem 3.3 olds: p N ^N (K; ) 0 d! N(0; ); iv. if : N k+! ; but N v O(); a biased asymptotically normal estimator results: p N ^N (K; ) 0 d! N(B(K); )

12 Robust Average Derivative Estimation v. if : N v! ; te bias dominates: v ^N (K; ) 0 p! B(K): (b) For te case v = k+ (i), (ii) and (v) of part (a) apply. (c) If eiter te density is not smoot enoug or te order of te kernel is low: v < k+ te parametric rate cannot be obtained: i. for : N k++v = o(),n k+! in te limit tere is normality, no bias: N k+ ^N (K; ) 0 d! N(0; (K)); ii. for : N k++v O(); N k+! in te limit tere is normality wit asymptotic bias: N k+ ^N (K; ) 0 d! N(B(K); (K)); iii. for : N k++v! te bias dominates: v ^N (K; ) 0 p! B(K): Proof. See Appendix (were te variances and covariances are derived for any kernel but parts of te normality proof are provided for te case of a symmetric kernel). Selection of te optimal bandwidt as minimizing te mean squared error critically depends on our knowledge of te degree of smootness of te density. Let v denote te true di erentiability (smootness) of f 0 and coose te order of our kernel v(k) [v]. Te MSE(^ N (K; )) can ten be represented as MSE(^ N (K; )) = (K)N (k+) + N + B(K)B T (K) v(k) ; and te optimal bandwidt yields opt = cn =(v(k)+k+) ; were te problem of e cient estimation is to nd an appropriate c (e.g., Powell and Stoker (996)). If iger order derivatives exist, furter improvements in e ciency can be obtained by using a iger

13 Robust Average Derivative Estimation 3 order kernel to reduce te bias. In any case, to ascertain a parametric rate of our limiting distribution for ^ N (K; ) wit te use of te iger order kernel (as long as te density is su ciently di erentiable to ave v(k) [v]); our bandwidt sequence needs to satisfy N v(k)! 0, and te degree of smootness of te derivative of te density v needs be in excess of k+ (wit N k+ O() to guarantee boundedness of te variance of p N ^N (K; )). Te advantage of being able to assume tis ig di erentiability order is te insensitivity of te limit process to te bandwidt and kernel over a range of coices tat satisfy te assumptions (among wic N k+! ); if density is not su ciently smoot te parametric rate may not be acievable and bandwidt and kernel coices become crucial in ensuring good performance. Moreover, if degree of density smootness is not known tere is no guidance for te coice of kernel and bandwidt: a iger order kernel and larger bandwidt tat could be better if tere were more smootness could lead to substantial bias if te density is less smoot. Witout making furter assumptions about knowledge of te degree of smootness of te density, all tat is known is tat for some rate of! 0 tere is undersmooting: no asymptotic bias and a limiting Gaussian distribution, and for some slower convergence rate of tere is oversmooting. An optimal rate may exist, but to determine it, and to identify bandwidts wic lead to under- and over- smooting precise knowledge of v; density smootness, is required. Te situation were tere is uncertainty about smootness of density was considered in Kotlyarova and inde-wals (006), ereafter referred to as KW. Teorem corresponds to Assumption of tat paper and demonstrates tat wen our Assumptions 6 are satis ed te estimator satis es teir Assumption. We next establis tat Assumption of tat paper is satis ed as well. Consider several kernel/bandwidt sequence pairs ( Nj ; K j ); j = ; :::J and te corresponding estimators, ^ N (K j ; Nj ): If all satisfy te assumptions of Teorem ten tere exist corresponding rates r Nj for wic te joint limit process of r Nj ^N (K j ; Nj ) 0 is non-zero Gaussian, possibly degenerate. Teorem. Under te Assumptions of Teorem te joint limit process for te vector wit components r Nj ^N (K j ; Nj ) 0 ; j = ; :::J is Gaussian wit te covariance matrix

14 Robust Average Derivative Estimation 4 suc tat for components tat correspond to di erent rates covariances are zero. Proof. See appendix. Consider a linear combination of te estimators ^ N = X j a j^n (K j ; Nj ) wit JX a j = : j= We can represent te Var of ^ N as X X a t ;s a t ;s Cov(^ N (K t ; s ); ^ X N (K t ; s )) aj a j j j ; t ;s t ;s were (see appendix) wit and as before. j j = (K t ; K t ; s = s )N (k+) s s + N + O(N ) (K t ; K t ; s = s ) = 4E y f(x i ) (K t ; K t ; s = s ) + (K t ; K t ; s = s )(gf)(x i )y i Te MSE(^ N) = MSE( P t;s a t;s^ N (K t ; s )) can ten be represented as MSE(^ N(K t ; s )) = X a j a j (B j B T j + j j ) wit B(K j ) = A(K j ) To optimally coose te weigts a j ; we will minimize te trace of te AMSE as in KW. 3 tr(amse(^ N(K t ; s )) = X a j a j ( ~ B T j ~ Bj + tr ~ j j ) = a 0 Da; were fdg j j = B T j B j + tr j j ; Alternatively, more complicated toug, we could consider ~ N = P N P S N i= s= w s(x i ) ^f 0 s;k s (x i )y i, wit P S s= w s(x i ) = 3 Note MSE only provides a complete ordering wen ^ N is a scalar, using a trace is one way to obtain a complete ordering. Depending on wic scalar function of te AMSE is used te order migt di er.

15 Robust Average Derivative Estimation 5 ~B j = B j =r N (t j ; s j ); and ~ j j = j j =(r N (t j ; s j ) r N (t j ; s j )): Te combined estimator is de ned as te linear combination wit weigts tat minimize te estimated tr(amse(^ N): KW discusses te optimal weigts tat minimize te (consistently) estimated tr(amse(^ N) subject to P j a j = ; ere we summarize te results. After ranking te pairs (K tj ; sj ) in declining order of rates r N (t j ; s j ); denote by D I te largest invertible submatrix of D and by D its square submatrix associated wit estimators aving te fastest rate of convergence; note tat it can ave entries associated wit at most one oversmooted estimator to be of full rank. Ten a 0 D I a (subject to P j a j = ) is minimized by tat is by weigts equalling a I lim = 0 D 0 0 D D ; 0; :::; 0 ; D to te kernel/bandwidt combinations aving te fastest rate of convergence and zero weigt to all combinations wit slower rate of convergence. Note tat te weigts in te limiting linear combination are non-negative for estimators corresponding to D I (at most one asymptotically biased estimator). If D I 6= D; ten D D I = D II is of rank one and corresponds to oversmooted estimators only. D II as dimension more tan one (oterwise te only oversmooted estimator would ave been included in D I ) ; note tat ten tere always exist vectors a II lim suc tat a II0 limd II a II lim = 0; a II lim i = ; in oter words, it is possible to automatically bias-correct by using te combined estimator wit weigts tat are not restricted to be non-negative. Finally, te vector of weigts in te combined estimator approaces an optimal linear combination of a I lim and aii lim. Te combined estimator tus as te trace of AMSE tat converges at te rate no worse tan tat of te trace of AMSE for te fastest converging individual estimator. Te combined estimator provides a useful mecanism for reducing te uncertainty about te degree of smootness and tus about te best rate (bandwidt) and automatically selects te best

16 Robust Average Derivative Estimation 6 rate from tose available even toug it is not known a priori wic of te estimators converges faster. Te optimality property of te combined estimator relies on consistent estimation of biases and covariances. 4 To provide a consistent estimate for te asymptotic variance tat does not rely on te degree of smootness, we apply te bootstap, wic is obtained as b~ = Cov d B (^ N (K t ; s ); ^ N (K t ; s )) (8) = B BX b= 0 ^b;n (K t; s) ^N (K t; s) ^b;n (K t; s) ^N (K t; s) ; To provide us wit a consistent estimator of te biases, we need to assume tat for all kernels we consider an undersmooted bandwidt, yielding an asymptotic bias equalling zero. Let s0 bias is obtained as: denote te smallest bandwidt we consider, a consistent estimator for te b~b j [Bias(^ N (K tj ; sj )) = ^ N (K tj ; sj ) B BX ^b;n (K tj ; s0 ): Alternatively te bootstrapped averaged estimates at te lowest bandwidt for all te kernels, i = ; :::; m could be used in bias estimation: b= b~b j [Bias(^ N (K tj ; sj )) = ^ N (K tj ; sj ) B BX b= m X ^b;n (K ti ; s0 ): i= 4. Simulation In order to illustrate te e ectiveness of te combined estimator, we provide a Monte Carlo study were we consider te Tobit model. Te Tobit model under consideration is given by y i = yi if yi > 0; yi = x T i + " i ; i = ; :::; n = 0 oterwise, 4 Examples in KW demonstrate tat a combined estimator can reduce te AMSE relative to an estimator based on incorrectly assumed ig smootness level even wen te weigts are not optimally determined.

17 Robust Average Derivative Estimation 7 were our dependent variable y i is censored to zero for all observations for wic te latent variable y i lies below a tresold, wic witout loss of generality is set equal to zero. We randomly draw f(x i ; " i )g n i= ; were we assume tat te errors, drawn independently of te regressors, are standard Gaussian. Consequently, te conditional mean representation of y given x can be written as g(x) = x T (x T ) + (x T ); were () and () denote te standard normal cdf and pdf respectively. Irrespective of te distributional assumption on " i ; tis is a single index model as te conditional mean of y given x depends on te data only troug te index x T. Wile MLE obviously o ers te asymptotically e cient estimator of ; (density weigted) ADE o ers a semiparametric estimator for wic does not rely on te Gaussianity assumption on " i : Under te usual smootness assumptions, te nite sample properties of ADE for te Tobit model ave been considered in te literature (Niciyama and Robinson (005)). We select two explanatory variables, and set = (; ) T : We make various assumptions about te distribution of explanatory variables. For te rst model, we use two independent standard normal explanatory variables, i.e., f (x ; x ) = (x )(x ): Tis density is in nitely di erentiable and very smoot; tus, te ADE estimator evaluated at te optimal bandwidt sould be a good coice. Tis model, wic we label (s,s), is considered to demonstrate tat even in te case were te smootness assumptions old, te combined estimator performs similar to te ADE estimator evaluated at te optimal bandwidt. For te second model we use one standard normal explanatory variable and one mixture of normals, and in te tird model bot explanatory variables are mixtures of normals. We label tese models respectively (s,m) and (m,m). As in te rst model, we assume independence of te explanatory variables. Mixtures of normal, wile still being in nitely di erentiable, do allow beaviour resembling tat of nonsmoot densities, e.g., te double claw density and te discrete comb density (see Marron and Wand (99)). We consider ere te trimodal normal mixture given by f m (x) = 0:5(x + 0:767) + 3( x+0:767 0:8 0: ) + ( x+0:767 : 0: ):

18 Robust Average Derivative Estimation 8 So f (x ; x ) = (x )f m (x ) and f 3 (x ; x ) = f m (x )f m (x ): 5 Te sample size is set at 000 and 00 replications are drawn in eac case. Te multivariate kernel function K() (on R ) is cosen as te product of two univariate kernel functions. We use a second and fourt order kernel in our Monte Carlo experiment, were, given tat we use two explanatory variables, te igest order satis es te minimal teoretical requirement for ascertaining a parametric rate subject to te necessary smootness assumptions. Bot are bounded, symmetric kernels, wic satisfy te assumption tat te kernel and its derivative vanis at te boundary. Oter simulations will consider te use of asymmetric kernels, wic may yield for te combined estimator furter improvements. For eac kernel we consider tree di erent bandwidts. Te largest bandwidt is cosen on te basis of a generalized cross validation metod were a gridsearc algoritm and 50 simulations are used. Te cross-validation bandwidt is given by te optimal bandwidt sequence gcv = cn =(p+) (see Stone (98)) wit p equalling te order of te kernel (so tat ere p = v(k)). For densities of su cient smootness, tis bandwidt does not represent te undersmooting required to ensure asymptotic unbiasedness. Wen densities are not su ciently smoot, v = v < v(k); gcv will even correspond to oversmooting as we will ave N v! ; providing cases (a)v, (b)iii, or (c)iii in Teorem. Te smallest bandwidt for eac kernel is cosen as 0:5 gcv ; it needs to be be su ciently small so as to ensure te required level of undersmooting. In addition we take te intermediate bandwidt 0:75 gcv. Te generalized cross validation metod applied is not tat typically applied for nonparametric regression, but is specialized to te derivative of te regression function g 0 (x): P We use te usual generalized cross validation (min n i= (y i ^g (x i )) )) to obtain numerical derivatives of g(x); evaluated at a uniform grid of te x s, wic we denote as ~g 0 pgcv (x). Te optimal bandwidt for te derivative of te regression function is ten obtained by minimizing ~g 0 pgcv (x) ^g 0 (x) : Te bandwidt obtained tis way yielded smaller bandwidts tan te usual cross validation metod, wic accorded well wit se- 5 We are planning an analysis using te claw density and discrete comb density (Marron and Wand (99)) and are also exploring te selection of a density wit a precise order or non-smootness.

19 Robust Average Derivative Estimation 9 lecting te bandwidt by minimizing te mean squared error of te nonparametrically estimated moments (Donkers and Scafgans (005)), a metod wic only can be applied in a simulation setting troug te knowledge of te true data generating process. 6 Consistent estimators for biases and covariances of te density weigted ADE are obtained by bootstrap (wit 50 bootstraps) as discussed in te previous section. In table, we report te Root Mean Squared Errors (RMSE) of various density weigted average derivatives togeter wit te average bias and standard deviation (teoretical (T), sample (S), and bootstrapped (B)) of te average derivatives. Te rst tree columns present te results using te nd order kernel (K ) for various bandwidts, te next tree columns present te results using te 4t order kernel (K 4 ) for various bandwidt, and te nal column tat of te combined estimator. Te RMSEs using te di erent pairs of kernels and bandwidts sould be compared wit te RMSE of te combined estimator, wic optimally cooses te weigts. In all tree models we see tat te biases and standard deviations of te individual estimators on average beave as expected: as te bandwidt increases, bias becomes more pronounced and te standard deviation declines. No kernel/bandwidt pair is te best in terms of RMSE among individual ones for all te models, altoug (K 4 ; gcv ) is best for (s,s) and (s,m) and close to te best for (m,m). Te teoretical standard deviation (using te leading two components of Var(^ N ) given in (5) compares very well wit te standard deviation based on te bootstrap, were we note te importance of taking te kernel/bandwidt dependent component into account to ensure tis close correspondence. Te sample standard deviation still reveals a disparity (smaller for (s,s) and (s,m) versus larger for (m,m)) wic migt be te consequence of aving set te number of simulations too low. Table sows tat in terms of te RMSE of te ADE te combined estimator performs 6 : For te (s,s) model te bandwidts for te second and fourt order kernel were respectively : ; :8 :8 :0 for te nonparametric regression and : ; : : for its derivative. For te (s,m) model tey were : ; :7 and : ; : respectively, wereas for te (m,m) model tey were : ; :9 : : :0 :0 :0 ; : : respectively. : : :9 and

20 Robust Average Derivative Estimation 0 Model (s,s) Table : Density weigted ADE estimators K ; K ; K ; K 4 ; K 4 ; K 4 ; 0:5 gcv 0:75 gcv gcv 0:5 gcv 0:75 gcv gcv Combined RMSE Bias StdDev(T) StdDev(S) StdDev(B) Model (s,m) :0008 :0009 0:004 0:0039 0:003 0:0030 0:004 0:0039 :003 :004 0:009 0:009 0:004 0:004 0:009 0:008 :003 :003 0:005 0:004 0:00 0:00 0:005 0:004 :0000 :0000 0:0083 0:0080 0:0049 0:0050 0:008 0:008 :0005 :0003 0:0045 0:0045 0:0034 0:003 0:0045 0:0045 :005 :00 0:0035 0:0035 0:009 0:009 0:0035 0:0035 :0006 :0007 0:0033 0:0034 0:0034 0:0033 RMSE Bias StdDev(T) StdDev(S) StdDev(B) Model 3 (m,m) :0008 :0009 0:006 0:004 0:0047 0:00 0:006 0:004 :0008 :0009 0:0046 0:0070 0:0037 0:007 0:0046 0:0070 :0008 :0009 0:0036 0:0043 0:0030 0:0044 0:0036 0:0043 :0008 :0009 0:0 0:07 0:008 0:038 0:0 0:070 :0008 :0009 0:0075 0:030 0:0055 0:08 0:0075 0:030 :0008 :0009 0:0060 0:0095 0:0048 0:0096 0:0060 0:0095 :003 :00 0:0060 0:04 0:0063 0:0097 RMSE Bias StdDev(T) StdDev(S) StdDev(B) :08 :0 0:083 0:088 0:097 0:077 0:08 0:087 :0379 :0330 0:008 0:0 0:04 0:08 0:007 0:00 :0584 :054 0:0060 0:0069 0:006 0:0063 0:0060 0:0068 0:08 0:03 0:030 0:0307 0:09 0:047 0:0300 0:0306 0:0078 0:0077 0:039 0:039 0:058 0:038 0:038 0:037 :047 :046 0:066 0:066 0:078 0:06 0:065 0:065 :0047 :0036 0:034 0:099 0:078 0:085

21 Robust Average Derivative Estimation better tan te individual estimators in all cases. Were tere is a clearly superior individual estimator it gets a iger weigt on average and, in agreement wit te results for te combined estimator, oversmooted individual estimators get weigts of di erent signs re ecting te tendency of te combined estimator to balance o te biases. Speci cally, te average weigts of ((K ; 0:75 gcv ); (K ; gcv )) and ((K 4 ; 0:75 gcv ); (K 4 ; gcv )) ave opposite signs in all models, and for (s,s) (K ; 0:75 gcv ) gets a relatively large weigt, for (s,m) so does (K 4 ; gcv ); wile for (m,m) (K ; 0:5 gcv ) gets a large weigt. 7 In Table te parameter estimates of te Tobit model are presented. Since te ADE allows for te estimation of = ( ; ) T up to scale, we report results of te parameter estimates of were is standardized to. For comparison, te Tobit MLE parameter estimates are reported as well. To ensure comparability wit te semiparametric estimates, were is standardized to, we report ^ (t) =^ (t) for te Tobit regressions, were ^ (t) are te Tobit parameter estimates (allowing for te estimation of an intercept 0 ). Again te results are provided for eac kernel/bandwidt pair selection as well as for te combined estimator. Wen looking at table, we note tat superiority in estimating ADE does not necessarily translate into better parameter estimators. If we judge performance on te RMSE, no individual estimator can be ranked to be te best in all models (and none is ranked above te combined estimator in all te models). Te kernel bandwidt combination wic is best for (s,s) and (m,m) is (K ; gcv ) compared to (K ; 0:5 gcv ) for (s,m). Even toug te combined estimator is not ranked best in RMSE sense in any of te models, its RMSE is relatively closer to te best individual estimator tan te worst individual estimator. Te same conclusions can be drawn if we judge te performance on te basis of absolute deviation of te mean of te individual estimator from te true value : In tis case, (K ; 0:75 gcv ) is best for (s,s) compared to (K 4 ; 0:75 gcv ) for (s,m) and (K 4 ; gcv ) for (m,m), only te individual estimator (K 4 ; 0:5 gcv ) wit tis criterion is consistently worse tan te combined estimator. A loss in e ciency arising from not knowing te distribution 7 On average te weigts are for (s,s) (0; 34; 0:99; 0:4; 0:03; 0:47; 0:5); for (s,m) (0:30; 0:39; 0:40; 0:05; 0:60; :7) and for (m,m) (0:66; :58; :50; 0:07; :5; :44).

22 Robust Average Derivative Estimation Model (s,s) Table : Tobit Model: Single Index parameter estimates Parametric MLE Semiparametric, ADE based estimator K ; K ; K ; K 4 ; K 4 ; K 4 ; 0:5 gcv 0:75 gcv gcv 0:5 gcv 0:75 gcv gcv Combined Mean StdDev(T) StdDev(S) RMSE Model (s,m) Mean StdDev(T) StdDev(S) RMSE Model 3 (m,m) Mean StdDev(T) StdDev(S) RMSE

23 Robust Average Derivative Estimation 3 of te disturbances occurs as expected, but is witin reason; te standard deviation of te combined semiparametric estimator is less tan double tat of te Tobit MLE for (s,s). Wile te loss in e ciency arising from not knowing te distribution of te disturbances is more severe for (s,m) and (m,m), te potential gain from using te combined estimator over an incorrect kernel bandwidt combination is greater wit non-smoot densities for te explanatory variables. 5. Appendix Te proof of Teorems and relies on te following Lemmas and, correspondingly, were moments are computed under te general assumptions of tis paper. We do not use te teory of U statistics in te following lemma but obtain te moments by direct computation for symmetric as well as non-symmetric kernels. Lemma. Given Assumptions -4, te variance of ^ N (K; ) can be expressed as V ar(^ N (K; )) N (k+) + N + O(N ) were = 4E yi f(x i ) (K) + (K)g(x i )f(x i )y i + o(); = 4 E(g 0 (x i )f(x i ) (y i g(x i ))f 0 (x i ))(g 0 (x i )f(x i ) (y i g(x i )f 0 (x i ))) T 4 0 T 0 + o(); for (K) = (K) = K 0 (u)k 0 (u) T du K 0 (u)k 0 ( u) T du; (under symmetry (K) = (K)). Proof. First, recall tat Bias(^ N (K; )) = E(A(K; ; x i )y i = v B(K) + o( v ) wit A(K; ; x i ) = K(u)(f 0 (x i u) f 0 (x i ))du: (A.)

24 Robust Average Derivative Estimation 4 To derive an expression for te Variance of ^ N (K; ); we note V ar(^ N (K; )) = E(^ N (K; )^ N (K; ) T ) E^ N (K; )E^ N (K; ) T : Let I(a) = ; if te expression a is true, zero oterwise. We decompose te rst term as follows E ^N (K; )^ N (K; ) T (A.) 8" # " # 9 < T NX NX = = 4E ^f 0 : N (K;)(x i )y i ^f 0 N (K;)(x i )y i ; i= i= n = 4 E N ^f 0 (K;) (x i ) ^f (K;)(x 0 i ) T yi + N E N ^f 0 (K;) (x i ) ^f o (K;)(x 0 i ) T y i y i I( i 6= i ). Te rst expectation yields E ^f 0 (K;)(x i ) ^f (K;)(x 0 i ) T yi 8 0 " < X = N E : E z j6=i k+ K 0 ( x i x j ) # " X j6=i k+ K 0 ( x i = k+ N E y i E zi K 0 ( x i x j )K 0 ( x i x j ) T I( i 6= j) + k+ E yi E zi N N = N N N k+ E yi k+ E E zi y i K 0 ( x i x j ) # 9 T = A ; i K 0 ( x i x j )K 0 ( x i x j ) T I( i; i ; i pairwise distinct) K 0 (u)k 0 (u) T f(x i u)du + x j ) E zi y i K 0 ( x i = N k+ Ey i f(x i ) (K) + O() + N E(f 0 (x N i )y i )(f 0 (x i )y i ) T + O( v ) ; (A.3) x j ) T I( i; i ; i pairwise distinct) were for te tird and te last equality we use cange of variable in integration and independence of x j, x j ; by Assumptions 4 and 5 te moments of te additional terms are correspondingly bounded. Furter E ^f 0 (K;)(x i ) ^f (K;)(x 0 i ) T yi = f k+ N Ey i f(x i ) (K) + O() + E(f 0 (x i )y i )(f 0 (x i )y i ) T + O( v ) gf + O(N ) g:

25 Robust Average Derivative Estimation 5 Te second expectation yields, E = N ^f 0 (K;)(x i ) ^f (K;)(x 0 i ) T y i y i I(i 6= i ) k+ X E yi y i j 6=i X j 6=i K 0 ( xi xj )K 0 ( x i x j = N k+ (N ) E y i y i K 0 ( x i x j )K 0 ( x i x j ) T I(j = j ; j ; j 6= i 6= i ) + k+ (N ) E y i y i K 0 ( x i x i )K 0 ( x i x i ) T I(j 6= j ; j = i ; j = i ) + N (N ) + N (N ) (N )(N 3) + (N ) k+ E y i y i K 0 ( x i x i )K 0 ( x i x j ) T I(j 6= j ; j = i ; j 6= i 6= j ) k+ E y i y i K 0 ( x i x j )K 0 ( x i x i ) T I(j 6= j ; j = i ; j 6= i 6= j ) k+ E y i y i K 0 ( x i x j )K 0 ( x i x j ) T I(j 6= j 6= i 6= i ) : Using te law of iterated expectations, we rewrite E ^f 0 (K;)(x i ) ^f (K;)(x 0 i ) T y i y i I(i 6= i ) (A.4) i i = N k+ (N ) E E zj y i K 0 ( x i x j ) E zj y i K 0 ( x i x T j ) + i k+ (N ) E y i E zi y i K 0 ( x i x i )K 0 ( x i x i ) T + i i N k+ (N ) E E zi y i K 0 ( x i x i ) E zi y i K 0 ( x i x T j ) + i i N k+ (N ) E E zi y i K 0 ( x i x j ) E zi y i K 0 ( x i x T i ) + i (N )(N 3) k+ (N ) E E zi y i K 0 ( x i x j ) E E zi y i K 0 ( x i x T j )i ; were for brevity we omit te term I(i 6= i ) in te terms of te expression. Next follow details of derivation. Denote ) T!

26 Robust Average Derivative Estimation 6 A(K; ; x i ) = E zi ^f 0 (K;) (x i ) i f 0 (x i ) = K(u)(f 0 (x i u) f 0 (x i ))du B(K; ; x i ) = K 0 (u)k 0 (u) T (f(x i u) f(x i )) du: C(K; ; x i ) = K(u) [(gf) 0 (x i + u) (gf) 0 (x i )] du D(K; ; x i ) = K 0 (u)k 0 ( u) T [(gf)(x i + u) (gf)(x i )] du c(x i ) = (gf) 0 (x i ) d(k; x i ) = (K)(gf)(x i ) (K) = K 0 (u)k 0 (u) T du (K) = K 0 (u)k 0 ( u) T du; (under symmetry (K) = (K)). Ten write for terms in (A.4). First, E zi Te remaining conditional moments are E zj E zi i k+ K 0 ( x i x j )y i = f 0 (x i )y i + A(K; ; x i )y i. i k+ K 0 ( x i x j )y i = c(x j ) + C(K; ; x j ) (A.5) i k K 0 ( x j x i )K 0 ( x i x j ) T y j = d(k; x i ) + D(K; ; x i ): (A.6) Indeed, for (A.5) E zj k+ K 0 ( x i x j )y i i = k+ K 0 ( x x j )(gf)(x)dx = K 0 (u)(gf)(x i + u)dx (integration by parts) = (gf) 0 (x j ) K(u) [(gf) 0 (x j + u) (gf) 0 (x j )] du

27 Robust Average Derivative Estimation 7 For (A.6) = = = i k E zi K 0 ( x j x i )K 0 ( x i x j ) T y j k g(x)k 0 ( x x i )K0 ( x i x )T f(x)dx c.o.v. x x i = u K 0 (u)k 0 ( u) T (gf)(x i + u)dudy K 0 (u)k 0 ( u) T (gf)(x i )du + yk 0 (u)k 0 ( u) T [(gf)(x i + u) (gf)(x i )] du = d(k; x i ) + D(K; ; x i ): It is useful to note ere tat E E zi k+ K 0 ( x i x j )y i ii = E E zj k+ K 0 ( x i x j E [f 0 (x i )y i + A(K; ; x i )y i ] = E [c(x j ) + C(K; ; x j )] : )y i ii Indeed it can easily be veri ed tat E(f 0 (x i )y i ) = E(c(x j )): Using (A.), (A.5), and (A.6) we can express (A.4) as E ^f 0 (K;)(x i ) ^f (K;)(x 0 i ) T y i y i = N (N ) E (c(x i ) + C(K; ; x i )) (c(x i ) + C(K; ; x i )) T i + k+ (N ) E [d(k; xi )y i + D(K; ; x i )y i ] N E (c(x (N ) i ) + C(K; ; x i )) (f 0 (x i )y i + A(K; ; x i )y i ) T + N E (N ) (f 0 (x i )y i + A(K; ; x i )y i ) (c(x i ) + C(K; ; x i )) T i (N )(N 3) (N ) E [f 0 (x i )y i + A(K; ; x i )y i ] E [f 0 (x i )y i + A(K; ; x i )y i ] T (A.7)

28 Robust Average Derivative Estimation 8 Combining (A.), (A.3), and (A.7) yields, E ^N (K; )^ N (K; ) T = 4 N(N ) k+ E y i f(x i ) (K) + B(K; ; x i )yi + d(k; x i )y i + D(K; ; x i )y i +4 N N(N ) E((f 0 (x i )y i + A(K; ; x i )y i )(f 0 (x i )y i + A(K; ; x i )y i ) T ) i +4 N E (c(x N(N ) i ) + C(K; ; x i )) (c(x i ) + C(K; ; x i )) T +4 N E (c(x N(N ) i ) + C(K; ; x i )) (f 0 (x i )y i + A(K; ; x i )y i ) T i +4 N E (f 0 (x N(N ) i )y i + A(K; ; x i )y i ) (c(x i ) + C(K; ; x i )) T (N )(N 3) + N(N ) E^ N (K; ) T E^ N (K; ) : Te nal expression (using repeatedly Assumptions 3-5 to sow convergence to zero of expectation of terms involving quantities denoted in capitals) is E ^N (K; )^ N (K; ) T = 4 k+ N E y i f(x i ) (K) + y i (gf)(x i ) (K) + o() +4 N E(y i (f 0 (x i )(f 0 (x i ) T + (gf) 0 (x i )(gf) 0 (x i ) T + y i (gf) 0 (x i )(f 0 (x i ) T + y i f 0 (x i )(gf) 0 (x i ) T + o()) T E^ N (K; ) E^ N (K; ) : (N )(N 3) + N(N ) Alternatively, we can write te variance expression in te form given in te statement of te Lemma. Remark. For N V ar(^ N (K; )) to converge, we require N k+ O() or N k+! : Notice tat indeed given N k+! (regardless of weter we assume te kernel to be symmetric), NV ar(^ N (K; ))! 4 E c(x i )c(x i ) T + E f 0 (x i )c(x i ) T + c(x i )f 0 (x i ) T )y i + y i f 0 (x i )f 0 (x i ) T = 4 E((g 0 (x i )f(x) (y i g(x i )f 0 (x i ))(g 0 (x i )f(x) (y i g(x i )f 0 (x i )) T 4 0 T 0 = as in PSS 989

29 Robust Average Derivative Estimation 9 Proof of Teorem. Tree main situations ave to be dealt wit in te proof. From Lemma it follows tat te variance as two leading parts, one tat converges at a parametric rate, O(N ); requiring N k+! ; wen tis condition on te rate of te bandwidt does not old, te variance converges at te rate O(N (k+) ): Te bias converges at te rate O( v ): Te rst situation arises wen te rate of te bias dominates te rates for bot leading terms in te variance: cases (a) v. (correspondingly in (b)) and (c) iii.. By standard arguments tis situation clearly results in convergence in probability to B(K) as stated in te Teorem. Te second situation refers to parametric rate of te variance dominating (wit or witout bias). For tis case Teorem 3.3 in PSS applies. Since te proof in PSS is based on te teory of U statistics we make te additional assumption of symmetry of te kernel function (see te comment in Ser ing (980, p.7) to wic PSS refer in footnote 7 re symmetrization - it is not actually clear to me ow tis will elp in te proof for a non-symmetric kernel). Te tird situation is wen te condition N k+! is violated; note tat if te degree of smootness, v < k+ tis condition regardless of kernel order could old only in te case wen te bias dominates. Tis possibility N k+! 0 was not examined in te literature previously. We tus need to provide te proof of asymptotic normality for cases (a) i. (corresponding (b)) and (c) i.. Consider N k+! 0: Sketc of proof. We sall say tat x i ; x j are close if jx i x j j < ; ere jwj indicates te maximum of te absolute value of te components of vector w: In te sample of fx ; :::x N g denote by A s te set fx i j exactly s oter x j wit j > i are close to x i g: Ten A is te set of "isolated" x i ; tat do not ave any oter wit exactly one for a given : close point, etc. Clearly, close sample points, A is te set of points [ A s represents a partition of te sample N s=

30 Robust Average Derivative Estimation 30 Step of proof. We sow tat a small enoug results in te probablity measure of N[ A s going to zero fast enoug; tis implies tat most of te non-zero contribution into s=3 ^ comes from A (since A does not add non-zero terms). Step. Consider A : Te contribution from te x 0 s in tis set to ^ reduces to te sum (recall symmetry of te kernel) N X N x i A k+ K 0 ( x i Since in view of te result in step te x j tat is A ; we consider ^A = N X x i ;x j A i=;::n ;j=i+;::n N x j )(y i y j ): close to x i wit ig probability is in k+ K 0 ( x i x j )(y i y j ): (A.8) Te terms in (A.8) are i.i.d. (note tat were a pair x i ; x j is not in A te contribution to te sum is zero.) Te second moments of tese terms were derived in Lemma.?Note tat for cross-products only te terms of te form (N )(N 3) k+ (N ) E y i y i K 0 ( x i x j )K 0 ( x i x j ) T I(j 6= j 6= i 6= i ) are relevant since in A te terms are independent??? or someting of tat sort?? so tat te variance will re ect te rate... To be continued... Lemma. Given Assumptions -4, te Var of ^ N can be represented as X X a t ;s a t ;s Cov(^ N (K t ; s ); ^ X N (K t ; s )) aj a j j j t ;s t ;s

Smoothness Adaptive Average Derivative Estimation ±

Smoothness Adaptive Average Derivative Estimation ± Smootness Adaptive Average Derivative Estimation ± Marcia M.A. Scafgans Victoria Zinde-Walsyz Te Suntory Centre Suntory and Toyota International Centres for Economics and Related Disciplines London Scool