Adaptive Estimation of the Regression Discontinuity Model

Size: px

Start display at page:

Download "Adaptive Estimation of the Regression Discontinuity Model"

Clarissa Stafford
6 years ago
Views:

1 Adative Estimation of the Regression Discontinuity Model Yixiao Sun Deartment of Economics Univeristy of California, San Diego La Jolla, CA Feburary 25 Tel:

2 Abstract In order to reduce the nite samle bias and imrove the rate of convergence, local olynomial estimators have been introduced into the econometric literature to estimate the regression discontinuity model. In this aer, we show that, when the degree of smoothness is known, the local olynomial estimator achieves the otimal rate of convergence within the Hölder smoothness class. However, when the degree of smoothness is not known, the local olynomial estimator may actually in ate the nite samle bias and reduce the rate of convergence. We roose an adative version of the local olynomial estimator which selects both the bandwidth and the olynomial order adatively and show that the adative estimator achieves the otimal rate of convergence u to a logarithm factor without knowing the degree of smoothness. Simulation results show that the nite samle erformance of the locally cross-validated adative estimator is robust to the arameter combinations and data generating rocesses, re ecting the adative nature of the estimator. The root mean squared error of the adative estimator comares favorably to local olynomial estimators in the Monte Carlo exeriments. Keywords: Adative estimator, local cross validation, local olynomial, minimax rate, otimal bandwidth, otimal smoothness arameter JEL Classi cation Numbers: C3, C4

3 Introduction In this aer, we consider the regression discontinuity model: y = m(x) + d + " () where m(x) is a continuous function of x; d = fx x g, and E("jx; d) =. Such a model has been used in the emirical literature to identify the treatment e ect when there is a discontinuity in the treatment assignment. A artial list of examles include Angrist and Lavy (999), Black (999), Battistin and Rettore (22), Van der Klaauw (22), DiNardo and Lee (24), and Chay and Greenstone (25). Given the iid data fx i ; y i g n i= ; our objective is to develo a good estimator of ; the treatment e ect at a known cut-o oint x : In order to maintain generality of the resonse attern, we do not imose a seci c functional form on m(x): Instead, we take m(x) to belong to a family that is characterized by regularity conditions near the cut-o oint. This is a semiarametric aroach to estimating the regression discontinuity model. Semiarametric estimation of the regression discontinuity model is closely related to the estimation of conditional exectation at a boundary oint. In both settings, the widely used Nadaraya-Watson (NW) estimator has a large nite samle bias and slow rate of convergence. To reduce the nite samle bias and imrove the rate of convergence, Hahn, Todd and Van der Klaauw (2) and Porter (23) roose using a linear function or a olynomial to aroximate m(x) in a small neighborhood of the cut-o oint. Porter (23) obtains the otimal rate of convergence using Stone s (98) criterion and shows that the local olynomial estimator achieves the otimal rate when the degree of smoothness of m(x) is known. In this aer, we show that the local olynomial estimator with the asymtotic MSE otimal bandwidth may actually in ate the nite samle bias and reduce the rate of convergence when the degree of smoothness of m(x) is not known. In articular, this will haen if the order of the local olynomial is too large relative to the degree of smoothness. Hence, a drawback of the local olynomial estimator is that the otimal rate of convergence can not be achieved because it deends on the unknown quantity. This calls for an estimator that is adative to the unknown smoothness. We require the estimator to be adative not just at a xed model, but also at a sequence of models near it. The adative rate refers not just to ointwise convergence, but rather to convergence uniformly over models that are very close to some articular model of interest. The roblem of adative estimation of a nonarametric function from noisy data has been studied in a number of aers including Leski (99,99,992), Donoho and John-

4 stone (995), Birge and Massart (997) and the references cited therein. Various aroaches have been roosed, among which Leski s method has been widely used in the statistical literature; see for examle, Leski and Sokoiny (997), Leski, Mammen and Sokoiny (997) and Sokoiny (2). These aers study adative bandwidth choice in local constant or linear regression for estimating the drift function in a Gaussian white noise model or a nonarametric di usion model. More seci cally, Leski and Sokoiny (997) work with the Gaussian white noise model and consider ointwise estimation using a kernel method with the Hölder smoothness class, assuming that the order of smoothness is less than 2. Leski, Mammen and Sokoiny (997) extend the ointwise estimation to global estimation using a high order kernel method with the Bosev class. In addition, Leski s method has been used in several aers on semiarametric estimation of long memory in the time series literature including Giritis, Robinson, and Samarov (2), Hurvich, Moulier and Soulier (22), Ioudisky, Moulier and Soulier (22), Andrews and Sun (24) and Guggenberger and Sun (24). In this aer, we use Leski s method to construct a rate-adative estimator of the regression discontinuity model. In doing so, we extend Leski s method in several imortant ways. First, we consider the local olynomial estimators instead of kernel estimators. The estimation of the regression discontinuity model is similar to the estimation of conditional exectation on the boundary. It is well known that local olynomial estimators have some otimality roerties for the boundary estimation roblem. Second, a direct alication of Leski s aroach to the resent framework involves using a olynomial of a re-seci ed order and comaring local olynomial estimators with di erent bandwidths. More seci cally, one has to rst choose the order of the olynomial to be larger than the uer bound s of the smoothness arameter. Such a strategy is not otimal. If the underlying smoothness arameter s is less than s ; then it is better to use a olynomial of order bsc; the largest integer strictly smaller than s: Using a olynomial of a higher order will only in ate the asymtotic variance without the bene t of bias reduction. In contrast, our adative method chooses both the bandwidth and the order of the olynomial adatively The chosen olynomial in the adative estimator is indeed of order bsc: Third, our adative rule does not use the lower and uer bounds for s while the adative rule in Leski (99) uses them exlicitly. In consequence, the rate of convergence of our adative estimator can be arbitrarily close to the arametric rate in the in nitely smooth case while that of Leski s estimator is caed by the uer bound s : This advan- 2

5 tage of our adative estimator is artly due to the use of the zero-one loss rather than the squared-error loss. Results for the zero-one loss are su cient to obtain the otimal rate of convergence, which is the item of greatest interest here. Finally, one drawback of Leski s aroach is that there are constants in the adative rocedure that are arbitrary. This is true for other adative rocedures although some rocedures may x their constants at certain ad hoc values and seemingly remove the need to choose any constant. In this aer, we roose using local cross validation to select the constants and rovide a ractical strategy to imlement the adative estimator. We comare the root mean-squared error (RMSE) erformance of the adative estimator with the local constant, local linear, local quadratic and local cubic estimators. We consider three grous of models with di erent resonse functions m(x): In the rst grou, m(x) is the sum of a third order olynomial and a term containing (x x ) s for some non-integer s. Resonse functions in this grou are designed to have nite smoothness s : By choosing di erent s ; we can get resonse functions that have di erent degrees of smoothness. The second grou is the same as the rst grou excet that m(x) is erturbed by an additive sine function such that the resonse function has a ner structure. For the third grou, we take m(x) to be a constant, linear, quadratic or cubic function. This grou is designed to give each of the local olynomial estimators the best advantage. The Monte Carlo results show that the RMSE erformance of the adative estimator is very robust to the data generating rocess, re ecting its adative nature. Its RMSE is either the lowest or among the three lowest ones for the arameter combinations and data generating rocesses considered. In contrast, a local olynomial estimator may erform very well in some scenario but disastrously in other scenarios. The best estimator in an overall sense seems to be the adative estimator. The rest of the aer is organized as follows. Section 2 overviews the local olynomial estimator and examines its asymtotic roerties when the order of the olynomial is larger than the underlying smoothness. Section 3 establishes the otimal rate of convergence within the Hölder smoothness class and shows that the local olynomial estimator achieves the otimal rate when the degree of smoothness is known. Section 4 introduces the adative local olynomial estimator. It is shown that the adative estimator achieves the otimal rate for known smoothness u to a logarithm factor when the smoothness is not known. For a given resonse function m(x); it is also shown that the adative rocedure rovides a consistent estimator of the smoothness index de ned in that section. The subsequent section contains the simulation results that comare the nite samle erformance of the adative estimator with those of the local olynomial estimators. Proofs and additional 3

6 technical results are given in the Aendix. Throughout the aer, fg is the indicator function and jj jj signi es the Euclidean norm. C is a generic constant that may be di erent across di erent lines. 2 Local Polynomial Estimation Consider the regression discontinuity design model y = m(x) + d + " where m(x) is a unknown function of x; E("jx; d) = and d = fx x g. Given the iid data (x i ; y i ); i = ; 2; :::; m; our objective is to estimate without assuming the functional form of m(). However, it is necessary to assume that m(x) belongs to some smoothness class. De nition: Let s = `+ where ` is the largest integer strictly less than s and 2 (; ]: If a function de ned on the interval [x ; x + ) is ` times di erentiable, su m (j) (x) K for j = ; ; 2; 3; :::; ` x2[x ;x +) and m (`) (x ) m (`) (x 2 ) K jx x 2 j for x ; x 2 2 [x ; x + ) where m (j) (x) is the j-th order derivative and m (j) (x ) is the j-th order right hand derivative at x ; then we say m(x) is smooth of order s on [x ; x + ). Denote this class of functions by M + (s; ; K): Similarly, we can de ne M (s; ; K) as the class of functions that satisfy the above two conditions with [x ; x + ) relaced by (x ; x ] and m (j) (x ) being the left hand derivative at x : Assumtion : m(x) 2 M(s; ; K) where M(s; ; K) := fm : m 2 M + (s; ; K) \ M (s; ; K) \ C (x ; x + )g and C (x ; x + ) is the set of continuous functions on (x ; x + ): Assumtion allows us to develo an ` term Taylor exansion of m(x) on each side of x : Without loss of generality, we focus on x x ; in which case we have d j m(x) = m(x ) + `X b + j (x x ) j + ~e + (x); (2) j= where b + j = j! m(x)j dx j x=x + is the (normalized) j-th order right hand derivative of m(x) at x and ~e + (x) = `! m (`) (~x) m (`) (x ) (x x )` (3) 4

7 for some ~x between x and x : Under Assumtion ; ~e + (x) satis es ~e + (x) K (`!) jx x j s for all x 2 [x ; x + ): (4) We break u the Taylor exansion into the art that will be catured by the local olynomial regression and the remainder: where and q = min fs; r + 2g : min(r;`) X m(x) = m(x ) + b + j (x x ) j + R + (x), x x (5) R + (x) = `X j= j=min(r;`)+ b + j (x x ) j + ~e + (x) (6) : = f` r + gb + r+ (x x ) r+ + e + (x); e + (x)= (x x ) q = O() uniformly over x 2 [x ; x + ); (7) Let b + (r) denote the column r-vector whose j-th element is b + j for j = ; 2; :::; min(r; `) and for j = min(r; `)+; :::; r: Let z ir = (; (x i x ); :::; (x i x ) r ) be the row (r+)-vector, = (c + ; (b + (r)) ) and c + = + m(x ): Then for x i x ; we have + r To estimate + r, we minimize y i = z ir + r + R + (x i ) + " i (8) nx k h (x i x )d i (y i z ir r ) 2 (9) i= with resect to r, where d i = fx i x g; k h (x i x ) = =hk((x i x ) =h) and h is the bandwidth arameter. Let Y + and Z r + be the data matrix that collects the values of y i and z ir resectively with the corresonding value of x i x. Then (8) can be written in the vector form: Y + = Z + r + r + R + + " + () and the objective function in (9) becomes Y + Z + r r W + Y + Z + r r () where W + = diag(fhk h (x i ^+ r = x )g) xi x: Minimizing the receding quantity gives ^c + r ; (^b (r)) = Z + r 5 W + Z + r (Z + r W + Y + ): (2)

8 De ning Y, Z r, W analogously using the observations satisfying x i < x ; we have Y = Z r r + R + " (3) where r = (c ; (b (r)) ), c = m(x ) and b (r) is similarly de ned but with the right hand derivatives relaced by the left hand derivatives. Minimizing (Y Z r r ) W (Y Z r r ) with resect to r gives an estimate for r : ^r = ^c r ; (^b (r)) = Z r W Z r (Z r W Y ): (4) The di erence between ^c + r and ^c r gives an estimate for : ^ r = ^c + r ^c r : (5) To investigate the asymtotic roerties of ^ r ; we maintain the following two additional assumtions. x : Assumtion 2: (a) E("jx; d) = : (b) 2 (x) = E(" 2 jx) is continuous for x 6= x and the right and left hand limits exist at (c) For some > ; E(j"j 2+ jx) is uniformly bounded on [x (d)the marginal density f(x) of x is continuous on [x ; x + ]. ; x + ]: Assumtion 3: The kernel k () is even, bounded and has a bounded suort. Theorem Let Assumtions -3 hold. If n! and h! such that nh! ; then nh(^r ) B ) N ;! 2 2 r where B = fs > r + g e! 2 = 2+ (x ) + 2 (x ) f(x ; 2 r = e ) r r b + r+ ( ) r+ b r+ f(x ) r = i+j 2 (r+)(r+) = r V r r e ; h r+ nh( + o ()) + O h q nh ; ::: r.. C A, r ::: 2r v ::: v r V r = (v i+j 2 ) (r+)(r+) = C A ; v r ::: v 2r e = (; ; ::::; ), r = ( r+ ; :::; 2r+ ) ; j = R k(u)u j du and v j = R k 2 (u)u j du: 6

9 Remarks. When s > r + ; Theorem is the same as Theorem 3(a) in Porter (23). The roof is straightforward and uses art of Porter s result. 2. If s > r +, the asymtotic bias of ^ r, de ned as B= nh, is of order h r+ : In contrast, the asymtotic bias of ^ is of order h: The asymtotic bias of ^ r for r is smaller than that of ^ by an order of magnitude rovided that m(x) is smooth of order s > r + : 3. If s > r + ; then the asymtotic MSE of ^ r is AMSE(^ r ) = C h 2r+2 + C 2 nh : (6) Assume that C > and C 2 > ; then minimizing AMSE(^ r ) over h gives the AMSE-otimal choice for h : h C =(2r+3) 2 = n =(2r+3) : (7) (2r + 2)C For this AMSE-otimal choice of h; AMSE(^ r ) is roortional to e r r r r e e 2(r+) =(2r+3) r V r r e n 2(r+)=(2r+3) : (8) So ^ r converges to at the rate of n (r+)=(2r+3) : In articular, ^ converges to at the rate of n =3 : As a consequence, by aroriate choice of h, one has asymtotic normality of ^ r with a faster rate of convergence (as a function of the samle size n) than is ossible with ^ : 4. When s > r + and h = h ; the asymtotic mean squared error deends on the kernel only through the quantity (k) = e r r r r e e 2(r+) r V r r e : (9) This quantity is the same as T +; de ned in equation (7) in Cheng, Fan and Marron (997,. 695). Using their roof without change, we can show that the kernel that minimizes (k) over the class of kernels de ned by Z K = k(x) : k(x) ; k(x)dx = ; jk(x) k(y)j C jx yj for some C > is simly the Bartlett kernel k(x) = ( jxj) fjxj g for all r. This is an unusual result because the otimal kernel does not deend on the order of the local olynomial. 7

10 5. Consider the case that s r + and h is roortional to the AMSE otimal rate n =(2r+3) : For such a con guration, the asymtotic bias dominates the asymtotic variance. The estimator ^ r converges to the true at the rate of n s 2r+3 : The larger r is, the slower the rate of convergence is. For examle, when 2r + 3 3s; the rate of convergence is slower than n =3 ; the rate that is obtainable using the Nadaraya- Watson estimator. By tting a high order olynomial, it is ossible that we in ate the boundary e ect instead of reducing it. Theorem shows that the local olynomial estimation has the otential to reduce the boundary bias roblem and deliver a faster rate of convergence when the resonse function is smooth enough. In the next section, we establish the otimal rate of convergence when the degree of smoothness is known. It is shown that the local olynomial estimator with aroriately chosen bandwidth achieves this otimal rate. 3 Otimal Rate of Convergence To obtain the otimal rate of convergence, we cast the regression discontinuity model into the following general framework: Suose P is a family of robability models on some xed measurable sace (; A). Let be a functional de ned on P, taking values in R. An estimator of is a measurable ma ^ :! R: For a given loss function L(^; ), the maximum exected loss over P 2 P is de ned to be R(^; P) = su E P L(^; (P )) (2) P 2P where E P is the exectation oerator under the robability measure P: Our goal is to nd an achievable lower bound for the minimax risk de ned by inf ^ R(^; P) = inf ^ su E P L(^; (P )): (2) P 2P If we add a subscrit n to ^, P; and P where n is the samle size, the achievable lower bound will translate into the best rate of convergence of R(^; P) to zero. This best rate is called the minimax rate of convergence as it is derived from the minimax criterion. It is also commonly referred to as the otimal rate of convergence. Now let us ut the regression discontinuity model in the above general framework. Let f() be a robability density function of x and ' x () be a conditional density of " for a given x such that E("jx) = : For both densities the dominating measures are the usual 8

11 Lesbegue measures. De ne P(s; ; K) = P m; : dp m; d = f(x)' x (y m(x)) fx < x g o + f(x)' x (y m(x) ) fx x g ; m(x) 2 M(s; ; K); jj K where is the Lesbegue measure on R 2 : For this family of models, the marginal distribution of x and the conditional distribution of " are the same across all members. The di erence among members lies in the conditional mean of y for a given x: In other words, the function m() and the constant characterize the robability model in the family P(s; ; K): To re ect this, we use subscrits m; to di erentiate the robability model in P(s; ; K): For the regression discontinuity model, the functional of interest is (P m; ) = : For a given loss function L(; ); we want to design an estimator ^ to minimize su E m; L(^; ) (22) P m;2p(s;;k) where E m; L(^; ) := E Pm; L(^; ) and E Pm; is the exectation oerator under P m; : One common choice of L(; ) is the quadratic loss function L(^; ) := L (^ ) = (^ ) 2 ; (23) in which case R(^; P) is the maximum exected mean squared error. Another common choice is the - loss function L(^; ) =: L (^ ) = fj^ j > =2g (24) for some xed > ; in which case, R(^; P) is the maximum robability that ^ is not in the =2-neighborhood of : Since the exected mean squared error may not exist for the local olynomial estimator, we use the - loss for convenience in this aer. The use of the - loss is innocuous if the otimal rate of convergence is the item of greatest interest. The derivation of a minimax rate of convergence for an estimator involves a series of minimax calculations for di erent samle sizes. There is no initial advantage in making the deendence on the samle size exlicit. Consider then the roblem of nding a lower bound for the minimax risk inf ^ su P 2P E P L(^; ): The simlest method for nding such a bound is to identify an estimator with a test between simle hyotheses. The whole argument could be cast in the language of Neyman-Pearson testing. Let P; Q be robability measures de ned on the same measurable sace (; A). Then the testing a nity (Le Cam (986) and Donoho and Liu (99)) of two robability measures is de ned to be (P; Q) = inf(e P + E Q ( )) (25) 9

12 where the in mum is taken over the measurable function such that : In other words, (P; Q) is the smallest sum of tye I and tye II errors of any test between P and Q. It is a natural measure of the di culty of distinguishing P and Q: Suose is a measure dominating both P and Q with corresonding densities and q: It follows from the Neyman-Pearson lemma that the in mum is achieved by setting = f qg and Z Z (P; Q) = f qg d + f > qg qd Z = j qj d := 2 2 jjp Qjj (26) where jjp Qjj = R j qj d is the L distance between two robability measures. Now consider a air of robability models P; Q 2 P such that (P ) for any estimator ^ Let then and Therefore (Q) : Then fj^ (P )j > =2g + fj^ (Q)j > =2g : (27) = fj^ (P )j > =2g fj^ (P )j > =2g + fj^ (Q)j > =2g ; (28) su P(j^ (P)j > =2) fp (j^ (P )j > =2) + Q(j^ (Q)j > =2)g P2P 2 2 E P + 2 E Q ( ) (P; Q): (29) 2 inf ^ for any P and Q such that (P ) (Q) : su P fj^ j > =2g (P; Q) (3) P2P 2 Inequality (3) suggests a simle way to get a good lower bound for the minimax robability error: search for the air (P; Q) to minimize (P; Q); subject to the constraint (P ) (Q) : To obtain a lower bound with a sequence of indeendent observations, we let (; A) be the roduct sace and P be a family of robability models on such a sace. Then for any air of nite-roduct measures P = n i= P i and Q = n i= Q i, the minimax risk satis es inf su P fj^ j > =2g ^ P2P 2 2 jjn i=p i n i=q i jj (3) rovided that (P ) (Q) : We now turn to the regression discontinuity model. Our objective is to search for two robability models P and Q that are di cult to distinguish by the indeendent observations

13 (x i ; y i ), i = ; 2; :::; n: Note that it is not restrictive to consider only articular distributions for " i and x i for the urose of obtaining a lower bound. The minimax risk for a larger class of robability models must not be smaller than that for a smaller class of robability models. Therefore, if the lower bound holds for a articular distributional assumtion, then it also holds for a wider class of distributions. To simlify the calculation, we assume that " i is iid N(; 2 ) and x i is iid uniform [x ; x + ] under both P and Q: More details on the construction of P and Q are given in the roof of the following theorem: Theorem 2 Let Assumtion 2 hold. (a) For any nite constants s, and K; we have n lim inf inf s su P m; 2s+ (^ ) > C n! ^ P m;2p(s;;k) 2 for some ositive constant C and a small > : (b) Suose Assumtion 3 also holds: Let h = n =(2s+) for some constant ; then Remarks lim! lim su n! su P m;2p(s;;k) n s P m; 2s+ (^` ) > = : 2. Part (a) of the theorem shows that there exists no estimator ^ that converges to at a rate faster than n s=(2s+) uniformly over the class of robability models P(s; ; K): Part (b) of the theorem shows that the rate n s=(2s+) is achieved by the local olynomial estimator rovided that r = ` and h is chosen aroriately. Because of Parts (a) and (b), the rate n s=(2s+) is called the minimax otimal rate of convergence. 2. This results of the theorem extends Porter (23) who considers a class of functions that are ` times continuously di erentiable. Our result is more general as we consider the Hölder smoothness class, which is larger than what Porter (23) has considered. Our method for calculating the lower bound for the minimax risk is also simler than that of Stone (98), which is adoted in Porter (23). 3. An alternative roof of the minimax rate is to use the asymtotic equivalence of nonarametric regression models and Gaussian noise models (see Brown and Low (996)). The Gaussian noise model is de ned by dy = S(t)dt + "dw (t) where W (t) is the standard Brownian motion. Ibragimov and Khasminskii (98) show that the otimal minimax rate for estimating the drift function S(t) is " 2s=(2s+) : Since " in

14 the Gaussian noise model corresonds to = n in a nonarametric regression with n coies of iid data, we infer that the otimal minimax rate in the nonarametric regression is n s=(2s+) : Our roof is in the sirit of Donoho and Liu (99) and involves only elementary calculations. 4 A Rate Adative Estimator The revious section establishes the otimal rate of convergence when the degree of smoothness is known. In this section, we roose a local olynomial estimator that achieves the otimal rate of convergence u to a logarithm factor when the degree of smoothness is not known. Let [s ; s ] for some s > and s 2 [s ; ) be the range of smoothness. For each 2 [s ; s ], we de ne a local olynomial estimator ^ = ^c + c ; by setting h = n =(2+) and r = w for 2 (w; w + ] for w = ; ; ::: (32) where is a ositive constant. Equivalently, r is the largest integer that is strictly less than : Note that the subscrit on ^, ^c + and ^c indicates the order of the local olynomial in the revious sections while it now indicates the underlying smoothing arameter that generates the bandwidth and the order of the olynomial given in (32). Let g := = log n and S g be the g-net of the interval [s ; ): S g = f : = s + jg; j = ; ; 2; :::g: For a ositive constant 2 ; de ne o ^s = su n 2 2 S g : j^ ^ 2 j 2 (nh ) =2 (n) for all 2 ; 2 S g ; (33) where (n) = (log n)(log log(n)) =2 : Intuitively, ^s is the largest smoothness arameter such that the associated local olynomial estimator does not di er signi cantly from the local olynomial estimator with a smaller smoothness arameter. Grahically, one can view the bound in the de nition of ^s as a function of : Then, ^s is the largest value of 2 2 S g such that j^ ^ 2 j lies below the bound for all 2 ; 2 S g : Calculation of ^s is carried out by considering successively larger 2 values s ; s + g; s + 2g; :::; until for some 2 the deviation j^ ^ 2 j exceeds the bound for some 2, 2 S g : Finally, we set the adative estimator to be ^ A = ^^s : (34) 2

15 The roosed adative rocedure is based on the comarison of local olynomial estimators with di erent smoothness arameters from the g-net S g : The total number of smoothness arameters in S g is of order log(n) and the resolution of the g-net S g is = log n: As the samle size increases, the grid of S g becomes ner and ner. However, given the structure of S g ; it is not ossible to distinguish smoothness arameters whose di erence is less than = log n: This is why the roosed estimator can not achieve the best rate of convergence n s=(2s+) for known smoothness. To further understand the adative rocedure, consider a function m() 2 M(s; ; K) but m() =2 M(s ; ; K) for any s > s: In other words, m() is smooth to at most order s: For any 2 s; it follows from Theorem that the asymtotic bias of nh (^ ) is = O asymbias nh (^ ) nh h r + = O n [ min(r +;s)]=(2 +) = O(): (35) Similarly, the asymtotic bias of nh (^ 2 ) is asymbias nh (^ 2 ) = O n =(2 +) n min(r 2 +;s)=(2 2+) (36) = O n =(2 +) 2 =(2 2 +) n [ 2 min(r 2 +;s)]=(2 2 +) = o(): Therefore, the asymtotic bias of nh j^ ^ 2 j is bounded. On the other hand, nh j^ ^ 2 j is no larger than nh j^ j + nh j^ 2 j (37) whose asymtotic variance is of order O(): As a consequence, when 2 s; nh j^ ^ 2 j is stochastically bounded in large samles and nh j^ ^ 2 j 2 (n) holds with robability aroaching. This heuristic argument suggests that the robability that ^s is less than s is small in large samles. Next, consider = s and 2 > s; the asymtotic bias of nh s (^ 2 ) is of order O n s=(2s+) n s=(2 2+) = O (n 2 s ) which will be larger than 2 (n) in general if 2 s is su ciently large. This suggests that ^s can not be too far away from s from above. Rigorous arguments are given in the roofs of the next two Theorems in the Aendix. Theorem 3 Let Assumtions 2 3 hold. Assume that min r2[rs ;r s ] f min ( r )g > where min ( r ) is the smallest eigenvalue of lim lim su C! n! su su s2[s ;s ] P m;2p(s;;k) r: For all s 2 [s ; ) with s > ; we have P m; n s 2s+ (n) j^ A j C = : 3

16 Remarks. Theorem 2 shows that the otimal rate of convergence for the estimation of is given by n s=(2s+) when s is nite and known. Theorem 3 shows that the adative estimator achieves this rate u to a logarithm factor (n) when s is nite and not known. 2. When s is not known, the otimal rate of n s=(2s+) for known smoothness can not be achieved in general. For the Gaussian noise model and quadratic loss, Leski (99) shows that an extra (log n) s=(2s+) factor is needed. This result has been recently challenged by Cai and Low (23) who show that under the - loss the achievable lower bound for unknown smoothness is the same as that is ossible with known smoothness. However, their results are obtained under the assumtion that there are a nite number of di erent values of the smoothness arameter. This assumtion does not hold for the roblem at hand. As a result, the extra logarithm factor may not be removed in general for the - loss. This extra logarithmic factor is an unavoidable rice for adatation and most (if not all) adative estimators of linear functionals share this roerty. 3. If the function m(x) is not smooth to the same order on the two sides of x ; say m(x) 2 M + (s ; ; K) \ M (s 2 ; ; K); then we can estimate c + and c adatively on each side of the cuto oint x : For a constant + 2 > ; let ^s + = su n 2 2 S g : ^c + ^c + o (nh ) =2 (n) for all 2 ; 2 S g where ^c + is the local olynomial estimator of c + when h = + n =(2+) and r = r ; the largest integer strictly less than : The adative estimator ^c + A of c+ is given by ^c^s+ : The adative estimator ^c A of c can be analogously de ned. Finally, the adative estimator of ^ is set to be ^ A = ^c + A ^c A : In this case, the rate of the convergence min(s of ^ A is easily seen to be (n) ex ;s 2 ) 2 min(s ;s 2 )+ log n : In other words, the slower rate of convergence of ^c + A and ^c+ A dictates. 4. Through ^s; the adative estimator deends on several user-chosen constants, namely ; 2 ; s ; and s : In Section 5 we use local cross validation to choose and 2 : For the bounds s and s we suggest using = log(n) and ; resectively. Theorems 2 and 3 suggest that ^s rovides a consistent estimator of s if m(x) 2 M(s; ; K): However, s is not well de ned. According to our de nition of smoothness, 4

17 a function that is smooth of order s is also smooth of order s 2 whenever s > s 2 : The rate-otimal olynomial order and bandwidth are increasing functions of the smoothness and we are therefore interested in de ning a class of functions with a unique smoothness index. Before de ning the new function class, recall that any function m(x) 2 M(s; ; K) admits Taylor exansions of the form: m(x) = m(x ) + m(x) = m(x ) + with the remainder terms satisfying `X b + j (x x ) j + ~e + (x) for x x (38) j= `X b j (x x ) j + ~e (x) for x < x (39) j= ~e + (x) =(x x ) s (`!) K for x x ; ~e (x) = jx x j s (`!) K for x < x : (4) Let ~e + = f~e + (x i )g xi x and ~e = f~e (x i)g xi <x be the vectors that contain the remainder terms. The following de nition imoses an additional condition on ~e + (x) and ~e (x): De nition 4 Let s = ` + where ` is the largest integer strictly less than s and 2 (; ]: Let M (s ; ; K) be the class of functions satisfying (i) m(x) 2 M(s ; ; K) but m(x) =2 M(s; ; K) for any s > s : (ii) Let D n` = nhdiag(; h; h 2 ; :::; h` ). The remainder terms ~e + (x) and ~e (x) of the `-th order Taylor exansion of m(x) around x satisfy (nh) =2 h s Dn`Z + W + ~e + ( )`+ D ` n`z W ~e ` C for a constant C > with robability aroaching as n! ; h! such that nh! : The rst requirement in the above de nition determines the maximum degree of smoothness of a function. For an in nitely di erentiable function, there is no s such that the rst requirement is met. In this case, we de ne s to be : In other words, M (; ; K) is the set of in nitely di erentiable functions. The second requirement asks for a lower bound for the asymtotic bias of the local olynomial estimator with order `: These two requirements make M (s ; ; K) a subset of M(s ; ; K) which is the most di cult to estimate. Heuristically, if m(x) 2 M (s ; ; K); then there exists no estimator ^ with the rate of convergence faster than n 2s =(2s +)+ for any > : For a function m(x) 2 M(s ; ; K)\M(s; ; K) with s > s ; it is easy to see that the estimator ^ s 5 converges to at the rate of n 2s=(2s+)

18 which is faster than the rate n 2s =(2s +) : To rule out this case, we imose the rst requirement. On the other hand, when the rst requirement is met but the asymtotic bias of ^ s diminishes as n!, ossibly due to the cancellation of the asymtotic biases from the two sides, we can choose a large bandwidth without in ating the asymtotic bias and thus obtain a rate of convergence that is faster than n 2s =(2s +) : To rule out this case, we thus imose the second requirement. Su cient conditions for the second requirement are (i) K jx x j s j~e + (x)j K 2 jx x j s and K jx x j s j~e (x)j K 2 jx x j s for some K > ; K 2 > (ii) ~e + (x) 6= ~e (x) when ` is odd. The following theorem shows that ^s rovides a consistent estimate for the maximal degree of smoothness. Theorem 5 Let the assumtions of Theorem 3 hold. If m(x) 2 M (s ; ; K) with s s > ; then Remarks log log n ^s = min(s ; s ) + O log n as n!.. The theorem shows that ^s consistently estimates the maximal degree of smoothness s when it is nite and s and s are aroriately chosen. 2. A direct imlication of Theorem 5 is that ^s converges to s when s s : As a result, when the samle size is not large in ractical alications, we can set an uer bound that is relatively small. This will revent us from using high order olynomials for small samle sizes. For examle, when s = 3; the adative rocedure e ectively rovides a method to choose between the local constant, local linear and local quadratic estimators. In the simulation study, we choose s = 4; which we feel is a reasonable choice for samle size The adative estimator ^ A is not necessarily asymtotically normal. At the cost of a slower rate of convergence, Theorem 5 enables us to de ne a new adative estimator that is asymtotically normal with zero asymtotic bias. obtaining ^s using the above adative rocedure, we de ne More seci cally, after ^ ^s := ^s(r^s ; h ^s ); where h s = n =(2rs+) : (4) If s < and s is not an integer, Theorem 5 imlies that r^s = r s with robability aroaching one: Thus, both r^s and h^s are essentially non-random for large n. In 6

19 consequence, the adative estimator ^ ^s is asymtotically normal: q nh ^s (^^s )! d N(;! 2 2 r^s ): (42) Of course, one would exect that a given level of accuracy of aroximation by the normal distribution would require a larger samle size when r and h are adatively selected than otherwise. 4. The only unknown quantity in (42) is! 2 = 2+ (x ) + 2 (x ) =f(x ): The density of x at the cut-o oint, f(x ); can be estimated consistently by kernel methods. Given a consistent estimate ~; we de ne the estimated residual by ~" i = y i ~m(x i ) ~d i (43) where ~m(x i ) = P n i= k h(x x i ) [y i ~d i ] P n i= k h(x x i ) (44) Porter (23) shows that, under some regularity conditions, ^ 2+ (x ) = 2 P n i= k h(x i x )d i ~" 2 i P n i= k h(x i x ) and ^ 2 (x ) = 2 P n i= k h(x i x )( d i )~" 2 i P n i= k h(x i x ) (45) are consistent for 2+ (x ) and 2 (x ) resectively. Plugging ^ 2+ (x ); ^ 2 (x ) and ^f(x ) = =n P n i= k h(x i x ) into the de nition of! 2 roduces a consistent estimator for it. The adative estimator ^^s or ^ ^s can be used to comute the estimated residual in (43). 5 Monte Carlo Exeriments In this section, we roose a ractical strategy to select the constants and 2 in the adative rocedure and rovide some simulation evidence on the nite samle erformance of the adative estimator. The emirical strategy we use is based on the squared-error cross validation, which has had considerable in uence on nonarametric estimation. Since our objective is to estimate the discontinuity at a certain oint, we use a local version of cross validation roosed by Hall and Schuany (989) for density estimation. For each combination of ( ; 2 ) ; we rst use the adative rule to determine ^s; h^s ; and r^s : We then use the local olynomial estimator with bandwidth h^s and olynomial order r^s to estimate the conditional mean of y i at x = x i leaving the observation (x i; y i ) out. Denote 7

20 the estimate by ^y i ( ; 2 ); where we have made it exlicit that ^y i deends on (, 2 ): Let fx + i ; :::; x + i m g and fx i ; :::; x im g be the closest m observations that are larger and smaller than x resectively. We choose and 2 to minimize the local cross validation function: mx mx CV ( ; 2 ) = (y i + k ^y + i k ( ; 2 )) 2 + (y ik ^y ik ( ; 2 )) 2 (46) k= Finally we use the cross validation choice (^; ^ 2) of ( ; 2 ) to comute the adative estimator, which is denoted by ^ A (^; ^ 2): In this aer, we do not rovide asymtotic results for ^ A (^; ^ 2); but we do give some simle results for an estimator based on a data-deendent method that is close to (^; ^ 2): Let = f ; :::; Ug be a nite grid of ositive real numbers. Take ( ~ ; ~ 2) to be the closest oint in to (^; ^ 2): Let ^ A ( ~ ; ~ 2) denote the adative estimator based on ( ~ ; ~ 2): One can take the grid size of k= to be su ciently small that the minimum of CV ( ; 2 ) over ( ; 2 ) 2 is quite close to its minimum over R + R + ; at least if one has knowledge of suitable lower and uer bounds for and 2 : The asymtotic behavior of ^ A ( ~ ; ~ 2) is relatively easy to obtain. First, Theorem 3 holds for ^ A ( ~ ; ~ 2) under Assumtions 2 and 3. The reasons are that the theorem holds for ^ A for each combination ( ; 2 ) 2 and that there are a nite number of such combinations. So, ^ A ( ~ ; ~ 2) is consistent and has the rate of convergence given by n s 2s+ (n): Second, suose the value (^; ^ 2) is not equidistant to any two oints in (which fails only for a set of oints with Lebesgue measure zero) and assume that (^; ^ 2) converges to ( ; 2) in large samles. Let ( o ; o 2) be the closest oint in ( ; 2) : Let ^ A ( o ; o 2) and ^ A ( ; 2) denote the adative estimators based on ( o ; o 2) and ( ; 2) resectively. Then, the asymtotic distribution of ^ A ( ~ ; ~ 2) as that of ^ A ( o ; o 2) to is the same. This holds because ( ~ ; ~ 2) = ( o ; o 2) with robability that goes to as n! by the discreteness of. After a simle modi cation along the line of (4), we have q nh ^s ^ A( ~ ; ~ 2)! d N(;! 2 2 ): (47) r^s where ^ A (~ ; ~ 2) is the same as ^ A ( ~ ; ~ 2); excet that the bandwidth h^s = ~ n =(2^s+) is relaced by h ^s = ~ n =(2r^s+) : The above theoretical results for ^ A (~ ; ~ 2) are not entirely satisfactory because they require the use of the somewhat arti cial grid. Nevertheless, in the absence of asymtotic results for ^ A (^; ^ 2); they should be useful. Since our cross validation algorithm is based on a grid search, we e ectively use the estimator ^ A ( ~ ; ~ 2) in our simulations. In our Monte Carlo exeriment, we let s = 4, m = :n; and = f:; :5; ; 5g to comute the adative estimator. To evaluate the nite samle erformance of the adative 8

21 estimator ^ A ( ~ ; ~ 2), we comare it with the local constant, local linear, local quadratic and local cubic estimators, each of them using the locally cross-validated bandwidth. For these local olynomial estimators, we use the AMSE-otimal bandwidth h = cn =(2r+3) and choose c over the set C = (:; :2; :::; ) [ (2; 3; 4; :::; ) via cross validation. For each estimator the cross validation is based on the same neighborhood observations fx + i ; :::; x + i m g and fx i ; :::; x im g and uses the grid search method. We have considered other choices of m, and C but the qualitative results are similar. We consider three grous of exeriments. In the rst grou, the data generating rocess is y i = m(x i ) + fx i > x g + " i where = and m(x i ) = ( P 3 i= (x i x ) i + jx i x j s for x i x ; P 3 i= (x i x ) i jx i x j s for x i < x : (48) Both x i and " i are iid standard normal. fx i g n i= is indeendent f" ig n i=. We set x = without loss of generality. We consider several values for s ; i.e. s = =2; 3=2; 5=3; 7=2 and two values for ; i.e. = and 5: s characterizes the smoothness of m(x) while determines the imortance of the not-so-smooth comonent in m(x): For the second grou of exeriments, the data generating rocess is the same as the one above excet that a sine wave is added to m(x); leading to m(x i ) = ( P 3 i= (x i x ) i + 5 sin (x i x ) + jx i x j s for x i x P 3 i= (x i x ) i + 5 sin (x i x ) jx i x j s for x i < x (49) The resonse function we just de ned has a ner structure than that given in (48). Such a resonse function may not be realistic in emirical alications but it is used to examine the nite samle erformances of di erent estimators in the worst situations. For the last grou of exeriments, the data generating rocess is m(x i ) = kx (x i x ) i ; for k = ; ; 2 or 3: (5) i= Since m(x i ) is a constant, linear, quadratic or cubic function, we exect the local constant, local linear, local quadratic and local cubic estimators to have the best nite samle erformances in the resective cases of k = ; ; 2 and 3: The motivation for considering this grou is to crash test the adative estimator against the local olynomial estimators. For each grou of the Monte Carlo exeriments, we comute the bias, standard deviation (SD) and root mean square error (RMSE) of all estimators considered. The number of relication is and the samle size is 5. More seci cally, for an estimator ^; the 9

22 bias, SD, and RMSE are comuted according to bias = ^ ; SD = X i= ^ i ^ 2 and RMSE = q(bias) 2 + (SD) 2 (5) where ^ = = P m= ^ m and ^ m is the estimate for the m-th relication. Table I resents the results for the rst grou of exeriments. It is clear that the local constant estimator has the smallest standard deviation and the largest bias. When s = 3=2; 5=2; 7=2; the sloe of m(x) is relatively at at x = x. As a result, the e ect of the standard deviation outweighs that of the bias. It is not surrising that the local constant estimator has the smallest RMSE in these cases. However, when s = =2; the function m(x) becomes very stee at x = x : As exected, the local constant estimator has a large uward bias and the largest RMSE. Next, for the rest of the local olynomial estimators, the absolute values of the biases are in general comarable while the standard deviation decreases with the order of the olynomial. The latter result seems to be counter-intuitive at rst sight. However, as the order of the olynomial increases, the cross-validated bandwidth also increases. Note that the bandwidth and olynomial order have oosite e ects on the variance of the local olynomial estimators. In nite samles, it is likely that the variance reduction from using a larger bandwidth dominates the variance in ation from using a higher order olynomial. This is the case for the rst grou of data generating rocesses we consider. Finally, the erformance of the adative estimator is very robust to the arameter con gurations. When the underlying rocess is not so smooth (s = =2; = ); the adative estimator has the smallest RMSE. In other cases, the RMSE of the adative estimator is only slightly larger than the smallest RMSE. It is imortant to note that the smallest RMSE is achieved by di erent estimators for di erent arameter combinations. Table II reorts the results for the second grou of exeriments. We reort only the case = as it is reresentative of the case = 5: Due to the raid sloe changes in the resonse function, all estimators have much larger RMSE s than those given in Table I. While the local constant estimator has a satisfactory RMSE erformance in Table I, its RMSE erformance is the oorest because of the large bias. The best estimator, according to the RMSE criterion, is the local linear estimator whose absolute bias is the smallest among the local olynomial estimators and standard deviation is only slightly larger than that of the local constant estimator. Comared with the local olynomial estimators, the adative estimator has the smallest bias for all arameter combinations while its variance is comarable to that of the local linear estimator. As a consequence, the RMSE erformance of the adative estimator is quite satisfactory. Table III gives the result for the last grou of exeriments. As exected, when the 2

23 resonse function is a olynomial with order r; the local olynomial estimator with the same order has the best nite samle erformance in general. An excetion is the local linear estimator whose RMSE is larger than that of the local quadratic and cubic estimators. The erformance of the adative estimator is very encouraging. Its RMSE is either the smallest or slightly larger than that of the estimator which is most suitable for the underlying data generating rocess. To sum u, the RMSE of the adative estimator is either the smallest or among the smallest ones. The erformance of the adative estimator is robust to the underlying data generating rocess. In contrast, a local olynomial estimator may have the best erformance in one scenario and disastrous erformances in other scenarios. For examle, the local constant estimator erforms well in the rst grou of exeriments but erforms oorly in the second grou of exeriments. The local linear estimator has a satisfactory erformance in the second grou of exeriments but its erformance is the worst in the rst grou of exeriments. The adative estimator seems to be the best estimator in an overall sense. 2

24 Table I: Finite Samle Performances of Di erent Estimators When m(x i ) = P 3 i= (x i x ) i + jx i x j s sign(x i x ) Adative Estimator Local Constant Local Linear Local Quadratic Local Cubic (s ; ) = (=2; ) Bias SD RMSE (s ; ) = (3=2; ) Bias SD RMSE (s ; ) = (5=2; ) Bias SD RMSE (s ; ) = (7=2; ) Bias SD RMSE (s ; ) = (=2; 5) Bias SD RMSE (s ; ) = (3=2; 5) Bias SD RMSE (s ; ) = (5=2; 5) Bias SD RMSE (s ; ) = (7=2; 5) Bias SD RMSE The suerscrits ; 2; 3 indicate the smallest, second smallest, and third smallest RMSE in each row, resectively 22

25 Table II: Finite Samle Performances of Di erent Estimators When m(x i ) = P 3 i= (x i x ) i + 5 sin (x i x ) + jx i x j s sign(x i x ) Adative Estimator Local Constant Local Linear Local Quadratic Local Cubic (s ; ) = (=2; ) Bias SD RMSE (s ; ) = (3=2; ) Bias SD RMSE (s ; ) = (5=2; ) Bias SD RMSE (s ; ) = (7=2; ) Bias SD RMSE The suerscrits ; 2; 3 indicate the smallest, second smallest, and third smallest values in each row, resectively 23

26 Table III Finite Samle Performances of Di erent Estimators for Di erent Resonse Functions Adative Local Local Local Local Estimator Constant Linear Quadratic Cubic m(x) = Bias SD RMSE m(x) = (x x ) Bias SD RMSE m(x) = (x x ) + (x x ) 2 Bias SD RMSE m(x) = (x x ) + (x x ) 2 + (x x ) 3 Bias SD RMSE

27 6 Aendix of Proofs Proof of Theorem. It is easy to show that ^+ r + r = Z r + W + Z r + Z + r W + " + + Z r + W Z r + Z + r W + R + (A.) Let D nr = nhdiag(; h; h 2 ; :::; h r ): (A.2) Then D nr ^+ r + r = Dnr Z r + W + Z r + Dnr D nr Z + r W + " + + D It follows from the roof of Lemma A.(a) below that nr Z + r W + Z r + Dnr D nr Z + (A.3) r W + R + : lim D n! nr Z r + W + Z r + Dnr = f(x ) r : (A.4) Porter (23) shows that, under Assumtion 2, Dnr Z r + W + " + ) N ; 2+ (x ) f(x ) V r : (A.5) Combining (A.3), (A.4) and (A.5) gives D n;r ^+ r + r ) N ; 2+ (x ) f(x ) Dnr Z + r V r r r W + Z r + Dnr D nr Z r + W + R +, (A.6) which imlies nh(^c + r c + ) B + ) N ; 2+ (x ) f(x ) e r V r r e ; (A.7) where B + = e D nr Z + r W + Z r + Dnr Similarly, we can show that nh(^cr c ) B ) N D nr Z + r W + R + : ; 2 (x ) f(x ) e r V r r e : (A.8) By the indeendence of nh(^c + r c + ) and nh(^c r c ); we get nh(^r ) (B + B ) ) N ; 2+ (x ) + 2 (x ) f(x e ) r V r r e : (A.9) 25

28 When ` r + ; D nr Z + r W + R + = h r+ nhb + r+ r( + o ()): (A.) When ` r; Dnr Z r + W + R + = nh Dnr Z r + W + e + = O h q nh : (A.) Therefore Similarly B + = f` r + g e r r b + r+ f(x ) h r+ nh( + o ()) + O h q nh : (A.2) B = f` r + g ( )r+ e r r br+ f(x h r+ nh( + o ()) + O h q nh : (A.3) ) Let B = B + B, then B = f` r + g e r r b + r+ ( ) r+ b r+ f(x h r+ nh( + o ()) ) +O h q nh : (A.4) Combining (A.4) and (A.9) leads to the desired result. Proof of Theorem 2. Part (a). The roof uses the following result from Pollard (993): Let P = n i= P i and Q = n i= Q i be the nite roducts of robability measures such that Q i has density + i () with resect to P i : If 2 i = E P i 2 i jj n i=p i Using this result and (3), we have inf ^ rovided that (P ) (Q) > : n i=q i jj ex su P2P P(j^ j =2) nx i= ex 2 i is nite for each i; then! nx i= : (A.5) 2 i!! ; (A.6) To get a good lower bound for the minimax risk, we consider two robability models P and Q: Under the model P; the data is generated according to Y = m P (X) + P d + " (A.7) where Y = (y ; y 2 ; :::; y n ), m P (X) = (m P (x ); :::; m P (x n )), " = (" ; :::; " n ); x i s iid uniform(x ; x + ), " i s iid N(; ) and " i is indeendent of x j for all i and j: The 26

29 data generating rocess under Q is de ned analogously with m P (X) + P d relaced by m Q (X) + Q d: It is obvious that both models P and Q satisfy Assumtion 2. We now secify m and for each model. For the robability model P; we let m P (x) = and P = : For the robability model Q; we let m Q (x) = s ((x x ) =) and Q = s (A.8) where = n =(2s+) and is an in nitely di erentiable function satisfying (i) (x) ; (ii) (x) = for x and (iii) (x) = for x : Obviously m P 2 M(s; ; K): We next verify that m Q 2 M(s; ; K): First, by construction, m Q is continuous on [x ; x + ] : Second, the i-th order derivative of m (i) Q is s i (i) ((x x ) =) which is obviously bounded by K when n is large enough for all i `: Third, we verify the Hölder condition for the `-th order derivative. It su ces to consider the case when x 2 [x ; x + ] and x 2 2 [x ; x + ] as the Hölder condition holds trivially when x 2 [x ; x ] and x 2 2 [x ; x ]: We consider three cases: (i) when x, x 2 2 [x ; x + ]; the `-th order derivative satis es s ` (`) ((x x ) =) s ` (`) ((x 2 x ) =) s ` (`+) ~x jx x 2 j = s ` (`+) ~x jx x 2 j`+ s jx x 2 j s ` C s ` `+ s `+ s jx x 2 j (A.9) K jx x 2 j if is small enough; (ii) when x 2 [x ; x + ] and x 2 x + ; = s ` (`) ((x x ) =) s ` (`) ((x 2 x ) =) s ` (`) ((x x ) =) s ` (`) ((x + x ) =) K jx x j K jx x 2 j (A.2) when the rst inequality follows from (A.9); (iii) when x x + and x 2 x + ; we have (`) ((x x ) =) = (`) ((x 2 x ) =) = : Again the Hölder condition holds trivially. It remains to comute the L distance between the two measures. Let the density of Q i with resect to P i be + i (x i ; y i ); then i (x i ; y i ) = ( ' (yi m Q (x i ) Q ) ='(y i ) ; if x i 2 [x ; x + ) ; otherwise (A.2) 27

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley Elements of Asymtotic Theory James L. Powell Deartment of Economics University of California, Berkeley Objectives of Asymtotic Theory While exact results are available for, say, the distribution of the