optimal inference in a class of nonparametric models

Size: px

Start display at page:

Download "optimal inference in a class of nonparametric models"

Brianna McCarthy
5 years ago
Views:

1 optimal inference in a class of nonparametric models Timothy Armstrong (Yale University) Michal Kolesár (Princeton University) September 2015

2 setup Interested in inference on linear functional Lf in regression model y i = f (x i ) + u i, u i N (0,σ 2 (x i )). x i is fixed, σ 2 (x i ) is known. Important special cases: 1. Inference at a point: Lf = f (0) 2. Regression discontinuity: Lf = f (0 + ) f (0 ) 3. ATE under unconfoundedness: x i = {w i,d i }, Lf = 1 n i (f (w i,1) f (w i,0)) 4. Partially linear model 2/59

3 key assumption Convexity Assumption f F, a known convex set Rules out e.g. sparsity, but not usual shape/smoothness restrictions: Monotonicity F = { f : f non-increasing } Lipschitz class F Lip (C) = { f : f (x 1 ) f (x 2 ) C x 1 x 2 } (or Hölder class generalizations). Taylor class F T,2 (C) = { f : f (x) f (0) f (x)x Cx 2} (useful for RD / inference at point) Sign restrictions in linear regression { f (x) = x β : β j 0,j J } Will take C as known if necessary, and ask later if this can be relaxed. 3/59

4 notions of finite-sample optimality Normality = can derive finite-sample procedures that minimize the worst case loss over G F without Normality, procedures will be valid and optimal asymptotically under regularity conditions, uniformly over F 1. Setting G = F yields minimax procedures. Problem well-studied if loss is MSE, general solution in Donoho (1994), used to derive optimal kernels and rates of convergence (Stone, 1980; Fan, 1993; Cheng, Fan, and Marron, 1997) Donoho (1994) derives fixed-length confidence intervals (CI) that are almost optimal 2. G F smoother functions: adaptive inference ( directing power ) For two-sided CIs, Cai and Low (2004) give bounds 4/59

5 new finite-sample results: one-sided cis Derive one-sided CIs, [ĉ, ), that minimize maximum quantiles of excess length over G, with ĉ = ˆL bias F ( ˆL) z 1 α sd( ˆL), for optimal estimator ˆL For case F = G (minimax CIs), ˆL has same form as minimax MSE estimators / fixed-length CIs of Donoho (1994) We show that if F is symmetric, adaptation severely limited. Adaptation requires non-convexity or shape restrictions: otherwise, cannot do better at smaller C while maintaining coverage for larger C Conversely any inference method that claims to do better than minimax CIs when f is smooth must be size distorted for some f F (C) Related to Low (1997), who shows adapting to derivative smoothness classes limited for two-sided (random-length) CIs. 5/59

6 new finite-sample results: two-sided cis We derive two-sided CIs that minimize expected length over G = { д }, solving the problem of adaptation to a function posed in Cai, Low, and Xia (2013) Can be used to bound scope for adaptivity 6/59

7 implications for optimal bandwidth choice Asymptotically, optimal procedures often correspond to kernel estimators with fixed (optimal) kernel, and bw that depends on optimality criterion. We find that for RD and inference at a point: Optimal 95% fixed-length CIs use larger bandwidth than minimax MSE estimators. Undersmoothing cannot be optimal Recentering CIs by estimating bias cannot be optimal it s essentially equivalent to using higher-order kernel and undersmoothing (Calonico, Cattaneo, and Titiunik, 2014). Difference is small: CI around minimax MSE estimator only 1% longer In practice, can keep the same bandwidth as for estimation, and construct CI around it using worst-case bias correction 7/59

8 applications We apply the general results to: 1. RD with F = { f + f : f ± F T,2 (C) } as in Cheng, Fan, and Marron (1997) Optimal bandwidths balance number of effective observations on each side of cutoff Illustrate with empirical application from Lee (2008) 2. Linear regression with β possibly constrained (sign restrictions, sparsity, elliptical constraints) 3. Sample average treatment effect under unconfoundedness under Hölder class (separate paper) 8/59

9 incomplete list of related literature Stats literature on minimax estimation/inference/rates of convergence/adaptivity: Ibragimov and Khas minskii (1985), Donoho and Liu (1991), Donoho and Low (1992), Donoho (1994), Low (1995), Low (1997), Cai and Low (2004), Cai, Low, and Xia (2013), Cheng, Fan, and Marron (1997), Fan (1993), Fan, Gasser, Gijbels, Brockmann, and Engel (1997), Lepski and Tsybakov (2000) non-standard CIs: Imbens and Manski (2004), Müller and Norets (2012), Calonico, Cattaneo, and Titiunik (2014), Calonico, Cattaneo, and Farrell (2015), Rothe (2015) Adaptive estimation/inference in econometrics: Sun (2005), Armstrong (2015), Chernozhukov, Chetverikov, and Kato (2014) 9/59

10 Finite-Sample results Asymptotic results Applications Conclusion

11 running example Consider the problem of inference on f (0) when f is restricted to be in Lipschitz class F = F Lip (C) = { f : f (x 1 ) f (x 2 ) C x 1 x 2 }. Assume σ (x) = σ, known 10/59

12 performance criteria To measure performance of (1 α)% one-sided CIs [ĉ, ), we use maximum quantiles of excess length where q д,β is βth quantile under д. EL β (ĉ, G) = sup q д,β (Lд ĉ), д G For two-sided CIs, we focus on fixed-length CIs ˆL ± χ, where ˆL is estimator, and χ is chosen to satisfy coverage: χ α ( ˆL) = min { χ : inf f F P f ( ˆL Lf χ ) 1 α } For estimation, we use maximum MSE, R MSE ( ˆL) = sup f F E f ( ˆL Lf ) 2 11/59

13 minimax testing problem In running example, Lf = f (0), F = F Lip (C), consider minimax test of H 0 : Lf L 0 against H 1 : Lf L 0 + 2b Inverting minimax tests yields CI that minimizes EL β (ĉ, F ), where β is minimax power of the test. First need to find least favorable null and alternative. Problem equivalent to Y N (µ,σ 2 I ), µ = (f (x 1 ),..., f (x n )) M convex Both M 0 = M { f : Lf L 0 } and M1 = M { д : Lд L 0 + 2b } are convex least favorable functions minimize distance between them (Ingster and Suslina, 2003) (д, f ) = argmin д M 1,f M 0 n (д(x i ) f (x i )) 2. i=1 12/59

14 L_{0}+2b д (x) = L 0 + b + (b C x ) + f (x) = L 0 + b (b C x ) + L_{0}+b g^{*} f^{*} L_{0} b/c 0 b/c x 13/59

15 д (x) = L 0 + b + (b C x ) +, f (x) = L 0 + b (b C x ) + Minimax test then given by LR test of µ 0 = (f (x 1 ),..., f (x n )) against µ 1 = (д (x 1 ),...,д (x n )): reject for large values of Y (µ 1 µ 0 ) Test can be written as rejecting whenever L(h) L 0 b ( 1 ni=1 k T (x i /h) 2 ) ni=1 k T (x i /h) ( n i=1 k T (x i /h) 2 ) 1/2 ni=1 σz 1 α. k T (x i /h) where k T (u) = (1 u ) +, h = b/c and L(h) = ni=1 (д (x i ) f (x i ))Y i ni=1 (д (x i ) f (x i )) = ni=1 k T (x i /h)y i ni=1 k T (x i /h) Key feature: non-random bias correction based on worst-case bias, doesn t disappear asymptotically 14/59

16 general setup In general, we observe Y = K f + σϵ, ϵ is standard Normal and K linear operator, with Kд,K f = i (Kд)(x i )(K f )(x i ), Heteroscedasticity handled by setting K f = (f (x 1 )/σ (x 1 ),..., f (x n )/σ (x n )), Y = (Y 1 /σ (x 1 ),...,Y n /σ (x n )). Define modulus of continuity (Donoho and Liu, 1991): ω(δ; F ) = sup { L(д f ) : K (д f ) δ, д, f F } Denote solutions by д δ, f δ, and let f M,δ = (д δ + f δ )/2 Problem of finding LF functions equivalent to finding ω 1 ( ; F ), so for running example, д = д ω 1 (2b), f = f ω 1 (2b) 15/59

17 class of optimal estimators Define ˆL δ, F = Lf M,δ + ω (δ; F ) δ K (д δ f δ ),Y K f M,δ These estimators minimize maximum bias given variance bound (and vice versa) (Low, 1995). Their maximum and minimum bias over F satisfies bias F ( ˆL δ, F ) = bias F ( ˆL δ, F ) = 1 ( ω(δ; F ) δω (δ ; F ) ), 2 In running example: L(h) = ˆL ω 1 (2hC), F Lip (C) 16/59

18 centrosymmetry and translation invariance When F has additional structure, ˆL δ simplifies: If F is translation invariant (for some ι F with Lι = 1, cι F for all c R), then δ/ω (δ; F ) = K (д δ f δ ),ι, and estimator has Nadaraya-Watson form: ˆL δ, F = Lf M,δ + K (д δ f δ ),Y K f M,δ K (д δ f δ ),Kι. If F is centrosymmetric (f F = f F ), then f δ = д δ, and ˆL δ, F = 2ω (δ; F ) Kд δ δ,y = Kд δ,y Kд δ,kι, 17/59

19 Theorem 1 (One-sided minimax CI) Let ĉ α,δ, F = ˆL δ, F bias F ( ˆL δ, F ) z 1 α σω (δ; F ). Then [ĉ α,δ, F, ) is a 1 α CI for Lf, with coverage minimized at f δ. For β = Φ(δ/σ z 1 α ), it minimizes EL β (ĉ, F ) among all one sided 1 α CIs. All quantiles of excess length are maximized at д δ. The minimax excess length at quantile β is EL β (ĉ α,δ, F ; F ) = ω(δ; F ). β is minimax power of underlying tests (under translation invariance) Bias-correction based on worst-case bias under F, non-random In running ( example, using ) bw h minimizes β quantile of excess length at β = Φ ω 1 (2hC) σ z 1 α 18/59

20 For estimation and two-sided CI, exact optimality results hard Donoho (1994) shows that procedures based on ˆL δ, F minimax optimal if we restrict attention to affine estimators Results use the fact that problem is just as hard if we know that f is in one-dimensional subfamily { λf δ + (1 λ)д δ : 0 λ 1} To state these results, consider Z N (θ,1), θ [ τ,τ ] Minimax linear estimator is c ρ (τ )Z, c ρ (τ ) = τ 2 /(1 + τ 2 ) with minimax risk ρ(τ ) = τ 2 /(1 + τ 2 ) Shortest fixed-length CI is c χ (τ )Z ± χ α (c χ (τ )Z ), solution characterized in Drees (1999), similar in spirit to Imbens and Manski (2004) 19/59

21 optimal shrinkage in bounded normal means c_{aci} chi Confidence level 90% 95% 95% (Estimation) tau 20/59

22 Theorem (Donoho (1994)) minimax MSE affine estimator is ˆL δ, F where δ solves max δ >0 ω(δ; F ) δ ρ ( ) δ σ. 2σ and the optimal δ satisfies c ρ (δ/(2σ )) = δω (δ; F )/ω(δ; F ). ( ) ω (δ ;F ) The shortest fixed-length affine CI is ˆL δ, F ± δ χ δ α 2σ σ where δ solves ( ) ω(δ; F ) δ max χ α σ. δ >0 δ 2σ and the optimal δ satisfies c χ (δ/(2σ )) = δω (δ; F )/ω(δ; F ). 21/59

23 For example, to find minimax MSE optimal bandwidth in running example, solve which yields δ 2 4σ 2 + δ 2 = c ρ (δ/(2σ )) = δω (δ; F ) ω(δ; F ) Asymptotically = δ 2 2ω(δ; F ) i д δ (x i ) ( σ 2 = C 2 h 2 k T (x i /h) k T (x i /h) ). 2 i ( 3σ 2 ) 1/3 h opt,mse = C 2 + o p (1) nf X (0) Can also use these results to derive optimal rates of convergence (eg Fan (1993); Cheng, Fan, and Marron (1997)) n 1/3 here i 22/59

24 adaptive inference onesided CIs focus on good performance under least favorable f F, which may be too pessimistic Alternative: optimize excess length over smaller class G of smoother functions inf ĉ sup q q,β (Lд ĉ), д G among ĉ that satisfy sup f F P (Lf ĉ) 1 α. Amounts to directing power at smooth alternatives, while maintaining size over all of F 23/59

25 adaptive inference in running example Associated testing problem in running example: H 0 : Lf L 0 against { H 1 : Lf L0 + 2b } { f G } Inverting these minimax tests will yield CI that minimizes β quantile of excess length over G, where β is minimax power of the test. As long as G is convex, this is still equivalent to testing convex null against convex alternative = LF functions minimize distance between sets: (f,д ) = argmin f F,д G n (д(x i ) f (x i )) 2, Lд L 0 + 2b, Lf L 0 i=1 24/59

26 To make this concrete, consider G = { д(x) : д(x) = c,c R } (i.e. д(x) = cι), and suppose Lf L 0 + b under alternative Solution: f = L 0 + b (b X x ) + (as before), д (x) = L 0 + b L_{0}+2b L_{0}+2b g^{*} f^{*} L_{0}+b g^{*} f^{*} L_{0}+b L_{0} L_{0} b/c 0 b/c x 2b/C 0 2b/C x 25/59

27 But д f same as before, so estimator as before L(h) = ni=1 (д (x i ) f (x i ))Y i ni=1 (д (x i ) f (x i )) = ni=1 k T (x i /h)y i ni=1 k T (x i /h) Worst case-bias under the null and variance same as before = Same CI as before Summary One sided CI that minimizes maximum excess length over F for β = Φ(δ/σ z 1 α ) subject to 1 α coverage also minimizes EL β (ĉ; span(ι)) for β = Φ(δ/(2σ ) z 1 α ) 26/59

28 setup for general adaptivity result Define order modulus of continuity Cai and Low (2004): ω(δ; F, G) = sup { Lд Lf : K (д f ) δ, f F,д G }. so that ω(δ; F ) = ω(δ; F, F ), and define ˆL δ, F, G = Lf M,δ + ω (δ; F, G) δ K (д δ f δ ),Y K f M,δ, so that L δ, F, F = L δ, F bias formulas generalize: bias F ( ˆL δ, F, G ) bias G ( ˆL δ, F, G ) = 1 ( ω(δ; F, G) δω (δ; F, G) ), 2 In running example, L(h) = ˆL ω 1 (hc;f, G), F, G 27/59

29 Theorem 2 (One-sided adaptive CIs) Let F and G F be convex, and suppose that f δ and д δ achieve the ordered modulus at δ. Let ĉ α,δ, F, G = ˆL δ, F, G bias F ( ˆL δ, F, G ) z 1 α σω (δ; F, G). Then, for β = Φ(δ/σ z 1 α ), ĉ α,δ, F, G minimizes EL β (ĉ, G) among all one-sided 1 α CIs, where Φ denotes the standard normal cdf. Minimum coverage is taken at f and equals 1 α. All quantiles of δ excess length are maximized at д. The worst case βth quantile of δ excess length is EL β (ĉ α,δ, F, G, G) = ω(δ; F, G). 28/59

30 non-adaptivity under centrosymmetry Suppose F is centrosymmetric and f δ, F, G д δ, F, G F. (1) Holds for G smooth enough, e.g. G = span(ι) under translation invariance as in running example Then 0 and f δ, F, G д also solve the modulus, and since δ, F, G ω(δ; F ) = sup { 2Lf : K f δ/2, f F } under centrosymmetry, ω(δ; F, G) = ω(δ; F, {0}) = sup f F Implies ĉ α,δ, F, G = ĉ α,δ, F,{0} = ĉ α,2δ, F. { } 1 Lf : K f δ = ω(2δ; F ), 2 29/59

31 Theorem 3 (Non-adaptivity of one-sided CIs under centrosymmetry) Let F be centrosymmetric. Then the one-sided CI that is minimax for the βth quantile also optimizes EL β (ĉ; G) for any G such that the solution to the ordered modulus problem exists and satisfies (1), where β = Φ((z β z 1 α )/2). In particular, the minimax CI optimizes EL β (ĉ; {0}). CI that is minimax for median excess length among 95% CIs also optimizes Φ( 1.645/2) quantile under the zero function. 30/59

32 bound on adaptivity CI [ĉ α,σ (zβ +z 1 α ), F ) that is minimax for βth quantile of excess length is unbiased at 0, and satisfies q 0,β (L0 ĉ α,σ (zβ +z 1 α )) = 1 2 (ω (δ; F )δ + ω(δ; F )). Hence, ω(δ; F, G) q 0,β (L0 ĉ α,σ (zβ +z 1 α )) = ω(δ; F, G) 1 2 (ω (δ )δ + ω(δ )) = ω(2δ ) ω (δ )δ + ω(δ ). Typically, ω(δ; F ) = Aδ r (1 + o(1)) as n for some where r is the optimal rate of convergence of the MSE. Then for 1/2 r 1, minimax CI has asymptotic efficiency at least 94.3% when indeed f = 0. Adapting to G that includes 0 at least as hard as adapting to zero 31/59

33 implications of non-adaptivity result Need shape restriction or non-convexity for adaptation Similar to impossibility results in Low (1997) and Cai and Low (2004) for two-sided CIs, and in contrast to positive results for MSE Minimax rate of shrinkage describes the actual rate for all functions in the class Possible to construct estimators that do better when f is smoother, but impossible to tell how well you did For valid inference in cases where F is convex and centrosymmetric, one has to think hard about appropriate C Not possible to try to estimate it from the data and to better than if we assume worst possible case 32/59

34 adaptivity under monotonicity Suppose, in running example, that we know f is non-increasing Least favorable functions without and with directing power: L_{0}+2b g^{*} f^{*} L_{0}+2b g^{*} f^{*} L_{0}+b L_{0}+b L_{0} L_{0} 2b/C 0 2b/C x 2b/C 0 2b/C x 33/59

35 Without directing power, optimal estimator again given by triangular kernel, but now includes bias correction (to ensure bias = bias) L(h) = i k i Y i /σ 2 i i k i /σ 2 i + b i sign(x i )k i (1 k i )/σi 2 i k i /σi 2, where k i = k β (x i /h), and optimal bw bigger than without monotonicity. About 20% reduction in quantiles of excess length With directing power, optimal estimator averages all positive observations, and averages negative observations using triangular kernel. Excess length shrinks at parametric rate. When Lipschitz assumption dropped and only monotonicity maintained, optimal estimator averages all positive observations, and excess length still shrinks at parametric rate 34/59

36 two-sided adaptive cis Fixed-length confidence intervals cannot be adaptive Cai and Low (2004) construct random-length confidence intervals that are within a constant factor of lower bound on expected length Cai, Low, and Xia (2013) construct random-length confidence intervals under shape constraints that have near minimum expected length for each individual function (again within constant) 35/59

37 Natural best-case scenario for two-sided CIs: optimize expected length at a single function G = { д } By Pratt (1961), inverting UMP tests against G achieve exactly this Again amounts to testing convex null against convex alternative, LF function under null solves f θ = argmin f F n (f (x i ) д(x i )) 2, Lf θ i=1 Theorem 4 (Adaptation to a function) CI with minimum expected measure E д λ(c) st 1 α coverage on F inverts family of tests ϕ θ, where ϕ θ rejects for large values of K (д f θ ),Y with critical value given by 1 α quantile under f θ. 36/59

38 cis based on suboptimal estimators What is efficiency loss of CIs around suboptimal affine estimators? Affine estimators are Normal, with variance that doesn t depend on f, and bias that does For each performance criterion, only worst-case bias and variance matter: if we can calculate them, then can also calculate maximum MSE, and form of one- and two-sided CIs Let χ α (B) solve P ( Z + B χ ) = Φ(χ B) Φ(χ + B) = 1 α. Then for estimator ˆL with variance V and maximum bias B, is the shortest CI is ˆL ± V 1/2 χ α (B/V 1/2 ) 37/59

39 Theorem 5 (Suboptimal estimators) Let ˆL = a + w,y be an affine estimator. Then [ˆL bias F ( ˆL) w z 1 α σ, ) is valid CI and ˆL ± σ w χ α (bias F ( ˆL)/σ w ) is the shortest fixed-length 1 α CI centered at ˆL. Not deep result, but very useful: allows to compute exact efficiency loss from using suboptimal estimators, or size-distortion of CIs with (pointwise) asymptotic justification Asymptotic version of this theorem can be used to calculate asymptotic efficiency loss from using suboptimal kernel, and/or suboptimal bandwidth 38/59

40 suboptimal estimators in running example Consider some other kernel k in running example, ˆL = i k (x i /h)y i i k (x i /h) Variance: σ 2 i k (x i /h) 2 ( i k (x i /h)) 2 Maximum bias, since f F Lip (C). i k(x i /h)(f (x i ) f (0)) i k(x i /h) C i k(x i /h) x i. i k(x i /h) Bound attained at f (x) = C x i if k 0, otherwise gives an upper bound. 39/59

41 Finite-Sample results Asymptotic results Applications Conclusion

42 renormalization In many cases (depending on L and smoothness of F, but including inference at a point and RD), nonparametric regression problem equivalent to White noise model Y (dt) = f (t) + σϵ (t) See Brown and Low (1996) and Donoho and Low (1992) In running example, this holds with σ 2 = σ (0) 2 /nf X (0) Suppose F = { f : J (f ) C } for some J (as in running example), and that for the white noise model, following functionals are homogeneous J (af ( /h)) = ah s J J (f ) Ka1 f ( /h),ka 2 д( /h) = a 1 a 2 h 2s K K f,kд L(af ( /h)) = ah s L Lf In running example, we have s L = 0, s J = 2, s K 1/2 40/59

43 (single-class) modulus problem then renormalizes: if д C,δ, f C,δ minimize min L(f 1 f 0 ) st K (f 1 f 0 ) δ, J (f 1 ) C, J (f 0 ) C, then д C,δ = aд 1,1 ( /h) ω C (δ ) = C 1 r δ r ω 1 (1) f C,δ = af 1,1 ( /h) where a = δ s J /(s K s J ) C s K /(s K s J ), h = (C/δ ) 1/(s K s J ) and r = s L s J s K s J. root of minimax MSE, and (excess) length of CIs will shrink at rate n r /2 41/59

44 optimal bandwidths Class of optimal estimators can be written as ˆL δ = L(h) = h 2s K s L Kk( /h),y + Ch s J s L ( LfM,1,1 Kk,K f M,1,1 ), with h = (C/δ ) 1/(s K s J ) and kernel k(u) = rω 1 (1)(д 1,1 f 1,1 ). Recall that optimal δ given by c l (δ/(2σ )) = δω (δ )/ω(δ ). Plugging in definition of h yields optimal bandwidth h = (2σc 1 l (r )/C) 1 s J s K, where, for one-sided CIs, c 1 β (r ) = (z β z 1 α )/2 42/59

45 ratios of optimal bandwidths, s k = 1/2, s l = (twosided) 1.5 Bandwidth ratio 0.95 (twosided) 0.99 (onesided, q=0.8) 0.95 (onesided, q=0.8) (onesided, q=0.5) 0.95 (onesided, q=0.5) r Ratios of of optimal bandwidths for CIs to optimal MSE bandwidths 43/59

46 takeaways from picture Optimal bandwidth ratios depend only on dilation exponents s L,s K and s J : h l h l = c 1 l (r ) (r ) l c 1 1 s J s K Bandwidths of same order in all cases: no undersmoothing For one-sided CIs, bandwidth gets larger with quantile that we are minimizing For 95+% two-sided CIs, if s L = 0 and s K = 1/2, optimal fixed-length CI uses a larger bandwidth than optimal MSE bandwidth 44/59

47 For any bandwidth h, worst-case bias is C 1 r 2 r hs J s L ( k 2 ) 1/2 Can use this worst case bias to construct CIs around L(h) How much bigger are two-sided CIs around minimax MSE bandwidth? Ratio of CI lengths given by c 1 χ,α (r ) c 1 ρ (r ) r 1 χ α (c 1 χ,α (r )(1 1/r )) χ α (cρ 1 (r )(1 1/r )), where χ α (B) solves P ( N (0,1) + B χ ) = Φ(χ B) Φ(χ + B) = 1 α ( ) 1 r Need to use χ α r instead of z α /2 as a critical value to ensure coverage for CI around minimax MSE bandwidth 45/59

48 length of optimal cis relative to cis around mse bw Percentage decrease in length r 46/59

49 critical values for ci around mse bandwidth Critical value r 47/59

50 undercoverage with usual critical values Coverage r 48/59

51 takeaways from pictures To construct two-sided CIs, can keep the same bandwidth as for estimation, price is < 2% for 95% CIs Need to use a slightly higher critical value to ensure proper coverage 49/59

52 suboptimal kernels Results so far assumed using optimal kernel Under renormalization, maximum bias and variance renormalize in similar way for suboptimal kernels For any kernel k, let h k be bandwidth that equates the maximum bias and root variance, and let w(k) = se( L k ( h k )) = sup f bias f ( L k ( h k )) Suppose criterion scales linearly with maximum bias and root variance 50/59

53 Theorem 6 (Efficiency loss of suboptimal kernels) 1. Relative efficiency of k and k (where the optimal bandwidth is used in both cases) does not depend on the performance criterion, and is given by w(k)/w( k) 2. Results for ratios of optimal bandwidths remain unchanged for suboptimal kernels 3. Efficiency loss from using bandwidth optimal for a different criterion rather than bandwidth optimal for criterion of interest remains unchanged for suboptimal kernels 51/59

54 corollaries bounds for minimax MSE efficiency of different kernels of Cheng, Fan, and Marron (1997) 1. are tight; and 2. hold for other efficiency criteria Using minimax MSE bandwidth for two-sided CIs a good idea no matter what kernel one uses 52/59

55 Finite-Sample results Asymptotic results Applications Conclusion

56 rd Interested in Lf = lim x 0 f (x) lim x 0 f (x). Let f + (x) = f (x)i (x > 0) and f (x) = f (x)i (x < 0) so that f = f + f. We consider class F RDT,2 (C) = { f + f : f + F T,2 (C; R + ), f F T,2 (C; R ), } where F 2 (C; X), is the class from Sacks and Ylvisaker (1978), F T,2 (C; X) = { f : f (x) f (0) f (0)x Cx 2 all x X }. F T,2 also used in Cheng, Fan, and Marron (1997) for estimation at a point that justifies much of empirical RD practice 53/59

57 least favorable functions Least favorable functions are symmetric, д δ (x) = f (x) and have form δ д δ (x) = [(b b + d + x Cx 2 ) + (b b + d + x + Cx 2 ) ]1(x > 0) with b,d +,d chosen to solve [(b + d x Cx 2 ) + (b + d x + Cx 2 ) ]1(x < 0) 0 = n д n,b,c (x i )x i σ 2, 0 = (x i ) i=1 n д n +,b,c (x i )x i σ 2, (x i ) i=1 and n i=1 д +,b,c (x i ) σ 2 (x i ) = n i=1 д,b,c (x i ) σ 2 (x i ) 54/59

58 optimal kernel k(u) u Asymptotically, д δ corresponds to difference between two kernel estimators, with bandwidths chosen to equate number of effective observations Optimal kernel same as for inference at a point, derived in Cheng, Fan, and Marron (1997) using upper bound on minimax MSE 55/59

59 application to Lee (2008) RD design: X i = margin of victory in previous election for Democratic party (negative for Republican victory) Y i = Democratic vote share in given election D i = I (X i 0) = indicator for Democratic incumbency n = 6558 observations of elections between 1946 and 1998 For simplicity, assume homoscedastic errors, use estimates ˆσ (0) 2 = and ˆσ + (0) = derived using Imbens and Kalyanaraman (2012) bandwidth LF functions very close to scaled versions of optimal bandwidth Unless C very small, results in line with Lee (2008) and Imbens and Kalyanaraman (2012) 56/59

60 minimax mse estimator as function of c 11 estimate 10 Electoral advantage (%) variance bias_sq b I II III L / Effective number of observations Note L = C 57/59

61 optimal fixed-length cis 13 upper estimate 11 lower Electoral advantage (%) variance bias_sq b I II III L / Effective number of observations /59

62 Finite-Sample results Asymptotic results Applications Conclusion

63 summary 1. give exact results for 1. minimax optimal and 2. adaptive one-sided CIs. CIs use non-random bias correction based on worst-case bias Adaptivity without shape restrictions severely limited, like in two-sided case. Impossible to avoid thinking hard about appropriate C 2. give exact solution to problem of adaptation to a function 3. use these finite-sample results to characterize optimal tuning parameters for different performance criteria building CIs around minimax MSE bandwidth nearly optimal undersmoothing cannot be optimal 59/59

OPTIMAL INFERENCE IN A CLASS OF REGRESSION MODELS. Timothy B. Armstrong and Michal Kolesár. May 2016 COWLES FOUNDATION DISCUSSION PAPER NO.

OPTIMAL INFERENCE IN A CLASS OF REGRESSION MODELS By Timothy B. Armstrong and Michal Kolesár May 2016 COWLES FOUNDATION DISCUSSION PAPER NO. 2043 COWLES FOUNDATION FOR RESEARCH IN ECONOMICS YALE UNIVERSITY