Coverage Error Optimal Confidence Intervals

Size: px

Start display at page:

Download "Coverage Error Optimal Confidence Intervals"

Ralf Berry
5 years ago
Views:

1 Coverage Error Optimal Confidence Intervals Sebastian Calonico Matias D. Cattaneo Max H. Farrell August 3, 2018 Abstract We propose a framework for ranking confidence interval estimators in terms of their uniform coverage accuracy. The key ingredient is the (existence and) quantification of the error in coverage of competing confidence intervals, uniformly over some empirically-relevant class of data generating processes. The framework employs the check function to quantify coverage error loss, which allows researchers to incorporate their preference in terms of over- and under-coverage, where confidence intervals attaining the best-possible uniform coverage error are minimax optimal. We demonstrate the usefulness of our framework with three distinct applications. First, we establish novel uniformly valid Edgeworth expansions for nonparametric local polynomial regression, offering some technical results that may be of independent interest, and use them to characterize the coverage error of and rank confidence interval estimators for the regression function and its derivatives. As a second application we consider inference in least squares linear regression under potential misspecification, ranking interval estimators utilizing uniformly valid expansions already established in the literature. Third, we study heteroskedasticity-autocorrelation robust inference to showcase how our framework can unify existing conclusions. Several other potential applications are mentioned. Keywords: minimax bound, Edgeworth expansion, nonparametric regression, robust bias correction, linear models, bootstrap, heteroskedasticity-autocorrelation robust inference, optimal inference. The second author gratefully acknowledges financial support from the National Science Foundation (SES and SES ). We thank Federico Bugni, Chris Hansen, Michael Jansson, Andres Santos, Azeem Shaikh, Rocio Titiunik, and participants at various seminars and conference for comments. Department of Economics, University of Miami. Department of Economics and Department of Statistics, University of Michigan. Booth School of Business, University of Chicago.

2 1 Introduction Researchers typically have a range of options for constructing confidence intervals for a parameter of interest in empirical work. These options arise from many sources: how the sampling distribution is approximated, how standard errors or quantiles are constructed, how tuning and smoothing parameters are selected, or what concept of validity/robustness is used. Often many competing interval estimators are valid in some principled sense, but it is difficult to choose which are best among them. When two options are asymptotically equivalent to first order, for a given data generating process, practical guidance does not follow from theory. Further, it is rare that economic or other subject-specific theory dictate a choice of inference procedure. Instead, the model and accompanying assumptions formalize what the researcher believes to be a plausible class of distributions that could have generated the data, and they would like some assurances that the chosen confidence interval is accurate in level regardless of the specific data generating process. It is natural and desirable to know if any confidence intervals are more accurate than others over this class of distributions. We propose a framework that quantifies inference quality according to the coverage error of competing confidence interval estimators, uniformly over a class of data-generating processes (DGPs), thereby allowing us to rank those estimators. In a nutshell, we look at the worst-case coverage error for a given confidence region over the distributions allowed by the researcher s assumptions and then use this information to characterize an optimal inference procedure: a minimax notion of optimality. We employ the check function loss to quantify coverage error, which allows for asymmetric penalization of under- or over-coverage in the optimality criterion. For example, it is common practice to prefer conservative confidence intervals, perhaps at the expense of interval length, and our proposed framework can incorporate such preference directly. We necessarily restrict attention to classes of confidence regions for which the uniform coverage error can be quantified in some way. In the generic framework outlined in Section 2, we give nested levels of knowledge required of such a quantification. At heart, each of these involve some degree of higher-order asymptotic analysis (which is precisely how we can distinguish between first-order equivalent procedures). The weakest assumption is simply bounds on the rate of decay of the worst-case coverage error. When these bounds are nontrivial, we are able to conclusively rank 1

3 some procedures. The strongest assumption consists of a full and precise quantification of the leading terms of coverage error, in which case we can rank all procedures, identify minimax optimal rates (of coverage error decay), and single out optimal inference procedures. Optimization over constants is also possible in our framework. We show through examples that the required levels of knowledge are available in many contexts, and even the most stringent requirement is often met. In an application to nonparametric regression, we develop a uniformly valid coverage error expansion and use it to give a novel recipe for optimal inference. In other applications, we rely on existing results, from bounds to expansions, and show how our framework can be used to unify and extend previous rankings among competing inference procedures. In all cases, our framework gives principled guidance to practitioners. In our framework both the class of confidence intervals and the class of data generating processes play a crucial role: together they determine the lower ( min ) and upper ( max ) portions of the minimax optimality, and in particular, neither class should be too large nor too small in order to obtain useful and interesting results. If the class of intervals is too small it will not reflect the range of choices available to the practitioner, while if too large, the optimal procedure may be infeasible or not useful. For example, the interval estimator set to the real line with probability (1 α) and empty otherwise yields perfect coverage over any possible distribution, but is uninformative. In much the same way, if the assumed class of distributions is too small, it is unlikely to be rich enough to be useful in real-world applications. Economic or other field-specific theory often does not make tight restrictions on the distribution of the data (such as Gaussianity), and any inference procedure designed to be optimal or valid under these restrictions may exhibit poor behavior when the restrictions do not hold. The class is too small for the uniformity to be interesting. On the other hand, some restrictions on the distribution are necessary, because if the class is too large then uniformly valid and informative inference procedures can not be constructed; an idea with a long history that has been studied in a variety of settings (see Bahadur and Savage, 1956; Dufour, 1997; Romano and Wolf, 2000; Romano, 2004; Hirano and Porter, 2012, just to name a few examples). The interplay between the two classes will be crucial for our results. To illustrate, consider nonparametric regression, which we study in Section 3. In this context, the class of intervals may restrict attention to only low order approximation, for example all methods removing bias up to a certain order, even if the underlying functions of the DGP are assumed to possess more smoothness. 2

4 Alternatively, it could be that the intervals are based on approximations trying to utilizing more smoothness than available. The optimal coverage error rate is affected by these assumptions, as we study in detail below. This interplay impacts the optimal coverage error rate in other contexts as well, and applies to conditions other than smoothness, such as orthogonality of the errors in a linear model, as studied by Kline and Santos (2012). We employ their work to provide a second application of our generic framework in Section 4. Highlighting the importance of such interplay between the class of DGPs and the class of confidence intervals considered when developing coverage error minimax optimality is an additional methodological contribution that emerges naturally from our general framework (and the specific examples we study). The remaining of this paper proceeds as follows. Section 2 proposes our generic framework for ranking confidence interval estimators. Sections 3, 4, and 5 then consider three distinct applications of our framework to, respectively, nonparametric local polynomial regression, traditional least squares linear regression, and heteroskedasticity-autocorrelation robust (HAR) inference. The breadth of these applications spans parametric and nonparametric estimands and nuisance parameters, standard and nonstandard limiting distributions, and cross-sectional and dependent data. Section 6 briefly mentions several other possible applications, and concludes. The online supplement contains omitted formulas and proofs. Beyond their use as an ingredient in our framework, Section 3 contains new technical and methodological results that may be of independent interest. We establish uniformly valid Edgeworth expansions for local polynomial regression, and use them to derive inference-optimal bandwidth choices that minimize coverage error or, alternatively, balance coverage error against interval length, which may lead to a shorter (more powerful) interval that is still valid. Calonico, Cattaneo, and Farrell (2018b) further develops these ideas for the specific case of regression discontinuity designs. 1.1 Related Literature Romano (2004), Wasserman (2006), and Romano, Shaikh, and Wolf (2010) give introductions to uniform validity and optimality of confidence interval estimators in particular, and statistical inference more generally, thus providing background review and references for this paper. Hall and Jing (1995) is the closest paper related to our work, wherein minimax bounds are established for one-sided confidence interval estimators based on t-test statistics, relying on Edgeworth expansions 3

5 for the Studentized sample mean. Our general framework was inspired by their paper but is quite different from their work and, to the best of our knowledge, new in the literature. To be more specific, while we also consider a minimax optimality criteria to rank confidence intervals, our framework applies more generally to a large class of (one- or two-sided) confidence intervals and under a wide range of conceptually distinct sufficient conditions, which can be verified in many other settings beyond the case of a parametric location model under i.i.d. data. The idea of ranking inference procedures using coverage error, or the equivalent notion of error in rejection probability, has appeared before in the econometrics literature. Two important examples are Jansson (2004) and Bugni (2010, 2016), who give Berry-Esseen-type bounds for inference procedures as a means of ranking them. Our framework encompasses this type of bounds and conclusions as one possible (weak) way of ranking confidence intervals, although it goes beyond their specific approach: neither of these works, nor others that we are aware of, have laid out the minimax framework of Section 2. Nevertheless, these papers are discussed in more detail below in the context of our framework and specific examples. Ranking inference procedures in general, sometimes uniformly over data-generating processes, has a longer history in econometrics and statistics. We can not hope to do justice to this literature, and hence only mention a few relevant and recent examples beyond those already cited. Beran (1982) studies the uniform optimality of bootstrap inference procedures. Backus (1989) and Donoho (1994) construct minimax confidence intervals for regression under parametric (e.g. Gaussian) assumptions, with the latter reference also showing some interval length optimality properties. Rothenberg (1984) develops higher order (size and power) comparisons of classical testing procedures. Horowitz and Spokoiny (2001) discuss minimax optimal rates for hypothesis testing. Schafer and Stark (2009) constructs confidence intervals with optimal expected size and some minimax optimality. Elliott, Müller, and Watson (2015) establish upper bounds on power for tests involving a nuisance parameter. Müller and Norets (2016) propose a notion of betting-based uniformity guarantees as a measure of (minimax) inference quality. More references and related approaches are discussed in these works. 4

6 2 Framework Our framework centers around inference on a parameter denoted θ, and we write θ F for the value of the true parameter when the data is generated according to distribution F. We study confidence intervals estimators for θ F, denoted I, that have nominal 100(1 α)% coverage. Our ultimate goal is to provide a ranking of all confidence intervals I in a class I, uniformly over a class F of DGPs, where the ranking is determined by the accuracy of coverage with respect to the nominal level. This ranking requires some knowledge of the coverage error, and naturally the more that is known, the more precise the ranking will be. This is formalized in the three assumptions below, Assumptions CEB, CER, and CEE, which are progressively stronger. We first discuss our measure of coverage accuracy as well as the underlying classes F and I. The class F of plausible DGPs is defined by the modeling assumptions and the empirical regularities of the application of interest, and the researcher would like some assurance that coverage is accurate no matter which F F generated the data. Thus, it is reasonable to evaluate a confidence interval I I by studying the worst-case coverage error within the plausible set of DGPs. We thus measure the worst-case coverage error as ( ) sup L P F [θ F I] (1 α), (2.1) F F where L(e) = L τ (e) = e (τ 1{e < 0}) is the check function for a given τ (0, 1). We focus on L τ (e) for its concreteness and usefulness, but could use other well-behaved loss functions, or augment the loss with a penalty for interval length to rule out interval estimators that are infinite with positive probability. Using the check function loss allows the researcher, through their choice of τ, to evaluate inference procedures according to their preferences against over- and under-coverage. Setting τ = 1/2 recovers the usual, symmetric measure of coverage error. Guarding more against undercoverage requires choosing a τ < 1/2. For example, setting τ = 1/3 encodes the belief that undercoverage is twice as bad as the same amount of overcoverage. Intuitively, a good confidence interval is one for which this maximal coveral error is minimized. More precisely, we seek a minimax optimal confidence interval. Since coverage error cannot be written as expected loss, our analysis does not fit neatly within traditional minimax risk analyses, and the corresponding established 5

7 tools and results do not apply. Lastly, controlling (2.1) is qualitatively different than uniform size control: see Remark 1 below. The identity and properties of an optimal confidence interval will depend not only on what is assumed about F, but also which confidence intervals are considered: the class of confidence intervals (more generally, inference procedures) I under consideration. We wish to be agnostic about both F and I, so that our results are as useful as possible and provide tight guidance for empirical practice. That is, the larger F, the more likely it is that a given data set is generated by some F F, and the larger is I the more certain the researcher can be that she is using the best procedure for inference. At the same time, both must be restricted in order to obtain effective bounds on (2.1). The sizes of the two classes, F and I, must be considered together, and interesting problems naturally balance their sizes/complexities. To illustrate these ideas, and motivate our results, consider the case of forming a confidence interval for the mean of a scalar random variable. Let the data be a random sample {X i, i = 1,..., n} from a scalar random variable X F and θ F = E F [X]. First, consider the class I. In order to give useful and tight guidance to a researcher, I can be neither too small nor too large. If I contains very few interval types, we would not be comparing all the procedures available to a researcher. On the other hand, to see that I cannot be too large, consider the interval which is set to the real line with probability (1 α) and empty otherwise. This interval has uniformly perfect coverage, that is, (2.1) is exactly zero, but is entirely uninformative to the researcher. This applies for any F, no matter how large, but such a seemingly powerful conclusion is only possible because I is not usefully defined. To illustrate the role of the class F in studying (2.1), suppose that I includes the standard t interval: I = [ X ± t 1 α/2 s/ n], where X is the sample mean, s is the sample standard deviation, and t 1 α/2 the 1 α 2 is quantile of a t distribution with n 1 degrees of freedom. If F is sufficiently restricted, then the t interval is optimal: if only Gaussian DGPs are plausible, that is F = {F : P F [X x] = Φ((x θ)/σ), θ R, σ 2 > 0}, then (2.1) is again exactly zero. The optimality of the t interval holds for any class I, no matter how large, but in parallel to the discussion above, this powerful statement is only possible because F is unrealistically small. On the other hand, well-known results, dating back at least to Bahadur and Savage (1956), show that if F is too large it is impossible to construct an effective confidence interval that controls the worst-case coverage. 6

8 The restriction that I be effective rules out setting I = R with a certain probability, and other such examples. (We may augment the coverage error loss function so that such intervals are not optimal, but rarely would ranking these be of interest to researchers.) Again, this indicates that F and I must be defined together in order that the problem is interesting and the results useful. In the artificial examples above it is possible to reduce (2.1) to zero, but in general this is not possible. However, it is often true that (2.1) vanishes asymptotically, as the sample size grows. Such a confidence interval is uniformly consistent. But even this is not guaranteed. For example, the t interval is pointwise consistent, that is, P F [θ F I] 1 α, provided that F permits a central limit theorem to hold for n X/s, but (2.1) will not vanish without further restrictions on F. Our central aim will be to quantify the rate at which (2.1) vanishes asymptotically, and to show how this rate depends on F and I. Our rankings formalize the intuition that intervals for which (2.1) vanishes faster are preferred over those for which the rate is slower, and intervals with the fastest possible rate are minimax optimal. This gives a way to rank inference procedures that may otherwise appear equivalent. Depending on what is known about the worst-case coverage, which we will now successively build up, we can provide more informative rankings. In the remaining of the paper, all quantities may vary with n, including F and I (and their members F and I) and limits are taken as n unless explicitly stated otherwise. We focus on scalar θ F for concreteness, but our framework extends naturally to other types of estimand. We begin with a weak notion of ranking confidence intervals, and correspondingly we make a weak assumption about coverage error: only bounds are known for the worst-case coverage. Assumption CEB: Coverage Error Bounds. For each I I, there exists a non-negative sequence R I and a positive sequence R I, such that ( ) R I sup L P F [θ F I] (1 α) R I. F F This assumption requires the existence and characterization of lower and upper bounds on the worse-case coverage error of a confidence interval estimator I I. Trivial bounds are R I = 0 and R I = 1. Non-trivial bounds can be established employing Berry-Esseen-type bounds and their reversed versions, Edgeworth Expansions and related methods, or other higher-order approximations to coverage error. See, for example, Rothenberg (1984), Hall (1992a), and Chen, Goldstein, 7

9 and Shao (2010) for reviews and more references. Concrete illustrations of these methods are given below. Assumption CEB is useful in comparing confidence intervals when R I = o(1) and R I > 0 for at least some I I. In this case, an interval I 1 I would never be a preferred choice if there was a competing procedure I 2 I whose upper bound was below the lower bound of I 1 : heuristically, R I2 < R I1 should mean that I 2 ranks above I 1. The following definition formalizes this idea. Definition 1: Domination. Under Assumption CEB, an interval I 1 I is I /F -dominated if there exists I 2 I such that R I2 = o(r I1 ). This idea parallels the notion of (in)admissibility in classical statistical decision theory, separating those confidence intervals that have the potential of being optimal from those that can never be (i.e., intervals that will always be dominated by some other interval estimator in the class I ). This is a weak ranking notion for confidence intervals. Nevertheless, it is often useful. For example, in the context of partially identified parameters, Bugni (2010, 2016) compares inference procedures based on an asymptotic distributional approximation (AA), a bootstrap approximation (B), and subsampling (SS), and shows that subsampling-based inference is dominated under assumptions therein. In the notation of Assumption CEB, it is shown that R SS n 1/3 whereas R AA R B = O(n 1/2 ). (Bugni establishes these upper bounds pointwise in F, but they can be extended to hold uniformly under regularity conditions.) Therefore, subsampling is dominated in this specific setting. Further, we have only the trivial bounds R AA = R B = 0, and thus confidence intervals based on the asymptotic approximation and based on the bootstrap cannot be ranked. See Sections 3, 4, and 5 for more detailed examples. Although it does not provide optimality directly, domination does hint at the notion of optimality: not being dominated by any other member of the class I is a necessary but not sufficient condition for optimality, and intervals that dominate all others in the class I should be optimal. Our next definition formalizes this idea. Definition 2: Minimax Rate Optimal Interval. Under Assumption CEB, an interval I I is I /F -minimax coverage error rate optimal if lim inf n ( ) inf sup R 1 I I I L P F [θ F I] (1 α) > 0. F F 8

10 This definition is most interesting when R I > 0 for all I I and R I = o(1) for at least some. However, notice that we do not require uniformly, or even pointwise, consistent coverage, because R I need not vanish for all I I. Such intervals can be ranked, they are simply suboptimal in our framework. For example, this will occur in nonparametrics (Section 3) when using a meansquare error optimal bandwidth to conduct inference without bias reduction or, more generally, with intervals that are asymptotically conservative or liberal. See also Remark 3.1. Even if Assumption CEB holds with R I > 0 and R I = o(1) for all I I, we may not be able to find useful rankings if the bounds are too loose. To see why, consider an artificial example in which I has three members, with R I1 n 2, R I1 n 1, R I2 n 3/2, R I2 n 1, and R I3 R I3 n 1/2. Here, I 3 is dominated, but we are unable to further rank I 1 and I 2. Suppose further that we had a sharp bound for I 2 : R I2 R I2 n 1. In this case, only the bounds for I 1 do not agree, yet still we do not have enough information to conclusively rank I 1 and I 2. (Alternatively, we could restate the definition so that both were optimal.) In interesting applications of our framework, the bounds will not be loose. For many examples, including the three applications below, we can find the exact rate at which the worst-case coverage error vanishes for some I I. We will thus strengthen Assumption CEB by assuming that the lower and upper bounds exhibit the same rate, and this rate can be characterized. Assumption CER: Coverage Error Rate. For each I I, there exist a positive (bounded) sequence r I, such that 0 < lim inf n r 1 I R I lim sup n r 1 I R I <. Heuristically, the idea is that the worse-case coverage error of each I I is bounded and bounded away from zero after appropriate scaling: c I < r 1 I sup F F ( ) L P F [θ F I] (1 α) < C I, for constants 0 < c I C I <. This rules out intervals with zero worse-case coverage. Confidence intervals with exact coverage typically have too large I (e.g., taking the real line with probability 1 α), too small F (e.g., t-test inference in the Gaussian location model), or pertain to specific cases (e.g., rank-based tests of the median under symmetry). Our framework can accommodate 9

11 situations with zero-worst case coverage by putting all such procedures in the same equivalence class, but this is not as useful in the current context. Thus, we leave unranked procedures that do not exhibit coverage error for at least one F F : the main focus of our paper are scenarios where coverage error is unavoidable, arguably the most common case in practice. We can now formalize the minimax optimal rate in our framework. Definition 3: Minimax Optimal Rate. Under Assumption CER, a sequence r is the I /F - minimax optimal coverage error rate if lim inf n inf I I r I r > 0. This definition requires strictly more information than Definition 2. That is, Assumption CER is sufficient but not necessary for identifying the optimal interval in the sense of Definition 2. However, if Assumption CER holds, then any minimax optimal I will attain this rate and any interval that attains r is of course a minimax optimal interval. Identifying r is often a crucial step in providing practical guidance. Naturally, not all procedures can attain this rate and in many examples, even for those that can, certain implementation details must be chosen appropriately to yield an optimal interval estimator. For example, in Section 4, wild bootstrap intervals can be optimal, but even within this family, only certain bootstrap weights yield the optimal rate. Assumption CER is not as restrictive as it may seem. Indeed, often it is verified using higherorder asymptotic expansions, in which case even more is known about the worst-case coverage error. In many applications we can characterize the rate and constant of the leading term of the coverage error, as formalized in our final, and strongest, assumption. Assumption CEE: Coverage Error Expansion. For each I I, there exists a (bounded) sequence R I,F, with R I,F 0 for at least one F F, and a positive (bounded) sequence r I with R I,F = O(r I ) uniformly in F F, such that ) sup L (P F [θ F I] (1 α) R I,F = o(r I ). (2.2) F F For many classes of confidence intervals Assumption CEE will follow from an Edgeworth ex- 10

12 pansion or other higher-order approximation. Sections 3, 4, and 5 discuss concrete and empirically important contexts where such an approximation holds, covering parametric and nonparametric estimands and nuisance parameters, Gaussian and non-gaussian limiting distributions, and crosssectional and dependent data. Following the structure above, Assumption CEE is more than what is required for finding the rate-optimal interval (Definition 2) or the optimal rate (Definition 3), but with the stronger assumption more can be learned from the constants. Notice that R I,F subsumes the rate and constant, and is thus not restricted to be positive (cf. Assumptions CEB and CER). Without loss of generality we can set R I,F = r I,F C I,F, where r I,F is a positive (usually vanishing) sequence, the rate, and C I,F will be a non-vanishing bounded sequence, forming the constant term of the expansion (when it converges). These constants will often be useful to guide practical implementation, such as tuning parameter selection. Section 3 illustrates this point by constructing data-driven coverage-optimal bandwidth selectors in the context of local polynomial nonparametric regression. Calonico, Cattaneo, and Farrell (2018b) further investigate this in the specialized setting of regression discontinuity designs. Furthermore, if C I,F can be appropriately characterized, uniformly over F, it may be possible to minimize both the rate and constants, i.e. finding the minimax coverage error rate optimal intervals, and then within this group, find the best constants. This brings us to the final definition in our framework. Definition 4: Minimax Optimal Interval. Under Assumption CEE, an interval I I is I /F -minimax coverage error optimal if r I = r (of Definition 3) and lim inf n inf I I sup F F L(C I,F ) sup F F L(C I,F ) 1. This ranking requires the strongest assumption, but accordingly, is the most powerful: giving essentially a complete and strict notion of optimality within I and F. This level of information is not always available (see Section 5 for an example). Indeed, we will focus on rate optimality (Definitions 1 3) in the subsequent sections, both in our new results and in unifying the literature, relegating the role of the constants for practical implementation only. In future work, we plan to further investigate the optimality notion given in Definition 4. 11

13 Assumptions CEB, CER, and CEE, and Definitions 1 4, complete the description of our proposed optimality framework. It is fairly general with respect to both I and F, and in many cases one or more of the assumptions is verifiable for interesting and large classes of intervals and DGPs. We now turn to three applications: nonparametric regression (Section 3), linear least squares regression (Section 4), and HAR inference (Section 5). Others are mentioned in Section 6. Remark 1. An alternative idea when ranking confidence interval estimators is to search for the shortest interval and/or fastest contracting interval among those with asymptotically and/or uniformly conservative coverage (i.e. uniform size control). This method considers coverage as fixed, and not necessarily correct, and optimizes length (or power). We optimize coverage error under assumptions that will, in general, restrict attention to finite-length intervals. The two rankings need not agree because the uniform quality guarantees are different. Specifically, an I is uniformly asymptotically conservative level α if for any δ > 0, there exists an n 0 = n 0 (δ) such that for all n n 0, inf F F P F [θ F I] (1 α) δ. In contrast, we are interested in intervals for which (2.1) vanishes, which translates to the guarantee that for all n n 0, sup F F L(P F [θ F I] (1 α)) < δ. See Romano (2004) and Romano, Shaikh, and Wolf (2010) for more discussion. Remark 2. Our framework focuses squarely on inference quality, and not on quality of point estimation. In general, these goals are not the same, and our framework highlights the distinction: it may be possible to find an excellent approximation to the sampling distribution of a poor point estimator. There are many ways to measure the quality of a point estimator, perhaps the two most common being mean square error and, among unbiased estimators (or within a bias tolerance), precision/efficiency. Coverage error improvements can come at the expense of these measures, and our framework can quantify this tradeoff precisely. For example, in Section 3.4 we show that the MSE optimal point estimator is suboptimal in terms of coverage error, and furthermore, in some cases the coverage error optimal interval implicitly uses a point estimator that is not even consistent in mean square, revealing a striking gap between the two notions of quality. The distinction between precision and coverage error is also evident in Section 5, where fixed-b HAR procedures are not asymptotically efficient (manifesting, in particular, as longer intervals) but offer coverage improvements. Some of the examples mentioned in Section 6 have the same features. In general, our framework reinforces the point that when the researcher seeks better statistical 12

14 inference they should choose a method explicitly for that goal. 3 Application to Local Polynomial Nonparametric Regression The first application of our framework is to local polynomial regression. We will characterize the minimax optimal coverage error rate r for a popular class of confidence interval estimators (restricting I ) under precise smoothness restriction of the regression function (restricting F ). An important lesson of this section, recalling the discussion above, is that r, and the set of intervals which can attain it, depends crucially on both F and I, and in particular through the smoothness assumed for the population regression function (in F ) and the smoothness exploited by the interval estimator (in I ). We show that, with appropriate choice of bandwidth, standard errors and quantiles, the robust bias corrected confidence intervals proposed by Calonico, Cattaneo, and Farrell (2018a) are minimax rate optimal in the sense of Definitions 2 and 3. For this application, we require new technical results to verify Assumption CEE (and hence CEB and CER). Specifically, we obtain novel uniformly valid Edgeworth expansions for local polynomial estimators of the regression function and its derivatives at both interior and boundary points; given in the supplemental appendix and underlying Lemma 3.1 below. These results improve upon the current literature by (i) establishing uniformity over empirically-relevant classes of DGPs, (ii) covering derivative estimation, and (iii) allowing for the uniform kernel. 3.1 The Class of Data Generating Processes To apply our proposed framework, we must make precise the classes F and I. We begin with F. For a pair of random variables (Y, X), the object of interest is a derivative of the regression function at a point x in the support of X: θ F = µ (ν) ν F (x) := x ν E F [Y X =x], (3.1) x=x with ν Z +. From the rest of this section, we assume x = 0 and omit the point of evaluation (e.g., θ F = µ (ν) ) whenever possible. All generic results in this section cover both interior and boundary F cases, and could be naturally extended to vector-valued data. 13

15 The class of DGPs is defined by the following set of conditions. Assumption 3.1 (DGP). {(Y 1, X 1 ),..., (Y n, X n )} is a random sample from (Y, X) which are distributed according to F. There exist constants S ν, s (0, 1], 0 < c < C <, and δ > 8, and a neighborhood of x = 0, none of which depend on F, such that for all x, x in the neighborhood: (a) the Lebesgue density of X i, f( ), is continuous and c f(x) C, v(x) := V[Y i X i = x] c and continuous, and E[ Y i δ X i = x] C, and (b) µ( ) is S-times continuously differentiable and µ (S) (x) µ (S) (x ) C x x s. The conditions here are not materially stronger than usual, other than the requirement that they hold independently of F, which is used to prove uniform results. Assumption 3.1(b) highlights the smoothness assumption, which sets a limit of how quickly the worst-case coverage error can decay. The distinction and interplay between the smoothness assumed here and that utilized by the procedure will be important for our results, as made precise below. Procedures that make use of more smoothness will yield faster rates when such smoothness is available, but there is also an important optimality notion among inference procedures that exploit the same level of smoothness. To make these points precise, we must first define the class of confidence interval procedures (and estimators) considered. 3.2 The Class of Confidence Interval Estimators We restrict I to contain t-test-based intervals constructed using local polynomial methods (i.e., weighted least squares regression), and discuss optimal procedures within this class. Many other ways of forming confidence intervals exists, of course, as well as other nonparametric regression techniques, and all have strengths and weaknesses. We focus on local polynomial t-test-based intervals because they are tractable and popular in empirical work. We will only briefly review local polynomial estimation (see Fan and Gijbels, 1996, for more). Define ˆµ (ν) via the local regression: ˆµ (ν) = ν!e ν ˆβ = 1 nh ν ν!e νγ 1 ΩY, ˆβ = arg min b R p+1 n (Y i r p (X i ) b) 2 K i=1 ( ) Xi, (3.2) h 14

16 where p ν is an integer with p ν odd, e ν is the (p + 1)-vector with a one in the (ν + 1) th position and zeros in the rest, r p (u) = (1, u, u 2,..., u p ), K is a kernel or weighting function, Γ = n i=1 (nh) 1 K(X i /h)r p (X i /h)r p (X i /h), Ω = [K(X 1 /h)r p (X 1 /h),..., K(X n /h)r p (X n /h)], and Y = (Y 1,..., Y n ). The two germane quantities here are the bandwidth sequence h, assumed to vanish as n diverges, and p, the order of the polynomial, set as usual so that p ν is odd. These are chosen by the researcher and will impact the coverage error decay rate. The rate depends on the local sample size, nh, and the pointwise bias, determined by h, p, and the assumed smoothness. With these chosen, and a valid standard error choice ˆσ p (detailed below), the standard t-statistic is T p = nh 1+2ν (ˆµ (ν) θ F ) ˆσ p. (3.3) Valid inference requires a choice of the tuning parameter h, which is often regarded as the most difficult in practice and most delicate in theory. Our framework of coverage optimality sheds new light on this problem by motivating inference-optimal (coverage error minimizing) bandwidth choices, an important result from our work for empirical research. To formalize how the bandwidth choice impacts estimation and inference, let us begin with the most common choice by far, and indeed, the default in most software packages: minimizing the mean-squared error (MSE) of the point estimator ˆθ p := ˆµ (ν) (x). To characterize the MSE-optimal bandwidth, suppose for the moment that p S 1. Then the conditional mean and variance of ˆθ p are: E [ˆµ (ν) ] X 1,..., X n = µ (ν) + h p+1 ν ν!e νγ 1 Λ µ(p+1) (p + 1)! + o P(h p+1 ν ), (3.4) with Λ = Ω[(X 1 /h) p+1,, (X n /h) p+1 ] /n, and V [ˆµ (ν) ] X 1,..., X n = 1 nh 1+2ν ν!2 e νγ 1 (hωσω /n)γ 1 e ν, (3.5) with Σ the n n diagonal matrix with elements v(x i ). The MSE-optimal bandwidth will thus obey h mse n 1/(2p+3), whenever µ (p+1) 0. (Throughout this paper, asymptotic orders and their in-probability versions hold uniformly in F, as required by our framework; e.g., A n = o P (a n ) means sup F Fn P F [ A n /a n > ɛ] 0 for every ɛ > 0.) The rate of decay of h mse does not depend on the 15

17 specific derivative being estimated, though the convergence rate of the point estimate ˆµ (ν) to µ (ν) F will depend on ν. This is a well-known feature of local polynomials, but warrants mention as the coverage error decay rate will also not depend on the derivative, as established below. The MSE-optimal bandwidth is too large for inference: the bias term of (3.4) remains firstorder important after scaling by the standard deviation, rendering standard Gaussian inference invalid. To remove this bias term, we consider two approaches: undersmoothing and robust explicit bias correction. The former simply involves choosing a bandwidth that vanishes more rapidly than n 1/(2p+3), rendering the bias negligible. Explicit bias correction involves subtracting an estimate of the leading term of (3.4), of which only µ (p+1) is unknown, and then inference is made robust by accounting for the variability of this point estimate. The estimate ˆµ (p+1) is defined via (3.2), with p + 1 in place of both p and ν throughout, and a bandwidth b := ρ 1 h instead of h. These implementation choices have a precise theoretical justification (Calonico, Cattaneo, and Farrell, 2018a). The bias corrected point estimate is ˆµ (p+1) ˆθ rbc = ˆµ (ν) h p+1 ν ν!e νγ 1 Λ 1 (p + 1)! = 1 nh ν ν!e νγ 1 Ω rbc Y, Ω rbc = Ω ρ p+1 Λe p+1 Γ 1 Ω, where Γ and Ω are defined as above, but with p + 1 and b in place of p and h, respectively. Comparing to (3.2), all that has changed is the matrix Ω premultiplying Y. The final choice for implementation is that of variance estimator, which is also important for coverage error. This is a crucial aspect that differentiates traditional first-order analyses, where only consistency is required, from higher-order theory, which captures explicitly the uncertainty in variance estimation (among other things). Thus part of finding the optimal procedure in our framework is a careful choice of standard errors. In general, there are two types of higher-order terms that arise due to Studentization. One is the unavoidable estimation error incurred when replacing a population standardization, say σ 2, with a feasible Studentization, ˆσ 2. However, there is also an error in the difference between the population variability of the point estimate (i.e. of the numerator of the t-statistic) and the population standardization chosen. A fixed-n approach is one where the Studentization ˆσ 2 is chosen directly to estimate V[ nh 1+2ν ˆθ X 1,..., X n ], a fixed-n calculation. Fixed-n Studentization completely removes the second type of error. This can be contrasted with the popular practice of Studentizing with a feasible version of the asymptotic 16

18 variance, i.e. finding the probability limit of V[ nh 1+2ν ˆθ X 1,..., X n ] and estimating any unknown quantities. This is valid to first order, but the difference between V[ nh 1+2ν ˆθ X 1,..., X n ] and its limit manifest in the higher-order expansion, exacerbating coverage error. At boundary points these errors are O(h) and thus particularly damaging to coverage. Other possibilities are available and may also be detrimental to coverage. We can now define the class of confidence intervals we consider, which indexes choices of point estimates, standard errors, bandwidths, and quantiles. Regularity conditions are placed upon the kernel function. All of these represent choices made by the researcher, and each choice impacts the coverage error, as made precise below. Our results give practical guidance for these choices, the most important of which is the choice of bandwidth. In general, we shall write I and I, but when discussing specific choices it will be useful notationally to write the intervals as functions of these choices, such as I(h) for an interval based on a bandwidth h or I(ˆθ, ˆσ) for specific choices of point estimate and standard errors. The other choices will be clear from the context. In particular, let I p = I(ˆθ p, ˆσ p ) and I rbc = I(ˆθ rbc, ˆσ rbc ), where ˆσ p and ˆσ rbc are defined below. Assumption 3.2 (Confidence Intervals). (a) I is of the form I = [ˆθ / zuˆσ nh 1+2ν, ˆθ z / lˆσ nh 1+2ν] (3.6) for a point estimator ˆθ = ˆθ p or ˆθ rbc, a well-behaved standard error ˆσ (defined below), fixed quantiles z l and z u, a nonrandom bandwidth sequence h = Hn γ, with H bounded and bounded away from zero and c γ 1 c for c > 0, not depending on I, and, if required, a fixed, bounded ρ = h/b. (b) The kernel K is supported on [ 1, 1], positive, bounded, and even. Further, K(u) is either constant (the uniform kernel) or (1, K(u)r 3(p+1) (u)) is linearly independent on [ 1, 1]. The order p is at least ν and p ν is odd. By well-behaved standard errors here we mean two things. First, we assume that the standard errors are (uniformly) valid, in that the associated t-statistic is asymptotically standard Normal. Inference based on invalid standard errors will be dominated, trivially, and thus while we could in principle account for this, we assume it away for simplicity. Second, and more importantly, is 17

19 that the ingredients of the standard errors must obey Cramér s condition. This can be assumed directly, but for kernel-based estimators of V[ nh 1+2v ˆθ X 1,..., X n ] or its limit, we prove in the supplement that Assumption 3.2(b) ensures this. In particular, for ˆθ p and ˆθ rbc, we will focus on the following fixed-n standard errors, following the ideas above. Let ˆΣ p and ˆΣ rbc be the diagonal matrixes of estimates of v(x i ), given by ˆv(X i ) = (Y i r p (X i ) ˆβ) 2 for the former and and ˆv(X i ) = (Y i r p+1 (X i ) ˆβ p+1 ) 2 for the latter, where ˆβ p+1 is defined as in (3.2), with p + 1 in place of p and b instead of h. Then, we let ˆσ 2 p = ν! 2 e νγ 1 (hω ˆΣ p Ω /n)γ 1 e ν and ˆσ 2 rbc = ν! 2 e νγ 1 (hω rbc ˆΣ rbc Ω rbc/n)γ 1 e ν. (3.7) For the quantiles, the most common choices are z l = Φ 1 (α/2) =: z α/2 and z u = Φ 1 (1 α/2) =: z 1 α, where Φ is the standard Normal distribution function, but our results allow for other options. 2 For coverage error purposes, symmetric choices, i.e. where Φ (1) (z l ) = Φ (1) (z u ), yield improvements in coverage error due to cancellations in Edgeworth expansion terms. Asymmetric choices such that Φ(z u ) Φ(z l ) = 1 α can still yield correct coverage, but at a slower rate. Under Assumption 3.2(b) we prove (in the supplement) that the appropriate n-varying version of Cramér s condition hold for all I I. We do not need to make an opaque high-level assumption. Prior work on Edgeworth expansions for nonparametric inference has, explicitly or implicitly, ruled out the uniform kernel (Hall, 1991; Chen and Qin, 2002; Calonico, Cattaneo, and Farrell, 2018a) or treated the regressors as fixed (Hall, 1992b; Neumann, 1997). We are able to include the uniform kernel, which is important to account for popular empirical practice (i.e., local least squares). All popular kernel functions are now allowed for by Assumption 3.2(b), including uniform, triangular, Epanechnikov, and so forth. 3.3 Uniform Coverage Error Expansions Before we can apply the framework of Section 2 to this class of problems we must verify one of the assumptions. We now establish uniformly valid coverage error expansions, which verify Assumption CEE and define the R I,F. These are used in the next subsection to identify optimal intervals and rates, and further below to select inference-optimal bandwidths. As mentioned above, the relationship between S and p captures the interplay between I and 18

20 F in this problem. This is due to the bias of ˆθ p and ˆθ rbc, which manifest in the expansions. It is convenient to separate the rate and constant portions with specific notation. Let the (fixed-n) population bias of h ν ˆθ be denoted by h η ψ I,F, where both η > 0 and ψ I,F depend on the specific procedure and F. In general, the rate will be known but the constants may be unknown or even (if p > S) uncharacterizable without further assumptions (details are in the supplement). For example, in the case of the MSE-optimal bandwidth discussed above, with p < S, η = p + 1 and ψ Ip,F is ν!e νe[γ] 1 E[Λ]µ (p+1) /(p+1)!; c.f. Equation (3.4). In this notation, explicit bias correction removes an estimate of h η ν ψ Ip,F. For coverage to be uniformly correct in large samples, we will assume that nhh η vanishes asymptotically, making no explicit mention of smoothness. We can then use the generic expansions to study the coverage error of each interval and its dependence on p and S. For example, in the case of the standard approach, using ˆθ p and ˆσ p, this is the standard undersmoothing requirement for correct coverage. The coverage error expansions are given next. For an interval I, the coverage error is the difference of Edgeworth expansions for the associated t-statistic, evaluated at each quantile. The expansion is given in terms of six functions ω k,i,f (z), k = 1, 2,..., 6. These are cumbersome notationally, and so the exact forms are deferred to the supplement. All that is important for our results is that they are known for all I I and F F, bounded, and bounded away from zero for at least some F F, and most crucially, that ω 1, ω 2, and ω 3 are even functions of z, while ω 4, ω 5, and ω 6 are odd. Also appearing is λ I,F, a generic placeholder capturing the mismatch between the variance of the numerator of the t-statistic and the population standardization chosen (i.e. the quantity estimated by ˆσ of I). We can not make this error precise for all choices, but we consider two important special cases. First, employing an estimate of the asymptotic variance renders λ I,F = O(h) at boundary points. Second, the fixed-n Studentizations (3.7) yield λ I,F 0. For other choices, the rates and constants may change, but it is important to point out that the coverage error rate cannot be improved beyond the others shown through the choice of Studentization alone (see discussion in the supplement). Let λ I such that sup F F λ I,F = O(λ I ) = o(1). Our main technical result for local polynomials is the following Lemma. Lemma 3.1. Let F collect all F which obey Assumption 3.1 and I collect all I that obey Assump- 19

21 tion 3.2. Then, uniformly over I, if γ > 1/(1 + 2η) and z l, z u are such that Φ(z u ) Φ(z l ) = 1 α, then where r I = max{(nh) 1, nh 1+2η, h η, λ I } and ) sup L (P F [θ F I] (1 α) R I,F = o (r I ), F F R I,F = 1 nh { ω1,i,f (z u ) ω 1,I,F (z l ) } + nhh η{ ψ I,F [ω 2,I,F (z u ) ω 2,I,F (z l )] } + 1 { ω4,i,f (z u ) ω 4,I,F (z l ) } + nh 1+2η{ ψi,f 2 [ω 5,I,F (z u ) ω 5,I,F (z l )] } nh + h η{ ψ I,F [ω 6,I,F (z u ) ω 6,I,F (z l )] } { + λ I,F ω3,i,f (z u ) ω 3,I,F (z l ) }, ( ) otherwise sup F F L P F [θ F I] (1 α) 1. This result is quite general, covering unadjusted, undersmoothed, and robust bias corrected confidence intervals, as well as other methods (Remark 3.1), at interior and boundary points. To fully utilize this result, and identify optimal procedures, we have to specify the relationship of p to S. However, even at this level of generality, some important conclusions are available. First, this shows the well-known result that symmetric intervals, with z l = z u, have superior coverage: the even functions ω 1, ω 2, and ω 3 cancel, and these are the slowest-decaying. Second, the final conclusion of the theorem simply formalizes the idea that the bandwidth must vanish at the appropriate rate (among other choices) lest worst-case coverage error persists asymptotically. Notice that such intervals can be ranked in our framework, but are dominated (Definition 1). Third, I such that λ I,F 0 yield superior coverage. Taking these three conclusions into account, for the rest of the paper we use the fixed-n standard errors of (3.7) and z l = z α/2 := Φ 1 (α/2) and z u = z 1 α 2 := Φ 1 (1 α/2). The coverage error of such an interval is R I,F = 1 { 2ω4,I,F (z nh α/2 ) } + nh 1+2η{ 2ψI,F 2 ω 5,I,F (z α/2 ) } + h η{ 2ψ I,F ω 6,I,F (z α/2 ) }. (3.8) Below we use this form to obtain the optimal rates and intervals, as well as to select the bandwidths. Finally, Lemma 3.1 implicitly reveals how fundamentally different are point estimation and inference, recalling Remark 2. First, the two may proceed at different rates. Observe that the rate of R I,F does not depend on the order of the derivative being estimate, ν. As a consequence, neither will the optimal rate r (Definition 3) nor the optimal procedure (Definition 2), encompassing, in 20

On the Effect of Bias Estimation on Coverage Accuracy in Nonparametric Inference

On the Effect of Bias Estimation on Coverage Accuracy in Nonparametric Inference Sebastian Calonico Matias D. Cattaneo Max H. Farrell November 4, 014 PRELIMINARY AND INCOMPLETE COMMENTS WELCOME Abstract