Bandwidth choice for regression functionals with application to average treatment effects

Size: px

Start display at page:

Download "Bandwidth choice for regression functionals with application to average treatment effects"

Molly Lindsey
6 years ago
Views:

1 Bandwidth choice for regression functionals with application to average treatment effects Karthik Kalyanaraman November 2008 Abstract I investigate the problem of optimal selection of smoothing parameter or bandwidth when one is interested in finite-dimensional smooth functionals of a regression function. Examples of such functionals include average treatment effects, nonparametric versions of the Blinder-Oaxaca decomposition used to analyze labor market discrimination, nonparametric policy effects of macroeconomic interventions and average derivatives. Firstly I emphasize that this problem is very different from the problem of optimal bandwidth selection when one s interest is in the entire regression function, with different rates of convergence of the smoothing parameter to zero. Further I show that the problem is in general characterized by an instability or technically, a form of ill-posedness. Small changes in the underlying joint distribution of the data can cause large changes in chosen bandwidth. I propose a simple solution to this problem and prove its optimality properties. As an example I focus on how to choose bandwidth for average treatment effects in observational studies when a local linear regression estimate is used. I derive original approximations to the estimation error and use these to provide a rule of thumb bandwidth selection algorithm for applied work. Finally, I use data from Imbens, Rubin and Sacerdote (200) on lottery winnings and the effect of unearned income on labor supply; the data is used to demonstrate and assess both the bandwidth algorithm and some of the theoretical issues involved. JEL Classification: C, C2, J7 Keywords: Plug-in estimator, Nonparametric, ATE, Regression discontinuity The project has been sustained by Guido Imbens, who has provided crucial support at each stage. I am very grateful for detailed suggestions and support from Gary Chamberlain and Lawrence Katz. I have also benefited a great deal from comments made by Alberto Abadie, Victor Chernozhukov, Rustam Ibragimov, Alex Kaufman, Jean Lee, Ulrich Müller, Eduardo Morales, James Stock and workshop participants at Harvard Harvard University, Cambridge, MA Electronic correspondence: kalyanar@harvard.edu

2 Introduction A large proportion of policy-relevant statistics of interest to economists, can be described as regression functionals, that is, real-valued functions of an underlying regression function. Examples include, among others, average treatment effects, the Blinder- Oaxaca wage decompositions used to analyze wage-discrimination (Blinder (973), Oaxaca (973)) the counterfactual effect of a macroeconomic policy intervention (Stock (989)), weighted average derivatives (Powell, Stock and Stoker (989)), and regression discontinuity estimators (see Hahn, Todd and van der Klaauw (200) for theory; several papers including Lee (2008) for applied work). A standard method for estimating such functionals, which avoids functional form assumptions, is to first estimate the regression function nonparametrically and then to compute the functional using the estimated regression function. Such estimators are called plug-in estimators. A key issue here is the choice of smoothing parameter, or bandwidth, in the first step. While this choice is well-explored in settings where the primary object of interest is the entire regression function, it has not been studied in detail where the primary object of interest is a finite dimensional functional. This paper studies optimal bandwidth choice in this setting. More specifically I look at the subgroup of functionals that are estimable at a N 2 rate. Examples include averages of the regression function or its derivatives (all the functionals cited above except regression discontinuity belong to this class). Though some work has been done on bandwidth selection with functionals that are not of this type, e.g. regression discontinuity (Imbens and Kalyanaraman, 2008), this paper focuses on N 2 -estimable functionals. For illustrative purposes I consider two specific functionals, though the methods proposed extend transparently to others in the class. First, I focus on a simple average of the estimated regression function calculated over the sample points. Second, I demonstrate the resulting bandwidth procedure on a commonly used There is a literature on undersmoothing (e.g. Newey (994), Goldstein and Messer (992), Powell, Stock and Stoker (989) and Stock (989)) to achieve N 2 -consistency, but no specific bandwidth proposals except for density-weighted average derivatives (see Goh (2007)). There has also been no discussion of the illposedness and instability inherent to these problems and the estimator proposed for density-weighted averages potentially suffers from the same instability 2

3 functional in labor and development studies, the average treatment effect. Though the results for this functional are written up in terms of the sample average treatment effect (SATE), I briefly show first how the same results are valid for the population average treatment effects and for nonparametric versions of the Blinder-Oaxaca decompositions used to study labor-force discrimination as well. 2 I indicate how a minor modification of the results can be used for bandwidth choice in the case of constructing policy effects of counterfactual macroeconomic interventions as in Stock (989), and for average derivatives. The standard approach to the bandwidth problem is to choose a bandwidth that minimizes some measure of global risk for the entire regression function, usually mean integrated squared error (MISE), i.e. the expected squared error integrated over the entire curve. The optimal bandwidth is then estimated either using plug-in estimators of the minimizer of the asymptotic approximation to MISE or using an unbiased data-based estimator of the MISE (cross-validation). While this is appropriate if we are interested in the entire regression function, it is not the correct risk measure for the estimation of a particular functional of the regression function. The first contribution of this paper is to propose and investigate a criterion, the mean-squared error criterion (MSE), based directly on the functional and on the standard squared-error loss criterion used in estimation problems, and solve for the optimal bandwidth under this criterion. Once the risk measure is defined, I approximate the risk in terms of functionals of the data and then find the bandwidth that minimizes the leading terms in the approximation. The solution obtained for the MSE is quite different from minimizing an asymptotic approximation to the MISE, or by doing standard cross-validation. This is essentially because the averaging done in smooth functionals gets rid of some variance, and as a result it is optimal to remove more bias by using a smaller bandwidth. Indeed the convergence rates to zero are different. The point about the different rates has been made before in the literature on undersmoothing (see Newey (994) and Goldstein and Messer (992)); however it is worth highlighting in view of 2 Alternative nonparametric decompositions using densities rather than regression functions are discussed in Dinardo, Fortin and Lemieux (996) 3

4 inappropriate methods like cross-validation used for this problem in practice. The second contribution of the paper is to show that the problem of minimizing MSE is subject to an instability. For a large class of regression functions, small perturbations of the support or of the function itself, can lead to large differences in the optimal bandwidth (in general discontinuously changing its convergence rate to zero, and in some cases moving from a finite value to infinity). Formally, the problem exhibits a certain kind of ill-posedness. This leads to estimators performing poorly if the regression function is close to the class of badly behaved regression functions; in the case of the ATE the bad behavior occurs at and near two particularly interesting cases: constant additive treatment effects (including no effect of treatment) and treatment effects linear in the measured covariate(s). I consider some solutions to this problem and propose a method of regularization 3 that ensures two properties of the stable solution: ) the risk evaluated at this regularized estimate is close to the optimal solution as the data set becomes large (a type of asymptotic no-regret property,) and 2) the estimated bandwidth is close to the optimal bandwidth in large datasets. I motivate this solution in two different ways. Firstly, the solution can be considered as a type of Tikhonov-regularization, where a convex penalty term which penalizes extremely large bandwidths for bumpy regression functions, is added to the loss function, and the sum then minimized to find the optimal regularized bandwidth. Secondly, the same solution can be regarded as the minimizing the risk averaged over local uncertainties in the weight function. This second way of approaching the issue is very much in the spirit of Bickel and Li (2006). Finally I use data on lottery winnings from Imbens, Rubin and Sacerdote (200) that was collected to try to identify the effect of unearned income on labor supply: I use this dataset to both illustrate the properties of the proposed ATE bandwidth selection algorithm and to illustrate some theoretical issues raised by this paper. The simulations indicate that there are potential gains to be achieved, even in fairly linear regression 3 I use the term in similar sense as Bickel and Li (2006) who discuss the notion of approximating a difficult, possibly singular, problem with simpler regularized problems that approach the former in the limit 4

5 functions, in bias reduction and bandwidth stabilization. In sum the contributions to the existing literature are: ) the approximation of the risk function for the ATE when the estimation method is local linear regression 4, 2) the formulation of an objective data-based bandwidth algorithm for the ATE leading to its N 2 consistent estimation, and 3) highlighting and exploration of the ill-posedness of the bandwidth problem in the context of minimizing MSE, and its solution using regularization. 2 A general introduction to the bandwidth problem Consider first a case in which one is interested in estimating a regression function without making functional form assumptions. Examples would include estimation of Engel curves (Bierens and Pott-Buter, 990) estimation of the relationship between calories consumed and food prices (Subramanian and Deaton, 996) and so on. There are several competing ways of doing this such as methods based on kernels, series, sieves etc., (see Wasserman (2006) for a description and evaluation of these methods), but all of these require a choice of what is called the smoothing parameter or bandwidth. This is essentially a choice of how complex of a model to fit to the data. With more complex models, we reduce bias but this typically comes with the cost of increasing variance. In the regression function context, this choice is made a little clearer by imagining potential extremes. One of the simplest regression functions we could fit, say in the kernel context, would be the constant function corresponding to an infinite bandwidth; a zero bandwidth on the other hand corresponds to the very complicated regression function obtained from simple interpolation of the data. Thus the trade-off between bias and variance gives us a framework to think about optimality in bandwidth selection. This choice is quite familiar in the regression context, with several methods that allow the researcher to choose an objective, data-based and optimal bandwidth for fitting the entire regression function (see Wasserman (2006) and Fan and Gijbels (996)); it is far less 4 This method is attractive due to its good bias properties in regions of low density and at the boundaries. See also Heckman, Smith and Todd (998) who recommend its use for calculating ATE 5

6 so in the context where one in interested in a scalar functional of the regression function. One could argue that economists are typically more interested in scalar functionals of the regression function than in the function itself; just as one is far more interested in the coefficients of a linear regression than the regression itself 5. Examples of such functionals include the regression function at a point (e.g. the regression discontinuity estimator), a weighted average regression (various average treatment effects, e.g Dehejia and Wahba (999), Imbens, Rubin and Sacerdote (200)), average partial derivatives (e.g. average cross-partials used as an index of substitutability of two goods) and so on. Indeed most policy effects that summarize regression relationships (e.g Stock (989)) are (typically smooth) functionals. Just as in regression function estimation, one might want to estimate the functional one is interested in without functional form restrictions as when one is concerned about nonlinearities in the underlying regression relationship (e.g. Heckman, et al(998)), or in the case of ATE about different covariate distributions in the treatment and control populations (Dehejia and Wahba(999)). In this context, the choice of a bandwidth becomes a key decision. Once we a procedure for selecting the the bandwidth we can form a plug-in estimate of the functional, by first estimating the regression nonparametrically with that bandwidth and then using this estimate in the formula for the functional. But, there has not been much work done on how to choose bandwidth when the interest is not in the regression function itself. There are no objective methods available to apply to the problem. Thus applied researchers have either used a fixed subjectively chosen bandwidth (e.g. Heckman, et al (998) choose a fixed bandwidth in their discussion of calculating ATE using local linear regressions) or used cross-validation. Cross-validation essentially amounts to minimizing an unbiased estimate of the risk of estimating the entire regression function; it is not appropriate to the problem of choosing a bandwidth for functionals. Worse, to estimate smooth functionals like various average treatment effects at a N 2 - rate using kernels, we need to use a bandwidth that converges to zero, faster than the 5 the average derivative could be one possible analogue of the slope coefficient in a linear regression function, in the nonlinear context 6

7 bandwidth sequence for fitting the entire function. This point was highlighted by the undersmoothing literature (Newey (994)). Thus tailoring the bandwidth method to the functional at hand is very important not only if we want to estimate a bandwidth that theoretically is best for a given dataset, but also if we care about a sensible estimate of the functional in general. This paper essentially seeks to investigate this choice and provide in the instance of various policy effects, a concrete objective and data-based way of selecting the bandwidth. 3 Smooth regression functionals The key feature of these functionals is that they are estimable at a N 2 rate. A general characterization using the Riesz integral representation of these functionals is available in Goldstein and Messer, 992. For exposition I will further assume that the functionals are linear (if they are not, a linearization approximation using Fréchet type derivatives of the functional can be considered) and indeed focus on weighted average regressions as the key example, as most smooth linear functionals can be approximated by sums of weighted average regressions and regression derivatives (again see Goldstein and Messer, 992) Examples of smooth linear functionals that arise frequently in applied work:. Average treatment effect (ATE): Under unconfoundedness (Y 0, Y ) W X and full overlap, the population ATE (PATE) can be written as θ = E(m (X) m 0 (X)) (m (x) = E(Y W =, X = x), etc). This is the difference between the averages of two regressions averaged over the distribution of the covariates X in the population. One might be interested instead in the sample ATE (SATE) where the averaging is done over the empirical sample distribution of the covariate: θ = N (m (X i ) m 0 (X i )) N i= In both cases the ATE can be estimated by ˆθ = N i ( ˆm (X i ) ˆm 0 (X i )) 7

8 2. Nonparametric versions of Blinder-Oaxaca wage decompositions: Used to analyse labor-market discrimination, this functional has the same statistical structure as the PATE. Consider two groups, say women and men, indexed by T {0, }. To get that part of the difference in some average outcome Y, say wages, that is not attributable to difference in observed characteristics, X (say IQ scores), one estimates θ = E X (E(Y X, T = ) E(Y X, T = 0)) Here the outer expectation is taken with respect to some reference distribution; either the conditional covariate distribution in one of the two populations, or the unconditional on group covariate distribution. 3. Average Derivative Effect (ADE): Analysed in Powell Stock and Stoker (989) and Härdle and Stoker (989), these functionals are of the type θ = E( m (x)) (Powell x Stock and Stoker (989) considered density-weighted average derivatives of the form E(f(X)m (X))). These arise in semiparametric estimation of index models as estimands of interest and may also be of independent interest in demand estimation problems. Bandwidth selection issues for these estimators are identical in spirit to the issues with the simple weighted average θ W A = E(ω(X)m(X)) considered later and a straightforward extension is immediate 4. Another example, considered in Stock (989), involves assessing the impact of policies that leave structural relationship m(x) = E(Y X = x), between an outcome variable Y and a policy instrument X, unchanged but affect the distribution of the policy instrument. Here one is typically interested in calculating the mean effect on Y or E(m(X)) under the two different distributions of X, one being the current empirical distribution and the other the future counterfactual. In other words: θ = E(ω(x)m(x)) E(m(x)) 8

9 3. Setup 3.. Weighted average regression As mentioned earlier I focus on two functionals in the paper. The first, weighted average regression θ W A, is chosen ) due to its simplicity and ease of illustration of the key theoretical issues and 2) the ease of extension of the ideas to functionals like weighted average derivatives. To make the essential aspects of the problem clear I work with scalar covariates X, and known regression design (i.e. the density function of X, f(x) is assumed known or controlled by the researcher). 6 I stress here that these assumptions are not necessary. The calculations provided in the section on ATE dispense with such assumptions; those calculations can also immediately be modified for this case. These assumptions however also allow one to illustrate the issues in the familiar setting of a simple kernel estimate, rather than the more complex local linear regression used later. The output variable Y is generated by the following process Y i = m(x i ) + ɛ i The errors ɛ i are identically, and independently distributed, with conditional mean zero and conditional variance σ 2 (x). Thus m(x) = E(Y X = x) The functional of interest is θ W A = E (ω(x)m(x)) where ω(.) is a known smooth weighting function. 6 In other words, the regression at X i is calculated as: ˆm(X i ) = (n )f(x i ) n K h (X i X j )Y j. j= j i Assuming a random design will only make calculations more tedious while still preserving the nature of the problem explored in this paper. 9

10 Now theta can be estimated in two steps: first estimating the underlying nonparametric regression function at sample points and then estimating the functional as ˆθ W A = ω(x i ) ˆm(X i ), n i where ˆm(X i ) is the estimate of the regression function at X i. For the approximations that follow, the following standard assumptions will be needed. Assumption 3.: The marginal distribution of the covariate X is known and denoted f(x). Moreover it is twice continuously differentiable and bounded away from 0. Assumption 3.2: m(.) has at least two continuous derivatives. Assumption 3.3: The kernel K is a smooth compactly supported symmetric nonnegative pdf. Assumption 3.4: The weight function ω(.) is piecewise continuously differentiable. Assumption 3.5: X and ω(.) are supported on the whole real line. 3.5 is made only so that boundary biases that arise with simple kernel estimation can be ignored. A compact support for X will entail some way of correcting for boundary biases in the calculated risk (e.g., negative reflection) but will lead to the same asymptotic approximation given below Average treatment effect A functional of interest in observational studies is the sample average treatment effect (SATE). This is essentially the average treatment effect where the averaging is over the empirical distribution of the covariate: θ SAT E = n n i= ( ) Y i Yi 0, where Yi and Yi 0 are the potential outcome for unit i under treatment (T i = ) and control (T i = 0) respectively. The treatment assignment indicator, T, and an additional covariate, X, are observed. 0

11 Under the assumptions of unconfoundedness ((Y, Y 0 ) W T ) and overlap (the density of X is bounded away from 0 over the entire support for both treated and control samples), we can estimate the SATE using the following simple estimator: ˆθ SAT E = n n ( ˆm (X i ) ˆm 0 X i ), i= where ˆm (X i ) and ˆm 0 X i are the estimated regression functions on the treated and control samples respectively, each evaluated at X i. I state the lemmas about the ATE in terms of the SATE; however as indicated in a subsequent section, the bandwidth optimal for the SATE, is also optimal for the PATE. For the purposes of this paper, I work out bandwidth selection details for a specific (and attractive) nonparametric regression estimator: the local linear regression estimator. This estimator does not suffer from the severe boundary bias that affects standard (Nadaraya-Watson) kernel estimates (see Heckman, et al (998), who recommend this estimator in the context of calculating treatment effects). More importantly perhaps for this application, the local linear estimator does not have high bias in areas with fewer observations unlike the latter (or is design-adaptive to use the jargon; see Fan (992)). This is an important consideration for the ATE where one can expect for certain covariate values a low density of treated observations and for other values very few control observations (i.e., limited overlap.) More explicitly, arrange the observations such that the first n 0 observations constitute the control sample, and the rest n = n n 0 form the treated samples. Define y 0 = (y,..., y n0 ) and m 0 = (m 0 (X ),..., m 0 (X n0 )), where m 0 (.) is the regression function in the control group, i.e., m 0 (x) = E(Y T = 0, X = x). Define R 0 i = [ι Z i ], where Z i j = X j X i, j =... n 0, and ι is a column of n 0 ones; define the weight matrix W 0 i = diag j=...n0 (K h (X j X i )); e = ( 0). Similarly define y, R i and W i. Given the above, we can write the estimated control regression at X i as ˆm 0 (X i ) = e (R 0 i W 0 i R 0 i ) R 0 i W 0 i y 0

12 Therefore: ˆθ SAT E = n n i= ( e (R i W i R i ) R i W i y e (R 0 i W 0 i R 0 i ) R 0 i W 0 i y 0 ) It will be useful to collect here the assumptions necessary for upcoming lemmas. They are all standard: Assumption 3.6: The marginal distribution of the covariate X is denoted f(x); it is continuous a.e. and bounded away from 0 conditional on W : f(x T = ) f (x) c > 0 and f(x T = 0) f 0 (x) c 0 > 0 Assumption 3.7: m(.) has at least two continuous derivatives a.e. Assumption 3.8: The kernel K is a smooth compactly supported symmetric nonnegative pdf. For concreteness assume a kernel supported on [, ]. Assumption 3.9: The conditional variance functions σ 2 (x) = V ar(y X = x) and σ 2 0(x) are bounded and continuous almost everywhere 4 Error criteria and optimal bandwidth 4. Standard criterion: MISE The standard criterion of risk that is used for bandwidth selection in nonparametric regression problems is called the Mean Integrated Squared Error (MISE), or more generally, weighted MISE (WMISE). It is a global measure of error and the loss function sums up squared error along the entire regression function. It is defined as W MISE = ξ(x)e( ˆm(x) m(x)) 2 f(x)dx, where ξ(x) is the weighting function used. Usually this is chosen to be the flat, and the resulting error criterion called simply the MISE. The bandwidth is then chosen by finding a data-based unbiased estimator of the error criterion (e.g., cross-validation) or by first asymptotically approximating risk and then estimating its minimizer. 2

13 Lemma 4.: (Asymptotic approximation to MISE) Under assumptions and 3.5, we obtain the following expression for WMISE W MISE = C nh + C 2h 4 + o p (n h + h 4 ) The constants in the approximation are: C = T (K) ξ(x)σ 2 (x)dx C 2 = T 2 (K) ξ(x) [(mf) ] 2 (x) dx f(x) ( T (K) = K(t) 2 dt T 2 (K) = 2 t 2 K(t)dt) 2. Calculations are given in the appendix. This is a well-known result. See Priestley and Chao (972) Note that the first term reflects the risk cost of variance of the estimator and the second reflects the bias. Increasing bandwidth reduces variance across repeated samples at the cost of increasing bias. Once again, a key observation to keep in mind is that standard cross-validation essentially minimizes an unbiased estimate of the MISE and can be shown to be consistent to the minimizer of leading terms of MISE displayed above. 4.2 Modified criterion: MSE Using the MISE or AMISE as a measure of risk is not appropriate when one is interested primarily in the functional θ. Thus standard methods of bandwidth selection will have to be modified. We instead consider a measure of risk defined directly in terms of the functional of interest, the mean squared error (MSE) criterion. MSE = E(ˆθ( ˆm h ) θ(m)) 2. One can now try and approximate MSE in terms of functionals of the joint distribution of the data and the chosen bandwidth. 3

14 4.2. MSE for θ W A Lemma 4.2: (Asymptotic approximation to MSE) Under assumptions , we obtain the following expression for MSE MSE W A = C 0 n + C n 2 h + C 2h 4 + o p (n 2 h + h 4 ) AMSE + o p (n 2 h + h 4 ). Here, the constants in the approximation are: C 0 = V ar(2ω(x)m(x)) + ω(x) 2 σ 2 (x)f(x)dx C = T (K) ω(x) 2 (σ 2 (x) + m(x) 2 )dx ( C 2 = T 2 (K) T (K) = ) 2 (mf) (x)ω(x)dx K(t) 2 dt, T 2 (K) = ( 2 t 2 K(t)dt) 2. This is the first result of the paper. Here the first term comes from the variance of the averaging involved. Note that is it not affected by the bandwidth choice. The second two terms reflect the variance and bias stemming from the choice of smoothing parameter. An increased bandwidth makes the estimator more stable across repeated samples but also more biased. The nature of the constant on the bias term, C 2, will prove crucial to the following discussion MSE for ATE A contribution of this paper is to work out a similar expansion for the SATE when local linear estimation is used to calculate the regression function 7. The appendix provides both the calculations as well as further higher order terms than presented below. As is standard with regressions, all expectations in the lemmas below condition on the sample 7 Imbens, Newey and Ridder (2005)provide a MSE expansion for ATE when they are calculated using inverse propensity score weighting and series estimation for the regression function and the propensity score. Their procedure requires the choice of two bandwidths. Other differences from the current work include ) they work with a criterion slightly different from mine in that they consider the sum of the mean squared errors of the treatment and control regressions rather than working directly with the ATE; 2) they do not provide guidance for bandwidth selection and 3) they are not concerned with the problems of regularization, which will arise in their set-up as well 4

15 X and T, as unconditional expectations may not exist (if they do they are identical to what is given below). This is suppressed notationally. Following the result, I indicate extensions to other similar functionals. Lemma 4.3: (Asymptotic approximation to MSE: SATE) Under assumptions , and denoting the support of X by [x, x] (these can be nonfinite) and by p the P r(t = 0) we obtain the following expression for MSE for SATE (MSE SAT E ) MSE SAT E = K 0 n + K n 2 h + K 2h 4 + o p (n 2 h + h 4 ). Here, the constants in the approximation are: ( ) K 0 = σ 2 f + σ 2 f 0 (x)f(x)dx p f 0 ( p) f ( ) K = π σ 2 f + σ 2 f 0 (x)dx pf 0 ( p)f ( ) 2 K 2 = ν2 (m m 4 0)(x)f(x)dx π = K(t) 2 dt ν = t 2 K(t)dt This is the second novel result of the paper. Inspecting the MSE we see first that the term K n 0 is the same as the semiparametric variance bound for the problem derived by Hahn (998) when one expresses it in terms of the propensity score using Bayes theorem. The only difference from the expansion for the PATE, as indicated later, would be a third term in the definition of K 0, namely, (m (x) m 0 (x) θ P AT E ) 2 f(x)dx. The next two terms reflect the variance-bias tradeoff. An increased bandwidth makes the estimator more stable across repeated samples but also more biased. Note that relatively lower densities in either treated or control samples increase the variability (showing the importance of significant overlap in variance reduction). Also note that the third term, the bias term, does not depend on f 0 or f. This is due to the design adaptivity of local linear regression: sparse data in a region does not increase first order bias as it would with simple kernel estimation. Also note that it is the difference in curvatures between 5

16 the treatment and control regressions that determines bias. This will be important in the following discussion Extensions to other functionals The extension to the PATE is immediate recognizing that firstly, under the random sampling assumption, the estimate of the PATE is the same as for the SATE (i.e. ˆθ P AT E = ˆθ SAT E ) and, that the only additional terms in the MSE expansion of E(ˆθ P AT E θ P AT E ) 2 are E(θ SAT E θ P AT E ) 2 and E(ˆθ SAT E θ SAT E )(θ SAT E θ P AT E ). The second term is negligible and the first contributes a term of order 8. Thus bandwidth choice remains n unaffected. The extension to the nonparametric Blinder-Oaxaca estimates is immediate. Suppose the two groups are indexed by T = and T = 0, say T = indicate female. Then the part of the difference in wages Y, not attributable purely to different distributions of the measured covariate, X (say IQ), in the two groups is E X [E(Y T =, X) E(Y T = 0, X)] where the averaging is done over the unconditional (on group membership) marginal distribution of X. Thus we can see that this functional has an identical structure as the PATE and the application of the above MSE expansion only requires redifinition of terms. It is fairly easy to extend the results to the policy effect estimator of Stock (989), as well. Note that this estimate has the identical structure of one of the two elements in the ATE: the average, say, control regression, averaged over a distribution other than the conditional distribution of X in the control sample (f 0 ). In the ATE case, this other distribution was the unconditional distribution of X and in the Stock (989) case it is the counterfactual distributed of X after the policy intervention. 8 This, as indicated before, is the term E(m (x) m 0 (x) θ P AT E ) 2, also to be found in Hahn s semiparametric variance bound 6

17 5 Bandwidth selection and illposedness 5. Optimal bandwidth Once we have the expressions for the asymptotic approximations to MSE, we can optimally trade bias for variance to the first order, noting that the former is typically an increasing function of bandwidth and the latter a decreasing function, and find a formula for the theoretically optimal bandwidths, h opt W A and hopt SAT E Lemma 5.: (Optimal bandwidths) Given lemma 4.2 and 4.3, we obtain the following expression for the optimal bandwidths: ( ) h opt W A = C 5 n op (n 2 5 ). 4C 2 For the SATE ( ) h opt SAT E = K 5 n op n K 2 Note that while the optimal bandwidth for estimating m would be O(n 5 ), the optimal bandwidth for estimating θ W A and θ SAT E is O(n 2 5 ). In other words, we are required to use a smaller bandwidth than we would have were we interested in the entire function (see Goldstein and Messer (992) and Newey (994)). This makes intuitive sense: the averaging gets rid of some of the variance for us, and therefore we can allow ourselves less bias by smoothing less at the optimal bias variance tradeoff point. Moreover, if one used the optimal bandwidth for MISE, one would not achieve the optimal semiparametric rate of N 2 for the estimation of θ. There is, however, something else that has not been noted in the literature. First consider MSE W A. The optimal bandwidth is inversely proportional to the bias constants C 5 2. C 2 can be thought of as proportional to the total curvature of the regression function on its support (technically it is the total curvature of the function mf, but with uniform design, or more compellingly, with local linear estimation, it is the total curvature of the regression function m that matters and not the design f). For a large class of functions, C 2 could be close or exactly 0. For these functions there is no first-order bias-variance tradeoff as defined by minimizing the AMSE, leading to an optimal bandwidth of! 7

18 The same problem arises with MSE SAT E in perhaps an even more relevant scenario. Here the first order bias is proportional to the difference in curvatures. In the practically interesting cases of constant additive treatment effect (CATE) and linear in covariates treatment effect, this bias term is zero, again leading to the same issue as above. Taking higher order terms does not help one as explained later. The next two sections are abstract in nature. Readers interested primarily in applied work should skim them, and then look at sections 6.2, 6.3, and Illposedness For the purposes of this section I concentrate on θ W A though it is clear that the exact same issues are relevant to θ SAT E. This is primarily for illustrative purposes as the simple structure of ˆθ W A makes the issues at hand more transparent. The calculations, based on simple known functional forms, shown in table and figure 4. illustrate the issue starkly. The output process was y = m(x) + ɛ with ɛ N(0, σ 2 ). The regression was either m(x) = 8cos(Bx) or 8sin(Bx). B is a parameter that determines frequency: increasing it increases local features of the regression function (i.e. make it more bumpy ) and thus should decrease the MISE optimal bandwidth. The support over which risk was calculated was [ 0, 0]. The table shows the optimal bandwidths that minimize respectively the actual MSE, its asymptotic approximation, the AMSE, and the MISE, for the two different regression functions. The figures show an example of two different regression functions (cosine and sine with same amplitude and frequency) with the exact MSE at various bandwidths. There are a couple of things to note in the simulations shown. Firstly, the bandwidth that minimizes MISE is almost identical for the sine and the cosine curves. This makes sense: the sine is really only a shifted version of the cosine and in a sense all the local features are identical. The bandwidth that minimizes the MSE and its asymptotic approximation, 9 however, is radically different in the two cases; from a finite value (cosine) it jumps to infinity for the cosine curve. In the case of the AMSE the reason for this is 9 The approximation, incidentally, seems fairly accurate 8

19 Table 7.2.3: Optimal bandwidths Cos C 2 > 0 Sin C 2 = 0 σ B MSE AMSE MISE MSE AMSE MISE Figure 4. : Regression function and MSE 9

20 apparent on inspection of the formulae: C 2 = 0 for the sine curve, leading to an exactly 0 asymptotic bias. With the MSE, essentially, if we are only interested in the functional and not the entire function, biases in one region of the support can cancel out biases in other regions leading to a 0 total bias. A similar issue is present with the MSE for θ SAT E near, for instance, CATE where the issue is the cancellation of biases when taking the difference across the two regression functions: the bias of the estimate of the treatment regression is of the same magnitude and direction as the bias of the estimate of the control regression to the first order. Here what matters is the difference in total curvature between the treatment and control regression. Thus, while for functions near the CATE case, the optimal MSE minimizing bandwidth is O(n 2 5 ), at CATE, the rate suddenly jumps to O(n 2 9 ). Note that there is a single case in which bias suddenly changes orders in the MISE and the AMISE as well: the linear case. This generally seems to be ignored in the vast literature around the MSE, for theoretical convenience. For instance about the expression for the bandwidth that minimizes MISE to first order with local linear estimation, Fan and Gijbels (996) write It is understood that the integrals are finite and the denominator does not vanish. Since the problem happens only when the regression is close to linear and one can either rule out linearity - as is usually done - it is claimed that this not a serious problem. However, for the data set I use later on to illustrate some of the issues, this does seem to be an issue. A small modification, noted later, of the solution I propose to the instability in the MSE problem, does take care of this case as well. It is also to be noted that the sensitivity to choice of support is special to the MSE case. The essential problem of small perturbations in the underlying regression function leading to large changes in the optimal bandwidth can by formalized by the connection to the literature on ill-posedness in optimization theory. If one considers the underlying true regression function as the exogenously given parameter of the bandwidth problem of minimizing MSE, then, formally, that problem is in a certain sense (defined below) ill-posed. Ill-posed problems have received a lot of attention recently in econometrics in the 20

21 context of deconvolution methods for measurement error and nonparametric IV estimation. A more fundamental example is kernel density estimation. All of these involve linear integral equations of a certain type with a compact operator. Naive solutions to the integral equations which involve inversion are problematic as the inverse of a compact operator is unbounded, leading to the solution not being continuous with respect to perturbations of the definitional problem.. However ill-posedness itself is a concept stemming from optimization theory and while the above form of ill-posedness has a few features in common with the kind of ill-posedness explored in this paper (both lend themselves to solution by similar approaches) the analogy cannot perhaps be pursued much further. We are not dealing here with an integral equation but an optimization problem with function valued parameters. Bandwidth choice for functionals is ill-posed in an extended or variational sense. This type of ill-posedness as far as I know has not been studied in econometrics before. The standard reference here is Zolezzi (995, 996). Define MSE = R(h, m) = E(ˆθ h θ) 2. Let h (m) denote the solution (assuming it exists) to the problem min h R(h, m), given the unknown regression function m M, where M is the class of candidate regression functions. This problem has the issue that the map m h (m) is not continuous, or, extending Zolezzi (995, 996) to make it appropriate to the problem at hand, this implies that the problem is ill-posed in an extended sense (Proofs of the illposedness and precise definitions are given in the appendix). The bandwidth problem has not been looked at in a variational sense (i.e. optimal bandwidth as a functional of the true regression) before partly because most attention has been focused on the MISE criterion where the problem occurs in the neighborhood of linearity. This case is typically ruled out, as I mentioned earlier, a priori. This strategy cannot be followed here because the problem with MSE occurs, for instance, in the neighborhood of all regression functions with 0 integrated curvature over the support, an infinite and unknown set. For instance, the appendix shows that the class of functions for with the problem with minimizing MSE W A occurs contains all odd functions. 2

22 Before discussing potential solutions to the problem, I will review again the implications of the problem: Firstly, small perturbations of the support of the function may lead to large changes in optimally chosen bandwidth (this for example is not the case with the AMISE optimal bandwidth) Secondly, the estimator will be very unstable for regression functions with low total curvature across repeated samples (and in the case of average treatment effects for treatments effects that are close to linear in the covariates). Since robustness is an important feature of a statistical estimator this is troubling, combined with the fact that MSE can in fact vary quite a lot with different bandwidths. Lastly, another way of putting the problem is in term of uniform behaviour, over the unknown parameter space. The optimal bandwidth does not converge to 0 at the same rate throughout the parameter space. In the CATE case it jumps to O(n 2 9 ). Usage of a bandwidth of this order in cases of small deviations from CATE will lead to the functional not being estimable at a N 2 -rate. When we look at θ W A for certain (an infinite set) regression functions the optimal bandwidth becomes infinite. However, the bandwidths that will be estimated from an actual dataset will be O(n 2 5 ) and bounded with probability one (because C 2 will be estimated as nonzero with probability one with continuously distributed regressors.) 6 Regularization 6. Regularization With any solution to the stability issue, one ideally wants an asymptotic no-regret type property for our modified bandwidths, i.e., as the dataset grows in size the difference berween the risk under the modified bandwidth procedure and the minimal AMSE should approach 0. This is the type of property for instance that Li (987) demands of crossvalidation. We shall call this the Li criterion and will be more explicit in the result given below. As mentioned in the introduction, I approach the problem in two ways, both leading to the same solution. 22

23 6.. Penalty term approach One basic solution to ill-posed problems is to regularize them by adding to the criterion one is minimizing a nonzero convex penalty term, that decreases in relative size as the dataset becomes larger. This general idea is sometimes referred to as Tikhonov regularization in optimization theory. For this problem note that we want the penalty term to ) penalize very large bandwidths and 2) penalize the ignoring of local features of m; i.e., when the regression curve is very bumpy, we want to guard ourselves against very large bandwidths because in this case perturbations of support are likely to have a significant effect on the chosen bandwidth. We also in general want a convex penalty term as this ensures minimization does not become ill-defined. Now note that though the bandwidth minimizing the mean squared error of the functional estimate is badly behaved, the MISE minimizing bandwidth is fairly wellbehaved except in the linear case. (This can also be seen in the asymptotic expression for the risks: while C 2 is likely to be close to 0 for a wide variety of cases, this is not likely with C 2.) A penalty term that makes a lot of sense here then is a scaled version the well-behaved bias term in the expansion of the MISE, where the scale factor decreases with increasing sample size at an appropriate rate allowing for the no-regret criterion to be satisfied asymptotically. Note that this can be interpreted as penalizing the AMSE more when the regression function has more local features (C 2 large). Also since MISE bias is increasing and convex in the chosen bandwidth, the other desired properties are satisfied The weight function approach: idea One can motivate this penalty term in a very different and more intuitive way by considering local perturbations of the regression function. Using the basic theoretical framework of Bickel and Li (2006), I consider a sequence of regularized estimands, which result from perturbing the weight function in θ = ω(x)m(x)dx by adding a scaled integrable function φ(.) centered at point x 0, and scaled such that the new weight function still integrates to. Then letting the perturbation get smaller with larger sample size, such 23

24 that in the limit the new weight function is essentially like the original weight function plus the dirac delta generalized function, one finds the bandwidth that minimizes the risk of the perturbed problem in the limit. However one might have chosen the particular point x 0 in such a way that the perturbed problem is still ill-posed (e.g., if m (x 0 ) = 0), so after calculating the MSE of the point-perturbed problem, I average over the possible points x 0 around which we can perturb the weight function. This I call the regularized problem. One can show that when C 2 0, we get an MISE expansion for the regularized problem that is to the first order the same as the risk with the convex penalty term we have chosen. Thus the addition of the convex penalty term can be thought of as asking for a bandwidth that minimizes risk averaged over small uncertainties/perturbations in the weighting function. See below section for a formalization of this notion The weight function approach: formalization I now formalize this approach for θ W A. The same can be done for θ SAT E. Consider the weight function ω(.), and perturbation function φ(.) which fulfills the conditions φ(x) dx < and supx φ(x) <. Define b a positive constant close to 0 and α = n γ. Now perturb the weight function at x 0 in the following way: ω x0 (x) = αω(x) + ( α)φ x0 (x), where φ x0 (x) = φ( x x 0 ) b φ( x x 0 b Note that the perturbed weight function still integrates to. Now given θ = E F (ω(x)m(x)) and ˆθ = E Fn (ω(x) ˆm(x)), define ( φ( x x 0 ) )m(x)f(x)dx b θ(x 0 ) = αθ + ( α) φ( t x 0 )f(t)dt = θ + O(n γ ) ˆθ(x 0 ) = E Fn (ω x0 (x) ˆm(x)) b )f(x)dx Now we will consider the MSE of the perturbed problem averaged over all the possible points x 0 of perturbation, MSE b and then finally let the perturbation go to 0, i.e. 24

25 consider MSE reg = lim b 0 MSE b : MSE b = E x0 E[ˆθ(x 0 ) θ] 2 = E x0 E[ˆθ(x 0 ) θ(x 0 )] 2 + E x0 [θ(x 0 ) θ] 2 + 2E x0 [θ(x 0 ) θ][e ˆθ(x 0 ) θ(x 0 )] = E x0 E[ˆθ(x 0 ) θ(x 0 )] 2 + O(n 2γ ) + O(h 2 n γ ) = E x0 [E ˆθ(x 0 ) θ(x 0 )] 2 + E x0 V ˆθ(x 0 ) + O(h 2 n γ ) Bias 2 b,reg + V ar b,reg + O(h 2 n γ ) MSE b,reg + O(h 2 n γ ) Note that the additional term of order O(n 2γ ) has been ignored as we will need n γ = o(h 2 ) later on to ensure that the leading term in the asymptotic expansion of the bias part of MSE reg, namely Bias b E x0 [E ˆθ(x 0 ) θ(x 0 )] dominates the remainder in MSE b,new. Using h = O(n δ 2 5 ), we will then have the condition that γ > Define B b = φ( x x 0 b )f(x)dx. Now consider E ˆθ(x 0 ): E ˆθ(x 0 ) = E Fn (ω x0 (x) ˆm(x)) = αe ˆθ + ( α)e Fn (φ x0 (x) ˆm(x)) = αe ˆθ + ( α) φ x0 (x)k h (x u)m(u)f(u)dudx = α(θ + C 2 2 h 2 + o(h 2 ( α) )) + φ( x x 0 )K h (x u)m(u)f(u)dudx B b ( = α(θ + C 2 2 h 2 ( α) ) + φ( x x ) 0 )m(x)f(x)dx + B b ( ( α) + h 2 T 2 (K) 2 φ( x x ) 0 )(mf) (x)dx + o(h 2 ) B b 0 This implies that δ, the regularization parameter as defined in the main body of the paper to follow, will have to fulfill δ > 2 7. Thus, strictly speaking, we can view this alternative approach as being equivalent to the approach with the convex penalty term if δ ( 2 7, 3 ). 25

26 Therefore: E(ˆθ(x 0 )) θ(x 0 ) = h (αc 2 2 φ( x x 0 )(mf) ) (x)dx 2 + ( α)t 2 (K) 2 b φ( t x 0 + o(h 2 ) )f(t)dt b ( (E(ˆθ(x 0 )) θ(x 0 )) 2 = h 4 α 2 C 2 + α( α)t 2 (K) 2 C 2 φ( x x 0 )(mf) ) (x)dx b 2 φ( t x 0 + )f(t)dt b ( φ( x x 0 + h 4 T 2 (K)( α) 2 )(mf) ) 2 (x)dx b φ( t x 0 + o(h 4 ) )f(t)dt b This implies, since Bias 2 b,reg = E x 0 (E(ˆθ(x 0 )) θ(x 0 )) 2, that: Bias 2 b,reg =h 4 [α 2 C 2 + n γ αt 2 (K) 2 C 2 b φ( x x 0 )(mf) (x)dx b 2 E x0 b φ( t x 0 ] )f(t)dt b φ( x x 0 + h 4 [n 2γ )(mf) (x)dx b T 2 (K)E x0 ( φ( t x 0 ) 2 ] )f(t)dt Now we consider the Bias 2 reg = lim b 0 Bias 2 b,reg, i.e. we let the perturbations go to 0. We assume that inff(x) = c > 0 and sup(mf) (x) = M <. We then get, using the dominated convergence theorem Bias reg = h 4 [α 2 C 2 + n γ T 2 (K) 2 αc 2 2 = h 4 [α 2 C 2 + n γ (T 2 (K) 2 αc 2 2 For C 2 0 therefore: Bias reg = n 2γ h 4 C 2 + O(n γ h 4 C 2 ) + o(h 4 ) b [(mf) (mf) (x)dx + n 2γ ] 2 (x) T 2 (K) dx] + o(h 4 ) f(x) (mf) (x)dx) + n 2γ C 2] + o(h 4 ) Note this is exactly to the leading order, the bias term in the regularized AMSE in the main body of the paper, as the first term in the above expansion is the convex penalty term we chose. It is similarly easily shown that V ar reg = lim b 0 V ar b,reg has to the leading order the same expression as in the regularized AMSE, thus substantiating the claim that the bandwidth that minimizes MSE reg is to the leading order the same as the bandwidth that minimizes the regularized AMSE in the paper. Alternatively we could assume that (mf) has one continuous derivative and take a Taylor expansion assuming b small. But this is theoretically unappealing as we have been making the minimal assumption throughout that mf(.) has two continuous derivatives 26

27 6.2 Properties of regularization The key theoretical issue is to make sure when the regularization term is added, that it is large enough to effect regularization, and small enough so that the new regularized risk satisfies the Li criterion and the semiparametric rate still obtains. We also have to be careful about higher order terms (see appendix). This tradeoff yields bounds on the rate constant. Specifically, we consider a penalty term of the form λ W A = n δ C 2h 4. Thus the modified risk is: ( ( MSE reg,w A = C 0 n + C ) 2 ) [(mf) n 2 h + T 2(K) (mf) (x)ω(x)dx + n δ ] 2 (x) dx h 4 f(x) + o p (n 2 h + h 4 ) = C 0 n + C n 2 h + (C 2 + n δ C 2)h 4 + o p (n 2 h + h 4 ) Using this modified risk, we have a formula for the regularized bandwidth: h opt reg,w A = ( C 4(C 2 + n δ C 2) ) 5 n 2 5. Following the same logic, we can also give a regularized bandwidth for the SATE as: h opt reg,sat E = ( ) 5 K 4(K 2 + n δ ν ( 2 [m ] 2 + [m 0] 2) n 2 5 fdx) C opt regn 2 5 Note that if one wants to insure oneself against instability in the near linear case as well, one can add an additional regularization term. This term which will have to as before decline at the appropriate range of rates, and, will have to scale sensibly with rescalings of the data. One potential term that works, is to add to the MISE regularization term above, the variance of the estimated coefficient on the linear term in a global linear regression fitted across the entire support. If the linear term is precisely estimated, the regularization perform will be lower. The term will also be sensitive to rescalings of the data. However this issue clearly need more careful investigation and the above should only be taken as a crude and tentative suggestion. We can now state the main results for both regularized bandwidths above: 27

Optimal bandwidth selection for the fuzzy regression discontinuity estimator

Optimal bandwidth selection for the fuzzy regression discontinuity estimator Yoichi Arai Hidehiko Ichimura The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper CWP49/5 Optimal