Bandwidth choice for regression functionals with application to average treatment effects
|
|
- Molly Lindsey
- 6 years ago
- Views:
Transcription
1 Bandwidth choice for regression functionals with application to average treatment effects Karthik Kalyanaraman November 2008 Abstract I investigate the problem of optimal selection of smoothing parameter or bandwidth when one is interested in finite-dimensional smooth functionals of a regression function. Examples of such functionals include average treatment effects, nonparametric versions of the Blinder-Oaxaca decomposition used to analyze labor market discrimination, nonparametric policy effects of macroeconomic interventions and average derivatives. Firstly I emphasize that this problem is very different from the problem of optimal bandwidth selection when one s interest is in the entire regression function, with different rates of convergence of the smoothing parameter to zero. Further I show that the problem is in general characterized by an instability or technically, a form of ill-posedness. Small changes in the underlying joint distribution of the data can cause large changes in chosen bandwidth. I propose a simple solution to this problem and prove its optimality properties. As an example I focus on how to choose bandwidth for average treatment effects in observational studies when a local linear regression estimate is used. I derive original approximations to the estimation error and use these to provide a rule of thumb bandwidth selection algorithm for applied work. Finally, I use data from Imbens, Rubin and Sacerdote (200) on lottery winnings and the effect of unearned income on labor supply; the data is used to demonstrate and assess both the bandwidth algorithm and some of the theoretical issues involved. JEL Classification: C, C2, J7 Keywords: Plug-in estimator, Nonparametric, ATE, Regression discontinuity The project has been sustained by Guido Imbens, who has provided crucial support at each stage. I am very grateful for detailed suggestions and support from Gary Chamberlain and Lawrence Katz. I have also benefited a great deal from comments made by Alberto Abadie, Victor Chernozhukov, Rustam Ibragimov, Alex Kaufman, Jean Lee, Ulrich Müller, Eduardo Morales, James Stock and workshop participants at Harvard Harvard University, Cambridge, MA Electronic correspondence: kalyanar@harvard.edu
2 Introduction A large proportion of policy-relevant statistics of interest to economists, can be described as regression functionals, that is, real-valued functions of an underlying regression function. Examples include, among others, average treatment effects, the Blinder- Oaxaca wage decompositions used to analyze wage-discrimination (Blinder (973), Oaxaca (973)) the counterfactual effect of a macroeconomic policy intervention (Stock (989)), weighted average derivatives (Powell, Stock and Stoker (989)), and regression discontinuity estimators (see Hahn, Todd and van der Klaauw (200) for theory; several papers including Lee (2008) for applied work). A standard method for estimating such functionals, which avoids functional form assumptions, is to first estimate the regression function nonparametrically and then to compute the functional using the estimated regression function. Such estimators are called plug-in estimators. A key issue here is the choice of smoothing parameter, or bandwidth, in the first step. While this choice is well-explored in settings where the primary object of interest is the entire regression function, it has not been studied in detail where the primary object of interest is a finite dimensional functional. This paper studies optimal bandwidth choice in this setting. More specifically I look at the subgroup of functionals that are estimable at a N 2 rate. Examples include averages of the regression function or its derivatives (all the functionals cited above except regression discontinuity belong to this class). Though some work has been done on bandwidth selection with functionals that are not of this type, e.g. regression discontinuity (Imbens and Kalyanaraman, 2008), this paper focuses on N 2 -estimable functionals. For illustrative purposes I consider two specific functionals, though the methods proposed extend transparently to others in the class. First, I focus on a simple average of the estimated regression function calculated over the sample points. Second, I demonstrate the resulting bandwidth procedure on a commonly used There is a literature on undersmoothing (e.g. Newey (994), Goldstein and Messer (992), Powell, Stock and Stoker (989) and Stock (989)) to achieve N 2 -consistency, but no specific bandwidth proposals except for density-weighted average derivatives (see Goh (2007)). There has also been no discussion of the illposedness and instability inherent to these problems and the estimator proposed for density-weighted averages potentially suffers from the same instability 2
3 functional in labor and development studies, the average treatment effect. Though the results for this functional are written up in terms of the sample average treatment effect (SATE), I briefly show first how the same results are valid for the population average treatment effects and for nonparametric versions of the Blinder-Oaxaca decompositions used to study labor-force discrimination as well. 2 I indicate how a minor modification of the results can be used for bandwidth choice in the case of constructing policy effects of counterfactual macroeconomic interventions as in Stock (989), and for average derivatives. The standard approach to the bandwidth problem is to choose a bandwidth that minimizes some measure of global risk for the entire regression function, usually mean integrated squared error (MISE), i.e. the expected squared error integrated over the entire curve. The optimal bandwidth is then estimated either using plug-in estimators of the minimizer of the asymptotic approximation to MISE or using an unbiased data-based estimator of the MISE (cross-validation). While this is appropriate if we are interested in the entire regression function, it is not the correct risk measure for the estimation of a particular functional of the regression function. The first contribution of this paper is to propose and investigate a criterion, the mean-squared error criterion (MSE), based directly on the functional and on the standard squared-error loss criterion used in estimation problems, and solve for the optimal bandwidth under this criterion. Once the risk measure is defined, I approximate the risk in terms of functionals of the data and then find the bandwidth that minimizes the leading terms in the approximation. The solution obtained for the MSE is quite different from minimizing an asymptotic approximation to the MISE, or by doing standard cross-validation. This is essentially because the averaging done in smooth functionals gets rid of some variance, and as a result it is optimal to remove more bias by using a smaller bandwidth. Indeed the convergence rates to zero are different. The point about the different rates has been made before in the literature on undersmoothing (see Newey (994) and Goldstein and Messer (992)); however it is worth highlighting in view of 2 Alternative nonparametric decompositions using densities rather than regression functions are discussed in Dinardo, Fortin and Lemieux (996) 3
4 inappropriate methods like cross-validation used for this problem in practice. The second contribution of the paper is to show that the problem of minimizing MSE is subject to an instability. For a large class of regression functions, small perturbations of the support or of the function itself, can lead to large differences in the optimal bandwidth (in general discontinuously changing its convergence rate to zero, and in some cases moving from a finite value to infinity). Formally, the problem exhibits a certain kind of ill-posedness. This leads to estimators performing poorly if the regression function is close to the class of badly behaved regression functions; in the case of the ATE the bad behavior occurs at and near two particularly interesting cases: constant additive treatment effects (including no effect of treatment) and treatment effects linear in the measured covariate(s). I consider some solutions to this problem and propose a method of regularization 3 that ensures two properties of the stable solution: ) the risk evaluated at this regularized estimate is close to the optimal solution as the data set becomes large (a type of asymptotic no-regret property,) and 2) the estimated bandwidth is close to the optimal bandwidth in large datasets. I motivate this solution in two different ways. Firstly, the solution can be considered as a type of Tikhonov-regularization, where a convex penalty term which penalizes extremely large bandwidths for bumpy regression functions, is added to the loss function, and the sum then minimized to find the optimal regularized bandwidth. Secondly, the same solution can be regarded as the minimizing the risk averaged over local uncertainties in the weight function. This second way of approaching the issue is very much in the spirit of Bickel and Li (2006). Finally I use data on lottery winnings from Imbens, Rubin and Sacerdote (200) that was collected to try to identify the effect of unearned income on labor supply: I use this dataset to both illustrate the properties of the proposed ATE bandwidth selection algorithm and to illustrate some theoretical issues raised by this paper. The simulations indicate that there are potential gains to be achieved, even in fairly linear regression 3 I use the term in similar sense as Bickel and Li (2006) who discuss the notion of approximating a difficult, possibly singular, problem with simpler regularized problems that approach the former in the limit 4
5 functions, in bias reduction and bandwidth stabilization. In sum the contributions to the existing literature are: ) the approximation of the risk function for the ATE when the estimation method is local linear regression 4, 2) the formulation of an objective data-based bandwidth algorithm for the ATE leading to its N 2 consistent estimation, and 3) highlighting and exploration of the ill-posedness of the bandwidth problem in the context of minimizing MSE, and its solution using regularization. 2 A general introduction to the bandwidth problem Consider first a case in which one is interested in estimating a regression function without making functional form assumptions. Examples would include estimation of Engel curves (Bierens and Pott-Buter, 990) estimation of the relationship between calories consumed and food prices (Subramanian and Deaton, 996) and so on. There are several competing ways of doing this such as methods based on kernels, series, sieves etc., (see Wasserman (2006) for a description and evaluation of these methods), but all of these require a choice of what is called the smoothing parameter or bandwidth. This is essentially a choice of how complex of a model to fit to the data. With more complex models, we reduce bias but this typically comes with the cost of increasing variance. In the regression function context, this choice is made a little clearer by imagining potential extremes. One of the simplest regression functions we could fit, say in the kernel context, would be the constant function corresponding to an infinite bandwidth; a zero bandwidth on the other hand corresponds to the very complicated regression function obtained from simple interpolation of the data. Thus the trade-off between bias and variance gives us a framework to think about optimality in bandwidth selection. This choice is quite familiar in the regression context, with several methods that allow the researcher to choose an objective, data-based and optimal bandwidth for fitting the entire regression function (see Wasserman (2006) and Fan and Gijbels (996)); it is far less 4 This method is attractive due to its good bias properties in regions of low density and at the boundaries. See also Heckman, Smith and Todd (998) who recommend its use for calculating ATE 5
6 so in the context where one in interested in a scalar functional of the regression function. One could argue that economists are typically more interested in scalar functionals of the regression function than in the function itself; just as one is far more interested in the coefficients of a linear regression than the regression itself 5. Examples of such functionals include the regression function at a point (e.g. the regression discontinuity estimator), a weighted average regression (various average treatment effects, e.g Dehejia and Wahba (999), Imbens, Rubin and Sacerdote (200)), average partial derivatives (e.g. average cross-partials used as an index of substitutability of two goods) and so on. Indeed most policy effects that summarize regression relationships (e.g Stock (989)) are (typically smooth) functionals. Just as in regression function estimation, one might want to estimate the functional one is interested in without functional form restrictions as when one is concerned about nonlinearities in the underlying regression relationship (e.g. Heckman, et al(998)), or in the case of ATE about different covariate distributions in the treatment and control populations (Dehejia and Wahba(999)). In this context, the choice of a bandwidth becomes a key decision. Once we a procedure for selecting the the bandwidth we can form a plug-in estimate of the functional, by first estimating the regression nonparametrically with that bandwidth and then using this estimate in the formula for the functional. But, there has not been much work done on how to choose bandwidth when the interest is not in the regression function itself. There are no objective methods available to apply to the problem. Thus applied researchers have either used a fixed subjectively chosen bandwidth (e.g. Heckman, et al (998) choose a fixed bandwidth in their discussion of calculating ATE using local linear regressions) or used cross-validation. Cross-validation essentially amounts to minimizing an unbiased estimate of the risk of estimating the entire regression function; it is not appropriate to the problem of choosing a bandwidth for functionals. Worse, to estimate smooth functionals like various average treatment effects at a N 2 - rate using kernels, we need to use a bandwidth that converges to zero, faster than the 5 the average derivative could be one possible analogue of the slope coefficient in a linear regression function, in the nonlinear context 6
7 bandwidth sequence for fitting the entire function. This point was highlighted by the undersmoothing literature (Newey (994)). Thus tailoring the bandwidth method to the functional at hand is very important not only if we want to estimate a bandwidth that theoretically is best for a given dataset, but also if we care about a sensible estimate of the functional in general. This paper essentially seeks to investigate this choice and provide in the instance of various policy effects, a concrete objective and data-based way of selecting the bandwidth. 3 Smooth regression functionals The key feature of these functionals is that they are estimable at a N 2 rate. A general characterization using the Riesz integral representation of these functionals is available in Goldstein and Messer, 992. For exposition I will further assume that the functionals are linear (if they are not, a linearization approximation using Fréchet type derivatives of the functional can be considered) and indeed focus on weighted average regressions as the key example, as most smooth linear functionals can be approximated by sums of weighted average regressions and regression derivatives (again see Goldstein and Messer, 992) Examples of smooth linear functionals that arise frequently in applied work:. Average treatment effect (ATE): Under unconfoundedness (Y 0, Y ) W X and full overlap, the population ATE (PATE) can be written as θ = E(m (X) m 0 (X)) (m (x) = E(Y W =, X = x), etc). This is the difference between the averages of two regressions averaged over the distribution of the covariates X in the population. One might be interested instead in the sample ATE (SATE) where the averaging is done over the empirical sample distribution of the covariate: θ = N (m (X i ) m 0 (X i )) N i= In both cases the ATE can be estimated by ˆθ = N i ( ˆm (X i ) ˆm 0 (X i )) 7
8 2. Nonparametric versions of Blinder-Oaxaca wage decompositions: Used to analyse labor-market discrimination, this functional has the same statistical structure as the PATE. Consider two groups, say women and men, indexed by T {0, }. To get that part of the difference in some average outcome Y, say wages, that is not attributable to difference in observed characteristics, X (say IQ scores), one estimates θ = E X (E(Y X, T = ) E(Y X, T = 0)) Here the outer expectation is taken with respect to some reference distribution; either the conditional covariate distribution in one of the two populations, or the unconditional on group covariate distribution. 3. Average Derivative Effect (ADE): Analysed in Powell Stock and Stoker (989) and Härdle and Stoker (989), these functionals are of the type θ = E( m (x)) (Powell x Stock and Stoker (989) considered density-weighted average derivatives of the form E(f(X)m (X))). These arise in semiparametric estimation of index models as estimands of interest and may also be of independent interest in demand estimation problems. Bandwidth selection issues for these estimators are identical in spirit to the issues with the simple weighted average θ W A = E(ω(X)m(X)) considered later and a straightforward extension is immediate 4. Another example, considered in Stock (989), involves assessing the impact of policies that leave structural relationship m(x) = E(Y X = x), between an outcome variable Y and a policy instrument X, unchanged but affect the distribution of the policy instrument. Here one is typically interested in calculating the mean effect on Y or E(m(X)) under the two different distributions of X, one being the current empirical distribution and the other the future counterfactual. In other words: θ = E(ω(x)m(x)) E(m(x)) 8
9 3. Setup 3.. Weighted average regression As mentioned earlier I focus on two functionals in the paper. The first, weighted average regression θ W A, is chosen ) due to its simplicity and ease of illustration of the key theoretical issues and 2) the ease of extension of the ideas to functionals like weighted average derivatives. To make the essential aspects of the problem clear I work with scalar covariates X, and known regression design (i.e. the density function of X, f(x) is assumed known or controlled by the researcher). 6 I stress here that these assumptions are not necessary. The calculations provided in the section on ATE dispense with such assumptions; those calculations can also immediately be modified for this case. These assumptions however also allow one to illustrate the issues in the familiar setting of a simple kernel estimate, rather than the more complex local linear regression used later. The output variable Y is generated by the following process Y i = m(x i ) + ɛ i The errors ɛ i are identically, and independently distributed, with conditional mean zero and conditional variance σ 2 (x). Thus m(x) = E(Y X = x) The functional of interest is θ W A = E (ω(x)m(x)) where ω(.) is a known smooth weighting function. 6 In other words, the regression at X i is calculated as: ˆm(X i ) = (n )f(x i ) n K h (X i X j )Y j. j= j i Assuming a random design will only make calculations more tedious while still preserving the nature of the problem explored in this paper. 9
10 Now theta can be estimated in two steps: first estimating the underlying nonparametric regression function at sample points and then estimating the functional as ˆθ W A = ω(x i ) ˆm(X i ), n i where ˆm(X i ) is the estimate of the regression function at X i. For the approximations that follow, the following standard assumptions will be needed. Assumption 3.: The marginal distribution of the covariate X is known and denoted f(x). Moreover it is twice continuously differentiable and bounded away from 0. Assumption 3.2: m(.) has at least two continuous derivatives. Assumption 3.3: The kernel K is a smooth compactly supported symmetric nonnegative pdf. Assumption 3.4: The weight function ω(.) is piecewise continuously differentiable. Assumption 3.5: X and ω(.) are supported on the whole real line. 3.5 is made only so that boundary biases that arise with simple kernel estimation can be ignored. A compact support for X will entail some way of correcting for boundary biases in the calculated risk (e.g., negative reflection) but will lead to the same asymptotic approximation given below Average treatment effect A functional of interest in observational studies is the sample average treatment effect (SATE). This is essentially the average treatment effect where the averaging is over the empirical distribution of the covariate: θ SAT E = n n i= ( ) Y i Yi 0, where Yi and Yi 0 are the potential outcome for unit i under treatment (T i = ) and control (T i = 0) respectively. The treatment assignment indicator, T, and an additional covariate, X, are observed. 0
11 Under the assumptions of unconfoundedness ((Y, Y 0 ) W T ) and overlap (the density of X is bounded away from 0 over the entire support for both treated and control samples), we can estimate the SATE using the following simple estimator: ˆθ SAT E = n n ( ˆm (X i ) ˆm 0 X i ), i= where ˆm (X i ) and ˆm 0 X i are the estimated regression functions on the treated and control samples respectively, each evaluated at X i. I state the lemmas about the ATE in terms of the SATE; however as indicated in a subsequent section, the bandwidth optimal for the SATE, is also optimal for the PATE. For the purposes of this paper, I work out bandwidth selection details for a specific (and attractive) nonparametric regression estimator: the local linear regression estimator. This estimator does not suffer from the severe boundary bias that affects standard (Nadaraya-Watson) kernel estimates (see Heckman, et al (998), who recommend this estimator in the context of calculating treatment effects). More importantly perhaps for this application, the local linear estimator does not have high bias in areas with fewer observations unlike the latter (or is design-adaptive to use the jargon; see Fan (992)). This is an important consideration for the ATE where one can expect for certain covariate values a low density of treated observations and for other values very few control observations (i.e., limited overlap.) More explicitly, arrange the observations such that the first n 0 observations constitute the control sample, and the rest n = n n 0 form the treated samples. Define y 0 = (y,..., y n0 ) and m 0 = (m 0 (X ),..., m 0 (X n0 )), where m 0 (.) is the regression function in the control group, i.e., m 0 (x) = E(Y T = 0, X = x). Define R 0 i = [ι Z i ], where Z i j = X j X i, j =... n 0, and ι is a column of n 0 ones; define the weight matrix W 0 i = diag j=...n0 (K h (X j X i )); e = ( 0). Similarly define y, R i and W i. Given the above, we can write the estimated control regression at X i as ˆm 0 (X i ) = e (R 0 i W 0 i R 0 i ) R 0 i W 0 i y 0
12 Therefore: ˆθ SAT E = n n i= ( e (R i W i R i ) R i W i y e (R 0 i W 0 i R 0 i ) R 0 i W 0 i y 0 ) It will be useful to collect here the assumptions necessary for upcoming lemmas. They are all standard: Assumption 3.6: The marginal distribution of the covariate X is denoted f(x); it is continuous a.e. and bounded away from 0 conditional on W : f(x T = ) f (x) c > 0 and f(x T = 0) f 0 (x) c 0 > 0 Assumption 3.7: m(.) has at least two continuous derivatives a.e. Assumption 3.8: The kernel K is a smooth compactly supported symmetric nonnegative pdf. For concreteness assume a kernel supported on [, ]. Assumption 3.9: The conditional variance functions σ 2 (x) = V ar(y X = x) and σ 2 0(x) are bounded and continuous almost everywhere 4 Error criteria and optimal bandwidth 4. Standard criterion: MISE The standard criterion of risk that is used for bandwidth selection in nonparametric regression problems is called the Mean Integrated Squared Error (MISE), or more generally, weighted MISE (WMISE). It is a global measure of error and the loss function sums up squared error along the entire regression function. It is defined as W MISE = ξ(x)e( ˆm(x) m(x)) 2 f(x)dx, where ξ(x) is the weighting function used. Usually this is chosen to be the flat, and the resulting error criterion called simply the MISE. The bandwidth is then chosen by finding a data-based unbiased estimator of the error criterion (e.g., cross-validation) or by first asymptotically approximating risk and then estimating its minimizer. 2
13 Lemma 4.: (Asymptotic approximation to MISE) Under assumptions and 3.5, we obtain the following expression for WMISE W MISE = C nh + C 2h 4 + o p (n h + h 4 ) The constants in the approximation are: C = T (K) ξ(x)σ 2 (x)dx C 2 = T 2 (K) ξ(x) [(mf) ] 2 (x) dx f(x) ( T (K) = K(t) 2 dt T 2 (K) = 2 t 2 K(t)dt) 2. Calculations are given in the appendix. This is a well-known result. See Priestley and Chao (972) Note that the first term reflects the risk cost of variance of the estimator and the second reflects the bias. Increasing bandwidth reduces variance across repeated samples at the cost of increasing bias. Once again, a key observation to keep in mind is that standard cross-validation essentially minimizes an unbiased estimate of the MISE and can be shown to be consistent to the minimizer of leading terms of MISE displayed above. 4.2 Modified criterion: MSE Using the MISE or AMISE as a measure of risk is not appropriate when one is interested primarily in the functional θ. Thus standard methods of bandwidth selection will have to be modified. We instead consider a measure of risk defined directly in terms of the functional of interest, the mean squared error (MSE) criterion. MSE = E(ˆθ( ˆm h ) θ(m)) 2. One can now try and approximate MSE in terms of functionals of the joint distribution of the data and the chosen bandwidth. 3
14 4.2. MSE for θ W A Lemma 4.2: (Asymptotic approximation to MSE) Under assumptions , we obtain the following expression for MSE MSE W A = C 0 n + C n 2 h + C 2h 4 + o p (n 2 h + h 4 ) AMSE + o p (n 2 h + h 4 ). Here, the constants in the approximation are: C 0 = V ar(2ω(x)m(x)) + ω(x) 2 σ 2 (x)f(x)dx C = T (K) ω(x) 2 (σ 2 (x) + m(x) 2 )dx ( C 2 = T 2 (K) T (K) = ) 2 (mf) (x)ω(x)dx K(t) 2 dt, T 2 (K) = ( 2 t 2 K(t)dt) 2. This is the first result of the paper. Here the first term comes from the variance of the averaging involved. Note that is it not affected by the bandwidth choice. The second two terms reflect the variance and bias stemming from the choice of smoothing parameter. An increased bandwidth makes the estimator more stable across repeated samples but also more biased. The nature of the constant on the bias term, C 2, will prove crucial to the following discussion MSE for ATE A contribution of this paper is to work out a similar expansion for the SATE when local linear estimation is used to calculate the regression function 7. The appendix provides both the calculations as well as further higher order terms than presented below. As is standard with regressions, all expectations in the lemmas below condition on the sample 7 Imbens, Newey and Ridder (2005)provide a MSE expansion for ATE when they are calculated using inverse propensity score weighting and series estimation for the regression function and the propensity score. Their procedure requires the choice of two bandwidths. Other differences from the current work include ) they work with a criterion slightly different from mine in that they consider the sum of the mean squared errors of the treatment and control regressions rather than working directly with the ATE; 2) they do not provide guidance for bandwidth selection and 3) they are not concerned with the problems of regularization, which will arise in their set-up as well 4
15 X and T, as unconditional expectations may not exist (if they do they are identical to what is given below). This is suppressed notationally. Following the result, I indicate extensions to other similar functionals. Lemma 4.3: (Asymptotic approximation to MSE: SATE) Under assumptions , and denoting the support of X by [x, x] (these can be nonfinite) and by p the P r(t = 0) we obtain the following expression for MSE for SATE (MSE SAT E ) MSE SAT E = K 0 n + K n 2 h + K 2h 4 + o p (n 2 h + h 4 ). Here, the constants in the approximation are: ( ) K 0 = σ 2 f + σ 2 f 0 (x)f(x)dx p f 0 ( p) f ( ) K = π σ 2 f + σ 2 f 0 (x)dx pf 0 ( p)f ( ) 2 K 2 = ν2 (m m 4 0)(x)f(x)dx π = K(t) 2 dt ν = t 2 K(t)dt This is the second novel result of the paper. Inspecting the MSE we see first that the term K n 0 is the same as the semiparametric variance bound for the problem derived by Hahn (998) when one expresses it in terms of the propensity score using Bayes theorem. The only difference from the expansion for the PATE, as indicated later, would be a third term in the definition of K 0, namely, (m (x) m 0 (x) θ P AT E ) 2 f(x)dx. The next two terms reflect the variance-bias tradeoff. An increased bandwidth makes the estimator more stable across repeated samples but also more biased. Note that relatively lower densities in either treated or control samples increase the variability (showing the importance of significant overlap in variance reduction). Also note that the third term, the bias term, does not depend on f 0 or f. This is due to the design adaptivity of local linear regression: sparse data in a region does not increase first order bias as it would with simple kernel estimation. Also note that it is the difference in curvatures between 5
16 the treatment and control regressions that determines bias. This will be important in the following discussion Extensions to other functionals The extension to the PATE is immediate recognizing that firstly, under the random sampling assumption, the estimate of the PATE is the same as for the SATE (i.e. ˆθ P AT E = ˆθ SAT E ) and, that the only additional terms in the MSE expansion of E(ˆθ P AT E θ P AT E ) 2 are E(θ SAT E θ P AT E ) 2 and E(ˆθ SAT E θ SAT E )(θ SAT E θ P AT E ). The second term is negligible and the first contributes a term of order 8. Thus bandwidth choice remains n unaffected. The extension to the nonparametric Blinder-Oaxaca estimates is immediate. Suppose the two groups are indexed by T = and T = 0, say T = indicate female. Then the part of the difference in wages Y, not attributable purely to different distributions of the measured covariate, X (say IQ), in the two groups is E X [E(Y T =, X) E(Y T = 0, X)] where the averaging is done over the unconditional (on group membership) marginal distribution of X. Thus we can see that this functional has an identical structure as the PATE and the application of the above MSE expansion only requires redifinition of terms. It is fairly easy to extend the results to the policy effect estimator of Stock (989), as well. Note that this estimate has the identical structure of one of the two elements in the ATE: the average, say, control regression, averaged over a distribution other than the conditional distribution of X in the control sample (f 0 ). In the ATE case, this other distribution was the unconditional distribution of X and in the Stock (989) case it is the counterfactual distributed of X after the policy intervention. 8 This, as indicated before, is the term E(m (x) m 0 (x) θ P AT E ) 2, also to be found in Hahn s semiparametric variance bound 6
17 5 Bandwidth selection and illposedness 5. Optimal bandwidth Once we have the expressions for the asymptotic approximations to MSE, we can optimally trade bias for variance to the first order, noting that the former is typically an increasing function of bandwidth and the latter a decreasing function, and find a formula for the theoretically optimal bandwidths, h opt W A and hopt SAT E Lemma 5.: (Optimal bandwidths) Given lemma 4.2 and 4.3, we obtain the following expression for the optimal bandwidths: ( ) h opt W A = C 5 n op (n 2 5 ). 4C 2 For the SATE ( ) h opt SAT E = K 5 n op n K 2 Note that while the optimal bandwidth for estimating m would be O(n 5 ), the optimal bandwidth for estimating θ W A and θ SAT E is O(n 2 5 ). In other words, we are required to use a smaller bandwidth than we would have were we interested in the entire function (see Goldstein and Messer (992) and Newey (994)). This makes intuitive sense: the averaging gets rid of some of the variance for us, and therefore we can allow ourselves less bias by smoothing less at the optimal bias variance tradeoff point. Moreover, if one used the optimal bandwidth for MISE, one would not achieve the optimal semiparametric rate of N 2 for the estimation of θ. There is, however, something else that has not been noted in the literature. First consider MSE W A. The optimal bandwidth is inversely proportional to the bias constants C 5 2. C 2 can be thought of as proportional to the total curvature of the regression function on its support (technically it is the total curvature of the function mf, but with uniform design, or more compellingly, with local linear estimation, it is the total curvature of the regression function m that matters and not the design f). For a large class of functions, C 2 could be close or exactly 0. For these functions there is no first-order bias-variance tradeoff as defined by minimizing the AMSE, leading to an optimal bandwidth of! 7
18 The same problem arises with MSE SAT E in perhaps an even more relevant scenario. Here the first order bias is proportional to the difference in curvatures. In the practically interesting cases of constant additive treatment effect (CATE) and linear in covariates treatment effect, this bias term is zero, again leading to the same issue as above. Taking higher order terms does not help one as explained later. The next two sections are abstract in nature. Readers interested primarily in applied work should skim them, and then look at sections 6.2, 6.3, and Illposedness For the purposes of this section I concentrate on θ W A though it is clear that the exact same issues are relevant to θ SAT E. This is primarily for illustrative purposes as the simple structure of ˆθ W A makes the issues at hand more transparent. The calculations, based on simple known functional forms, shown in table and figure 4. illustrate the issue starkly. The output process was y = m(x) + ɛ with ɛ N(0, σ 2 ). The regression was either m(x) = 8cos(Bx) or 8sin(Bx). B is a parameter that determines frequency: increasing it increases local features of the regression function (i.e. make it more bumpy ) and thus should decrease the MISE optimal bandwidth. The support over which risk was calculated was [ 0, 0]. The table shows the optimal bandwidths that minimize respectively the actual MSE, its asymptotic approximation, the AMSE, and the MISE, for the two different regression functions. The figures show an example of two different regression functions (cosine and sine with same amplitude and frequency) with the exact MSE at various bandwidths. There are a couple of things to note in the simulations shown. Firstly, the bandwidth that minimizes MISE is almost identical for the sine and the cosine curves. This makes sense: the sine is really only a shifted version of the cosine and in a sense all the local features are identical. The bandwidth that minimizes the MSE and its asymptotic approximation, 9 however, is radically different in the two cases; from a finite value (cosine) it jumps to infinity for the cosine curve. In the case of the AMSE the reason for this is 9 The approximation, incidentally, seems fairly accurate 8
19 Table 7.2.3: Optimal bandwidths Cos C 2 > 0 Sin C 2 = 0 σ B MSE AMSE MISE MSE AMSE MISE Figure 4. : Regression function and MSE 9
20 apparent on inspection of the formulae: C 2 = 0 for the sine curve, leading to an exactly 0 asymptotic bias. With the MSE, essentially, if we are only interested in the functional and not the entire function, biases in one region of the support can cancel out biases in other regions leading to a 0 total bias. A similar issue is present with the MSE for θ SAT E near, for instance, CATE where the issue is the cancellation of biases when taking the difference across the two regression functions: the bias of the estimate of the treatment regression is of the same magnitude and direction as the bias of the estimate of the control regression to the first order. Here what matters is the difference in total curvature between the treatment and control regression. Thus, while for functions near the CATE case, the optimal MSE minimizing bandwidth is O(n 2 5 ), at CATE, the rate suddenly jumps to O(n 2 9 ). Note that there is a single case in which bias suddenly changes orders in the MISE and the AMISE as well: the linear case. This generally seems to be ignored in the vast literature around the MSE, for theoretical convenience. For instance about the expression for the bandwidth that minimizes MISE to first order with local linear estimation, Fan and Gijbels (996) write It is understood that the integrals are finite and the denominator does not vanish. Since the problem happens only when the regression is close to linear and one can either rule out linearity - as is usually done - it is claimed that this not a serious problem. However, for the data set I use later on to illustrate some of the issues, this does seem to be an issue. A small modification, noted later, of the solution I propose to the instability in the MSE problem, does take care of this case as well. It is also to be noted that the sensitivity to choice of support is special to the MSE case. The essential problem of small perturbations in the underlying regression function leading to large changes in the optimal bandwidth can by formalized by the connection to the literature on ill-posedness in optimization theory. If one considers the underlying true regression function as the exogenously given parameter of the bandwidth problem of minimizing MSE, then, formally, that problem is in a certain sense (defined below) ill-posed. Ill-posed problems have received a lot of attention recently in econometrics in the 20
21 context of deconvolution methods for measurement error and nonparametric IV estimation. A more fundamental example is kernel density estimation. All of these involve linear integral equations of a certain type with a compact operator. Naive solutions to the integral equations which involve inversion are problematic as the inverse of a compact operator is unbounded, leading to the solution not being continuous with respect to perturbations of the definitional problem.. However ill-posedness itself is a concept stemming from optimization theory and while the above form of ill-posedness has a few features in common with the kind of ill-posedness explored in this paper (both lend themselves to solution by similar approaches) the analogy cannot perhaps be pursued much further. We are not dealing here with an integral equation but an optimization problem with function valued parameters. Bandwidth choice for functionals is ill-posed in an extended or variational sense. This type of ill-posedness as far as I know has not been studied in econometrics before. The standard reference here is Zolezzi (995, 996). Define MSE = R(h, m) = E(ˆθ h θ) 2. Let h (m) denote the solution (assuming it exists) to the problem min h R(h, m), given the unknown regression function m M, where M is the class of candidate regression functions. This problem has the issue that the map m h (m) is not continuous, or, extending Zolezzi (995, 996) to make it appropriate to the problem at hand, this implies that the problem is ill-posed in an extended sense (Proofs of the illposedness and precise definitions are given in the appendix). The bandwidth problem has not been looked at in a variational sense (i.e. optimal bandwidth as a functional of the true regression) before partly because most attention has been focused on the MISE criterion where the problem occurs in the neighborhood of linearity. This case is typically ruled out, as I mentioned earlier, a priori. This strategy cannot be followed here because the problem with MSE occurs, for instance, in the neighborhood of all regression functions with 0 integrated curvature over the support, an infinite and unknown set. For instance, the appendix shows that the class of functions for with the problem with minimizing MSE W A occurs contains all odd functions. 2
22 Before discussing potential solutions to the problem, I will review again the implications of the problem: Firstly, small perturbations of the support of the function may lead to large changes in optimally chosen bandwidth (this for example is not the case with the AMISE optimal bandwidth) Secondly, the estimator will be very unstable for regression functions with low total curvature across repeated samples (and in the case of average treatment effects for treatments effects that are close to linear in the covariates). Since robustness is an important feature of a statistical estimator this is troubling, combined with the fact that MSE can in fact vary quite a lot with different bandwidths. Lastly, another way of putting the problem is in term of uniform behaviour, over the unknown parameter space. The optimal bandwidth does not converge to 0 at the same rate throughout the parameter space. In the CATE case it jumps to O(n 2 9 ). Usage of a bandwidth of this order in cases of small deviations from CATE will lead to the functional not being estimable at a N 2 -rate. When we look at θ W A for certain (an infinite set) regression functions the optimal bandwidth becomes infinite. However, the bandwidths that will be estimated from an actual dataset will be O(n 2 5 ) and bounded with probability one (because C 2 will be estimated as nonzero with probability one with continuously distributed regressors.) 6 Regularization 6. Regularization With any solution to the stability issue, one ideally wants an asymptotic no-regret type property for our modified bandwidths, i.e., as the dataset grows in size the difference berween the risk under the modified bandwidth procedure and the minimal AMSE should approach 0. This is the type of property for instance that Li (987) demands of crossvalidation. We shall call this the Li criterion and will be more explicit in the result given below. As mentioned in the introduction, I approach the problem in two ways, both leading to the same solution. 22
23 6.. Penalty term approach One basic solution to ill-posed problems is to regularize them by adding to the criterion one is minimizing a nonzero convex penalty term, that decreases in relative size as the dataset becomes larger. This general idea is sometimes referred to as Tikhonov regularization in optimization theory. For this problem note that we want the penalty term to ) penalize very large bandwidths and 2) penalize the ignoring of local features of m; i.e., when the regression curve is very bumpy, we want to guard ourselves against very large bandwidths because in this case perturbations of support are likely to have a significant effect on the chosen bandwidth. We also in general want a convex penalty term as this ensures minimization does not become ill-defined. Now note that though the bandwidth minimizing the mean squared error of the functional estimate is badly behaved, the MISE minimizing bandwidth is fairly wellbehaved except in the linear case. (This can also be seen in the asymptotic expression for the risks: while C 2 is likely to be close to 0 for a wide variety of cases, this is not likely with C 2.) A penalty term that makes a lot of sense here then is a scaled version the well-behaved bias term in the expansion of the MISE, where the scale factor decreases with increasing sample size at an appropriate rate allowing for the no-regret criterion to be satisfied asymptotically. Note that this can be interpreted as penalizing the AMSE more when the regression function has more local features (C 2 large). Also since MISE bias is increasing and convex in the chosen bandwidth, the other desired properties are satisfied The weight function approach: idea One can motivate this penalty term in a very different and more intuitive way by considering local perturbations of the regression function. Using the basic theoretical framework of Bickel and Li (2006), I consider a sequence of regularized estimands, which result from perturbing the weight function in θ = ω(x)m(x)dx by adding a scaled integrable function φ(.) centered at point x 0, and scaled such that the new weight function still integrates to. Then letting the perturbation get smaller with larger sample size, such 23
24 that in the limit the new weight function is essentially like the original weight function plus the dirac delta generalized function, one finds the bandwidth that minimizes the risk of the perturbed problem in the limit. However one might have chosen the particular point x 0 in such a way that the perturbed problem is still ill-posed (e.g., if m (x 0 ) = 0), so after calculating the MSE of the point-perturbed problem, I average over the possible points x 0 around which we can perturb the weight function. This I call the regularized problem. One can show that when C 2 0, we get an MISE expansion for the regularized problem that is to the first order the same as the risk with the convex penalty term we have chosen. Thus the addition of the convex penalty term can be thought of as asking for a bandwidth that minimizes risk averaged over small uncertainties/perturbations in the weighting function. See below section for a formalization of this notion The weight function approach: formalization I now formalize this approach for θ W A. The same can be done for θ SAT E. Consider the weight function ω(.), and perturbation function φ(.) which fulfills the conditions φ(x) dx < and supx φ(x) <. Define b a positive constant close to 0 and α = n γ. Now perturb the weight function at x 0 in the following way: ω x0 (x) = αω(x) + ( α)φ x0 (x), where φ x0 (x) = φ( x x 0 ) b φ( x x 0 b Note that the perturbed weight function still integrates to. Now given θ = E F (ω(x)m(x)) and ˆθ = E Fn (ω(x) ˆm(x)), define ( φ( x x 0 ) )m(x)f(x)dx b θ(x 0 ) = αθ + ( α) φ( t x 0 )f(t)dt = θ + O(n γ ) ˆθ(x 0 ) = E Fn (ω x0 (x) ˆm(x)) b )f(x)dx Now we will consider the MSE of the perturbed problem averaged over all the possible points x 0 of perturbation, MSE b and then finally let the perturbation go to 0, i.e. 24
25 consider MSE reg = lim b 0 MSE b : MSE b = E x0 E[ˆθ(x 0 ) θ] 2 = E x0 E[ˆθ(x 0 ) θ(x 0 )] 2 + E x0 [θ(x 0 ) θ] 2 + 2E x0 [θ(x 0 ) θ][e ˆθ(x 0 ) θ(x 0 )] = E x0 E[ˆθ(x 0 ) θ(x 0 )] 2 + O(n 2γ ) + O(h 2 n γ ) = E x0 [E ˆθ(x 0 ) θ(x 0 )] 2 + E x0 V ˆθ(x 0 ) + O(h 2 n γ ) Bias 2 b,reg + V ar b,reg + O(h 2 n γ ) MSE b,reg + O(h 2 n γ ) Note that the additional term of order O(n 2γ ) has been ignored as we will need n γ = o(h 2 ) later on to ensure that the leading term in the asymptotic expansion of the bias part of MSE reg, namely Bias b E x0 [E ˆθ(x 0 ) θ(x 0 )] dominates the remainder in MSE b,new. Using h = O(n δ 2 5 ), we will then have the condition that γ > Define B b = φ( x x 0 b )f(x)dx. Now consider E ˆθ(x 0 ): E ˆθ(x 0 ) = E Fn (ω x0 (x) ˆm(x)) = αe ˆθ + ( α)e Fn (φ x0 (x) ˆm(x)) = αe ˆθ + ( α) φ x0 (x)k h (x u)m(u)f(u)dudx = α(θ + C 2 2 h 2 + o(h 2 ( α) )) + φ( x x 0 )K h (x u)m(u)f(u)dudx B b ( = α(θ + C 2 2 h 2 ( α) ) + φ( x x ) 0 )m(x)f(x)dx + B b ( ( α) + h 2 T 2 (K) 2 φ( x x ) 0 )(mf) (x)dx + o(h 2 ) B b 0 This implies that δ, the regularization parameter as defined in the main body of the paper to follow, will have to fulfill δ > 2 7. Thus, strictly speaking, we can view this alternative approach as being equivalent to the approach with the convex penalty term if δ ( 2 7, 3 ). 25
26 Therefore: E(ˆθ(x 0 )) θ(x 0 ) = h (αc 2 2 φ( x x 0 )(mf) ) (x)dx 2 + ( α)t 2 (K) 2 b φ( t x 0 + o(h 2 ) )f(t)dt b ( (E(ˆθ(x 0 )) θ(x 0 )) 2 = h 4 α 2 C 2 + α( α)t 2 (K) 2 C 2 φ( x x 0 )(mf) ) (x)dx b 2 φ( t x 0 + )f(t)dt b ( φ( x x 0 + h 4 T 2 (K)( α) 2 )(mf) ) 2 (x)dx b φ( t x 0 + o(h 4 ) )f(t)dt b This implies, since Bias 2 b,reg = E x 0 (E(ˆθ(x 0 )) θ(x 0 )) 2, that: Bias 2 b,reg =h 4 [α 2 C 2 + n γ αt 2 (K) 2 C 2 b φ( x x 0 )(mf) (x)dx b 2 E x0 b φ( t x 0 ] )f(t)dt b φ( x x 0 + h 4 [n 2γ )(mf) (x)dx b T 2 (K)E x0 ( φ( t x 0 ) 2 ] )f(t)dt Now we consider the Bias 2 reg = lim b 0 Bias 2 b,reg, i.e. we let the perturbations go to 0. We assume that inff(x) = c > 0 and sup(mf) (x) = M <. We then get, using the dominated convergence theorem Bias reg = h 4 [α 2 C 2 + n γ T 2 (K) 2 αc 2 2 = h 4 [α 2 C 2 + n γ (T 2 (K) 2 αc 2 2 For C 2 0 therefore: Bias reg = n 2γ h 4 C 2 + O(n γ h 4 C 2 ) + o(h 4 ) b [(mf) (mf) (x)dx + n 2γ ] 2 (x) T 2 (K) dx] + o(h 4 ) f(x) (mf) (x)dx) + n 2γ C 2] + o(h 4 ) Note this is exactly to the leading order, the bias term in the regularized AMSE in the main body of the paper, as the first term in the above expansion is the convex penalty term we chose. It is similarly easily shown that V ar reg = lim b 0 V ar b,reg has to the leading order the same expression as in the regularized AMSE, thus substantiating the claim that the bandwidth that minimizes MSE reg is to the leading order the same as the bandwidth that minimizes the regularized AMSE in the paper. Alternatively we could assume that (mf) has one continuous derivative and take a Taylor expansion assuming b small. But this is theoretically unappealing as we have been making the minimal assumption throughout that mf(.) has two continuous derivatives 26
27 6.2 Properties of regularization The key theoretical issue is to make sure when the regularization term is added, that it is large enough to effect regularization, and small enough so that the new regularized risk satisfies the Li criterion and the semiparametric rate still obtains. We also have to be careful about higher order terms (see appendix). This tradeoff yields bounds on the rate constant. Specifically, we consider a penalty term of the form λ W A = n δ C 2h 4. Thus the modified risk is: ( ( MSE reg,w A = C 0 n + C ) 2 ) [(mf) n 2 h + T 2(K) (mf) (x)ω(x)dx + n δ ] 2 (x) dx h 4 f(x) + o p (n 2 h + h 4 ) = C 0 n + C n 2 h + (C 2 + n δ C 2)h 4 + o p (n 2 h + h 4 ) Using this modified risk, we have a formula for the regularized bandwidth: h opt reg,w A = ( C 4(C 2 + n δ C 2) ) 5 n 2 5. Following the same logic, we can also give a regularized bandwidth for the SATE as: h opt reg,sat E = ( ) 5 K 4(K 2 + n δ ν ( 2 [m ] 2 + [m 0] 2) n 2 5 fdx) C opt regn 2 5 Note that if one wants to insure oneself against instability in the near linear case as well, one can add an additional regularization term. This term which will have to as before decline at the appropriate range of rates, and, will have to scale sensibly with rescalings of the data. One potential term that works, is to add to the MISE regularization term above, the variance of the estimated coefficient on the linear term in a global linear regression fitted across the entire support. If the linear term is precisely estimated, the regularization perform will be lower. The term will also be sensitive to rescalings of the data. However this issue clearly need more careful investigation and the above should only be taken as a crude and tentative suggestion. We can now state the main results for both regularized bandwidths above: 27
Optimal bandwidth selection for the fuzzy regression discontinuity estimator
Optimal bandwidth selection for the fuzzy regression discontinuity estimator Yoichi Arai Hidehiko Ichimura The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper CWP49/5 Optimal
More informationNonparametric Methods
Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis
More informationESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics
ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. The Sharp RD Design 3.
More informationFlexible Estimation of Treatment Effect Parameters
Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both
More informationOptimal Bandwidth Choice for the Regression Discontinuity Estimator
Optimal Bandwidth Choice for the Regression Discontinuity Estimator Guido Imbens and Karthik Kalyanaraman First Draft: June 8 This Draft: September Abstract We investigate the choice of the bandwidth for
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationA Measure of Robustness to Misspecification
A Measure of Robustness to Misspecification Susan Athey Guido W. Imbens December 2014 Graduate School of Business, Stanford University, and NBER. Electronic correspondence: athey@stanford.edu. Graduate
More informationMichael Lechner Causal Analysis RDD 2014 page 1. Lecture 7. The Regression Discontinuity Design. RDD fuzzy and sharp
page 1 Lecture 7 The Regression Discontinuity Design fuzzy and sharp page 2 Regression Discontinuity Design () Introduction (1) The design is a quasi-experimental design with the defining characteristic
More informationSelection on Observables: Propensity Score Matching.
Selection on Observables: Propensity Score Matching. Department of Economics and Management Irene Brunetti ireneb@ec.unipi.it 24/10/2017 I. Brunetti Labour Economics in an European Perspective 24/10/2017
More informationfinite-sample optimal estimation and inference on average treatment effects under unconfoundedness
finite-sample optimal estimation and inference on average treatment effects under unconfoundedness Timothy Armstrong (Yale University) Michal Kolesár (Princeton University) September 2017 Introduction
More informationBoundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta
Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta November 29, 2007 Outline Overview of Kernel Density
More informationECO Class 6 Nonparametric Econometrics
ECO 523 - Class 6 Nonparametric Econometrics Carolina Caetano Contents 1 Nonparametric instrumental variable regression 1 2 Nonparametric Estimation of Average Treatment Effects 3 2.1 Asymptotic results................................
More informationCausal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies
Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies Kosuke Imai Department of Politics Princeton University November 13, 2013 So far, we have essentially assumed
More informationNonparametric Econometrics
Applied Microeconometrics with Stata Nonparametric Econometrics Spring Term 2011 1 / 37 Contents Introduction The histogram estimator The kernel density estimator Nonparametric regression estimators Semi-
More informationCausal Inference with Big Data Sets
Causal Inference with Big Data Sets Marcelo Coca Perraillon University of Colorado AMC November 2016 1 / 1 Outlone Outline Big data Causal inference in economics and statistics Regression discontinuity
More informationLECTURE 10: REVIEW OF POWER SERIES. 1. Motivation
LECTURE 10: REVIEW OF POWER SERIES By definition, a power series centered at x 0 is a series of the form where a 0, a 1,... and x 0 are constants. For convenience, we shall mostly be concerned with the
More informationAdditive Isotonic Regression
Additive Isotonic Regression Enno Mammen and Kyusang Yu 11. July 2006 INTRODUCTION: We have i.i.d. random vectors (Y 1, X 1 ),..., (Y n, X n ) with X i = (X1 i,..., X d i ) and we consider the additive
More informationDensity estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas
0 0 5 Motivation: Regression discontinuity (Angrist&Pischke) Outcome.5 1 1.5 A. Linear E[Y 0i X i] 0.2.4.6.8 1 X Outcome.5 1 1.5 B. Nonlinear E[Y 0i X i] i 0.2.4.6.8 1 X utcome.5 1 1.5 C. Nonlinearity
More informationOptimization Problems
Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that
More informationThe Generalized Roy Model and Treatment Effects
The Generalized Roy Model and Treatment Effects Christopher Taber University of Wisconsin November 10, 2016 Introduction From Imbens and Angrist we showed that if one runs IV, we get estimates of the Local
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More informationNonparametric Regression
Nonparametric Regression Econ 674 Purdue University April 8, 2009 Justin L. Tobias (Purdue) Nonparametric Regression April 8, 2009 1 / 31 Consider the univariate nonparametric regression model: where y
More informationCross-fitting and fast remainder rates for semiparametric estimation
Cross-fitting and fast remainder rates for semiparametric estimation Whitney K. Newey James M. Robins The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper CWP41/17 Cross-Fitting
More informationEstimation of Treatment Effects under Essential Heterogeneity
Estimation of Treatment Effects under Essential Heterogeneity James Heckman University of Chicago and American Bar Foundation Sergio Urzua University of Chicago Edward Vytlacil Columbia University March
More information41903: Introduction to Nonparametrics
41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific
More informationA General Overview of Parametric Estimation and Inference Techniques.
A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying
More informationNonparametric Density Estimation
Nonparametric Density Estimation Econ 690 Purdue University Justin L. Tobias (Purdue) Nonparametric Density Estimation 1 / 29 Density Estimation Suppose that you had some data, say on wages, and you wanted
More informationNew Developments in Econometrics Lecture 11: Difference-in-Differences Estimation
New Developments in Econometrics Lecture 11: Difference-in-Differences Estimation Jeff Wooldridge Cemmap Lectures, UCL, June 2009 1. The Basic Methodology 2. How Should We View Uncertainty in DD Settings?
More informationThe Influence Function of Semiparametric Estimators
The Influence Function of Semiparametric Estimators Hidehiko Ichimura University of Tokyo Whitney K. Newey MIT July 2015 Revised January 2017 Abstract There are many economic parameters that depend on
More informationLocally Robust Semiparametric Estimation
Locally Robust Semiparametric Estimation Victor Chernozhukov Juan Carlos Escanciano Hidehiko Ichimura Whitney K. Newey The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper
More informationGenerated Covariates in Nonparametric Estimation: A Short Review.
Generated Covariates in Nonparametric Estimation: A Short Review. Enno Mammen, Christoph Rothe, and Melanie Schienle Abstract In many applications, covariates are not observed but have to be estimated
More informationClosest Moment Estimation under General Conditions
Closest Moment Estimation under General Conditions Chirok Han and Robert de Jong January 28, 2002 Abstract This paper considers Closest Moment (CM) estimation with a general distance function, and avoids
More informationChapter 9. Non-Parametric Density Function Estimation
9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationTest for Discontinuities in Nonparametric Regression
Communications of the Korean Statistical Society Vol. 15, No. 5, 2008, pp. 709 717 Test for Discontinuities in Nonparametric Regression Dongryeon Park 1) Abstract The difference of two one-sided kernel
More informationChapter 9. Non-Parametric Density Function Estimation
9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least
More informationEconometrics of causal inference. Throughout, we consider the simplest case of a linear outcome equation, and homogeneous
Econometrics of causal inference Throughout, we consider the simplest case of a linear outcome equation, and homogeneous effects: y = βx + ɛ (1) where y is some outcome, x is an explanatory variable, and
More informationA Note on Adapting Propensity Score Matching and Selection Models to Choice Based Samples
DISCUSSION PAPER SERIES IZA DP No. 4304 A Note on Adapting Propensity Score Matching and Selection Models to Choice Based Samples James J. Heckman Petra E. Todd July 2009 Forschungsinstitut zur Zukunft
More informationMore on Estimation. Maximum Likelihood Estimation.
More on Estimation. In the previous chapter we looked at the properties of estimators and the criteria we could use to choose between types of estimators. Here we examine more closely some very popular
More informationRegression Discontinuity
Regression Discontinuity Christopher Taber Department of Economics University of Wisconsin-Madison October 24, 2017 I will describe the basic ideas of RD, but ignore many of the details Good references
More informationNonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors
Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors Peter Hall, Qi Li, Jeff Racine 1 Introduction Nonparametric techniques robust to functional form specification.
More informationEstimation of the Conditional Variance in Paired Experiments
Estimation of the Conditional Variance in Paired Experiments Alberto Abadie & Guido W. Imbens Harvard University and BER June 008 Abstract In paired randomized experiments units are grouped in pairs, often
More informationOptimal bandwidth choice for the regression discontinuity estimator
Optimal bandwidth choice for the regression discontinuity estimator Guido Imbens Karthik Kalyanaraman The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper CWP5/ Optimal Bandwidth
More informationWhat s New in Econometrics. Lecture 1
What s New in Econometrics Lecture 1 Estimation of Average Treatment Effects Under Unconfoundedness Guido Imbens NBER Summer Institute, 2007 Outline 1. Introduction 2. Potential Outcomes 3. Estimands and
More informationPost-Selection Inference
Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis
More informationNBER WORKING PAPER SERIES A NOTE ON ADAPTING PROPENSITY SCORE MATCHING AND SELECTION MODELS TO CHOICE BASED SAMPLES. James J. Heckman Petra E.
NBER WORKING PAPER SERIES A NOTE ON ADAPTING PROPENSITY SCORE MATCHING AND SELECTION MODELS TO CHOICE BASED SAMPLES James J. Heckman Petra E. Todd Working Paper 15179 http://www.nber.org/papers/w15179
More informationUltra High Dimensional Variable Selection with Endogenous Variables
1 / 39 Ultra High Dimensional Variable Selection with Endogenous Variables Yuan Liao Princeton University Joint work with Jianqing Fan Job Market Talk January, 2012 2 / 39 Outline 1 Examples of Ultra High
More informationRegression Discontinuity
Regression Discontinuity Christopher Taber Department of Economics University of Wisconsin-Madison October 16, 2018 I will describe the basic ideas of RD, but ignore many of the details Good references
More informationSeptember Math Course: First Order Derivative
September Math Course: First Order Derivative Arina Nikandrova Functions Function y = f (x), where x is either be a scalar or a vector of several variables (x,..., x n ), can be thought of as a rule which
More informationRegression #3: Properties of OLS Estimator
Regression #3: Properties of OLS Estimator Econ 671 Purdue University Justin L. Tobias (Purdue) Regression #3 1 / 20 Introduction In this lecture, we establish some desirable properties associated with
More informationStatistical inference
Statistical inference Contents 1. Main definitions 2. Estimation 3. Testing L. Trapani MSc Induction - Statistical inference 1 1 Introduction: definition and preliminary theory In this chapter, we shall
More informationWhy high-order polynomials should not be used in regression discontinuity designs
Why high-order polynomials should not be used in regression discontinuity designs Andrew Gelman Guido Imbens 6 Jul 217 Abstract It is common in regression discontinuity analysis to control for third, fourth,
More informationRegression Discontinuity Design
Chapter 11 Regression Discontinuity Design 11.1 Introduction The idea in Regression Discontinuity Design (RDD) is to estimate a treatment effect where the treatment is determined by whether as observed
More informationUsing Matching, Instrumental Variables and Control Functions to Estimate Economic Choice Models
Using Matching, Instrumental Variables and Control Functions to Estimate Economic Choice Models James J. Heckman and Salvador Navarro The University of Chicago Review of Economics and Statistics 86(1)
More information6.867 Machine Learning
6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.
More informationWooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares
Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares Many economic models involve endogeneity: that is, a theoretical relationship does not fit
More informationThe Economics of European Regions: Theory, Empirics, and Policy
The Economics of European Regions: Theory, Empirics, and Policy Dipartimento di Economia e Management Davide Fiaschi Angela Parenti 1 1 davide.fiaschi@unipi.it, and aparenti@ec.unipi.it. Fiaschi-Parenti
More informationPOLI 8501 Introduction to Maximum Likelihood Estimation
POLI 8501 Introduction to Maximum Likelihood Estimation Maximum Likelihood Intuition Consider a model that looks like this: Y i N(µ, σ 2 ) So: E(Y ) = µ V ar(y ) = σ 2 Suppose you have some data on Y,
More informationTime Series and Forecasting Lecture 4 NonLinear Time Series
Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations
More informationSTATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN
Massimo Guidolin Massimo.Guidolin@unibocconi.it Dept. of Finance STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN SECOND PART, LECTURE 2: MODES OF CONVERGENCE AND POINT ESTIMATION Lecture 2:
More informationNotes on Random Variables, Expectations, Probability Densities, and Martingales
Eco 315.2 Spring 2006 C.Sims Notes on Random Variables, Expectations, Probability Densities, and Martingales Includes Exercise Due Tuesday, April 4. For many or most of you, parts of these notes will be
More informationIs there an optimal weighting for linear inverse problems?
Is there an optimal weighting for linear inverse problems? Jean-Pierre FLORENS Toulouse School of Economics Senay SOKULLU University of Bristol October 9, 205 Abstract This paper considers linear equations
More informationOn IV estimation of the dynamic binary panel data model with fixed effects
On IV estimation of the dynamic binary panel data model with fixed effects Andrew Adrian Yu Pua March 30, 2015 Abstract A big part of applied research still uses IV to estimate a dynamic linear probability
More informationChapter 2: simple regression model
Chapter 2: simple regression model Goal: understand how to estimate and more importantly interpret the simple regression Reading: chapter 2 of the textbook Advice: this chapter is foundation of econometrics.
More information13 Endogeneity and Nonparametric IV
13 Endogeneity and Nonparametric IV 13.1 Nonparametric Endogeneity A nonparametric IV equation is Y i = g (X i ) + e i (1) E (e i j i ) = 0 In this model, some elements of X i are potentially endogenous,
More informationNonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction
Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction Tine Buch-Kromann Univariate Kernel Regression The relationship between two variables, X and Y where m(
More informationPreface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation
Preface Nonparametric econometrics has become one of the most important sub-fields in modern econometrics. The primary goal of this lecture note is to introduce various nonparametric and semiparametric
More informationON ILL-POSEDNESS OF NONPARAMETRIC INSTRUMENTAL VARIABLE REGRESSION WITH CONVEXITY CONSTRAINTS
ON ILL-POSEDNESS OF NONPARAMETRIC INSTRUMENTAL VARIABLE REGRESSION WITH CONVEXITY CONSTRAINTS Olivier Scaillet a * This draft: July 2016. Abstract This note shows that adding monotonicity or convexity
More informationNonparametric Identi cation and Estimation of Truncated Regression Models with Heteroskedasticity
Nonparametric Identi cation and Estimation of Truncated Regression Models with Heteroskedasticity Songnian Chen a, Xun Lu a, Xianbo Zhou b and Yahong Zhou c a Department of Economics, Hong Kong University
More informationSupplemental Appendix to "Alternative Assumptions to Identify LATE in Fuzzy Regression Discontinuity Designs"
Supplemental Appendix to "Alternative Assumptions to Identify LATE in Fuzzy Regression Discontinuity Designs" Yingying Dong University of California Irvine February 2018 Abstract This document provides
More informationNonparametric Modal Regression
Nonparametric Modal Regression Summary In this article, we propose a new nonparametric modal regression model, which aims to estimate the mode of the conditional density of Y given predictors X. The nonparametric
More informationThe risk of machine learning
/ 33 The risk of machine learning Alberto Abadie Maximilian Kasy July 27, 27 2 / 33 Two key features of machine learning procedures Regularization / shrinkage: Improve prediction or estimation performance
More informationECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd
ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation Petra E. Todd Fall, 2014 2 Contents 1 Review of Stochastic Order Symbols 1 2 Nonparametric Density Estimation 3 2.1 Histogram
More informationRobustness to Parametric Assumptions in Missing Data Models
Robustness to Parametric Assumptions in Missing Data Models Bryan Graham NYU Keisuke Hirano University of Arizona April 2011 Motivation Motivation We consider the classic missing data problem. In practice
More informationIDENTIFICATION OF MARGINAL EFFECTS IN NONSEPARABLE MODELS WITHOUT MONOTONICITY
Econometrica, Vol. 75, No. 5 (September, 2007), 1513 1518 IDENTIFICATION OF MARGINAL EFFECTS IN NONSEPARABLE MODELS WITHOUT MONOTONICITY BY STEFAN HODERLEIN AND ENNO MAMMEN 1 Nonseparable models do not
More informationTransparent Structural Estimation. Matthew Gentzkow Fisher-Schultz Lecture (from work w/ Isaiah Andrews & Jesse M. Shapiro)
Transparent Structural Estimation Matthew Gentzkow Fisher-Schultz Lecture (from work w/ Isaiah Andrews & Jesse M. Shapiro) 1 A hallmark of contemporary applied microeconomics is a conceptual framework
More informationSupervised Learning: Non-parametric Estimation
Supervised Learning: Non-parametric Estimation Edmondo Trentin March 18, 2018 Non-parametric Estimates No assumptions are made on the form of the pdfs 1. There are 3 major instances of non-parametric estimates:
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More information4 Nonparametric Regression
4 Nonparametric Regression 4.1 Univariate Kernel Regression An important question in many fields of science is the relation between two variables, say X and Y. Regression analysis is concerned with the
More informationEcon 673: Microeconometrics Chapter 12: Estimating Treatment Effects. The Problem
Econ 673: Microeconometrics Chapter 12: Estimating Treatment Effects The Problem Analysts are frequently interested in measuring the impact of a treatment on individual behavior; e.g., the impact of job
More informationConsistency and Asymptotic Normality for Equilibrium Models with Partially Observed Outcome Variables
Consistency and Asymptotic Normality for Equilibrium Models with Partially Observed Outcome Variables Nathan H. Miller Georgetown University Matthew Osborne University of Toronto November 25, 2013 Abstract
More information3.3 Estimator quality, confidence sets and bootstrapping
Estimator quality, confidence sets and bootstrapping 109 3.3 Estimator quality, confidence sets and bootstrapping A comparison of two estimators is always a matter of comparing their respective distributions.
More informationIndependent and conditionally independent counterfactual distributions
Independent and conditionally independent counterfactual distributions Marcin Wolski European Investment Bank M.Wolski@eib.org Society for Nonlinear Dynamics and Econometrics Tokyo March 19, 2018 Views
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationWooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics
Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics A short review of the principles of mathematical statistics (or, what you should have learned in EC 151).
More informationClassical regularity conditions
Chapter 3 Classical regularity conditions Preliminary draft. Please do not distribute. The results from classical asymptotic theory typically require assumptions of pointwise differentiability of a criterion
More informationLocal Polynomial Regression
VI Local Polynomial Regression (1) Global polynomial regression We observe random pairs (X 1, Y 1 ),, (X n, Y n ) where (X 1, Y 1 ),, (X n, Y n ) iid (X, Y ). We want to estimate m(x) = E(Y X = x) based
More informationA Novel Nonparametric Density Estimator
A Novel Nonparametric Density Estimator Z. I. Botev The University of Queensland Australia Abstract We present a novel nonparametric density estimator and a new data-driven bandwidth selection method with
More informationMinimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data. Jeff Dominitz RAND. and
Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data Jeff Dominitz RAND and Charles F. Manski Department of Economics and Institute for Policy Research, Northwestern
More informationNonparametric Regression. Badr Missaoui
Badr Missaoui Outline Kernel and local polynomial regression. Penalized regression. We are given n pairs of observations (X 1, Y 1 ),...,(X n, Y n ) where Y i = r(x i ) + ε i, i = 1,..., n and r(x) = E(Y
More information1 Lyapunov theory of stability
M.Kawski, APM 581 Diff Equns Intro to Lyapunov theory. November 15, 29 1 1 Lyapunov theory of stability Introduction. Lyapunov s second (or direct) method provides tools for studying (asymptotic) stability
More informationDESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA
Statistica Sinica 18(2008), 515-534 DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA Kani Chen 1, Jianqing Fan 2 and Zhezhen Jin 3 1 Hong Kong University of Science and Technology,
More informationUnconditional Quantile Regression with Endogenous Regressors
Unconditional Quantile Regression with Endogenous Regressors Pallab Kumar Ghosh Department of Economics Syracuse University. Email: paghosh@syr.edu Abstract This paper proposes an extension of the Fortin,
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationComments on: Panel Data Analysis Advantages and Challenges. Manuel Arellano CEMFI, Madrid November 2006
Comments on: Panel Data Analysis Advantages and Challenges Manuel Arellano CEMFI, Madrid November 2006 This paper provides an impressive, yet compact and easily accessible review of the econometric literature
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationNonconvex penalties: Signal-to-noise ratio and algorithms
Nonconvex penalties: Signal-to-noise ratio and algorithms Patrick Breheny March 21 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/22 Introduction In today s lecture, we will return to nonconvex
More informationNONPARAMETRIC ESTIMATION OF AVERAGE TREATMENT EFFECTS UNDER EXOGENEITY: A REVIEW*
OPARAMETRIC ESTIMATIO OF AVERAGE TREATMET EFFECTS UDER EXOGEEITY: A REVIEW* Guido W. Imbens Abstract Recently there has been a surge in econometric work focusing on estimating average treatment effects
More informationOptimal Bandwidth Choice for the Regression Discontinuity Estimator
Optimal Bandwidth Choice for the Regression Discontinuity Estimator Guido Imbens and Karthik Kalyanaraman First Draft: June 8 This Draft: February 9 Abstract We investigate the problem of optimal choice
More information