Bandwidth choice for regression functionals with application to average treatment effects

Size: px
Start display at page:

Download "Bandwidth choice for regression functionals with application to average treatment effects"

Transcription

1 Bandwidth choice for regression functionals with application to average treatment effects Karthik Kalyanaraman November 2008 Abstract I investigate the problem of optimal selection of smoothing parameter or bandwidth when one is interested in finite-dimensional smooth functionals of a regression function. Examples of such functionals include average treatment effects, nonparametric versions of the Blinder-Oaxaca decomposition used to analyze labor market discrimination, nonparametric policy effects of macroeconomic interventions and average derivatives. Firstly I emphasize that this problem is very different from the problem of optimal bandwidth selection when one s interest is in the entire regression function, with different rates of convergence of the smoothing parameter to zero. Further I show that the problem is in general characterized by an instability or technically, a form of ill-posedness. Small changes in the underlying joint distribution of the data can cause large changes in chosen bandwidth. I propose a simple solution to this problem and prove its optimality properties. As an example I focus on how to choose bandwidth for average treatment effects in observational studies when a local linear regression estimate is used. I derive original approximations to the estimation error and use these to provide a rule of thumb bandwidth selection algorithm for applied work. Finally, I use data from Imbens, Rubin and Sacerdote (200) on lottery winnings and the effect of unearned income on labor supply; the data is used to demonstrate and assess both the bandwidth algorithm and some of the theoretical issues involved. JEL Classification: C, C2, J7 Keywords: Plug-in estimator, Nonparametric, ATE, Regression discontinuity The project has been sustained by Guido Imbens, who has provided crucial support at each stage. I am very grateful for detailed suggestions and support from Gary Chamberlain and Lawrence Katz. I have also benefited a great deal from comments made by Alberto Abadie, Victor Chernozhukov, Rustam Ibragimov, Alex Kaufman, Jean Lee, Ulrich Müller, Eduardo Morales, James Stock and workshop participants at Harvard Harvard University, Cambridge, MA Electronic correspondence: kalyanar@harvard.edu

2 Introduction A large proportion of policy-relevant statistics of interest to economists, can be described as regression functionals, that is, real-valued functions of an underlying regression function. Examples include, among others, average treatment effects, the Blinder- Oaxaca wage decompositions used to analyze wage-discrimination (Blinder (973), Oaxaca (973)) the counterfactual effect of a macroeconomic policy intervention (Stock (989)), weighted average derivatives (Powell, Stock and Stoker (989)), and regression discontinuity estimators (see Hahn, Todd and van der Klaauw (200) for theory; several papers including Lee (2008) for applied work). A standard method for estimating such functionals, which avoids functional form assumptions, is to first estimate the regression function nonparametrically and then to compute the functional using the estimated regression function. Such estimators are called plug-in estimators. A key issue here is the choice of smoothing parameter, or bandwidth, in the first step. While this choice is well-explored in settings where the primary object of interest is the entire regression function, it has not been studied in detail where the primary object of interest is a finite dimensional functional. This paper studies optimal bandwidth choice in this setting. More specifically I look at the subgroup of functionals that are estimable at a N 2 rate. Examples include averages of the regression function or its derivatives (all the functionals cited above except regression discontinuity belong to this class). Though some work has been done on bandwidth selection with functionals that are not of this type, e.g. regression discontinuity (Imbens and Kalyanaraman, 2008), this paper focuses on N 2 -estimable functionals. For illustrative purposes I consider two specific functionals, though the methods proposed extend transparently to others in the class. First, I focus on a simple average of the estimated regression function calculated over the sample points. Second, I demonstrate the resulting bandwidth procedure on a commonly used There is a literature on undersmoothing (e.g. Newey (994), Goldstein and Messer (992), Powell, Stock and Stoker (989) and Stock (989)) to achieve N 2 -consistency, but no specific bandwidth proposals except for density-weighted average derivatives (see Goh (2007)). There has also been no discussion of the illposedness and instability inherent to these problems and the estimator proposed for density-weighted averages potentially suffers from the same instability 2

3 functional in labor and development studies, the average treatment effect. Though the results for this functional are written up in terms of the sample average treatment effect (SATE), I briefly show first how the same results are valid for the population average treatment effects and for nonparametric versions of the Blinder-Oaxaca decompositions used to study labor-force discrimination as well. 2 I indicate how a minor modification of the results can be used for bandwidth choice in the case of constructing policy effects of counterfactual macroeconomic interventions as in Stock (989), and for average derivatives. The standard approach to the bandwidth problem is to choose a bandwidth that minimizes some measure of global risk for the entire regression function, usually mean integrated squared error (MISE), i.e. the expected squared error integrated over the entire curve. The optimal bandwidth is then estimated either using plug-in estimators of the minimizer of the asymptotic approximation to MISE or using an unbiased data-based estimator of the MISE (cross-validation). While this is appropriate if we are interested in the entire regression function, it is not the correct risk measure for the estimation of a particular functional of the regression function. The first contribution of this paper is to propose and investigate a criterion, the mean-squared error criterion (MSE), based directly on the functional and on the standard squared-error loss criterion used in estimation problems, and solve for the optimal bandwidth under this criterion. Once the risk measure is defined, I approximate the risk in terms of functionals of the data and then find the bandwidth that minimizes the leading terms in the approximation. The solution obtained for the MSE is quite different from minimizing an asymptotic approximation to the MISE, or by doing standard cross-validation. This is essentially because the averaging done in smooth functionals gets rid of some variance, and as a result it is optimal to remove more bias by using a smaller bandwidth. Indeed the convergence rates to zero are different. The point about the different rates has been made before in the literature on undersmoothing (see Newey (994) and Goldstein and Messer (992)); however it is worth highlighting in view of 2 Alternative nonparametric decompositions using densities rather than regression functions are discussed in Dinardo, Fortin and Lemieux (996) 3

4 inappropriate methods like cross-validation used for this problem in practice. The second contribution of the paper is to show that the problem of minimizing MSE is subject to an instability. For a large class of regression functions, small perturbations of the support or of the function itself, can lead to large differences in the optimal bandwidth (in general discontinuously changing its convergence rate to zero, and in some cases moving from a finite value to infinity). Formally, the problem exhibits a certain kind of ill-posedness. This leads to estimators performing poorly if the regression function is close to the class of badly behaved regression functions; in the case of the ATE the bad behavior occurs at and near two particularly interesting cases: constant additive treatment effects (including no effect of treatment) and treatment effects linear in the measured covariate(s). I consider some solutions to this problem and propose a method of regularization 3 that ensures two properties of the stable solution: ) the risk evaluated at this regularized estimate is close to the optimal solution as the data set becomes large (a type of asymptotic no-regret property,) and 2) the estimated bandwidth is close to the optimal bandwidth in large datasets. I motivate this solution in two different ways. Firstly, the solution can be considered as a type of Tikhonov-regularization, where a convex penalty term which penalizes extremely large bandwidths for bumpy regression functions, is added to the loss function, and the sum then minimized to find the optimal regularized bandwidth. Secondly, the same solution can be regarded as the minimizing the risk averaged over local uncertainties in the weight function. This second way of approaching the issue is very much in the spirit of Bickel and Li (2006). Finally I use data on lottery winnings from Imbens, Rubin and Sacerdote (200) that was collected to try to identify the effect of unearned income on labor supply: I use this dataset to both illustrate the properties of the proposed ATE bandwidth selection algorithm and to illustrate some theoretical issues raised by this paper. The simulations indicate that there are potential gains to be achieved, even in fairly linear regression 3 I use the term in similar sense as Bickel and Li (2006) who discuss the notion of approximating a difficult, possibly singular, problem with simpler regularized problems that approach the former in the limit 4

5 functions, in bias reduction and bandwidth stabilization. In sum the contributions to the existing literature are: ) the approximation of the risk function for the ATE when the estimation method is local linear regression 4, 2) the formulation of an objective data-based bandwidth algorithm for the ATE leading to its N 2 consistent estimation, and 3) highlighting and exploration of the ill-posedness of the bandwidth problem in the context of minimizing MSE, and its solution using regularization. 2 A general introduction to the bandwidth problem Consider first a case in which one is interested in estimating a regression function without making functional form assumptions. Examples would include estimation of Engel curves (Bierens and Pott-Buter, 990) estimation of the relationship between calories consumed and food prices (Subramanian and Deaton, 996) and so on. There are several competing ways of doing this such as methods based on kernels, series, sieves etc., (see Wasserman (2006) for a description and evaluation of these methods), but all of these require a choice of what is called the smoothing parameter or bandwidth. This is essentially a choice of how complex of a model to fit to the data. With more complex models, we reduce bias but this typically comes with the cost of increasing variance. In the regression function context, this choice is made a little clearer by imagining potential extremes. One of the simplest regression functions we could fit, say in the kernel context, would be the constant function corresponding to an infinite bandwidth; a zero bandwidth on the other hand corresponds to the very complicated regression function obtained from simple interpolation of the data. Thus the trade-off between bias and variance gives us a framework to think about optimality in bandwidth selection. This choice is quite familiar in the regression context, with several methods that allow the researcher to choose an objective, data-based and optimal bandwidth for fitting the entire regression function (see Wasserman (2006) and Fan and Gijbels (996)); it is far less 4 This method is attractive due to its good bias properties in regions of low density and at the boundaries. See also Heckman, Smith and Todd (998) who recommend its use for calculating ATE 5

6 so in the context where one in interested in a scalar functional of the regression function. One could argue that economists are typically more interested in scalar functionals of the regression function than in the function itself; just as one is far more interested in the coefficients of a linear regression than the regression itself 5. Examples of such functionals include the regression function at a point (e.g. the regression discontinuity estimator), a weighted average regression (various average treatment effects, e.g Dehejia and Wahba (999), Imbens, Rubin and Sacerdote (200)), average partial derivatives (e.g. average cross-partials used as an index of substitutability of two goods) and so on. Indeed most policy effects that summarize regression relationships (e.g Stock (989)) are (typically smooth) functionals. Just as in regression function estimation, one might want to estimate the functional one is interested in without functional form restrictions as when one is concerned about nonlinearities in the underlying regression relationship (e.g. Heckman, et al(998)), or in the case of ATE about different covariate distributions in the treatment and control populations (Dehejia and Wahba(999)). In this context, the choice of a bandwidth becomes a key decision. Once we a procedure for selecting the the bandwidth we can form a plug-in estimate of the functional, by first estimating the regression nonparametrically with that bandwidth and then using this estimate in the formula for the functional. But, there has not been much work done on how to choose bandwidth when the interest is not in the regression function itself. There are no objective methods available to apply to the problem. Thus applied researchers have either used a fixed subjectively chosen bandwidth (e.g. Heckman, et al (998) choose a fixed bandwidth in their discussion of calculating ATE using local linear regressions) or used cross-validation. Cross-validation essentially amounts to minimizing an unbiased estimate of the risk of estimating the entire regression function; it is not appropriate to the problem of choosing a bandwidth for functionals. Worse, to estimate smooth functionals like various average treatment effects at a N 2 - rate using kernels, we need to use a bandwidth that converges to zero, faster than the 5 the average derivative could be one possible analogue of the slope coefficient in a linear regression function, in the nonlinear context 6

7 bandwidth sequence for fitting the entire function. This point was highlighted by the undersmoothing literature (Newey (994)). Thus tailoring the bandwidth method to the functional at hand is very important not only if we want to estimate a bandwidth that theoretically is best for a given dataset, but also if we care about a sensible estimate of the functional in general. This paper essentially seeks to investigate this choice and provide in the instance of various policy effects, a concrete objective and data-based way of selecting the bandwidth. 3 Smooth regression functionals The key feature of these functionals is that they are estimable at a N 2 rate. A general characterization using the Riesz integral representation of these functionals is available in Goldstein and Messer, 992. For exposition I will further assume that the functionals are linear (if they are not, a linearization approximation using Fréchet type derivatives of the functional can be considered) and indeed focus on weighted average regressions as the key example, as most smooth linear functionals can be approximated by sums of weighted average regressions and regression derivatives (again see Goldstein and Messer, 992) Examples of smooth linear functionals that arise frequently in applied work:. Average treatment effect (ATE): Under unconfoundedness (Y 0, Y ) W X and full overlap, the population ATE (PATE) can be written as θ = E(m (X) m 0 (X)) (m (x) = E(Y W =, X = x), etc). This is the difference between the averages of two regressions averaged over the distribution of the covariates X in the population. One might be interested instead in the sample ATE (SATE) where the averaging is done over the empirical sample distribution of the covariate: θ = N (m (X i ) m 0 (X i )) N i= In both cases the ATE can be estimated by ˆθ = N i ( ˆm (X i ) ˆm 0 (X i )) 7

8 2. Nonparametric versions of Blinder-Oaxaca wage decompositions: Used to analyse labor-market discrimination, this functional has the same statistical structure as the PATE. Consider two groups, say women and men, indexed by T {0, }. To get that part of the difference in some average outcome Y, say wages, that is not attributable to difference in observed characteristics, X (say IQ scores), one estimates θ = E X (E(Y X, T = ) E(Y X, T = 0)) Here the outer expectation is taken with respect to some reference distribution; either the conditional covariate distribution in one of the two populations, or the unconditional on group covariate distribution. 3. Average Derivative Effect (ADE): Analysed in Powell Stock and Stoker (989) and Härdle and Stoker (989), these functionals are of the type θ = E( m (x)) (Powell x Stock and Stoker (989) considered density-weighted average derivatives of the form E(f(X)m (X))). These arise in semiparametric estimation of index models as estimands of interest and may also be of independent interest in demand estimation problems. Bandwidth selection issues for these estimators are identical in spirit to the issues with the simple weighted average θ W A = E(ω(X)m(X)) considered later and a straightforward extension is immediate 4. Another example, considered in Stock (989), involves assessing the impact of policies that leave structural relationship m(x) = E(Y X = x), between an outcome variable Y and a policy instrument X, unchanged but affect the distribution of the policy instrument. Here one is typically interested in calculating the mean effect on Y or E(m(X)) under the two different distributions of X, one being the current empirical distribution and the other the future counterfactual. In other words: θ = E(ω(x)m(x)) E(m(x)) 8

9 3. Setup 3.. Weighted average regression As mentioned earlier I focus on two functionals in the paper. The first, weighted average regression θ W A, is chosen ) due to its simplicity and ease of illustration of the key theoretical issues and 2) the ease of extension of the ideas to functionals like weighted average derivatives. To make the essential aspects of the problem clear I work with scalar covariates X, and known regression design (i.e. the density function of X, f(x) is assumed known or controlled by the researcher). 6 I stress here that these assumptions are not necessary. The calculations provided in the section on ATE dispense with such assumptions; those calculations can also immediately be modified for this case. These assumptions however also allow one to illustrate the issues in the familiar setting of a simple kernel estimate, rather than the more complex local linear regression used later. The output variable Y is generated by the following process Y i = m(x i ) + ɛ i The errors ɛ i are identically, and independently distributed, with conditional mean zero and conditional variance σ 2 (x). Thus m(x) = E(Y X = x) The functional of interest is θ W A = E (ω(x)m(x)) where ω(.) is a known smooth weighting function. 6 In other words, the regression at X i is calculated as: ˆm(X i ) = (n )f(x i ) n K h (X i X j )Y j. j= j i Assuming a random design will only make calculations more tedious while still preserving the nature of the problem explored in this paper. 9

10 Now theta can be estimated in two steps: first estimating the underlying nonparametric regression function at sample points and then estimating the functional as ˆθ W A = ω(x i ) ˆm(X i ), n i where ˆm(X i ) is the estimate of the regression function at X i. For the approximations that follow, the following standard assumptions will be needed. Assumption 3.: The marginal distribution of the covariate X is known and denoted f(x). Moreover it is twice continuously differentiable and bounded away from 0. Assumption 3.2: m(.) has at least two continuous derivatives. Assumption 3.3: The kernel K is a smooth compactly supported symmetric nonnegative pdf. Assumption 3.4: The weight function ω(.) is piecewise continuously differentiable. Assumption 3.5: X and ω(.) are supported on the whole real line. 3.5 is made only so that boundary biases that arise with simple kernel estimation can be ignored. A compact support for X will entail some way of correcting for boundary biases in the calculated risk (e.g., negative reflection) but will lead to the same asymptotic approximation given below Average treatment effect A functional of interest in observational studies is the sample average treatment effect (SATE). This is essentially the average treatment effect where the averaging is over the empirical distribution of the covariate: θ SAT E = n n i= ( ) Y i Yi 0, where Yi and Yi 0 are the potential outcome for unit i under treatment (T i = ) and control (T i = 0) respectively. The treatment assignment indicator, T, and an additional covariate, X, are observed. 0

11 Under the assumptions of unconfoundedness ((Y, Y 0 ) W T ) and overlap (the density of X is bounded away from 0 over the entire support for both treated and control samples), we can estimate the SATE using the following simple estimator: ˆθ SAT E = n n ( ˆm (X i ) ˆm 0 X i ), i= where ˆm (X i ) and ˆm 0 X i are the estimated regression functions on the treated and control samples respectively, each evaluated at X i. I state the lemmas about the ATE in terms of the SATE; however as indicated in a subsequent section, the bandwidth optimal for the SATE, is also optimal for the PATE. For the purposes of this paper, I work out bandwidth selection details for a specific (and attractive) nonparametric regression estimator: the local linear regression estimator. This estimator does not suffer from the severe boundary bias that affects standard (Nadaraya-Watson) kernel estimates (see Heckman, et al (998), who recommend this estimator in the context of calculating treatment effects). More importantly perhaps for this application, the local linear estimator does not have high bias in areas with fewer observations unlike the latter (or is design-adaptive to use the jargon; see Fan (992)). This is an important consideration for the ATE where one can expect for certain covariate values a low density of treated observations and for other values very few control observations (i.e., limited overlap.) More explicitly, arrange the observations such that the first n 0 observations constitute the control sample, and the rest n = n n 0 form the treated samples. Define y 0 = (y,..., y n0 ) and m 0 = (m 0 (X ),..., m 0 (X n0 )), where m 0 (.) is the regression function in the control group, i.e., m 0 (x) = E(Y T = 0, X = x). Define R 0 i = [ι Z i ], where Z i j = X j X i, j =... n 0, and ι is a column of n 0 ones; define the weight matrix W 0 i = diag j=...n0 (K h (X j X i )); e = ( 0). Similarly define y, R i and W i. Given the above, we can write the estimated control regression at X i as ˆm 0 (X i ) = e (R 0 i W 0 i R 0 i ) R 0 i W 0 i y 0

12 Therefore: ˆθ SAT E = n n i= ( e (R i W i R i ) R i W i y e (R 0 i W 0 i R 0 i ) R 0 i W 0 i y 0 ) It will be useful to collect here the assumptions necessary for upcoming lemmas. They are all standard: Assumption 3.6: The marginal distribution of the covariate X is denoted f(x); it is continuous a.e. and bounded away from 0 conditional on W : f(x T = ) f (x) c > 0 and f(x T = 0) f 0 (x) c 0 > 0 Assumption 3.7: m(.) has at least two continuous derivatives a.e. Assumption 3.8: The kernel K is a smooth compactly supported symmetric nonnegative pdf. For concreteness assume a kernel supported on [, ]. Assumption 3.9: The conditional variance functions σ 2 (x) = V ar(y X = x) and σ 2 0(x) are bounded and continuous almost everywhere 4 Error criteria and optimal bandwidth 4. Standard criterion: MISE The standard criterion of risk that is used for bandwidth selection in nonparametric regression problems is called the Mean Integrated Squared Error (MISE), or more generally, weighted MISE (WMISE). It is a global measure of error and the loss function sums up squared error along the entire regression function. It is defined as W MISE = ξ(x)e( ˆm(x) m(x)) 2 f(x)dx, where ξ(x) is the weighting function used. Usually this is chosen to be the flat, and the resulting error criterion called simply the MISE. The bandwidth is then chosen by finding a data-based unbiased estimator of the error criterion (e.g., cross-validation) or by first asymptotically approximating risk and then estimating its minimizer. 2

13 Lemma 4.: (Asymptotic approximation to MISE) Under assumptions and 3.5, we obtain the following expression for WMISE W MISE = C nh + C 2h 4 + o p (n h + h 4 ) The constants in the approximation are: C = T (K) ξ(x)σ 2 (x)dx C 2 = T 2 (K) ξ(x) [(mf) ] 2 (x) dx f(x) ( T (K) = K(t) 2 dt T 2 (K) = 2 t 2 K(t)dt) 2. Calculations are given in the appendix. This is a well-known result. See Priestley and Chao (972) Note that the first term reflects the risk cost of variance of the estimator and the second reflects the bias. Increasing bandwidth reduces variance across repeated samples at the cost of increasing bias. Once again, a key observation to keep in mind is that standard cross-validation essentially minimizes an unbiased estimate of the MISE and can be shown to be consistent to the minimizer of leading terms of MISE displayed above. 4.2 Modified criterion: MSE Using the MISE or AMISE as a measure of risk is not appropriate when one is interested primarily in the functional θ. Thus standard methods of bandwidth selection will have to be modified. We instead consider a measure of risk defined directly in terms of the functional of interest, the mean squared error (MSE) criterion. MSE = E(ˆθ( ˆm h ) θ(m)) 2. One can now try and approximate MSE in terms of functionals of the joint distribution of the data and the chosen bandwidth. 3

14 4.2. MSE for θ W A Lemma 4.2: (Asymptotic approximation to MSE) Under assumptions , we obtain the following expression for MSE MSE W A = C 0 n + C n 2 h + C 2h 4 + o p (n 2 h + h 4 ) AMSE + o p (n 2 h + h 4 ). Here, the constants in the approximation are: C 0 = V ar(2ω(x)m(x)) + ω(x) 2 σ 2 (x)f(x)dx C = T (K) ω(x) 2 (σ 2 (x) + m(x) 2 )dx ( C 2 = T 2 (K) T (K) = ) 2 (mf) (x)ω(x)dx K(t) 2 dt, T 2 (K) = ( 2 t 2 K(t)dt) 2. This is the first result of the paper. Here the first term comes from the variance of the averaging involved. Note that is it not affected by the bandwidth choice. The second two terms reflect the variance and bias stemming from the choice of smoothing parameter. An increased bandwidth makes the estimator more stable across repeated samples but also more biased. The nature of the constant on the bias term, C 2, will prove crucial to the following discussion MSE for ATE A contribution of this paper is to work out a similar expansion for the SATE when local linear estimation is used to calculate the regression function 7. The appendix provides both the calculations as well as further higher order terms than presented below. As is standard with regressions, all expectations in the lemmas below condition on the sample 7 Imbens, Newey and Ridder (2005)provide a MSE expansion for ATE when they are calculated using inverse propensity score weighting and series estimation for the regression function and the propensity score. Their procedure requires the choice of two bandwidths. Other differences from the current work include ) they work with a criterion slightly different from mine in that they consider the sum of the mean squared errors of the treatment and control regressions rather than working directly with the ATE; 2) they do not provide guidance for bandwidth selection and 3) they are not concerned with the problems of regularization, which will arise in their set-up as well 4

15 X and T, as unconditional expectations may not exist (if they do they are identical to what is given below). This is suppressed notationally. Following the result, I indicate extensions to other similar functionals. Lemma 4.3: (Asymptotic approximation to MSE: SATE) Under assumptions , and denoting the support of X by [x, x] (these can be nonfinite) and by p the P r(t = 0) we obtain the following expression for MSE for SATE (MSE SAT E ) MSE SAT E = K 0 n + K n 2 h + K 2h 4 + o p (n 2 h + h 4 ). Here, the constants in the approximation are: ( ) K 0 = σ 2 f + σ 2 f 0 (x)f(x)dx p f 0 ( p) f ( ) K = π σ 2 f + σ 2 f 0 (x)dx pf 0 ( p)f ( ) 2 K 2 = ν2 (m m 4 0)(x)f(x)dx π = K(t) 2 dt ν = t 2 K(t)dt This is the second novel result of the paper. Inspecting the MSE we see first that the term K n 0 is the same as the semiparametric variance bound for the problem derived by Hahn (998) when one expresses it in terms of the propensity score using Bayes theorem. The only difference from the expansion for the PATE, as indicated later, would be a third term in the definition of K 0, namely, (m (x) m 0 (x) θ P AT E ) 2 f(x)dx. The next two terms reflect the variance-bias tradeoff. An increased bandwidth makes the estimator more stable across repeated samples but also more biased. Note that relatively lower densities in either treated or control samples increase the variability (showing the importance of significant overlap in variance reduction). Also note that the third term, the bias term, does not depend on f 0 or f. This is due to the design adaptivity of local linear regression: sparse data in a region does not increase first order bias as it would with simple kernel estimation. Also note that it is the difference in curvatures between 5

16 the treatment and control regressions that determines bias. This will be important in the following discussion Extensions to other functionals The extension to the PATE is immediate recognizing that firstly, under the random sampling assumption, the estimate of the PATE is the same as for the SATE (i.e. ˆθ P AT E = ˆθ SAT E ) and, that the only additional terms in the MSE expansion of E(ˆθ P AT E θ P AT E ) 2 are E(θ SAT E θ P AT E ) 2 and E(ˆθ SAT E θ SAT E )(θ SAT E θ P AT E ). The second term is negligible and the first contributes a term of order 8. Thus bandwidth choice remains n unaffected. The extension to the nonparametric Blinder-Oaxaca estimates is immediate. Suppose the two groups are indexed by T = and T = 0, say T = indicate female. Then the part of the difference in wages Y, not attributable purely to different distributions of the measured covariate, X (say IQ), in the two groups is E X [E(Y T =, X) E(Y T = 0, X)] where the averaging is done over the unconditional (on group membership) marginal distribution of X. Thus we can see that this functional has an identical structure as the PATE and the application of the above MSE expansion only requires redifinition of terms. It is fairly easy to extend the results to the policy effect estimator of Stock (989), as well. Note that this estimate has the identical structure of one of the two elements in the ATE: the average, say, control regression, averaged over a distribution other than the conditional distribution of X in the control sample (f 0 ). In the ATE case, this other distribution was the unconditional distribution of X and in the Stock (989) case it is the counterfactual distributed of X after the policy intervention. 8 This, as indicated before, is the term E(m (x) m 0 (x) θ P AT E ) 2, also to be found in Hahn s semiparametric variance bound 6

17 5 Bandwidth selection and illposedness 5. Optimal bandwidth Once we have the expressions for the asymptotic approximations to MSE, we can optimally trade bias for variance to the first order, noting that the former is typically an increasing function of bandwidth and the latter a decreasing function, and find a formula for the theoretically optimal bandwidths, h opt W A and hopt SAT E Lemma 5.: (Optimal bandwidths) Given lemma 4.2 and 4.3, we obtain the following expression for the optimal bandwidths: ( ) h opt W A = C 5 n op (n 2 5 ). 4C 2 For the SATE ( ) h opt SAT E = K 5 n op n K 2 Note that while the optimal bandwidth for estimating m would be O(n 5 ), the optimal bandwidth for estimating θ W A and θ SAT E is O(n 2 5 ). In other words, we are required to use a smaller bandwidth than we would have were we interested in the entire function (see Goldstein and Messer (992) and Newey (994)). This makes intuitive sense: the averaging gets rid of some of the variance for us, and therefore we can allow ourselves less bias by smoothing less at the optimal bias variance tradeoff point. Moreover, if one used the optimal bandwidth for MISE, one would not achieve the optimal semiparametric rate of N 2 for the estimation of θ. There is, however, something else that has not been noted in the literature. First consider MSE W A. The optimal bandwidth is inversely proportional to the bias constants C 5 2. C 2 can be thought of as proportional to the total curvature of the regression function on its support (technically it is the total curvature of the function mf, but with uniform design, or more compellingly, with local linear estimation, it is the total curvature of the regression function m that matters and not the design f). For a large class of functions, C 2 could be close or exactly 0. For these functions there is no first-order bias-variance tradeoff as defined by minimizing the AMSE, leading to an optimal bandwidth of! 7

18 The same problem arises with MSE SAT E in perhaps an even more relevant scenario. Here the first order bias is proportional to the difference in curvatures. In the practically interesting cases of constant additive treatment effect (CATE) and linear in covariates treatment effect, this bias term is zero, again leading to the same issue as above. Taking higher order terms does not help one as explained later. The next two sections are abstract in nature. Readers interested primarily in applied work should skim them, and then look at sections 6.2, 6.3, and Illposedness For the purposes of this section I concentrate on θ W A though it is clear that the exact same issues are relevant to θ SAT E. This is primarily for illustrative purposes as the simple structure of ˆθ W A makes the issues at hand more transparent. The calculations, based on simple known functional forms, shown in table and figure 4. illustrate the issue starkly. The output process was y = m(x) + ɛ with ɛ N(0, σ 2 ). The regression was either m(x) = 8cos(Bx) or 8sin(Bx). B is a parameter that determines frequency: increasing it increases local features of the regression function (i.e. make it more bumpy ) and thus should decrease the MISE optimal bandwidth. The support over which risk was calculated was [ 0, 0]. The table shows the optimal bandwidths that minimize respectively the actual MSE, its asymptotic approximation, the AMSE, and the MISE, for the two different regression functions. The figures show an example of two different regression functions (cosine and sine with same amplitude and frequency) with the exact MSE at various bandwidths. There are a couple of things to note in the simulations shown. Firstly, the bandwidth that minimizes MISE is almost identical for the sine and the cosine curves. This makes sense: the sine is really only a shifted version of the cosine and in a sense all the local features are identical. The bandwidth that minimizes the MSE and its asymptotic approximation, 9 however, is radically different in the two cases; from a finite value (cosine) it jumps to infinity for the cosine curve. In the case of the AMSE the reason for this is 9 The approximation, incidentally, seems fairly accurate 8

19 Table 7.2.3: Optimal bandwidths Cos C 2 > 0 Sin C 2 = 0 σ B MSE AMSE MISE MSE AMSE MISE Figure 4. : Regression function and MSE 9

20 apparent on inspection of the formulae: C 2 = 0 for the sine curve, leading to an exactly 0 asymptotic bias. With the MSE, essentially, if we are only interested in the functional and not the entire function, biases in one region of the support can cancel out biases in other regions leading to a 0 total bias. A similar issue is present with the MSE for θ SAT E near, for instance, CATE where the issue is the cancellation of biases when taking the difference across the two regression functions: the bias of the estimate of the treatment regression is of the same magnitude and direction as the bias of the estimate of the control regression to the first order. Here what matters is the difference in total curvature between the treatment and control regression. Thus, while for functions near the CATE case, the optimal MSE minimizing bandwidth is O(n 2 5 ), at CATE, the rate suddenly jumps to O(n 2 9 ). Note that there is a single case in which bias suddenly changes orders in the MISE and the AMISE as well: the linear case. This generally seems to be ignored in the vast literature around the MSE, for theoretical convenience. For instance about the expression for the bandwidth that minimizes MISE to first order with local linear estimation, Fan and Gijbels (996) write It is understood that the integrals are finite and the denominator does not vanish. Since the problem happens only when the regression is close to linear and one can either rule out linearity - as is usually done - it is claimed that this not a serious problem. However, for the data set I use later on to illustrate some of the issues, this does seem to be an issue. A small modification, noted later, of the solution I propose to the instability in the MSE problem, does take care of this case as well. It is also to be noted that the sensitivity to choice of support is special to the MSE case. The essential problem of small perturbations in the underlying regression function leading to large changes in the optimal bandwidth can by formalized by the connection to the literature on ill-posedness in optimization theory. If one considers the underlying true regression function as the exogenously given parameter of the bandwidth problem of minimizing MSE, then, formally, that problem is in a certain sense (defined below) ill-posed. Ill-posed problems have received a lot of attention recently in econometrics in the 20

21 context of deconvolution methods for measurement error and nonparametric IV estimation. A more fundamental example is kernel density estimation. All of these involve linear integral equations of a certain type with a compact operator. Naive solutions to the integral equations which involve inversion are problematic as the inverse of a compact operator is unbounded, leading to the solution not being continuous with respect to perturbations of the definitional problem.. However ill-posedness itself is a concept stemming from optimization theory and while the above form of ill-posedness has a few features in common with the kind of ill-posedness explored in this paper (both lend themselves to solution by similar approaches) the analogy cannot perhaps be pursued much further. We are not dealing here with an integral equation but an optimization problem with function valued parameters. Bandwidth choice for functionals is ill-posed in an extended or variational sense. This type of ill-posedness as far as I know has not been studied in econometrics before. The standard reference here is Zolezzi (995, 996). Define MSE = R(h, m) = E(ˆθ h θ) 2. Let h (m) denote the solution (assuming it exists) to the problem min h R(h, m), given the unknown regression function m M, where M is the class of candidate regression functions. This problem has the issue that the map m h (m) is not continuous, or, extending Zolezzi (995, 996) to make it appropriate to the problem at hand, this implies that the problem is ill-posed in an extended sense (Proofs of the illposedness and precise definitions are given in the appendix). The bandwidth problem has not been looked at in a variational sense (i.e. optimal bandwidth as a functional of the true regression) before partly because most attention has been focused on the MISE criterion where the problem occurs in the neighborhood of linearity. This case is typically ruled out, as I mentioned earlier, a priori. This strategy cannot be followed here because the problem with MSE occurs, for instance, in the neighborhood of all regression functions with 0 integrated curvature over the support, an infinite and unknown set. For instance, the appendix shows that the class of functions for with the problem with minimizing MSE W A occurs contains all odd functions. 2

22 Before discussing potential solutions to the problem, I will review again the implications of the problem: Firstly, small perturbations of the support of the function may lead to large changes in optimally chosen bandwidth (this for example is not the case with the AMISE optimal bandwidth) Secondly, the estimator will be very unstable for regression functions with low total curvature across repeated samples (and in the case of average treatment effects for treatments effects that are close to linear in the covariates). Since robustness is an important feature of a statistical estimator this is troubling, combined with the fact that MSE can in fact vary quite a lot with different bandwidths. Lastly, another way of putting the problem is in term of uniform behaviour, over the unknown parameter space. The optimal bandwidth does not converge to 0 at the same rate throughout the parameter space. In the CATE case it jumps to O(n 2 9 ). Usage of a bandwidth of this order in cases of small deviations from CATE will lead to the functional not being estimable at a N 2 -rate. When we look at θ W A for certain (an infinite set) regression functions the optimal bandwidth becomes infinite. However, the bandwidths that will be estimated from an actual dataset will be O(n 2 5 ) and bounded with probability one (because C 2 will be estimated as nonzero with probability one with continuously distributed regressors.) 6 Regularization 6. Regularization With any solution to the stability issue, one ideally wants an asymptotic no-regret type property for our modified bandwidths, i.e., as the dataset grows in size the difference berween the risk under the modified bandwidth procedure and the minimal AMSE should approach 0. This is the type of property for instance that Li (987) demands of crossvalidation. We shall call this the Li criterion and will be more explicit in the result given below. As mentioned in the introduction, I approach the problem in two ways, both leading to the same solution. 22

23 6.. Penalty term approach One basic solution to ill-posed problems is to regularize them by adding to the criterion one is minimizing a nonzero convex penalty term, that decreases in relative size as the dataset becomes larger. This general idea is sometimes referred to as Tikhonov regularization in optimization theory. For this problem note that we want the penalty term to ) penalize very large bandwidths and 2) penalize the ignoring of local features of m; i.e., when the regression curve is very bumpy, we want to guard ourselves against very large bandwidths because in this case perturbations of support are likely to have a significant effect on the chosen bandwidth. We also in general want a convex penalty term as this ensures minimization does not become ill-defined. Now note that though the bandwidth minimizing the mean squared error of the functional estimate is badly behaved, the MISE minimizing bandwidth is fairly wellbehaved except in the linear case. (This can also be seen in the asymptotic expression for the risks: while C 2 is likely to be close to 0 for a wide variety of cases, this is not likely with C 2.) A penalty term that makes a lot of sense here then is a scaled version the well-behaved bias term in the expansion of the MISE, where the scale factor decreases with increasing sample size at an appropriate rate allowing for the no-regret criterion to be satisfied asymptotically. Note that this can be interpreted as penalizing the AMSE more when the regression function has more local features (C 2 large). Also since MISE bias is increasing and convex in the chosen bandwidth, the other desired properties are satisfied The weight function approach: idea One can motivate this penalty term in a very different and more intuitive way by considering local perturbations of the regression function. Using the basic theoretical framework of Bickel and Li (2006), I consider a sequence of regularized estimands, which result from perturbing the weight function in θ = ω(x)m(x)dx by adding a scaled integrable function φ(.) centered at point x 0, and scaled such that the new weight function still integrates to. Then letting the perturbation get smaller with larger sample size, such 23

24 that in the limit the new weight function is essentially like the original weight function plus the dirac delta generalized function, one finds the bandwidth that minimizes the risk of the perturbed problem in the limit. However one might have chosen the particular point x 0 in such a way that the perturbed problem is still ill-posed (e.g., if m (x 0 ) = 0), so after calculating the MSE of the point-perturbed problem, I average over the possible points x 0 around which we can perturb the weight function. This I call the regularized problem. One can show that when C 2 0, we get an MISE expansion for the regularized problem that is to the first order the same as the risk with the convex penalty term we have chosen. Thus the addition of the convex penalty term can be thought of as asking for a bandwidth that minimizes risk averaged over small uncertainties/perturbations in the weighting function. See below section for a formalization of this notion The weight function approach: formalization I now formalize this approach for θ W A. The same can be done for θ SAT E. Consider the weight function ω(.), and perturbation function φ(.) which fulfills the conditions φ(x) dx < and supx φ(x) <. Define b a positive constant close to 0 and α = n γ. Now perturb the weight function at x 0 in the following way: ω x0 (x) = αω(x) + ( α)φ x0 (x), where φ x0 (x) = φ( x x 0 ) b φ( x x 0 b Note that the perturbed weight function still integrates to. Now given θ = E F (ω(x)m(x)) and ˆθ = E Fn (ω(x) ˆm(x)), define ( φ( x x 0 ) )m(x)f(x)dx b θ(x 0 ) = αθ + ( α) φ( t x 0 )f(t)dt = θ + O(n γ ) ˆθ(x 0 ) = E Fn (ω x0 (x) ˆm(x)) b )f(x)dx Now we will consider the MSE of the perturbed problem averaged over all the possible points x 0 of perturbation, MSE b and then finally let the perturbation go to 0, i.e. 24

25 consider MSE reg = lim b 0 MSE b : MSE b = E x0 E[ˆθ(x 0 ) θ] 2 = E x0 E[ˆθ(x 0 ) θ(x 0 )] 2 + E x0 [θ(x 0 ) θ] 2 + 2E x0 [θ(x 0 ) θ][e ˆθ(x 0 ) θ(x 0 )] = E x0 E[ˆθ(x 0 ) θ(x 0 )] 2 + O(n 2γ ) + O(h 2 n γ ) = E x0 [E ˆθ(x 0 ) θ(x 0 )] 2 + E x0 V ˆθ(x 0 ) + O(h 2 n γ ) Bias 2 b,reg + V ar b,reg + O(h 2 n γ ) MSE b,reg + O(h 2 n γ ) Note that the additional term of order O(n 2γ ) has been ignored as we will need n γ = o(h 2 ) later on to ensure that the leading term in the asymptotic expansion of the bias part of MSE reg, namely Bias b E x0 [E ˆθ(x 0 ) θ(x 0 )] dominates the remainder in MSE b,new. Using h = O(n δ 2 5 ), we will then have the condition that γ > Define B b = φ( x x 0 b )f(x)dx. Now consider E ˆθ(x 0 ): E ˆθ(x 0 ) = E Fn (ω x0 (x) ˆm(x)) = αe ˆθ + ( α)e Fn (φ x0 (x) ˆm(x)) = αe ˆθ + ( α) φ x0 (x)k h (x u)m(u)f(u)dudx = α(θ + C 2 2 h 2 + o(h 2 ( α) )) + φ( x x 0 )K h (x u)m(u)f(u)dudx B b ( = α(θ + C 2 2 h 2 ( α) ) + φ( x x ) 0 )m(x)f(x)dx + B b ( ( α) + h 2 T 2 (K) 2 φ( x x ) 0 )(mf) (x)dx + o(h 2 ) B b 0 This implies that δ, the regularization parameter as defined in the main body of the paper to follow, will have to fulfill δ > 2 7. Thus, strictly speaking, we can view this alternative approach as being equivalent to the approach with the convex penalty term if δ ( 2 7, 3 ). 25

26 Therefore: E(ˆθ(x 0 )) θ(x 0 ) = h (αc 2 2 φ( x x 0 )(mf) ) (x)dx 2 + ( α)t 2 (K) 2 b φ( t x 0 + o(h 2 ) )f(t)dt b ( (E(ˆθ(x 0 )) θ(x 0 )) 2 = h 4 α 2 C 2 + α( α)t 2 (K) 2 C 2 φ( x x 0 )(mf) ) (x)dx b 2 φ( t x 0 + )f(t)dt b ( φ( x x 0 + h 4 T 2 (K)( α) 2 )(mf) ) 2 (x)dx b φ( t x 0 + o(h 4 ) )f(t)dt b This implies, since Bias 2 b,reg = E x 0 (E(ˆθ(x 0 )) θ(x 0 )) 2, that: Bias 2 b,reg =h 4 [α 2 C 2 + n γ αt 2 (K) 2 C 2 b φ( x x 0 )(mf) (x)dx b 2 E x0 b φ( t x 0 ] )f(t)dt b φ( x x 0 + h 4 [n 2γ )(mf) (x)dx b T 2 (K)E x0 ( φ( t x 0 ) 2 ] )f(t)dt Now we consider the Bias 2 reg = lim b 0 Bias 2 b,reg, i.e. we let the perturbations go to 0. We assume that inff(x) = c > 0 and sup(mf) (x) = M <. We then get, using the dominated convergence theorem Bias reg = h 4 [α 2 C 2 + n γ T 2 (K) 2 αc 2 2 = h 4 [α 2 C 2 + n γ (T 2 (K) 2 αc 2 2 For C 2 0 therefore: Bias reg = n 2γ h 4 C 2 + O(n γ h 4 C 2 ) + o(h 4 ) b [(mf) (mf) (x)dx + n 2γ ] 2 (x) T 2 (K) dx] + o(h 4 ) f(x) (mf) (x)dx) + n 2γ C 2] + o(h 4 ) Note this is exactly to the leading order, the bias term in the regularized AMSE in the main body of the paper, as the first term in the above expansion is the convex penalty term we chose. It is similarly easily shown that V ar reg = lim b 0 V ar b,reg has to the leading order the same expression as in the regularized AMSE, thus substantiating the claim that the bandwidth that minimizes MSE reg is to the leading order the same as the bandwidth that minimizes the regularized AMSE in the paper. Alternatively we could assume that (mf) has one continuous derivative and take a Taylor expansion assuming b small. But this is theoretically unappealing as we have been making the minimal assumption throughout that mf(.) has two continuous derivatives 26

27 6.2 Properties of regularization The key theoretical issue is to make sure when the regularization term is added, that it is large enough to effect regularization, and small enough so that the new regularized risk satisfies the Li criterion and the semiparametric rate still obtains. We also have to be careful about higher order terms (see appendix). This tradeoff yields bounds on the rate constant. Specifically, we consider a penalty term of the form λ W A = n δ C 2h 4. Thus the modified risk is: ( ( MSE reg,w A = C 0 n + C ) 2 ) [(mf) n 2 h + T 2(K) (mf) (x)ω(x)dx + n δ ] 2 (x) dx h 4 f(x) + o p (n 2 h + h 4 ) = C 0 n + C n 2 h + (C 2 + n δ C 2)h 4 + o p (n 2 h + h 4 ) Using this modified risk, we have a formula for the regularized bandwidth: h opt reg,w A = ( C 4(C 2 + n δ C 2) ) 5 n 2 5. Following the same logic, we can also give a regularized bandwidth for the SATE as: h opt reg,sat E = ( ) 5 K 4(K 2 + n δ ν ( 2 [m ] 2 + [m 0] 2) n 2 5 fdx) C opt regn 2 5 Note that if one wants to insure oneself against instability in the near linear case as well, one can add an additional regularization term. This term which will have to as before decline at the appropriate range of rates, and, will have to scale sensibly with rescalings of the data. One potential term that works, is to add to the MISE regularization term above, the variance of the estimated coefficient on the linear term in a global linear regression fitted across the entire support. If the linear term is precisely estimated, the regularization perform will be lower. The term will also be sensitive to rescalings of the data. However this issue clearly need more careful investigation and the above should only be taken as a crude and tentative suggestion. We can now state the main results for both regularized bandwidths above: 27

Optimal bandwidth selection for the fuzzy regression discontinuity estimator

Optimal bandwidth selection for the fuzzy regression discontinuity estimator Optimal bandwidth selection for the fuzzy regression discontinuity estimator Yoichi Arai Hidehiko Ichimura The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper CWP49/5 Optimal

More information

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis

More information

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. The Sharp RD Design 3.

More information

Flexible Estimation of Treatment Effect Parameters

Flexible Estimation of Treatment Effect Parameters Flexible Estimation of Treatment Effect Parameters Thomas MaCurdy a and Xiaohong Chen b and Han Hong c Introduction Many empirical studies of program evaluations are complicated by the presence of both

More information

Optimal Bandwidth Choice for the Regression Discontinuity Estimator

Optimal Bandwidth Choice for the Regression Discontinuity Estimator Optimal Bandwidth Choice for the Regression Discontinuity Estimator Guido Imbens and Karthik Kalyanaraman First Draft: June 8 This Draft: September Abstract We investigate the choice of the bandwidth for

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

A Measure of Robustness to Misspecification

A Measure of Robustness to Misspecification A Measure of Robustness to Misspecification Susan Athey Guido W. Imbens December 2014 Graduate School of Business, Stanford University, and NBER. Electronic correspondence: athey@stanford.edu. Graduate

More information

Michael Lechner Causal Analysis RDD 2014 page 1. Lecture 7. The Regression Discontinuity Design. RDD fuzzy and sharp

Michael Lechner Causal Analysis RDD 2014 page 1. Lecture 7. The Regression Discontinuity Design. RDD fuzzy and sharp page 1 Lecture 7 The Regression Discontinuity Design fuzzy and sharp page 2 Regression Discontinuity Design () Introduction (1) The design is a quasi-experimental design with the defining characteristic

More information

Selection on Observables: Propensity Score Matching.

Selection on Observables: Propensity Score Matching. Selection on Observables: Propensity Score Matching. Department of Economics and Management Irene Brunetti ireneb@ec.unipi.it 24/10/2017 I. Brunetti Labour Economics in an European Perspective 24/10/2017

More information

finite-sample optimal estimation and inference on average treatment effects under unconfoundedness

finite-sample optimal estimation and inference on average treatment effects under unconfoundedness finite-sample optimal estimation and inference on average treatment effects under unconfoundedness Timothy Armstrong (Yale University) Michal Kolesár (Princeton University) September 2017 Introduction

More information

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta November 29, 2007 Outline Overview of Kernel Density

More information

ECO Class 6 Nonparametric Econometrics

ECO Class 6 Nonparametric Econometrics ECO 523 - Class 6 Nonparametric Econometrics Carolina Caetano Contents 1 Nonparametric instrumental variable regression 1 2 Nonparametric Estimation of Average Treatment Effects 3 2.1 Asymptotic results................................

More information

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies Kosuke Imai Department of Politics Princeton University November 13, 2013 So far, we have essentially assumed

More information

Nonparametric Econometrics

Nonparametric Econometrics Applied Microeconometrics with Stata Nonparametric Econometrics Spring Term 2011 1 / 37 Contents Introduction The histogram estimator The kernel density estimator Nonparametric regression estimators Semi-

More information

Causal Inference with Big Data Sets

Causal Inference with Big Data Sets Causal Inference with Big Data Sets Marcelo Coca Perraillon University of Colorado AMC November 2016 1 / 1 Outlone Outline Big data Causal inference in economics and statistics Regression discontinuity

More information

LECTURE 10: REVIEW OF POWER SERIES. 1. Motivation

LECTURE 10: REVIEW OF POWER SERIES. 1. Motivation LECTURE 10: REVIEW OF POWER SERIES By definition, a power series centered at x 0 is a series of the form where a 0, a 1,... and x 0 are constants. For convenience, we shall mostly be concerned with the

More information

Additive Isotonic Regression

Additive Isotonic Regression Additive Isotonic Regression Enno Mammen and Kyusang Yu 11. July 2006 INTRODUCTION: We have i.i.d. random vectors (Y 1, X 1 ),..., (Y n, X n ) with X i = (X1 i,..., X d i ) and we consider the additive

More information

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas 0 0 5 Motivation: Regression discontinuity (Angrist&Pischke) Outcome.5 1 1.5 A. Linear E[Y 0i X i] 0.2.4.6.8 1 X Outcome.5 1 1.5 B. Nonlinear E[Y 0i X i] i 0.2.4.6.8 1 X utcome.5 1 1.5 C. Nonlinearity

More information

Optimization Problems

Optimization Problems Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that

More information

The Generalized Roy Model and Treatment Effects

The Generalized Roy Model and Treatment Effects The Generalized Roy Model and Treatment Effects Christopher Taber University of Wisconsin November 10, 2016 Introduction From Imbens and Angrist we showed that if one runs IV, we get estimates of the Local

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Nonparametric Regression

Nonparametric Regression Nonparametric Regression Econ 674 Purdue University April 8, 2009 Justin L. Tobias (Purdue) Nonparametric Regression April 8, 2009 1 / 31 Consider the univariate nonparametric regression model: where y

More information

Cross-fitting and fast remainder rates for semiparametric estimation

Cross-fitting and fast remainder rates for semiparametric estimation Cross-fitting and fast remainder rates for semiparametric estimation Whitney K. Newey James M. Robins The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper CWP41/17 Cross-Fitting

More information

Estimation of Treatment Effects under Essential Heterogeneity

Estimation of Treatment Effects under Essential Heterogeneity Estimation of Treatment Effects under Essential Heterogeneity James Heckman University of Chicago and American Bar Foundation Sergio Urzua University of Chicago Edward Vytlacil Columbia University March

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Nonparametric Density Estimation

Nonparametric Density Estimation Nonparametric Density Estimation Econ 690 Purdue University Justin L. Tobias (Purdue) Nonparametric Density Estimation 1 / 29 Density Estimation Suppose that you had some data, say on wages, and you wanted

More information

New Developments in Econometrics Lecture 11: Difference-in-Differences Estimation

New Developments in Econometrics Lecture 11: Difference-in-Differences Estimation New Developments in Econometrics Lecture 11: Difference-in-Differences Estimation Jeff Wooldridge Cemmap Lectures, UCL, June 2009 1. The Basic Methodology 2. How Should We View Uncertainty in DD Settings?

More information

The Influence Function of Semiparametric Estimators

The Influence Function of Semiparametric Estimators The Influence Function of Semiparametric Estimators Hidehiko Ichimura University of Tokyo Whitney K. Newey MIT July 2015 Revised January 2017 Abstract There are many economic parameters that depend on

More information

Locally Robust Semiparametric Estimation

Locally Robust Semiparametric Estimation Locally Robust Semiparametric Estimation Victor Chernozhukov Juan Carlos Escanciano Hidehiko Ichimura Whitney K. Newey The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper

More information

Generated Covariates in Nonparametric Estimation: A Short Review.

Generated Covariates in Nonparametric Estimation: A Short Review. Generated Covariates in Nonparametric Estimation: A Short Review. Enno Mammen, Christoph Rothe, and Melanie Schienle Abstract In many applications, covariates are not observed but have to be estimated

More information

Closest Moment Estimation under General Conditions

Closest Moment Estimation under General Conditions Closest Moment Estimation under General Conditions Chirok Han and Robert de Jong January 28, 2002 Abstract This paper considers Closest Moment (CM) estimation with a general distance function, and avoids

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Test for Discontinuities in Nonparametric Regression

Test for Discontinuities in Nonparametric Regression Communications of the Korean Statistical Society Vol. 15, No. 5, 2008, pp. 709 717 Test for Discontinuities in Nonparametric Regression Dongryeon Park 1) Abstract The difference of two one-sided kernel

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Econometrics of causal inference. Throughout, we consider the simplest case of a linear outcome equation, and homogeneous

Econometrics of causal inference. Throughout, we consider the simplest case of a linear outcome equation, and homogeneous Econometrics of causal inference Throughout, we consider the simplest case of a linear outcome equation, and homogeneous effects: y = βx + ɛ (1) where y is some outcome, x is an explanatory variable, and

More information

A Note on Adapting Propensity Score Matching and Selection Models to Choice Based Samples

A Note on Adapting Propensity Score Matching and Selection Models to Choice Based Samples DISCUSSION PAPER SERIES IZA DP No. 4304 A Note on Adapting Propensity Score Matching and Selection Models to Choice Based Samples James J. Heckman Petra E. Todd July 2009 Forschungsinstitut zur Zukunft

More information

More on Estimation. Maximum Likelihood Estimation.

More on Estimation. Maximum Likelihood Estimation. More on Estimation. In the previous chapter we looked at the properties of estimators and the criteria we could use to choose between types of estimators. Here we examine more closely some very popular

More information

Regression Discontinuity

Regression Discontinuity Regression Discontinuity Christopher Taber Department of Economics University of Wisconsin-Madison October 24, 2017 I will describe the basic ideas of RD, but ignore many of the details Good references

More information

Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors

Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors Peter Hall, Qi Li, Jeff Racine 1 Introduction Nonparametric techniques robust to functional form specification.

More information

Estimation of the Conditional Variance in Paired Experiments

Estimation of the Conditional Variance in Paired Experiments Estimation of the Conditional Variance in Paired Experiments Alberto Abadie & Guido W. Imbens Harvard University and BER June 008 Abstract In paired randomized experiments units are grouped in pairs, often

More information

Optimal bandwidth choice for the regression discontinuity estimator

Optimal bandwidth choice for the regression discontinuity estimator Optimal bandwidth choice for the regression discontinuity estimator Guido Imbens Karthik Kalyanaraman The Institute for Fiscal Studies Department of Economics, UCL cemmap working paper CWP5/ Optimal Bandwidth

More information

What s New in Econometrics. Lecture 1

What s New in Econometrics. Lecture 1 What s New in Econometrics Lecture 1 Estimation of Average Treatment Effects Under Unconfoundedness Guido Imbens NBER Summer Institute, 2007 Outline 1. Introduction 2. Potential Outcomes 3. Estimands and

More information

Post-Selection Inference

Post-Selection Inference Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis

More information

NBER WORKING PAPER SERIES A NOTE ON ADAPTING PROPENSITY SCORE MATCHING AND SELECTION MODELS TO CHOICE BASED SAMPLES. James J. Heckman Petra E.

NBER WORKING PAPER SERIES A NOTE ON ADAPTING PROPENSITY SCORE MATCHING AND SELECTION MODELS TO CHOICE BASED SAMPLES. James J. Heckman Petra E. NBER WORKING PAPER SERIES A NOTE ON ADAPTING PROPENSITY SCORE MATCHING AND SELECTION MODELS TO CHOICE BASED SAMPLES James J. Heckman Petra E. Todd Working Paper 15179 http://www.nber.org/papers/w15179

More information

Ultra High Dimensional Variable Selection with Endogenous Variables

Ultra High Dimensional Variable Selection with Endogenous Variables 1 / 39 Ultra High Dimensional Variable Selection with Endogenous Variables Yuan Liao Princeton University Joint work with Jianqing Fan Job Market Talk January, 2012 2 / 39 Outline 1 Examples of Ultra High

More information

Regression Discontinuity

Regression Discontinuity Regression Discontinuity Christopher Taber Department of Economics University of Wisconsin-Madison October 16, 2018 I will describe the basic ideas of RD, but ignore many of the details Good references

More information

September Math Course: First Order Derivative

September Math Course: First Order Derivative September Math Course: First Order Derivative Arina Nikandrova Functions Function y = f (x), where x is either be a scalar or a vector of several variables (x,..., x n ), can be thought of as a rule which

More information

Regression #3: Properties of OLS Estimator

Regression #3: Properties of OLS Estimator Regression #3: Properties of OLS Estimator Econ 671 Purdue University Justin L. Tobias (Purdue) Regression #3 1 / 20 Introduction In this lecture, we establish some desirable properties associated with

More information

Statistical inference

Statistical inference Statistical inference Contents 1. Main definitions 2. Estimation 3. Testing L. Trapani MSc Induction - Statistical inference 1 1 Introduction: definition and preliminary theory In this chapter, we shall

More information

Why high-order polynomials should not be used in regression discontinuity designs

Why high-order polynomials should not be used in regression discontinuity designs Why high-order polynomials should not be used in regression discontinuity designs Andrew Gelman Guido Imbens 6 Jul 217 Abstract It is common in regression discontinuity analysis to control for third, fourth,

More information

Regression Discontinuity Design

Regression Discontinuity Design Chapter 11 Regression Discontinuity Design 11.1 Introduction The idea in Regression Discontinuity Design (RDD) is to estimate a treatment effect where the treatment is determined by whether as observed

More information

Using Matching, Instrumental Variables and Control Functions to Estimate Economic Choice Models

Using Matching, Instrumental Variables and Control Functions to Estimate Economic Choice Models Using Matching, Instrumental Variables and Control Functions to Estimate Economic Choice Models James J. Heckman and Salvador Navarro The University of Chicago Review of Economics and Statistics 86(1)

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares

Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares Wooldridge, Introductory Econometrics, 4th ed. Chapter 15: Instrumental variables and two stage least squares Many economic models involve endogeneity: that is, a theoretical relationship does not fit

More information

The Economics of European Regions: Theory, Empirics, and Policy

The Economics of European Regions: Theory, Empirics, and Policy The Economics of European Regions: Theory, Empirics, and Policy Dipartimento di Economia e Management Davide Fiaschi Angela Parenti 1 1 davide.fiaschi@unipi.it, and aparenti@ec.unipi.it. Fiaschi-Parenti

More information

POLI 8501 Introduction to Maximum Likelihood Estimation

POLI 8501 Introduction to Maximum Likelihood Estimation POLI 8501 Introduction to Maximum Likelihood Estimation Maximum Likelihood Intuition Consider a model that looks like this: Y i N(µ, σ 2 ) So: E(Y ) = µ V ar(y ) = σ 2 Suppose you have some data on Y,

More information

Time Series and Forecasting Lecture 4 NonLinear Time Series

Time Series and Forecasting Lecture 4 NonLinear Time Series Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations

More information

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN Massimo Guidolin Massimo.Guidolin@unibocconi.it Dept. of Finance STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN SECOND PART, LECTURE 2: MODES OF CONVERGENCE AND POINT ESTIMATION Lecture 2:

More information

Notes on Random Variables, Expectations, Probability Densities, and Martingales

Notes on Random Variables, Expectations, Probability Densities, and Martingales Eco 315.2 Spring 2006 C.Sims Notes on Random Variables, Expectations, Probability Densities, and Martingales Includes Exercise Due Tuesday, April 4. For many or most of you, parts of these notes will be

More information

Is there an optimal weighting for linear inverse problems?

Is there an optimal weighting for linear inverse problems? Is there an optimal weighting for linear inverse problems? Jean-Pierre FLORENS Toulouse School of Economics Senay SOKULLU University of Bristol October 9, 205 Abstract This paper considers linear equations

More information

On IV estimation of the dynamic binary panel data model with fixed effects

On IV estimation of the dynamic binary panel data model with fixed effects On IV estimation of the dynamic binary panel data model with fixed effects Andrew Adrian Yu Pua March 30, 2015 Abstract A big part of applied research still uses IV to estimate a dynamic linear probability

More information

Chapter 2: simple regression model

Chapter 2: simple regression model Chapter 2: simple regression model Goal: understand how to estimate and more importantly interpret the simple regression Reading: chapter 2 of the textbook Advice: this chapter is foundation of econometrics.

More information

13 Endogeneity and Nonparametric IV

13 Endogeneity and Nonparametric IV 13 Endogeneity and Nonparametric IV 13.1 Nonparametric Endogeneity A nonparametric IV equation is Y i = g (X i ) + e i (1) E (e i j i ) = 0 In this model, some elements of X i are potentially endogenous,

More information

Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction

Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction Tine Buch-Kromann Univariate Kernel Regression The relationship between two variables, X and Y where m(

More information

Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation

Preface. 1 Nonparametric Density Estimation and Testing. 1.1 Introduction. 1.2 Univariate Density Estimation Preface Nonparametric econometrics has become one of the most important sub-fields in modern econometrics. The primary goal of this lecture note is to introduce various nonparametric and semiparametric

More information

ON ILL-POSEDNESS OF NONPARAMETRIC INSTRUMENTAL VARIABLE REGRESSION WITH CONVEXITY CONSTRAINTS

ON ILL-POSEDNESS OF NONPARAMETRIC INSTRUMENTAL VARIABLE REGRESSION WITH CONVEXITY CONSTRAINTS ON ILL-POSEDNESS OF NONPARAMETRIC INSTRUMENTAL VARIABLE REGRESSION WITH CONVEXITY CONSTRAINTS Olivier Scaillet a * This draft: July 2016. Abstract This note shows that adding monotonicity or convexity

More information

Nonparametric Identi cation and Estimation of Truncated Regression Models with Heteroskedasticity

Nonparametric Identi cation and Estimation of Truncated Regression Models with Heteroskedasticity Nonparametric Identi cation and Estimation of Truncated Regression Models with Heteroskedasticity Songnian Chen a, Xun Lu a, Xianbo Zhou b and Yahong Zhou c a Department of Economics, Hong Kong University

More information

Supplemental Appendix to "Alternative Assumptions to Identify LATE in Fuzzy Regression Discontinuity Designs"

Supplemental Appendix to Alternative Assumptions to Identify LATE in Fuzzy Regression Discontinuity Designs Supplemental Appendix to "Alternative Assumptions to Identify LATE in Fuzzy Regression Discontinuity Designs" Yingying Dong University of California Irvine February 2018 Abstract This document provides

More information

Nonparametric Modal Regression

Nonparametric Modal Regression Nonparametric Modal Regression Summary In this article, we propose a new nonparametric modal regression model, which aims to estimate the mode of the conditional density of Y given predictors X. The nonparametric

More information

The risk of machine learning

The risk of machine learning / 33 The risk of machine learning Alberto Abadie Maximilian Kasy July 27, 27 2 / 33 Two key features of machine learning procedures Regularization / shrinkage: Improve prediction or estimation performance

More information

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation Petra E. Todd Fall, 2014 2 Contents 1 Review of Stochastic Order Symbols 1 2 Nonparametric Density Estimation 3 2.1 Histogram

More information

Robustness to Parametric Assumptions in Missing Data Models

Robustness to Parametric Assumptions in Missing Data Models Robustness to Parametric Assumptions in Missing Data Models Bryan Graham NYU Keisuke Hirano University of Arizona April 2011 Motivation Motivation We consider the classic missing data problem. In practice

More information

IDENTIFICATION OF MARGINAL EFFECTS IN NONSEPARABLE MODELS WITHOUT MONOTONICITY

IDENTIFICATION OF MARGINAL EFFECTS IN NONSEPARABLE MODELS WITHOUT MONOTONICITY Econometrica, Vol. 75, No. 5 (September, 2007), 1513 1518 IDENTIFICATION OF MARGINAL EFFECTS IN NONSEPARABLE MODELS WITHOUT MONOTONICITY BY STEFAN HODERLEIN AND ENNO MAMMEN 1 Nonseparable models do not

More information

Transparent Structural Estimation. Matthew Gentzkow Fisher-Schultz Lecture (from work w/ Isaiah Andrews & Jesse M. Shapiro)

Transparent Structural Estimation. Matthew Gentzkow Fisher-Schultz Lecture (from work w/ Isaiah Andrews & Jesse M. Shapiro) Transparent Structural Estimation Matthew Gentzkow Fisher-Schultz Lecture (from work w/ Isaiah Andrews & Jesse M. Shapiro) 1 A hallmark of contemporary applied microeconomics is a conceptual framework

More information

Supervised Learning: Non-parametric Estimation

Supervised Learning: Non-parametric Estimation Supervised Learning: Non-parametric Estimation Edmondo Trentin March 18, 2018 Non-parametric Estimates No assumptions are made on the form of the pdfs 1. There are 3 major instances of non-parametric estimates:

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

4 Nonparametric Regression

4 Nonparametric Regression 4 Nonparametric Regression 4.1 Univariate Kernel Regression An important question in many fields of science is the relation between two variables, say X and Y. Regression analysis is concerned with the

More information

Econ 673: Microeconometrics Chapter 12: Estimating Treatment Effects. The Problem

Econ 673: Microeconometrics Chapter 12: Estimating Treatment Effects. The Problem Econ 673: Microeconometrics Chapter 12: Estimating Treatment Effects The Problem Analysts are frequently interested in measuring the impact of a treatment on individual behavior; e.g., the impact of job

More information

Consistency and Asymptotic Normality for Equilibrium Models with Partially Observed Outcome Variables

Consistency and Asymptotic Normality for Equilibrium Models with Partially Observed Outcome Variables Consistency and Asymptotic Normality for Equilibrium Models with Partially Observed Outcome Variables Nathan H. Miller Georgetown University Matthew Osborne University of Toronto November 25, 2013 Abstract

More information

3.3 Estimator quality, confidence sets and bootstrapping

3.3 Estimator quality, confidence sets and bootstrapping Estimator quality, confidence sets and bootstrapping 109 3.3 Estimator quality, confidence sets and bootstrapping A comparison of two estimators is always a matter of comparing their respective distributions.

More information

Independent and conditionally independent counterfactual distributions

Independent and conditionally independent counterfactual distributions Independent and conditionally independent counterfactual distributions Marcin Wolski European Investment Bank M.Wolski@eib.org Society for Nonlinear Dynamics and Econometrics Tokyo March 19, 2018 Views

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics A short review of the principles of mathematical statistics (or, what you should have learned in EC 151).

More information

Classical regularity conditions

Classical regularity conditions Chapter 3 Classical regularity conditions Preliminary draft. Please do not distribute. The results from classical asymptotic theory typically require assumptions of pointwise differentiability of a criterion

More information

Local Polynomial Regression

Local Polynomial Regression VI Local Polynomial Regression (1) Global polynomial regression We observe random pairs (X 1, Y 1 ),, (X n, Y n ) where (X 1, Y 1 ),, (X n, Y n ) iid (X, Y ). We want to estimate m(x) = E(Y X = x) based

More information

A Novel Nonparametric Density Estimator

A Novel Nonparametric Density Estimator A Novel Nonparametric Density Estimator Z. I. Botev The University of Queensland Australia Abstract We present a novel nonparametric density estimator and a new data-driven bandwidth selection method with

More information

Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data. Jeff Dominitz RAND. and

Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data. Jeff Dominitz RAND. and Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data Jeff Dominitz RAND and Charles F. Manski Department of Economics and Institute for Policy Research, Northwestern

More information

Nonparametric Regression. Badr Missaoui

Nonparametric Regression. Badr Missaoui Badr Missaoui Outline Kernel and local polynomial regression. Penalized regression. We are given n pairs of observations (X 1, Y 1 ),...,(X n, Y n ) where Y i = r(x i ) + ε i, i = 1,..., n and r(x) = E(Y

More information

1 Lyapunov theory of stability

1 Lyapunov theory of stability M.Kawski, APM 581 Diff Equns Intro to Lyapunov theory. November 15, 29 1 1 Lyapunov theory of stability Introduction. Lyapunov s second (or direct) method provides tools for studying (asymptotic) stability

More information

DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA

DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA Statistica Sinica 18(2008), 515-534 DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA Kani Chen 1, Jianqing Fan 2 and Zhezhen Jin 3 1 Hong Kong University of Science and Technology,

More information

Unconditional Quantile Regression with Endogenous Regressors

Unconditional Quantile Regression with Endogenous Regressors Unconditional Quantile Regression with Endogenous Regressors Pallab Kumar Ghosh Department of Economics Syracuse University. Email: paghosh@syr.edu Abstract This paper proposes an extension of the Fortin,

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Comments on: Panel Data Analysis Advantages and Challenges. Manuel Arellano CEMFI, Madrid November 2006

Comments on: Panel Data Analysis Advantages and Challenges. Manuel Arellano CEMFI, Madrid November 2006 Comments on: Panel Data Analysis Advantages and Challenges Manuel Arellano CEMFI, Madrid November 2006 This paper provides an impressive, yet compact and easily accessible review of the econometric literature

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Nonconvex penalties: Signal-to-noise ratio and algorithms

Nonconvex penalties: Signal-to-noise ratio and algorithms Nonconvex penalties: Signal-to-noise ratio and algorithms Patrick Breheny March 21 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/22 Introduction In today s lecture, we will return to nonconvex

More information

NONPARAMETRIC ESTIMATION OF AVERAGE TREATMENT EFFECTS UNDER EXOGENEITY: A REVIEW*

NONPARAMETRIC ESTIMATION OF AVERAGE TREATMENT EFFECTS UNDER EXOGENEITY: A REVIEW* OPARAMETRIC ESTIMATIO OF AVERAGE TREATMET EFFECTS UDER EXOGEEITY: A REVIEW* Guido W. Imbens Abstract Recently there has been a surge in econometric work focusing on estimating average treatment effects

More information

Optimal Bandwidth Choice for the Regression Discontinuity Estimator

Optimal Bandwidth Choice for the Regression Discontinuity Estimator Optimal Bandwidth Choice for the Regression Discontinuity Estimator Guido Imbens and Karthik Kalyanaraman First Draft: June 8 This Draft: February 9 Abstract We investigate the problem of optimal choice

More information