Program Evaluation with High-Dimensional Data

Similar documents
PROGRAM EVALUATION WITH HIGH-DIMENSIONAL DATA

Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: Alexandre Belloni (Duke) + Kengo Kato (Tokyo)

Mostly Dangerous Econometrics: How to do Model Selection with Inference in Mind

Locally Robust Semiparametric Estimation

Problem Set 7. Ideally, these would be the same observations left out when you

INFERENCE APPROACHES FOR INSTRUMENTAL VARIABLE QUANTILE REGRESSION. 1. Introduction

Flexible Estimation of Treatment Effect Parameters

New Developments in Econometrics Lecture 16: Quantile Estimation

Learning 2 -Continuous Regression Functionals via Regularized Riesz Representers

What s New in Econometrics? Lecture 14 Quantile Methods

Cross-fitting and fast remainder rates for semiparametric estimation

Ultra High Dimensional Variable Selection with Endogenous Variables

Quantile Processes for Semi and Nonparametric Regression

High Dimensional Empirical Likelihood for Generalized Estimating Equations with Dependent Data

Single Index Quantile Regression for Heteroscedastic Data

Econometrics of High-Dimensional Sparse Models

Inference in Nonparametric Series Estimation with Data-Dependent Number of Series Terms

Estimation and Inference for Distribution Functions and Quantile Functions in Endogenous Treatment Effect Models. Abstract

High Dimensional Propensity Score Estimation via Covariate Balancing

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

SPARSE MODELS AND METHODS FOR OPTIMAL INSTRUMENTS WITH AN APPLICATION TO EMINENT DOMAIN

INFERENCE ON TREATMENT EFFECTS AFTER SELECTION AMONGST HIGH-DIMENSIONAL CONTROLS

Honest confidence regions for a regression parameter in logistic regression with a large number of controls

arxiv: v3 [stat.me] 9 May 2012

Econometric Analysis of Cross Section and Panel Data

A General Framework for High-Dimensional Inference and Multiple Testing

The Numerical Delta Method and Bootstrap

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Comments on: Panel Data Analysis Advantages and Challenges. Manuel Arellano CEMFI, Madrid November 2006

Comparison of inferential methods in partially identified models in terms of error in coverage probability

The Influence Function of Semiparametric Estimators

INFERENCE IN ADDITIVELY SEPARABLE MODELS WITH A HIGH DIMENSIONAL COMPONENT

Nonparametric Inference via Bootstrapping the Debiased Estimator

Econometrics of Big Data: Large p Case

Two-Step Estimation and Inference with Possibly Many Included Covariates

Generated Covariates in Nonparametric Estimation: A Short Review.

What s New in Econometrics. Lecture 13

Jinyong Hahn. Department of Economics Tel: (310) Bunche Hall Fax: (310) Professional Positions

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Binary Models with Endogenous Explanatory Variables

arxiv: v1 [stat.ml] 30 Jan 2017

Increasing the Power of Specification Tests. November 18, 2018

Online Appendix for Post-Selection and Post-Regularization Inference in Linear Models with Very Many Controls and Instruments

arxiv: v6 [stat.ml] 12 Dec 2017

Single Index Quantile Regression for Heteroscedastic Data

Double/Debiased Machine Learning for Treatment and Structural Parameters 1

IV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade

Double/de-biased machine learning using regularized Riesz representers

Inference in Additively Separable Models with a High-Dimensional Set of Conditioning Variables

Michael Lechner Causal Analysis RDD 2014 page 1. Lecture 7. The Regression Discontinuity Design. RDD fuzzy and sharp

On Variance Estimation for 2SLS When Instruments Identify Different LATEs

Double Robust Semiparametric Efficient Tests for Distributional Treatment Effects under the Conditional Independence Assumption

Independent and conditionally independent counterfactual distributions

Valid post-selection and post-regularization inference: An elementary, general approach

Causal Inference Lecture Notes: Causal Inference with Repeated Measures in Observational Studies

Inference For High Dimensional M-estimates. Fixed Design Results

Robustness to Parametric Assumptions in Missing Data Models

A Test for Rank Similarity and Partial Identification of the Distribution of Treatment Effects Preliminary and incomplete

Exact Nonparametric Inference for a Binary. Endogenous Regressor

Quantile Structural Treatment Effect: Application to Smoking Wage Penalty and its Determinants

Selection endogenous dummy ordered probit, and selection endogenous dummy dynamic ordered probit models

Censored quantile instrumental variable estimation with Stata

Least Squares Estimation of a Panel Data Model with Multifactor Error Structure and Endogenous Covariates

Sensitivity checks for the local average treatment effect

Quantile methods. Class Notes Manuel Arellano December 1, Let F (r) =Pr(Y r). Forτ (0, 1), theτth population quantile of Y is defined to be

Online Appendix for Targeting Policies: Multiple Testing and Distributional Treatment Effects

Robust Con dence Intervals in Nonlinear Regression under Weak Identi cation

arxiv: v2 [math.st] 7 Jun 2018

A Test for Rank Similarity and Partial Identification of the Distribution of Treatment Effects Preliminary and incomplete

Inference For High Dimensional M-estimates: Fixed Design Results

CLUSTER-ROBUST BOOTSTRAP INFERENCE IN QUANTILE REGRESSION MODELS. Andreas Hagemann. July 3, 2015

Testing Endogeneity with Possibly Invalid Instruments and High Dimensional Covariates

TESTING REGRESSION MONOTONICITY IN ECONOMETRIC MODELS

LASSO METHODS FOR GAUSSIAN INSTRUMENTAL VARIABLES MODELS. 1. Introduction

General Doubly Robust Identification and Estimation

Uniformity and the delta method

The changes-in-changes model with covariates

Testing Rank Similarity

Inference on treatment effects after selection amongst highdimensional

L3. STRUCTURAL EQUATIONS MODELS AND GMM. 1. Wright s Supply-Demand System of Simultaneous Equations

Testing for Rank Invariance or Similarity in Program Evaluation: The Effect of Training on Earnings Revisited

Unconditional Quantile Regression with Endogenous Regressors

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory

A Local Generalized Method of Moments Estimator

Regression Discontinuity Designs with a Continuous Treatment

ESTIMATION OF TREATMENT EFFECTS VIA MATCHING

Inference for best linear approximations to set identified functions

Supplement to Quantile-Based Nonparametric Inference for First-Price Auctions

Quantile Regression for Panel/Longitudinal Data

The Exact Distribution of the t-ratio with Robust and Clustered Standard Errors

Limited Information Econometrics

Estimating Average Treatment Effects: Supplementary Analyses and Remaining Challenges

IDENTIFICATION OF MARGINAL EFFECTS IN NONSEPARABLE MODELS WITHOUT MONOTONICITY

arxiv: v1 [stat.me] 31 Dec 2011

Weak Stochastic Increasingness, Rank Exchangeability, and Partial Identification of The Distribution of Treatment Effects

Identification and Estimation of Marginal Effects in Nonlinear Panel Models 1

Groupe de lecture. Instrumental Variables Estimates of the Effect of Subsidized Training on the Quantiles of Trainee Earnings. Abadie, Angrist, Imbens

ECO Class 6 Nonparametric Econometrics

Causal Inference: Discussion

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

Transcription:

Program Evaluation with High-Dimensional Data Alexandre Belloni Duke Victor Chernozhukov MIT Iván Fernández-Val BU Christian Hansen Booth ESWC 215 August 17, 215

Introduction Goal is to perform inference on summaries of heterogenous effect of an endogenous treatment in the presence of a instrument with many controls D {, 1} is an endogenous binary treatment, Y u is an outcome, possibly indexed by u U Z {, 1} is a binary instrument, f (X ) is a dictionary of controls, where we allow p = dim f (X ) n, which will have to be selected in data-driven fashion Conditioning on covariates for identification or for efficiency reasons Example: Y u = wealth or Y u = 1(wealth u), D = 41(k) participation, Z = offer of 41(k) by an employer, believed to be as good as randomly assigned conditional on covariates X Dictionary f (X ) is generated by taking transformations and interactions of X =age, income, family size, education, etc.

Local Effects ( Local = Compliers, people affected by Z). Local Average Treatment Effect (LATE), Local Quantile Treatment Effect (LQTE) Local Effects on the Treated (Treated Compliers). In the absence of endogeneity, Z = D, drop Local above. Contribution: we provide honest confidence bands for all of the above parameters, based on post-selection-robust procedures, where Lasso is used to select terms of the dictionary that explain the regression functions and the propensity scores. Also simultaneous bands for curves= function-valued parameters.

D 1, Example of Post-Selection-Robust in Action Example with p 7, n 1-1, -1, Impact of 41(K).2 on.4quantiles.6 of.8total Wealth.2.4.6.8 Quantile Quantile D 1, 6, 5, Total Wealth LQTE No Selection Selection 6, 5, Total Wealth LQTT No Selection Selection Dollars (1991) 4, 3, 2, 1, Dollars (1991) 4, 3, 2, 1, -1,.2.4.6.8 Quantile -1,.2.4.6.8 Quantile

Key Structural Parameters Assume standard LATE assumptions of Imbens and Angrist. Average potential outcome in treatment state d for the compliers =: Local average structural function (LASF): θ Yu (d) = E P {E P [1 d (D)Y u Z = 1, X ] E P [1 d (D)Y u Z =, X ]}, d {, 1} E P {E P [1 d (D) Z = 1, X ] E P [1 d (D) Z =, X ]} where 1 d (D) := 1(D = d) is the indicator function Local average treatment effect (LATE) with Y u = Y θ Y (1) θ Y ()

Key Structural Parameters Defining the outcome variable as Y u = 1(Y u) gives local distribution structural function (LDSF): u θ Yu (d) Local quantile structural function (LQSF) is obtained by inversion: τ θ Y (τ, d) := inf{u R : θ Yu (d) τ}, τ-quantile of potential outcome in treatment state d for compliers Take differences to get LQTEs τ θ Y (τ, 1) θ Y (τ, ), τ Q (, 1) Can define on the treated versions

Link to Key Reduced Form Parameters All these structural θ-parameters are smooth transforms of reduced-form α-parameters, for example, where for z {, 1}, θ Yu (d) = α 1 d (D)Y u (1) α 1d (D)Y u () α 1d (D)(1) α 1d (D)() α V (z) := E P [g V (z, X )], g V (z, x) := E P [V Z = z, X = x], for or V V u := {1 (D)Y u, 1 1 (D)Y u, 1 (D), 1 1 (D)}, u U. [ ] 1z (Z )V α V (z) = E P, m Z (z, x) := E P [1 z (Z ) X = x], m Z (z, X )

Rephrasing High-Dimensional Problem to Reduced Forms θ s = smooth functional(α s) α s = smooth functional(g s or m Z ) Estimate via plug-in θ s = smooth functional( α s) α s = smooth functional(ĝ s or m Z ) How to estimate g s or m Z with p n so that inference on θ is honest = uniformly valid over a large class of models?

Naive Approach Make approximate sparsity assumption on the regression function g V (z, X ) = p f j (X )β V,z,j, j=1 where β V,z,j decay at some speed after sorting the coefficients in decreasing order. Then estimate g V using modern high-dimensional methods, LASSO or post-lasso. Obtain the natural plug-in estimator: α V (z) = n ĝ V (z, X i )/n i=1 Problem: Despite averaging, α V (z) is not n-consistent and n( αv (z) α(z)) N (, Ω). Regularization (by selection or shrinkage) causes too much omitted variable bias, causing the averaging estimator not to be n-consistent Lack of uniformity in inference wrt the DGP = not honest

Naive Approach has Very Poor Inference Quality Exogenous example (Z = D) θ = ATE Y = D ( p j=1 X j β j ) c + ζ.5 Test H : θ = true value Nominal size: 5% Naive rp(.5) D = 1{( p j=1 X j β j ) c + V > } β j = 1/j 2, n = 1, p = 2 ζ N(, 1), V N(, 1) X N(, Toeplitz)..4.3.2.1.8.6.4 R 2 d.2.2.4 R 2 y.6.8

Our Solution: Use Doubly Robust Scores + Lasso or Post-Lasso to Estimate Regressions and Propensity Scores 1 Assume approximate sparsity for g V and m Z, and estimate both via post-lasso or LASSO to deal with p n. Under our assumption we can estimate g V and m Z at rates o(n 1/4 ). 2 Estimate the reduced form α-parameters using doubly robust efficient scores to protect against crude estimation of g V and m Z. The estimators are n-consistent and semi-parametrically efficient. 3 Estimate structural θ-parameters via the plug-in rule and derive their asymptotics via functional delta method 4 Multiplier bootstrap method (resampling the scores) to make inference on the reduced form and structural parameters. Very fast. Theorem (Validity of Post-Selection-Robust Procedure) The main result is that this works uniformly for a large class of DGPs, thereby providing efficient estimators and honest confidence intervals for LATE, LDTE, LQTE, and other effects.

Inference Quality After Model Selection Not Doubly Robust Doubly Robust Naive rp(.5) Proposed rp(.5).5.5.4.4.3.3.2.2.1.1.8.6.4 R 2 d.2.2.4 R 2 y.6.8.8.6.4 R 2 d.2.2.4 R 2 y.6.8

The Bigger Picture: Use Orthogonal Scores Suppose identification is achieved via the moment condition E P ψ(w, α, h ) =, where α is the parameter of interest, and h is a nuisance function The score function ψ has orthogonality property w.r.t. h if h E P ψ(w, α, h) = h=h where h computes a functional derivative operator w.r.t. h. Orthogonality reduces the dependency on h and allows the use of highly non-regular, crude plug-in estimators of h (converging to h at rates o(n 1/4 )). Orthogonality is equivalent to double-robustness in many cases.

Orthogonal or Doubly Robust Score for α-parameters Orthogonal score function for α V (z) ψ α V,z (W, α, g, m) := 1 z(z )(V g(z, X )) m(z, X ) + g(z, X ) α It combines regression and propensity score reweighing approaches: Robins and Rotnizky (95) and Hahn (98). When evaluated at the true parameter values ψ α V,z (W, α, g V, m Z ) this score is the semi-parametrically efficient influence function for α V (z) Provides orthogonality with respect to h = (g, m) at h = (g V, m Z ) In p n settings, ψv α,z used in ATE estimation by Belloni, Chernozhukov, and Hansen (13, ReStud) and Farrell (13).

More on Orthogonal Score Functions Long history in statistics and econometrics In low-dimensional parametric settings, it was used by Neyman (56, 79) to deal with crudely estimated nuisance parameters Newey (9, 94), Andrews (94), Linton (96), and van der Vaart (98) used orthogonality in semi parametric problems In p n settings, Belloni, Chen, Chernozhukov, and Hansen (212) first used orthogonality in the context of IV models. In the paper Program Evaluation with High-Dimensional Data (ArXiv, 213) we construct honest confidence bands for generic smooth and nonsmooth moment problems with orthogonal score functions, not just the program evaluation settings.

Conclusion 1. Don t use naive inference based on LASSO-based estimation of regression only (or propensity score only). It does fail to provide honest confidence sets. 2. Do use inference based on double robust scores, which combines nicely with Lasso-based estimation of both the regression and the propensity scores.

References: Inference on treatment effects after selection amongst high-dimensional controls ArXiv 211, Review of Economic Studies, 213 Partially linear regression + exogenous TE models Program Evaluation with High-Dimensional Data ArXiv 213, Econometrica R&R endogenous TE models + general orthogonal score problems

Appendix The rest is technical Appendix.

Summary: Inference Quality After Model Selection Not Double Robust Double Robust Naive rp(.5) Proposed rp(.5).5.5.4.4.3.3.2.2.1.1.8.6.4 R 2 d.2.2.4 R 2 y.6.8.8.6.4 R 2 d.2.2.4 R 2 y.6.8

Step-1: LASSO/Post-LASSO Estimators of g V and m Z Result 1: Under approximate sparsity assumptions, namely when g V and m Z are well-approximated by s < n unknown terms amongst f (X ), LASSO and Post-LASSO estimators of g V and m Z attain the near-oracle rate s log p/n for each V V u, uniformly in u U; they are also sparse with dimension of stochastic order s uniformly in u U. Result 1 applies to a continuum of LS and logit LASSO/Post-LASSO regressions Choice of penalty level in LASSO needs to account for simultaneous estimation over V V u, u U Results for continuum were only available for quantile regression LASSO/Post-LASSO (Belloni and Chernozhukov, AoS, 11) Covers l 1 -penalized distribution regression process

Approximate Sparsity of g V and m Z Let f (X ) = (f j (X )) p j=1 be a vector of transformations of X, p = p n n We use series approximations for g V and m Z g V (z, x) = Λ V (f (z, x) β V ) + r V, f (z, x) = [zf (x), (1 z)f (x) ], m Z (1, x) = Λ Z (f (x) β Z ) + r Z, m Z (, x) = 1 m Z (1, x), where r V and r Z are the approximation errors and Λ V and Λ Z are known link functions. Assume approximate sparsity, namely that: 1 There exists s = s n n such that, for all V {V u : u U}, β V + β Z s, where denotes the number of non-zero elements 2 Approximation errors are smaller than conjectured estimation error: {E P [r 2 V ]}1/2 + {E P [r 2 Z ]}1/2 s/n. Extends series regression by letting relevant terms to be unknown

LASSO/Post-LASSO Estimation of g V and m Z Let (W i ) n i=1 be a random sample of W and E n denote the empirical expectation over this sample. LASSO and Post-LASSO estimator are given respectively by ( 1. β V arg min E n [M(V, f (Z, X ) β)] + λ β R 2p n Ψβ 1 ), ( 2. β V arg min E n [M(V, f (Z, X ) β)] : β j =, j supp[ β V ] β R 2p ) M( ) is objective function of M-estimator (OLS or logit or probit) Ψ = diag( l 1,..., l p ) is a matrix of data-dependent loadings Choice of λ needs to account for potential simultaneous estimation of a continuum of LASSO regressions over V V u, u U (Post-LASSO) Refit selected model to reduce attenuation bias due to regularization Similar procedure to obtain β Z and β Z

Choice of Penalty λ Need to control selection errors uniformly over u U With a high probability λ n > sup u U [ Ψ u 1 M(V, f (X ) ] β E V ) n β, Similar strategy to Bickel et al (9) for singleton U and Belloni and Chernozhukov (11, AoS) for l 1 -penalized QR processes As a practical way to implement this choice, we propose λ = c nφ 1 (1 γ/{2pn d u }), where γ = o(1), c > 1, and d u = dim U (e.g., c = 1.1, γ =.1/ log n) Corresponds to Belloni, Chernozhukov, and Hansen (14) when d u =

Step-2: Continuum of Reduced-Form Estimators Let ĝ V and m Z be LASSO/Post-LASSO estimators of g V and m Z For V V u, z Z α V (z) = E n [ 1z (Z )(V ĝ V (z, X )) m Z (z, X ) where E n denotes the empirical expectation ] + ĝ V (z, X ), We apply this procedure to each variable V V u and z Z to obtain the estimator: ρ u := ( { α V (), α V (1)} ) V V u of ρ u := ( {α V (), α V (1)} ) V V u We then stack into reduced-form empirical and estimand processes ρ = ( ρ u ) u U and ρ = (ρ u ) u U

Uniform Asymptotic Gaussianity of Reduced Form Process We show that n( ρ ρ) is asymptotically Gaussian in l (U) d ρ with influence function ψ ρ u(w ) := ({ψ α V, (W ), ψα V,1 (W )}) V V u Result 2. Under s 2 log 2 (p n) log 2 n/n and other regularity conditions, n( ρ ρ) ZP := (G P ψ ρ u) u U in l (U) d ρ, uniformly in P P n, where G P denotes the P-Brownian bridge, and P n is a rich set of data generating processes P that includes cases where perfect model selection is impossible theoretically. Covers pointwise normality as a special case Set of DGPs P n is weakly increasing in n Derivation needs to deal with non Donsker function classes to accommodate high-dimensional estimation of nuisance functions

Multiplier Bootstrap (Gine and Zinn, 84) Let (ξ i ) n i=1 are i.i.d. copies of ξ which are independently distributed from the data (W i ) n i=1 and whose distribution does not depend on P We also impose that E[ξ] =, E[ξ 2 ] = 1, E[exp( ξ )] < Examples of ξ include (a) ξ = E 1, where E is standard exponential random variable, (b) ξ = N, where N is standard normal random variable, (c) ξ = N 1 / 2 + (N2 2 1)/2, N 1 N 2, Mammen multiplier We define a bootstrap draw of ρ = ( ρ u) u U via n( ρ u ρ u ) = n 1/2 n i=1 ξ i ψ ρ u(w i ), where ψ ρ u is a plug-in estimators of the influence function

Uniform Consistency of Multiplier Bootstrap Result 3. We establish that that the bootstrap law n( ρ ρ) is uniformly asymptotically valid, namely in the metric space l (U) d ρ, n( ρ ρ) B Z P, uniformly in P P n, where B denotes the convergence of the bootstrap law in probability conditional on the data Computationally very efficient since it does not involve recomputing the influence function ψ ρ u (and nuisance functions) Previously used in Econometrics in low-dimensional settings by B. Hansen (96) and Kline and Santos (12)

Step-3: Estimators of Structural Parameters All structural parameters we consider are smooth transformations of reduced-form parameters: := ( q ) q Q, where q := φ(ρ)(q), q Q The structural parameters may themselves carry an index q Q that can be different from u, e.g. LQTE are indexed by q = τ (, 1) We define the estimators and their bootstrap versions via the plug-in principle: := ( q ) q Q, q := φ ( ρ) (q), := ( q) q Q, q := φ ( ρ ) (q).

Uniform Asymptotic Gaussianity and Bootstrap Consistency for Estimators of Structural Parameters Result 4. We establish that these estimators are asymptotically Gaussian n( ) φ ρ(z P ) in l (Q), uniformly in P P n, where h φ ρ(h) = (φ ρ,q(h)) q Q is the uniform Hadamard derivative. Result 5. The bootstrap consistently estimates the large sample distribution of uniformly in P P n : n( ) B φ ρ(z P ) in l (Q), uniformly in P P n. Strengthens Hadamard differentiability and functional delta method to handle uniformity in P Result 5 complements Romano and Shaikh (11) Construct uniform confidence bands and test functional hypotheses on