ECONOMETRICS II (ECO 2401) Victor Aguirregabiria. Spring 2018 TOPIC 4: INTRODUCTION TO THE EVALUATION OF TREATMENT EFFECTS

Similar documents
Empirical Methods in Applied Microeconomics

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Spring 2013 Instructor: Victor Aguirregabiria

Econometrics of causal inference. Throughout, we consider the simplest case of a linear outcome equation, and homogeneous

Instrumental Variables. Ethan Kaplan

Introduction: structural econometrics. Jean-Marc Robin

Lecture 8. Roy Model, IV with essential heterogeneity, MTE

Regression discontinuity design with covariates

x i = 1 yi 2 = 55 with N = 30. Use the above sample information to answer all the following questions. Show explicitly all formulas and calculations.

i=1 y i 1fd i = dg= P N i=1 1fd i = dg.

ECONOMET RICS P RELIM EXAM August 24, 2010 Department of Economics, Michigan State University

Prediction and causal inference, in a nutshell

Economics 241B Estimation with Instruments

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Winter 2014 Instructor: Victor Aguirregabiria

Flexible Estimation of Treatment Effect Parameters

Chapter 1. GMM: Basic Concepts

ECO Class 6 Nonparametric Econometrics

Identi cation of Positive Treatment E ects in. Randomized Experiments with Non-Compliance

Non-parametric Identi cation and Testable Implications of the Roy Model

Estimation with Aggregate Shocks

More on Roy Model of Self-Selection

Lecture Notes on Measurement Error

Regression Discontinuity Designs.

Lecture Notes Part 7: Systems of Equations

Lecture 11 Roy model, MTE, PRTE

Lecture 4: Linear panel models

Notes on Heterogeneity, Aggregation, and Market Wage Functions: An Empirical Model of Self-Selection in the Labor Market

A Course in Applied Econometrics. Lecture 5. Instrumental Variables with Treatment Effect. Heterogeneity: Local Average Treatment Effects.

1 Static (one period) model

Using Matching, Instrumental Variables and Control Functions to Estimate Economic Choice Models

Nonparametric Identi cation of Regression Models Containing a Misclassi ed Dichotomous Regressor Without Instruments

Lecture 1- The constrained optimization problem

IV Estimation WS 2014/15 SS Alexander Spermann. IV Estimation

Simple Estimators for Semiparametric Multinomial Choice Models

ECON0702: Mathematical Methods in Economics

Macroeconomics IV Problem Set I

Nonparametric Identi cation of Regression Models Containing a Misclassi ed Dichotomous Regressor Without Instruments

Contents. University of York Department of Economics PhD Course 2006 VAR ANALYSIS IN MACROECONOMICS. Lecturer: Professor Mike Wickens.

An Alternative Assumption to Identify LATE in Regression Discontinuity Designs

Advanced Economic Growth: Lecture 8, Technology Di usion, Trade and Interdependencies: Di usion of Technology

Exam ECON5106/9106 Fall 2018

Notes on Generalized Method of Moments Estimation

Speci cation of Conditional Expectation Functions

Supplementary material to: Tolerating deance? Local average treatment eects without monotonicity.

Trimming for Bounds on Treatment Effects with Missing Outcomes *

An Alternative Assumption to Identify LATE in Regression Discontinuity Design

Lecture 3, November 30: The Basic New Keynesian Model (Galí, Chapter 3)

Chapter 2. Dynamic panel data models

Chapter 6: Endogeneity and Instrumental Variables (IV) estimator

Addendum to: International Trade, Technology, and the Skill Premium

Microeconomics, Block I Part 1

ECON 594: Lecture #6

The Problem of Causality in the Analysis of Educational Choices and Labor Market Outcomes Slides for Lectures

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Winter 2016 Instructor: Victor Aguirregabiria

Nonparametric Identification and Estimation of Nonadditive Hedonic Models

Michael Lechner Causal Analysis RDD 2014 page 1. Lecture 7. The Regression Discontinuity Design. RDD fuzzy and sharp

Control Functions in Nonseparable Simultaneous Equations Models 1

Labor Economics, Lecture 11: Partial Equilibrium Sequential Search

ECON2285: Mathematical Economics

Empirical Methods in Applied Economics Lecture Notes

Pseudo panels and repeated cross-sections

1 A Non-technical Introduction to Regression

Principles Underlying Evaluation Estimators

Testing for Regime Switching: A Comment

Four Parameters of Interest in the Evaluation. of Social Programs. James J. Heckman Justin L. Tobias Edward Vytlacil

Recitation Notes 5. Konrad Menzel. October 13, 2006

A Course on Advanced Econometrics

Estimation of Dynamic Nonlinear Random E ects Models with Unbalanced Panels.

4.8 Instrumental Variables

Introduction to causal identification. Nidhiya Menon IGC Summer School, New Delhi, July 2015

Advanced Economic Growth: Lecture 3, Review of Endogenous Growth: Schumpeterian Models

Parametric Inference on Strong Dependence

The Generalized Roy Model and Treatment Effects

Single-Equation GMM: Endogeneity Bias

Identification of Regression Models with Misclassified and Endogenous Binary Regressor

EMERGING MARKETS - Lecture 2: Methodology refresher

Exploring Marginal Treatment Effects

Internationa1 l Trade

Lecture 11/12. Roy Model, MTE, Structural Estimation

MC3: Econometric Theory and Methods. Course Notes 4

ow variables (sections A1. A3.); 2) state-level average earnings (section A4.) and rents (section

Empirical Methods in Applied Economics

Simple Estimators for Monotone Index Models

Nonparametric Identi cation and Estimation of Truncated Regression Models with Heteroskedasticity

Generalized Roy Model and Cost-Benefit Analysis of Social Programs 1

Sensitivity checks for the local average treatment effect

Estimating Marginal and Average Returns to Education

Chapter 6. Panel Data. Joan Llull. Quantitative Statistical Methods II Barcelona GSE

Recitation Notes 6. Konrad Menzel. October 22, 2006

A time series plot: a variable Y t on the vertical axis is plotted against time on the horizontal axis

Estimation of Treatment Effects under Essential Heterogeneity

Time Series Models and Inference. James L. Powell Department of Economics University of California, Berkeley

Final Exam. Economics 835: Econometrics. Fall 2010

Selection on Observables: Propensity Score Matching.

Environmental Econometrics

Econ Review Set 2 - Answers

Regression Discontinuity

The Econometric Evaluation of Policy Design: Part I: Heterogeneity in Program Impacts, Modeling Self-Selection, and Parameters of Interest

Repeated observations on the same cross-section of individual units. Important advantages relative to pure cross-section data

When is it really justifiable to ignore explanatory variable endogeneity in a regression model?

Comparative Advantage and Schooling

Transcription:

ECONOMETRICS II (ECO 2401) Victor Aguirregabiria Spring 2018 TOPIC 4: INTRODUCTION TO THE EVALUATION OF TREATMENT EFFECTS 1. Introduction and Notation 2. Randomized treatment 3. Conditional independence 4. Di erence-in-di erences (a variant of the CI assumption)

5. Randomized eligibility: LATE 6. Regression Discontinuity Design 7. Roy s Model

1. INTRODUCTION. We are interested in estimating the causal e ect of an explanatory variable D on an outcome variable Y. This setting is very general: e ect of a drug on cholesterol level; e ect of education of labor earnings; e ect of price on demand; e ect of a wage tax on employment; e ect of a competition policy on rms pro ts; etc, etc, etc. We consider a stylized but very general model where D is binary, D 2 f0; 1g, and Y can be continuous or discrete, e.g., D = University degree, Y = Earnings.

Following the language in this literature we denote D as the treatment variable, and Y is the outcome variable. - D = 1 indicates that the subject "receives treatment" or "is in the treatment group", or " is in the experimental group"; - D = 0 indicates that the subject "does not receive treatment" or "is in the control group". Example: A retail chain is interested in estimating the e ect on demand of a 20% discount in the price of its key product. The rm decides to implement this price discount in some of its stores (experimental group) and keep the regular price in other stores (control group).

Let Y 0 and Y 1 be latent variables that represent the outcome variable for an individual without and with treatment, respectively. We have an observation of the outcome variable Y per individual. Therefore, we observe: Y = ( Y0 if D = 0 Y 1 if D = 1 = (1 D) Y 0 + D Y 1 The Treatment E ect for an individual is: T E Y 1 Y 0 Note: Even if we could observe the same individual with and without treatment, it would be at di erent moments [More on this below].

Subjects are heterogeneous in multiple dimensions, and Treatment E ects can be very heterogeneous across individuals. - The e ect of a medicine drug varies substantially across patients; - The e ect of a university degree on earnings can be very di erent across individuals; - The e ect of a price reduction on demand can be substantially di erent across stores of the same chain. -... Ideally, we would like to estimate the TE of each individual. However, this is not feasible because we observe an individual either with or without treatment but not both.

Under some conditions / restrictions, we will be able to estimate some features of the distribution of the TEs in the population of interest. A commonly used parameter that measures the aggregate e ect of a treatment is the Average Treatment E ect, ATE. AT E = E (T E) = E (Y 1 Y 0 ) An the Conditional Average Treatment E ect. AT E(x) = E (Y 1 Y 0 j X = x) where X is a vector of predetermined attributes of the subject. AT E(x) is the ATE for subpopulation of individuals with X = x.

Regression-like representation of the model De ne 0 E(Y 0 ) and 1 E(Y 1 ) such that we can write: 8 >< >: Y 0 = 0 + U 0 Y 1 = 1 + U 1 where, by construction, E(U 0 ) = E(U 1 ) = 0. Note that, by de nition, AT E = 1 0.

Regression-like representation of the model [2] Using these de nitions, we have that: Y = (1 D) ( 0 + U 0 ) + D ( 1 + U 1 ) = + D + e where = 0, = 1 0 = AT E, and e = U 0 + (U 1 U 0 )D We will show below the Regression-Like representation of the model that includes the X variables.

Estimation of ATE and Endogeneity Problem Let fy i ; d i ; x i : i = 1; 2; :::; Ng be a random sample of N individuals, some with treatment (d i = 1), and others without treatment (d i = 0). The researcher is interested in estimating these data to estimate AT E or/and AT E(x). We now present two simple and intuitive estimators of the ATE: - Di erence in means estimator: - OLS estimator of Y on D. We show that they are equivalent and, without further restrictions, they are inconsistent estimators of the ATE.

Estimation of ATE and Endogeneity Problem [2] Di erence in means estimator: [AT E DM = y D=1 y D=0 with y D=1 = P Ni=1 y i d i P Ni=1 d i and y D=0 = P Ni=1 y i (1 d i ) P Ni=1 (1 d i ) OLS Estimator: [AT E OLS = b OLS = P Ni=1 (y i y) (d i d) P Ni=1 (d i d) 2

Equivalence of Di erence-in-means and OLS of ATE [AT E OLS = P Ni=1 (y i y) (d i d) P Ni=1 (d i d) 2 Note: P Ni=1 di d 2 = P Ni=1 d 2 i 2 P N i=1 d i d + P N i=1 d 2 = N d 2N d 2 + N d 2 = n d(1 d) And: P Ni=1 di d (y i y) = P N i=1 d i y i N d y

Equivalence of Di erence-in-means and OLS of ATE [2] Therefore: [AT E OLS = P Ni=1 d i y i N d y N d(1 d) = 1 1 d P Ni=1 d i y i P Ni=1 d i y! Note that: = 1 1 d y(d=1) y y = N 1 P N i=1 d i y i + (1 d i )y i = d y (D=1) + (1 d) y (D=0)

Equivalence of Di erence-in-means and OLS of ATE [3] Thus, [AT E OLS = 1 1 d y(d=1) d y (D=1) (1 d) y (D=0) = y (D=1) y (D=0)

Inconsistency of DM / OLS Estimators Is this estimator consistent? No, without further assumptions. It is clear that [AT E DM! p E (Y j D = 1) NOT independent of Y 0 and Y 1 : E (Y j D = 0), and if D is E (Y j D = 1) E (Y j D = 0) = E (Y 1 j D = 1) E (Y 0 j D = 0) 6= E (Y 1 ) E (Y 0 ) = AT E In Economics or social sciences, we expect the "choice of treatment" D to be correlated with the "e ect of treatment" Y 1 Y 0. Examples.

Inconsistency of DM / OLS Estimators [2] In the regression like representation of the model: Y = + D + e where e = U 0 + (U 1 U 0 )D Such that: E (D e) = E (D U 0 + D(U 1 U 0 )) = E (D U 1 ) 6= 0

2. RANDOMIZED TREATMENT Suppose that the treatment dummy D is independent of the latent outcome variables Y 0 and Y 1. D cb (Y 0 ; Y 1 ) where cb represents "statistical independence". Given that Y = (1 D) Y 0 + D Y 1 and D cb (Y 0 ; Y 1 ): 8 >< E (Y j D = 0) = E (Y 0 j D = 0) = E (Y 0 ) such that: >: E (Y j D = 1) = E (Y 1 j D = 1) = E (Y 1 ) AT E E (Y 1 Y 0 ) = E (Y j D = 1) E (Y j D = 0) and AT E is identi ed from data of fy; Dg

RANDOMIZED TREATMENT [2] We can construct root-n consistent estimators of E (Y j D = 1) and E b (Y j D = 0) using: P Ni=1 P y i d Ni=1 i y i (1 d i ) y D=1 = P Ni=1 and y D=0 = P d Ni=1 i (1 d i ) Then, a root-n consistent estimator of AT E is: [AT E = y D=1 y D=0

ENDOGENOUS TREATMENT The main concern in this literature is the endogeneity of treatment. D is not independent of T E = Y 1 Y 0 The assumption of D cb (Y 0 ; Y 1 ) is equivalent to assume that treatment is perfectly randomized. This assumption is not plausible in most applications unless there is a randomized experiment and all the individuals comply to their treatment assigment. This may be a realistic condition in some randomized experiments in medical or natural science experiments, or even in lab experiments in experimental economics. However, it is quite unrealistic in social sciences, even in randomized eld experiments in social sciences.

ENDOGENOUS TREATMENT [2] In general, treatment D is not independent of the potential outcomes Y 0 and Y 1. Individuals tend to self-select into treatment or not treatment according to their individual-speci c bene ts of treatment, i.e., according to Y 0 and Y 1. In eld randomized experiments in social sciences, we typically can randomize eligibility to treatment but not treatment itself. Treatment No Treatment Eligible Compliers Not Compliers Not Elligible Not Compliers Compliers

In general: - Some subjects eligible to treatment choose not to take the treatment; - Some subjects not eligible decide to take an alternative but similar treatment.

Regression-like representation of the model [2] *** The OLS estimator of Y on D in the linear regression Y = + D + e is: b OLS = P ni=1 (y i y) (d i d) P ni=1 (d i d) 2

Regression-like representation of the model [2] Is the OLS estimator of (i.e., the AT E) consistent? Consistency of the OLS requires E(D e) = 0. Let s see that this condition holds under randomized treatment. Randomized treatment implies D cb (U 0 ; U 1 ) and therefore E(U 0 j D) = E(U 1 j D) = 0. E(D e ) = E(D [U 0 + D(U 1 U 0 )]) = Pr (D = 1) E(U 1 j D = 1) = 0

Regression-like representation of the model [3] Without a randomized experiment, the unobservable component of the potential outcomes, U 0 and U 1, can be correlated with the treatment dummy D and this implies correlation between the error term e and the regressor D. The OLS estimator b OLS = y D=1 y D=0 will be inconsistent.

3. CONDITIONAL INDEPENDENCE A weaker version of the assumption of independence between treatment and potential outcomes is that this independence holds only conditional on a vector of observable individual characteristics (control variables) X. D cb (Y 0 ; Y 1 ) j X Given that Y = (1 D) Y 0 + D Y 1 and D cb (Y 0 ; Y 1 ) j X: 8 >< >: E (Y j D = 0; X = x) = E (Y 0 j D = 0; X = x) = E (Y 0 j X = x) E (Y j D = 1; X = x) = E (Y 1 j D = 1; X = x) = E (Y 1 j X = x) such that: AT E(x) E (Y 1 Y 0 j X = x) = E (Y j D = 1; X = x) E (Y j D = 0; X = x and the conditional ATE(x) is identi ed. Then, we can also identify: AT E = E X ( AT E(x) )

CONDITIONAL INDEPENDENCE [2] Estimation: With conditional independence D cb (Y 0 ; Y 1 ) j X but without unconditional independence D, (Y 0 ; Y 1 ), the estimator [AT E = y D=1 y D=0 of the ATE is inconsistent. To estimate consistently the ATE we need to condition on X and estimate rst the conditional ATE(x). Is X is a vector of discrete random variables (and our sample is relatively large), we can estimate ATE(x) using frequency estimators to estimate E (Y j D = 1; and E (Y j D = 0; X = x): [AT E(x) = y D=1 (x) y D=0 (x)

with y D=1 (x) = P Ni=1 y i d i 1fx i = xg P Ni=1 d i 1fx i = xg y D=0 (x) = P Ni=1 y i (1 P Ni=1 (1 d i ) 1fx i = xg d i ) 1fx i = xg

CONDITIONAL INDEPENDENCE [3] If X contains continuous variables (or if our sample is not so large) we can estimate ATE(x) using Kernel Estimators to estimate E (Y j D = 1; X = x) and E (Y j D = 0; X = x): [AT E(x) = y D=1 (x) y D=0 (x) with P Ni=1 y i d i K x i x! y D=1 (x) = b N P Ni=1 d i K x i x! b N P Ni=1 y i (1 d i ) K x i x! y D=0 (x) = b N! P Ni=1 (1 d i ) K x i x b N

Regression-like representation under CI De ne 0 (x) E(Y 0 j X = x) and 1 (x) E(Y 1 j X = x) such that we can write: 8 >< >: Y 0 = 0 (x) + U 0 Y 1 = 0 (x) + U 1 where, by construction, E(U 0 j X = x) = E(U 1 j X = x) = 0. Note that, by de nition, AT E(x) = 1 (x) 0 (x).

Regression-like representation under CI [2] Taking into account that Y = (1 D)Y 0 + D Y 1 : Y = (1 D) ( 0 (X) + U 0 ) + D ( 1 (X) + U 1 ) where: = (X) + (X) D + e (X) = 0 (X) (X) = AT E(X) e = U 0 + D (U 1 U 0 )

Regression-like representation under CI [3] Under the CI assumption, the OLS estimation of (x) in this regression model provides a consistent estimator of the AT E(x). Y = (X) + (X) D + e This is because, under the CI Assumption we have that: E (e j X; D = 0) = E (U 0 jx; D = 0) = E (U 0 ) = 0 E (e j X; D = 1) = E (U 1 jx; D = 1) = E (U 1 ) = 0 We can apply (nonparametric) Least Squares to estimate consistently AT E(X).

Regression-like representation under CI [4] Suppose that (x) and (x) are well approximated by a polynomial of order q in x: When x is a scalar: y i = h 0 + 1 x i + ::: + q x q i i + h 0 + 1 x i + ::: + q x q i i di + e i We can estimate parameters 0 s and 0 s by OLS and the construct the estimate of the ATE(x): [AT E(x) = b (x) = b 0 + b 1 x + ::: + b q x q

Curse of dimensionality in NP estimation of ATE(x) The Kernel and Polynomial series estimators of ATE(x) su er of the wellknown curse of dimensionality in NP estimator. The speed of convergence of [AT E(x) to the true AT E(x) declines with the number of continuous explanatory variables in the vector X. The estimator can be very imprecise unless we have very large samples. When X is discrete, these estimators have good asymptotic properties, but we still need su cient observations for each discrete value of x. A possible approach is to construct an estimate of the unconditional ATE given the estimates of ATE(x): [AT E = 1 N NX i=1 [AT E(x i )

Curse of dimensionality in NP estimation of ATE(x) [2] This estimator [AT E is root-n consistent and asymptotically normal (Newey, ET 1994) despite [AT E(x) have lower speed of convergence due to continuous regressors. However, in some applications we are interested in the conditional ATE(x). Furthermore, even if we are interested only in unconditional ATE, the nite sample properties of the previous estimator [AT E are a ected by the poor and imprecise estimates [AT E(x i ). Rosenbaum and Rubin (Biometrika, 1983) provide an interesting and useful approach to deal with this curse of dimensionality in the NP estimation of ATE.

Rosenbaum and Rubin (1983) Matching estimator using the Propensity Score Since D is a binary variable, its distribution conditional on X = x is Bernoulli with probability P (x) where: P (x) Pr (D = 1 j X = x) In the TE literature, P (x) is denoted the Propensity Score. Note that P (x) contains all the information in the distribution of D conditional on X = x. Therefore, if D is independent of (Y 0 ; Y 1 ) conditional on X, then it should be also true that D is independent of (Y 0 ; Y 1 ) conditional on P (X). D cb (Y 0 ; Y 1 ) j P (X)

Matching estimator using the Propensity Score [2] De ne e 0 (p) E(Y 0 j P (X) = p) and e 1 (p) E(Y 1 j P (X) = p) such that we can write: 8 >< >: Y 0 = e 0 (p) + U 0 Y 1 = e 1 (p) + U 1 where, by construction, E(U 0 j P (x)) = E(U 1 j P (x)) = 0. Note that, by de nition, AT E(p) = e 1 (p) e 0 (p).

Matching estimator using the Propensity Score [3] The CI assumption implies that AT E(p) is identi ed as: AT E(p) = E (Y 1 j P (X) = p) E (Y 0 j P (X) = p) = E (Y j D = 1; P (X) = p) E (Y j D = 0; P (X) = p) Based on this insight, Rosenbaum and Rubin proposed the following estimator of the ATE. Let bp i b P (x i ) be a consistent estimator of the propensity score for individual i. Then: where [AT E = 1 N NX i=1 [AT E( bp(x i )) [AT E(p) = y D=1 (p) y D=0 (p)

Matching estimator using the Propensity Score [4] with: P Ni=1 y i d i K b p i p! y D=1 (p) = b N P Ni=1 d i K b p i p! b N P Ni=1 y i (1 d i ) K b p i p! y D=0 (p) = b N P Nj=1 (1 d i ) K b p i p! and bp i = bp(x i ) = P N j=1 d j K x! j x i =[ P N b j=1 K x! j x i ]. N b N Now, the dimension of the conditioning variables in the estimation of the conditional expectations y D=1 (p) and y D=0 (p) is 1 (the propensity score) b N

instead of dim(x). This improves the asymptotic and nite sample properties of the estimators of ATE.

4. DIFFERENCES-IN-DIFFERENCES (DiD) DiD is a particular case of Conditional Indipendence when we have (1) Panel Data; (2) A particular structure of the Treatment variable D; U 0, U 1 (3) An assumption about the components structure of the unobservables Suppose that we have panel data fy it ; d it ; x it g for t = 1,...,T, with T 2. The treatment dummy D it 2 f0; 1g has the following structure:

D it = T i 1ft t g - T i 2 f0; 1g is the dummy that indicates that an individual i belongs to the experimental group. - 1ft t g is the dummy that indicates that period t is a period of treatment.

DIFFERENCES-IN-DIFFERENCES [2] The model is the same as before but for panel data: Y 0;it and Y 1;it are the latent variables that represent the outcome variable for an individual without and with treatment, respectively. And have that: 8 >< with 0 E(Y 0;it ) and 1 E(Y 1;it ) >: Y 0;it = 0 + U 0;it Y 1;it = 1 + U 1;it The model is completed with an assumption about the component structure of U 0;it and U 1;it : U 0;it = i + 0t + u 0it U 1;it = i + 1t + u 1it Note 0i = 1i. Key restriction.

DIFFERENCES-IN-DIFFERENCES [3] Model: Y it = (1 D it ) Y 0;it + D it Y 1;it that we can represent as: Y it = + D it + h U 0;it + D it U1;it U 0;it i where = 0, 1 0 = AT E, and U 0;it + D it U1;it U 0;it = i + 0t + u 0it + D it ( 1t 0t + u 1it u 0it )

DIFFERENCES-IN-DIFFERENCES [4] The DiD estimator is simply the OLS estimator in the equation in rstdi erences when we include time-dummies: Y it = D it + e t T D t + e it Note that: D it = 8 >< >: 0 for t < t or t > t T i for t = t Therefore, the model has information about only at t = t. Y i /t = T i + e t + e it

DIFFERENCES-IN-DIFFERENCES [5] And according to the model: e it = u 0it u 0it 1 + u 1it u 0it = u 1it u 0it 1 Consistency os the DiD estimator requires: T i is independent of the transitory shocks u 1it u 0it 1 More importantly, the error component restriction: 0i = 1i

5. RANDOMIZED ELIGIBILITY TO TREATMENT Let Z 2 f0; 1g be a random variable that represents whether the individual is eligible to treatment (Z = 1) or not (Z = 0). This Z variable comes from a randomized experiment. In general, in eld experiments in the social sciences, Z 6= D. - We observe subjects with fz i = 1 and d i = 0g: eligible but not taking the treatment; - We observe (os suspect) subjects with fz i = 0 and d i = 1g: non-eligible but taking a similar / alternative treatment.

Using Z as a proxy for D generates an inconsistent estimator [See Below] However, we show below that, under some additional assumptions, Z can be used as an instrument for D. This IV estimator is not a consistent estimator for the ATE for all the population. However, thsi IV estimator is a consistent estimator of the ATE for a particular subpopulation of subjects: the compliers.

IV Estimator For z = 0; 1, let P (z) be the propensity score P (z) Pr(D = 1 j Z = z). Consider the following assumptions on the instrument Z. [Independence] Z is independent of potential outcomes (Y 0 ; Y 1 ); [Relevance] Z is correlated with treatment, i.e., P (1) > P (0). Consider the regression-like representation of the model: Y = + D + e The IV estimator of the ATE is: b IV = 2 4 NX (z i z) d i d i=1 i=1 3 5 1 2 NX 4 (z i z) (y i y) 3 5

Wald Estimator Wald Estimator is de ned as: b W ald = y Z=1 y Z=0 d Z=1 d Z=0 where y Z=1 and d Z=1 are the sample means of Y and D, respectively, for the subsample of observations with Z = 1, and similarly, y Z=0 and d Z=0 are the sample means of Y and D for the subsample of observations with Z = 0. We can show that for this model, the IV and the Wald estimators are the same. P Ni=1 P z i (y i y) Ni=1 b IV = P Ni=1 z i di d z i y i N 1 y = P Ni=1 z i d i N 1 d = N 1 (y 1 y) N 1 d1 d = N 1 N 0 N (y Z=1 y Z=0 ) = y Z=1 y Z=0 = b W ald dz=1 d Z=0 d Z=0 N 1 N 0 N d Z=1

Inconsistency of IV (Wald) Estimator for ATE In general, this IV is NOT a consistent estimator of the ATE. Though the instrument Z is independent of U 0 and U 1, it is correlated with the error term e = U 0 + D(U 1 U 0 ). E(e j Z = 0) = E(U 0 + D(U 1 U 0 ) j Z = 0) And, = P (0) E(U 1 U 0 j D = 1) E(e j Z = 1) = E(U 0 + D(U 1 U 0 ) j Z = 1) = P (1) E(U 1 U 0 j D = 1) Such that: E(e j Z = 1) E(e j Z = 0) = [P (1) P (0)] E(U 1 U 0 j D = 1) 6= 0

For the IV estimator to be consistent, we need E(Z e) = 0. Note that E(Z e) = Pr(Z = 1) E(e j Z = 0) + Pr(Z = 0) E(e j Z = 0). Given that E(e j Z = 1) E(e j Z = 0) = [P (1) P (0)] E(U 1 U 0 j D = 1), note that E(Ze) = [Pr(Z = 0) + Pr(Z = 1) [P (1) P (0)]] E(U 1 U 0 jd = 1) that in general is di erent to zero.

Inconsistency of IV in Random Coe cients Models with Endogeneity More generally, in models with random coe cients and endogenous variables, the error term of the regression model includes interactions between the random coe cient and the endogenous variable. In these models, IV estimation does not provide a consistent estimator of the average coe cient. Consider the model: such that Y i = X i i + " i with i = + v i Y i = X i + e i with e i = " i + X i v i where X i is correlated with v i, but there is a vector of instruments Z i that is independent of " i and v i.

The IV estimator b IV = 0 NX @ i=1 z 0 i x i is an asymptotically biased estimator of. 1 A 1 0 NX @ i=1 z 0 i y i 1 A The reason is simple: despite Z i is independent of " i and v i, it is not independent of e i = " i + X i v i.

LOCAL AVERAGE TREATMENT EFFECT (LATE) Though the IV estimator is an inconsistent estimator of ATE (when we have heterogeneous treatment e ects), under some conditions (Monotonicity), the IV is a consistent estimator of the the ATE for a subpopulation of individuals: the Compliers. To understand some assumptions of the model and some properties of the estimators, it is useful to de ne the following latent variables: D 0 = Treatment indicator under the hypothetical case that individual were not eligible, i.e., when Z = 0; D 1 = Treatment indicator under the hypothetical case that individual were eligible, i.e., when Z = 1.

D 0 and D 1 are unobservable. All what we observe is the treatment D. D = (1 Z) D 0 + Z D 1

LATE [2] According to these latent variables, we can de ne D 0 = 0 D 0 = 1 D 1 = 0 Never Takers De ers D 1 = 1 Compliers Always Takers [Assumption: Monotonicity] For every individual, D 1 D 0, i.e., there are not de ers.

LATE [3] Using the de nitions of "individual types" above, the assumption of Monotonicity establishes that there are not "De ers" in the population. Under the assumptions of Independence, Relevance, and Monotonicity, the IV estimator converges in probability to the Local Average Treatment E ect parameter de ned as LAT E E (Y 1 Y 0 j D 1 > D 0 ) : LATE is the ATE for the subpopulation of Compliers.

Proof IV is a consistent estimator of LATE As we have shown before, the IV and the Wald estimator are the same: b IV = y 1 y 0 d 1 d 0 By the LLN, b converges in probability to E(Y jz = 1) E(Y jz = 0) E(DjZ = 1) E(DjZ = 0). Now, we show that, under the Monotonicity assumption, E(Y jz = 1) E(Y jz = 0) E(DjZ = 1) E(DjZ = 0) = E (Y 1 Y 0 j D 1 > D 0 ) = LAT E

Proof IV is a consistent estimator of LATE [2] Note that Y = Y 0 +D (Y 1 Y 0 ), and D = D 0 +Z (D 1 D 0 ). Therefore, (by independence of Z with (Y 0 ; Y 1 ; D 0 ; D 1 )): E(Y jz = 1) = E(Y 0 + D 1 (Y 1 Y 0 ) j Z = 1) = E(Y 0 + D 1 (Y 1 Y 0 ) ) And E(Y jz = 0) = E(Y 0 + D 1 (Y 1 Y 0 ) jz = 0) = E(Y 0 + D 0 (Y 1 Y 0 ) )

Proof IV is a consistent estimator of LATE [3] Therefore, the numerator of the PLIM of IV is: Numerator of PLIM of IV = E(Y 0 + D 1 (Y 1 Y 0 ) ) E(Y 0 + D 0 (Y 1 Y 0 ) ) = E((D 1 D 0 ) (Y 1 Y 0 ) ) By the monotonicity assumption, (D 1 D 0 ) can be only 0 or 1. Therefore, Numerator of PLIM of IV = Pr(D 1 D 0 > 0) E(Y 1 Y 0 j D 1 D 0 > 0)

Proof IV is a consistent estimator of LATE [4] Similarly, for the denominator of the PLIM of IV we have that (by independence of Z with (D 0 ; D 1 )) E(DjZ = 1) = E(D 0 + Z(D 1 D 0 ) jz = 1) = E(D 1 ) And (by independence of Z with (D 0 ; D 1 )) E(DjZ = 0) = E(D 0 + Z(D 1 D 0 ) jz = 0) = E(D 0 )

Proof IV is a consistent estimator of LATE [5] The denominator of the PLIM of IV is: Denominator of PLIM of IV = E(D 1 D 0 ) Again, by the monotonicity assumption, (D 1 D 0 ) can be only 0 or 1, such that E(D 1 D 0 ) = Pr(D 1 D 0 > 0). Therefore, PLIM of IV = Pr(D 1 D 0 > 0) E(Y 1 Y 0 j D 1 D 0 > 0) Pr(D 1 D 0 > 0) = E(Y 1 Y 0 j D 1 D 0 > 0) = LAT E

What if Monotonicity does not hold? What is the plim of the IV?

External Validity of LATE How di erent is the LATE to the ATE? Can we apply the LATE (ATE of compliers) to the rest of the population (Always Takers and Never Takers). In general, we cannot. However, if the proportion of compliers in the population is large (e.g., > 80%), we can be more con dent about the external validity of the LATE. If thsi proportion is small (e.g., < 20%) we should be very cautious. Under the Monotonicity assumption, we can identify the proportion of compliers in the population.

Identifying the Proportion of Compliers Let C, A, N, and D, be the proportion of compliers, always-takers, nerver-takers, and de ers in the population. Under Monotonicity, we have that D = 0, such that C + A + N = 1. We have that: Pr (D = 1 j Z = 0) = C Pr (D = 1jZ = 0; C) + A Pr (D = 1jZ = 0; A) + N Pr (D = 1jZ = 0; N) = A Similarly, Pr (D = 1 j Z = 1) = C Pr (D = 1jZ = 1; C) + A Pr (D = 1jZ = 1; A) + N Pr (D = 1jZ = 1; N) = C + A

Therefore, C = Pr (D = 1 j Z = 1) Pr (D = 1 j Z = 0)

What if Monotonicity does not hold? What is the interpretation of Pr (D = 1 j Z = 1) Pr (D = 1 j Z = 0)?

6. REGRESSION DISCONTINUITY (RD) Van der Klaauw (2002) uses a RD approach to estimate the e ect of nancial aid on students decisions to accept admission to a given college. He exploits discontinuities in an administrative formula that determines aid based on SAT score, GPA, & other components. Angrist and Lavy (1999) estimate the e ect of class size on student test scores, with identi cation coming from a rule requiring that one classroom be added in a school whenever average class size exceeds a predetermined threshold. Here class size is a discontinuous (and note: non-monotonic) function of enrollment in the student s school. Black (1999) uses a RD approach to estimates parents willingness to pay for school quality by comparing housing prices near school district boundaries.

Suppose that the probability of treatment (of D = 1) depends on some observable variable X, which is continuous. The variable X need not be independent of Y 0 and Y 1 (of T E). De ne: P (x) Pr (D = 1 j X = x) The key feature of the RD approach is that P (x) is such that there is a point x 0 in which P (:) is discontinuous. Note that, though this is a necessary condition to apply a RD approach, it is not really an assumption because P (x) is identi ed at every point in the support of X, so we can check whether this discontinuity exits or not.

The key identi cation assumption is that the functions 0 (x) E(Y 0 jx = x) and 1 (x) E(Y 1 jx = x) are continuous functions of x. ASSUMPTION RD: The functions 0 (x) E(Y 0 jx = x) and 1 (x) E(Y 1 jx = x) are continuous at X = x 0. Under this assumption any observed discontinuity in E(Y jx = x) should be associated with the policy e ect.

Under Assumption RD it is possible to show that: where: AT E(x 0 ) = E(Y jx 0) + E(Y jx 0 ) P (x 0 ) + P (x 0 ) E(Y jx 0 ) + lim x!x + 0 E(Y jx = x 0 ) E(Y jx 0 ) lim x!x0 E(Y jx = x 0 ) P (x 0 ) + lim x!x + 0 P (x 0 ) P (x 0 ) lim x!x + 0 P (x 0 ) Note that AT E(x) is identi ed only at x 0.

7. ROY S MODEL Rational agents self-select in markets, occupations, education levels, etc, that maximize their payo. Roy s (1951) Thoughts on the Distribution of Earnings, is a seminal paper on this topic. He discusses the optimizing choices of workers selecting between shing and hunting. Workers have skills in each occupation/sector, and they select the sector that gives them the highest expected earnings. Roy s model is a model of comparative advantage. Since that seminal paper, there has been very substantial amount of methodological and empirical work in Econometrics on the identi cation and estimation of Roy s model.

ROY S MODEL 7.1. The Model 7.2. Indeti cation with (log)normal distributions of skills 7.3. Nonparametric identi cation 7.4. Generalized Roy s model

7.1. THE MODEL Two occupations [or industries, or countries, etc] indexed by d 2 f0; 1g. A worker is endowed with skills for each occupation (S 0 and S 1 ). Let 0 and 1 be the market prices of skills [the same for all workers in the market] in occupations 0 and 1, respectively, such that earnings of a worker in occupation d 2 f0; 1g are: W d = d S d A worker selects the occupation that maximizes her earnings: W 1 W 0, Worker selects occupation 1 W 1 < W 0, Worker selects occupation 0

7.1. THE MODEL [2] De ne the variables: Y d ln W d = ln d + ln S d (i.e., log-earnings in occupation d) D 1f worker selects occupation 1g For d = 0; 1, de ne the parameters d E(Y d ) = ln d + E(ln S d ), and the random variables U d Y d d. The model can be described in terms of the following equations: 8 >< >: Y = (1 D) Y 0 + D Y 1 Y d = d + U d for d = 0; 1 D = 1 fy 1 Y 0 g This is the TE model but with the assumation that individuals choose "treatment" to maximze earnings.

7.1. THE MODEL [3] Roy s main purpose was to understand the implications of self-selection on the distribution of earnings in di erent occupations. For d = 0; 1, de ne: d E(Y d jd = d) E(Y d ). If d > 0 we say that there is positive selection into occupation d; i.e., workers selecting occupation d have on more skills in this occupation than the average worker in the population. If d < 0 we say that there is negative selection into occupation d; i.e., workers selecting occupation d have on less skills in this occupation than the average worker in the population. What are the predictions of the model about 0 and 1?

7.1. THE MODEL [4] Note that: D = 1 fy 1 Y 0 g = 1 fu 0 U 1 1 0 g = 1 ( ) V V where V U 0 U 1 and 1 0 V

7.1. THE MODEL [5] Under normality of U 0 and U 1 : E (Y 1 jd = 1) = 1 + E (U 1 j V 1 0 ) = 1 + E 1V 2 V V j V 1 0! = 1 + 1V 2 V = 1 1V V V E () () V V j V! 1 0 V V

7.1. THE MODEL [6] Similarly, E (Y 0 jd = 0) = 0 + E (U 0 j V > 1 0 ) = 0 + E 0V 2 V V j V > 1 0! = 0 + 0V 2 V V E V V j! V > V = 0 + 0V V () 1 ()

7.1. THE MODEL [7] Taking into account that 0V = 2 0 01 and 1V = 01 2 1, and de ning 0 = 2 0 01 V 1 = 2 1 01 V () 1 () () () The signs of 0 and 1 depend on the signs of [ 2 0 01 ] and h 2 1 i 01, respectively. Note that: 2 V = [2 0 01 ] + h 2 1 01 i >, so at least one of the two terms is positive, and it can be both.

7.1. THE MODEL [8] 2 1 01 < 0 2 1 01 > 0 2 0 01 < 0 Imposible Positive selection in 1 Negative selection in 0 2 0 01 > 0 Negative selection in 1 Positive selection in 1 Positive selection in 0 Positive selection in 0

7.1. THE MODEL [9] Which type of occupation has positive selection? The occupation where the distribution of skills is more heterogeneous, more disperse. To see this, note that [ 2 0 01 ] = 0 1 " 0 1 [ 2 1 01 ] = 0 1 " 1 0 such that the sign of 0 is determined by the sign of of 1 is determined by the sign of " 1 0 #. # # " 0 1 #, and the sign

If 1 0 > 1, then 1 > 0, and if If 0 1 > 1, then 0 > 0.

7.2. Indeti cation: Normal distributions Suppose that we have cross-sectional data fy i ; d i : i = 1; 2; :::; Ng. Can we identify the parameters of the Roy s model = ( 0 ; 1 ; 0 ; 1 ; 01 )? Heckman and Honore (ECMA, 1990) show that we normal distributions the parameters are uniquely identi ed from the following moments in the data: Pr(D = 1); E (Y jd = 0) ; E (Y jd = 1) ; V (Y jd = 0) ; and V (Y jd = 1) They also show that, without regressors, the model is not identi ed if we consider a nonparametric speci cation of the unobservables. Then, they present nonparametric identi cation results when the model includes regressors X.

7.3. Nonparametric Indeti cation: Exclusion restrictions Consider the model with repressors, such that d (X) E(Y d jx), and assume that U 0 and U 1 are independent of X: Suppose that X includes three groups of variables: X = (Z 0 ; Z 1 ; X c ) such that: 0 (X) = 0 (Z 0 ; X c ) 1 (X) = 1 (Z 1 ; X c ) Furtherefore Z 0 and Z 1 have continuous support and d (Z d ; X c ) is strictly monotonic in Z d, and lim Z d! 1 d(z d ; X c ) = 1

Nonparametric Indeti cation: Exclusion restrictions [2] We have that: E (Y j X; D = 0) = 0 (Z 0 ; X c ) + E (U 0 j V > 1 (Z 1 ; X c ) 0 (Z 0 ; X c )) Therefore, lim E (Y j X; D = 0) = Z 1! 1 0 (Z 0 ; X c ) +E U 0 j V > lim Z 1! 1 1(Z 1 ; X c ) 0 (Z 0 ; X c )! = 0 (Z 0 ; X c ) + E (U 1 j V > 1) = 0 (Z 0 ; X c ) and 0 (Z 0 ; X c ) is identi ed everywhere.

Nonparametric Indeti cation: Exclusion restrictions [3] Similarly, we have that: E (Y j X; D = 1) = 1 (Z 1 ; X c ) + E (U 1 j V 1 (Z 1 ; X c ) 0 (Z 0 ; X c )) Therefore, lim E (Y j X; D = 1) = Z 0! 1 1 (Z 1 ; X c ) +E U 1 j V 1 (Z 1 ; X c ) lim Z 0! 1 0(Z 0 ; X c )! = 1 (Z 1 ; X c ) + E (U 1 j V +1) = 1 (Z 1 ; X c ) and 0 (Z 0 ; X c ) is identi ed everywhere.

Nonparametric Indeti cation: Exclusion restrictions [4] For estimation, we can use nonparametric methods. De ne the choice probability P D (X) = Pr(D = 1jX). The model implies that P D (X) = F V ( 1 (Z 1 ; X c ) 0 (Z 0 ; X c )), and if F V (:) is strictlu increasing: 1 (Z 1 ; X c ) 0 (Z 0 ; X c ) = F 1 V [P D(X)] Note that E (U 1 j V 1 (Z 1 ; X c ) 0 (Z 0 ; X c )) is a function of 1 (Z 1 ; X c ) 0 (Z 0 ; X c ) only, and therefore we can represented as a function of P D (X). E (U 1 j V 1 (Z 1 ; X c ) 0 (Z 0 ; X c )) = s 1 (P D (Z 0 ; Z 1 ; X c ))

Nonparametric Indeti cation: Exclusion restrictions [5] Therefore, we can write E (Y j X; D = 1) = 1 (Z 1 ; X c ) + s 1 (P D (Z 0 ; Z 1 ; X c )) For the subsample of observations with d i = 1, consider the regression model: y i = 1 (z 1i ; x ci ) + s 1 (p i ) + e i = h(z 1i ; x ci ) 0 1 + s 1 (p i ) + e i We can use Robinson (1988) or Yatchew (2003) to estimate 1 in this model.