Lecture 11/12. Roy Model, MTE, Structural Estimation

Lecture 11/12. Roy Model, MTE, Structural Estimation Economics 2123 George Washington University Instructor: Prof. Ben Williams

Roy model The Roy model is a model of comparative advantage: Potential earnings in sectors 0 and 1: Y 0, Y 1 Individuals choose sector 1 if and only if Y 1 Y 0 c where c is a nonrandom cost. Heckman and Honore (1990) studied the empirical implications and identification of this model.

Roy model The extended and generalized Roy model: the extended model allows for an observable cost component, D = 1(Y 1 Y 0 c(w )) where W is a vector of covariates and c is a possibly unknown function. the generalized model allows for an unobservable cost component, D = 1(Y 1 Y 0 c(w, V )) where V is unobservable

Roy model Two parts of this lecture: estimation of treatment effects estimation of structural models

Roy model In the Roy model, Y d = µ d + U d for d = 0, 1. Suppose we observe a vector of covariates X so that µ d = β d X. Then E(Y D = 1, X = x) = β 1 x + E(U 1 U 1 U 0 z, X = x) E(Y D = 0, X = x) = β 0 x + E(U 0 U 1 U 0 < z, X = x) where z = (β 1 β 0 ) x c.

Roy model Assumption: (U 1, U 0 ) X = x N(0, Σ) where ( ) σ 2 Σ = 1 σ 10 σ 10 σ0 2 Let V = U 1 U 0 and σ 2 V = Var(V )

Roy model Assumption: (U 1, U 0 ) X = x N(0, Σ) Let z = z /σ V. Then under this assumption, E(Y D = 1, X = x) = β 1 x + σ2 1 σ 10 λ( z) σ V E(Y D = 0, X = x) = β 0 x + σ2 0 σ 10 λ( z) σ V and Pr(D = 1 X = x) = Φ( z)

Roy model Estimation with only sector one observed. If we only observe Y for those with D = 1 (for example, a wage-lfp model) then we can (1) estimate the probit: Pr(D = 1 X = x) = Φ(γ 0 + γ 1 x) = Φ( z) (2) compute the predicted values from (1), ˆ Z = ˆγ0 + ˆγ 1 X and plug into λ to get λ( ˆ Z ) (3) estimate a regression of Y on X, λ( ˆ Z ) for those with D = 1 This enables us to estimate β 1 but not β 0, σ 1, σ 10, σ 0.

Roy model Estimation with both sectors observed. If we only observe Y, D, X for everyone. (1) estimate the probit: Pr(D = 1 X = x) = Φ(γ 0 + γ 1 x) = Φ( z) (2) compute the predicted values from (1), ˆ Z = ˆγ0 + ˆγ 1 X and plug into λ to get λ( ˆ Z ) (3) estimate a regression of Y on X, λ( ˆ Z ) for those with D = 1 (4) estimate a regression of Y on X, λ(ˆ Z ) for those with D = 0 This enables us to estimate β 1, β 0, β 1 β 0 and σ V, σ2 0 σ 10 σ V σ1 2 σ 10 σ V. Thus we get σ V too From the variance of the residuals from the two regressions we can also identify σ1 2, σ2 0 and σ 10.

Roy model Concerns: If λ( z) is approximately linear then we will have a serious collinearity problem. If U 1, U 0 is not normal then the model is misspecified and identification is not transparent. If there are variable costs the model is misspecified.

Roy model In the extended Roy model, Y d = µ d + U d for d = 0, 1 but D = 1(Y 1 Y 0 γ 1 X + γ 2 Z ) Cost of participation varies with X and also with other variables Z. Then E(Y D = 1, X = x, Z = z) = β 1 x E(Y D = 0, X = x, Z = z) = β 0 x where z = (β 1 β 0 ) x γ 1 x γ 2 z. + E(U 1 U 1 U 0 z, X = x) + E(U 0 U 1 U 0 < z, X = x)

Roy model Assumption: (U 1, U 0 ) X = x, Z = z N(0, Σ) Under this assumption, if z = z /σ V, E(Y D = 1, X = x) = β 1 x + σ2 1 σ 10 λ( z) σ V E(Y D = 0, X = x) = β 0 x + σ2 0 σ 10 λ( z) σ V and Pr(D = 1 X = x, Z = z) = Φ( z) β 1 and β 0 are still identified Σ only identified if there is an exclusion: a component of X that does not affect costs

Roy model What do we need/want to identify? The ATE is E(Y 1 Y 0 ) = (β 1 β 0 )E(X)

Roy model What do we need/want to identify? The ATE is E(Y 1 Y 0 ) = (β 1 β 0 )E(X) The distribution of gains: Y 1 Y 0 X = x N((β 1 β 0 ) x, σv 2 ) various other counterfactuals need Σ to go beyond mean treatment effects

Roy model In the generalized Roy model, Y d = µ d + U d for d = 0, 1 but D = 1(γ 1 X + γ 2 Z V ) V includes an unobservable component of cost Then E(Y D = 1, X = x, Z = z) = β 1 x E(Y D = 0, X = x, Z = z) = β 0 x where z = γ 1 x + γ 2 z. + E(U 1 V z, X = x) + E(U 0 V > z, X = x)

Roy model Assumption: (U 1, U 0, V ) X = x, Z = z N(0, Σ) Under this assumption, β 1 and β 0 are identified σ V is identified under the exclusion restriction but Var(U 1 U 0 ) σv 2 (key ingredient needed for distribution of Y 1 Y 0 ) is not identified

Roy model without normality Assumption: (U 1, U 0, V ) X = x, Z = z and lim z Pr(D = 1 X = x, Z = z) = 1 The first assumption is essentially the same one used by Imbens and Angrist (1994) The second assumption is called identification at infinity

Roy model without normality Under these assumptions Let P(x, z) = Pr(D = 1 X = x, Z = z) E(Y D = 1, X = x, Z = z) = β 1 x + K 1(P(x, z)) and lim z E(Y D = 1, X = x, Z = z) = β 1 x selection on unobservables goes away in the limit Using the same argument for D = 0, we can identify ATE(x) = (β 1 β 0 ) x We ve traded normality for identification at infinity.

Roy model Next: The marginal treatment effect. Structural estimation of the Roy model More on structural estimation

Some preliminaries Recall that D = 1(γ 1 X + γ 2 Z V ) Let U D = F V (V ) where V has distribution function F V ( ). Then D = 1(P(X, Z ) U D ) And note that U D Uniform(0, 1).

Definition of the MTE Then MTE(x, u) = E(Y 1 Y 0 X = x, U D = u) This demonstrates (observable and unobservable) heterogeneity in Y 1 Y 0. MTE(x, u) can be interpreted as the effect of participation for those individuals who would be indifferent if we assigned them a new value of P = P(x, z) equal to u. Someone with a large value of u (close to 1) will participate only if P is quite large; this person will be indifferent if P is equal to u. These are the high unobservable cost individuals. Someone with a small value of u (close to 0) will participate even if P is quite small; this person will be indifferent if P is equal to u. These are the low unobservable cost individuals.

Identification of the MTE The identifying equation: E(Y X = x, P(X, Z ) = p) p = MTE(x, p) This only works if Z is continuous. The effect for the high unobserved cost individuals is identified by the effect of a marginal increase in participation probability on Y at a high participation rate.

Other treatment parameters and methods ATE(x) = 1 0 MTE(x, u)du

Other treatment parameters and methods ATE(x) = 1 0 MTE(x, u)du TT (x) = 1 0 MTE(x, u)ω TT (x, u)du where ω TT (x, u) disproportionately weights smaller values of u OLS and IV can also be written as weighted averages of MTE(x, u).

Other treatment parameters and methods Consider the IV estimand IV (x) = The weight here is Cov(J(Z ),Y X=x) Cov(J(Z ),D X=x) ω IV (x, u) = E(J E(J) X = x, P u)pr(p u X = x) Cov(J, P X = x) This and many interesting implications are discussed in Heckman, Vytlacil, and Urzua (2006).

An example Carneiro, Heckman, Vytlacil (2011) use data from the NLSY Y is log wage in 1991 (ages 28-34), D represents college attendance, X a vector of controls vector Z : (i) distance to college, (ii) local wage, (iii) local unemployment, (iv) average local public tuition

VOL. 101 NO. 6 0.8 carneiro et al.: estimating marginal returns to education An example 2771 0.6 0.4 MTE 0.2 0 0.2 0.4 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 U S

Estimating the MTE - normal model Option 1. Estimate using MLE. Option 2. Two stage estimation: Probit to estimate P(Z i ) (to simplify, define Z i to include X i and instrument(s)) Regress Y on X i and ˆλ 1i = φ(φ 1 (ˆP(Z i ))) for D ˆP(Z i ) i = 1 Regress Y on X i and ˆλ 0i = φ(φ 1 (ˆP(Z i ))) 1 ˆP(Z i ) Then for D i = 0 MTE(x, u) = x ( ˆβ 1 ˆβ 0 ) + (ˆρ 1 ˆρ 0 )Φ 1 (u)

Estimating the MTE semiparametric model The outcome equation can be written 1 as E(Y X = x, P(Z ) = p) = x δ 0 + px (δ 1 δ 0 ) + K (p) There are several ways to estimate this perhaps the simplest is a series/spline/sieve estimator. Estimate P(Z i ) (logit). Choose a set of basis functions (polynomials) and an order, K. Run the regression: Y i = X i δ 0 + ˆP(Z i )X i (δ 1 δ 0 ) + γ 1 ˆP(Zi ) +... + γ K ˆP(Zi ) K + η i An important sacrifice here is that MTE(x, u) is only identified for u in the support of P. (Recall identification at )

Consider policies that affect P(Z ) but not Y 1, Y 0, V. Propensity score P under new policy. It can be shown that the effect of shifting to this new policy is given by 1 0 [ MTE(x, u) F P X=x(u) F P X=x (u) E(P X = x) E(P X = x) This will still require large support for P(Z ). define a continuum of policies consider marginal change from baseline ] du

MPRTE Consider increasing tuition (a component of Z ) by an amount α: tuition = tuition + α. Corresponding propensity score, P α. Define the MPRTE as 1 lim α 0 0 [ MTE(x, u) F Pα X=x(u) F P0 X=x(u) E(P α X = x) E(P 0 X = x) ] du This is also equal to lim e 0 E(Y 1 Y 0 µ D (X, Z ) V < e). And it can be written as 1 MTE(x, u)ω(x, u) where 0 ω(x, u) = f P X (u)f V X (F 1 V X (u)) E(f V X (µ D (X, Z )) X)

Maximum likelihood You ve seen theoretical conditions for maximum likelihood estimation before. See Cameron and Trivedi for a review. Here I want to practice constructing the likelihood for different models. Suppose we observe a vector of outcomes Y i and covariates X i. With iid data, the likelihood function is L(β) = n f Y X (Y i X i ; β) i=1 Let L(β) = log(l(β)) = n i=1 log(f Y X (Y i X i ; β)). Then ˆβ MLE = arg max L(β) β

Censoring and Truncation Censoring and truncation occur when when only observe the value of a variable below or above some threshold. Censoring occurs when we observe y if y > L but we still observe the individuals with y < L, just not their value of y. Truncation occurs when we observe y if y > L and we don t observe the individuals with y < L at all.

Censoring Suppose the conditional density is given by f (yi X i ). Then the likelihood is given by ( L n i=1 1(y i > L) ln(f (y i X i ))+1(y i < L) ln f (y X i )dy )

Truncation Suppose again that the conditional density is given by f (yi X i ). Then the density conditional on being observed is given by f (yi f (y X i)dy. Plug this into the likelihood to get X i )/ L n i=1 ln(f (y i ( ) X i )) ln f (y X i )dy L

Tobit In the Tobit model, yi = β X i + ε i where ε i is normally distributed. This can be implemented quite easily in Stata using the tobit command. As in the probit and logit models, consistency requires correct specification of the distribution of ε i a strong assumption (excludes heteroskedasticity, for example)!

Tobit Note that in this model, E(y i X i, y i > 0) = β X i + E(ε i β X i < ε i ) This second term is equal to σ ε φ( β X i /σ ε )/(1 Φ( β X i /σ ε )) if ε i N(0, σε 2 ) This is called the inverse Mills ratio.

Tobit An alternative is what Cameron and Trivedi call a two-part estimator. This models censoring as a different function of X i. d i = 1 if the observation is censored and Pr(d i = 0 X i ) is not given by Pr(β X i + ε i > 0 X i ). This can be interpreted as a participation decision and a separate choice/determination of y i conditional on participation (COP). See Angrist and Pischke for the reduced form approach to this.

Sample selection models The sample selection model generalizes this by allowing different components of X i to enter the participation/selection equation and the outcome equation. y i1 = β 1 X i1 + ε i1 y i2 = β 2 X i2 + ε i2 Then y i2 is observed if and only if y i1 > 0. Sometimes called a type 2 Tobit model.

Sample selection models This is easy to estimate via MLE if the distribution of (ε i1, ε i2 ) is known. The likelihood is L = n (1 y 1i )log (Pr(y1i 0))+y 1i log (f (y 2i y1i > 0)Pr(y1i > 0 i=1

Sample selection models The sample selection model can also be estimated using the two-step estimator of Heckman (1979). (heckit in Stata) The Roy model is a version of the sample selection model if yi1 < 0 then we observe a different outcome, yi3 and yi1 = y i2 y i3 or = y i2 y i3 + γ Z i We ll see an example of heckit briefly next class.

Example Another example (due to Petra Todd) Consider the following model of consumption (c i ), wages (w i ), and labor force participation (d i ). U i = c i + α i (1 d i ) with α i = κ i β + ε i c i = y i + w i d i πn i d i w i = z i γ + η i d i = 1(z i γ πn i κ i β + η i ε i 0) where ε i, η i are jointly normal. Is it possible to identify the effect of a child care subsidy?

Example Another example (due to Petra Todd) If we have sufficient independent variation so that we can identify Pr(d i = 1 z i, n i, κ i ) the estimated coefficients correspond to γ/σ ξ, β/σ ξ, and (β n + π)/σ ξ where σ 2 ξ = Var(η i ε i ) enough to identify the effect on LFP of the birth of a child, for example. not enough to identify the effect of a child care subsidy of a given amount τ because we don t observe variation in τ under the new policy, the participation probability becomes ( ) zi γ κ i β (β n + π τ)n i Pr(d i = 1 z i, n i, κ i, τ) = Φ σ ξ

Example Another example (due to Petra Todd) We can estimate a sample selection model where y 2i = w i and y 1i = d i We can generally identify (β n + π)/σ ξ If there is a variable in the wage equation that is not in κ i then we can identify σ ξ. This is what we need to identify the counterfactual!