Supplementary Appendix: Difference-in-Differences with. Multiple Time Periods and an Application on the Minimum. Wage and Employment

Size: px

Start display at page:

Download "Supplementary Appendix: Difference-in-Differences with. Multiple Time Periods and an Application on the Minimum. Wage and Employment"

Marshall Shelton
5 years ago
Views:

1 Supplementary Appendix: Difference-in-Differences with Multiple Time Periods and an Application on the Minimum Wage and mployment Brantly Callaway Pedro H. C. Sant Anna August 3, 208 This supplementary appendix contains a) the proofs of the results stated in the main text; b) results for the case where a researcher has access to repeated cross sections data rather than panel data; c) extensions of our main results when using not yet treated observations as a control group; and d) additional details on group-time average treatment effects under an unconditional parallel trends assumption, paying particular attention to the possibilities of using regressions to estimate group-time average treatment effects. Appendix A: Proofs of Main Results We provide the proofs of our results in this appendix. Before proceeding, we first state and prove several auxiliary lemmas that help us proving our main theorems. Let AT T X g, t) = Y t ) Y t 0) X, G g =. Lemma A.. Under Assumptions -4, and for 2 g t T, AT T X g, t) = Y t Y g X, G g = Y t Y g X, C = a.s.. Proof of Lemma A.: In what follows, take all equalities to hold almost surely a.s.). Notice that for identifying AT T X g, t), the key term is Y t 0) X, G g =. And notice that for h > s, Y s 0) X, G h = = Y s X, G h =, which holds because in time periods before an individual Department of conomics, Temple University. mail: brantly.callaway@temple.edu Department of conomics, Vanderbilt University. mail: pedro.h.santanna@vanderbilt.edu

2 is first treated, their untreated potential outcomes are observed outcomes. Also, note that, for 2 g t T, Y t 0) X, G g = = Y t 0) X, G g = + Y t 0) X, G g = = Y t X, C = + Y t 0) X, G g =, A.) where the first equality holds by adding and subtracting Y t 0) X, G g = and the second equality holds by Assumption 2. If g = t, then the last term in the final equation is identified; otherwise, one can continue recursively in similar way to A.) but starting with Y t 0) X, G g =. As a result, t g Y t 0) X, G g = = Y t j X, C = + Y g X, G g = j=0 = Y t Y g X, C = + Y g X, G g =. A.2) Combining A.2) with the fact that, for all g t, Y t ) X, G g = = Y t X, G g = which holds because observed outcomes for group g in period t with g t are treated potential outcomes), implies the result. Next, recall that ˆπ g = arg max π i:g ig +C i = G ig ln p g X iπ)) + G ig ) ln p g X iπ)), ṗ g = p g u)/ u, ṗ g X) = ṗ g X π 0 g), and π 0 g is the true, unknown vector of parameter indexed the generalized propensity score p g X) = G g = X, G g + C =. Lemma A.2. Under Assumption 5, ˆπg π 0 g) = n ξg π W i ) + o p ), where ξg π G g + C) ṗ g X) 2 W) = p g X) p g X)) XX X G g + C) G g p g X)) ṗ g X). p g X) p g X)) Proof of Lemma A.2: Let n gc = n C i + G ig ). Under Assumption 5, from Theorem 5.39 and xample 5.40 in van der Vaart 998), we have ngc ˆπg π 0 g) 2

3 = ngc ṗ g X) 2 Gg G ig p g X i )) ṗ g X i ) p g X) p g X)) XX + C = X i +o p ) p g X i ) p g X i )) G g + C) ṗ g X) 2 G ig + C i ) G ig p g X i )) ṗ g X i ) p g X) p g X)) XX X i +o p ) p g X i ) p g X i )) i:g ig +C i = = G g + C ngc = n G g + C ngc = Thus, ngc n ξg π W i ) +o p ) ξg π W i ) +o p ). and the proof is complete. n ˆπg πg) 0 ) = gc ˆπg πg 0 ngc = ξg π W i ) + o p ), For an arbitrary π, let p g x; π) = p g x π), ṗ g x; π) = ṗ g x π), for all g = 2,..., T. Define the classes of functions, { } p g x; π) H,g = x, c) c p g x; π) : π Π g, { H 2,g = x, c, y t, y g ) c p } g x; π) y t y g ) : π Π g p g x; π) { } H 3,g = x, c, y t, y g ) x c ṗ g x; π) y t y g ) p g x; π)) 2 : π Π g, { } c ṗ g x; π) H 4,g = x, c) x p g x; π)) 2 : π Π g, { H 5,g = x, c, g g ) x g } g + c) g g p g x; π)) ṗ g x; π) : π Π g. p g x; π) p g x; π)) Lemma A.3. Under Assumptions and 5, for all g = 2,..., T, t = 2,..., T, the classes of functions H j,g, j = {, 2..., 5}, are Donsker. Proof of Lemma A.3: This follows from xample 9.7 in van der Vaart 998). Lemma A.4. Under Assumptions and 5, the null hypothesis H 0 : Y t Y t X, G g = Y t Y t X, C = = 0 a.s.for all 2 t < g T, 3

4 can be equivalently written as H 0 : G g G g p g X) C p g X) pg X) C p g X) Y t Y t ) X = 0 a.s.for all 2 t < g T. Proof of Lemma A.4: First note that Y t Y t X, G g = = G g Y t Y t ) X, G g = G g = G g X Y t Y t ) X. Analogously, implying that Y t Y t X, C = = C C X Y t Y t ) X, Y t Y t X, G g = Y t Y t X, C = = 0 a.s.for all 2 t < g T. ) Gg G g X C Y t Y t ) C X X = 0 a.s.for all 2 t < g T. Given that under Assumptions 4 and 5, G g + C X > 0 a.s., we have that ) Gg G g X C Y t Y t ) C X X = 0 a.s.for all 2 t < g T if and only if G g + C X By noticing that Gg G g X p g X) = ) C C X Y t Y t ) X = 0 a.s.for all 2 t < g T. A.3) G g X G g + C X, p g X) = C X G g + C X, and that both of these are bounded away from zero under Assumption 5, we can rewrite A.3) as G g G g p g X) C p g X) pg X) C p g X) Y t Y t ) X = 0 a.s.for all 2 t < g T, 4

5 since pg X) C p g X)) Gg X, C + G g = C = C X, C + G g = Gg X C = C X Gg X C X = C X = G g X = G g. A.4) This completes the proof. Now, we are ready to proceed with the proofs of our main theorems. Proof of Theorem : Given the result in Lemma A., AT T g, t) = AT T X g, t) G g = = Y t Y g X, G g = Y t Y g X, C = G g = := A X G g = B X G g =, and we consider each term separately. For the first term A X G g = = Y t Y g G g = Gg = G g Y t Y g ). A.5) For the second term, by repetition of the law of iterated expectations, we have B X G g = = Y t Y g X, C = G g = = G g CY t Y g ) X, C = G g = C = G g p g X)) Y t Y g ) Gg X, Gg + C = = C G g p g X)) Y t Y g ) Gg X, Gg + C = + C = = G g G g + C = pg X) C p g X)) Y t Y g ) Gg X, Gg + C = + C = = G g G g + C = = G g pg X) C G g + C X p g X)) Y t Y g ) X, Gg + C = 5

6 = G g pg X) C p g X)) Y t Y g ) X = G g pg X) C p g X)) Y t Y g ) pg X) C p g X)) Y t Y g ) =, A.6) pg X) C p g X)) where A.6) follows from A.4). The proof is completed by combining A.5) and A.6). Proof of Theorem 2: Remember that Gg ÂT T g, t) = n n G g Y t Y g ) n := ÂT T g g, t) ÂT T C g, t), ˆp g X) C ˆp g X) n ˆpg X) C ˆp g X) Y t Y g ), and Gg AT T g, t) = G g Y t Y g ) := AT T g g, t) AT T C g, t). p g X) C p g X) pg X) C p g X) Y t Y g ) In what follows we will separately show that, for 2 g t T, ÂT T g g, t) AT T g g, t)) = ψgtw G i ) + o p ), A.7) and Then, ÂT T C g, t) AT T C g, t)) = ÂT T g, t) AT T g, t)) = ψgtw C i ) + o p ). ψ gt W i ) + o p ) A.8) hold from A.7) and A.8), and the asymptotic normality result follows from the application of the multivariate central limit theorem. 6

7 Let β g = G g and β g = n G g, and note that βg β g ) = G ig G g ). Then, for all 2 g t T, by the continuous mapping theorem, ÂT T g g, t) AT T g g, t)) = βg n G g Y t Y g ) G g Y t Y g )) concluding the proof of A.7). G g Y t Y g ) ) n βg β g = n G ig Y it Y ig ) G g Y t Y g )) β g G g Y t Y g ) ) n βg β g + o p ) = = := Next we focus on A.8). For an arbitrary function g, let and note that β 2 g Gig Y it Y ig ) G ) ig G g Y t Y g ) + o β g βg 2 p ) G ig Y it Y ig ) AT T g g, t)) + o p ) β g ψgtw G i ) + o p ), w g) = g X) C g X), ÂT T C g, t) AT T C g, t)) = n n w ˆp g ) Y t Y g ) w p g ) Y t Y g )) n w ˆp g ) w p g) Y t Y g ) n w ˆp g ) w p g )) n w ˆp g ) w p g ) := n w ˆp g ) na n ˆp g ) AT T Cg, t) n w ˆp g ) nb n ˆp g ). From Assumption 5, Lemmas A.2 and A.3, and the continuous mapping theorem, n w ˆp g ) = w p g ) + o p ), AT T C g, t) n w ˆp g ) = AT T Cg, t) + o p ). w p g ) 7

8 Thus, ÂT T C g, t) AT T C g, t)) = w p g ) na n ˆp g ) Applying a classical mean value theorem argument, AT T Cg, t) w p g ) nb n ˆp g ) + o p ) A.9) A n ˆp g ) = n w p g ) Y t Y g ) w p g ) Y t Y g ) ) 2 C + n X ṗ g X; π g ) Y it Y ig ) ˆπg π 0 p g X; π g ) g), where π is an intermediate point that satisfies πg πg 0 ˆπg πg 0 a.s. Thus, by Assumption 5, Lemmas A.2 and A.3, and the Classical Glivenko-Cantelli s theorem, Analogously, A n ˆp g ) = n w p g ) Y t Y g ) w p g ) Y t Y g ) A.0) ) 2 C + X ṗ g X) Y it Y ig ) ˆπg π 0 p g X) g) ) + op n /2. A.) B n ˆp g ) = n w p g ) w p g ) ) 2 C + X ṗ g X) ˆπg π 0 p g X) g) ) + op n /2. A.2) Then, A.9), A.0), A.2) and Lemma A.2 yield A.8), concluding the proof. Proof of Theorem 3: Note that, by the conditional multiplier central limit theorem, see Lemma in van der Vaart and Wellner 996), as n, V i Ψ g t W i ) d N0, Σ), A.3) where Σ = Ψ g t W)Ψ g t W). Thus, to conclude the proof that ÂT T g t ÂT T g t ) d N0, Σ), it suffices to show that, for all 2 g t T, V i ψgt W i ) ψ gt W i ) = o p ). 8

9 Towards this, note that V i ψgt W i ) ψ gt W i ) = n V i V i ψg gtw i ) ψgtw G i ) ψc gtw i ) ψgtw C i ), A.4) where and with ψ G gtw) = G g Y t Y g ) ÂT T g g, t), n G g ψ gtw) C = w ˆp g) Y it Y ig ) ÂT T C g, t) + n w ˆp g ) M gt ξ g π W), w ˆp g ) = ˆp g X) C ˆp g X), ) 2 C n X ṗ ˆp g X) g X) Y it Y ig ) ÂT T g g, t) M gt =, n w ˆp g ) Gg + C) ṗ ξ g π g X) 2 W) = n ˆp g X) ˆp g X)) XX X G g + C) G g ˆp g X)) ṗ g X). ˆp g X) ˆp g X)) We will show that each term in A.4) is o p ). For the first term in A.4), we have = V i ψg gtw i ) ψgtw G i ) n G g G g n V i G ig Y it Y ig ) ÂT T g g, t) AT T g g, t) n V i G ig, = o p ), A.5) where the last equality follows from the results in Theorem, together with the law of large numbers, continuous mapping theorem, and Lemma in van der Vaart and Wellner 996). For the second term in A.4), we have = V i n w ˆp g ) n ψc gtw i ) ψgtw C i ) V i w i ˆp g ) w i p g )) Y it Y ig ) 9

10 + n w ˆp g ) + Mgt M gt ) n + M gt n V i ) n V i w i p g ) Y it Y ig ) w p g ) V i ξg π W i ). := A n + A 2n + A 3n + A 4n. ξπ g W i ) ξ π g W i ) From Lemma A.3, we have that H,g, H 2,g, H 3,g and H 5,g are Donsker, and by Assumption 5, w p g ) it is bounded away from zero. Thus, by a stochastic equicontinuity argument, Glivenko- Cantelli s theorem, continuous mapping theorem, and Theorem in van der Vaart and Wellner 996), implying that A n = o p ), A 2n = o p ), A 3n = o p ), and A 4n = o p ), V i From A.3)-A.6), it follows that ) ψc gtw i ) ψgtw C i ) = o p ). A.6) ÂT T g t ÂT T g t ) d N0, Σ). Finally, by the continuous mapping theorem, see e.g. Theorem 0.8 in Kosorok 2008), for any continuous functional Γ ) n )) d Γ ÂT T g t ÂT T g t Γ N0, V )), concluding our proof. Proof of Theorem 4: In order to prove the first part of Theorem 4, we first show that, under H 0, for all 2 t < g T, Ĵu, g, t, ˆp g ) = n ψ test ugt W i ) + o p n /2 ), Towards this end, we write Gg Ĵu, g, t, ˆp g ) = n n G g X u) Y t Y t ) ˆp g X) C n ˆp g X) X u) Y t Y t ) ˆpg X) C n ˆp g X) 0

11 and analyze each term separately. := ĴGu, g, t, ˆp g ) ĴCu, g, t, ˆp g ), As in the proof of Theorem, let β g = G g and β g = n G g. Applying a classical mean value theorem argument, uniformly in u X, Gg Ĵ G u, g, t, ˆp g ) = n X u) Y t Y t ) β g n G g X u) Y t Y t ) β 2 g where β g is an intermediate point that satisfies βg β g functions n G g G g. H 6,g = {x, g g, y t, y t ) g g y t y t ) {x u} : u X }. β g β g a.s.. Define the class of By xample 9. in van der Vaart 998), H 6,g is Donsker under Assumption 5. Furthermore, n G g G g = O p n /2 ). Thus, by the Glivenko-Cantelli s theorem and the continuous mapping theorem, uniformly in u X, Gg Ĵ G u, g, t, ˆp g ) = n G g X u) Y t Y t ) where J Gu, g, t, p g ) G g n G g G g + o p n /2 ) = n w G g Yt Y t ) X u) w G g X u) Y t Y t ) ) + J G u, g, t, p g ) + o p n /2 ), Gg J G u, g, t, p g ) = G g X u) Y t Y t ). A.7) We analyze ĴCu, g, t, ˆp g ) next. Applying a classical mean value theorem argument, uniformly in u X, Ĵ C u, g, t, ˆp g ) = ĴCu, g, t, p g ) n X C ṗ g X; π g ) p g X; π g )) 2 X u) Y t Y t ) + ˆπg π pg X; π g ) C g) 0 n p g X; π g )

12 n X C ṗ g X; π g ) pg X; π g ) C p g X; π g )) 2 n p g X; π g ) X u) Y t Y t ) ˆπg π pg X; π g ) C pg X; π g ) C g) 0 n n p g X; π g ) p g X; π g ) where π is an intermediate point that satisfies πg πg 0 ˆπg πg 0 a.s, and pg X) C n p g X) X u) Y t Y t ) Ĵ C u, g, t, p g ) =. pg X) C n p g X) Define the classes of functions { H 7,g = x, c, y t, y t ) p } g x; π) p g x; π) c y t y t ) {x u} : π Π g, u X, { } x; π) c y t y t ) {x u} H 8,g = x, c, y t, y t ) xṗg p g x; π)) 2 : π Π g, u X, { H 9,g = x, c) cp } g x; π) p g x; π) : π Π g,, { } ṗ g x; π) c H 0,g = x, c) x p g x; π)) 2 : π Π g. By xamples 9.7, 9., and 9.20 in van der Vaart 998), all these classes of functions are Donsker under Assumption 5. theorem, and Lemma A.2, uniformly in u X, for every g, t. Denote Thus, by the Glivenko-Cantelli s theorem, continuous mapping Ĵ C u, g, t, ˆp g ) = ĴCu, ) g, t, p g ) + Mugt test ˆπg π ) g 0 + op n /2, A.8) ˆβ C g pg X) C = n, β C pg X) C g =. p g X) p g X) Applying a classical mean value theorem argument, we have pg X) C n p g X) X u) Y t Y t ) Ĵ C u, g, t, p g ) = pg X) C p g X) pg X) C n p g X) X u) Y t Y t ) pg X) C ) 2 n βc g p g X) pg X) C p g X) 2

13 where β g C is an intermediate point that satisfies βc g βg C βc g βg C a.s.. Since H 7,g is a Donsker Class of functions and pg X) C n p g X) pg X) C = O ) p n /2, p g X) we have that, by the Glivenko-Cantelli s theorem and the continuous mapping theorem, uniformly in u X, Ĵ C u, g, t, p g ) = n w C g Y t Y t ) X u) wg C Y t Y t ) X u) pg X) C n pg X) C p g X) pg X) C + o ) p n /2 p g X) p g X) = n w C g Yt Y t ) X u) w C g X u) Y t Y t ) ) + J C u, g, t, p g ) + o p n /2 ). A.9) Hence, from A.7), A.8), A.9), and the asymptotic linear representation of ˆπ g π 0 g) in Lemma A.2, for every g, t, Ĵu, g, t, ˆp g ) = n ψ test ugt W) + J G u, g, t, p g ) J C u, g, t, p g )) + o p n /2 ) A.20) By noticing that under H 0, J G u, g, t, p g ) = J C u, g, t, p g ) for all u X, g, t) such that 2 t < g T, we have that, under H 0, uniformly in u X, for all 2 t < g T Ĵu, g, t, ˆp g ) = n ψ test ugt W i ) + o p n /2 ). In order to show that Ĵg>t u) Gu) in l X ), it suffices to show that the class of functions is Donsker. H 0 = { x, g g, c, y t, y t ) ψ test ugt : u X, 2 t < g T } This follows straightforwardly from the previously discussed Donsker results and xample 9.20 in van der Vaart 998). Finally, CvM d n Gu) 2 M F X du) follows from the continuous mapping theorem, and the Helly-Bray Theorem. X sup F n,x u) F X u) = o a.s. ), u X Next, we study the behavior of CvM n under H. First, note that under H, for some u X, 3

14 and some g, t), 2 t < g T, Ju, g, t, p g ) 0. Thus, from A.20), under H, uniformly in u X, Ĵ g>t u) = O p n /2 ), implying that CvM n diverges to infinity under H. Because c CvM α = O ) a.s., as n, concluding the proof of Theorem 4. P ) CvM n > c CvM α, Proof of Theorem 5: In the proof of Theorem 4, we have shown that H = { x, g g, c, y t, y t ) ψ test ugt : u X, 2 t < g T } is a Donsker class of functions. Then, by the conditional multiplier functional central limit theorem, see Theorem 2.9.6, in van der Vaart and Wellner 996), as n, V i Ψ test g>tw i ) Gu) in l X ), where Gu) in l X ) is the same Gaussian process of Theorem 4 and indicates weak convergence in probability under the bootstrap law. Thus, to conclude the proof it suffices to show that, for all 2 t < g T, uniformly in u X, V i ψtest ugt W i ) ψugt test W i ) = o p ). A.2) The proof of A.2) follows exactly the same steps as the proof of Theorem 3), and is therefore omitted. Appendix B: Additional Results for Repeated Cross Sections In this section we extend our results to the case with repeated cross sections data instead of panel data. Here we assume that for each individual in the pooled sample, we observe Y, G,..., G T, C, T, X) where T {,..., T } denotes the time period when that individual is observed. Let T t = if an observation is observed at time t, and zero otherwise. We assume that random samples are available for each time period. 4

15 Assumption B.. Conditional of T = t, the data are independent and identically distributed from the distribution of Y t, G,..., G T, C, X), for all t =,..., T. Assumption B. implies that our sample consists of random draws from the mixture distribution F M y, g,..., g T, c, t, x) = T λ t F Y,G,...,G T,C,X T y, g,..., g T, c, x t), t= where λ t = P T t = ). Notice that, once one conditions on the time period, then expectations under the mixture distribution correspond to population expectations. Also, because X, G g, and C are observed for all individuals, one can use draws from the mixture distribution to estimate the generalized propensity score. With some abuse of notation, we then use p g X) as a short notation for M G g X, G g + C =, where M denotes expectations with respect to F M ). Define the stabilized weights where a, b =, 2,... T. w treat a, b) = T b G a / M T b G a, w cont a, b) = T b p a X) C p a X) / M Tb p a X) C p a X) Theorem B.. Under Assumption B. and Assumptions 2-4 in the main text, for 2 g t T, the group-time average treatment effect for group g in period t is nonparametrically identified, and given by AT T g, t) = M w treat g, t) w treat g, g )) Y, M w cont g, t) w cont g, g )) Y. Proof of Theorem B.: By the law of iterated expectations, Assumption B. and Assumption 3 in the main text, for all 2 g t T, M w treat g, t) Y = M T t G g Y M T t G g = G g Y T t = G g T t = = Y T t =, G g = = Y t ) G g =. To complete the proof of Theorem B., we must show that M w treat g, g, ) + w cont g, t) w cont g, g )) Y = Y t 0) G g =. B.) 5

16 Towards this, from Assumption B. and proceeding as in Lemma A., we get Y t 0) X, G g = =Y 0) X, G g =, T t = From the above result, it follows that = Y X, G g =, T g = B.2) + Y X, C =, T t = Y X, C =, T g =. Y t 0) X, G g = = Y X, G g =, T g = G g =, T g = + Y X, C =, T t = G g =, T t = B.3) Y X, C =, T g = G g =, T g =. We consider each term separately. For the first term of B.3), Y X, G g =, T g = G g =, T g = = Y G g =, T g = = M w treat t, g) Y. B.4) Let Y X, C =, T t = = A C=,Tt= X), and note that, by repeated application of the law of iterated expectations as in the proof of Theorem, we have that for the second term of B.3), A C=,Tt= X) G g =, T t = = G g T t = pg X) C p g X)) Y T t = = M G g T t Tt p g X) C M p g X)) Y = M w cont g, t) Y, B.5) where the last equality follows from p g X) := M G g X, G g + C =, and Tt p g X) C M = M T t M G g X, C + G g = C p g X)) M C X, C + G g = = M T t M G g X C M C X M G g X M C X = M M C X = M T t G g X = M T t G g. Following analogous steps, we get that, for the third term of B.3), A C=,Tg = X) Gg =, T g = = M w cont g, g ) Y. B.6) 6

17 Then, B.) follows by combining B.4), B.5) and B.6). The proof of Theorem B. is therefore completed. The identification results in Theorem B. suggest a simple two-step estimation procedure for the AT T g, t) with repeated cross-section data. Similar to the panel data case discussed in the main text, we propose to estimate AT T g, t) by ÂT T g, t) = n ŵ treat g, t) ŵ treat g, g )) Y where ˆp g ) is an estimate of p g ), and for a, b =, 2,... T, ŵ treat a, b) = T b G a / n T b G a, ŵ cont a, b; ˆp) = T b ˆp a X) C ˆp a X) n ŵ cont g, t; ˆp) ŵ cont g, g ; ˆp)) Y. / n Tb ˆp a X) C ˆp a X) Next, we show that ÂT T g, t) is -consistent, admits an asymptotically linear representation, and is asymptotically normal. These results are analogous to Theorem 2 in the main text. Let AT T g t and ÂT T g t denote the vector of AT T g, t) and ÂT T g, t), respectively, for all g = 2,..., T and t = 2,..., T with g t. Define ψg,tw rc i ) = ψ rc,g g,t W i ) ψ rc,g where, for g, t =, 2,... T, and g,g W i ) ) ψ rc,c g,t ψ rc,g g,t W) = w treat g, t) Y M w treat g, t) Y,. ) W i ) ψg,g W rc,c i ), ψ rc,c g,t W) = w cont g, t) Y M w cont g, t) Y + M rc g,t ξ π g W), M rc g,t = M X ) 2 Tt C ṗ g X) Y w cont g, t) Y p g X), Tt p g X) C M p g X) which is a k vector, with k the dimension of X, and ξ π g W) is as defined in 3.) in the main text. Finally, let Ψ rc g t denote the collection of ψ rc g,t across all periods t and groups g such that g t. Theorem B.2. Under Assumption B. and Assumptions 2-5 in the main text, for 2 g t T, ÂT T g, t) AT T g, t)) = ψ rc g,tw i ) + o p ). 7

18 Furthermore, ÂT T g t AT T g t ) d N0, Σ rc ) where Σ rc = M Ψ rc g tw)ψ rc g tw). Proof of Theorem B.2: The proof of Theorem B.2 follows the same steps as the Proof of Theorem 2. From Theorem B., for each 2 g t T we can write ÂT T g, t) AT T g, t)) = n ŵ treat g, t) Y M w treat g, t) Y ) n ŵ treat g, g ) Y M w treat g, g ) Y ) n ŵ cont g, t; ˆp) Y M w cont g, t) Y ) + n ŵ cont g, g ; ˆp) Y M w cont g, g ) Y ). B.7) We analyze each term separately. First, note that, for each 2 g t T, n T t G g M T t G g ) = Then, by the continuous mapping theorem, Analogously, T it G ig T t G g ). n ŵ treat g, t) Y M w treat g, t) Y ) = ψ rc,g g,t W i ) + o p ). B.8) n ŵ treat g, g ) Y M w treat g, g ) Y ) = ψ rc,g g,g W i ) + o p ). B.9) Next we focus on n ŵ cont g, t; ˆp) Y M w cont g, t) Y ). To simplify notation, write w a,b p) = T b p a X) C p a X), and note that ŵ cont g, t; ˆp) = w g,t ˆp) / n w g,t ˆp) and w cont g, t; p) = w g,t p) / M w g,t p). Then, n ŵ cont g, t; ˆp) Y M w cont g, t) Y ) = n n w g,t ˆp) Y M w g,t p) Y ) n w g,t ˆp) M w g,t p) Y n n w g,t ˆp) M w g,t p)) n w g,t ˆp) M w g,t p) 8

19 := na rc n, g,t ˆp g ) M w cont g, t) Y nb rc n, g,t ˆp g ). n w g,t ˆp) n w g,t ˆp) From Assumption 5, Lemmas A.2 and A.3, and the continuous mapping theorem, Thus, n w g,t ˆp) = M w g,t p) + o p ), M w cont g, t) Y n w g,t ˆp) = M w cont g, t) Y M w g,t p) + o p ). n ŵ cont g, t; ˆp) Y M w cont g, t) Y ) na rc = n, g,t ˆp g ) M w g,t p) M w cont g, t) Y nb rc n, g,t ˆp g ) + o p ) B.0) M w g,t p) Applying a classical mean value theorem argument, A rc n, g,t ˆp g ) = n w g,t p) Y M w g,t p) Y + n X T t C p g X; π g ) ) 2 ṗ g X; π g ) Y ˆπg π 0 g), where π is an intermediate point that satisfies πg πg 0 ˆπg πg 0 a.s. Thus, by Assumption 5, Lemmas A.2 and A.3, and the Glivenko-Cantelli s theorem, A rc n, g,t ˆp g ) = n w g,t p) Y M w g,t p) Y ) 2 Tt C + M X ṗ g X) Y p g X) Analogously, ˆπg π 0 g) + op n /2 ). B.) B n ˆp g ) = n w g,t p) M w g,t p) + M X Combining B.0), B.), B.2) with Lemma A.2 yield n ŵ cont g, t; ˆp) Y M w cont g, t) Y ) = ) 2 Tt C ṗ g X) ˆπg π 0 p g X) g) ) + op n /2. B.2) ψ rc,c g,t W i ) + o p ). B.3) 9

20 Using the same arguments, we conclude that n ŵ cont g, g ; ˆp) Y M w cont g, g ) Y ) = ψ rc,c g,g W i ) + o p ). B.4) Hence, from B.7), B.8), B.9), B.3) and B.4), we conclude that, for each 2 g t T, ÂT T g, t) AT T g, t)) = ψ rc g,tw i ) + o p ). The proof is then completed by applying the multivariate central limit theorem. Based on the above results, one can conclude that estimation and inference procedures for AT T g, t) in the case of repeated cross sections is similar to what we did in the case with panel data. In fact, one simply needs to adjust the weights slightly. In order to conduct asymptotically valid simultaneous inference, one can leverage on the asymptotic linear representation in Theorem B.2, and use a multiplier bootstrap procedure analogous to the one in Theorem 3. The proof of the bootstrap validity in the repeated cross section case follows exactly the same steps as in Theorem 3 and is therefore omitted. Appendix C: Analysis with Not yet Treated as a Control Group In this appendix, we discuss the case where one considers the not yet treated instead of the never treated as a control group. This case is particularly relevant in applications when eventually almost) all units are treated, though the timing of the treatment differs across groups. To carry this analysis, we make the following assumptions. Assumption C.. {Y i, Y i2,... Y it, X i, D i, D i2,..., D it } n is independent and identically distributed iid). Assumption C.2. For all t = 2,..., T, g = 2,..., T such that g t, Y t 0) Y t 0) X, G g = = Y t 0) Y t 0) X, D t = 0 a.s.. Assumption C.3. For t = 2,..., T, D t = implies that D t = Assumption C.4. For all t = 2,..., T, g = 2,..., T, P G g = ) > 0 and P D t = X) < a.s.. 20

21 Assumptions C. and C.3 are the same as Assumptions and 3 in the main text. Assumptions C.2 and C.4 are the analogue of Assumptions 2 and 4, but using those not yet treated D t = 0) as a control group instead of the never treated C = 0 or D T = 0). Note that Assumption C.4 rules out the case in which eventually everyone is treated; in these time periods, there is no control group available, and therefore the data itself is not informative about the average treatment effect when D t = a.s.. In these cases, one should concentrate their attention only to the time periods such that P D t = X) < a.s.. Remember that AT T X g, t) = Y t ) Y t 0) X, G g =. Next lemma states that, under Assumptions C.-C.4, we can identify AT T X g, t) for 2 g t T. This is the analogue of Lemma A.. Lemma C.. Under Assumptions C.-C.4, and for 2 g t T, AT T X g, t) = Y t Y g X, G g = Y t Y g X, D t = 0 a.s.. Proof of Lemma C.: In what follows, take all equalities to hold almost surely a.s.). Notice that for identifying AT T X g, t), the key term is Y t 0) X, G g =. And notice that for h > s, Y s 0) X, G s = = Y s X, G h =, which holds because in time periods before an individual is first treated, their untreated potential outcomes are observed outcomes. Also, note that, for 2 g t T, Y t 0) X, G g = = Y t 0) X, G g = + Y t 0) X, G g = = Y t X, D t = 0 + Y t 0) X, G g =, C.) where the first equality holds by adding and subtracting Y t 0) X, G g = and the second equality holds by Assumption C.2. If g = t, then the last term in the final equation is identified; otherwise, one can continue recursively in similar way to C.) but starting with Y t 0) X, G g =. As a result, t g Y t 0) X, G g = = Y t j X, D t = 0 + Y g X, G g = j=0 = Y t Y g X, D t = 0 + Y g X, G g =. C.2) Combining C.2) with the fact that, for all g t, Y t ) X, G g = = Y t X, G g = which holds because observed outcomes for group g in period t with g t are treated potential outcomes), implies the result. With the result of Lemma C. at hands, we proceed to show that the AT T g, t) is nonparametrically identified under Assumptions C. - C.4 and for 2 g t T. The following Theorem 2

22 is the analogue Theorem. Theorem C.. Under Assumptions C. - C.4 and for 2 g t T, the group-time average treatment effect for group g in period t is nonparametrically identified, and given by P G g = X) D t ) AT T g, t) = G g G g P D t = X) P Gg = X) D t ) Y t Y g ). P D t = X) Proof of Theorem C.: Given the result in Lemma C., AT T g, t) = AT T X g, t) G g = = Y t Y g X, G g = Y t Y g X, D t = 0 G g = := A X G g = B n.yet X G g =, and we consider each term separately. For the first term C.3) A X G g = = Y t Y g G g = Gg = G g Y t Y g ). C.4) For the second term, by repetition of the law of iterated expectations, we have B n.yet X G g = = Y t Y g X, D t = 0 G g = = G g D t ) Y t Y g ) X, D t = 0 G g = D t ) = G g P D t = X) Y t Y g ) Gg X = = G g D t ) G g P D t = X) Y t Y g ) X = G g D t ) G g X P D t = X) Y t Y g ) X P = G g Gg = X) D t ) Y t Y g ) X P D t = X) P = G g Gg = X) D t ) Y t Y g ) P D t = X) P Gg = X) D t ) Y t Y g ) P D t = X) =, C.5) P Gg = X) D t ) P D t = X) 22

23 where C.5) follows from P Gg = X) D t ) P Gg = X) D t ) = X P D t = X) P D t = X) P Gg = X) = P D t = X) D t ) X P Gg = X) = P D t = X) P D t = X)) = P G g = X) = G g X = G g. The proof is completed by combining C.4) and C.5). Once we have establish nonparametric identification of AT T g, t), we can follow a similar twostep estimation strategy as described in Section 3. More precisely, under Assumptions C. - C.4 and for 2 g t T, one can estimate AT T g, t) by ÂT T n.yet g, t) = n G g n G g ˆp Gg X) D t ) ˆp Dt X) n ˆpGg X) D t ) ˆp Dt X) Y t Y g ), where ˆp Gg X) is an estimate of P G g = X), and ˆp Dt X) is an estimate of P D t = X). In contrast to the case analyzed in the main text, here we need to estimate two propensity scores. These can be estimated separately, using binary choice models e.g. multinomial choice models e.g. multinomial logit). logit), or jointly, using Following similar steps as in Theorems 2 and 3, one can show that under suitable regularity conditions akin to those in Assumption 5, ÂT T n.yet g, t) is consistent and asymptotically normal, and that one can use a multiplier bootstrap similar to the one described in Algorithm to conduct asymptotically valid inference. Nonetheless, it is worth mentioning that the asymptotic linear representation of ÂT T n.yet g, t) will be different from that of ÂT T g, t), because the former is based on of two different propensity scores whereas the later is based only on one. A detailed and formal derivation of the aforementioned results is beyond the scope of this article. 23

24 Appendix D: Additional Results for the Case without Covariates Panel Data The case where the DID assumption holds without conditioning on covariates is of particular interest. In this appendix, we briefly consider whether or not it is possible to obtain AT T g, t) using a regression approach like the two period - two group case. A natural starting point is the model Y igt = α t + c g + γ gt G igt + u igt where α t is a vector of time period fixed effects we normalize α to be equal to zero and γ g to be equal to ), c g is time invariant unobserved heterogeneity that can be distributed differently across groups, and G igt is is a dummy variable indicating whether or not individual i is a member group g and the time period is t. Differencing the model across time periods results in Y igt = α t + γ gt G igt + u igt, where α t = α t α t. Notice that this is a fully saturated model in group and time effects. It is straightforward to show that γ gt = Y t G g = Y t C =. When g = t, this is exactly the DID estimator. Under the augmented unconditional version of the parallel trends assumption, γ gt should be equal to 0 for all g > t, and it is straightforward to test this using output from standard regression software e.g. Wald test). For t > g, the long difference estimate of AT T g, t) can be constructed by AT T g, t) = Y t Y g G g = Y t Y g C = t = Y s G g = Y s C = ) = s=g t s=g γ gs This implies that, under the augmented) unconditional parallel trends assumption, AT T g, t) can be recovered using a regression approach. However, combining the estimates of the parameters in this way does not seem to offer much convenience relative to simply computing the estimates directly using the main approach suggested in the paper. Thus, unlike the 2-period case, it does not appear that there is as exact of a mapping from a regression coefficient to a group-time average 24

25 treatment effect. Common Approaches to Pre-Testing in the Unconditional Case Finally in this section, we consider the most common approach to pre-testing the augmented unconditional version of the parallel trends assumption, that is, to run the following regression see Autor et al. 2007) and Angrist and Pischke 2008)). q Y it = α t + θ g + β 0 D it + β j D it,t+j + u it j= D.) where D it is a dummy variable for whether or not individual i is treated in period t notice that this is not whether they are first treated in period t but whether or not they are treated at all; it is a post-treatment dummy variable), D it,t+j is a j period lead for individual i who is first treated in period t + j. For example, when t = 2, D i2,4 = for j = 2) for individuals who are first treated in period 4, which indicates that the group of individuals first treated in period 4 will be treated 2 periods from period t. Then, one can pre-test the unconditional parallel trends assumption by testing if β j = 0 for j =,..., q. Under the Unconditional DID Assumption, each β j will be 0. One advantage of this approach is that it allows simple graphs of pre-treatment trends. However, it is possible for this approach to miss departures from the unconditional parallel trends assumption that our test would not miss. Consider the case with four periods and three groups the control group, a group first treated in period 4, and a group first treated in period 3. Also, consider the case with q =. It is easy to show that β = Y 3 G 4 = Y 3 C = and β = Y 2 G 3 = Y C = so that the estimate of β will be a weighted average of these two pre-trends. Thus, the unconditional augmented parallel trends assumption could be violated in ways that offset each other leading to β being equal to 0. ven more importantly,the weights associate with the regression coefficient β may not be convex; see Propositions 3 and 7 in Abraham and Sun 208) for detailed arguments. As a consequence, tests for pre-trends based on D.) may not be reliable under treatment effect heterogeneity. Our approach described in Remark 7 in the main text, on the other hand, does not suffer from this potential drawback. References Abraham, S., and Sun, L. 208), stimating Dynamic Treatment ffects in vent Studies With Heterogeneous Treatment ffects, Working Paper,. 25

26 Angrist, J. D., and Pischke, J.-S. 2008), Mostly Harmless conometrics: An mpiricist s Companion, : Princeton University Press. Autor, D. H., Kerr, W. R., and Kugler, A. D. 2007), Does mployment Protection Reduce Productivity? vidence From US States, The conomic Journal, 752), F89 F27. Kosorok, M. R. 2008), Introduction to mpirical Processes and Semiparametric Inference, New York, NY: Springer. van der Vaart, A. W. 998), Asymptotic Statistics, Cambridge: Cambridge University Press. van der Vaart, A. W., and Wellner, J. A. 996), Weak Convergence and mpirical Processes, New York: Springer. 26

Difference-in-Differences with Multiple Time Periods and an Application on the Minimum Wage and Employment

Difference-in-Differences with Multiple Time Periods and an Application on the Minimum Wage and Employment Brantly Callaway Department of Economics Temple University Pedro H. C. Sant Anna Department of