Simple Tests of Random Missing for Unbalanced Panel Data Models. September 10, 2014

Size: px

Start display at page:

Download "Simple Tests of Random Missing for Unbalanced Panel Data Models. September 10, 2014"

Octavia Bailey
5 years ago
Views:

1 Simple Tests of Random Missing for Unbalanced Panel Data Models Do Won Kwak and Suyong Song September 10, 2014 Abstract This paper proposes simple tests of missing completely at random (MCAR) and missing at random (MAR) assumptions for unbalanced panel data by extending Hausman [1978] s specification test. The proposed Hausman-Type (HT) tests can be applied flexibly to estimations with complete-case, multiple imputations, sample selection correction, and inverse probability weighted methods. Monte Carlo simulations show substantial power and no size bias of the tests. We illustrate usefulness of the tests of the MCAR and MAR assumptions in the study of the effects of trade liberalization policies on trade flows. The HT tests results show that conventional estimators substantially underestimate the effects on trade flows. Keywords: Unbalanced panel data; Hausman-type specification test; Missing data; Missing completely at random; Missing at random JEL Classfication: C12, C33, F15 We would like to thank Jeffrey M. Wooldridge for his perspective and helpful comments. We are grateful to Xuepeng Liu for kindly providing his dataset and Timothy Vogelsang, Todd Elder, Chris O Donnell, Adrian Pagan, Alicia Rambaldi, Prasada Rao, Donggyu Sul for suggestions and comments. We also thank seminar participants at Korea University, Michigan State University, Monash University, University of Sydney, and University of Queensland, for useful comments. All remaining errors are our own. School of Economics, The University of Queensland, St Lucia, QLD, Australia, d.kwak@uq.edu.au Department of Economics, University of Wisconsin-Milwaukee, PO Box 413, Milwaukee, WI 53211, US, songs@uwm.edu 1

2 1 Introduction As panel data are becoming more readily available, analysis using the methods for balanced data are becoming more attractive to empirical researchers. However, balanced panel data is rarely available because of missing data. For instance, Abrevaya and Donald [2010] surveyed that missing data occurs in 40% of all published articles in four empirical economics journals (AER, JHR, JOLE, and QJE) over If the missing completely at random (MCAR) assumption is satisfied (Rubin [1976] and Little and Rubin [2002]), we can still justify the results from the balanced data method but, unfortunately, the MCAR assumption is not generally testable. In 70% of these publications with missing data, units with incomplete observations are all discarded without providing any justification for the validity of the MCAR assumption. 1 Other remaining studies typically rely on the assumption that the missing and nonmissing data have the same distribution conditional on observed covariates. The literature generally assumes the MAR assumption and focuses on efficient estimation (Robins et al. [1994], Hahn [1998], Cattaneo [2010], Abrevaya and Donald [2010], Muris [2011], Graham et al. [2012], Chaudhuri [2014], Chaudhuri and Guilkey [2014], among others). Again, justification of the results for the estimates from using the MAR assumption relies on faith rather than formal statistical analysis because these MAR assumptions have been regarded as untestable in most empirical applications. For instance, Hirsch and Schumacher [2004], Bollinger and Hirsch [2006] and Bollinger and Hirsch [2013] examine the quality of imputations for the studies that use Current Population Survey data, but the MAR assumption used for imputation with the data can sometimes exacerbate bias which arises from missing data. Although the MCAR and MAR assumptions are generally considered untestable (Horowitz and Manski [2006]), this paper shows that, for unbalanced linear panel data models which have at least two consistent estimators under the null hypothesis, the validity of the assumptions can be tested. The proposed test statistics extend the Hausman [1978] s specification test to examine the validity of the assumptions in unbalanced panel data. Construction of test statistics is based on the idea that, under the null hypothesis of the MCAR assumption (or the MAR assumption), the difference in the estimates of two consistent estimators should be entirely due to sampling errors. In testing the MCAR assumption, a test statistic is constructed using the difference between two scaled consistent estimators and it is shown to asymptotically follow χ 2 -distribution. The alternative 1 Missing completely at random (MCAR) and missing at random (MAR) assumptions used in this paper are introduced by Rubin [1976] and we use in the following sense. The MCAR assumption implies that the probability of missingness is independent of observed and unobserved variables. The MAR assumption implies that, conditional on observed variables, the probability of missingness does not depend on the unobserved variables. 2

3 hypothesis is chosen according to the types of nonrandom missing data. There are several attractive characteristics of the proposed tests of the MCAR assumption. First, no restriction is imposed on missing pattern for the justification of these test statistics. For instance, missing data can be of the dependent and explanatory variablesand of any types of missingness (e.g. attrition type and/or intermediate type). Second, the tests allow the researcher to focus upon the inconsistency of a single covariate of interest (e.g., critical core covariate in Lu and White [2014] or White and Lu [2011]) caused by nonrandom missing data. This could be practically important because this flexibility can enhance the power of the test. Third, the proposed tests of the MCAR assumption can be easily implemented with standard statistical software such as Stata. The tests can be implemented using any combinations of two estimations from pooled least square (LS), random effects (RE), fixed effects (FE) and first-differencing (FD) estimators where two estimators are chosen according to the alternative hypothesis in consideration. Furthermore, we show that the proposed method can be applied to test the MAR assumption, provided that two consistent estimators from imputation (or inverse probability weighting) are available under the null hypothesis. In practice, researchers often have information on missingness process and use this information with the MAR assumption to eliminate or mitigate biases from nonrandom missing data. This information on the missingness process with the MAR assumption was typically implemented with popular methods such as multiple imputation (MI) or inverse probability weighting (IPW). 2 The primary idea of the tests is based on the difference between two scaled consistent MI estimators (or IPW estimators). The null hypothesis implies the validity of the MAR assumption which indicates consistency of MI (or IPW) estimators. In our empirical illustrations, we choose either two imputed estimators among LS with imputation, FD with imputation and FE with imputation estimators where a MAR assumption is imposed to calculate missing observations; or two IPW estimators among weighted pooled LS, weighted FD and weighted FE estimators where a MAR assumption is used to calculate probability of observability. We select two estimators in the construction of a test statistic depending on which alternative hypothesis to use. We assume these estimators are consistent under the validity of the MAR assumption. The alternative hypotheses are the violation of exogeneity caused by correlation between unobserved factors and covariate of interest conditional on observability. Like a test of the MCAR assumption, a test statistic for testing the MAR assumption uses the idea that the difference between two 2 The MI method with the MAR assumption is routinely applied and the details of the MI method is well documented in Shafer [1997], Buuren et al. [1999], Schafer and Graham [2002], Little and Rubin [2002], Hahn [1998], Chen et al. [2008], and Imbens et al. [2007]. The IPW method with MAR was initially introduced by Horvitz and Thompson [1952] but had not been used much until recently because of inefficiency. This, however, has changed drastically since Robins et al. [1994] introduced efficient version of the IPW (it is called Augmented IPW or AIPW) method. The IPW method with the MAR assumption recently has been studied in Tsiatis [2006], Hirano et al. [2003], Wooldridge [2007], and Graham et al. [2012]. 3

4 scaled consistent estimators under the null hypothesis (i.e. missingness is random after conditioning on observed variables) should be entirely because of sampling error and it is shown to converge to χ 2 - distribution. All three attractive characteristics of the tests of the MCAR assumption are retained for the test of the MAR assumption. In particular, focusing only upon a single covariate is important in practice because devising the MAR assumption so that all included covariates are conditionally random is difficult and probably unrealistic, while the MAR assumption that one variable of interestis conditionally random is relatively easier to work with. However, in contrast to a test of the MCAR assumption, the MAR assumption requires that the missingness process should be explained by observed variables which have to be excluded in the outcome model. In most applications, all explanatory variables in the outcome model and other excluded observed variables (i.e. exclusion restriction) are used as predictors for doing imputation and calculating the probability of observability. Similar to the proposed tests of the MCAR, the test of the MAR assumption is also easily implementable with popular statistical software such as Stata. The tests can be implemented after combining two estimations from either imputed or weighted pooled LS, RE, FE and FD estimators. Stata ado codes to implement the HT test of the MCAR assumption and the MAR assumption with IPW and MI estimators can be downloaded at Stata do file with illustrative examples and Stata help file for the codes are also provided at the website. Finite-sample properties of the proposed tests are examined with Monte Carlo (MC) experiments and the MC results show substantial power and no size distortion of the tests. For sensitivity analyses, various types of alternative hypotheses that induce bias for pooled LS, RE, FE and FD estimators are examined in the MC experiments. It is evident that the proposed tests of MCAR and MAR assumptions perform well in a variety of models. Practical usefulness of the proposed tests is illustrated by the study of the effects of trade liberalization policies World Trade Organization (WTO) membership, Regional Trade Agreement (RTA) membership, Currency Union (CU) membership on trade flows. A gravity equation in unbalanced panel data is often estimated with error-component models as in Wansbeek and Kapteyn [1989], Baltagi and Chang [1994], Werner [2001], and Davis [2002]. Justification of the results from these methods requires the validity of the MCAR assumption but formal testing of the assumption is rare in previous studies. In our sample, the dependent variable (trade flows) suffers severely from missing data, with a missing proportion of about 50%. Assuming the other sources of biases in the gravity model are properly controlled, 4

5 we test the validity of the MCAR assumption by examining whether there is a statistically significant difference between the FE and FD estimates beyond sampling error. Furthermore, following previous studies that use the MAR assumption with imputation in the gravity model as in Linders and de Groot [2006] and Felbermayr and Kohler [2006], we show usefulness of the HT test of the MAR assumption by imputing missing observations for trade flows. We also conduct the HT tests of the MAR assumption with the IPW and two-step estimators where inverse probability weights and the inverse mills ratio (IMR) are obtained by modeling a missing/observability process with observed variables including exclusion restrictions. Our illustration of the HT tests provides several important practical implications. First, the HT test rejects the null hypothesis that the MCAR assumption is valid. 3 Second, the difference between conventional FE and FD estimates and the difference between corresponding FE and FD estimates with imputation suggest that severe bias exists because of missing data and violation of the MAR assumption respectively. Rejection of the null hypothesis from the HT tests of the MAR assumption with imputation confirms a violation of the MAR assumption used for imputation in the literature. Furthermore, test results imply that the effects of trade liberalization policies such as WTO and RTA were severely underestimated by ignoring missing data. The trade creation effect of WTO membership is estimated to be 0% without imputation but 55% with imputation while RTA membership increases trade flows by 13% without imputation and 105% with imputation. Therefore, the WTO and RTA membership effects are underestimated by 55 and 92 percentage points respectively. Moreover, the HT tests of the MAR assumption with imputation also implies that the standard and most popular methods of controlling for omitted variable bias are not justifiable unless the MAR assumption used for the imputation model is valid. Assuming that our imputation model of the MAR assumption from economic argument is reasonable, we can attribute the difference between the FE and FD estimates with imputation to omitted variable bias that is not completely removed by conventional methods of controlling omitted variables suggested in Magee [2003], Baldwin and Taglioni [2006], Baier and Bergstrand [2007, 2009]. Finally, twostep and IPW estimates are similar to conventional estimates that ignore missing data. Furthermore, the qualitative conclusions we draw from the HT tests for the coefficients on WTO, RTA and CU membership variables are the same for conventional, IPW and two-step estimators. This implies that the MAR assumption used to obtain IPW and IMR correction term do little practically to mitigate/eliminate bias 3 We implement one of the most popular methods used in the gravity model that accounts for three-way error component models as in Anderson and van Wincoop [2003], Feenstra [2004], Baldwin and Taglioni [2006], Baier and Bergstrand [2007], Magee [2008], Eicher et al. [2012], Matyas and Balazsi [2012] among others. 5

6 caused from nonrandom missingness. This is not surprising in our application because we are limited in the availability of exclusion restrictions that predict whether trade flows are missing or not. Specifically, we use religion and lagged dependent variables as exclusion restrictions, following Helpman et al. [2008] in the first stage estimation of the probability of observability. These variables may not have enough variation to correctly predict the probability of observability and to have a major effect on the estimates for variables of trade liberalization policies. The rest of paper is organized as follows. In section 2, we introduce the HT test statistics to test the MCAR assumption in an unbalanced panel data. Section 3 extends the HT test statistics to test the MAR assumption. Section 4 shows the size and the power of the test statistics through Monte Carlo experiments. Section 5 provides application of HT tests to the effect of trade liberalization on trade flows. Section 6 contains concluding remarks. Appendix contains technical details, additional Monte Carlo experiments, and a recipe for our tests routine in Stata. 2 Linear panel data models with missing data A formal test of the MCAR assumption is based on a test statistic that uses a pairwise difference between two estimates. Two estimators used in the test are chosen among four popular linear panel data estimators - pooled LS, RE, FE and FD estimators. 2.1 Model We consider the following linear panel data model with missing data where observability is denoted by observability indicator s i t : y i t = d i t β + x 1i t δ + f t + c i + u i t, for i = 1,2,..., N and t = 1,2,...,T (1) 1 if all of y i t, x i t are observed s i t = 0 otherwise (2) where y i t is an outcome, d i t is a vector of regressors of primary interest, x 1i t is a vector of remaining observed covariates, f t is time effect, c i is unobserved heterogeneity and u i t is idiosyncratic error. Using x i t = (d i t, x 1i t, f t ), θ = (β,δ,1) and v i t = c i + u i t, we can rewrite (1) with selection indicator more 6

7 compactly as follows s i t y i t = s i t x i t θ + s i t v i t, for i = 1,2,...,n and t = 1,2,...,T. (3) Pooled LS, RE, FE and FD estimators in unbalanced panel data model are: N θ pl s = ( s i t x i t x i t ) 1 N s i t x i t y i t, (4) N θ RE = ( s i t x i t x i t ) 1 N s i t x i t ỹi t, (5) where.ẋ i t = x i t x i, x i = 1 T i N θ F E = (.. s i t x.. i t x i t ) 1 N s i t x i t, x i t = x i t λx i, λ = 1 N θ F D = ( i=1 t=2 s F D i t x i t x i t ) 1 N.. s i t x.. i t i=1 t=2 σ 2 u σ 2 u+t i σ 2 c y i t, (6) and T i = T s i t, and s F D i t x i t y i t (7) where x i t = x i t x i t 1 for t = 2,3,...,T and s F D i t = min{s i t, s i t 1 }. 2.2 Conditions for consistent estimators in unbalanced linear panel data models We consider that violation of the MCAR assumption is the only source of inconsistency in the estimation of θ. Therefore, we assume E(v i t x i ) = 0 t so, as long as E(v i t x i ) = E(v i t x i, s i ), the pooled LS, RE, FE and FD estimators are all consistent. We can relax the assumption even further as E(u i t x i t,c i ) = 0 and E(u i t x i t,c i ) = E(u i t x i t,c i, s i t ) so that FE and FD estimators are consistent under the MCAR assumption conditional on fixed effects c i. For the cases where we are only interested in single parameter in θ, we can relax the assumption in a way that E(u i t d 1i t x i t,c i ) = 0 t where x i = (d 1i t, x i t ) so, if E(u i t d 1i t x i t,c i ) = E(u i t d 1i t x i t,c i, s i t ), then FE and FD estimators for θ are consistent. The violation of the MCAR assumption occurs because of the following two instances of endogeneity: (1) E(c i x i, s i ) E(c i x i ) = 0 or (2) E(u i t x i t,c i, s i t ) E(u i t x i t,c i ) = 0, for at least one t. For the first case, time-constant unobserved factors are correlated with the covariate of interest conditional on s i t = 1 while, for the second case, time-varying unobserved factors are correlated with the covariate of interest 7

8 conditional on observability (s i t = 1). For each alternative, the consistent estimators we can use for constructing HT tests are different but, for both cases, we can construct an HT test statistic using the difference between FE and FD estimates. Besides standard regularity conditions, rank and exogeneity conditions are essential for consistency. They are assumed in equations (8) and (9) for the pooled LS, in (10) and (11) for the FE, and in (12) and (13) for the FD estimators: r anke(s i t x i t x i t ) = di m(θ), t = 1,2,...,T (8) r anke(s i t.. E(s i t x i t v i t ) = 0, t = 1,2,...,T (9) x.. i t x i t ) = di m(θ), t = 1,2,...,T (10).. E(s i t x.. i t u i t ) = 0, t = 1,2,...,T (11) r anke(s F D i t x i t x i t ) = di m(θ), t = 2,...,T (12) E(s F D i t x i t u i t ) = 0, t = 2,...,T. (13) For the RE estimator, r anke(s i t x i t x i t ) = di m(θ) and conditional strict exogeneity, E(u i t x i, s i ) = 0, are sufficient for consistency of ˆθ RE where s i = (s i 1, s i 2,..., s i T ) and x i = (x i 1, x i 2,..., x i T ). 2.3 Tests of the MCAR assumption For simplicity of illustration, let s consider a random experiment on d 1i t so the pooled LS, RE, FE and FD estimators of β are all consistent under the MCAR assumption (i.e. E(v i t d 1i t ) = 0 and under the MCAR assumption E(v i t d 1i t ) = E(v i t d 1i t, s i t = 1), t). Using this fact, we construct a HT test statistic comparing two consistent estimators. A pairwise difference between two estimators should be entirely because of sampling errors if the MCAR assumption is satisfied. For instance, we can select two consistent estimators under the null hypothesis, say the FD and FE estimators, and construct a test statistic by using difference between the FD and FE estimators. Depending upon the source of violation of the MCAR assumption, we can consider the following two alternative hypotheses, (1) E(c i d 1i t, x i t, s i t = 1) E(c i d 1i t, x i t ) = 0 for some t and (2) E(u i t d 1i t, x i t,c i, s i t = 1) E(u i t d i t x i t,c i ) = 0, for some t and we choose two estimators accordingly. Against these alternative hypotheses, the null hypothesis is E(v i t d i t, x i t ) = E(v i t d i t, x i t, s i t = 1) = 0 t or E(u i t d i t,c i, x i t ) = E(v i t d i t, x i t,c i, s i t = 1) = 0 t, respectively. 8

9 For the first alternative, if we assume E(s i t d 1i t u i t ) = 0, t and E(s i t d 1i t c i ) 0, t, then FE and FD are consistent but pooled OLS and RE are inconsistent. Because we assume E(d 1i t c i ) = 0, E(s i t d i t c i ) 0 implies that non-zero correlation between d 1i t and a time-invariant unobserved factor is caused by the violation of the MCAR assumption. Under the alternative, the FE and FD estimators are consistent while the pooled LS and RE estimators are inconsistent. Thus, against this alternative, we can construct four HT test statistics using following four pairs: LS-FE, LS-FD, RE-FE, and RE-FD. The second alternative is violation of strict conditional exogeneity E(s i t d i t u i ) E(d i t u i ) = 0 for some t. Against this alternative, all four estimators are inconsistent so we rely on the different magnitude of the inconsistency of each estimator to detect the violation of the MCAR assumption. If we do not have any prior information on the sources of nonrandom missingness, it is safe to assume that the MCAR assumption is violated because of the correlation between d i t and time-varying and time-invariant unobserved factors conditional on observability. Therefore, in practice, we should consider both alternatives (1) and (2) and construct the HT test statistics accordingly. 2.4 Hausman-type tests of the MCAR assumption Test of multiple parameters We consider a HT test statistic that uses a difference between the pooled LS and FE estimators. Under the null hypothesis, the pooled LS and FE estimators are consistent. Depending on which of two alternatives we consider, either only the pooled LS is inconsistent or both the pooled LS and FE estimators are inconsistent under the alternative hypothesis. HT test statistic is constructed by stacking two k- dimensional consistent estimators θ pl s and θ F E as θ 1 = ( θ pl s, θ F E ) and multiplying a restriction matrix, R k 2k = [ I p k p I p k p ] where k is the dimension of θ and p is the dimension of covariates of interest, d i t, to obtain linear transform of θ 1. The transformation provides a restriction of θ pl s θ F E converging to zero under the null hypothesis. We note that N θ1 = 1 N N s i t x i t x i t N N.. s i t x.. i t x i t 1 2k 2k 1 N 1 N N s i t x i t y i t N.. s i t x.. i t y i t 2k 1. (14) We can write the null hypothesis as below (15) where pooled LS and FE estimators are consistent. H 0 : Rθ 1 = 0, 9

10 Rθ 1 = θ 1,pl s θ 1,F E θ 2,pl s θ 2,F E. = θ p,pl s θ p,f E 0. 0 k 1 k 2k θ 1,pl s. θ k,pl s θ 1,F E. θ k,f E 2k 1 = 0. (15) The null hypothesis states H 0 : Rθ 1 = 0, which equals H 0 : θ j,pls θ j,f E = 0 j = 1,2,..., p. We construct a test statistic W 1 which converges asymptotically to a chi-squared distribution with degree of freedom equal to the number of restrictions, p, in (15) under the following assumptions. Assumption Conditional strict exogeneity: E(v i t x i, s i ) = 0 t. 2. Rank conditions: E( T s i t.. x.. i t x i t ) and E( T s i t x i t x i t ) have full rank. 3. For fixed T, regularity conditions 4 to invoke CLT to the following objects, 1 N N.. s i t x.. i t u i t so both objects are N -asymptotically normal. 1 N N s i t x i t v i t and The next theorem summarizes the result. Theorem 1 Under the Assumption 2.1, a Wald statistic W 1 converges to χ 2 p distribution under the null hypothesis of H 0 : Rθ 1 = 0, where W 1 = [R N ( θ 1 )] [R var ( N θ 1 )R ] 1 R N ( θ 1 ) χ 2 p 4 Both sufficient and necessary conditions are provided at section 5.3 in Newey and McFadden [1994]. We omit them for brevity. 10

11 with R θ 1 = k 2k θ 1,pl s. θ k,pl s θ 1,F E. θ k,f E 2k 1 = θ 1,pl s θ 1,F E θ 2,pl s θ 2,F E. θ p,pl s θ p,f E k 1 We call W 1 as robust HT test statistic. 5 The rejection of the null hypothesis implies that either (i) the pooled LS estimator is inconsistent and the FE estimator is consistent or (ii) the pooled LS and FE estimators are inconsistent. The pooled LS estimator is inconsistent and the FE estimator is consistent if the sole cause of nonrandom missingness is the correlation between time-invariant unobserved factors and covariates conditional on observability (i.e. E(c i x i t s i t ) 0 for some t). The pooled LS and FE estimators are inconsistent if E(u i t x i, s i ) 0 for some t. A rejection of null hypothesis under any one of these alternatives necessarily implies the violation of the MCAR assumption. Similarly, we can construct another HT test statistic that uses the difference between other pairs of consistent estimators such as θ F D and θ F E under the null hypothesis of the MCAR assumption. We need to adjust strict exogeneity assumption, rank condition, and regularity conditions for N -asymptotically normal to invoke CLT. Assumption 2.1A 1. Strict exogeneity condition: E(u i t x i, s i ) = 0 t. 2. Proper rank conditions: E( T s i t.. x.. i t x i t ) and E( T t=2 s F D i t x i t x i t ) have full rank 3. For fixed T, regularity conditions to invoke CLT to the following objects: and 1 N N.. s i t x.. i t u i t. 1 N N i=1 t=2 s F D x i t i t u i t We use the following stacked estimator θ 2 = ( θ F D, θ F E ) that uses difference between the FD and FE estimators: 5 We call our test statistic as robust HT test in the sense that we do not make any imposition on second moment of error term to obtain efficiency unlike in the original Hausman test. We instead estimate sandwich form of error covariance matrix. 11

12 N θ2 = 1 N N i=1 t=2 s F D i t x i t x i t N N.. s i t x.. i t x i t 1 2k 2k 1 N N i=1 t=2 1 N N s F D x i t i t y i t.... s i t x i t y i t 2k 1. (16) We write the null hypothesis as below and the FD and FE estimators are consistent under the null hypothesis of the MCAR assumption. Rθ 2 = H 0 : Rθ 2 = 0, (17) θ 1,F D θ 1,F E θ 2,F D θ 2,F E. = θ p,f D θ p,f E 0. 0 k 1 = 0. k 2k θ 1,F D. θ k,f D θ 1,F E. (18) θ k,f E 2k 1 Proposition 2 Under Assumption 2.1A, a Wald statistic W 2 converges to χ 2 p distribution under the null hypothesis of H 0 : Rθ 2 = 0, where W 2 = [R N ( θ 2 )] [R var ( N θ 2 )R ] 1 R N ( θ 2 ) χ 2 p. (19) The rejection of the null hypothesis with the HT test statistic, W 2, implies that the FE and FD estimators are inconsistent so conditional strict exogeneity is violated. We can similarly construct a statistic W 3 that uses difference between the pooled LS and FD estimators with adjusted exogeneity and rank conditions using stacked estimators θ 3 = ( θ pl s, θ F D ) and the null hypothesis of H 0 : Rθ 3 = 0. We have that 12

13 W 3 = [R N ( θ 3 )] [R var ( N θ 3 )R ] 1 R N ( θ 3 ) χ 2 p. (20) The possible number of HT test statistics constructed from pairwise differences between two consistent estimators under the null hypothesis are six. We can additionally construct two HT test statistics by replacing the pooled LS estimator with the RE estimator for W 1 and W 3. This increases the power of the test with additional assumption on the second moment because these assumptions guarantee that the RE estimator is ideal (e.g. Hausman [1978]). Test statistics can also be constructed using the differences that replace the pooled LS with the RE estimators. Using the RE estimator enhances the power of the tests assuming that additional assumptions for the second moment are satisfied Test of one parameter Sometimes we are interested in the causal relationship of small set of core parameters or often only one parameter. The practical importance of focusing only on small number of core parameters is also examined in Lu and White [2014]. This is the case for our application in this paper. The adjustment of a multiple parameter HT test statistic to a one parameter HT test statistic is very important in practice because, in general, it provides a higher power and enables us to use the conditional MCAR assumption (which is much easier to satisfy) rather than MCAR assumption. We denote the one parameter version of test statistics that correspond to the multiple parameter version test statistics W q for q = 1,2,...,5, by T q for q = 1,2,...,5, respectively. A HT statistic T q using the null hypothesis H 0 : R j θ q = 0 where R j is j th row vector of k 2k matrix R is given as follows: R j n( θq ) T q = R j var ( n θ t n 1 z. (21) q )R as y j 2.5 Variable addition tests Simple tests of violation of the MCAR assumption for unbalanced panel data, called variable addition tests, have been introduced in the literature (Verbeerk and Nijman [1996] and more recently in Wooldridge [2010a]). The idea of the tests is that under the MCAR assumption any function of s i t x i t should not appear in the expression for E(s i t y i t ). So a test on the coefficients for the lag and lead values of {s i t,s i t x i t } has been proposed accordingly. Rejection implies the violation of the MCAR assumption. 13

14 For instance, a simple variable addition test for the MCAR assumption is a robust t-test for the coefficient on s i t+1 in pooled linear regression model as follows: y i t = γ 1 s i t+1 + x i t θ + u i t, for s i t = 1. (22) We can similarly implement variable addition tests for the FE and FD estimators as follows: ÿ i t = γ 2 s i t+1 + ẍ i t θ + ü i t, for s i t = 1 (23) where ÿ i t = s i t (y i t Σ T k=1 s i k y i k ) and ẍ i t and ü i t are similarly defined. We also consider the relative performance of HT tests and variable addition tests by comparing the power and size bias of the tests under various alternative hypotheses using Monte Carlo simulation. We consider four data generating processes in section 5. Although the variable addition test is quite useful, it is not easy to extend to the test of the MAR assumption so in the test of the MAR assumption, we focus on the HT test. 3 HT test of the MAR assumption using MI, IPW, and Two-Step estimators In this section, we extend our method to test the MAR assumption used to generate probability of observability, to impute missing observations, or to generate a correction term for sample selection. The proposed HT tests are extended to test the MAR assumption by constructing HT statistics from the MI, Two-step and IPW estimators. 3.1 MI estimators First, we show an extension of the HT test based on the MI estimators. We use the following notations for observability s i t : for any random draw w i t = (y i t, d i t, x 1i t ) and s i t from the population, s i t is equal to 1 if observations for w i t are available and 0 otherwise. We consider the pooled LS, FE and FD estimators from MI method. For simplicity of exposition, we illustrate the case where missing data only occurs in the dependent variable of the following model: y i t = d i t β + x 1i t δ + f t + c i + u i t for,i = 1,2,...,n and t = 1,2,...,T. (24) 14

15 We assume nonrandom missingness to be the only source of inconsistency and 0 = E(u i t d i t, x 1i t,c i ) E(u i t d i t, x 1i t,c i, s i t = 1), (25) so the MI-FE and MI-FD estimators are consistent if the model for imputation is correctly specified. Besides the assumptions 2.1 (or 2.1A), we additionally assume the MAR assumption a correct model specification for the distribution of missing observations for the dependent variable conditional on the availability of predictors. Let z 1i t be the exclusion restriction. For the sake of simplicity, we suppose that z i t = (d i t, x 1i t, z 1i t ) are available for s i t = 1 and s i t = 0 so only the dependent variable and the core variable d i t have some missing observations. This is an important case because missing observations on either the dependent variable or only on some core covariates are common as in our application. Assumption 3.1 Conditional distribution of y m,i t given z i t is correctly specified as D(y m,i t z i t,c i ), (26) where y m,i t denotes a missing dependent variable for individual i at time t. This assumption is also known as selection on observables. Under the validity of the MAR assumption (i.e. the model for missing observations, y m,i t, conditional on predictors, z i t is correct), we obtain a consistent estimator for β. Our proposed HT test of the MAR assumption is constructed by using the difference between two consistent estimators under the null, say MI-LS and MI-FD estimators, and the test is based on the idea that the difference should be entirely because of sampling error as long as the model for imputation in the equation (26) is correctly specified. The MI estimators are obtained from three steps. First, the missing values, y m,i t, are filled with predicted values of y m,i t from (26) in M times so we obtain M complete data. Second, standard balanced panel data methods are applied to the M complete data sets. Therefore, we can obtain the MI-FE and MI-FD estimators using the original and imputed observations. For instance, for the MI-FE estimator, we get coefficients estimates on d i t, ˆβj,f e, and its covariance matrix ˆ V j where j = 1,2,..., M. Finally, the results from the M analyses are combined into a single analysis by applying the Rubin s formula for MI-FE estimator as follows: ˆβ M I f e = 1 M M j =1 ˆβ j,f e ; ˆV M I f e = 1 M M j =1 ˆV j,f e + M + 1 M B f e (27) 15

16 where B f e = 1 M 1 M j =1 ( ˆβ j,f e ˆβ f e )( ˆβ j,f e ˆβ f e ). ˆβ M I f d and ˆV M I f d can be similarly obtained. We can extend the formula in (27) to a HT test statistic, W M I 1. We stack two k-dimensional estimators ˆβ j,m I f e, ˆβ j,m I f d and define this as ˆθ j = ( ˆβ j,m I f e, ˆβ j,m I f d ) and its variance estimate as var ( θ j ). Using ˆθ j and var ( θ j ) and applying (27) we obtain ˆθ M I 1 = 1 M M ˆθ j ; var ( θ) M I 1 = 1 M j =1 M j =1 var ( θ j ) + M + 1 B, (28) M where B = 1 M 1 M j =1 ( ˆθ j ˆθ M I )( ˆθ j ˆθ M I ). Then W M I 1 is obtained as follows: W M I 1 = [R N ( θ M I 1 )] [RN var ( θ) M I 1 R ] 1 R N ( θ M I 1 ). Under conditional strict exogeneity, with the correct model for observability, and rank condition and regularity conditions, it could be shown that W M I 1 converges to χ 2 p distribution under the null hypothesis of Rθ M I 1 = 0. The result is summarized in the following proposition. Proposition 3 Under the Assumptions 2.1 and 3.1, a test statistic W M I 1 converges to χ 2 p distribution under the null hypothesis of H 0 : Rθ M I 1 = 0, where W M I 1 = [R N ( θ M I 1 )] [RN var ( θ) M I 1 R ] 1 R N ( θ M I 1 ) χ 2 p. A test statistic for the test of one parameter can be similarly obtained using pairwise difference between two MI estimators. A statistict M I 1 is given below: R j N ( θm I q ) T M I q = R j var ( N θ t n 1 z (29) M I q )R as y j where q = 1,2,...,5 with the null hypothesis H 0 : R j θ q = 0 and where R j is j th row of R matrix, which is a 1 2k row vector. The complete observability of z i t, the predictors in imputation model, is a strong assumption in practice. A weaker assumption has been studied in Wooldridge [2007] where the assumption of complete observability of the predictors is relaxed to sequential observability of the predictors as Assumption 3.1A. Assumption 3.1A A random vector z i t is observed whenever s i t 1 = 1. Accordingly, we adjust the model for imputation from (26) to the following: 16

17 E(y m,i t z i t,c i, y i t 1, s i t 1 = 1) (30) and apply the MI method using the above three-steps. This relaxation is especially important for the attrition type of missing data because of its presence in many economic examples including our application. 3.2 IPW estimators We extend the HT test to the IPW estimators where the MAR assumption is used to obtain the probability of observability (ŝ i t ). Besides the assumption 2.1 (or 2.1A), we assume the following for the probability of observability model as in Rubin [1976], Little and Rubin [2002] and Wooldridge [2007]. Assumption A random vector z i t is available for a model of observability: E(s i t d i t, x 1i t, z i t ) = P(s i t = 1 d i t, x 1i t, z i t ) = P(s i t = 1 z i t ) = f (z i t ) and f (z i t ) > 0 z i t Z. 2. A random vector z i t is always observed. 3. A parametric model G(z i t,γ) for f (z i t ) is available so E(s i t w i t, z i t ) can be consistently estimated. We consider linear panel data model where G(z i t,γ) is probability of observability: s i t y i t G(z i t,γ) = s i t x i t θ G(z i t,γ) + s i t v i t for i = 1,2,..., N and t = 1,2,...,T. (31) G(z i t,γ) The pooled LS estimator for the model with IPW (IPW-LS estimator) is provided as follows: N s i t x i t θ I PW pl s = ( x i t N s i t x G(z i t, γ) ) 1 i t y i t G(z i t, γ) (32) where G(z i t, γ) is estimated consistently at the first stage by using parametric models such as Probit (or Logit) model with s i t = 1(z i t γ + e i t > 0) (33) where e i t follows either a normal or logistic distribution. 17

18 Similarly, we can apply the IPW-FE and IPW-FD estimators to the equation (31) by applying the relevant transformation. Then, the IPW-FE estimator is: where x w i t = s i t x i t G(z i t, γ) 1 T i T N θ I PW F E = ( x w i t x w i t ) 1 N x w i t ỹ w i t (34) s i t x i t G(z i t, γ) and ŷ w i t = s i t y i t G(z i t, γ) 1 T i T s i t y i t G(z i t, γ), and the IPW-FD estimator is: N θ I PW F D = ( w x i t w x i t ) 1 N w x i t w y i t (35) i=1 t=2 where w x i t = s i t x i t G(z i t, γ) s i t 1x i t 1 G(z i t 1, γ) for s i t = s i t 1 = 1. i=1 t=2 The IPW-LS estimator is consistent as long as E(s i t w i t, z i t ) is consistently estimated, as summarized in the following theorem. Lemma 4 Under Assumptions 2.1 and 3.2, pl i m θ I PW pl s = θ. Consistency can be shown with the law of iterated expectation. See the appendix A for a proof. The consistency for the IPW-FE and IPW-FD estimators under the MAR assumption can be similarly obtained under assumptions 2.1 and 3.2. Following Wooldridge [2007], we can relax the assumption of complete observability of the predictors to partial observability of the predictors for the consistency of the IPW-LS and the IPW-FD estimators as follows: Assumption 3.2A A random vector z i t is observed whenever s i t = 1. The results we use for the proof of consistency of the IPW-LS and IPW-FD estimators remain the same except the first-stage estimation of G(z i t, γ). We instead use the following equation to estimate f (z i t ) because we do not have observation z i t for unit i with s i t = 0: p(s i t = 1 z i t ) = p(s i t = 1 s i t 1 = 1, z i t 1 ) p(s i t 1 = 1 s i t 2 = 1, z i t 2 ) p(s i 1 = 1) G i t (36) where we assume p(s i 1 = 1) = 1(i.e. random sample is assumed at the beginning of period). We estimate π i t by using parametric models such as a Probit model as in (37): 18

19 π i t = Φ(c p + z i t 1 γ > 0), for s i t 1 = 1 (37) where z i t 1 is observed and Φ( ) is the standard normal CDF. A HT test statistic could be constructed by using the difference between two inverse probability weighted estimators from the IPW-LS, IPW-FE and IPW-FD estimators under complete observability of predictors z i t or from the IPW-LS and IPW-FD estimators under partial observability of predictors z i t 1. As an illustration, the IPW-LS and IPW-FE estimators are considered. We stack two k-dimensional consistent estimators θ I PW l s and θ I PW F E and use the stacked estimator θ I PW 1 = ( θ I PW LS, θ I PW F E ) to construct a HT test statistic as in the previous section: N θi PW 1 = 1 N N 0 s i t x i t x i t G(z i t, γ) 0 1 N N x w i t x w i t 1 2k 2k 1 N 1 N N N s i t x i t y i t G(z i t, γ) x w i t ỹ w i t 2k 1. (38) The null hypothesis of the test is provided with a linear transformation of θ I PW 1 by multiplying a restriction matrix R. This transformation provides a restriction of θ j,i PW l s = θ j,i PW F E where j = 1,..., p. The null hypothesis H 0 : Rθ I PW 1 = 0 equals H 0 : θ j,i PW pl s = θ j,i PW F E j = 1,2,..., p. Under conditional strict exogeneity, with the correct model specification for observability, and rank condition and regularity conditions, it could be shown that W I PW 1 converges to χ 2 p distribution under the null hypothesis of Rθ I PW 1 = 0. Proposition 5 Under the Assumptions 3.1 and 3.2, a test statistic W I PW 1 converges to χ 2 p distribution under the null hypothesis of H 0 : Rθ I PW 1 = 0, where W I PW 1 = [R N ( θ I PW 1 )] [R var ( N θ I PW 1 )R ] 1 R N ( θ I PW 1 ) χ 2 p. The other HT test statistics that use the IPW estimators can be obtained as follows: W I PW 2 = [R N ( θ I PW 2 )] [R var ( N θ I PW 2 )R ] 1 R N ( θ I PW 2 ) χ 2 p, W I PW 3 = [R N ( θ I PW 3 )] [R var ( N θ I PW 3 )R ] 1 R N ( θ I PW 3 ) χ 2 p 19

20 where θ I PW 2 = ( θ I PW F D, θ I PW F E ) and θ I PW 3 = ( θ I PW l s, θ I PW F D ). With partially available predictors, a valid test statistic cannot be constructed using θ I PW 1 or θ I PW 2 because of the sequential nature of obtaining probability weight estimates. 6 Thus, W I PW 3 can be only used for partially observed predictors. A test statistic for the test of one parameter can be similarly obtained using pairwise difference between two IPW estimators. A statistict I PW q is given below: R j N ( θi PW q ) T I PW q = R j var ( N θ t n 1 z (39) I PW q )R as y j where q = 1,2,3 with the null hypothesis H 0 : R j θ q = 0 and where R j is j th row of R matrix, which is a 1 2k row vector. 3.3 Two-step estimator We can also extend the HT test applicable with sample-selection correction using the MAR assumption (i.e. selection on observables, z i t ) to model selection/observability with a joint normality assumption on outcome and observability errors. Consider the following outcome and observability model: y i t = x i t β + c i + u i t, and 0 = E(u i t x i t ) E(u i t x i t, s i t = 1) and impose the following condition. Assumption 3.3 A random vector z i t is observed and selection model is provided as follows: s i t = 1(z i t γ + v i t > 0) and E(u i t v i t ) = ρv i t. Then, using the assumption that E(u i t v i t ) = ρv i t, we obtain: E(y i t z i t, s i t = 1) = x i t β + c i + ρe(v i t z i t, v i t > z i t γ) = x i t β + c i + ρ φ( z i t γ σ v ) Φ( z i t γ σ v ). (40) Missingness is not random because correlation between u i t and v i t (i.e. ρ 0) and correlation between x i t and v i t cause inconsistency for an estimator of β. However, the pooled LS, FE and FD es- 6 For the IPW-FE estimator, probability of observability at time t is a function of all time periods because demeaning transformation involves all time periods. 20

21 timators from the regressions of (40) are all consistent because using the selection correction term φ( z i t γ σv ) as an additional regressor removes the inconsistency because of nonrandom missingness (i.e. ) Φ( z i t γ σv E(u i t z i t, s i t = 1) = 0). We can construct the HT test statistics by comparing these two-step estimators of β. Given that nonrandom missingness is the sole source of inconsistency of β, the assumptions of correct specification of selection model s i t = 1(z i t γ + v i t > 0) and E(u i t v i t ) = ρv i t provide consistent estimators for β from two-step estimations. As an illustration, the two-step pooled LS and two-step pooled FE estimators are considered. We stack two k-dimensional consistent estimators θ T S l s and θ T S F E and use the stacked estimator θ T S 1 = ( θ T S l s, θ T S F E ) to construct a HT test statistic as in the previous section: N θt S 1 = 1 N N s i t x i t x i t N N.. s i t x.. i t x i t 1 2k 2k 1 N 1 N N s i t x i t y i t N.. s i t x.. i t y i t 2k 1 (41) where x i t = (x i t, φ( z i t γ σv ) Φ( z i t γ σv )). The null hypothesis of the test is provided with a linear transformation of θ T S 1 by multiplying R, which is the same as the test of MCAR in the previous section. The next proposition provides asymptotic distribution of a test statistic. Proposition 6 Under the Assumptions 3.1 and 3.3, a test statistic W T S 1 converges to χ 2 p distribution under the null hypothesis of H 0 : Rθ T S 1 = 0, where W T S 1 = [R N ( θ T S 1 )] [R var ( N θ T S 1 )R ] 1 R N ( θ T S 1 ) χ 2 p. 4 Monte Carlo Experiment Finite-sample performance of the HT tests for the MCAR assumption and the MAR assumption with the IPW and two-step estimators are provided. The results show that no size distortion and high power of the HT tests in various models. 4.1 Tests of the MCAR assumption We consider the following linear panel data model where the response variable is continuous and explanatory variables can be either binary or continuous: 21

22 y i t = x i t β + c 1i + u i t, for i = 1,2,...,n and t = 1,2,...,T (42) where x i t includes binary (treatment dummy) variables and continuous variables. We are interested in performing statistical inference on β. Selection processes reduce the sample size from 40 to 70 percent of full samples. We consider observations of cross-section dimension from n=100 to n=1,000 and of time-dimension from T =4 to T =8. Note that s i 1 = 1; s i t = 1(c p + z i t γ + c 2i + v i t > 0) (43) where c p is chosen to make the missing fraction be between 40 and 70 percent. The violation of the MCAR assumption is induced by the correlation between u i t (or c i ) and x i t conditional on observability (s i t = 1) Binary explanatory variable d i t and nonrandom missingness We consider following two data generating processes (DGPs) as benchmark where we are interested in estimating β in equation (42). The following DGPs are chosen to mimic our empirical example in a simple way. 7 DGP I 1. c 1i = i i d N (0, 1 16 ) + w 1i ; c 2i = q i + w 2i ; q i i i d N (0,1). 2. (w 1i, w 2i ) B N 0, 1 ρ where B N is bivariate normal. 0 ρ 1 3. d i t = q i + i i d N (0, 1 16 ); x 1i t = 1 if d i t > 0; β 1 = x 2i t i i d N (0,1); β 2 = z i t i i d N (0,1); γ = u i t i i d N (0,1); v i t i i d N (0,1). 7 To save space, we only report the HT tests of the MCAR assumption on the coefficient on a binary covariate, d i t while we report the HT tests for the MAR assumption in estimating continuous covariate. We provide DGPs for the HT tests of the MCAR assumption on a continuous covariate in the Appendix B. The qualitative implications from simulations remain exactly the same. 22

23 First, in the DGP I, the violation of the MCAR assumption is induced by the correlation between the time-invariant unobserved factor (w 1 ) and x 1i t conditional on observability. The correlation is indirect as it appears between error in selection model and x 1i t, and between error in selection model and timeinvariant unobserved factor in c 1i. Therefore, although x 1i t and c 1i are not directly correlated, they are correlated conditional on s i t = 1. The pooled LS estimator is inconsistent but the FE and FD estimators are consistent conditional on observability because the unobserved factor in c 1i causes inconsistency on the coefficient of x 1i t. In DGP I, the null hypothesis of the MCAR assumption is satisfied only if ρ = 0. In addition, ideal conditions for the RE model are satisfied under the null hypothesis. Thus, we can replace the LS estimator with the RE estimator in the construction of HT test. However, in the DGP I under alternative of ρ 0, the LS and RE estimators are inconsistent while the FE and FD estimators are consistent. Thus, we perform tests of the MCAR assumption using the HT statistics of W 1 (based on difference between the LS and FE estimators), W 2 (based on difference between the FD and FE estimators), W 3 (based on difference between the LS and FD estimators). DGP II 1. c 1i = i i d N (0,1); c 2i = i i d N (0,1). 2. z i t i i d N (0, 1 4 ), γ = d i t = z i t α + i i d N (0, 1 16 ); x 1i t = 1 if d i t > 0; β 1 = x i t i i d N (0,1); β 2 = (u i t, v i t ) B N 0, 1 ρ 0 ρ 1. Second, in the DGP II, the violation of the MCAR assumption is induced by the correlation between timevarying unobserved factor (u i t ) and x 1i t conditional on observability. In this case, all three estimators the pooled LS, FE, and FD estimators are inconsistent. The null hypothesis is satisfied if ρ = 0. Under the null, ideal conditions for these estimators are satisfied. All three estimators are inconsistent, under the alternative of ρ 0, but the magnitude of inconsistency differs among estimators Simulation results MC simulations are performed with 2,000 replications for T =4 and T =8 and n ranging from 100 to 1,000 and cluster robust standard errors are used in all simulations. The magnitude of correlation because of 23

24 nonrandom missingness and its effect on the inconsistency of β 1 in (42) are determined by ρ. When ρ = 0, no differential missingness across x 1i t occurs so no inconsistency is caused by nonrandom missingness. Finite-sample performance of the HT tests when covariate of interest is binary is examined in tables 1 and 2 for DGP I and tables 3 and 4 for DGP II. The true value of β 1 is 1 and the rejection rate of the null hypothesis should be nominal size, 0.05, under the true null hypothesis. When ρ is zero, no inconsistency on β 1 is caused by nonrandom missingness. Five HT tests and three variable addition tests using four panel data estimators such as the LS, RE, FE, and FD estimators are examined. The mean of the MC point estimates for β 1 is close to true value 1 for all four estimators under ρ = 0 regardless of missing proportion. In the DGP I with ρ = 0, the null hypothesis is true. 8 Table 1 shows that, for DGP I with missing proportion of 40%, coverages of rejection probability contain 0.05 for all five tests. As missing proportion increases from 40% to 60%, slight over-rejections of null hypothesis appear for some of HT tests when N = 100 and N = 500, but coverages of rejection probability of all five HT tests contain 0.05 when N = 1,000. Proposed five HT tests show no size bias. Finite-sample behavior of the HT tests improves as N increases. The last three columns report rejection probability for variable addition tests using the pooled LS, FE and FD estimators. Over-rejection appears for the pooled LS estimator and size bias is larger when missing proportion is higher. Coverages of rejection probability contain 0.05 in cases for the FE and FD estimators regardless of missing proportion. Variable addition tests with the FE and FD estimators show no size bias while the pooled LS estimator shows slight over-rejection. Under DGP I where inconsistency on an estimator of β 1 is caused by conditional correlation between covariate x 1i t and unobserved time-invariant factors, the pooled LS is inconsistent when ρ is not equal to zero. In simulation, we choose ρ = 0.9 to induce inconsistency of 50% for a missing proportion of 40% and inconsistency of 100% for a missing proportion of 60%. 9 No inconsistency arises for the FE and FD estimators because FE and FD transformations eliminate unobserved time-invariant factors that cause inconsistency because of nonrandom missing data. In Table 1, the last six rows report results when ρ is 0.9. The sample means of the MC point estimates for β are about 0.47 and 0.07 for the pooled LS estimator with missing proportions of 40% and 60%, respectively. Finite-sample mean estimates for β 1 do not depend on cross-sectional dimension, n, for the pooled LS estimator. On the other hand, the mean values of the MC point estimates of β 1 for the FE and FD estimators are 1 for all n regardless of the missing proportion. Therefore, the Pooled LS estimator is inconsistent while the FE and FD estimators 8 In this paper, all simulations of tests are based on 95% confidence interval. 9 Inconsistency increases for the Pooled LS as ρ increases. 24

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

A Course in Applied Econometrics Lecture 18: Missing Data Jeff Wooldridge IRP Lectures, UW Madison, August 2008 1. When Can Missing Data be Ignored? 2. Inverse Probability Weighting 3. Imputation 4. Heckman-Type