Missing dependent variables in panel data models

Size: px

Start display at page:

Download "Missing dependent variables in panel data models"

Bonnie Sanders
6 years ago
Views:

1 Missing dependent variables in panel data models Jason Abrevaya Abstract This paper considers estimation of a fixed-effects model in which the dependent variable may be missing. For cross-sectional units with some but not all) dependent variables missing, covariate information from all time periods can be utilized to improve efficiency of estimators. Estimation of fixed-effects models with exchangeability where the fixed effect is independent of the ordering of observations for a cross-sectional unit) and lagged dependent variables are also considered. Keywords: missing data; fixed effects; linear projections. Financial support from the National Science Foundation SES ) is gratefully acknowledged. This paper has benefited from comments by Stephen Donald and seminar participants at Northwestern, UBC, and the Midwest Econometrics Group meeting. Department of Economics, The University of Texas at Austin; abrevaya@eco.utexas.edu.

2 1 Introduction Data missingness is a common problem in empirical research. In this paper, we focus on the case in which data on the dependent variable may be missing. In a cross-sectional setting, unless one is willing to adopt an imputation approach, the standard approach is to drop observations for which the dependent variable is missing. Such an approach does not affect consistency of the resulting estimators if the model s disturbances are unrelated to the missingness mechanism. 1 The approach of dropping observations would also be appropriate in a panel-data setting under a similar assumption on the missingness mechanism), but we show that this approach can result in unnecessary efficiency losses. Specifically, in models with fixed effects heterogeneity that is correlated with the covariates), covariate information from all time periods, including those with missing dependent variables, can be incorporated into estimation to improve efficiency compared to standard fixed-effects estimators. The paper is organized as follows. Section 2 introduces the linear fixed-effects model and the dependent-variable missingness mechanism. The Chamberlain 1982) projection approach is reviewed, and the impact of dependent-variable missingness on several fixedeffects estimators is discussed. the presence of missing dependent variables. Section 3 provides a framework for GMM estimation in The GMM estimators have closed-form IV formulas, and the optimal GMM estimator is obtained as a two-step estimator. Section 4 discusses estimation under the additional assumption of exchangeability, where ordering of the observations within a cross-sectional unit is not related to the distribution of the fixed effect. Such an assumption has been considered by Altonji and Matzakin 2005) and may be applicable, for instance, in cases where the cross-sectional unit is a group like siblings) and the time periods are individual group members. GMM estimation under exchangeability can provide efficiency gains and also can identify the primary coefficient parameters in the presence of a single observed dependent variable for each cross-sectional unit. Section 5 relaxes the strict exogeneity assumption and considers estimation when the fixed-effects 1 In the linear regression model, the formal condition for consistency of OLS is that the error disturbance, conditional on observability of the dependent variable, is not correlated with the covariates. [1]

3 model has lagged dependent variables as explanatory variables. Missingness causes further complications here since missingness affects both the dependent and independent variables of the model. 2 The model Consider the standard linear fixed-effects model y it = x it β 0 + c i + u it i = 1,..., n; t = 1,..., T ), 2.1) where i indexes cross-sectional units or groups) and T is the total number of time periods or group members). The case of a balanced panel T not varying over i) is considered for simplicity, although the ideas developed in this paper could also be applied to unbalanced panels. We consider the case of large-n n ), fixed-t asymptotics. Let k denote the number of covariates in x it so that β 0 is a k-vector). To allow for missingness of dependent variables y it ), an observability indicator is defined as follows: { 1 if yit is observed s it 0 otherwise For missing dependent variables s it = 0), we use the convention that y it = 0. The observed data is {s it, s it y it, x it )} T t=1 for i = 1,..., n. Note that x it is always observed.) Without loss of generality, assume T t=1 s it > 0 for each i that is, y it is observed for at least one time period for each i). 2 To simplify exposition, the following stacked quantities are defined: y i y i1,..., y it ), x i x i1,..., x it ), u i u i1,..., u it ), s i s i1,..., s it ). Each of these stacked quantities are column vectors, with x i having kt elements and y i, u i, and s i having T elements. 2 This assumption is merely for notational convenience. With Assumption 1 below, this assumption says that we have already dropped cross-sectional units with all y it missing from the dataset. These observations would be dropped from any of the proposed estimators below. [2]

4 When missingness of y it is not an issue, the standard strict exogeneity assumption for the fixed-effects model 2.1) is Eu i x i, c i ) = 0. The observability indicator s i is incorporated into the conditioning set to yield the strict exogeneity assumption for the current setup: Assumption 1 Strict exogeneity) Eu i x i, c i, s i ) = 0 Note that Assumption 1 allows observability s i ) of the dependent variable to be related in arbitrary ways with x i and c i but restricts the error disturbances u i ) to be conditionally mean independent of x i, c i, s i ). 3 The Chamberlain 1982) projection approach considers the linear projection of the fixed effect c i upon x i1,..., x it ), c i = ψ 0 + x i1 λ 01 + x i2 λ x it λ 0T + a i, 2.2) where ψ 0 is a scalar, each λ 0t is a k 1 vector, and E[x ita i ] = 0 by construction for each t). Plugging 2.2) into 2.1) yields a model from which β 0, ψ 0, λ 01,..., λ 0T ) can be estimated: y it = x it β 0 + ψ 0 + x i1 λ 01 + x i2 λ x it λ 0T + a i + u it. 2.3) An additional assumption on missingness is required for the Chamberlain approach to be applicable to the case of missing dependent variables: Assumption 2 Ignorability of missingness for the projection error) E[x ita i s i ] = 0 for t = 1,..., T A sufficient condition for Assumption 2 to hold is that the random variable c i x i is independent of s i i.e., the conditional distribution Dc i x i ) is the same as the conditional distribution Dc i x i, s i ). Before discussing estimation in the presence of missingness, we briefly review some simple estimators when dependent variables are fully observed. A particularly simple estimation method is the pooled ordinary least squares OLS) regression i.e., OLS of y it on 3 This assumption is the same made by Wooldridge 2002) in the context of unbalanced panel-data models, where missingness s it = 0) corresponds to the full observation y it, x it ) being missing rather than just y it. [3]

5 x it, 1, x i1,..., x it )). It is known that this estimator is numerically equivalent to the within estimator and also a Mundlak 1978) regression estimator), as stated in the following result: Equivalence result for fixed-effects models with no missing dependent variables: The following estimators of β 0 are numerically equivalent: i) the within estimator OLS of ÿ it on ẍ it ), where ÿ it y it T 1 T s=1 y is and ẍ it x it T 1 T s=1 x is, ii) the Chamberlain regression estimator OLS of y it on x it, 1, x i1,..., x it )), and iii) the Mundlak regression estimator OLS of y it on x it, 1, x i )). While these three linear-regression estimators are equivalent in the case of perfect dependentvariable observability, this equivalence breaks down when dependent variables are possibly missing. The following subsection illustrates this point, focusing on the simple case of two time periods T = 2). 2.1 Linear-regression estimators with missing dependent variables We consider the two time-period case T = 2) case for simplicity, although the basic idea of this subsection extends easily to T > 2. Fully observed cross-sectional units have s i1 = s i2 = 1, whereas partially observed cross-sectional units have s i1 = 0 or s i2 = 0. For each of the three linear-regression estimators described above within, Chamberlain, Mundlak), consider the logical extensions to the missing-y setting and the resulting properties of the estimators under Assumption 1). Within estimator: Any partially observed cross-sectional unit i would be dropped since ÿ it is unknown. 4 The resulting within estimator, regressing ÿ it on ẍ it for the subsample of fully observed cross-sectional units, is consistent under Assumption 1. Assumption 2 is not needed since the projection error is eliminated by the within transformation.) 4 For T > 2, within transformations can be used on the subset of s it = 1 observations when t s it 2. [4]

6 Chamberlain estimator: When s it = 0 observations are dropped, the resulting pooled OLS estimator y it on x it, 1, x i1, x i2 ) for s it = 1 observations) is consistent under Assumptions 1 and 2. This modified Chamberlain estimator is not equivalent to the modified within estimator. Whereas the modified within estimator drops the partially observed cross-sectional units, the modified Chamberlain estimator still uses both time periods of covariate data for such units. Mundlak estimator: When s it = 0 observations are dropped, the resulting pooled OLS estimator y it on x it, 1, x i ) for s it consistent. 5 = 1 observations) is no longer guaranteed to be Interestingly, the modified Chamberlain estimator is able to incorporate cross-sectional units for which only a single y it value is available. This particular feature seems to be new to the literature on fixed-effects models, although Altonji and Matzkin 2005) have previously considered estimation with a single y it and multiple covariate observations) in a nonparametric panel-data model with heterogeneity but not of a pure fixed-effects form). 6 The modified Chamberlain estimator provides efficiency gains since it effectively uses more observations than the within estimator. The modified Chamberlain estimator includes one of time periods for the partially observed cross-sectional units within the pooled OLS estimation, whereas the within estimator is equivalent to a pooled OLS on only the fully observed cross-sectional units. To illustrate the efficiency gains from the modified Chamberlain estimator as compared to the modified within estimator), we report the results of some simple simulations. Using the following design for n = 1000 and T = 2, x 1, x 2, u 1, u 2 N0, 1), a = 1 + 2x 1, y t = x t + a + u t, and y it missing at random three simulations: 0% missingness, 20% missingness, 40% missingness), we report the results from a representative simulation in Table 1. For this design, 5 Of course, if the partially observed cross-sectional units were dropped entirely, the resulting estimator would remain equivalent to the within estimator. 6 Specifically, Altonji and Matzkin 2005) require the heterogeneity to depend upon some function of the covariates like x i ). [5]

7 Table 1: Simulation results with missing y it values None missing 20% y 2 missing 40% y 2 missing Baseline within on full data) ) ) ) Within on partial data ) ) ) Chamberlain ) ) ) Mundlak ) ) ) Coefficient estimates with standard errors in parentheses) are reported. Sample size n = The simulation design is described in the text. the Chamberlain approach eliminates roughly half of the inefficiency of the within estimator i.e., comparing the within on partial data to the baseline on the full data). 3 GMM estimation with missing dependent variables In this section, we consider GMM estimation of the fixed-effects model based upon the projection form in equation 2.3). Under Assumption 1, the composite error disturbance a i + u it ) is uncorrelated with the covariates from all time periods i.e., Ex isa i + u it )) = 0 for all s, t {1,..., T }). Moreover, under Assumption 2, these orthogonality properties remain true even when conditioning on observability s it = 1). The following moment functions correspond to these orthogonality conditions: s it 1 x i ) y it x it β ψ x i1 λ 1 x i2 λ 2 x it λ T ) for t = 1,..., T. 3.4) Recall that x i is the kt 1 stacked vector of covariates from all time periods. Let gz i, θ) denote the stacked vector of all the moment functions in 3.4), where z i y i, x i, s i ). There are a total of T kt + 1) moments conditions and T + 1)k + 1 parameters, meaning there are T 2 T 1)k + T 1) overidentifying restrictions. The orthogonality conditions imply that the moment functions have expectation zero at the true parameter values. For identification i.e., non-zero expectation at other parameter values), we need the following additional full-rank condition: [6]

8 Assumption 3 Full rank) EV i V i ) has full column rank, where V i s i1 x i1 s i1 s i1 x i s i2 x i2 s i2 s i2 x i... s it x it s it s it x i Note that a given row of V i matrix is the row of RHS variables x it, 1, x i) for the Chamberlain regression when s it = 1 y it observed) and is a row of zeros when s it = 0 y it missing). The full-rank condition, aside from ruling out the usual linear dependence of covariates, also rules out a situation in which a given time period s dependent variable for example, y i1 ) is missing for all i. The following lemma provides the GMM identification result: Lemma 1 Under Assumptions 1 and 2, E[gz i, θ 0 )] = 0. If Assumption 3 also holds, then E[gz i, θ)] 0 for θ θ 0. Let ˆθ denote the unweighted GMM estimator obtained by minimizing n 1 n gz i, θ) ) n 1 n. gz i, θ) ). 3.5) The GMM estimator can be implemented without numerical optimization by using instrumentalvariables methods. Specifically, define the instrumental-variable matrix Z i as s i1 s i1 x i s i2 s i2 x i 0 0. Z i s it s it x i 3.6) which corresponds to using 1 x i) as instruments in each time period for which y it is observed. The Z i instrument matrix is T kt + 1). Then, the unweighted GMM estimator ˆθ see equation 3.5)) can be obtained directly as the system IV estimator ˆθ = V ZZ V ) 1 V ZZ Y ), 3.7) where V, Z, and Y are the stacked versions of V i, Z i, and y i with dimensions nt T + 1)k + 1), nt T kt + 1), and nt 1, respectively). [7]

9 The unweighted GMM estimator is not efficient. The GMM objective function that allows for weighting is n 1 n gz i, θ)) Ŵ The system 2SLS estimator has Ŵ2SLS = Z Z n Note that ˆθ 2SLS n 1 n ) 1, so that gz i, θ) ˆθ2SLS = V ZŴ2SLSZ V ) 1 V ZŴ2SLSZ Y ). ). 3.8) is numerically equivalent to the modified Chamberlain linear-regression estimator described in Section 2.1, specifically the pooled OLS estimator of y it on 1, x it, x i) for the subsample of s it = 1 observations. 7 To obtain the optimal GMM estimator, the standard two-step approach can be used. For instance, after obtaining the system 2SLS estimator ˆθ 2SLS, the optimal GMM estimator θ minimizes the objective function 4.13) where the optimal weighting matrix Ŵ is given by Ŵ = n 1 n gz i, ˆθ 2SLS )gz i, ˆθ 2SLS ) ) 1. The optimal GMM estimator can be obtained directly as θ = V ZŴ Z V ) 1 V ZŴ Z Y ). 4 Exchangeable panels In this section, we consider a fixed-effects model in which the time periods are exchangeable. The notion of exchangeability in this context means that the fixed effect is independent of the ordering of the time periods in the data. Examples could include panel data for which t does not actually index time, such as sibling data, twins data, etc. Even in such cases, however, the exchangeability assumption will rule out an order effect, such as a birth-order effect for siblings data.) The exchangeability assumption is defined as follows: Assumption 4 Exchangeability) Dc i x i1 = x 1, x i2 = x 2,..., x it = x T ) = Dc i x ij1 = x 1, x ij2 = x 2,..., x ijt = x T ) for any x 1, x 2,..., x T ) and any permutation of the set {1, 2,..., T } denoted j 1, j 2,..., j T ). 7 To see this equivalence, note that the first-stage regression is vacuous in that the fitted values are identical to the pooled-ols covariate matrix specifically, ZZ Z) 1 Z V = V. [8]

10 Recalling the projection of the fixed effect c i onto the covariates, c i = ψ 0 + x i1 λ 01 + x i2 λ x it λ 0T + a i, Assumption 4 implies that the λ coefficient vectors must be identical for each time period that is, λ 01 = λ 02 = = λ 0T = λ 0 for some k 1 vector λ 0. Plugging these equalities back into the projection equation yields or equivalently T ) c i = ψ 0 + x it λ 0 + a i, 4.9) t=1 c i = ψ 0 + x i γ 0 + a i, where γ 0 = T λ ) Importantly, the projection error a i in 4.9) and 4.10) is still uncorrelated with the covariates x i1, x i2,..., x it ). In general, this orthogonality does not hold without exchangeability when one projects c i upon x i. The GMM estimator can incorporate the exchangeability restriction by writing the orthogonality conditions using the projection equation 4.10) and conditioning on observability, E x i a i + u it ) s i ) = E x i y it x it β 0 ψ 0 x i γ 0 ) s i ) = ) The vector of moment functions, denoted hz i, θ), corresponding to these orthogonality conditions is obtained by stacking the following functions s it 1 x i ) y it x it β ψ x i γ) for t = 1,..., T. 4.12) There are T kt + 1) moment functions, as in the non-exchangeable case of Section 3, but now only 2k + 1 parameters. The associated GMM objective function is n 1 n hz i, θ)) Ŵ n 1 n hz i, θ) ) 4.13) for a weighting matrix Ŵ. As in the non-exchangeable case, the GMM estimator can be obtained directly using IV methods. The instrument matrix Z i and its stacked version [9]

11 Z) remains the same since the instruments in 4.12) are unchanged. The covariate matrix denoted Vi e ) is different in the exchangeable case, as x i is used in place of the full x i vector, s i1 x i1 s i1 s i1 x i X i e s i2 x i2 s i2 s i2 x i ) s it x it s it s it x i The covariate matrix V e i has dimension nt 2k + 1). The form of the IV estimator, depending upon the weighting matrix Ŵ, is V e ZŴ Z V e ) 1 V e ZŴ Z Y ), 4.15) where V e is the stacked version of Vi e. The system 2SLS estimator, denoted ˆθ e 2SLS with Ŵ = ) Z Z 1, n remains numerically equivalent to the modified Chamberlain linear-regression estimator of Section 2.1. The optimal GMM estimator can be obtained by using the optimal weighting matrix Ŵ = n 1 in the closed-form expression 4.15). n hz i, ˆθ 2SLS )hz i, ˆθ 2SLS ) ) 1 An interesting feature of the model with exchangeability is that the parameters can be identified even when only a single y it is available for each i. As an example, if one had data on twins with y i1 observed, y i2 missing, and both x i1 and x i2 observed, the leastsquares regression of y i1 on x i1, 1, x i ) would consistently estimate β 0, ψ 0, γ 0 ). 8 In terms of the previous full-rank condition Assumption 3) discussed in Section 3, the model with exchangeability requires that EVi e Vi e ) have full column rank. If only y i1 is observed s i1 = 1, s i2 = = s it = 0), this full-rank condition is equivalent to having full rank of EV e i1 V e i1) where V e i1 = [x i1 1 x i ] is the first row of V e i. This condition will hold with time-varying x it, whereas the analogous condition for the non-exchangeable case would not; in the nonexchangeable case, recall that the first row of V i was [x i1 1 x i1 x it ], for which x i1 causes colinearity problems. 8 This idea was previously discussed by Altonji and Matzkin 2005, p. 1066), but their discussion starts from a slightly different model their equation 2.9)) under additional restrictions. In contrast, the discussion here points out that the exchangeability assumption yields this identification result for the linear fixed-effects model without any further restrictions. [10]

12 If Assumption 4 is true, the optimal GMM estimator above will be at least as efficient as the optimal GMM estimator of Section 3 since it incorporates the restrictions on the λ parameters. Note that one could directly test an implication of exchangeability by using any consistent estimator from Section 2 and testing the null hypothesis H 0 : λ 01 = λ 02 = = λ 0T. 5 Lagged dependent variables In this section, we consider estimation in a fixed-effects model where a lagged dependent variable enters as an explanatory variable often called an autoregressive fixed-effects model). Missingness of y it causes further complications here since the missingness now affects both the dependent and independent variables of the model. To focus ideas, we consider a model with no additional covariates x it, although inclusion of such covariates would be straightforward: 9 y it = ρ 0 y i,t 1 + c i + u it i = 1,..., n; t = 1,..., T ). 5.16) where Eu it y i,t 1,..., y i0, c i, s i ) = 0 for t = 1,..., T. 5.17) The condition on the disturbances in 5.17), which implies that the AR1) structure completely captures the dynamics, is standard for this model see, e.g., Alonso-Borrego and Arellano 1999)). As in the preceding sections s it = 1 if y it is observed and s it = 0 if y it is missing. The initial condition is given by y i0. Without loss of generality, we will assume that y i0 is never missing. 10 Consider the projection of the fixed effect c i onto the initial condition, given by 11 c i = ψ0 + γ0y i0 + a i, Ea i y i0) = ) 9 The projection described below would include x i1,..., x it ). 10 If y i0 is missing, the first non-missing y it can be treated as the initial condition under the missingness assumption below). 11 The idea of projecting the fixed effect onto the initial condition was first considered by Chamberlain 1980), although his original treatment made the additional assumption of joint normality of the u it disturbances conditional on y i0 ). [11]

13 For the projection-based GMM estimator to be consistent, we require a condition analogous to Assumption 2, specifically Ea i y i0 s i ) = 0 here. In order to construct the GMM estimator that allows for missingness, re-write y it as a function of the initial condition y i0 by recursively applying equation 5.16) and plugging in the projection 5.18). For instance, y i1 can be expressed in terms of y i0 as follows: y i1 = ρ 0 y i0 + c i + u i1 = ρ 0 y i0 + ψ 0 + γ 0 y i0 + a i + u i1 = ψ 0 + ρ 0 + γ 0 )y i0 + a i + u i1. The next observation, y i2, can then be written y i2 = ρ 0 y i1 + c i + u i2 = ρρ 0 y i0 + ψ 0 + γ 0 y i0 + a i + u i1 ) + ψ 0 + γ 0 y i0 + a i + u i2 = ψ ρ 0 ) + ρ γ ρ 0 ))y i0 + a i + ρ 0 u i1 + u i2. Repeating the recursive substitution yields the general formula ) t 1 y it = ψ 0 ρ s 0 + ρ t0 t 1 t 1 + γ 0 y i0 + a i + ρ s 0u i,t s. 5.19) s=0 ρ s 0 s=0 The moment conditions are based upon the composite error in 5.19), a i + t 1 s=0 ρ s 0u i,t s, being uncorrelated with y i0 conditional on s i ). Specifically, the vector of moment functions would be obtained by stacking the following functions ) ) ) 1 t 1 s it y y it ψ ρ s ρ t t 1 + γ ρ s y i0 for t = 1,..., T. 5.20) i0 s=0 GMM estimation proceeds in the usual way. In this situation without covariates), there are a total of 2T moments and 3 parameters ρ 0, φ 0, γ 0 ) being estimated. For T = 2 a maximum of three observed y it values), identification requires that the probabilities of observability for y i1 and y i2 are non-zero. Interestingly, the model parameters can be identified when all crosssectional units have only two observed y it values for example, if y i0, y i1 ) are observed for some observations and y i0, y i2 ) are observed for the remaining observations. That is, we require three time periods of data as in the fully observed case) but obtain identification with only two dependent variables for each cross-sectional unit. [12] s=0 s=0

14 References Alonso-Borrego, Cesar and Manuel Arellano, 1999, Symmetrically normalized instrumentalvariable estimation using panel data, Journal of Business and Economic Statistics 17, Altonji, Joseph and Rosa L. Matzkin, 2005, Cross section and panel data nonseparable models with endogenous regressors, Econometrica 73, Chamberlain, Gary, 1980, Analysis of covariance with qualitative data, Review of Economic Studies 47, Chamberlain, Gary, 1982, Multivariate regression models for panel data, Journal of Econometrics 18, Mundlak, Yair, 1978, On the pooling of time series and cross sectional data, Econometrica 56, Wooldridge, Jeffrey M., 2002, Econometric analysis of cross section and panel data, MIT Press. [13]

A Course in Applied Econometrics Lecture 4: Linear Panel Data Models, II. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

A Course in Applied Econometrics Lecture 4: Linear Panel Data Models, II Jeff Wooldridge IRP Lectures, UW Madison, August 2008 5. Estimating Production Functions Using Proxy Variables 6. Pseudo Panels