Informational Content in Static and Dynamic Discrete Response Panel Data Models

Informational Content in Static and Dynamic Discrete Response Panel Data Models S. Chen HKUST S. Khan Duke University X. Tang Rice University May 13, 2016 Preliminary and Incomplete Version Abstract We study the informational content of widely used identifying assumptions in discrete panel data models. To do so, we begin by exploring the information in the static binary choice panel data introduced in Manski (1987), who showed that point identification was achievable under a large support and stationarity assumption. Our first result is that while these conditions suffice for point identification, they do not achieve regular identification, thereby ruling out the ability to conduct standard inference. This motivates exploring the informational content of stronger identifying conditions, such as exchangeability, e.g. Honoré (1992), and exclusion, e.g. Honoré and Lewbel (2002). Here we find that while exchangeability adds no informational content, exclusion does in the sense that the identification is now regular. Based on this identification result we propose an estimation procedure that has conventional asymptotic properties, such as root-n consistency and asymptotic normality. We then extend our analysis to dynamic discrete choice panel data models, e.g. Honore and Kyriazidou (2000), Arellano and Carrasco (2003), and censored panel transformation/duration models, e.g. Khan and Tamer (2007). Here we find that while neither the exclusion restriction, nor serial independence add informational content by themselves, jointly they do in the sense that under both restrictions, regular identification can be achieved. In this setting we also propose a new estimation procedure that converges at the parametric rate with a limiting normal distribution. Keywords: Panel Data, Dynamic Discrete Choice, Duration Models. 1

1 Introduction In this paper we consider the identification and information for regression coefficients in discrete panel data models. Models with discrete data are widespread in econometrics, as many variables of interest in empirical economic models are binary, such as consumer choice or employment status. Paralleling the increase in popularity of estimating binary regression models is the estimation of panel data models. The increased availability of longitudinal panel data sets has presented new opportunities for econometricians to control for individual unobserved heterogeneity across agents. In linear panel data models, unobserved additive individual specific heterogeneity, if assumed constant over time (i.e. fixed effects ), can be controlled for when estimating the slope parameters by first differencing the observations. Discrete panel data models have received a great deal of interest in both the econometrics and statistics literature, beginning with the seminal paper of Andersen (1970). For a review of the early work on this model see Chamberlain (1984), and for a survey of more recent contributions see Arellano and Honoré (2001). More generally speaking there is a vibrant and growing literature on both partial and point identification in nonlinear panel data models. There are a set of recent papers that deal with various nonlinearities in models with (short T ) panels. See for example the work of Arellano and Bonhomme (2009), Bester and Hansen (2009), Bonhomme (2012), Chernozhukov, Fernandez-Val, Hahn, and Newey (2010), Evdokimov (2010), Graham and Powell (2012) and Hoderlein and White (2012). See also the survey in Arellano and Honoré (2001). Our approach will be to begin with a set of weak assumptions which will result in what we will refer to as the base model. This base model will effectively be the same on considered in Manski (1987). This base model can be expressed as y i = I[α i + x iβ 0 + ɛ it > 0] i = 1, 2,...n t = 1, 2 (1.1) where y it is the observed binary dependent variable, x it are the observed K dimensional regressors, and β 0 is unknown K dimensional parameter of interest. The scalar variables α i and ɛ it are unobserved, but will be assumed to satisfy varying conditions as we consider variations of the base model. The panel structure of the data set implies we have a few observations for each individual in a large cross section, so we are effectively assuming the number of individuals, n, is large whereas the number of time periods, t is small, and set it to 2, w.l.o.g. This model has been further explored recently in Chamberlain (2010). The main objective of this paper will be to evaluate the identification power for the parameter of interest, β 0 of the varying restrictions on the unobservables of the model. Consequently, we structure the paper as follows. In the next section, we first explore the informational content for β 0 in the base model, formally defined by stating the conditions on both the observables and unobservables. We then explore the additional informational content of strengthening the assumptions on the unobservables considering those assumed in the recent panel data literature. In those cases that there is increased informational 2

content of the stronger assumptions we propose new estimation procedures and derive their properties. Section Three then considers a dynamic variation of the base model, where a lagged dependent variable is an explanatory variable, as considered in Arellano and Carrasco (2003), Honore and Kyriazidou (2000), Honoré and Lewbel (2002), among others. In these models, of particular interest is the coefficient on the lagged dependent variable. As with the static model, we consider varying identifying assumptions so we can explore the informational content of the different conditions. In cases where we do find additional informational content we propose new estimation procedure. Section Four then considers panel data duration models 1, modeled as censored monotone transformation models- see Ridder and Tunali (1999),Khan and Tamer (2007), Chen (2010). By following the pattern of the previous sections we demonstrate which conditions add informational content, and provide new estimators for the set of assumptions that do. Section 5 concludes and the appendix collects the proofs of the main theorems. 2 Identification and Information in Binary Panel Data In this section we consider the identification of our base models, which is a binary choice model with fixed effects. Andersen (1970) considered the problem of inference on fixed effects linear models from binary response panel data. He showed that inference is possible if the disturbances for each panel member are known to be white noise with the logistic distribution and if the observed explanatory variables vary over time. Nothing need be known about the distribution of the fixed effects and he proved that a conditional maximum likelihood estimator consistently estimates the model parameters up to scale. Interestingly, recent work demonstrates how crucial the logistic distribution assumption is for point identification and consistent estimation. Chamberlain (2010) shows that when the observable covariates have bounded support, the logistic assumption on an unobserved component is necessary for point identification. Manski (1987) showed that identification of the regression coefficients remains possible if the disturbances for each panel member are known only to be time-stationary with unbounded support and if the observed explanatory variables vary enough over time and have large support. Specifically, he considered the model: y it = I[α i + x itβ 0 + ɛ it > 0] (2.1) 1 See VanDeBerg (2001) for a comprehensive survey on duration analysis which includes work on multiple durations. 3

where i = 1, 2,...n, t = 1, 2. The binary variable y it and the K-dimensional regressor vector x it are each observed and the parameter of interest is the K dimensional vector β 0. The unobservables are α i, and ɛ it, the former not varying with t and often referred to as the fixed effect or the individual specific effect. Manski (1987) imposes no restrictions on the conditional distribution of α i conditional on x i x i1, x i2. Manski (1987) imposed the following conditions: 1. F ɛi1 x i,α i ( ) = F ɛi2 x i,α i ( ) for all α i, x i. 2. The support of ɛ it x i, α i is R. 3. Let w i = x i2 x i1. Then the support of w i does not lie in proper linear subspace of R K. 4. There exists at least one k {1, 2,...K} such that β k 0 and w k is distributed continuously with support on the real line conditional on the other components of w i. 5. The sample (y i (y i1, y i2 ), x i ) i = 1, 2,...N is i.i.d. Under the 5 conditions the parameter of interest β 0 is identified up to scale, as shown in Manski (1987). Theorem 2.1 (Manski (1987)) Under the 5 conditions above, β 0 is identified up to scale. While in one sense this is a positive results, it turns out it is quite limited from an inference point of view. That is because the identification is on an arbitrarily small set of the observables, making the identification irregular. One way to see tis is to consider a sequence of imposter values of β 0, denoted by β n such that β n β 0. This was done in Chen, Khan, and Tang (2013) to show that the models in Manski (1985),Manski (1988), where identified on a set of measure 0, thereby ruling out the possibility of conducting standard inference. We show that the identification for the panel data model considered here is of the same type: Theorem 2.2 Under assumptions 1-5, β 0 is identified, but on a set of measure 0. The proof is actually related from our next theorem and an impossibility result in Chamberlain (2010), who shows that a logistic distribution assumption on ɛ it is necessary for the information to be positive when the regressors have unbounded support. One of the implications of this result is that the conditional maximum score estimator proposed in Manski (1987), while converging at the sub parametric rate as the original maximum score, will also be rate optimal under the stated conditions. 4

Theorem 2.3 Under Assumptions 1-5, the semiparameteric information bound 2 for β 0 is 0. Given these conclusions from either theorem, inference for β 0 will be nonstandard and difficult to conduct for this model. The next step to be able to conduct inference might be to add stronger assumptions which will hopefully add informational content for β 0. We note that Assumption 1 above imposes no restrictions on the form of the serial correlation between ɛ i1, ɛ i2. This was desirable in the sense it made results less sensitive to possible misspecification. But given our previous results one may want to impose more structure on the correlation in errors if it has informational content for β 0. Unfortunately, our next result proves it does not. Specifically, we consider imposing a much stronger assumption that Assumption 1 above. Specifically we now impose: 1 : Conditional on α i, x i, ɛ i1, ɛ i2 are independently and identically distributed. Assumption 1 is quite stronger than Assumption 1, now imposing independence as well as identical conditional distributions. Unfortunately, the stronger assumption has no informational content for β 0 : Theorem 2.4 The model under Assumptions 1,2,3,4,5 is informationally equivalent to the model under Assumptions 1,2,3,4,5. The proof of the above theorem can be found in the appendix in Section A. The notion of informational equivalence was introduced in Manski (1988), and recently extended in Chen, Khan, and Tang (2013). What this means in the context of the conclusions of our two previous theorems is that the researcher may want to impose alternative assumptions than Assumption 1 or 1. This result is a refinement of that attained in Chamberlain (2010), who showed that in the Manski model, the logistic distribution with iid errors is necessary for positive Fisher information. The Manski (1988) notion of informational equivalence is a more refined notion of informational comparisons than Fisher information- see, e.g. Chen, Khan, and Tang (2013). Consequently, we will now look at an alternative assumption to see if the informational content for β 0 can be increased. The alternative assumption we adopt will be referred to as an exclusion restriction, analogous to those used in cross sectional models in Lewbel (1998) and Chen, Khan, and Tang (2013), and for panel data models in 3 Honoré and Lewbel (2002), Ai and Gan (2010), Hoderlein and White (2012). For the model at hand, by exclusion 2 see Newey (1990) for the formal definition of semiparametric information bounds. 3 These panel data papers with exclusion restrictions impose other conditions as well. These conditions will be compared to the ones we impose later in this section. 5

restriction, the best way to denote what we mean, is, following the notation of Honoré and Lewbel (2002), express x it β 0 = x itθ 0 + v it where x it denotes the first K 1 components of x it and v it denotes the K th component, whose coefficient we normalized to 1. θ 0 denotes the first K 1 components of β 0. With knew notation, our new conditions are: S1 Let e it = α i + ɛ it. The conditional distribution of e it is independent of v it, conditioning on all values of x i S2 F ɛi1 x i,α i ( ) = F ɛi2 x i,α i ( ) for all α i, x i. S3 The support of ɛ it x i, α i is R. S4 Let w i = ((x i2 x i1 ), v i2 v i1 ). Then the support of w i does not lie in proper linear subspace of R K. S5 Let w K v i2 v i1. Then w K is distributed continuously with support on the real line conditional on the other components of w i. S6 The sample (y i, x i, v i ) i = 1, 2,...N is i.i.d., where v i (v i1, v i2 ), y i (y i1, y i2 ) We see that our knew conditions are unequivocally stronger than in our first model, as we have maintained stationarity but added conditional independence (Assumption 1). But our new conditions are neither stronger nor weaker than in the second model, as we do not require independence in ɛ it but we do impose the exclusion restriction. Our next theorem establishes that the new conditions provide the most informational content for β 0 of the three models considered. Theorem 2.5 Under Assumptions S1-S6, θ 0 is identified. Furthermore, unlike in the first two sets of conditions, the identification here is regular, implying the semi parametric information bound is positive. The conclusion of the theorem is that in the binary panel data model, the exclusion restriction has identifying power. This result is in one sense surprising when compared to the cross sectional setting. In that case, when analyzing the informational content in the model in Manski (1988), it was found in Chen, Khan, and Tang (2013) that adding the exclusion assumption added no informational content whatsoever. The two models, with and without the exclusion restriction were informationally equivalent. A positive feature of the above theorem is we did not have to impose any sort of serial independence in the unobservable component ɛ it. Allowing for serial correlation is one of the 6

desirable features in Manski (1987), and that can be maintained in our setting. We note that this is in contrast to the models considered in Chamberlain (2010) who assumed that ɛ it was conditionally i.i.d. over time. As alluded to Identifying power of exclusion restrictions has been explored in other work in the literature. In Honoré and Lewbel (2002), under their stated conditions, regular identification is achieved. However, there proposed estimation procedure is based on inverse density weighting which is many settings will not have standard asymptotic properties- see, e.g. Khan and Tamer (2010). A further disadvantage, as we will discuss in the next section, is they effectively require serial independence in the observed variable v it, which is difficult to satisfy in most empirical settings. However, an advantage of their approach is they allow for dynamics in the model, something we will consider in the next section and compare the stated conditions in each of the two approaches. Ai and Gan (2010) also impose an exclusion restriction but assume stronger conditions on the unobserved component ɛ it than imposed here or in Honoré and Lewbel (2002). Furthermore, they do not consider the dynamic setting as done here and in Honoré and Lewbel (2002). Hoderlein and White (2012) impose a local exclusion restriction, but require all covariates to be continuously distributed an only attain identification on the thin set where v i1 = v i2, implying sub parametric rates of convergence if estimating the parameters. This same thin set was used for identification in Honore and Kyriazidou (2000), also resulting in subparametric rates of convergence. The conclusion of our theorem immediately motivates the construction of a new estimator procedure which exploits this added informational content, and consequently converges at the parametric rate. We consider an estimation procedure which is based on the following probabilities: Let p 1 (x i, v i, α i ) = P (y i1 = 1 x i, v i, α i ) = P (ɛ i1 x i1β 0 + v i1 α i x i, α i ) Let p 1(x i, v i ) = p 1 (x i, v i, α i )df (α i x i ) We note that p 1 (x i, v i, α i ) is not identified from the data since we do not observe α i. However, p 1(x i, v i ) is identified from the data and our assumptions. Now consider the pair of observations in the cross section, whose regressor values we denote by (x i, v i ), (x i, v j ); by our definitions above and our assumptions, we have p 1(x i, v j ) = p 2(x i, v i ) x i2β 0 + v i2 = x i1β 0 + v j1 (2.2) 7

This means if we could find such matches we could identify β 0 by regressing v j1 v i2 on x i2 x i1. Of course, given our assumptions such matches will never occur, and we do not know p t (x i, v i ), nor will there be such exact matches. But given suitable regularity conditions, listed below, p t (x i, v i ) can be estimated non parametrically, and although exact matches cannot be found, kernel weights can be assigned to each pair, with the weight larger for pairs that are closer to a match. Letting k hn ( ) 1 h n k ( ) h n denote this kernel function, we could then define: ω ij = k hn (ˆp 2(x i, v j ) ˆp 1(x i, v i )) (2.3) where ˆp 1(x i, v i ) = l i y i1k 1hn (v l v i )K 2hn (x l x j ) l i y i1k 1hn (v l v i )K 2hn (x l x j ) (2.4) where 1h n, 2h n denote bandwidth sequences and K denotes the kernel function used here. With these weights we could then define our estimator of β 0 as ( ) 1 ˆβ = ω ij (x i2 x i1 )(x i2 x i1 ) ω ij (x i2 x i1 )(v j1 v i2 ) (2.5) i j i j Under suitable kernel regularity conditions, which are standard in the literature and detailed in the Appendix, Section B, this estimator will converge at the parametric rate with a limiting normal distribution. We conclude this section by summarizing our results found for the static binary choice panel data model introduced in Manski (1987). Under condition in Manski (1987), the regression coefficients are identified, but on a set of measure 0, implying the difficulty to conduct standard inference on them. This result was found under a stationarity condition, but we found no additional informational content from strengthening the stationarity assumption to serial independence and stationarity. In contrast, the exclusion restriction considered in Honoré and Lewbel (2002) did have informational content in the sense that the regression coefficients were regularly identified and so could be estimated at the parametric rate. In the next section we explore how these results change when we consider dynamic panel data binary models. 8

3 Identification and Information in Dynamic Binary Panel Data In this section we extend the base model of the previous section by introducing dynamics into the model. The dynamics we consider will be of the form of including lagged binary dependent variables as one of the explanatory variables.this model is well motivate in empirical settings. In many situations, such as in the study of labor force and union participation, it is observed that an individual who has experienced an event in the past is more likely to experience the event in the future than an individual who has not experienced the event. Heckman (1981), Heckman (1991) discusses two explanations for this phenomenon. The first explanation is the presence of true state dependence, in the sense that the lagged choice/decision enters the model as an explanatory variable. The second is the presence of serial correlation in the unobserved transitory errors that are in the model. Here we will be interested in the case where this serial correlation is due to the presence of unobservable permanent individual specific heterogeneity. This section expands results from the previous one by presenting identification and estimation methods for discrete choice models with structural state dependence that allow for the presence of unobservable individual heterogeneity in panels with a large number of individuals observed over a small number of time periods. Specifically we consider a discrete choice panel data model of the form: y it = I[α i + x itβ 0 + γ 0 y it 1 + α i + ɛ it > 0] (3.1) where x it is a vector of exogenous variables, ɛ it is an unobservable error term, α i is an unobservable individual-specific effect, and β 0 and γ 0 are the parameters of interest to be estimated. This model or variations thereof have been considered in Chamberlain (1984),Honore and Kyriazidou (2000), Arellano and Carrasco (2003), Honoré and Lewbel (2002). Identification of these parameters has been shown in the literature to be complicated, if not impossible. For example for the binary logit model with the dependent variable lagged only once, Chamberlain (1993) has shown that, if individuals are observed in three time periods, then the parameters of the model are not identified, even with the strong assumption that ɛ it has i.i.d. logistic distribution. Honore and Kyriazidou (2000) show that β 0, γ 0 are identified if the econometrician has access to four or more observations per individual. However, the identification was shown to be nonstandard,making inference more complicated, in a fashion somewhat analogous to the cases for the other models just considered. These results motivate the exploration of the informational content of the parameters β 0, γ 0 when considering alternative assumptions to Chamberlain (1993) and Honore and Kyriazidou (2000). Following how we approached the problem before, we consider the following three sets of models: 9

Model 1: Stationarity,Serial Independence, Initial Condition : Stationarity and serial independence are familiar to us from the previous section. Only initial condition needs defining. By stationarity and serial indepence we mean that conditional on α i, x i, v i, ɛ it is i.i.d. By initial condition we mean P (y i0 = 1 x i, α i ) = p 0 (x i, α i ) Model 2: Exclusion, Stationarity, Initial Condition Model 3: Exclusion, Stationarity, Serial Independence, Initial Condition Note Model 3 imposes the most restrictions, whereas Models 1 and 2 are neither more restrictive nor less restrictive than the other. Our Assumptions imply that P (y i0 = 1 x i, α i ) = p 0 (x i, α i ) (3.2) P (y it = 1 x i, α i, y i0,...y it 1 ) = F (x itβ 0 + γ 0 y it 1 + α i ) (t = 1, 2) (3.3) Crucial to our results below is our initial conditions assumptions, which we keep along the lines of, for example, Chamberlain (1984), Honore and Kyriazidou (2000). For important work based on alternative initial conditions assumptions, see Honoré and Tamer (2006), Wooldridge (2002). Note the above equality follows from the crucial assumption the the time varying errors are independently and identically distributed over time, as well as cross sectional units. With this assumption and the assumption of having data on 4 time periods, including the initial period 0, Honore and Kyriazidou (2000) proposed estimating the parameters by maximizing the objective function: 1 n n ( ) xi2 x i3 K (y i2 y i1 )sgn((x i2 x i1 ) β + γ(y i3 y i0 )) (3.4) i=1 h n with respect to β, γ. Analogous to some of the estimation procedures discussed before K( ) denotes a kernel function assigning greater weight to observations where x i2, x i3 are close. Related this this point, this estimator irregularly identifies the parameter β 0, γ 0 and so the estimator will not converge at the parametric rate, though it will be consistent. The sub parametric rate of convergence of the estimator reflects the difficulty in attaining identification in this dynamic model. Hahn (2001) shows that there does not exist any 10

estimator for β 0, γ 0 converging at the parametric rate under the condition in Honore and Kyriazidou (2000). This is even with the strong assumptions of iid errors across time and the cross section. To see if it is possible to add informational content to this model, we again consider the exclusion restriction from the previous section. For the dynamic model the exclusion restriction we assume is of the form: (α i, ɛ i ) v i x i (3.5) where as before variables without time subscripts indicate all time periods- e.g. x i = (x i1, x i2 ). We have shown that in the static model, the exclusion restriction did add informational content, and we were ably to get regular identification even with just a stationarity assumption, not having to impose serial independence. Unfortunately this is no longer the case in the dynamic model. Specifically, we have the following result: Theorem 3.1 Models 1 and 2 above are informationally equivalent, and the semi parametric information bound for each of them is 0. This implies inference is just as complicated in Model 2, so we turn attention to Model 3. In this model, like in Honore and Kyriazidou (2000), we assume the error terms are iid across time, meaning that conditional on α i, x i, ɛ i1, ɛ i2 are iid. As we can show, the conditions in Model 3 suffice for regular identification. Under these assumptions we take the following approach to identify β 0, γ 0 : Let x it denote (x it, y it 1 ), and let x i denote (x i1, x i2). p 2 (x i, v i, α i ) P (y i2 = 1 y i1, x i, v i, α i ) = P (ɛ i2 x i2β 0 + γ 0 y i1 + v i2 α i y i1, x i, α i ) = P (ɛ i2 x i2β 0 + γ 0 y i1 + v i2 α i x i, α i ) and 11

p 1 (x i, v i, α i ) P (y i1 = 1 y i0, x i, v i, α i ) = P (ɛ i1 x i1β 0 + γ 0 y i0 + v i1 α i y i0, x i, α i ) = p(ɛ i1 x i1β 0 + γ 0 y i0 + v i1 α i x i, α i ) And as in the static case, we will define: p t (x i, v i ) p t (x i, v i, α i )df (α i x i ) and analogous to the static case: p 1(x i, v j ) = p 2(x i, v i ) x i2β 0 + γ 0 y i1 + v i2 = x i1β 0 + γ 0 y i0 + v j1 (3.6) and p 1(x i, v i ) = p 2(x i, v j ) x i1β 0 + γ 0 y i0 + v i2 = x i2β 0 + γ 0 y i1 + v j2 (3.7) Analogous to our estimator in the previous section, this would suggest the kernel weighted least squares estimator: ( ) 1 ( ˆβ, ˆγ) = ω ij (x i2 x i1)(x i2 x i1) ω ij (x i2 x i1)(v j1 v i2 ) (3.8) i j i j Where, analogous to the previous estimator, ω ij denote kernel weights. Under standard conditions, completely analogous to the regularity conditions described in the previous section, this estimator will converge at the parametric rate with an asymptotic normal distribution. This fully demonstrates the identification power and informational content of the exclusion restriction. 12

As demonstrated in both Honore and Kyriazidou (2000) and Hahn (2001), even with the stationarity and serial independence assumptions regular identification cannot be achieved without the exclusion restriction. Honoré and Lewbel (2002) impose the exclusion restriction but they relax the stationarity and serial independence assumption that we impose here. They demonstrate identification and propose an estimation procedure based on inverse weighting of the density function of the excluded variable v i. However, also crucial to their identification result is their assumption A.2, which we restate here: Assumption A.2: For each t, let e it = α i + ɛ it. Assume e it is conditionally independent of v it, conditioning on x it and z i. As we show here, this assumption is difficult to satisfy unless there is serial independence in v i. That is, v 1, v 2 are independent. To see why, note that their assumption would imply that, in the case where there is one regressor that is the lagged dependent variable, that (α i + ɛ i2 ) v i2 y i1 = 1 or equivalently, (α i + ɛ i2 ) v i2 α i + ɛ i1 v i1 But it is easy to show that (ɛ i1, ɛ i2 ) (v i1, v i2 ) does not necessarily imply the above result. It will be true if v i1 v i2, but not necessarily the case other wise. Serial independence in the observables is a strong condition rarely imposed in the panel literature. 4 Identification and Information in Censored Duration Panel Data In this section we consider estimation of a right censored duration model with fixed/ group effects. Of particular interest in the panel data setting is to permit the distribution of the censoring variable to be spell-specific and individual/group specific. The vast existing literature does not address this type of problem. Honoré, Khan, and Powell (2002) allow for random censoring, but requires a linear transformation, and the censoring variables to be distributed independently of the covariates with the same distribution across spells. Extensions of the linear specification can be found in Abrevaya (2000),Abrevaya (1999), which allow for a generalized transformation function, but rule out fixed and/or general random censoring. Chen (2010) allows for fixed and random censoring,, but unlike our results here, has to assume the censoring variables are distributed independently of observed covariates. Furthermore, Chen (2010) uses inverse weighting of the censoring variable distribution function 13

which leads to instability and irregularity this distribution is in the tails of support. Other work in the panel duration literature parametrically specifies the distribution of the error terms. Examples include Chamberlain (1984), Honore (1993), Ridder and Tunali (1999), Lancaster (1979), Horowitz and Lee (2004) and Lee (2008). Some of these also rule out censoring distributions that vary across spells and/or are independent of covariates. In the context of multiple spell data, we wish to allow for distribution of the censoring variable to vary across spells, for one of two reasons: for one, the censoring distribution may depend on time-varying covariates. Also, even if the censoring distribution does not depend on the covariates, and is purely a result of the observation plan, the observation plans may vary across spells. To be precise, we will focus on the following model: T (ν it ) = min(α i + x itβ 0 + ɛ it, c it ) d it = I[α i + x itβ 0 + ɛ it c it ] i = 1, 2,...n, t = 1, 2 In the duration framework the subscript t denotes the spell, as opposed to the time period. T ( ) is unknown strictly monotonic function, ν it is an observed scalar dependent variable, x it if a vector of observed covariates as before, and c it is a censoring variables which is permitted to be random, and only observed for the censored observations. The observed binary variable d it is an indicator if the observation is censored. As before the disturbance term ɛ it is unobserved, as is the individual specific effect α i. The vector β 0 is again the parameter of interest we wish to identify. As alluded to above identification becomes very complicated when we allow for the general censoring structure. Khan and Tamer (2007) establish point identification of β 0 and proposed the following maximum score type estimator: ˆβ = arg max β 1 n n d i1 I[v i1 < v i2 I[x i1β < x i2β] + d i2 I[v i2 < v i1 ]I[x i2β < x i1β] (4.1) i=1 As with the other maximum score type estimators referred to in the previous sections, the estimator is consistent as point identification was proven. But the identification is irregular and the estimator converges at a sub parametric rate with non normal limiting distribution. So like in the previous sections we explore what additional restrictions may be used to add informational content. Given that we already are attaining irregular identification despite the assumption of stationarity and serial independence of ɛ it, like before we turn to an exclusion restriction. 14

To introduce an excluded variable into the above model we adopt the following notation: T (ν it ) = min(α i + x itβ 0 + v it + ɛ it, c it ) d it = I[α i + x itβ 0 + v it + ɛ it c it ] i = 1, 2,...n, t = 1, 2 and we impose the following assumptions: 1. 2. v i (α i, ɛ i, c i ) x i 3. (ɛ i, c i ) are stationary and serially independent conditional on x i. To identify β 0 in this case, following steps analogous to before, let p t (x i, v i ) = P (d it = 0 x i, v i ) = P (α i + ɛ it + c it x itβ 0 + v it x i ) So given our assumptions, we have p 1 (x i, v i ) = p 2 (x i, v j ) x i1β 0 + v i1 = x i2β 0 + v j2 (4.2) Hence, as before we can search for pairs where the probabilities match, and regress (v j2 v i1 ) on x i1 x i2. As before, while such exact matches will never occur, kernel weights can be used. Under our stated conditions the estimator will converge at the parametric rate, fully demonstrating the identification power of the exclusion restriction. To our knowledge, this will be the only estimation procedure for the censored transformation panel data model converging at the parametric rate with this generality in the censoring structure. 5 Simulation Study In the this section we explore the relative finite sample performances of the new proposed estimation procedures in both static and dynamic designs. Beginning with simulating data from a static model, we randomly generate data from the following design: y it = I[α i + x it β 0 + v it + ɛ it > 0] t = 1, 2 (5.1) 15

where x it was distributed bivariate normal, mean 0, variances 1, and correlation 0.5. The error terms ɛ it were distributed independently of x it, α i and serially independent standard normal. The special regressor was distributed standard normal, independently of x it, α i, ɛ it. For the fixed effect α i we considered two designs, on where α i was binary with probability 0.5 taking the value 1, and probability 0.5 taking the value 0. In the other design we set α i = (x i1 + x i2 )/2 + u i, where u i was distributed standard uniform, independently of the all the other variables. Table 1 reports the mean bias, median bias and mean squared error from 401 replications, for two estimators of β 0, the panel maximum score estimator in Manski (1987) and the estimator proposed in this paper. The estimator proposed here required the selection of tuning parameters for both the conditional probability estimation as well as the matching of both propensity scores and the nonspecial regressors (x i ). To select these tuning parameters we followed the procedure outlined in Chen and Khan (2008), which emphasized that the bandwidth in the propensity score estimation has to be under smoothed when compared to the bandwidth for matching. As the results indicate, the finite sample performances coincide somewhat with the asymptotic theory. Both estimators clearly demonstrate consistency, with biases and MSE s declining with the sample size. Also both estimators indicate to a certain extent finite sample symmetry with both small mean and median biases that are generally the same sign. The new estimator demonstrates smaller MSE than the maximum score estimator, which is expected by the asymptotic theory, as the new estimator is designed to exploit the stronger exclusion restriction that the maximum score estimator is not.. Table two reports results from simulating data for a dynamic design where we generated data from the following model: y it = I[α i + x it β 0 + γ 0 y i,t 1 + v it + ɛ it > 0] t = 0, 1, 2, 3 (5.2) Here we used 4 time periods, as is required for consistency of the estimator in Honore and Kyriazidou (2000). As in the static design we assumed ɛ it, v it here are each serially independent,, and distributed standard normal, independent of each other and independent of α i, x i. We again consider two designs for α i, distributed analogously to how they were in the static design, and x i was now distributed multivariate normal with pairwise correlations equal to 0.5. The initial variable y 0i was distributed standard normal, independent of all the other variables. Results are reported for the new estimator (CKT), for and the kernel weighted maximum core estimator proposed in Honore and Kyriazidou (2000) (HK), and the estimator proposed in Honoré and Lewbel (2002) (HL). To implement HK, we used a normal kernel function and a bandwidth selected as if estimating the density function of the random variable x i3 x i2 evaluated at 0. To implement HL, we used the true density function of the special regressor as the weighting function, and used the values of x it, y i,t 1 as instruments. As HL and the new estimator can be implemented with three time periods, we estimated β 0, γ 0 twice, once for periods 0,1,2 and again for periods 1,2,3, and averaged the results. For the new estimator and the estimators in Honore and Kyriazidou (2000), Honoré and Lewbel 16

(2002), the same statistics as in table 1 are reported for 50,100,200 and 400 observations, with 401 replications, for each of the parameters β 0, γ 0. The results for the dynamic design are somewhat analogous to those of the static design, the new estimator has noticeable smaller MSE than the estimator in Honore and Kyriazidou (2000), which is to be expected as the latter converges at a slower rate than the former since the latter is not based on the special regressor assumption. It also has noticeably smaller MSE values than HL. We attribute this to the fact that the serial correlation in v it would result in their Assumption 2 not being satisfied. This illustrates how crucial the assumption of no serial correlation in the special regressor is for their estimator to be used in the dynamic setting. One difference between the results in table 2 and table 1 is in the dynamic design the new estimator exhibits a larger bias than in the static design. This may well be due to the fact that first stage estimator now involves more regressors, but a more formal and rigorous analysis is warranted for an explanation. 17

TABLE I Simulation Results for Static Panel Binary Regression Estimators Design 1 Design 2 Max. Score Spec. Reg. Max. Score Spec. Reg. 50 obs. Mean Bias -0.6491-0.4005-0.7106-0.3592 Median Bias -0.7841-0.3905-0.9906-0.3292 MSE 0.5955 0.2435 0.6800 0.2151 100 obs. Mean Bias -0.4742-0.3005-0.5906-0.2492 Median Bias -0.5241-0.3205-0.7006-0.2392 MSE 0.4039 0.1321 0.5269 0.1055 200 obs. Mean Bias -0.3041-0.2205-0.4006-0.1492 Median Bias -0.3641-0.2305-0.4406-0.1432 MSE 0.2557 0.0821 0.3524 0.0419 400 obs. Mean Bias -0.1871-0.1676-0.2306-0.0592 Median Bias -0.2241-0.1705-0.2406-0.0692 MSE 0.1549 0.0468 0.2024 0.0192 18

TABLE 2 Simulation Results for Dynamic Panel Binary Regression Estimators Design 1 Design 2 HK CKT HL HK CKT HL β 0 γ 0 β 0 γ 0 β 0 γ 0 β 0 γ 0 β 0 γ 0 β 0 γ 0 50 obs. Mean Bias -0.4049-0.0605-0.0453 0.0587-0.0034-0.1480-0.4454-0.0655-0.0796 0.0593-0.1161-0.2460 MSE 0.5163 0.0053 0.2424 0.0049 0.4310 1.0446 0.5763 0.0055 0.2213 0.0045 0.5393 3.6324 100 obs. Mean Bias -0.2021-0.0515-0.0006 0.0456 0.0287-0.0522-0.2904-0.0507-0.0039 0.0469 0.0190-0.2366 MSE 0.2715 0.0311 0.1608 0.0029 0.1549 0.3462 0.3403 0.0029 0.2199 0.0026 1.3756 3.0985 200 obs. Mean Bias -0.0955-0.0418 0.0477 0.0378 0.0287-0.0522-0.1085-0.0414 0.0943 0.0373 0.0324-0.0103 MSE 0.1816 0.0019 0.1384 0.0016 0.1549 0.3462 0.1869 0.0018 0.1735 0.0015 0.6613 1.3347 400 obs. Mean Bias -0.0560-0.0350-0.0244 0.0311 0.0074-0.0205-0.0845-0.0333-0.0156 0.0308-0.0118-0.0548 MSE 0.1300 0.0013 0.0924 0.0019 0.0729 0.1684 0.1384 0.0012 0.1023 0.0010 0.1558 1.5248 6 Conclusions In this paper we explored the identifying power of varying restrictions in discrete response panel data models. Our approach is to treat the conditions in Manski (1987) as the base model and to continue adding restrictions to explore their informational content. We first strengthen the level of serial dependence by assuming the unobserved time varying heterogeneity is independently and identically distributed, and find that this adds no informational content. The notion of information equivalence we use for this finding is based on Manski (1988), which is a more refined notion than the Fisher information. We then consider an exclusion restriction, in the form of a special regressor, and find that it indeed adds informational content, which is in contrast to the cross sectional discrete response model, as found in Chen, Khan, and Tang (2013). Specifically, we find that the Fisher information for the regression coefficients was positive with the exclusion restriction, whereas Chamberlain (1984) found there to be zero information without an exclusion restriction. We then extend outré analysis to consider the informational content in dynamic models, such as those considered in Chamberlain (1984), Honore and Kyriazidou (2000), Arellano and Carrasco (2003),Honoré and Lewbel (2002). Here again we find informational content with a special regressor, this time in the form of attaining regular identification for the coefficient of the lagged dependent variable. Furthermore, we propose a new semi parametric estimator for the coefficient of the lagged dependent variable that converges at the paramedic rate with a limiting normal distribution. Analogous new results are also found for the regression coefficients in a panel data (i.e. multiple spell) duration model, where the exclusion restriction is shown to add strong informational content. The results in this paper do suggest areas for future research. For example, in the models where we do attain positive Fisher information, it would be useful to derive the semi 19

parametric efficiency bound as we ll as propose estimators that are asymptotically efficient. Furthermore, most of the results in this paper are for 2 or 3 time periods, and it would be useful to formally analyze the additional informational content in models with wider panels. 20

References Abrevaya, J. (1999): Leapfrog Estimation of a Fixed-effects Model with Unknown Transformation of the Dependent Variable, Journal of Econometrics, 93(1), 203 228. (2000): Rank Estimation of a Generalized Fixed-Effects Regression Model, Journal of Econometrics, 95(2), 1 23. Ai, C., and L. Gan (2010): An Alternative Root-n Consistent Estimator for Panel Data Binary Choice Model, Journal of Econometrics, 157, 93 100. Andersen, E. (1970): Asymptotic Properties of Conditional Maximum Likelihood Estimators, Journal of the Royal Statistical Society, 32(3), 283 301. Arellano, M., and S. Bonhomme (2009): Robust priors in nonlinear panel data models, Econometrica, 77(2), 489 536. Arellano, M., and R. Carrasco (2003): Binary choice panel data models with predetermined variables, Journal of Econometrics, 115, 125 157. Arellano, M., and B. Honoré (2001): Panel Data Models: Some Recent Developments, Handbook of econometrics.volume 5, pp. 3229 96. Bester, A., and C. Hansen (2009): Identification of Marginal Effects in a Nonparametric Correlated Random Effects Model, Journal of Business and Economic Statistics, 27(2), 235 250. Bonhomme, S. (2012): Functional Differencing, Econometrica, 80, 1337 1385. Chamberlain, G. (1984): Panel Data, in Handbook of Econometrics, Vol. 2, ed. by Z. Griliches, and M. Intriligator. North Holland. (1993): Feedback in Panel Data Models, mimeo., Harvard University. Chamberlain, G. (2010): Binary Response Models for Panel Data: Identification and Information, Econometrica, 78, 159 168. Chen, S. (2010): Root-N-consistent Estimation of Fixed Effect Panel Data Transformation Models with Censoring, Journal of Econometrics, 159, 222 234. 21

Chen, S., and S. Khan (2008): Rates of Convergence for Estimating Regression Coefficients in Heteroskedastic Discrete Response Models, Journal of Econometrics, 117, 245 278. Chen, S., S. Khan, and X. Tang (2013): On the Informational Content of Special Regressors in Heteroskedastic Binary Response Models, manuscript. Chernozhukov, V., I. Fernandez-Val, J. Hahn, and W. Newey (2010): Identification and Estimation of Marginal Effects in Nonlinear Panel Models, manuscript. Evdokimov, K. (2010): Identification and Estimation of a Nonparametric Panel Data Model with Unobserved Heterogeneity, Working Paper. Graham, B., and J. Powell (2012): Identification and Estimation of Irregular Correlated Random Coefficient Models, Econometrica, 80, 2105 2152. Hahn, J. (2001): The Information Bound of a Dynamic Panel Logit Model with Fixed Effects, Econometric Theory, 17, 913 932. Heckman, J. (1981): Statistical Models for Discrete Panel Data, in Structural Analysis of Discrete Data, ed. by C. Manski, and D. McFadden. MIT Press. (1991): Identifying the Hand of the Past: Distinguishing State Dependence from Heterogeneity, American Economic Review, pp. 75 79. Hoderlein, S., and H. White (2012): Nonparametric Identification in Nonseparable Panel Data Models with Generalized Fixed Effects, Journal of Econometrics, 168, 300 314. Honoré, B. (1992): Trimmed LAD and Least Squares Estimation of Truncated and Censored Regression Models with Fixed Effects, Econometrica, 60(3), 533 65. Honoré, B., S. Khan, and J. Powell (2002): Quantile Regression Under Random Censoring, Journal of Econometrics, 64, 241 278. Honore, B., and E. Kyriazidou (2000): Panel Data Discrete Choice Models with Lagged Dependent Variables, Econometrica, 68, 839 874. Honoré, B., and A. Lewbel (2002): Semiparametric Binary Choice Panel Data Models without Strictly Exogeneous Regressors, Econometrica, 70(5), 2053 2063. 22

Honoré, B., and E. Tamer (2006): Bounds on Parameters in Panel Dynamic Discrete Choice Models, Econometrica, 74, 611 629. Honore, B. E. (1993): Identification Results for Duration Models with Multiple Spells, Review of Economic Studies, 60, 241 46. Horowitz, J., and S. Lee (2004): Semiparametric Estimation of a Panel Data Proportional Hazards Model with Fixed Effects, Journal of Econometrics, pp. 155 198. Khan, S., and E. Tamer (2007): Partial Rank Estimation of Transformation Models with General forms of Censoring, Journal of Econometrics, 136, 251 280. (2010): Irregular Identification, Support Conditions and Inverse Weight Estimation, Econometrica, forthcoming. Lancaster, T. (1979): Econometric Methods for the Duration of Unemployment, Econometrica, 47(4), 939 956. Lee, S. (2008): Estimating Panel Data Duration Models With Censored Data, Econometric Theory, 24, 1254 1276. Lewbel, A. (1998): Semiparametric Latent Variable Model Estimation with Endogenous or Mismeasured Regressors, Econometrica, 66(1), 105 122. Manski, C. F. (1985): Semiparametric Analysis of Discrete Response: Asymptotic Properties of the Maximum Score Estimator, Journal of Econometrics, 27(3), 313 33. Manski, C. F. (1987): Semiparametric Analysis of Random Effects Linear Models from Binary Panel Data, Econometrica, 55(2), 357 362. (1988): Identification of Binary Response Models, Journal of the American Statistical Association, 83. Newey, W. (1990): Semiparametric Efficiency Bounds, Journal of Applied Econometrics, 5(2), 99 135. Ridder, G., and I. Tunali (1999): Stratified Partial Likelihood, Journal of Econometrics, 92, 193 232. 23

VanDeBerg, G. (2001): Duration Models: Specification, Identification and Multiple Durations, in Handbook of Econometrics, Vol. 5, ed. by J. Heckman, and E. Leamer. Amesterdam: North Holland. Wooldridge, J. (2002): Simple Solutions to the Initial Conditions Problem in Dynamic, Nonlinear Panel Data Models with Unobserved Heterogeneity, Journal of Applied Econometrics, 20, 39 54. A Informational Equivalence Result In this section we prove the informational equivalence between the stationarity assumption and the i.i.d. assumption on the disturbance terms. Recall from the paper we began with the stationarity assumption and then we imposed: 1 : Conditional on α i, x i, ɛ i1, ɛ i2 are independently and identically distributed. Assumption 1 is quite stronger than Assumption 1 (stationarity), now imposing independence as well as identical conditional distributions. Recall the statement of the theorem Theorem A.1 The model under Assumptions 1,2,3,4 is informationally equivalent to the model under Assumptions 1,2,3,4. Proof: The proof involves a series of steps, and we first recall the definitions of the two models we wish to show are informationally equivalent. Model A: y it = 1{x it β + α i ɛ it } for t = 1, 2 where F ɛi,1 x i,α i = F ɛi,2 x i, α i for all x i, α i (conditional stationarity) We drop subscripts i to simplify notation below. Let x (x 1, x 2 ) and ɛ (ɛ 1, ɛ 2 ). Use lower cases to denote realizations. Let X be the support of x; and let F be the parameter space for (or, equivalently, the restrictions on) F {F ɛ,α X : X X }. That is, F is the set of all G {G ɛ,α X : X X } that satisfy stationarity. Let (β, F ) be the true parameters in the DGP. Let p(x; β, F ) (p 1 (x; β, F ), p 2 (x; β, F ), p 12 (x; β, F )) where p t (x; β, F ) E[Y t X = x; β, F ] and p 12 (x; β, F ) E[Y 1 Y 2 X = x; β, F ]. Definition: We say (b, G) is observationally equivalent to (β, F ) (denoted by (b, G) o.e. = (β, F )) when p(x; β, F ) p(x; b, G) for almost all x. We say β is observationally equivalent to b under F (denoted by β o.e. = F b) if there exists G F such that (b, G) o.e. = (β, F ). We say β is identified relative to b under F (denoted by β i.d. Fb) if β is not observational equivalent 24

to b under F (or equivalently, if for all G F there exists a subset Q b X with positive measure such that p(x; β, F ) p(x; b, G) for all x Q b. We now apply the criterion in Manski (JASA 1988) to analyze the identification of this model. For any b β, define Q(b) {x X : sgn [(x 1 x 2 )b] sgn[(x 1 x 2 )β]} As is shown in the corollary to Lemma 1 in Manski (1987), we can write Q(b) equivalently as Q(b) = {x X : sgn[(x 1 x 2 )b] Med[Y 1 Y 2 Y 1 Y 2, X = x]}. We now characterize the set of states that help identify β relative to a generic imposter b β. Lemma A.1 If Pr{X Q(b)} > 0, then β i.d. Fb. Proof: It suffices to show that for any x Q(b), there exists no G F such that p(x; b, G) = p(x; β, F ). Without loss of generality, consider a generic x Q(b) such that x 1 β x 2 β and x 1 b < x 2 b. By construction, p t ( x; b, G) = G ɛt X= x,α( x t β + α)dg α X= x for t = 1, 2 where G ɛt X,α, G α X denote the conditional C.D.F. according to G. By construction, G ɛ1 X= x,α( x 1 b + α) < G ɛ2 X= x,α( x 2 b + α) for all G F (because, G ɛ1 X,α = G ɛ2 X,α) and all α. Integrating out α with respect to the same conditional distribution G α X= x on both sides of the inequality implies p 1 ( x; b, G) < p 2 ( x; b, G). On the other hand, by similar derivation, we can show p 1 ( x; β, F ) p 2 ( x; β, F ) because F F. Thus there exists no G F such that p( x; b, G) = p( x; β, F ). A symmetric argument proves the same result for x Q(b) where x 1 β < x 2 β and x 1 b x 2 b. Q.E.D. Lemma A.2 If Pr{X Q(b)} = 0, then β o.e. = F b. 25