Discrete Choices with Panel Data

Size: px

Start display at page:

Download "Discrete Choices with Panel Data"

Anastasia Murphy
5 years ago
Views:

1 Discrete Choices with Panel Data Manuel Arellano CEMFI, Madrid This version: July 2003 Abstract This paper reviews the existing approaches to deal with panel data binary choice models with individual effects. Their relative strengths and weaknesses are discussed. Much theoretical and empirical research is needed in this area, and the paper points to several aspects that deserve further investigation. In particular, I illustrate the usefulness of asymptotic arguments in providing both approximately unbiased moment conditions, and approximations to sampling distributions for panels of different sample sizes. JEL classification:c23. Keywords: Binary choice, panel data, fixed effects, modified likelihood, asymptotic corrections. Address for correspondence: CEMFI, Casado del Alisal 5, 2804 Madrid, Spain. Tel , Fax , arellano@cemfi.es This paper was written for presentation as the Investigaciones Económicas Lecture at the XXV Simposio de Análisis Económico, Universitat Autònoma de Barcelona, Bellaterra, 9-2 December I wish to thank Pedro Albarrán, Samuel Bentolila, Olympia Bover, Raquel Carrasco, Jesús Carro, Enrique Sentana, and two anonymous referees for helpful comments and discussions. I am also grateful to Pedro Albarrán and Jesús Carro for very able research assistance.

2 Introduction The use of fixed effects is a simple and well understood way of dealing with endogenous explanatory variables in linear panel data models. In such a context, least squares or instrumental variable methods for errors in differences provide consistent estimates that control for unobserved heterogeneity in short panels of large cross-sections (small T,largeN). However, the situation is fundamentally different in models with nonlinear errors; for example, when one intends to use fixed effects to deal with an endogenous explanatory variable in a probit model. In those cases, estimates of the parameters of interest, jointly estimated with the effects, are typically inconsistent if T is fixed (incidental parameter problem). Moreover, fixed effects estimates in a spirit similar to differencing in the linear case are not available for many models of practical importance. There are also random effectmethodsthatachievefixed T consistency subject to a particular specification of the form of the dependence between the explanatory variables and the effects, but they rely on strong and untestable auxiliary assumptions, and even these methods are often out of reach. Without auxiliary assumptions, the common parameters of certain nonlinear fixed effects models are simply unidentifiable in a fixed T setting, so that fixed-t consistent estimation is not possible at all. In other cases, although identifiable, fixed-t consistent estimation at the standard root-n rate is impossible. Analternativereactiontothefactthatmicropanelsareshortistoask for estimators with small biases as opposed to no bias at all; specifically, estimators with biases of order /T 2 instead of the standard magnitude of /T. This alternative approach has the potential of overcoming some of the fixed-t identification difficulties and the advantage of generality. The purpose of this lecture is twofold. First I review the incidental parameter problem (Sections 2 and 3), fixed-t solutions (Section 4), and identification problems (Section 5), all in the context of the static binary choice

3 model with explanatory variables that are correlated with an individual effect. Second, I discuss the modified concentrated likelihood of Cox and Reid (987), its role in achieving consistency up to a certain order of magnitude in T (Section 6), and a double asymptotic formulation which provides an effective discrimination between estimators with and without bias reduction (Section 7). I focus on the static binary case for simplicity and because many results are only available for this case. Thus, dynamic models, multinomial choice, andmodelswithrandomeffects that are uncorrelated with the explanatory variables are all left out (see Arellano and Honoré (200) for a fuller survey of the fixed-t panel data discrete choice literature). My intention is to exhibit the strengths and weaknesses of fixed-t approaches, and to illustrate the usefulness of double asymptotic arguments in providing both approximately unbiased moment conditions, and approximations to sampling distributions even for fairly short panels, which is the main theme of the paper. 2 Models and Parameters of Interest I begin by considering the following static binary choice model y it ={x 0 itβ 0 + η i + v it 0} (t =,..., T ; i =,..., N) () where the errors v it are independently distributed with cdf F conditional on η i and x i =(x 0 i,..., x 0 it ) 0,sothat Pr(y it = x i, η i )=F (x 0 itβ 0 + η i ). (2) The Linear Model as a Benchmark In a linear model of the form E(y it x i, η i )=x 0 itβ 0 + η i, (3) β 0 is identifiable from the regression in first differences or deviations from means in a cross-sectional population for fixed T, regardless of the form of 2

4 the distribution of η i x i.thatis,wehave plim N TN NX TX (x it x i ) (y it y i ) (x it x i ) 0 β 0 =0, (4) t= which is uniquely satisfied by the true value β 0 provided plim N TN NX TX (x it x i )(x it x i ) 0 (5) t= is non-singular. So, the value b β that solves TN NX TX t= i (x it x i ) h(y it y i ) (x it x i ) 0 β b =0 (6) (the within-group estimator) is a consistent estimator of β 0 for large N, no matter how small is T as long as T 2 (see, for example, Arellano, 2003, or Hsiao, 2003). This is of economic interest if one hopes that by conditioning on η i, β 0 measures a more relevant (causal or structural) effect of x on y. The consistency result matters because one wants to make sure that gets the right answer when calculating β b from a large cross-sectional panel with a small time series dimension, which is a typical situation in microeconometrics. The motivation and aim in a binary choice fixed effects model is to get similar results as in the linear case when the form of the model is given by (). In our context, the term fixed effects has nothing to do with the nature of sampling. It just refers to a model for the effect of x on y given x and η, in which we observe y and x but not η, and the distribution of η x is left unrestricted. Following the usage in the econometric literature, the term random effects will be reserved for models in which some knowledge about the form of the distribution of η x is assumed. Parameters of Interest The micropanel data literature has emphasized the large-n-short-t identification of β 0 with an unspecified distribution 3

5 of η i x i. However, a natural parameter of interest is the mean effect on the probability of y it =of changing x it from z a to z b, say. A consistent estimator of this is: NX Z [F (z a β N 0 + x 0 2itβ 02 + η) F (z b β 0 + x 0 2itβ 02 + η)] dg(η x 2it ) (7) where G(. x 2it ) is the cdf of η i conditional on x 2it,andx it denotes the first component of x it. Thus, measuring this effectwouldrequireustospecifyg, whichisnotinthenatureofthefixed effects approach. The direct information we can get from the β coefficients only concerns the relative impacts of explanatory variables on the probabilities. If x it and x 2it are continuous variables we have: β 02 = Pr ob(y it = x i, η i ) / Pr ob(y it = x i, η i ). (8) β 0 x 2it x it 3 The Problem The log-likelihood function from () assuming that the y it are independent conditional on x i and η i is given by NX `i (β, η i ) (9) where `i (β, η i )= TX {y it log F it +( y it )log( F it )} (0) t= and F it = F (x 0 itβ + η i ). Moreover, the scores are d ηi (β, η i ) `i (β, η i ) TX f it = η i F it ( F it ) (y it F it ) () d βi (β, η i ) `i (β, η i ) β = t= TX t= f it F it ( F it ) x it (y it F it ) (2) An alternative is to obtain the difference in probabilities for specific valuesofη and x 2t (e.g. their means), but this may only be relevant for a small part of the population (see Chamberlain, 984). 4

6 where f it denotes the pdf corresponding to F it. For the logit model F is the logistic cdf Λ(r) =e r / ( + e r ) and we have f it F it ( F it ) = so that in this case the scores are simply d ηi (β, η i )= P T t= (y it F it ) and d βi (β, η i )= P T t= x it (y it F it ). Let the MLE of η i for given β be bη i (β) =argmax`i (β, η i ) (3) η so that bη i (β) solves d ηi (β, bη i (β)) = 0. (4) Therefore, the MLE of β is given by the maximizer of the concentrated (or profile) log-likelihood bβ =argmax β which solves the first order conditions b TN (β) = TN = TN NX `i (β, bη i (β)) (5) NX ½ d βi (β, bη i (β)) + d ηi (β, bη i (β)) bη ¾ i (β) β NX d βi (β, bη i (β)) (6) The problem is that b TN (β) evaluated at β = β 0 does not converge to zero in probability when N for T fixed (although it does converge to zero when T ). This situation is known as the incidental parameters problem since Neyman and Scott (948). A discussion of this problem for discretechoicemodelsisinheckman(98). 5

7 An Example As a classic illustration let us consider a model in which T = 2, f is symmetric, β is scalar, and x it is a time dummy such that x i =0and x i2 =(Andersen, 973; Heckman, 98). For observations with (y i,y i2 )=(0, 0) we have bη i (β) and `i (β, bη i (β)) = log F ( bη i (β)) + log F ( bη i (β) β) 0. For observations with (y i,y i2 )=(, ) we have bη i (β) and `i (β, bη i (β)) = log F (bη i (β)) + log F (bη i (β)+β) 0. Finally, for (0, ) or (, 0) observations we have, respectively, `i (β, η) =logf ( η)+logf (η + β) or `i (β, η) =logf (η)+logf ( η β), which in both cases are maximized at bη i (β) = β 2. (7) The implication is that the contributions of observations (0, 0) and (, ) to the concentrated log-likelihood are equal to zero, a (0, ) observation contributes a term of the form 2logF (β/2), anda(, 0) observation contributes with 2log[ F (β/2)]. So the concentrated log-likelihood is given by NX 2 {d 0i log [ F (β/2)] + d 0i log F (β/2)} (8) where d 0i =(y i =,y i2 =0)and d 0i =(y i =0,y i2 =). Moreover, the MLE of p = F (β/2) is P N bp = d 0i P N (y i + y i2 =), (9) so that bβ =2F (bp). (20) Note that bp isthesamplecounterpartofp 0 =Pr(y i =0,y i2 = y i + y i2 =). Thus the MLE β b satisfies plim bβ =2F (p 0 ). (2) N 6

8 Moreover, we have Z p 0 = Pr (y i =0,y i2 = y i + y i2 =, η) dg (η y i + y i2 =). For the logit model Pr (y i =0,y i2 = y i + y i2 =, η) does not depend on η and it turns out that p 0 = Λ (β 0 ) where β 0 isthetruevalue. Therefore, in such a case plim N β b = 2Λ [Λ (β 0 )] = 2β 0,sothatMLwouldbe estimating a relative log odds ratio that is twice as large as its true value. More generally, using Bayes formula Z Pr (yi =0,y i2 = η) Pr (y i + y i2 = η) p 0 = dg (η) Pr (y i + y i2 = η) Pr (y i + y i2 =) or so that plim N p 0 = E η [Pr (y i =0,y i2 = η)], (22) E η [Pr (y i + y i2 = η)] ½ bβ =2F E η [F ( η) F (β + η)] E η [F ( η) F (β + η)] + E η [F (η) F ( β η)] ¾. (23) Thus, in general the form of the asymptotic bias of β b depends on the distribution of the individual effects. For probit, under η N 0, ση 2 an explicit expression is available. Letting β = β/ +ση 2 /2 and ρ = σ 2 η/ +ση 2 we have Eη [Φ (η)] = Φ (0) = 0.5, E η [Φ (β + η)] = Φ (β ),ande η [Φ (η) Φ (β + η)] = Φ 2 (0, β ; ρ), sothat plim bβ =2Φ Φ (β ) Φ 2 (0, β ; ρ) N Φ (β )+0.5 2Φ 2 (0, β (24) ; ρ) where Φ 2 (.,.; ρ)is the cdf of the standardized bivariate normal density with correlation coefficient ρ. As noted by Heckman (98), we get p 0 F (β) as σ η 0, in which case we get a similar bias result as for the logistic in a limiting situation. Numerical calculations of (24) are reported below in Section 6. 7

9 4 Fixed T Solutions 4. Conditional MLE Asufficient statistic for η i, S i say, is a function of the data such that the distribution of the data given S i does not depend on η i. The idea is to use the likelihood conditioned on S i to make inference about β 0 (Andersen, 970). This works as long as β 0 is identified from the conditional likelihood of the data, which obviously requires that the conditional likelihood depends on β 0.Unfortunately,thisisnotthecaseexceptforthelogitmodel. In the logit model P T t= y it is a sufficient statistic for η i. Indeed, we have Ã! PT TX exp t= y itx 0 itβ 0 Pr y i,..., y it y it,x i = P PT (25) t= (d,...,d T ) B i exp t= d tx 0 it β 0 where B i is the set of all 0 sequences such that P T t= d t = P T t= y it. This result was first obtained by Rasch (960, 96) (for surveys see Chamberlain, 984, or Arellano and Honoré, 200). For example, with T =2we have Pr (y i,y i2 y i + y i2,x i )= Λ ( x 0 i2β 0 ) Λ ( x 0 i2β 0 ) if (y i,y i2 )=(0, 0) or (, ) if (y i,y i2 )=(, 0) if (y i,y i2 )=(0, ). (26) Therefore, the log-likelihood conditioned on y i + y i2 is given by 2 L c (β) = NX {d 0i log [ Λ ( x 0 i2β)] + d 0i log Λ ( x 0 i2β)} (27) and the score takes the form L c (β) β = NX x i2 {d 0i Λ ( x 0 i2β)(y i + y i2 =)}. (28) 2 The contributions of (0, 0) or (, ) observations is zero. 8

10 4.2 Maximum Score Estimation The previous technique crucially relied on the logit assumption. Manski (987) considered a more general model of the form () in which the cdf of v it x i, η i was non-parametric and could depend on x i and η i in a timeinvariant way. Namely, for all t and s Pr( v it r x i, η i )=Pr( v is r x i, η i )=F (r x i, η i ), (29) so that F (r x i, η i ) does not change with t but is otherwise unrestricted. This assumption imposes stationarity and strict exogeneity, but allows for serial dependence in the errors v it. It also allows for a certain kind of conditional heteroskedasticity, though not a very plausible one, since Var(v it x i, η i ) may depend on x i but v it is not allowed to be more sensitive to x it than to other x s. Similarly if the expectations E (v it x i, η i ) exist, they may depend on x i but not their first-differences E ( v it x i, η i )=0. The time-invariance of F implies that for T =2: 3 med (y i2 y i x i,y i + y i2 =)=sgn( x 0 i2β 0 ). (30) To see this note that, given y i + y i2 =,thedifference y i2 y i can only equal or. So the median will be one or the other depending on whether Pr (y i2 =,y i =0 x i ) Q Pr (y i2 =0,y i = x i ).Thus 4 med (y i2 y i x i,y i + y i2 =) = sgn[pr(y i2 =,y i =0 x i ) Pr (y i2 =0,y i = x i )] = sgn [Pr (y i2 = x i ) Pr (y i = x i )]. 3 The sign function is defined as sgn (u) =(u>0) (u<0), i.e. sgn (u) = if u<0, sgn (u) =0if u =0and sgn (u) =if u>0. 4 The second equality follows from Pr (y i2 = x i ) = Pr(y i2 =,y i =0 x i )+Pr(y i2 =,y i = x i ) Pr (y i = x i ) = Pr(y i2 =0,y i = x i )+Pr(y i2 =,y i = x i ). 9

11 Moreover, from the model s specification, i.e. Pr (y i = x i, η i ) = F (x 0 iβ 0 + η i x i, η i ) Pr (y i2 = x i, η i ) = F (x 0 i2β 0 + η i x i, η i ), and the monotonicity of F,wehavethatforanyη i (the constancy of F over time becomes crucial at this point): Pr (y i2 = x i, η i ) Q Pr (y i = x i, η i ) x 0 i2β 0 Q x 0 iβ 0. Therefore, the implication also holds unconditionally relative to η i : Pr (y i2 = x i ) Q Pr (y i = x i ) x 0 i2β 0 Q x 0 iβ 0. or sgn [Pr (y i2 = x i ) Pr (y i = x i )] = sgn ( x 0 i2β 0 ). Manski showed that the true value of β 0 uniquely maximizes (up to scale) the expected agreement between the sign of x 0 i2β and that of y i2 conditioned on y i + y i2 =. This identification result required an unbounded support for at least one of the explanatory variables with a non-zero coefficient. That is, letting x 0 it =(z it,wit) 0 and β 0 0 =(γ 0, α 0 0), theminimal requirement for identification is that z it has unbounded support and γ 0 6=0. Identification fails at γ 0 =0,sothatγ 0 =0isnot a testable hypothesis. Manski s identification result implies that we can learn about the relative effects of the variables w it under the maintained assumption that γ 0 6=0. Manski then proposed to estimate β 0 by selecting the value that matches the sign of x 0 i2β with that of y i2 forasmanyobservationsaspossiblein the subsample with y i + y i2 =. The suggested estimator is bβ =argmax β NX sgn ( x 0 i2β)(y i2 y i ) (3) 0

12 subject to the normalization k β k=. 5 This is the maximum score estimator applied to the observations with y i + y i2 =(notice that the estimation criterion is unaffected by removing observations having y i = y i2 ). It is consistent under the assumption that there is at least one unbounded continuous regressor, but it is not root-n consistent, and not asymptotically normal. An alternative form of the score objective function is NX S N (β) = {d 0i ( x 0 i2β < 0) + d 0i ( x 0 i2β 0)}. (32) The score S N (β) gives the number of correct predictions we would make if we predicted (y i,y i2 ) to be (0, ) whenever x 0 i2β 0. In contrast, P N sgn ( x0 i2β) y i2 gives the number of successes minus the number of failures. Yet another form of the estimator suggested by the median regression interpretation is as the minimizer of the number of failures, which is given by NX (y i 6= y i2 ) y i2 sgn ( x 0 2 i2β). (33) Smoothed Maximum Score It is possible to consider a smoothed version of the maximum score estimator along the lines of Horowitz (992), which does have an asymptotic normal distribution, although the rate of convergence remains slower than root-n (Charlier, Melenberg and van Soest, 995, and Kyriazidou, 997). 6 The idea is to replace S N (β) with a smooth function SN(β) whose limit a.s. as N isthesameass N (β). Thisisof the form NX SN(β) = {d 0i [ K ( x 0 i2β/γ N )] + d 0i K ( x 0 i2β/γ N )} (34) 5 In the logit case the scale normalization is imposed through the variance of the logistic distribution. More generally, if F is a known distribution a priori, the scale normalization isdeterminedbytheformoff. Comparisons can be made by considering ratios of coefficients. 6 Chamberlain (986) showed that there is no root-n consistent estimator of β under the assumptions of Manski for his maximum score method.

13 where K(.) is analogous to a cdf and γ N is a sequence of positive numbers such that lim N γ N = Random Effects In general Z Pr (y i,..., y it x i )= Pr (y i,...,y it x i, η i ) dg (η i x i ) (35) where G (η i x i ) is the cdf of η i x i. The substantive model specifies Pr (y i,..., y it x i, η i ),butonlypr (y i,..., y it x i ) has an empirical counterpart. For example, we may have specified Pr (y i,...,y it x i, η i )= TY Pr (y it x i, η i )= t= TY t= F y it it ( F it ) ( y it). In a fixed effects model we seek to make inferences about parameters in Pr (y i,..., y it x i, η i ) without restricting the form of G. In a random effects model G is typically parametric or semiparametric, and the parameters of interest may or may not be identified with G unrestricted. Thus a fixed effects model can be regarded as a random effects model that leaves the distribution of the effects unrestricted. Thechoicebetweenfixed and random effects models often involves a trade-off between robustness in the specification of Pr (y i,..., y it x i, η i ) and robustness in G, in the sense that achieving fixed-t identification with unrestricted G usually requires a more restrictive specification of Pr (y i,..., y it x i, η i ). Chamberlain (980, 984) considered a random effects model in which the effects are of the form η i = µ (x i )+ε i (36) and ε i is independent of x i. He also made the normality assumptions v it x i, η i N (0, ω tt ) (37) ε i x i N 0, σ 2 η, (38) 2

14 which imply that Pr (y it = x i )=Φ σ t (x 0 itβ 0 + µ (x i )). (39) where σ 2 t = σ 2 η + ω tt and Φ (.) is the standard normal cdf. In this model the v it may be serially dependent and heteroskedastic over time. Chamberlain assumed a linear specification µ (x i )=λ 0 + x 0 iλ, andnewey (994) generalized the model to a non-parametric µ (x i ). In the linear case, β 0, λ 0, λ, andtheσ 2 t can be estimated subject to the normalization σ 2 = by combining the period-by-period probit likelihood functions (see Bover and Arellano, 997, for a discussion of alternative estimators). In the semiparametric case, Newey used the fact that σ t Φ [Pr (y it = x i )] σ t Φ Pr y i(t ) = x i = x 0 it β 0 (40) together with non-parametric estimates of the probabilities Pr (y it = x i ) to obtain an estimator of β 0 and the relative scales. A further generalization of the model is to drop the normality assumptions and allow the distribution of the errors ε i + v it x i to be unknown. This case has been considered by Chen (998). Another semi-parametric approach has been followed by Lee (999). Under certain assumptions on the joint distribution of x i and η i, Lee proposed a maximum rank correlation-type estimator which is N-consistent and asymptotically normal. 5 Identification Problems with Fixed T It would be useful to know which models for Pr (y i,..., y it x i, η i ) are identified without placing restrictions in the form of G (η i x i ) (i.e. fixed-effects identification with fixed T ) and which are not. Amodelisgivenbya2 T vector p (x i, η i, β 0 ) with elements that specify the probabilities Pr ((y i,..., y it )=d j x i, η i )(j =,..., 2 T ) (4) 3

15 where d j is a 0 sequence of order T. Let the true cdf of η i x i be G 0 (η x). Identification will fail at β 0 if for all x inthesupportofx i there is a cdf G (η x) and β 6= β 0 in the parameter space, such that Z Z p (x, η, β 0 ) dg 0 (η x) = p (x, η, β ) dg (η x). (42) If this is so, (β 0,G 0 ) and (β,g ) give the same conditional distribution for (y i,..., y it ) given x i. Therefore, they are observationally equivalent relative to such distribution. Chamberlain (992) studied the identification of a fixed effects binary choice model with T =2. He considered the model y it =(x 0 itβ 0 + η i + v it 0) (t =, 2) together with the assumption that the v it are independent of x i, η i and are i.i.d. over time with a known cdf F. The distribution F is strictly increasing on the whole line, with a bounded, continuous derivative. Moreover, we have the partitions x 0 it =(d t,zit) 0 and β 0 0 =(α 0, γ 0 0), whered t is a time dummy such that d =0and d 2 =,andz i is a continuous random vector with bounded support. With these assumptions Chamberlain showed that if F is not logistic, then there is a value of α such that identification fails for all β 0 in a neighborhood of (α, 0). This seems puzzling since Manski (987) proved identification under less restrictive assumptions. He required, however, the presence of an explanatory variable with unbounded support. Indeed, the difference between the identification result of Manski and the underidentification result of Chamberlain is due to the bounded support for the explanatory variables. The line between identification and underidentification in this context is very subtle. Under Manski s assumptions identification will fail at β 0 0 = (α 0, 0) even if z it has unbounded support, but there will be identification as long as a component of γ 0 is different from zero. Chamberlain shows that if z it is bounded β 0 is underidentified not only when β 0 0 =(α 0, 0), butalsofor 4

16 all β 0 in a neighborhood of (α 0, 0) for a certain value of α 0. So it seems to be a case of local underidentification at zero versus local underidentification in a neighborhood around zero. The lesson from these findings is the fragility of fixed-t identification results and the special role of the logistic assumption. Chamberlain (992) also showed that when the support of z it is unbounded (so that identification holds to the exclusion of γ 0 =0from the parameter space) the information bound for β 0 is zero unless F is logistic. Thus, root-n consistent estimation is possible only for the logit model. Chamberlain sproofcanbesketchedasfollows.inhiscasep (x, η, β 0 ) is ( F )( F 2 ) p (x, η, β 0 )= ( F ) F 2 F ( F 2 ) F F 2 where F = F (zγ η) and F 2 = F (α 0 + z2γ η). Let β =(α, 0) and define the 4 4 matrix H (x, η,..., η 4, β )=[p (x, η, β ),..., p (x, η 4, β )]. which does not vary with x when evaluated at β. The proof proceeds by showing that unless H (x, η,..., η 4, β ) is singular for every α and η,..., η 4, there will be lack of identification for all β 0 in a neighborhood of some β.nextitisshownthath (x, η,..., η 4, β ) can only be singular if F is logistic. Letuschooseapmf π =(π,..., π 4) 0, π j > 0, P 4 j= π j =.Ifforsome other pmf π 0 (x) we have H (x, η,..., η 4, β 0 ) π 0 (x) =H (x, η,..., η 4, β ) π, then the models characterized by (β 0, π 0 (x)) and (β, π ) give the same unconditional choice probabilities, hence creating an identification problem. To rulethisoutwehavetoruleoutthath is invertible. To see this, suppose 5

17 that H (x, η,..., η 4, β ) is nonsingular for some α and η,..., η 4. Since x is bounded, for β 0 6= β in a neighborhood of β, H (x, η,..., η 4, β 0 ) will also be nonsingular for all admissible values of x. Wecannowdefine π 0 (x) =H (x, η,..., η 4, β 0 ) H (x, η,..., η 4, β ) π, such that π 0j (x) > 0 for all admissible x. Moreover, since ι 0 H = ι 0 where ι is a 4 vector of ones, we also have ι 0 H = ι 0 and ι 0 π 0 (x) =.Therefore, 4X p x, η j, β 0 π0j (x) = j= 4X p x, η j, β π j j= which implies that β 0 cannot be distinguished from β. The singularity of H (x, η,..., η 4, β ) requires that ψ [ F (η)] [ F (α + η)] + ψ 2 [ F (η)] F (α + η) +ψ 3 F (η)[ F (α + η)] + ψ 4 F (η) F (α + η) =0 for all η and some scalars ψ,..., ψ 4 that are not all zero. Taking limits as η tends to ± gives ψ = ψ 4 =0.Thusweareleftwith ψ 2 Q (α + η)+ψ 3 Q (η) =0 where Q F/( F ). Forη =0we obtain ψ 3 /ψ 2 = Q (α) /Q(0). Therefore the singularity of H requires that for all α and η we have q (α + η) =q (α)+q(η) q (0). Thiscanonlyhappenifthelogoddratiosq log Q are linear or equivalently if F is logistic. 6 Adjusting the Concentrated Likelihood Cox and Reid (987) considered the general problem of doing inference for a parameter of interest in the absence of knowledge about nuisance parameters. They proposed a first-order adjustment to the concentrated likelihood 6

18 to take account of the estimation of the nuisance parameters (the modified profile likelihood). Their formulation required information orthogonality between the two types of parameters. That is, that the expected information matrix be block diagonal between the parameters of interest and the nuisance parameters; something that may be achieved by transformation of the latter (Cox and Reid explained how to construct orthogonal parameters). A discussion of orthogonality in the context of panel data models and a Bayesian perspective have been given by Lancaster (2000, 2002). The nature of the adjustment in a fixed effects model and some examples are also discussed in Cox and Reid (992). 6. Orthogonalization Let `i (β, η i ) be the log-likelihood for unit i (conditional on x i and η i ). A strong form of orthogonality arises when for some parameterization of η i we have `i (β, η i )=`i (β)+`2i (η i ), (43) for in this case the MLE of η i for given β does not depend on β, bη i (β) =bη i. The implication is that the MLE of β is unaffected by lack of knowledge of η i. In this case 2`i (β, η i ) / β η i = 0 for all i. Unfortunately, such factorization does not hold for binary choice models. In contrast, information orthogonality just requires the cross derivatives to be zero on average. Suppose that a reparameterization is made from (β, η i ) to (β, λ i ) chosen so that β and λ i are information orthogonal. Thus η i = η (β, λ i ) is chosen such that the reparameterized log likelihood ` i (β, λ i )=`i (β, η (β, λ i )) (44) satisfies (at true values): µ 2` i (β E 0, λ i ) β λ i x i, η i =0. (45) 7

19 Since we have ` i β = `i β + η i `i (46) β η i and 7 µ 2` i E x i, η β λ i = η µ i 2`i E x i, η i λ i β η i + η µ i η i i λ i β E 2`i x η 2 i, η i, i (47) following Cox and Reid (987) and Lancaster (2002), the function η(β, λ i ) must satisfy the partial differential equations µ µ η i β = E 2`i 2`i x i, η β η i /E x i, η i. (48) i Orthogonal Effects in Binary Choice Let us now consider the form of information orthogonal fixed effects for model ()-(2). These have been obtained by Lancaster (998, 2000). For binary choice we have µ 2`i (β E 0, η i ) x i, η β η i i µ 2`i (β E 0, η i ) x i, η i η 2 i = = η 2 i TX h (x 0 itβ 0 + η i ) x it (49) t= TX h (x 0 itβ 0 + η i ) (50) where f (r) 2 h (r) = F (r)[ F (r)]. (5) Since in general (49) is different from zero, β and η i are not information orthogonal. In view of (48), an orthogonal transformation of the effects will satisfy η i β = TX P T t= h h it x it (52) it t= where h it = h (x 0 itβ + η i ). 7 Note that there is a term that vanishes: ( 2 η i / β λ i )E ( `i/ η i x i, η i )=0. t= 8

20 and Moreover, letting φ (r) =h 0 (r) and φ it = φ (x 0 itβ + η i ),since " 2 η i = η i X µ T P β λ i λ T i t= h φ it x it + η # i t= it β it turns out that 2 η i / η i = β λ i λ i β log η i λ i = Hence, Lancaster s orthogonal reparameterization is η i λ i, (53) P T t= h. (54) it λ i = TX Z x 0 it β+η i t= h (r) dr. (55) When F (r) is the logistic distribution h(r) coincides with the logistic density, so that an orthogonal effect for the logit model is λ i = TX Λ (x 0 itβ + η i ). (56) t= 6.2 Modified Profile Likelihood The modified profile log likelihood function of Cox and Reid (987) can be written as L M (β) = X i `Mi (β) and `Mi (β) =` i β, λ b i (β) h 2 log d λλi β, λ b i i (β), (57) where λ b i (β) is the MLE of λ i for given β, andd λλi (β, λ i)= 2` i / λ 2 i. Intuitively, the role of the second term is to penalize values of β for which the information about the effects is relatively large. 9

21 An individual s modified score is of the form d λλβi β, b λ i (β) + d λλλi β, b h λ i (β) b i λ i (β) / β d Mi (β) =d Ci (β) β, λ b (58) i (β) 2d λλi where d Ci (β) is the standard score from the concentrated likelihood function, d λλβi (β, λ i)= 3` i / λ 2 i β and d λλλi (β, λ i)= 3` i / λ 3 i. Thefunction(57)wasderivedbyCoxandReidasanapproximationto the conditional likelihood given λ b i (β). Their approach was motivated by the fact that in an exponential family model, it is optimal to condition on sufficient statistics for the nuisance parameters, and these can be regarded as the MLE of nuisance parameters choseninaformtobeorthogonaltothe parameters of interest. For more general problems the idea was to derive a concentrated likelihood for β conditioned on the MLE λ b i (β), having ensured via orthogonality that λ b i (β) changes slowly with β. Another motivation for using (57) is that the corresponding expected score has a bias of a smaller order of magnitude than the standard ML score (cf. Liang, 987, McCullagh and Tibshirani, 990, and Ferguson, Reid, and Cox, 99). Seen in this way, the objective of the adjustment is to center the concentrated score function to achieve consistency up to a certain order of magnitude in T. Specifically, while the difference between the score with known λ i and the concentrated score is in general of order O p (), the corresponding difference with the modified concentrated score is of order O p T /2 (see Appendix). This leads to a bias of order O (T ) in the expected modified score, as opposed to O () in the concentrated score without modification. The Adjustment in Terms of the Original Parameterization Cox and Reid s motivation for modifying the concentrated likelihood relied on the orthogonality between common and nuisance parameters. Nevertheless, the mpl function (57) can be expressed in terms of the original parameterization. 20

22 Firstly, note that because of the invariance of MLE bη i (β) =η(β, λ b i (β)) and ` i β, λ b i (β) = `i (β, bη i (β)). (59) Next, the term d λλi β, λ b i (β) can be calculated as the product of the Fisher information in the (β, η i ) parameterization and the square of the Jacobian of the transformation from (β, η i ) to (β, λ i ) (Cox and Reid, 987, p. 0). That is, since the second derivatives of ` i and `i are related by the expression µ 2` i 2 µ λ 2 = 2`i ηi + `i 2 η i i η 2 i λ i η i λ 2, i and `i/ η i vanishes at bη i (β), lettingd ηηi (β, η i )= 2`i/ η2 i we have d λλi β, λ b i (β) = d ηηi (β, bη i (β)) µ ηi λ i λi = b λ i (β) 2. (60) Thus, the mpl can be written as `Mi (β) =`i (β, bη i (β)) µ 2 log [ d λi ηηi (β, bη i (β))] + log ηi =bη η i (β). (6) i Finally, in view of (48) and (53), the derivative with respect to β of the Jacobian term (the required term for the modified score) can be expressed as β log λ i η = q i (β, η i η i ), (62) i where q i (β, η i )= κ βηi (β, η i ) /κ ηηi (β, η i ) and κ βηi (β 0, η i ) = E T d βηi (β 0, η i ) x i, η i (63) κ ηηi (β 0, η i ) = E T d ηηi (β 0, η i ) x i, η i. (64) Modified Profile Likelihood for Binary Choice Replacing (54) in (6) we have Ã TX! `Mi (β) =`i (β, bη i (β)) 2 log [ d ηηi (β, bη i (β))] + log b hit (β) (65) 2 t=

23 where b h it (β) =h (x 0 itβ + bη i (β)), and `i (β, η i )= TX {y it log F it +( y it )log( F it )} t= d ηηi (β, η i )= where ρ it = ρ (x 0 itβ + η i ) and TX [h it ρ it (y it F it )] (66) t= ρ (r) = f 0 (r) h (r)[ 2F (r)]. (67) F (r)[ F (r)] For logit, the MLE b λ i (β) for given β solves bλ i (β) = TX Λ (x 0 itβ + bη i (β)) = t= TX y it (68) so that it does not vary with β. Therefore, the likelihood conditioned on bλ i (β) coincides with the conditional logit likelihood given a sufficient statistic for the fixed effect discussed in Section 4.. For the logistic distribution ρ (r) =0. Themodified profile likelihood (mpl) for logit is therefore Ã TX! `Mi (β) =`i (β, bη i (β)) + 2 log f Λ (x 0 itβ + bη i (β)) (69) t= t= where f Λ (r) =Λ (r)[ Λ (r)] is the logistic density and `Mi (β) is defined for observations such that P T t= y it is not zero or T Numerical Comparisons for Logit and Probit Comparisons for the Two-Period Logit Model The mpl for logit (69) differs from Andersen s conditional likelihood, and the estimator b β MML PT 8 If bη i (β) ±, thenlog t= f Λ (x 0 it β + bη i (β)) tends to for any β. Soobservations for individuals that never change state are uninformative about β. 22

24 that maximizes the mpl is inconsistent for fixed T. Pursuing the example in Section 3, we compare the large-n biases of ML and MML for T =2and x i2 =. Thus we are assessing the value of the large-t adjustment in (69) when T =2. When T =2, for individuals who change state bη i (β) = β/2 so that the second term in (69) becomes 2 log [f Λ ( β/2) + f Λ (β/2)]. (70) Collecting terms and ignoring constants, the modified profile log-likelihood takes the form N NX `Mi (β) = NX {2d 0i log [ Λ(β/2)] + 2d 0i log Λ(β/2) N +(d 0i + d 0i ) (log Λ(β/2) + log [ Λ(β/2)])} 2 NX {(5d 0i + d 0i )log[ Λ(β/2)] + (5d 0i + d 0i )logλ(β/2)} N (5 4bp)log[ Λ(β/2)] + (4bp +)logλ(β/2) (7) where d 0i =(y i =,y i2 =0), d 0i =(y i =0,y i2 =)and bp is as defined in (9). This is maximized at µ µ 4bp + 4bp + bβ MML =2Λ =2log. (72) 6 5 4bp Therefore, µ µ 4p0 + 4Λ (β0 )+ plim bβ MML =2log =2log. (73) N 5 4p 0 5 4Λ (β 0 ) Figure shows the probability limits of MML for positive values of β 0, together with those of ML (the 2β 0 line) and conditional ML (the 45 line) for comparisons. 9 In this example the adjustment produces a surprisingly 9 See McCullagh and Tibshirani (990, pp ) for a similar exercise using different adjusted likelihood functions. 23

25 Probability Limit 8 7 MLE Modified MLE Conditional MLE True Value Figure : Probability limits for a logit model with T =2 good improvement given that we are relying on a large T argument with T =2. For example, for p 0 =0.65, wehaveβ 0 =0.62, β ML =.24 and β MML = 0.8. Since the MML biases are of order O (/T 2 ), the result suggests that, although the biases are not negligible for T =2,theymaybe so for values of T as small as 5 or 6. Comparisons for the Two-Period Probit Model If f (r) is the standard normal pdf we have h ( r) =h (r) and ρ ( r) = ρ (r). Thus, in the two-period case, b h i = h [bη i (β)] = h ( β/2) = h (β/2) and b h i2 = h [β + bη i (β)] = h (β/2). Also F b i = F ( β/2) and F b i2 = F (β/2). Finally, bρ i = ρ ( β/2) = ρ (β/2) and bρ i2 = ρ (β/2). Therefore, for observations with y i + y i2 =we have: `Mi (β) = `i (β, bη i (β)) h 2 log bhi bρ i y i bf i + b h i2 bρ i2 y i2 bf i2 i 24

26 +log bhi (β)+ b h i2 (β) (74) and `Mi (β) 2[d 0i log F ( β/2) + d 0i log F (β/2)] 2 d 0i log [2h (β/2) + 2ρ (β/2) F (β/2)] (75) 2 d 0i log [2h (β/2) + 2ρ ( β/2) F ( β/2)] + log h (β/2). Next, collecting terms, ignoring constants, averaging over observations with y i + y i2 =, and using the notation q (r) =+ ρ (r) F (r) h (r) = Φ (r) rφ (r) Φ (r) φ (r), we have N XN 2 2 `Mi (β) = ( bp) 2logF( β/2) + log h (β/2) log q (β/2) 2 2 +bp 2logF(β/2) + log h (β/2) log q ( β/2).(76) Thus, the probability limit of β b MML for probit maximizes the limiting modified log likelihood as follows 2 2 plim bβ MML = argmax ½p 0 2logF(β/2) + log h (β/2) log q ( β/2) N β 2 2 +( p 0 ) 2logF( β/2) + log h (β/2) log q (β/2) ¾. Figure 2 shows the probability limits of probit ML and MML for normally distributed individual effects with variances 0.,, and0, aswellasfor Cauchy distributed effects. The range of values of β has been chosen for comparability with Figure, in the sense that both figures cover similar intervals of p 0 values. The impact of changing the distribution of the effects 25 (77)

27 Probability Limit MLE, N(0,.) MMLE, N(0,.) MLE, N(0,) MMLE, N(0,) MLE, N(0,0) MMLE, N(0,0) MLE, cauchy MMLE, cauchy True Value True Value Figure 2: Probability limits for a probit model with T =2 is noticeably small for both ML and MML. The adjustment for probit also produces a good improvement given that T is only two, although less so than in the logit case. For example, for β 0 =0.60, the relevant ranges of values are [.2,.30] for β ML, [0.90, 0.96] for β MML,and[0.73, 0.74] for p 0. 7 N and T Asymptotics The panel data literature has probably overemphasized the quest for fixed-t large-n consistent estimation of non-linear models with fixed effects. We have already seen the difficulties that arise in trying to obtain a root-n consistent estimator for a simple static fixed effects probit model. Not surprisingly, the difficulties become even more serious for dynamic binary choice models. In a sense, insisting on fixed T consistency has similarities with (and may be as restrictive as) requiring exactly unbiased estimation in non-linear 26

28 models. Panels with T =2are more common in theoretical discussions than in econometric practice. For a micro panel with 7 or 8 time series observations, whether estimation biases are of order O (/T ) or O (/T 2 ) may make all the difference. So it seems useful to consider a wider class of estimation methods than those providing fixed-t consistency, and assess their merits with regard to alternative N and T asymptotic plans. There are multiple possible asymptotic formulations, and it is a matter of judgement to decide which one provides the best approximation for the sample sizes involved in a given application. Here we consider the asymptotic properties of the estimators that maximize the concentrated likelihood (ML) and the modified concentrated likelihood (MML) when T/N tends to a constant (related results for autoregressive models are in Alvarez and Arellano, 2003, and Hahn and Kuersteiner, 2002). 0 Consistency The ML estimator of β can be shown to be consistent as T regardless of N using the arguments and the consistency theorem in Amemiya (985, pp ). The consistency of MML follows from noting that the concentrated likelihood and the mpl converge to the same objective function uniformly in probability as T. Letting bρ it (β) =ρ (x 0 itβ + bη i (β)) and bf it (β) =F (x 0 itβ + bη i (β)), from (65) we have Ã! p lim T T `Mi (β) = p lim T T `i (β, bη i (β)) + p lim T T log TX b hit (β) (78) T t= Ã p lim T T log TX h i! /2 bhit (β) bρ T it (β) y it bf it (β), 0 Since this lecture was first written I have become aware of recent work on double asymptotic formulations for nonlinear fixed effect models by Woutersen (200) and Li, Lindsay, and Waterman (2002). Moreover, a modified ML estimator for dynamic binary choice models has been developed in Carro (2003) and its properties investigated in simulations and empirical calculations. 27 t=

29 where the convergence is uniform in β in a neighborhood of β 0,andthelast two terms vanish. Asymptotic Normality When T/N c, 0 <c<, bothmland MML are asymptotically normal but, unlike MML, the ML estimator has a bias in the asymptotic distribution. An informal calculation of the terms arising in the asymptotic distributions is given in the Appendix. The results areasfollows: H 0 V H /2 µb βml β 0 + T H b d N N (0,I) (79) H 0 V H /2 bβmml β 0 d N (0,I). (80) where κ βλλi = E T d βλλi (β 0, λ i0 ) x i, λ i, κ λλi = E [T d λλi (β 0, λ i0 ) x i, λ i ], and V = b N = N H = NX µ κ βλλi 2κ λλi, (8) NX d βi (β 0, λ i0 ) d βi (β 0, λ i0 ) 0, (82) NX H = β 0 d βi β 0, λ b i (β 0 ), (83) NX β 0 d Mi (β 0 ). (84) Thus, the asymptotic distribution of the ML estimator will contain a bias term unless κ βλλi =0. 8 Concluding Remarks In this paper we have considered ML and modified ML estimators, but the estimation problem can be put more generally in terms of moment conditions 28

30 in a GMM framework. Fixed-T consistent estimators rely on exactly unbiased moment conditions. When T/N tends to a constant, a GMM estimator from moment conditions with a O (/T ) bias will typically exhibit a bias in the asymptotic distribution, but not if the estimator is based on moment conditions with a O (/T 2 ) bias. Thus, in the context of binary choice and other non-linear microeconometric models, a search for optimal orthogonality conditions that are unbiased to order O (/T 2 ) or greater seems a useful research agenda. But do these biases really matter? Heckman (98) reported a Monte Carlo experiment for ML estimation of a probit model with strictly exogenous variables and fixed effects, T =8and N =00. Usingarandomeffects estimator as a benchmark, he concluded that the MLE of the common parameters (jointly estimated with the effects) performed well. According to this, it would seem that even for fairly small panels there is not much to be gained from the use of fixed-t unbiased or approximately unbiased orthogonality conditions. For models with only strictly exogenous explanatory variablesthismaywellbethecase. Butthesearemodelsthatarefoundto be too restrictive in many applications. When modelling panel data, state dependence, predetermined regressors, and serial correlation often matter. Heckman (98) found that when a lagged dependent variable was included the ML probit estimator performed badly. This is not surprising since similar problems occur with linear autoregressive models. The difference is that while standard tools are available in the literature that ensure fixed T consistency for linear dynamic models, very little is known for dynamic binary choice. This is therefore a promising area of application of asymptotic arguments to both the construction of estimating equations and useful approximations to sampling distributions. See Keane (994), Hyslop (999), Honoré and Kyriazidou (2000), Magnac (2000), Honoré and Lewbel (2002), Arellano and Carrasco (2003), and Arellano and Honoré (200) for a survey and more references. 29

31 References [] Alvarez, J. and M. Arellano (2003): The Time Series and Cross-Section Asymptotics of Dynamic Panel Data Estimators, Econometrica, 7, July. [2] Amemiya, T.(985): Advanced Econometrics, Basil Blackwell. [3] Andersen, E. B. (970): Asymptotic Properties of Conditional Maximum Likelihood Estimators, Journal of the Royal Statistical Society, Series B, 32, [4] Andersen,E.B.(973):Conditional Inference and Models for Measuring, Mentalhygiejnisk Forsknings Institut, Copenhagen. [5] Arellano, M. (2003): Panel Data Econometrics, Oxford University Press. [6] Arellano, M. and R. Carrasco (2003): Binary Choice Panel Data Models with Predetermined Variables, Journal of Econometrics, 5, [7] Arellano, M. and B. Honoré (200): Panel Data Models: Some Recent Developments, in J. Heckman and E. Leamer (eds.), Handbook of Econometrics, vol.5,northholland,amsterdam. [8] Bover, O. and M. Arellano (997): Estimating Dynamic Limited Dependent Variable Models from Panel Data, Investigaciones Económicas, 2, [9] Carro, J. M. (2003): Estimating Dynamic Panel Data Discrete Choice Models with Fixed Effects, CEMFI Working Paper No [0] Chamberlain, G. (980): Analysis of Covariance with Qualitative Data, Review of Economic Studies, 47,

32 [] Chamberlain, G. (984): Panel Data, in Z. Griliches and M.D. Intriligator (eds.), Handbook of Econometrics, vol. 2, Elsevier Science, Amsterdam. [2] Chamberlain, G. (986): Asymptotic Efficiency in Semiparametric Models with Censoring, Journal of Econometrics, 32, [3] Chamberlain (992): Binary Response Models for Panel Data: Identification and Information, unpublished manuscript, Department of Economics, Harvard University. [4] Charlier, E., B. Melenberg, and A. van Soest (995): A Smoothed Maximum Score Estimator for the Binary Choice Panel Data Model and an Application to Labour Force Participation, Statistica Neerlandica, 49, [5] Chen, S. (998): Root-N Consistent Estimation of a Panel Data Sample Selection Model, unpublished manuscript, The Hong Kong University of Science and Technology. [6] Cox, D. R. and N. Reid (987): Parameter Orthogonality and Approximate Conditional Inference (with discussion), Journal of the Royal Statistical Society, SeriesB,49,-39. [7] Cox, D. R. and N. Reid (992): A Note on the Difference between Profile and Modified Profile Likelihood, Biometrika, 79, [8] Ferguson, H., N. Reid, and D. R. Cox (99): Estimating Equations from Modified Profile Likelihood, in Godambe, V. P. (ed.), Estimating Functions, Oxford University Press. [9] Hahn, J. and G. Kuersteiner (2002): Asymptotically Unbiased Inference for a Dynamic Panel Model with Fixed Effects When Both n and T are Large, Econometrica, 70,

33 [20] Heckman, J. J. (98): The Incidental Parameters Problem and the Problem of Initial Conditions in Estimating a Discrete Time Discrete Data Stochastic Process in C. F. Manski and D. McFadden (eds.), Structural Analysis of Discrete Data with Econometric Applications, MIT Press. [2] Honoré, B. and E. Kyriazidou (2000): Panel Data Discrete Choice Models with Lagged Dependent Variables, Econometrica, 68, [22] Honoré, B. and A. Lewbel (2002): Semiparametric Binary Choice Panel Data Models without Strictly Exogeneous Regressors, Econometrica, 70, [23] Horowitz, J. L. (992): A Smoothed Maximum Score Estimator for the Binary Response Model, Econometrica, 60, [24] Hsiao, C. (2003): Analysis of Panel Data, SecondEdition, Cambridge University Press. [25] Hyslop, D. R. (999): State Dependence, Serial Correlation and Heterogeneity in Intertemporal Labor Force Participation of Married Women, Econometrica, 67, [26] Keane, M. (994): A Computationally Practical Simulation Estimator for Panel Data, Econometrica, 62, [27] Kyriazidou, E. (997): Estimation of a Panel Data Sample Selection Model, Econometrica, 65, [28] Lancaster, T. (998): Panel Binary Choice with Fixed Effects, unpublished manuscript, Department of Economics, Brown University. [29] Lancaster, T. (2000): The Incidental Parameter Problem Since 948, Journal of Econometrics, 95,

34 [30] Lancaster, T. (2002): Orthogonal Parameters and Panel Data, Review of Economic Studies, 69, [3] Lee, M. J. (999): A Root N Consistent Semiparametric Estimator for Related-Effect Binary Response Panel Data, Econometrica, 67, [32] Li, H., B. G. Lindsay, and R. P. Waterman (2002): Efficiency of Projected Score Methods in Rectangular Array Asymptotics, unpublished manuscript, Department of Statistics, Pennsylvania State University. [33] Liang, K.Y. (987): Estimating Functions and Approximate Conditional Likelihood, Biometrika, 74, [34] Magnac, T. (2000): State Dependence and Unobserved Heterogeneity in Youth Employment Histories, Economic Journal 0, [35] McCullagh, P. and R. Tibshirani (990): A Simple Method for the Adjustment of Profile Likelihoods, Journal of the Royal Statistical Society, Series B, 52, [36] Manski, C. (987): Semiparametric Analysis of Random Effects Linear Models from Binary Panel Data, Econometrica, 55, [37] Newey, W. (994): The Asymptotic Variance of Semiparametric Estimators, Econometrica, 62, [38] Neyman, J. and E. L. Scott (948): Consistent Estimates Based on Partially Consistent Observations, Econometrica, 6, 32. [39] Rasch, G. (960): Probabilistic Models for Some Intelligence and Attainment Tests, Denmarks Pædagogiske Institut, Copenhagen. [40] Rasch, G. (96): On the General Laws and the Meaning of Measurement in Psychology, Proceedings of the Fourth Berkeley Symposium on 33

35 Mathematical Statistics and Probability, Vol.4,UniversityofCalifornia Press, Berkeley and Los Angeles. [4] Woutersen, T. (200): Robustness Against Incidental Parameters and Mixing Distributions, unpublished manuscript, Department of Economics, University of Western Ontario. 34

36 Appendix Expansion for the Score of the Concentrated Likelihood Let us consider a second order expansion of the score of the concentrated likelihood around the true value of the orthogonal effect. The log likelihood is ` i (β, λ i ); its vector of partial derivatives with respect to β is d βi (β, λ i) = ` i (β, λ i ) / β; the concentrated likelihood is ` i β, λ b i (β) and its score is given by d βi β, λ b i (β). An approximation at β 0 around the true value λ i0 is d βi β 0, λ b i (β 0 ) = d βi (β 0, λ i0 )+d βλi (β 0, λ i0 ) bλi (β 0 ) λ i0 (A) + 2 d βλλi (β 0, λ i0 ) bλi (β 0 ) λ i0 2 + Op T /2 where d βλi (β, λ i)= 2` i (β, λ i ) / β λ i and d βλλi (β 0, λ i0 )= 3` i (β, λ i ) / β λ 2 i. In general, the first three terms are O p T /2, O p T /2,andO p (), butbecause of orthogonality d βλi (β 0, λ i0 ) is O p as opposed to O p (T ). T 2 Expansion for λ b i (β 0 ) λ i0 Letting d λi (β, λ i)= ` i (β, λ i ) / λ i,the estimator λ b i (β 0 ) solves d λi β 0, λ b i (β 0 ) =0. Let us also introduce notation for the terms: κ λλi κ λλ (β 0, λ i0 )=E T d λλi (β 0, λ i0 ) x i, λ i κ βλλi κ βλλ (β 0, λ i0 )=E T d βλλi (β 0, λ i0 ) x i, λ i Note that κ λλi and κ βλλi are individual specific because they depend on λ i0, but they do not depend on the y s. 3 Moreover, from the information matrix identity E T d λi (β 0, λ i0 ) d λi (β 0, λ i0 ) x i, λ i = κ λλi. 2 Since h i T T T d βλi (β 0, λ i0 ) 0 = O p (), wehaved βλi (β 0, λ i0 )=O p. 3 Also T d λλi (β 0, λ i0 ) = κ λλ (β 0, λ i0 ) + O p T, which holds as T T d λλi (β 0, λ i0 ) κ λλ (β 0, λ i0 ) = O p (). 35

Comments on: Panel Data Analysis Advantages and Challenges. Manuel Arellano CEMFI, Madrid November 2006

Comments on: Panel Data Analysis Advantages and Challenges Manuel Arellano CEMFI, Madrid November 2006 This paper provides an impressive, yet compact and easily accessible review of the econometric literature