Linear Regression for Panel with Unknown Number of Factors as Interactive Fixed Effects

Size: px

Start display at page:

Download "Linear Regression for Panel with Unknown Number of Factors as Interactive Fixed Effects"

Hilary Powell
5 years ago
Views:

1 Linear Regression for Panel with Unnown Number of Factors as Interactive Fixed Effects Hyungsi Roger Moon Martin Weidne February 27, 22 Abstract In this paper we study the Gaussian quasi maximum lielihood estimator QMLE in a linear panel regression model with interactive fixed effects for asymptotics where both the number of time periods and the number of cross-sectional units go to infinity. Under appropriate assumptions we show that the limiting distribution of the QMLE for the regression coefficients is independent of the number of interactive fixed effects used in the estimation, as long as this number does not fall below the true number of interactive fixed effects present in the data. The important practical implication of this result is that for inference on the regression coefficients one does not need to estimate the number of interactive effects consistently, but can simply rely on any nown upper bound of this number to calculate the QMLE. Keywords: Panel data, interactive fixed effects, factor models, lielihood expansion, quasi-mle, perturbation theory of linear operators, random matrix theory. JEL-Classification: C23, C33 Introduction Panel data models typically incorporate individual and time effects to control for heterogeneity in cross-section and across time-periods. While often these individual and time effects enter the model additively, they can also be interacted multiplicatively, thus giving rise to so called interactive effects, which we also refer to as a factor structure. The multiplicative form captures the heterogeneity in the data more flexibly, since it allows for common time-varying shocs factors to affect the cross-sectional units with individual specific sensitivities factor loadings. It is this flexibility that motivated the discussion of interactive effects in the econometrics literature, e.g. Holtz-Eain, Newey and Rosen 988, We than the participants of the Cowles Summer Conference Handling Dependence: Temporal, Crosssectional, and Spatial at Yale University for interesting comments. Moon acnowledges the NSF for financial support via SES Department of Economics, University of Southern California, KAP 3, Los Angeles, CA moonr@usc.edu. Department of Economics, University College London, Gower Street, London WCE 6BT, U.K., and CeMMAP. m.weidner@ucl.ac.u.

2 Ahn, Lee, Schmidt 2; 27, Pesaran 26, Bai 29b; 29a, Zaffaroni 29, Moon and Weidner 2. Analogous to the analysis of individual specific effects, one can either choose to model the interactive effects as random random effects/correlated effects or as fixed fixed effects, with each option having its specific merits and drawbacs, that have to be weighed in each empirical application separately. In this paper, we consider the interactive fixed effect specification, i.e. we treat the interactive effects as nuisance parameters, which are estimated jointly with the parameters of interest. The advantages of the fixed effects approach are for instance that it is semi-parametric, since no assumption on the distribution of the interactive effects needs to be made, and that the regressors can be arbitrarily correlated with the interactive effect parameters. Let R be the true number of interactive effects number of factors in the data, and let R be the number of interactive effects used by the econometrician in the data analysis. A ey restriction in the existing literature on interactive fixed effects is that R is assumed to be nown, 2 i.e. R = R. This is true both for the quasi-differencing analysis in Holtz-Eain, Newey and Rosen and for the least squares analysis of Bai 29b. Assuming R to be nown could be quite restrictive, since in many empirical applications there is no consensus about the exact number of factors in the data or in the relevant economic model, so that an estimator which is not robust towards some degree of misspecification of R should not be used. The goal of the present paper is to overcome this problem. For a linear panel regression model with interactive fixed effects we consider the Gaussian quasi maximum lielihood estimator QMLE, 4 which jointly minimized the sum of squared residuals over the regression parameters and the interactive fixed effects parameters see Kiefer 98, Bai 29b, and Moon and Weinder 2. We employ an asymptotic where both the number of cross-sectional and the number of time-serial dimensions becomes large, while the number of interactive effects R and also R is constant. The main finding of the paper is that under appropriate assumptions the QMLE of the regression parameters has the same limiting distribution for all R R. Thus, the QMLE is robust towards inclusion of extra interactive effects in the model, and within the QMLE framewor there is no asymptotic efficiency loss from choosing R larger than R. This result is surprising because the conjecture in the literature is that the QMLE with R > R might be consistent but could be less efficient than the QMLE with R e.g., see Bai 29b. 5 The important empirical implication of our result is that as long as a valid upper bound on the number of factors is nown one can use this upper bound to construct the QMLE, and need not worry about consistent estimation of the number of factors. Since Note that Ahn, Lee, Schmidt 2; 27 tae a hybrid approach in that they treat the factors as nonrandom, but the factor loadings as random. The common correlated effects estimator of Pesaran 26 was introduced in a context, where both the factor loadings and the factors follow certain probability laws, but also exhibits some properties of a fixed effects estimator. When we refer to interactive fixed effects we mean that both factors and factor loadings are treated as non-random parameters. 2 In the literature, consistent estimation procedures for R are established only for pure factor models, not for the model with regressors. 3 Holtz-Eain, Newey and Rosen 988 assume just one interactive effect, but their approach could easily be generalized to multiple interactive effects, as long as their number is nown 4 The QMLE is sometimes called concentrated least squares estimator in the literature. 5 For R < R the QMLE could be inconsistent, since then there are interactive fixed effects in the residuals of the model which can be correlated with the regressors but are not controlled for in the estimation. 2

3 the limiting distribution of the QMLE with R > R is identical to the one with R = R the results of Bai 29b and Moon and Weidner 2 regarding inference on the regression parameters become applicable. In order to derive the asymptotic theory of the QMLE with R R we study the properties of the profile lielihood function, which is the quasi lielihood function after integrating out the interactive fixed effect parameters. Concretely, we derive an approximate quadratic expansion of this profile lielihood in the regression parameters. This expansion is difficult to perform, since integrating out the interactive fixed effects results in an eigenvalue problem in the formulation of the profile lielihood. For R = R we show how to overcome this difficulty by performing a joint expansion of the profile lielihood in the regression parameters and in the idiosyncratic error terms. Using the perturbation theory of linear operators we prove that the profile quasi lielihood function is analytic in a neighborhood of the true parameter, and we obtain explicit formulas for the expansion coefficients, in particular analytic expressions for the approximated score and the approximated Hessian for R = R. 6 To generalize the result to R > R we then show that the difference between the profile lielihood for R = R and for R > R is just a constant term plus a term whose dependence on the regression parameters is sufficiently small to be irrelevant for the asymptotic distribution of the QMLE. Due to the eigenvalue problem in the lielihood function, the derivation of this last result requires some very specific nowledge about the eigenvectors and eigenvalues of the random covariance matrix of the idiosyncratic error matrix. We provide high-level assumptions under which the results hold, and we show that these highlevel assumptions are satisfied, when the idiosyncratic errors of the model are independent and identically normally distributed. As we explain in section 4, the justification of our high-level assumptions for more general distribution of the idiosyncratic errors requires some further progress in the Random Matrix Theory of real random covariance matrices, both regarding the properties of their eigenvalues and of their eigenvectors see Bai 999 for a review of this literature. The paper is organized as follows. In Section 2 we introduce the interactive fixed effect model, its Gaussian quasi lielihood function, and the corresponding QMLE, and also discuss consistency of the QMLE. The asymptotic profile lielihood expansion is derived in Section 3. Section 4 provides a justification for the high-level assumptions that we impose, and discusses the relation of these assumptions to the random matrix theory literature. Monte Carlo results which illustrate the validity of our conclusion at finite sample are presented in Section 5, and the conclusions of the paper are drawn in Section 6. A few words on notation. The transpose of a matrix A is denoted by A. For a column vectors v its Euclidean norm is defined by v = v v. For the n-th largest eigenvalues counting multiple eigenvalues multiple times of a symmetric matrix B we write µ n B. For an m n matrix A the Frobenius or Hilbert Schmidt norm is A HS = TrAA, and the operator or spectral norm is A = max v R n Av v, or equivalently A = µ A A. Furthermore, we use P A = AA A A and M A = AA A A, where is the m m identity matrix, and A A denotes some generalized inverse if A is not of full column ran. For square matrices B, C, we use B > C or B C to indicate that B C is positive semi definite. We use wpa for with probability approaching one, and A = d B to 6 The lielihood expansion for R = R was first presented in Moon and Weidner 29. We separate and extend the expansion result from the 29 woring paper and present it in this paper. The remaining application results of Moon and Weidner 29 are now in Moon and Weidner 2. 3

4 indicate that the random variables A and B have the same probability distribution. 2 Model, QMLE and Consistency A linear panel regression model with cross-sectional dimension N, time-serial dimension T, and interactive fixed effects of dimension R, is given by Y = K β X + ε, ε = λ f + e, 2. = where Y, X, ε and e are N T matrices, λ is a N R matrix, f is a T R matrix, and the regression parameters β are scalars the superscript zero indicates the true value of the parameters. We write β for the K-vector of regression parameters, and introduce the notation β X K = β X. All matrices, vectors and scalars in this paper are real valued. A choice for the number of interactive effects R used in the estimation needs to be made, and we may have R R since the true number of factors R may not be nown accurately. Given the choice R, the quasi maximum lielihood estimator QMLE for the parameters β, λ and f is given by 7 ˆβR, ˆΛ R, ˆFR = argmin {β R K, Λ R N R, F R T R } Y β X Λ F 2 HS. 2.2 The square of the Hilbert-Schmidt norm is simply the sum of the squared elements of the argument matrix, i.e. the QMLE is defined by minimizing the sum of squared residuals, which is equivalent to minimizing the lielihood function for iid normal idiosyncratic errors. The estimator is the quasi MLE since the idiosyncratic errors need not be iid normal and since R might not equal R. The QMLE for β can equivalently be defined by minimizing the profile quasi lielihood function, namely where L R β = ˆβ R = argmin β R K L R β, 2.3 min {Λ R N R, F R T R } Y β X Λ F 2 HS = min F R T R Tr [ Y β X M F Y β X ] = T [ µ t Y β X Y β X ]. 2.4 t=r+ Here, we first concentrated out Λ by use of its own first order condition. The resulting optimization problem for F is a principal components problem, so that the the optimal F is 7 The optimal ˆΛR and ˆF R in 2.2 are not unique, since the objective function is invariant under rightmultiplication of Λ with a non-degenerate R R matrix S, and simultaneous right-multiplication of F with S. However, the column spaces of ˆΛ R and ˆF R are uniquely determined. 4

5 given by the R largest principal components of the T T matrix Y β X Y β X. At the optimum the projector M F therefore exactly projects out the R largest eigenvalues of this matrix, which gives rise to the final formulation of the profile lielihood function as the sum over its T R smallest eigenvalues. 8 This last formulation of L R β is very convenient since it does not involve any explicit optimization over nuisance parameters. Numerical calculation of eigenvalues is very fast, so that the numerical evaluation of L R β is unproblematic for moderately large values of T. The function L R β is not convex in β and might have multiple local minima, which have to be accounted for in the numerical calculation of ˆβ R. We write L β for L R β, which is the profile lielihood obtain from the true number of factors. In order to show consistency of ˆβ R we impose the following assumptions. Assumption. i X = O p, =,..., K, ii e = O p maxn, T. One can justify Assumption i by use of the norm inequality X X HS and the fact that X 2 HS = i,t X2,it = O p, where i =,..., N and t =,..., T, and the last step follows e.g. if X,it has a uniformly bounded second moment. Assumption ii is a condition on the largest eigenvalue of the random covariance matrix e e, which is often studied in the literature on random matrix theory, e.g. Geman 98, Bai, Silverstein, Yin 988, Yin, Bai, and Krishnaiah 988, Silverstein 989. The results in Latala 25 show that e = O p maxn, T if e has independent entries with mean zero and uniformly bounded fourth moment. Some wea dependence of the entries e it across i and t is also permissible see, e.g., Moon and Weidner 2. Assumption 2. i TrX e = O p, =,..., K. ii Consider linear combinations X α = K = α X of the regressors X with K-vector α such that α =. We assume that there exists a constant b > such that min {α R K, α =} T t=r+r + X µ α X α t b, wpa. Assumption 2i requires wea exogeneity of the regressors X. Assumption 2ii is a generalization of the usual non-collinearity condition on the regressors. It requires X αx α to be non-degenerate even after elimination of the largest R + R eigenvalues the sum in the assumption only runs over the smallest T R R eigenvalues of this matrix, while running over all eigenvalues would give the trace operator, and thus the usual noncolinearity condition. In particular, this assumption is violated if there exists a linear combination of the regressors with α = and ranx α R + R, i.e. the assumption rules out low-ran regressors lie time invariant regressors or cross-sectionally invariant 8 Since the model is symmetric under N T, Λ F, Y Y, X X there also exists a dual formulation of L R β that involves solving an eigenvalue problem for an N N matrix. 5

6 regressors. These low-ran regressors require a special treatment in the interactive fixed effect model see Bai 29b and Moon and Weidner 2, and we ignore them in the present paper. If one is not interested explicitly in their regression coefficients, one can always eliminate the low-ran regressors by an appropriate projection of the data, e.g. subtraction of the time or cross-sectional means from the data eliminates all timeinvariant or cross-sectionally invariant regressors. Theorem 2.. Let Assumption and 2 be satisfied and let R R. For N, T we then have minn, T ˆβR β = O p. Remars. i The Theorem guarantees consistency of ˆβ R, R R, in an arbitrary limit N, T. In the rest of this paper we consider asymptotics where N and T grow at the same rate, i.e. N/T κ 2, for some positive constant κ. For these restricted asymptotics the theorem already guarantees N or equivalently T consistency of ˆβ R, which is a useful intermediate result. ii The minn, T convergence rate in Theorem 2. can be generalized further. If we generalize Assumption ii and Assumption 2i to Assumption ii e = O p ξ, and Assumption 2i TrX e = O p ξ, =,..., K, where ξ, then it is possible to establish that ξ ˆβR β = O p. The proof of Theorem 2. is presented in the appendix. The theorem imposes no restriction at all on f and λ, apart from the condition R R. To derive the results in the rest of the paper we do however mae the following strong factor assumption. 9 Assumption 3. i < plim N,T N λ λ <, ii < plim N,T T f f <. The main result of this paper is that the inclusion of unnecessary factors in the estimation does not change the asymptotic distribution of the QMLE for β. Before deriving this result rigorously, we want to provide an intuitive explanation for it. As already mentioned above, the estimator ˆF R is given by the first R principal components of the matrix Y ˆβ R X Y ˆβ R X. We have Y ˆβ R X = λ f + e ˆβ R β X. 2.5 For asymptotics, where N and T grow at the same rate, we find that Assumption and the result of Theorem 2. guarantee that e ˆβ R β X = O p N. The strong factor assumption implies that the norms of the columns of λ and f each grow at a rate of N or equivalently T, so that the spectral norm of λ f grows at the rate. The strong factor assumption therefore guarantees that λ f is the dominant component 9 The strong factor assumption is regularly imposed in the literature on large N and T factor models, including Bai and Ng 22, Stoc and Watson 22 and Bai 29b. Onatsi 26 discussed an alternative wea factor assumption for the purpose of estimating the number of factors in a pure factor model, and a more general discussion of strong and wea factors is given in Chudi, Pesaran and Tosetti 29. 6

7 of Y ˆβ R X, which implies that the first R principal components of Y ˆβ R X Y ˆβ R X are close to f, i.e. the true factors are correctly piced up by the principal component estimator. The additional R R principal components that are estimated for R > R cannot pic up anymore true factors and are thus mostly determined by the remaining term e ˆβ R β X. Our results below show that ˆβ R is not only N consistent, but actually consistent, so that ˆβ R β X = O p, which maes the idiosyncratic error matrix e the dominant part of e ˆβ R β X, i.e. the R R additional principal components in ˆF R are mostly determined by e, and more precisely are close to the R R principal components of e e. This means that they are essentially random and close to uncorrelated with the regressors X. Including unnecessary factors in the QMLE calculation is therefore analogous to including irrelevant regressors in a linear regression which are uncorrelated with the relevant regressors X. From the second line in equation 2.4 we see that these additional random components of ˆF R project out the corresponding R R dimensional subspace of the T -dimensional space spanned by the observations over time, thus effectively reducing the number of time dimensions by R R. This usually results in a somewhat increased finite sample variance of the QMLE, but has no influence asymptotically as T goes to infinity, so that the asymptotic distributions of ˆβ R and ˆβ R are identical for R R. 3 Asymptotic Profile Lielihood Expansion To derive the asymptotics of ˆβ R, we study the asymptotic properties of the profile lielihood function L R β around β. First we notice that the expression cannot easily be discussed by analytic means, since there is no explicit formula for the eigenvalues of a matrix. In particular, a standard Taylor expansion of L β around β cannot easily be derived. In Section 3. we show how to overcome this problem when the true number of factors is nown, i.e. R = R, and in Section 3.2 we generalize the results to R > R. When the true R is nown, the approach we choose is to perform a joint expansion in the regression parameters and in the idiosyncratic error terms. To perform this joint expansion we apply the perturbation theory of linear operators e.g., Kato 98. We thereby obtain an approximate quadratic expansion of L β in β, which can be used to derive the first order asymptotic theory of the QMLE ˆβ R. To carry the results for R = R over to R > R, we first note that equation 2.4 implies that L R β = L β R t=r + µ t [ Y β X Y β X ]. 3. The extra term R t=r + µ [ t Y β X Y β X ] is due to overfitting on the extra factors. We show that the β-dependence of this term is sufficiently small, so that apart from a constant the approximate quadratic expansions of L R β and L β around β are identical. To obtain this result we first strengthen Theorem 2. and show that ˆβ R converges to β at a rate of at least N 3/4, so that we only have to discuss the β- dependence of the extra term in L R β within an N 3/4 shrining neighborhood of β. 7

8 From the analysis of L R β, we can then deduce the main result of the paper, namely ˆβR β = ˆβR β + o p. 3.2 This implies that the limiting distributions of ˆβ R and ˆβ R are identical, and that overestimating the number of factors results in no efficiency loss in terms of the asymptotic variance of the QMLE. 3. When R = R We want to expand the profile lielihood L β simultaneously in β and in the spectral norm of e. Let the K + expansion parameters be defined by ɛ = e / and ɛ = β β, =,..., K, and define the N T matrix X = / e e. With these definitions we obtain Y β X = [ λ f + β β X + e ] = λ f + K = ɛ X. 3.3 According to equation 2.4 the profile lielihood L β can be written as the sum over the T R smallest eigenvalues of the matrix in 3.3 multiplied by its transposed. We consider K = ɛ X / as a small perturbation of the unperturbed matrix λ f /, and thus expand L β in the perturbation parameters ɛ = ɛ,..., ɛ K around ɛ =, namely L β = K ɛ ɛ 2... ɛ g L g λ, f, X, X 2,..., X g, 3.4 g=,..., g= where L g = L g λ, f, X, X 2,..., X g are the expansion coefficients. The unperturbed matrix λ f / has ran R, so that the T R smallest eigenvalues of the unperturbed T T matrix f λ λ f / are all zero, i.e. L β = for ɛ = and thus L λ, f =. Due to Assumption 3 the R non-zero eigenvalues of the unperturbed T T matrix f λ λ f / converge to positive constants as N, T. This means that the separating distance of the T R zero-eigenvalues of the unperturbed T T matrix f λ λ f / converges to a positive constant, i.e. the next largest eigenvalue is well separated. This is exactly the technical condition under which the perturbation theory of linear operators guarantees that the above expansion of L in ɛ exists and is convergent as long as the spectral norm of the perturbation K = ɛ X / is smaller than a particular convergence radius r λ, f, which is closely related to the separating distance of the zero-eigenvalues. For details on that see Kato 98 and Appendix A.2, where we define r λ, f and show that it converges to a positive constant as N, T. Note that for the expansion 3.4 it is crucial that we have R = R, since the perturbation theory of linear operators describes the perturbation of the sum of all zero-eigenvalues of the unperturbed matrix f λ λ f /. For R > R the sum in L R β leaves out the R R largest of these perturbed zero-eigenvalues, which results in a much more complicated mathematical problem, since the structure and raning among 8

9 these perturbed zero-eigenvalues needs to be discussed. The above expansion of L β is applicable whenever the operator norm of the perturbation matrix K = ɛ X / is smaller than r λ, f. Since our assumptions guarantee that X / = O p, for =,..., K, and ɛ = O p minn, T /2 = o p, we have K = ɛ X / = O p β β + o p, i.e. the above expansion is always applicable asymptotically within a shrining neighborhood of β which is sufficient since we already now that ˆβ R is consistent for R R. In addition to guaranteeing converge of the series expansion, the perturbation theory of linear operators also provides explicit formulas for the expansion coefficients L g, namely for g =, 2, 3 we have L λ, f, X =, L 2 λ, f, X, X 2 = TrMλ X M f X 2, L 3 λ, f, X, X 2, X 3 = 3 [Tr M λ X M f X 2 λ λ λ f f f X ], where the dots refer to 5 additional terms obtained from the first one by permutation of, 2 and 3, so that the expression becomes totally symmetric in these indices. A general expression for the coefficients for all orders in g is given in Lemma A. in the appendix. One can show that for g 3 the coefficients L g are bounded as follows L g λ, f, X, X 2,..., X g a b g X X 2... X g, 3.5 where a and b are functions of λ and f that converge to finite positive constants in probability. This bound on the coefficients L g allows us to derive a bound on the remainder term, when the profile lielihood expansion is truncated at a particular order. The lielihood expansion can be applied under more general asymptotics, but here we only consider the limit N, T with N/T κ 2, < κ <, i.e. N and T grow at the same rate. Then, the relevant coefficients of the expansion, which are not treated as part of the remainder term, are L β = g=2 ɛ g Lg λ, f, X, X,..., X = W 2 = L2 λ, f, X, X 2 = TrM λ X M f X 2, L g λ, f, e, e,..., e, g=2 C = L 2 λ, f, X, U = TrM λ X M f e, C 2 = 3 2 L3 λ, f, X, e, U = [ Tr em f e M λ X f f f λ λ λ + Tr e M λ e M f X λ λ λ f f f + Tr e M λ X M f e λ λ λ f f f ]. 3.6 In the first line above we used the fact that L g λ, f, X, X 2,..., X g is linear in the arguments X to X g and that ɛ X = e. The K K matrix W with elements W 2 is the approximated Hessian of the profile lielihood function L β. The K-vectors C and C 2 with elements C and C 2 constitute the approximated score of L β. From 9

10 the expansion 3.4 and the bound 3.5 we obtain the following theorem, whose proof is provided in the appendix. Theorem 3.. Let Assumptions and 3 be satisfied. N/T κ 2, < κ <. Then we have Suppose that N, T with L β = L β 2 β β C + C 2 + β β W β β + L,rem β, where the remainder term L,rem β satisfies for any sequence c sup {β: β β c } L,rem β + β β 2 = o p. Corollary 3.2. Let Assumptions, 2, and 3 be satisfied. Furthermore assume that C = O p. In the limit N, T with N/T κ 2, < κ <, we then have ˆβR β = W C + C 2 + o p = O p. Since the estimator ˆβ R minimizes L β it must in particular satisfy L ˆβ R L β +W C + C 2 /. The Corollary follows from applying Theorem 3. to this inequality and using the consistency of ˆβ R. Details are given in the appendix. Using Theorem 3., the corollary is also directly obtained from the results in Andrews 999. Our assumptions already guarantee C 2 = O p and W = O p, so that only C = O p needs to be assumed explicitly in the Corollary. Corollary 3.2 allows to replicate the result in Bai 29b. Furthermore, the assumptions in the corollary do not restrict the regressor to be strictly exogenous, and the techniques developed here are applied in Moon and Weidner 29 to discuss pre-determined regressors in the linear factor regression model with R = R, in which case the score term C contributes an additional incidental parameter bias to the asymptotic distribution of ˆβ R. Remar. If we weaen Assumption ii to e = o p N 2/3, then Theorem 3. still continues to hold. If we assume that C 2 = O p, then Corollary 3.2 also holds under this weaer condition on e. 3.2 When R > R We now extend the lielihood expansion to the case R > R. Let ˆλβ and ˆfβ be the minimizing parameters in the first line of equation 2.4 for R = R. These are the first R principal components of Y β XY β X and Y β X Y β X, respectively. For the corresponding orthogonal projectors we use the notation Mˆλβ Mˆλβ and M ˆf β M ˆfβ. For the residuals after taing out these first R principal components we write êβ Y β X ˆλβ ˆf β.

11 Analogous to the expansion of L β the perturbation theory of linear operators also provides an expansion for Mˆλβ, M ˆf β and êβ in β β and e, i.e. in addition to describing the sum of the perturbed eigenvalues L β it also describes the structure of the corresponding perturbed eigenvectors. For example, we have êβ = M λ em f β β M λ X M f +higher order terms. The details of these expansions are presented in Lemma A. and A.2 in the appendix. These expansions are crucial when generalizing the lielihood expansion to R > R. Equation 3. can equivalently be written as L R β = L β R R t= µ t [ê βêβ ]. 3.7 Here we used that ê βêβ is the residual of Y β X Y β X after subtracting the first R principal components, which implies that the eigenvalues of these two matrices are the same, except from the R largest ones which are missing in ê βêβ. By applying the expansion of êβ to this expression for L R β one obtains the following. Theorem 3.3. Under Assumption and 3 and for R > R we have R R i L R β = L β µ t [Aβ] + L R,rem, β, t= [ where Aβ = M f e β β X ] [ Mλ e β β X ] M f, and for any constant c > sup {β: N β β c} L R,rem, β N + β β = O p. ii L R β = L β where R R t= [ µ t Bβ + B β ] + L R,rem,2 β, Bβ = 2 Aβ M f e M λ em f e λ λ λ f f f [ + M f β β X e ] Mλ ef f f λ λ λ em f + M f e [ M λ β β X ] f f f λ λ λ em f + M f e M λ ef f f λ λ λ [ β β X ] M f + B eeee + M f B rem, βp f + P f B rem,2 P f,

12 and B eeee = M f e M λ em f e λ λ λ f f λ λ λ em f + M f e M λ ef f f λ λ λ ef f f λ λ λ em f 2 M f e M λ ef f f λ λ f f f e M λ em f + 2 M f e λ λ λ f f f e M λ ef f f λ λ λ em f. Here, B rem, β and B rem,2 are T T matrices, B rem,2 is independent of β and satisfies B rem,2 = O p, and for any constant c > sup {β: N β β c} sup {β: N β β c} B rem, β + β β = O p, β L R,rem,2 + β β 2 = o p. Here, the remainder terms L R,rem, β and L R,rem,2 β stem from terms in ê βêβ whose spectral norm is smaller than O p and o p, respectively, within a N shrining neighborhood of β after dividing by N + β β and + β β, respectively. Using Weyl s inequality those terms can be separated from the eigenvalues µ t [ê βêβ]. The expression for Bβ loos complicated, in particular the terms in B eeee. Note however, that B eeee is β-independent and satisfies B eeee = O p under our assumptions, so that it is relatively easy to deal with these terms. Note furthermore that the structure of Bβ is closely related to the expansion of L β, since by definition we have L β = Trê βêβ, which can be approximated by TrBβ + B β. Plugging the definition of Bβ into TrBβ + B β one indeed recovers the terms of the approximated Hessian and score provided by Theorem 3., which is a convenient consistency chec. We do not give explicit formulas for B rem, β and B rem,2, because those terms enter Bβ projected by P f, which maes them orthogonal to the leading term Aβ, so that they can only have limited influence on the eigenvalues of Bβ+B β. The bounds on the norms of B rem, β and B rem,2 provided in the theorem are sufficient for all conclusions on the properties of µ t [Bβ + B β] below. The proof of the theorem can be found in the appendix. The first part of Theorem 3.3 is useful to show that ˆβ R converges to β at a rate of at least N 3/4. The purpose of the second part is to show that ˆβ R has the same limiting distribution as ˆβ R. To actually obtain these two results one requires further conditions on the β-dependence of the largest few eigenvalues of Aβ and Bβ + B β. Assumption 4. For all constants c > sup {β: N β β c} R R t= {µ t [Aβ] µ t [ Aβ ] µ t [Ãβ ]} N + N 5/4 β β + N 2 β β 2 / logn O p, [ where Ãβ = M f β β X ] [ Mλ β β X ] M f. 2

13 Corollary 3.4. Let R > R, let Assumptions, 2, 3 and 4 be satisfied and furthermore assume that C = O p. In the limit N, T with N/T κ 2, < κ <, we then have N 3/4 ˆβR β = O p. The corollary follows from the inequality L R ˆβ R L R β by applying the first part of Theorem 3.3, Assumption 4, and our expansion of L β. The justfication of Assumption 4 is discussed in the next section. Knowing that ˆβ R converges to β at a rate of at least N 3/4 is a convenient intermediate result. It implies that we only have to study the properties of L R β within a N 3/4 shrining neighborhood of β, which is reflected in the formulation of the following assumption. Assumption 5. For all constants c > sup {β:n 3/4 β β c} R R { t= µt [Bβ + B [ β] µ t Bβ + B β ]} + = o p. β β 2 Combining the first part of Theorem 3.3, Assumption 5, and Theorem 3., we find that the profile lielihood for R > R can be written as L R β = L R β 2 β β C + C 2 + β β W β β + L R,rem β, with a remainder term that satisfies for all constants c > L R,rem β sup {β:n 3/4 β β c} + 2 = o p. β β This result, together with N 3/4 -consistency of ˆβ R, gives rise to the following corollary. Corollary 3.5. Let R > R, let Assumptions, 2, 3, 4 and 5 be satisfied and furthermore assume that C = O p. In the limit N, T with N/T κ 2, < κ <, we then have ˆβR β = W C + C 2 + o p = O p. The proof of Corollary 3.5 is analogous to that of Corollary 3.2. The combination of both corollaries shows that our main result in equation 3.2 holds, i.e. the limiting distributions of ˆβR and ˆβ R are indeed identical. What is left to do is to justify the high-level assumptions 4 and 5. 4 Justification of Assumptions 4 and 5 We start with the justification of Assumption 4. We have Aβ = Aβ +Ãβ A mixedβ, where A mixed β = M f e M λ [ β β X ] M f + the same term transposed. By applying 3

14 Weyl s inequality we thus find R R t= [ {µ t [Aβ] µ t Aβ ] ]} µ t [Ãβ R R t= µ t [A mixed β] 2 R R K e β β max M λ X M f. 4. For asymptotics with N and T growing at the same rate Assumption ii guarantees e = O p N. Using this and inequality 4., we find that M λ X M f = O p N 3/4 is a sufficient condition for Assumption 4. This condition can be justified by assuming that X = Γ f + X, where Γ is an N R matrix and X is an N T matrix with X = O p N 3/4, i.e. X has an approximate factor structure with the same factors that enter into the equation for Y and an idiosyncratic component X. Analogous to our discussion of Assumption ii we can obtain the bound on the norm of X by assuming that its entries X,it are mean zero, have bounded fourth moment and are only wealy correlated across i and t. We have thus provided a way to justify Assumption 4 without imposing any additional condition on the error matrix e, but by restricting the data generating process for the regressors X. Alternatively, one can derive the statement in the assumption by imposing weaer restrictions on X, but maing further assumptions on the error matrix e. An example of this is provided by Theorem 4. below, where we only assume that X = X + X, with ranx being bounded, but without assuming that X is generated by the factors f. The discussion of Assumption 5 is more complicated. By Weyl s inequality we now that the absolute value of µ t [Bβ + B [ β] µ t Bβ + B β ] is bounded by the spectral norm of Bβ + B β Bβ B β, which is of order O p N 3/2 β β + O p N 2 β β 2. This bound is obviously too crude to justify the assumption. What we need here is a bound that not only taes into account the spectral norm of the difference between Bβ + B β and Bβ + B β, but also the structure of the eigenvectors of the various matrices involved. The assumption only restricts the properties of Bβ in an N 3/4 shrining neighborhood of β. In this shrining neighborhood the dominant term in Bβ+B β is M f e M λ em f, since its spectral norm is of order N, while the spectral norm of the remaining terms, e.g. A mixed β above, is at most of order N 3/4. Our goal is to show that the largest few eigenvalues of Bβ + B β only differ by o p from those of the leading term M f e M λ em f, within the shrining neighborhood of β. To do so, we first need to introduce some notation. Let w t R T, t =,..., T R, be the normalized eigenvectors of M f e M λ em f with the constraint f w t =, and let ρ t, t =,..., T R, be the corresponding eigenvalues. Let v i R N, i =,..., N R, be the normalized eigenvectors of M λ em f e M λ with the constraint λ v i =, and let ρ i, i =,..., N R, be the corresponding eigenvalues. Weyl s inequality says µ m G + H µ m G + µ H for arbitrary symmetric n n matrices G and H and m n. Here, we refer to a generalization of this, which reads m t= µ tg + H m t= µ tg + m t= µ th. These inequalities are standard results in linear algebra and are readily derived from the Courant-Fischer-Weyl min-max principle. For T < N the vectors v i, i = T R +,..., N R, correspond to null-eigenvalues, and if there are multiple null-eigenvalues those v i are not uniquely defined. In that case we assume that those v i are drawn randomly from the Haar measure on the unit sphere of the corresponding null-eigenspace. For T > N we assume 4

15 We assume that eigenvalues are sorted in decreasing order, i.e. ρ ρ Note that the eigenvalues ρ t and ρ i are identical for t = i. Let d = max i,t, v ix w t, d 2 = max v iep f, d 3 i = max w te P λ, t d 4 = N 3/4 max v ix P f, d 5 = N 3/4 max w tx P λ, i where i =... N R, t =,..., T R and =... K. Furthermore, define d = max, d, d2, d3, d4, d5. Theorem 4.. Let assumptions and 3 hold, let R > R and consider a limit N, T with N/T κ 2, < κ <. Assume that ρ R R > a N, wpa, for some constant a >. Furthermore, let there exists a series of integers q > R R such that t d q = o p N /4, and q T R t=q ρ R R ρ t = O p. Then, for all constants c > and t =,..., R R we have sup µ t Bβ + B β ρ t = o p, {β:n 3/4 β β c} which implies that Assumption 5 is satisified. We can now justify Assumption 5 by showing that the conditions of Theorem 4. are satisfied. The following discussion is largely heuristic. Since v i and w t are the normalized eigenvalues of M f e M λ em f and M λ em f e M λ we expect them to be essentially uncorrelated with X and ep f, and therefore we expect v i X w t = O p, v i ep f = O p, w te P λ = O p. We also expect v i X P f = O p T and w tx P λ = O p N, which is different to the preceding terms with e, since X can be correlated with f and λ. In the definition of d the maxima over these terms are taen over i and t, so that we anticipate some wea dependence of d on N or equivalently T. Note that we need d = o p N /4 since otherwise q does not exist. The ey to maing this discussion rigorous and show that indeed d = o p N /4, or smaller, is a good nowledge of the properties of the eigenvectors v i and w t. If the entries e it are iid normal, then the matrix of v i s and w t s is Haar-distributed on the N R and T R dimensional subspaces spanned by M λ and M f. In that case the formalization of the above discussion becomes relatively easy, and the result is summarized in Theorem 4.2 below. The conjecture in the random matrix theory literature is that the limiting distribution of the eigenvectors of a random covariance matrix is distribution free, i.e. is independent of the particular distribution of e it see, e.g., Silverstein 99, Bai 999. However, we are not aware of a formulation and corresponding proof of this conjecture that is sufficient for our purposes. the same for w t, t = N R,..., T R. This specification avoids correlation between X and those v i and w t being caused by a particular choice of the eigenvectors that correspond to degenerate null-eigenvalues. 5

16 The second condition in Theorem 4. is on the eigenvectors ρ t of the random covariance matrix M f e M λ em f. Eigenvalues are studied more intensely than eigenvectors in the random matrix theory literature, and it is well-nown that the properly normalized empirical distribution of the eigenvalues the so called empirical spectral distribution of an iid sample covariance matrix converges to the Marčeno-Pastur-law Marčeno and Pastur 967 for asymptotics where N and T grow at the same rate. This means that the sum over the eigenvalues ρ t in Theorem 4. asymptotically becomes an integral over the Marčeno-Pastur limiting spectral distribution. 2 To derive a bound on this sum, one furthermore needs to now the asymptotic properties of ρ R R. For random covariance matrices from iid normal errors, it is nown from Johnstone 2 and Soshniov 22 that the properly normalized few largest eigenvalues converge to the Tracy-Widom law. 3. An additional subtlety in the discussion of the eigenvalues and eigenvectors of the random covariance matrix M f e M λ em f are the projections with M f and M λ, which stem from integrating out the true factors and factor loadings of the model. Those projectors are not normally present in the literature on large dimensional random covariance matrices. If the idiosyncratic error distribution is iid normal these projections are unproblematic, since the distribution of e is rotationally invariant in that case, i.e. the projections are mathematically equivalent to a reduction of the sample size by R in both directions. Thus, if the e it are iid normal, then we can show that the conditions of Theorem 4. are satisfied, and we can therefore verify that the high-level assumptions of the last section hold. This result is summarized in the following theorem. Theorem 4.2. Let R > R, let Assumption 3 hold and consider a limit N, T with N/T κ 2, < κ <. Furthermore, assume that i For all =,..., K we can decompose X = X + X,, such that X = O p N 3/4, X HS = O p, X = O p, ranx Q, where Q is independent of N and T. For the K K matrix W defined by W 2 = Tr X X we assume that plim N,T W 2 >. In addition, we assume that E M λ X M f it 24+ɛ, E M λ X it 6+ɛ and E X M f it 6+ɛ are bounded uniformly across i, j, N and T for some ɛ >.. ii The error matrix e is independent of λ, f, X and elements e it are distributed as iid N, σ 2. X, =,..., K, and its Then, the Assumptions, 2, 4 and 5 are satisfied and we have C = O p. By Corollary 3.2 and 3.5 we can therefore conclude ˆβR β = ˆβR β + o p. The proofs for Theorem 4. and Theorem 4.2 are provided in the supplementary material to this paper. It seems to be quite challenging to extend Theorem 4.2 to non-iid-normal 2 To mae this argument mathematically rigorous one needs to now the convergence rate of the empirical spectral distribution to its limit law, which is a ongoing research subject in the literature, e.g. Bai 993, Bai, Miao and Yao 24, Götze and Tihomirov 2. 3 To our nowledge this result is not established for error distributions that are not normal. Soshniov 22 has a result under non-normality but only for asymptotics with N/T. 6

17 N = T = 5 N = T = e it N, e it t5 e it N, e it t5 bias std bias std bias std bias std R = R = R = R = R = R = Table : Simulation results for the bias and standard error std of the QMLE ˆβ R for different value of R, two different sample sizes N and T, and the two different specifications for e it. The data generating process is described in the main text, in particular the true number of factors here is R = 2. We used, repetitions in the simulation. e it, given the present status of the literature on eigenvalues and eigenvectors of large dimensional random covariance matrices, and we would lie to leave this as a future research topic. 5 Monte Carlo Simulations Here, we consider a panel model with one regressor K =, two factors R = 2 and the following data generating process DGP Y it = β X it + 2 λ ir f tr + e it, X it = + X it + r= 2 λ ir + χ ir f tr, 5. where i =,..., N and t =,..., T. The random variables X it, λ ir, f tr, χ ir and e it are mutually independent, Xit is distributed as iid N,, and λ ir, f tr and χ ir are all distributed as iid N,. For e it we also assume that it is iid across i and t, but we consider two different specifications for the marginal distribution, namely either N, or a Student s t-distribution with 5 degrees of freedom. We choose β =, and use, repetitions in our simulation. For each draw of Y and X we compute the QMLE ˆβ R according to equation 2.3 for different values of R. Table reports the bias and standard error of ˆβ R for sample sizes N = T = 5 and N = T =. For R = OLS estimator and R = we have R < R, i.e. less factors are used in the estimation than are present in the DGP. As a result of this, the QMLE is heavily biased for these values of R, since the factor structure in the DGP is correlated with the regressors, but is not controlled for in the estimation. In contrast, for all values R R the bias of the QMLE is negligible compared to its standard error. Furthermore, the standard error remains almost constant as R increases beyond R = R ; concretely from R = 2 to R = 5 it increases only by about 7% for N = T = 5 and only by 5% for N = T =. Table 2 reports quantiles of the appropriately normalized QMLE for R R and N = T =. One finds that the quantiles remain almost constant as R increases. In r= 7

18 Quantiles of ˆβR β e it 2.5% 5% % 25% 5% 75% 9% 95% 97.5% R = N, R = R = R = R = t5 R = R = R = Table 2: Simulation results for the quantiles of ˆβR β for N = T =, the two different specifications of e it, different values of R, and the data generating process as described in the main text with R = 2. We used, repetitions in the simulation. particular, the differences in the quantiles for different values of R are relatively small compared to the differences between the quantiles, so that the size of a test statistics that is based on ˆβ R is essentially independent of the choice of R R. The findings of the Monte Carlo simulations described in the last two paragraph hold just as well for the specification with normally distributed as for the specification where e it has Student s t-distribution. From this finding one may conjecture that Theorem 4.2 also holds for more general error distributions. 6 Conclusions In this paper we showed that under certain regularity conditions the limiting distribution of the QMLE of a linear panel regression with interactive fixed effects does not change when we include redundant factors in the estimation. The important empirical implication of this result is that one can use an upper bound of the number of factors in the estimation without asymptotic efficiency loss. For inference on the regression coefficients one thus need not worry about consistent estimation of the number of factors in the model. As regularity conditions we mostly impose high-level assumptions, and we verify that these hold under iid normal errors. Our simulation results suggest that normality of the error distribution is not necessary. Along the lines of the arguments presented in Section 4, we expect that progress in the literature on large dimensional random covariance matrices will allow verification of our high-level assumptions under more general error distributions. This is a vital and interesting topic for future research. A Appendix A. Proof of Consistency Proof of Theorem 2.. We first establish a lower bound on L β. Consider the last expression for L β in equation 2.4 and plug in Y = β X +λ f +e, then replace 8

19 λ f by λf, and minimize over the N R matrix λ and the T R matrix f. This gives [ L β min Tr β β X + e F b β β 2 β β + O p minn, T M F ] β β X + e, + Tr ee + O p. A. minn, T where in the first line we minimize over all T R + R matrices F, and to arrive at the second line we decomposed the expression in the component quadratic in β β, linear in β β and independent of β β and applied Assumption and 2. Next, we establish an upper bound on L β. We have L β = T t=r+ µ t [ λ f + e λ f + e ] Tr e M λ e Tr ee. A.2 Further details regarding the derivation of the bounds A. and A.2 are presented in the supplementary material. Since we could choose β = β in the minimization of β, the optimal ˆβ needs to satisfy L ˆβ L β. With the above results we thus find b ˆβ β 2 + O ˆβ β p + O p. A.3 minn, T minn, T From this it follows that ˆβ β = O p minn, T /2, which is what we wanted to show. A.2 Proof of Lielihood Expansion Definition. For the N R matrix λ and the T R matrix f we define d max λ, f = λ f = d min λ, f = µ Rλ f f λ, µ λ f f λ, A.4 i.e. d max λ, f and d min λ, f are the square roots of the maximal and the minimal eigenvalue of λ f f λ /. Furthermore, the convergence radius r λ, f is defined by r λ, f = Lemma A.. If the following condition is satisfies 4dmax λ, f d 2 min λ, f + 2d max λ, f. A.5 K β β X + e < r λ, f, = A.6 9

20 then i the profile quasi lielihood function can be written as a power series in the K + parameters ɛ = e / and ɛ = β β, namely L β = K K K... ɛ ɛ 2... ɛ g L g λ, f, X, X 2,..., X g, g=2 = 2 = g= where the expansion coefficients are given by 4 L g λ, f, X, X 2,..., X g = Lg λ, f, X, X 2,..., X g = [ Lg λ, f ], X, X 2,..., X g + all permutations of,..., g g!, i.e. L g is obtained by total symmetrization of the last g arguments of 5 L g λ, f, X, X 2,..., X g with = g p+ p= S = M λ, T ν ν p = g m +...+m p+ = p 2 ν j, m j Tr S m T ν... Sm 2... S mp T νp... g S m p+, S m = [ λ λ λ f f λ λ λ ] l = λ f X + X f λ, T 2 2 = X X 2, for,, 2 =... K, X = e e, X = X, for = =... K., for m, ii the projector Mˆλβ can be written as a power series in the same parameters ɛ =,..., K, namely Mˆλ β = K K g= = 2 =... K g= ɛ ɛ 2... ɛ g M g λ, f, X, X 2,..., X g, where the expansion coefficients are given by M λ, f = M λ, and for g M g λ, f, X, X 2,..., X g = g M λ, f, X, X 2,..., X g = [ M g ] X, X 2,..., X g + all permutations of,..., g, g! 4 Here we use the round bracet notation, 2,..., g for total symmetrization of these indices. 5 One finds L λ, f, X, X 2,..., X g =, which is why the sum in the power series of L starts from g = 2 instead of g =. 2

21 i.e. M g is obtained by total symmetrization of the last g arguments of M g λ, f, X, X 2,..., X g = g p+ p= ν ν p = g m m p+ = p 2 ν j, m j where S m, T, T 2 2, and X are given above. S m T ν... Sm 2... S mp T νp... g S m p+, iii For g 3 the coefficients L g in the series expansion of L β are bounded as follows L g λ, f, X, X 2,..., X g R g d2 min λ, f 2 Under the stronger condition = 6 dmax λ, f d 2 min λ, f g X X 2... X g. K β β X + e < d2 min λ, f 6 d max λ, f, A.7 we therefore have the following bound on the remainder when the series expansion for L β is truncated at order G 2: L β G K g=2 =... K g= ɛ... ɛ g L g λ, f, X, X 2,..., X g R G + αg+ d 2 min λ, f 2 α 2, where α = 6 d maxλ, f d 2 min λ, f K β β X + e = <. iv The operator norm of the coefficient M g in the series expansion of Mˆλ β is bounded as follows, for g M g λ, f, X, X 2,..., X g g 2 6 dmax λ, f g X X d 2 min λ, f 2... X g. Under the condition A.7 we therefore have the following bound on operator norm 2

NUCLEAR NORM PENALIZED ESTIMATION OF INTERACTIVE FIXED EFFECT MODELS. Incomplete and Work in Progress. 1. Introduction

NUCLEAR NORM PENALIZED ESTIMATION OF IERACTIVE FIXED EFFECT MODELS HYUNGSIK ROGER MOON AND MARTIN WEIDNER Incomplete and Work in Progress. Introduction Interactive fixed effects panel regression models