Nuclear Norm Regularized Estimation of Panel Regression Models

Size: px

Start display at page:

Download "Nuclear Norm Regularized Estimation of Panel Regression Models"

Logan Bates
5 years ago
Views:

1 Nuclear Norm Regularized Estimation of Panel Regression Models Hyungsik Roger Moon Martin Weidner February 3, 08 Abstract In this paper we investigate panel regression models with interactive fixed effects. We propose two new estimation methods that are based on minimizing convex objective functions. The first estimation method minimizes the sum of squared residuals with a nuclear trace norm regularization. The second estimation method minimizes the nuclear norm of the residuals. First, we establish the consistency of the two estimators, and then we show how to use these two estimators as a preliminary estimator and to construct an estimator that is asymptotically equivalent to the QMLE in Bai 009 and Moon and Weidner 07. For this, we propose an iteration procedure and derive its asymptotic properties. Keywords: Interactive Fixed Effects, Factors, Nuclear Norm Regularization, Nuclear Norm Minimization, Convex Optimization, Iterative Estimation Introduction Interactive fixed effects panel regression models have been widely studied in recent panel literature. The interactive fixed effect model parsimoniously represents heterogeneity in both dimensions of the panel and includes the conventional additive error component model as a special case. Department of Economics, University of Southern California, Los Angeles, CA moonr@usc.edu. School of Economics, Yonsei University, Seoul, Korea. Department of Economics, University College London, Gower Street, London WCE 6BT, U.K., and CeMMaP. m.weidner@ucl.ac.uk.

2 One of the widely used estimation methods of the interactive fixed effects panel regression is the fixed effect approach or principal component approach in a linear model that treats the interactive fixed effects in a factor form as parameters to estimate. For example, Bai 009, Moon and Weidner 05, and Moon and Weidner 07 investigated asymptotic properties of the fixed effect least squares estimator for linear panel regression models with interactive fixed effects, and Chen, Fernandez-Val, and Weidner 04 studied the fixed effect maximum likelihood estimator for nonlinear panel regressions with interactive fixed effects. The main advantage of the fixed effect or least squares LS approach is that it does not restrict the relationship between the unobserved heterogeneity and the observed explanatory variables. On the other hand, there are also drawbacks in it. First, the computation of the fixed effects estimator requires solving a non-convex optimization problem with respect to high dimensional fixed effects parameters. Even in a linear regression where a closed form profile objective function is available, the non-convexity can become a big practical problem if the dimension of the regression coefficients is large. The second problem is a theoretical issue. When the dimension of the factors or equivalently the rank of the interactive fixed effects matrix is unknown, the fixed effect estimator of the coefficients of low rank regressors could be inconsistent. As the main contribution of the paper, we propose two new estimation methods that overcome the two problems of the existing fixed effect estimation of interactive fixed effect panel model. The first estimation method is to minimizes the sum of squared residuals with a nuclear trace norm regulization. The second estimation method is to minimizes the nuclear norm of the residuals. These two estimators are based on convex objective functions and overcome the computational problem of the fixed effect estimation. Also, the nuclear norm regulization estimation overcome the potential inconsistency of the low rank regressor coefficients through regularization. Here the role of the regulization is similar to that in the penalized sieve method e.g., see Chen 0 for a survey.. The nuclear norm penalized estimation has been widely studied in machine learning and statistical learning literature, and very recently in econometrics literature Bai and Ng 07 and Athey, Bayati, Doudchenko, Imbens, and Khosravi 07. The main difference is that in our set up, the parameters of interest are regression coefficients while the high dimensional Other estimation methods of panel regressions with interactive fixed effects literature include the quasidifference approach e.g., Holtz-Eakin, Newey, and Rosen 988, the common correlated random effect method e.g., Pesaran 006, the decision theoretic approach e.g., Chamberlain and Moreira 009, Lasso type shrinkage methods on fixed effects e.g., Cheng, Liao, and Schorfheide 06, Lu and Su 06, Su, Shi, and Phillips 06. Earlier papers of the fixed effect approach include Kiefer 980 and Ahn, Lee, and Schmidt 00.

3 low rank matrix is incidental. In most machine learning and statistical learning papers, the parameter of interest is the high dimensional low rank matrix interactive fixed effects and use the nuclear norm regulization to complete the matrix e.g., Recht, Fazel, and Parrilo 00 and Hastie, Tibshirani, and Wainwright 05 for recent surveys or to deal with missing observations in panel Bai and Ng 07 and Athey, Bayati, Doudchenko, Imbens, and Khosravi 07. Regarding the nuclear norm minization estimation, it is new in literature, and was not studied before, as far as we know. In this paper, we establish asymptotics of the two estimators. As for the nuclear norm regularized estimation, we show that with a proper choice of the regularization parameter the nuclear norm regularized estimator of the interactive fixed effect matrix is consistent in Frobenius norm, and the nuclear norm regularized estimator of regression coefficients is close to minn, T consistent. As for the nuclear norm minimization estimation, we show that under certain regularity conditions the estimator minimizing the nuclear norm of the residuals is minn, T consistent. We also show how to use these two estimators as a preliminary estimator and construct an estmator that is asymptotically equivalent to the QMLE in Bai 009 and Moon and Weidner 07. For this, we propose an iteration procedure that achieves the asymptotic equivalence. The paper is organized as follows. Section introduces the model. In Section 3 we discuss the two aforementioned problems of the fixed effect estimation method as the main motivation of the paper. In Section 4 we define the nuclear norm regulization estimator, introduce regularity conditinos, and derive the consistency of the estimator. In Section 5, we define the nuclear norm minimization estimation method and discuss how it is related with the regulization method in Section 4. We also derive the consistency of the estimator. In Section 6 we show how to use these two estimators as a preliminary estimator and to construct an estimator through iterations that achieves asymptotic equlivalence to the fixed effect estimator. In Section 7, we show how to extend the analysis to nonlinear panel regressions with interactive fixed effects. Section 8 investigates finite sample properties of the estimators in Sections 4,5, and 6 through a small scale Monte Carlo simulations. Section 9 concludes the paper. All the technical derivations and proofs are presented in the appendix. Notation: For matrix A, we denote P A = AA A A and M A = I P A. For a matrix A, veca is the vector that vectorizes the columns of A. Denote mat as the inverse operator of vec, so for a = veca, mata = A. For matrices A,..., A K and K-vector α = α,..., α K, we denote α A := K k= α ka k. For an N, T matrix A, we denote s r A as the r th largest singular value of A for r =,..., minn, T. 3

4 Model Consider a linear panel regression with interactive fixed effects y it = β 0x it + γ 0,it + e it, where γ 0,it = λ 0,if 0,t, or in matrix and vector notation K Y = X k β 0,k + Γ 0 + E k= =: β 0 X + Γ 0 + E,. where Y, X,..., X K R N T are observed N T matrix of panel data, E R N T represents the panel of unobserved errors, and β 0 = β 0,,..., β 0,K is the main parameter of interest. The parameter Γ 0 = λ 0 f 0 represents the interactive fixed effcts, where λ 0 R N R 0 and f 0 R T R 0 with R 0 minn, T. We use small letters to denote vectororized variables and parameters. Let y = vecy, x k = vecx k, γ 0 = vecγ 0, and e = vece. Define x = x,..., x k. We can express the model. as y = xβ 0 + γ 0 + e = xβ 0 + f 0 λ 0 veci R0 + e.. The main goal of the paper is to estimate β 0 in the presense of the interactive fixed effects in panel. 3 Motivations 3. Computation One of the popular estimation methods is the least squares LS or the quasi-maximum likelihood method. β, Γ, Let Lβ, Γ be the quasi-gaussian likelihood function of parameters Lβ, Γ = Y β X Γ. 4

5 The LS estimator of β 0 minimizes the objective function imposing the interactive fixed effect restriction on Γ, Γ = λf, β LS := argmin β min λ R N R 0,f R T R 0 Lβ, λf. It is known that under the regularity conditions, as N, T κ, the LS estimator β LS is -consistent and asymptotically normal with a bias in the limit e.g.,bai 009, Moon and Weidner 05, and Moon and Weidner 07. The bias in the limit distribution is caused by heteroskedasticity of the error term e it accross i or over t e.g., Bai 009 and sequential exogeneity of the regressor x it e.g., Moon and Weidner 07. Even with all the nice theoretical properties of the LS estimation, in estimating the interactive fixed effect panel regression model with least squares approach, its computation can be a potential practical issue. The reason is because it requires optimizating a nonconvex objective function with respect to β, λ, f. To see this, notice that the interactive fixed effect least squares objective function is Lβ, λf = Lβ, Γ s.t. rankγ = R The non-convexity problem arises because even though the uncontrained objective function Lβ, Γ is convex in β, Γ, imposing the rank restriction, rankγ = R 0, makes the optimization problem non-convex. In the case of a linear regression like., we have a closed form profile objective function as a function of β as The LS estimator, β LS, minimizes Lβ, Γ imposing the rank restriction on Γ, Γ = λf, β LS = argmin β = argmin β min λ R N R 0,f R T R 0 min{n,t } r=r 0 + Y β X λf 3. s r Y β X. In this case, the profile objective function has a closed form and non-convexity matters only for optimizing the profile objective function β LS with respect to β. However, even in this case, if the dimension of β is high, then it is not easy to find the global minimum quickly and it may require long and careful numerical search. In the appendix, we provide an example of a non-convex profile LS objective function. When the panel regression model is nonlinear such as panel binary choice regression e.g., Chen, Fernandez-Val, and Weidner 04, optimizing a non-convex objective function with 5

6 respect to high-dimensional parameters λ and f can be challenging. 3. Low-rank Regressors and Unknown Number of Factors When estimating a panel regression with interactive fixed effects both the regression coefficients β 0 and the number of factors R 0 are usually unknown. If both N and T are sufficiently large and all factors are sufficiently strong, then it is known how to estimate β 0 and R 0, as long as all regressors are high-rank regressors, by which we mean that the rank of the N T matrix X k is large for all k. Namely, in that case the consistency proofs of Bai 009 and Moon and Weidner 05 are applicable for the LS estimator for β even if the number of factors used in the estimation is too large, that is, if only an upper bound on R 0 is known. The preliminary estimator β pre obtained in that way can be shown to be minn, T -consistent, and one can therefore consistently estimate the number of factors from Y β pre X using methods for pure factors models without regressors, e.g. the procedure in Bai and Ng 00. However, if there are also low-rank regressors, that is, if rankx k is very small for some k, then the LS estimator with interactive fixed effects cannot be used as a preliminary estimator if the number of factors is unknown. In Moon and Weidner 07 we give a consistency proof that allows for low-rank regressors, but only if the number of factors is known, and this LS estimator is indeed not robust towards using the wrong number of factors. To illustrate this, consider the following DGP for the regression model y it = β 0 x it +λ 0,i f 0,t +e it with K = regressors and R 0 = factors, x it = w i v t, λ 0,i = w i + λ i, f 0,t = v t + f t, e it = N T + w i w j /N u js + v s v t /T, j= s= where λ i, ft and u it are all mutually independent and standard normally distributed, and w = w,..., w N and v = v,..., v T are non-random with w = N and v = T. One can just set w and v equal to the vector of ones, in which case x it = becomes a constant regressor. It is easy to verify that this DGP satisfies all regularity conditions for the consistency proof in Moon and Weidner 07, that is, we know from that paper that the LS estimator with correct number of factors R = R 0 = is consistent. However, it can be shown that the LS estimator with interactive fixed effects is inconsistent in this example whenever any other number of factors is chosen, both R = 0 or R >. Specifically, if R = 0 is chosen, then it is easy to see that the resulting OLS estimator converges to β 0 +, because x it and λ 0,i f 0,t have a sample correlation that converges to one. For R > the large N, T limit of the estimator depends on the ratio of N and T, but in any case the limiting profile LS objective 6

7 function turns out to have a local maximum at β 0, instead of a local minimum, and the LS estimator is never consistent for R > see the appendix.. This example illustrates a more general problem: If low-rank regressors are present the LS estimator for β is in general only consistent if we use the correct number of factors R 0 in the estimation, but in order to estimate R 0 consistently, we first need a consistent estimator for β. This argument is obviously circular. To overcome this problem we require an alternative initial estimator for β, which is not the LS estimator with interactive fixed effects. One candidate for this is the common correlated effects estimator of Pesaran 006, but the requirements for this estimator are quite different from the LS estimator, and we will therefore not explore this possibility further here. Preliminary consistent estimators for β 0 which are much more closely related to the LS estimator with interactive fixed effects are the nuclear norm regularized and the nuclear norm minimzation estimator discussed in this paper. Those estimators allow estimation of β without having to specify a number of factors first. 4 Nuclear Norm Regularized Estimation Before we introduce one of our estimators, we first introduce a matrix norm called nuclear norm. For an N T matrix Γ, the nuclear norm of Γ, Γ, is defined by Γ := minn,t r= s r Γ = Tr [ ΓΓ /] = sup TrB Γ, B where B is the spectral norm of matrix B. In literature, the nuclear norm is also called as trace norm, Schatten -norm, or Ky Fan n-norm. An important property of the nuclear norm is that Γ is convex in Γ. Also, denoting A as the Frobenius norm of matrix A, we have A A A and TrA B A B. Our first estimator minimizes the least squares objective function Lβ, Γ with regularization of the nuclear norm of the nuisance parameters Γ. Define Q ψ β, Γ = Lβ, Γ + ψ Γ, where ψ = ψ is a regulization parameter such that plim N,T ψ = 0 The nuclear norm regularized estimators of β 0, Γ 0 are β ψ, Γ ψ = argmin β,γ Q ψ β, Γ. 4. 7

8 An important feature of the objective function Q ψ β, Γ is its convexity in β, Γ. This is true because the least squares objective function Lβ, Γ is convex in β, Γ and the penalty term Γ is also convex in Γ. Therefore, the profile penalized objective function min Γ R N T Q ψ β, Γ is also convex in β. However, since the nuclear norm Γ is not globally differentiable, the penalized objective function Q ψ β, Γ and its profile function min Γ R N T Q ψ β, Γ are not globally differentiable. When the outcome equation is linear in β as., we can compute β ψ, Γ ψ in 4. easily. First, we show that in the linear regression of. it is possible to derive the profiled penalized objective function and the penalized estimator of Γ 0, Γ ψ, in closed forms. Lemma. a The profile penalized objective function, min Γ R N T Q ψ β, Γ, has the following closed form: min Q ψ β, Γ Γ R N T = minn,t r= [ min { ψ, s r Y β X } + ψ max {0, s r Y β X ψ }]. b For given β, let U Y β X and V Y β X be left and right singular matrices of Y β X, respectively. Then, Γ ψ = U Y βψ X diagˆγv Y β ψ X, where ˆγ r = max [0, s r Y β ] ψ X ψ. Having min Γ R N T Q ψ β, Γ in the closed form and noticing that it is convex in β, one can compute β ψ using various optimizing algorithms for a convex function e.g., see chapter 5 of Hastie, Tibshirani, and Wainwright 05. When the dimension of β is small, one even may use a grid search method. 4. Consistency of Γ ψ and β ψ In this subsection we establish consistency of β ψ, Γ ψ. Suppose that Γ 0 0. We will discuss the case of Γ 0 later in the remark that follows Theorem. One of a critical condition that is required for consistency of Γ ψ and consequently of β ψ is the following condition called restricted strong convexity. 8

9 Assumption Restricted Strong Convexity. Let C = { Θ R N T M λ0 ΘM f0 3 Θ M λ0 ΘM f0 }. Let there exists µ > 0, independent from N and T, such that for any θ R with matθ C we have θ M x θ µ θ θ, for all N, T. This condition assumes that the quadratic term, γ γ 0 M x γ γ 0, of the profile µ likelihood function, min β Lβ, Γ, is bounded below by a strictly convex function, γ γ 0 γ γ 0, if Γ Γ 0 belongs in the cone C. Notice that without any restriciton on the parameter, we cannot find a strictly positive constant µ > 0 such that min Γ γ γ 0 M x γ γ 0 µγ γ 0 γ γ 0. Assumption assumes that if we restrict the parameter set to be C, then we can find a strictly convex lower bound of the quadratic term of the profile likelihood. The restricted strong convexity condition in Assumption corresponds to the restricted strong convexity condition in Negahban, Ravikumar, Wainwright, and Yu 0, and it plays the same role as the restricted eigenvalue condition in recent LASSO literature e.g., see Candes and Tao 007 and Bickel, Ritov, and Tsybakov 009. In the following subsection we discuss Assumption in more detail and provide further sufficient conditions under which the assumption is satisfied. Lemma Convergence Rate of Γ ψ. Let Assumption holds and assume that Then we have ψ matm xe. 4. Γ ψ Γ 0 3 R 0 ψ. µ Next we introduce further regularity conditions for the consistency of Γ ψ and β ψ. Assumption Regularity Conditions. i E = O p maxn, T /, ii e x = O p, iii x x p Σ x > 0, iv ψ = c maxn,t /, where c and ψ 0. Remarks on Assumption The conditions in Assumption are weak and quite general. a Various examples of E that satisfy Assumption i can be found in Moon and Weidner 07. These include weak dependent errors and nonidentical but independent sequence of errors. 9

10 b Assumption ii is satisfied if the regressors are exogeneous with respect to the error, Ex it e it = 0, and x it e it are weakly correlated over t and across i so that N i,j= T t,s= Ex k,it x l,js e it e js is bounded asymptotically. c Assumption iii is the standard full rank condition of the regressors. d Assumption iv restricts the choice of the regulation parameter ψ. The requirement of c in Assumption iv is necessary to satisfy the regularity condition 4. of Lemma. To see this in more detail, since matm x e = E K Ê k = X k x k x k x ke, we have Then, choosing c + Op E makes ψ satisfy 4. with probability approaching one. K matm x e = E Ê k k= K E + Êk = E + k= E + O p E K X k x k x k k=. x k e k= Êk with The following theorem shows the consistency of Γ ψ and β ψ. Theorem. Let Assumption hold with µ independent of the sample size. Let Assumption hold. Then we have a Γψ Γ 0 O p R 0 c ψ = O p. minn, T / b β ψ β 0 Op R 0 c ψ = O p. minn, T / Remark A special case is when Γ 0 = 0. In this case, with the regulation parameter ψ in 4., it follows Γ ψ Γ 0 = A proof of this available in the appendix. In this case, the regularized estimator of β becomes the pooled OLS estimator, β ψ = x x x y. 0

11 4. Sufficient Conditions for Restricted Strong Convexity In this section we discuss Assumption in more detail. The requirement in Assumption is to take a lower bound of θ M x θ with strictly convex function. To have some intuition, suppose that the regressor is scalar and assume that X = x x / = without loss of generality because the projection operator M x is invariant to the scale change. Also assume that θ 0. Then, θ M x θ = θ θ θ x = θ θ θ x θ θ = θ θ x x x θθ θ θ x θ θ min θ C x θ. In this case, if the limit of the distance between the regressor and the restricted parameter set is positive, Assumption is satisfied if µ := lim inf N,T min θ C x θ, the distance of the normalized regressor x and convex cone C is positive. An obvious necessary condition for this is that the normalized regressor does not belong in the cone C, that is, M λ0 XM f0 > 3 X M λ0 XM f0. The following Lemma 3 generalizes this. Let HA, C be the distance between a matrix A R N T and the cone C, HA, C := min B C TrA B A B. Assumption 3. We assume that there exists a positive constant µ > 0 such that for any α R K with α x x α =, the regressors X,..., X K satisfy X H α, C µ > 0 wpa. Lemma 3. Suppose that Assumption 3 holds. Then, there exists a positive constant µ > 0 such that for any Θ C. θ M x θ µ θ θ Assumption 3 is a high level condition for the restricted strong convexity. In what follows we introduce a set of lower level conditions under which Assumption 3 is satisfied. The lower

12 level conditions restrict the singular values of M λ0 α XM f0. Assumption 4. Suppose that α R K with α x x α =. Let s s s 3... s minn,t 0 be the singular values of the N T matrix M λ0 α XM f0. We assume that the following conditions hold uniformly in α with α x x α =. i α X = O p. ii minn,t r=q s r c > 0 wpa. iii q r= s r s q P. Lemma 4. Suppose that an N T random matrix α X with α R K with α x x α = satisfies Assumption 4. Then, it satisfies Assumption 3 and thus also Assumption with µ = c. Remarks a When X is a high rank regressor and s q s are of an order O p minn, T, we can choose, for example, q = minn, T /, for N, T converging to infinity at the same rate. Then, it is easy to verify those sufficient condition i, ii and iii for e.g. X it i.i.d.n 0, σ from well-known random matrix theory results. More generally, we can explicitly verify i, ii and iii if X has an approximate factor structure X = λ x f x + E x. b For a low-rank regressor with rankx = we have singular values s = M λ0 XM f0 and s r = 0 for all r. In that case [ we find that a = s, b ] = 0, and we have aq = 0 and bq = b = 3 R 0 s X M λ0 XM f0 for all q. Also, a b. Therefore ] min [aq + max {0, bq} = min { a, max{0, b} } q {,,...,minn,t } Thus, Assumption 3 is satisfied if wpa we have [ M λ0 XM f0 3 ] R 0 X M λ0 XM f0 c > 0 for some constant c. This last condition simply demands that the part of X that cannot be explained by λ 0 and f 0 needs to be sufficiently larger than the part of X that can be explained by either λ 0 or f 0. This is a sufficient condition for Assumption.

13 An analysis that is specialized towards low-rank regressors will likely give a weaker condition for Assumption in this case. 5 Nuclear Norm Minimization Estimation In this section we consider the estimator β = argmin β Y β X. 5. To motivate the nuclear norm minimization estimation in 5., let Γ ψ β := argmin Γ Q ψ β, Γ. For given N, T, suppose that we choose a very small regulization penalty ψ > 0 such that ψ min r s r Y β X. Then, since Γ ψ = Y β X ψ U β V β by Lemma, we have Q ψ β, Γ ψ = = ψ Uβ V β + ψ Y β X ψ Uβ V β ψ + ψ Y β X ψ Uβ V β. Then, for fixed N, T, letting ψ 0, we have uniformly in β. 3 lim ψ 0 ψ Q ψ β, Γ ψ β = Y β X 5. An important practical advantage of β over both β LS and β ψ is that neither a number of factors R 0 nor a regularization parameter ψ needs to be chosen. However, a drawback of the estimator β is that for the consistency of β we do not allow a low rank regressor In what follows we consider a special case where K = for an illustrative purpose. A general case of multiple regressors will be presented in the appendix and we provide a proof of the theorem under the general assumption. Assumption 5. Suppose that K =. As N, T with N > T, the following conditions are satisfied; 3 In this case, β = lim ψ 0 βψ 5.3 for fixed N, T. However, this relation between β ψ and β is not actually important for the following discussion of β, one can simply consider 5. as the definition of β. 3

14 i We assume E = O p N. ii We assume that there exists finite positive constant c up such that wpa we have T N E c up. iii We assume X = O p. vi Let U E S E V E be the singular value decomposition of M λ 0 EM f0 4. We assume Tr X U E V E = O p. v We assume that there exists a constant c low > 0 such that wpa T N / M λ0 XM f0 c low. vi Let U x S x V x = M λ0 XM f0 be the singular value decomposition of the matrix M λ0 XM f0. We assume that there exists c x 0, such that wpa Tr U EU x S x U xu E c x TrS x. Theorem. Under Assumption 5, T β β 0 = O p. Remarks on Assumption 5 a We consider a limit with N > T here. Alternatively, we could consider a limit with T < N, but then we also need to replace N by T, and X by X in the remaining assumptions. b Conditions i and ii are satisfied under weak restrictions on the error term e it. Various examples of the DGP s for condition i are found in Moon and Weidner 07. c Condition iii is standard. It is satisfied if X = O p, which is satisfied if sup it Ex it is finite. 4 That is, U E S E V E = M λ 0 EM f0 and U E is an N rankm λ0 EM f0 matrix of singular vectors, S E is a rankm λ0 EM f0 rankm λ0 EM f0 diagonal matrix, and V E is an T rankm λ0 EM f0 matrix of singular vectors. 4

15 d Condition iv is a high level condition and will be satisfied if sup E V E,rX U E,r M 5.4 r for some finite constant M, where U E,r and V E,r are the r th columns of U E,r and V E, respectively. An example of DGP s of X and E that satisfies condition 5.4 is Assumption LL i and ii in Moon and Weidner 05. A proof for this is similar to that of Lemma A.4 and we omit it. This example allows the regressor to be lagged dependent. e Condition v rules out low-rank regressors, for which we typically have M λ0 XM f0 = O p, but is satisfied generically for high-rank regressors, for which M λ0 XM f0 has T singular values of order N, so that M λ0 XM f0 is of order T N. It is not surprising that we need to rule out low-rank regressors here, because the estimator β does not use any information of R 0, so that Γ 0 cannot be distinguished from a low-rank regressor. 6 Post Nuclear Norm Regularization/Minimization Estimation In Sections 4 and 5 we showed β ψ and β are consistent but with a much slower convergence rate than that of the LS estimator β LS. In this section we investigate how to establish an estimator that is asymptotically equivalent to the LS estimator and yet easy to compute. Our suggestion is to use β ψ and β as a preliminary estimator and iterate estimating Γ 0 = λ 0 f 0 and β 0. First we consider the case where R 0 is known. When R 0 is unknown, we use a consistent estimate of R 0. After we introduce the main result of the section we discuss how to estimate R 0 consistently. Our iteration method is defined as follows. We start with s =. Step : In the first step s =, we set β s := β ψ or β. Step : We estimate the factor loadings and the factors of the s step residuals Y β s X by the principle component method: λ s+, f s+ Y := argmin λ R N R 0,f R T R 0 βs X λf. 5

16 Step 3: We update the s-stage estimator β s by β s+ := argmin β min Y X β λ s+ G + H f s+ G,H = x M f s+ M λs+ x x M f s+ M λs+ y. 6. Step 4: Then, increase s to s + and repeat Steps,3, and 4. In the following theorem we show that after the two iterations, βs with s = 3, 4... achieves asymptotic equivalence to the LS estimator β LS. Theorem 3.. Let Assumption hold with µ independent of the sample size. Let Assumption hold with c slowly such that Lemma 0 in the appendix. Then, as N, T with N T a c 4 κ with 0 < κ <, we have c β β 0 = O p 0. Also assume the assumptions in b βs β LS = O p for all s = 3, 4,... When R 0 is unknown, we use its consistent estimate, say R, in Step. One can estimate R 0 in various ways. With the nuclear norm regulization frame, we may consider the following. that Suppose that ψ is the regularization parameter used in β ψ. Choose ψ 0 such ψ Then, we estimate R 0 by = o, R ψ := minn,t r= ψ ψ = o, maxn, T ψ 6. I {s r Y β } ψ X ψ. 6.3 Lemma 5. Assume the assumptions of Theorem. Also, assume the strong factor assumption. Then, P{ R ψ R 0 } 0. 6

17 Table : Monte Carlo Results N, T POLS βls βψ β ψ 50,50 bias s.d ,00 bias s.d β 3 ψ β 4 ψ 7 A Small Scale Monte Carlo Simulations We consider a simple linear model with one regressor and two factors: y it = β 0 x it + λ 0,ir f 0,tr + e it, r= x it = + e x,it + λ 0,ir + λ x,ir f 0,tr + f 0,t r, r= where f 0,tr iidn0, and λ 0,ir, λ x,ir iidn,, and e x,it, e it iidn0,, and mutually independent. We set N, T = 50, 50, 00, 00 and ψ = logn / maxn,t. With this design, we compare the finite sample properties of β LS, β ψ, β, and the post β ψ 3 4 estimators, β ψ, and β ψ post estimates based on β is very similar and we omit them. based on β ψ as a preliminary estimator. The results for the 8 Conclusions In this paper we analyze two new estimation methods of interactive fixed effect panel regressions: i nuclear norm penalized estimation and ii nuclear norm minimizing estimation. These new estimation methods optimize convex functions of the parameters and can be applied to models where the LS estimator cannot estimate the parameter consistently. We establish consistency of the new estimators of the regression coefficients, and we show how to use them as a preliminary estimator to achieve asymptotic equivalence to the LS estimator. There are several ongoing extensions including developing a unified method to deal with heterogeneous coefficients, sieve estimation of nonparametrics, high-dimensional regressors, and data dependent choice of the penalty term and inferences. 7

18 Appendix: Proofs and Technical Derivations A Motivating Examples A. Example of Non-convex LS Profile Objective Function As an example of the non-convex LS profile objective function in 3., we consider the following linear model with one regressor and two factors: y it = β 0 x it + λ 0,ir f 0,tr + e it r= x it = 0.04e x,it + λ 0,i f 0,t + λ x,i f x,t, where λ 0,i = λ 0,i, λ 0,i iidn,, f 0,t = f 0,t, f 0,t iidn,, λ x,i iid χ, f x,t iid χ, x it, e it iidn0,, and {λ 0,i }, {f 0,t }, {λ x,i }, {f x,t }, {e x,it }, {e it } are independent each other. With N, T = 00, 00, we generate the panel data y it, x it and plot the LS objective function in the following figure. 8

19 A. Example of Inconsistent LS Estimation We plot some realizations of the LS objective function of the low rank regressor example in Section 3.. We also plot the nuclear norm regularized objective functions with various regularization penalty parameters. 9

20 B Useful Facts on Matrix Algebra We introduce some useful facts used in the proofs. Lemma 6. Suppose that A and B are two matrices with ranks of A and B are ranka and rankb, respectively. i A A A ranka A ranka A. ii AB A B. iii AB A B A B. iv AB = 0, A B = 0 implies A + B = max{ A, B }. v A B = 0 implies A + B A + B. Lemma 7. Suppose that N T matrix A has a singular value decomposition, A = U A S A V A, where U A R N minn,t, V A R T minn,t, and S A R minn,t minn,t is the diagonal matrix whose elements are s A,..., s minn,t A. Suppose that S R minn,t minn,t is the diagonal matrix consisting of s s minn,t 0. Then, Proof. Since min min U R N minn,t : U U=I V R T minn,t : V V =I A USV = S A S. we have A USV = TrA A TrA USV + TrS = TrS A TrA USV + TrS, min min U R N minn,t : U U=I V R T minn,t : V V =I = TrS A + TrS max A USV max U R N minn,t : U U=I V R T minn,t : V V =I TrA USV. B. According to Lemma 8, for any U U = I and V V = I, TrA USV is bounded uniformly by TrS A S because TrA USV = minn,t k= minn,t k= s k Aσ k USV s k As k = TrS A S. 0

21 Therefore, max max U R N minn,t : U U=I V R T minn,t : V V =I TrA USV TrS A S. For an lower bound of TrA USV, we choose U = U A and V = V A, which leads B. max max U R N minn,t : U U=I V R T minn,t : V V =I TrA USV TrA U A SV A = TrV A S A U AU A SV A = TrS A S. B.3 From B. and B.3, we have max max U R N minn,t : U U=I V R T minn,t : V V =I TrA USV = TrS A S, and so B. = TrS A + TrS TrS A S = S A S, as required. Lemma 8. Let A, B be N T matrices whose singular values are s A s minn,t A and s B s minn,t B, respectively. Then, TrA B minn,t s k As k B. k= Proof. See Theorem of Horn and Johnson 99. C Proofs of Section 4 Recall that the rank of Γ 0 = λ 0 f 0 is R 0 minn, T. Throughout the rest of the appendix, we use the following singular value decomposition of Γ 0, Γ 0 = USV, C. where U R N R 0 with U U = I R0, V R T R 0 with V V = I R0, S is the R 0 R 0 diagonal matrix of singular values of Γ 0. Suppose that f 0 is normalized as T f 0 f 0 = I R0. Then, we have f 0 = T V λ 0 = US T. First, we provide a proof of Lemma. Proof of Lemma.

22 Part a. Notice that min Q ψβ, Γ Γ R N T = min Γ R N T = min { γ R minn,t γ 0} minn,t r= { } [s r Y β X Γ] + ψ s r Γ min min { U R N minn,t U U=I N} { V R T minn,t V V =I T } = min { γ R minn,t γ 0} = min { γ R minn,t γ 0} = = minn,t r= minn,t r= min { α R α 0} minn,t r= minn,t r= minn,t r= { [sr Y β X UdiagγV ] + ψ γ r } { [sr Y β X UY β X diagγv Y β X] + ψ γ r } {[s r Y β X γ r ] + ψ γ r } { } [s r Y β X α] + ψ α [ { min ψ, s r Y β X } ] + ψ max {0, s r Y β X ψ }. Here, in the first step we rewrote the sum of squared residuals as the sum of squared singular values. In the second step we introduced the singular value decomposition Γ = UdiagγV, thus replacing the minimization over Γ by a minimization over the singular vector matrices U and V and the singular value vector γ. In the third step we use that for any given γ the minimizing U and V are equal to the corresponding singular vector matrices of Y β X, denoted by U Y β X and V Y β X as follows by B.3 in the proof of Lemma 7 below and Lemma 7. Here notice also that the minimizing singular vectors might not be unique, in particular they are not unique for γ r = 0.. The fourth step follows by Lemma 7. The fifth step interchanges the sum and the minimization. The final step solves the minimization problem over α the optimal α is α = max{0, s r Y β X ψ }. Then, we have the required result for Part a. Part b. The required result of Part b follows as a corollary of Step 3 and Step 6 of the proof of Part a. Next, we introduce some notation we will use in the following proofs. Let QΓ := inf β Q ψβ, Γ LΓ := inf β Lβ, Γ

23 be the profile objective functions of Q ψ β, Γ and Lβ, Γ, respectively, that concentrate out parameter β. We use notation Θ := Γ Γ 0 and θ := vecθ. Proof of Lemma. # Step : Use assumption i to show Θ ψ C 0 By definition, we have 0 QΓ 0 + Θ ψ QΓ 0 = LΓ 0 + Θ ψ LΓ 0 + ψ Γ 0 + Θ ψ Γ 0. Let Θ ψ := Γ ψ Γ 0, θ ψ := vec Θ ψ, Θ ψ, := M U0 Θψ M V0 and Θ ψ, := Θ ψ M U0 Θψ M V0 Then LΓ 0 + Θ ψ LΓ 0 = θ ψ M x θ ψ e M x θψ e M x θψ = Tr Θ ψ matm xe Θ ψ matm x e ψ Θ ψ ψ Θ ψ, ψ Θ ψ,. Here the first inequality holds since θ ψ M x θ ψ 0, the second inequality holds by the Hölder inequality, the third inequality holds by assumption i, and the last inequality holds by the triangle inequality. We furthermore have ψ Γ 0 + Θ ψ Γ 0 = ψ Γ 0 + Θ ψ, + Θ ψ, Γ 0 ψ Γ 0 + Θ ψ, Γ 0 ψ Θ ψ, = ψ Θ ψ, ψ Θ ψ,. Therefore, 0 LΓ 0 + Θ ψ LΓ 0 + ψ Γ 0 + Θ ψ Γ 0 ψ Θ ψ, ψ Θ ψ, + ψ Θ ψ, ψ Θ ψ, = ψ Θ ψ, 3 Θ ψ,. 3

24 Thus, we have Θ ψ C := { B R N T M U BM V 3 B M U BM V }. # Step : Also use assumption ii to show the final result: Using assumption ii and the same derivation as above we find QΓ 0 + Θ ψ QΓ 0 θ ψ M x θ ψ e M x θψ Because 0 QΓ 0 + Θ ψ QΓ 0 we thus have µ Θ ψ + ψ µ Θ ψ 3 ψ Θ ψ,. µ Θ ψ 3 ψ Θ ψ, 0. Θ ψ, 3 Θ ψ, Since the rank of Θ ψ, is at most R 0 e.g., see Recht, Fazel, and Parrilo 00, we have Θ ψ, R 0 Θ ψ, and we also have Θ ψ, Θ ψ. Therefore, Θ ψ 3 ψ R0 Θ ψ 0, µ and therefore Θ ψ 3 ψ R0. µ Proof of Theorem. Part a follows by Lemma 3 and the definition of ψ in Assumption. Part b. Let ˆβΓ = x x x y γ. Then, by definion we have ˆβ ψ β 0 := ˆβˆΓ ψ β 0 = x x x e x ˆγ ψ γ 0. 4

25 Under Assumption, x x = Op and e x = O p. Also, by Part a we have x ˆγ ψ γ 0 X ˆΓ ψ Γ 0 = O p maxn, T / ψ = O p. Combining these, we can deduce the required result for Part b. Proof of 4.3 in Remarks. Since M x is positive semi-definite and e M x γ ψ Γ ψ matm x e by Hölder inequality, we have 0 Q Γ ψ QΓ 0 = γ ψ γ 0 M x γ ψ γ 0 e M x γ ψ γ 0 + ψ Γ ψ Γ 0 e M x γ ψ γ 0 + ψ Γ ψ Γ 0 Γ ψ Γ 0 matm x e + ψ Γ ψ Γ 0 = ψ matm x e Γ ψ Γ 0. The required result follows since ψ matm x e > 0. Proof of Lemma 3. Recall the definition x = [x,..., x K ], K, where x k = vecx k. Case i. If θ = 0, the required result holds for any constant µ > 0. Now assume that θ 0. Case ii. If θ x = 0, then the required result holds for µ = because Case iii. We assume that θ 0 and θ x 0. Also, x 0. θ θ θ xx x x θ = θ θ. 5

26 Define x θ = Pxθ P xθ, and X θ := mat x θ. Then, for any Θ C and Θ 0, we have θ θ θ xx x x θ = θ θ θ x θ x θ θ by the definition of x θ = Θ θ x θ x θ θ θ since θ 0 θ = Θ x θθ θ θ θ x θ = Θ xθ P θ x θ Θ min x θ veca A C = Θ = Θ x θ x θ x θθ θ θ θ x θ H X θ, C, C. where the inequality holds because matp θ x θ C since Θ C and C is a cone. Notice that x θ = P xθ P x θ = x α, where α = x x x θ θ x x x x θ / and α x x α =. This implies X θ = α X with α α =. Therefore, we have C. Θ min α x x α= H α Then, the required result of the lemma follows by Assumption 3. X, C. Proof of Lemma 4. For notational simplicity, we write X α for α X. For given N T matrix X, and N R 0 matrix λ 0, and T R 0 matrix f 0, we want to find a lower bound on ν := min X α Θ Θ R N T s.t. M λ0 ΘM f0 3 Θ M λ0 ΘM f0. 6

27 By definition, we have X α Θ = M λ0 X α M f0 M λ0 ΘM f0 + X α M λ0 X α M f0 Θ M λ0 ΘM f0 Also, rankθ M λ0 ΘM f0 R 0 e.g., see Lemma 3.4 of Recht, Fazel, and Parrilo 00, and therefore Θ M λ0 ΘM f0 R 0 Θ M λ0 ΘM f0. Using this we find. ν min Θ R N T { } M λ0 X α M f0 M λ0 ΘM f0 + X α M λ0 X α M f0 Θ M λ0 ΘM f0 s.t. M λ0 ΘM f0 3 R 0 Θ M λ0 ΘM f0. Here, we have weakened the constraint allowing more values for Θ, and the minimizing value therefore weakly decreases. It is easy to see that for ω 0 we have X α M λ0 X α M f0 ω = min X α M λ0 X α M f0 Θ M λ0 ΘM f0 Θ R N T s.t. Θ M λ0 ΘM f0 = ω, because the optimal Θ M λ0 ΘM f0 here equals equals X α M λ0 X α M f0 rescaled by a non-negative number. We therefore have ν min ω 0 Let min Θ R N T M λ0 X α M f0 M λ0 ΘM f0 + X α M λ0 X α M f0 ω s.t. M λ0 ΘM f0 3 R 0 ω. M λ0 X α M f0 = be the singular value decomposition of M λ0 X α M f0 minn,t R 0 r= s r v r w r, singular vectors v r R N and w r R T. The optimal M λ0 ΘM f0 has the form minn,t R 0 r= with singular values s r 0 and normalized max0, s r β v r w r, in the last optimization problem 7

28 for some β 0. Here, β = 0 occurs if the constraint is not binding, that is, if M λ0 X α M f0 3 R 0 ω. We therefore have ν minn,t R 0 min ω 0, β 0 r= r= s r max0, s r β + X α M λ0 X α M f0 ω minn,t R 0 s.t. max0, s r β 3 R 0 ω. { } minn,t Here, the optimal ω equals max X α M λ0 X α M f0, 3 R0 R 0 r= max0, s r β, and we thus have ν min β 0 minn,t R 0 r= [ mins r, β + max 0, 3 R 0 minn,t R 0 r= max0, s r β X α M λ0 X α M f0 Let = s 0 > s... s minn,t R0 s minn,t R0 + = 0. For any β 0 there exists q be such that β [s q+, s q. We can therefore write ν min q {0,,,...,minN,T R 0 } min β [s q+,s q + [ q β + minn,t R 0 r=q+ s r { q } ] max 0, 3 s r β I{q } X α M λ0 X α M f0. R 0 Here, for a given q the objective function is strictly increasing, and the optimal β is always equal to the lower bound of the interval [s q+, s q. Using this and also shifting q q we obtain ν r= min aq + [max {0, bq}], q {,,...,minn,t R 0 } ]. where aq = q s q + [ bq = 3 R 0 q minn,t r=q s r, ] s r s q I{q } X α M λ0 X α M f0. r= Notice that aq is nonnegative and weakly decreasing and bq is weakly increasing. Then, for 8

29 any integer valued sequence q between and minn, T R 0 such that bq > 0, min aq + [max {0, bq}] q {,,...,minn,t R 0 } { = min min aq + [max {0, bq}], min q {,,...,q } { } min min aq, min [max {0, bq}] q {,,...,q } q {q +,...,minn,t R 0 } min { aq, bq + }. q {q +,...,minn,t R 0 } aq + [max {0, bq}] } Then, Assumption 4 thus guarantees that ν c. The definition of ν together with Lemma 3 thus guarantees that Assumptions 3 and Assumption are satisfied with µ = c. D Proofs of Section 5 In this section, we introduce conditions for Theorem with K -regressors and provide a proof of the general case. Assumption 6 K-regressor Case. Let there exist symmetric idempotent T T matrices Q k = Q k, such that Q k V = 0, for all k {,..., K}, and Q k Q l = 0, for all k, l {,..., K}. Suppose that N > T. As N, T, we assume the following conditions hold. i,ii Assume Assumption 5i and ii. iii For all k {,..., K}, we assume X k = O p. vi Let U E S E V E be the singular value decomposition of M λ 0 EM f0. We assume Tr X k U EV E = O p for all k {,..., K}. v We assume that there exists a constant c low > 0 such that wpa T N / M U X k M V Q k c low, for all k {,..., K}. vi For k =,..., K let U k S k V k = M UX k M V Q k be the singular value decomposition of the matrix M U X k M V Q k. We assume that there exists c x 0, such that wpa U k U E c x for all k =,..., K Remark For t {,,..., T }, let e t be the t th unit vector of dimension T. For k {,..., K}, let A k = e k T/K +, e k T/K +,... e kt/k be a T T/K matrix, and let P Ak be the 9

30 projector onto the column space of A k. Also define f 0,k = P Ak f 0 and B k = M f0,k A k. Then, for K > one possible choice for Q k in Assumption 6 vi is given by Q k = P Bk = M f0,k P Ak. The discussion of Assumption 6 vi is then analogous to the K = case, except that for the k th regressor only the time periods k T/K + to kt/k are used in the assumption, that is, we need enough variation in the k th regressor within those time periods. Other choices of Q k are also conceivable. Now we prove Theorem under Assumption 6, which is more general than Assumption 5. Proof of Theorem. Before we start the proof, remind the following sigular value decompositions, Γ 0 = USV, M λ0 EM f0 = U E S E V E, and M λ 0 X k M f0 = M U X k M V = U k X k V k used in what follows. The proof consists of two steps. that will be In the first step, we show that the local minimizer that minimizes the objective function Q β in a convex neighborhood of β 0 defined by is T - consistent. B := { β : c x c low c up K β In the second step, we show that the local minimizer is the global minimizer, for which we use convexity of the objective function Q β. Step. By definition of the nuclear norm, we have Q β = Γ 0 + E β X = } sup Tr [ Γ 0 + E β X A ]. {A : A } To obtain a lower bound on Q β we choose the following matrix A in the above minimization, A β = UV + a β U EV E a β K K where M UE = I N U E U E and a β [0, ] is given by k= a β = c x c low c up K β. sgn β k M UE U k V k, 30

31 We have A β, because A β = max UV, a β U EV E a β K sgn β k M UE U k V k K k= max UV, a β U E V E + a β K M K UE U k V k k= { max UV, a β } U E V E + a K β M UE U k V k K =. Here, for the first line, we used that UV is orthogonal to a β U EV E a β K K k= sgn β k M UE U k V k in both matrix dimension, which implies that the spectral norm of the sum is equal to the maximum of the spectral norms see Lemma 6iv. For the second line, we used that the columns of U E V E are orthogonal to the columns of K k= M U E U k V k, which implies that the squared spectral norm of the sum is bounded by the sum of the squared spectral norms see Lemma 6v. For the third, we used that the rows of M UE U k V k and M U E U l V l are orthogonal for k l see Lemma 6v. In the final line we used that UV = U E V E = and that M U E U k V k for all k. With this choice of A = A β we obtain the following lower bound for the objective function; for all β B, Q β Tr [ Γ 0 + E β X A β ] k= = Γ 0 + Tr E UV + Tr [ β X UV ] + a β M UEM V + a β Tr [ β X U E V E ] + a β K where we used the following: K k= β k Tr [ X k M U E U k V k], Tr Γ 0UV = Tr V SU UV = TrS = Γ 0 Tr E U E V E = Tr E MU EV U + M U EV U U E V E = TrMU EV U U E V E = TrS E = M U EM V Tr Γ 0U E V E = Tr V SU U E V E = 0 Tr [ Γ 0M UE U k V k ] [ = Tr V SUMUE U k V k ] [ = Tr V SUUk V k ] [ + Tr V SUPUE U k V k] = 0 Tr [ E M UE U k V k] = 0. We furthermore have Qβ 0 = Γ 0 + E. Thus, applying the assumptions of Lemma??, and also 3

32 using a β a β a4 β, we obtain for β B, Q β Q β 0 Tr [ Γ 0 + E β X A β ] Γ0 + E a β K K k= β k Tr [ X k M U E U k V k ] a β M UEM V Γ 0 + E Γ 0 M U EM V + Tr E UV a4 β M UEM V + a β Tr [ β X U E V E ] [ + Tr β X UV ] =: B B B 3 + B 4 B 5 + B 6. D. Here we bound B from below by B = a β K = a β K a β c x K K k= K k= K k= a β c x c low K a β c x c low K β k Tr [ X k M U E U k V k ] β k [ Tr S k Tr U EU k S k U k U ] E β k Tr S k = a β c x K T N K β k k= T N β, K k= β k M U X k M V Q k where the first inequality holds by Assumption 6vi and the second inequality holds by Assumption 6v. We bound B from above by B = a β M UEM V a β E + P U E + EP V + P U EP V a β E + 3R 0 E a β T cup N + T O p wpa a β T N c up wpa, where the first inequality holds by the triangle inequality, the second inequality holds by Lemma 6i and the third and the fourth inequalities follow by Assumption 5i and ii. 3

33 We bound term B 3 from above by B 3 = Γ 0 + E Γ 0 M U EM V E M U EM V = P U E + EP V P U EP V P U E + EP V + P U EP V 3R 0 E O p N where the second inequality holds by the triangle inequality and the third inequality holds by Lemma 6i. For B 4, by Hölder s inequality we have B 4 = Tr E UV E UV = O p N. For B 5, denoting O P + as a stochastically strictly positive and bounded term and using similar arguments for the bound of term B, we obtain B 5 = a4 β M UEM V = O P + a 4 β T N = O P + β 4 T N. For B 6, we have B 6 = a β Tr [ β X U E V E ] [ + Tr β X UV ] = O p β, where the last equality holds since TrX k U E V E = O p by Assumption 6vi, and TrX k UV X k UV = O p under Assumption 6iii. Notice that our choice for a β above is such that a β c x c low K β a β c up is maximized, which guarantees that B B is positive, namely B B T N c xc low K c up β. Combining the above, for any β B, we have T N {Q β Q β 0 } c xc low β + O p β + O p T + O P + β 4, K c up T which holds uniformly over β B i.e. none of the constants hidden in the O p. notation depends on β. Let β := argmin β B Q β be the local minimizer in a convex neighbood B of β 0. Notice that since β 0 B, Q β Q β 0 33

34 by definition. Therefore, we have This implies 0 T N Q β Q β 0 c xc low K c up β β 0 + O p O P + T T β β 0 + O p T + O P + β β 0 4. c x c low + O P + β β 0 β β 0 + O p β β 0 K c up T c xc low K c up β β 0 + O p T β β 0. From this we deduce β β 0 = O p. D. T Step. Let β B, that is, α β =. Write β := β β 0. From D. with a β =, we can bound Q β Q β 0 from below by T N Q β Q β 0 c x c low β K c up + O p β T + O p T + O P + β 4 = 4 c up + O p T cup K c up K + O p + O P + T c x c low c x c low > 0 wpa, where the equality holds since β = cup K c x c low. Since Q β is convex and has unique minimum, the local minimum at β is is also the global minimum asymptotically. Therefore, asymptotically β = β wpa. Then, combining this with the T consistency result of the local minimizer in D., we have the required result of the theorem. E Proofs of Section 6 Before we start the proof of Theorem 3, we introduce several helpful lemmas. 34

35 For parameter β, define λβ, fβ := argmin λ R N R 0,f R T R 0 Y β X and the corresponding residuals Êβ := Y β X λβ fβ. Define M λβ := I N λβ λβ λβ λβ, M f β := I N fβ fβ fβ fβ. E. Lemma 9. Assume the assumptions of Theorem S.9. of the supplementary appendix of Moon and Weidner 07. Then, we have the following expansions M λβ = M λ0 + M K λ,e + M λ,e β k β 0,k M k= M f β = M f0 + M K f,e + M f,e β k β 0,k M k= λ,k + Mrem λ f,k + Mrem f where the spectral norms of the remainders satisfy for any series η 0: sup {β: β β 0 η } sup {β: β β 0 η } β, β, M rem β λ β β 0 + / E β β 0 + 3/ E 3 = O p, M rem β f β β 0 + / E β β 0 + 3/ E 3 = O p, and the expansion coefficients are given by M λ,e = M λ 0 E f 0 f 0f 0 λ 0λ 0 λ 0 λ 0 λ 0λ 0 f 0f 0 f 0 E M λ0, M λ,k = M λ 0 X k f 0 f 0f 0 λ 0λ 0 λ 0 λ 0 λ 0λ 0 f 0f 0 f 0 X k M λ 0, M λ,e = M λ 0 E f 0 f 0f 0 λ 0λ 0 λ 0 E f 0 f 0f 0 λ 0λ 0 λ 0 + λ 0 λ 0λ 0 f 0f 0 f 0 E λ 0 λ 0λ 0 f 0f 0 f 0 E M λ0 M λ0 E M f0 E λ 0 λ 0λ 0 f 0f 0 λ 0λ 0 λ 0 λ 0 λ 0λ 0 f 0f 0 λ 0λ 0 λ 0 E M f0 E M λ0 M λ0 E f 0 f 0f 0 λ 0λ 0 f 0f 0 f 0 E M λ0 + λ 0 λ 0λ 0 f 0f 0 f 0 E M λ0 e f 0 f 0f 0 λ 0λ 0 λ 0, 35

36 analogously M f,e = M f 0 E λ 0 λ 0λ 0 f 0f 0 f 0 f 0 f 0f 0 λ 0λ 0 λ 0 E M f0, M f,k = M f 0 X k λ 0 λ 0λ 0 f 0f 0 f 0 f 0 f 0f 0 λ 0λ 0 λ 0 X k M f0, M f,e = M f 0 E λ 0 λ 0λ 0 f 0f 0 f 0 e λ 0 λ 0λ 0 f 0f 0 f 0 + f 0 f 0f 0 λ 0λ 0 λ 0 E f 0 f 0f 0 λ 0λ 0 λ 0 E M f0 M f0 E M λ0 E f 0 f 0f 0 λ 0λ 0 f 0f 0 f 0 f 0 f 0f 0 λ 0λ 0 f 0f 0 f 0 E M λ0 E M f0 M f0 E λ 0 λ 0λ 0 f 0f 0 λ 0λ 0 λ 0 E M f0 + f 0 f 0f 0 λ 0λ 0 λ 0 E M f0 E λ 0 λ 0λ 0 f 0f 0 f 0. Lemma 0. Suppose that the assumptions of Theorem S.9. of the supplementary appendix of Moon and Weidner 07 and of Theorem 4.4 hold. Then, the following hold. a M λ,e, M f,e = O pn /. b M λ,e, M f,e = O pn. c M λ,k, M f,k = O p for all k =,..., K. Proof. See the proof of Corollary S.0. in the supplementary appendix of Moon and Weidner 07, which is available in Moon and Weidner 05. Suppose that β s is a preliminary estimator of β 0 such that β s β 0 = O p δ s with δ s 0. Since y = xβ 0 + f 0 λ 0 veci R0 + e, we can write β s+ β 0 [ = x M ˆf s+ Mˆλs+ [ x M ˆf s+ Mˆλs+ ] x f 0 λ 0 veci R0 + x M ˆf s+ Mˆλs+ ] e. Lemma. Under the Assumptions in Theorem 3, the following hold. a Suppose that δ s 0. Then, M x ˆf s+ Mˆλs+ x = x M f0 M λ0 x + O p. For δ = c maxn,t /, we have / b c x M ˆf Mˆλ f 0 λ 0 veci R0 = O p, 36

NUCLEAR NORM PENALIZED ESTIMATION OF INTERACTIVE FIXED EFFECT MODELS. Incomplete and Work in Progress. 1. Introduction

NUCLEAR NORM PENALIZED ESTIMATION OF IERACTIVE FIXED EFFECT MODELS HYUNGSIK ROGER MOON AND MARTIN WEIDNER Incomplete and Work in Progress. Introduction Interactive fixed effects panel regression models