Large Covariance Estimation by Thresholding Principal Orthogonal Complements

Size: px

Start display at page:

Download "Large Covariance Estimation by Thresholding Principal Orthogonal Complements"

Sibyl Riley
5 years ago
Views:

1 Large Covariance Estimation by hresholding Principal Orthogonal Complements Jianqing Fan Yuan Liao Martina Mincheva Department of Operations Research and Financial Engineering, Princeton University Abstract his paper deals with estimation of high-dimensional covariance with a conditional sparsity structure, which is the composition of a low-rank matrix plus a sparse matrix. By assuming sparse error covariance matrix in a multi-factor model, we allow the presence of the cross-sectional correlation even after taking out common but unobservable factors. We introduce the Principal Orthogonal complement hresholding (POE) method to explore such an approximate factor structure. he POE estimator includes the sample covariance matrix, the factor-based covariance matrix (Fan, Fan, and Lv, 2008), the thresholding estimator (Bickel and Levina, 2008) and the adaptive thresholding estimator (Cai and Liu, 20) as specific examples. We provide mathematical insights when the factor analysis is approximately the same as the principal component analysis for high dimensional data. he rates of convergence of the sparse residual covariance matrix and the conditional sparse covariance matrix are studied under various norms, including the spectral norm. It is shown that the impact of estimating the unknown factors vanishes as the dimensionality increases. he uniform rates of convergence for the unobserved factors and their factor loadings are derived. he asymptotic results are also verified by extensive simulation studies. Keywords: High dimensionality, approximate factor model, unknown factors, principal components, sparse matrix, low-rank matrix, thresholding, cross-sectional correlation. Address: Department of ORFE, Sherrerd Hall, Princeton University, Princeton, NJ 08544, USA, e- mail: jqfan@princeton.edu, yuanliao@princeton.edu, mincheva@princeton.edu. he research was partially supported by NIH R0GM , NIH R0-GM0726, and DMS

2 Introduction Information and technology make large data sets widely available for scientific discovery. Much statistical analysis of such high-dimensional data involves the estimation of covariance matrix or its inverse (precision matrix). Examples include portfolio management and risk assessment (Fan, Fan and Lv, 2008), high-dimensional classification such as Fisher discriminant (Hastie, ibshirani and Friedman, 2009), graphic models (Meinshausen and Bühlmann, 2006), statistical inference such as controlling false discoveries in multiple testing (Leek and Storey, 2008; Efron, 200), finding quantitative trait loci based on longitudinal data (Yap, Fan, and Wu, 2009; Xiong, et al., 20), and testing the capital asset pricing model (Sentana, 2009), among others. See Section 4 for some of those applications. Yet, the dimensionality is often either comparable to the sample size or even larger. In such a case, the sample covariance is known to have a poor performance due to its diverse spectra, and some regularization is needed. Realizing the importance of estimating large covariance matrix and the challenges brought by the high dimensionality, researchers have proposed various regularization techniques to consistently estimate Σ in recent years. One of the key assumptions is that the covariance matrix is sparse, namely, many entries are zero (Bickel and Levia, 2008, Rothman et al, 2009, Lam and Fan 2009, Cai and Zhou, 200, Cai and Liu, 20). In many applications, however, the sparsity assumption directly on Σ is not appropriate. For example, the financial returns depend on the equity market risks, housing prices depend on the economic health, gene expressions can be stimulated by cytokines, among others. Due to the presence of common factors, it is unrealistic to assume that many outcomes are uncorrelated. An alternative method is to assume a structure of factor model, as in Fan, Fan and Lv (2008). However, they restrict themselves to the strict factor models with known factors. A natural extension is the conditional sparsity. Given the common factors, the outcomes are sparse. his approximate factor model is frequently used in economics and financial studies (Chamberlain and Rothschild, 983; Fama and French 993; Bai and Ng, 2002). A factor model typically takes the following form: y it = b if t + u it, (.) where y it is the observed datum for the ith (i =,..., p) variable at time t =,..., ; b i is a vector of factor loadings; f t is a K vector of common factors, and u it is the idiosyncratic error component, uncorrelated with f t. In a data-rich environment, p can diverge at a rate 2

3 faster than. he factor model (.) can be put in the matrix form as y t = Bf t + u t. (.2) where y t = (y t,..., y pt ), B = (b,..., b p ) and u t = (u t,..., u pt ). Under model (.), the covariance matrix Σ is given by Σ = Bcov(f t )B + Σ u, (.3) where Σ u is the covariance matrix of u t. We assume that Σ u is sparse (instead of diagnonal) and refer to the model as the approximate factor model. he conditional sparsity of form (.2) was explored by Fan, Liao and Mincheva (20) in estimating the covariance matrix, when the factors {f t } are observable. his allows them to use the regression analysis to estimate {u t }. his paper deals with the situation in which the factors are unobservable and have to be inferred. Our approach is very simple. Run the singular value decomposition on the sample covariance matrix Σ, keep the covariance matrix formed by the first K principal components, and apply the thresholding procedure to the remaining covariance matrix. his results in a Principal Orthogonal complement hresholding (POE) estimator. See Section 2 for additional details. We will investigate various properties of POE under the assumptions that the data are serial dependence, which include independent observation as a specific example. High-dimensional approximate factor model (.2) is innately related to the principal component analysis, as clearly elucidated in Section 2. his is a distinguished feature not shared by the finite-dimensionality. Bai (2003) established the large sample properties of the principal components as the estimators of the factor loading as well as the estimated common factors when they are unobservable. Forni, Hallin, Lippi and Reichlin (2000) established consistency for the estimated common components b if t. In addition, Doz, Giannone and Reichlin (2006) proposed quasi-maximum likelihood estimations and investigated the effect of misspecification on the estimation of common factors, and Wang (200) studied the large sample theory of high dimensional factor models with a multi-level factor structure. Stock and Watson (2002) considered time-varying factor loadings. See Bai and Shi (20) for a recent review. here has been an extensive literature in recent years that deals with sparse principal components, which have been widely used to enhance the convergence of the principal components in high-dimensional space. d Aspremont, Bach and El Ghaoui (2008) proposed a semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non-vanishing coefficients. Shen and Huang 3

4 (2008) proposed the spca-rsvd algorithm and Witten, ibshirani, and Hastie (2009) used the spc algorithm for computing regularized principal component. he idea is further extended by Ma (20) who iteratively applied thresholding and the QR decomposition to find sparse principal components and derived the rates of convergence of sparse principal components. Johnstone and Lu (2009) screened the variables first by a marginal variance and then applied the PCA to the screened variables to obtain a sparse principal component. hey proved the consistency of such a method. Amini and Wainwright (2009) analyzed semidefinite relaxations for sparse principal components. Zhang and El Ghaoui (20) proposed a fast block coordinate ascent algorithm for computing sparse PCA. he rest of the paper is organized as follows. Section 2 gives our estimation procedures and builds the relationship between the principal components analysis and the factor analysis in high-dimensional space. Section 3 provides the asymptotic theory for various estimated quantities. Specific applications of regularized covariance matrix are given in Section 4. Numerical results are reported in Section 5. Finally, Section 6 presents a real data example. All proofs are given in the appendix. hroughout the paper, we use λ min (A) and λ max (A) to denote the minimum and maximum eigenvalues of a matrix A. We also denote by A F, A, A and A max the Frobenius norm, spectral norm (also called operator norm), L -norm, and elementwise norm of a matrix A respectively, defined respectively as A F = tr /2 (A A), A = λ /2 max(a A), A = max j i a ij and A max = max i,j a ij. Note that when A is a vector, A is equal to the Euclidean norm. 2 Regularized Covariance Matrix via PCA 2. POE Sparsity assumption directly on Σ is inappropriate in many applications due to the presence of unobserved factors. Instead, we propose a nonparametric estimator of Σ based on the principal components analysis. Let λ λ 2 λ p be the ordered eigenvalues of the sample covariance matrix Σ and { ξ i } p i= be their corresponding eigenvectors. hen the sample covariance has the following spectral decomposition: Σ = K i= λ i ξi ξ i + R, (2.) where R = p i=k+ λ i ξi ξ i = (ˆr ij ) p p is the principal orthogonal complement, and K is the number of principal components. 4

5 Now we apply thresholding on R (Bickel and Levina, 2008). Define R = (ˆr ij) p p, ˆr ij = ˆr ij I( ˆr ij τ ij ), (2.2) where τ ij > 0 is an entry-dependent adaptive threshold (Cai and Liu, 20). In particular, the constant thresholding τ ij = δ is allowed. An example of the adaptive thresholding is τ ij = τ(ˆr iiˆr jj ) /2, for a given τ > 0 (2.3) where ˆr ii is the i th diagonal element of R. his corresponds to applying the thresholding with parameter τ to the correlation matrix of R. Moreover, instead of the hard-thresholding, one can also use more general thresholding functions of Antoniadis and Fan (200), as in Rothman et al. (2009) and Cai and Liu (20). he estimator of Σ is then defined as: Σ = K i= λ i ξi ξ i + R. (2.4) We will call the estimator as Principal Orthogonal complement thresholding (POE) estimator. It is obtained by thresholding the remaining components of the sample covariance matrix, after taking out the first K principal components. In practice, the number of principal components can be estimated based on the sample covariance matrix. Determining K in a data-driven way is an important topic, and is well understood in the literature. See, for example, Hallin and Liška (2007), Kapetanios (200), Onatski (200), etc, in latent factor models. With the choice of τ ij in (2.3), our estimator encompasses many popular estimators as its specific cases. When τ = 0, the estimator will be the sample covariance matrix and when τ =, the estimator becomes that based on the strict factor model. When K = 0, our estimator is the same as the threhsolding estimator of Bickel and Levina (2008) or the adaptive thresholding estimator of Cai and Liu (20) with a slight modification of the thresholding parameter that takes the standard error of the sample covariance. 2.2 High dimensional PCA and factor model We now elucidate why PCA can be used for the factor analysis. he main reason is that when dimensionality p is large, the covariance matrix Σ has very spiked eigenvalues. For 5

6 Σ u = (σ ij ) p p as in (.3), define m p = max i p I(σ ij 0). (2.5) j p which is assumed to grow slowly in p. Hence Σ u is a sparse covariance matrix. he rationale behind this assumption is that, after the common factors are taken out, many pairs of the cross-sectional units become uncorrelated. As we now demonstrate, we need the following lemma. Lemma 2.. Let {λ i } p i= be eigenvalues of Σ in descending order and {ξ i} p i= be their associated eigenvectors. Correspondingly, let { λ i } p i= be eigenvalues of Σ in descending order and { ξ i } p i= be their associated eigenvectors.. (Weyl s heorem) λ i λ i Σ Σ. 2. (sin θ heorem, Davis and Kahn, 970) ξ i ξ i When the factor loadings {b i } p i= then have p 2 Σ Σ min( λ i λ i, λ i λ i+ ). are a random sample from a certain population, we p b i b i = p B B Eb i b i, as p, (2.6) i= under some mild conditions. Note that the linear space spanned by the first K principal components of Bcov(f t )B is the same as that spanned by the columns of B when cov(f t ) is non-degenerate. hus, we can assume without loss of generality that the columns of B are orthogonal and cov(f t ) = I K, the identity matrix. he assumptions correspond to the identifiability condition in decomposition (.3). Let b,, b K be the columns of B, ordered such that { b j } K j= is in a non-increasing order. hen, { b j / b j } K j= are principal components of the matrix BB with eigenvalues { b j } K j= and the rest zero. Let { λ j } K j= be the eigenvalues (in non-increasing order) of p B B, which are bounded from below and above by (2.6), assuming K p. Since the non-vanishing eigenvalues of the matrix BB are the same as those of B B, the non-vanishing eigenvalues of the matrix BB are {p λ j } K j=, and p λ j = b j. Correspondingly, let {λ i } p i= be the values of Σ in a descending order and {ξ j } p j= be their corresponding eigenvectors. hen, an application of Weyl s theorem yields that 6

7 Proposition 2.. For the factor model (.3) with the normalization condition cov(f t ) = I K and B B is diagonal, (2.7) we have λ j b j Σ u, for j K, λ j Σ u, for j > K. If in addition (2.6) holds with λ min (Eb i b i) bounded away from zero, then b j /p is bounded away from zero for all large p. he above proposition reveals that the first K eigenvalues of Σ is of order p, whereas the rest is only of order Σ u. Σ u = (σ u,ij ), we have Under the sparsity assumption (2.5), with the notation Σ u Σ u m p max i σ u,ii. hus, when m p = o(p) and max i σ u,ii is bounded, we have distinguished eigenvalues between the principal components and the rest of the components. More generally, Σ u Σ u max i p σ u,ij q (σ u,ii σ u,jj ) ( q)/2, for q. (2.8) j= If we assume that the right-hand side of (2.8) is bounded by c p p, then the conclusion continues to hold. his generalizes the notion of sparsity defined by (2.5), which corresponds to q = 0 and was used in Bickel and Levina (2008) and Cai and Liu (20). Using Proposition 2. and the sin θ theorem of Davis and Kahn (970), we have the following Proposition 2.2. Under the normalization (2.7), if { b j } K j= are distinct, then ξ j b j / b j = O(p Σ u ), for j K Proposition 2.2 reveals that when p is large, the principal components {ξ j } K j= are close to the normalized vector { b j } K j=. he space spanned by the first K principal components are close to the space spanned by the columns of the factor loading matrix B. his provides the mathematics for using the principal components as proxy of the space spanned by the columns of the loading matrix. Our conditions for the consistency of principal components are much weaker than those in Jung and Marron (2009). A penalized least-squares approach for the low-rank plus sparse matrix is to minimize, 7

8 with respect to B and S, the objective function Σ BB S 2 F + λrank(b) + i j p λ ( s ij ). (2.9) For example, Luo (20) takes p λ ( θ ) = λ θ (but including the penalty in the diagnonal term) and relax the penalty on rank(b) by its nuclear norm (the L -norm of singular values). he advantage of our approach is that no optimization is needed. Even when the rank of B is known, and p λ ( θ ) = λ 2 (λ θ ) 2 + (Antoniadis and Fan, 200) is the hard thresholding penalty, the minimization of (2.9) still consists of two-step iterations: Given B, S is the thresholded matrix Σ BB, and given S, B is the un-normalized principal components of Σ S. With the POE, the above iterations are avoided and Propositions 2. and 2.2 provide some mathematical justifications. 2.3 Least squares point of view he POE (2.4) has an equivalent representation using a constrained least squares method, due to the low-rank decomposition (.3). least squares method seeks for Λ = ( b,..., b p ) and f t such that subject to the normalization ( Λ, f t ) = arg min b i,f t p i= In the latent factor model (.), the (y it b if t ) 2, (2.0) ft f t = I K, and p p i= b i b i is diagonal. (2.) he constraints (2.) correspond to the normalization (2.7). Here, through inclusion of the intercept terms a i in (2.0) if necessary, we assume that the mean of each variable {y it } has been removed so have been the factors {f t }. Putting in the matrix form, the optimization problem can be written as arg min B,F Y BF 2 F F F = I K, B B is diagonal. (2.2) where Y = (y,..., y ) and F = (f,, f ). For each given F, the least-squares estimator is Λ = YF, using the constraint (2.) on the factors. Substituting this into (2.2), the 8

9 object function now becomes Y YFF 2 F = tr[(i FF )Y Y]. he minimizer is now clear: the columns of F are the eigenvectors corresponding to the K largest eigenvalues of the matrix Y Y and Λ = Y F. We will show that under some mild regularity conditions, as p and, b i f t consistently estimates b if t for each i and t. Since Σ u is assumed to be sparse, we can construct an estimator of Σ u using the adaptive thresholding method by Cai and Liu (20) as follows. For some pre-determined decreasing sequence ω > 0, let û it = y it b i f t, σ ij = û it û jt, and ˆθij = (û it û jt σ ij ) 2. Define Σ u = (û it ), Σ u = ( σ ij) p p, and σ ij = σ ij I( σ ij ˆθ ij ω ). (2.3) Analogous to the decomposition (.3), we obtain the following substitution estimators Σ = Λ Λ + Σ u, (2.4) and ( Σ ) = ( Σ u ) ( Σ u ) Λ[IK + Λ ( Σ u ) Λ] Λ ( Σ u ), (2.5) by the Sherman-Morrison-Woodbury formula, noting that f t f t = I K. he following theorem shows that the two estimators based on either regularized PCA or least squares substitution are equivalent. heorem 2.. If the entry-dependent threshold in (2.2) is the same as the thresholding parameter used in (2.3). hen, Σu = R and the estimator (2.4) is equivalent to the substitution estimator (2.4), that is, Σ = Σ, and Σ u = R. 9

10 3 Asymptotic Properties 3. Assumptions his section gives the assumptions for the convergence of Σ Σ as well as ( Σ ) Σ under model (.2), in which only {y t } is observable. Recall that we impose the identifiability condition (2.7). We make the following assumptions. Assumption 3.. (i) {u t } t is stationary and ergodic with mean vector zero and covariance matrix Σ u. (ii) here exit constants c, c 2 > 0 such that c < λ min (Σ u ) λ max (Σ u ) < c 2. (iii) here exist r > 0 and b > 0, such that for any s > 0 and i p, P ( u it > s) exp( (s/b ) r ). Condition (ii) requires that Σ u is well-conditioned as in Chamberlain and Rothschild (983). As noted in Bickel and Levina (2004), this condition is satisfied if for each t, {u it } i= is a stationary and ergodic process with spectral density that is bounded away from both zero and infinity. Condition (iii) requires the distributions of the idiosyncratic terms to have exponential-type tails, which allows us to apply the large deviation theory to u itu jt σ u,ij. Assumption 3.2. (i) {f t } t is stationary and ergodic. (ii) {u t } t and {f t } t are independent. We introduce the strong mixing conditions to conduct asymptotic analysis of the least square estimates. Let F 0 and F denote the σ-algebras generated by {(f t, u t ) : t 0} and {(f t, u t ) : t } respectively. In addition, define the mixing coefficient α( ) = sup A F 0,B F P (A)P (B) P (AB). (3.) Assumption 3.3. (i) Exponential tail: here exist b 2 > 0 and r 2 > 0 such that for all i, t, and s > 0, (ii) Strong mixing: here exists r 3 satisfying: for all Z +, P ( f it > s) exp( (s/b 2 ) r 2 ). > 0 such that 3r +.5r 2 + r 3 >, and C > 0 α( ) exp( C r 3 ). 0

11 In addition, max t Eu su t /p = O(). s= In addition, we impose the following regularity conditions. Assumption 3.4. here exists M > 0 such that for all i p and t, (i) b i max < M, and E f t 4 < K 2 M, (ii) E[p /2 (u su t Eu su t )] 4 < M, (iii) E (pk) /2 p i= b iu it 4 < M. his assumption provides regularity conditions in order to consistently estimate the transformed common factors as well as the factor loadings. Conditions (ii) and (iii) are satisfied, for example, if {u it } i= and {b ik u it } i= are stationary and ergodic processes with finite fourth moment for each t, k. For simplicity, we only consider nonrandom factor loadings and a known number of factors. Note that the conditions here are weaker than those in Bai (2003), as there is no need to establish the asymptotic normality in order to estimate the covariance matrix. he following assumption requires that the factors should be pervasive, that is, impact every individual time series (Harding, 2009). See also (2.6). Assumption 3.5. p B B Ω = o() for some K K symmetric positive definite matrix Ω such that λ min (Ω) is bounded away from both zero and infinity. 3.2 Convergence of the idiosyncratic covariance Estimating the covariance matrix Σ u of the idiosyncratic components {u t } is important for many statistical inference purposes. For example, it is needed for large sample inference of the unknown factors and loadings in Bai (2003), Wang (200), for testing the capital asset pricing model in Sentana (2009), and large-scale testing in Fan, Han and Gu (202). See Section 4 for the last two applications. We estimate Σ u by applying the adaptive thresholding given by (2.3). By heorem 2., it is also the same as R given by (2.2). We apply the POE estimator with adaptive threshold: τ ij = C ˆθ ij ω, (3.2) where C > 0 is a sufficiently large constant, and throughout the paper we denote ω = K log p + K 2 + K3 log p + p and δ = m p ω. (3.3)

12 he following theorem shows that Σ u is asymptotically nonsingular as and p. he rate of convergence under the spectral norm is also derived. Let γ = 3r +.5r2 + r3. heorem 3.. Suppose max{(log p) 6/γ, K 4 (log(p )) 2 } = o( ), and /4 K 3 Under Assumptions , the POE estimator defined in (2.3) satisfies = o( p). Σ u Σ u = O p (δ ). If further δ = o(), then Σ u is invertible with probability approaching one, and ( Σ u ) Σ u = O p (δ ). When estimating Σ u, p is allowed to grow exponentially fast in, that is, there exists a > 0, as long as log p = O( a ), Σ u can be made consistent under the spectral norm. In addition, Σ u is invertible while the classical sample covariance matrix based on the residuals is not when p >. Remark 3.. Fan, Liao and Mincheva (20) showed that when {f t } are observable, the rate of convergence of the adaptive thresholding estimator is given by Σ u Σ u = O p (m p K ) log p = ( Σ u ) Σ u. Hence when the common factors are unobservable, the rate of convergence has an additional term m p K 3 / p, coming from the impact of estimating the unknown factors. his impact vanishes when p log p K 4. As p increases, more information about the common factors is collected, which results in more accurate estimation of the common factors {f t }. he estimation error becomes negligible when p is significantly larger than. If in addition K is bounded, the rate of convergence achieves the minimax rate as in Cai and Zhou (200). 3.3 Convergence of the POE estimator As was found by Fan et al. (2008), deriving the rate of convergence under the spectral norm for Σ Σ is inappropriate. We illustrate this using a simple example. Example 3.. Consider an ideal case where we know b i = (, 0,..., 0) for each i =,..., p, Σ u = I p, and {f t } are observable. hen when estimating Σ, we only need to estimate cov(f t ) using the sample covariance matrix ĉov(f t ), and obtain an estimator for Σ: Σ = Bĉov(f t )B + I p. 2

13 Simple calculations yield to Σ Σ = (f t f ) 2 var(f t ) p p, where p denotes the p-dimensional column vector of ones with p p = p. Σ Σ = O p (p/ ), which is o p () only if p = o( ). herefore, Alternatively, Fan et al. (2008) suggested to use the weighted quadratic loss to study the rate of convergence in high dimensional factor models: A Σ = p /2 Σ /2 AΣ /2 F. which is closely related to the entropy loss, introduced by James and Stein (96). Here p /2 is a normalization factor such that Σ Σ =. echnically, the impact of high dimensionality on the convergence rate of Σ Σ is via the number of factor loadings in B. It was shown by Fan et al. (2008) that B appears in Σ Σ Σ through B Σ B, and that λ max (B Σ B) is bounded, which successfully cancels out the curse of high dimensionality introduced by B. he following theorem gives the rate of convergence under various norms. heorem 3.2. Under the assumptions of heorem 3., the POE estimator defined in (2.4) satisfies Σ Σ Σ = O p ( K p log p + K2 ) log p + δ, Σ Σ max = O p (K 3 log K + ω ). In addition, if δ = o(), then Σ is nonsingular with probability approaching one, with ( Σ ) Σ = O p (δ ). In practical applications, K is typically small compared to p and. It can even be thought of as a constant. For example, in the Fama-French model, K = 3. Our result also covers the case of K = 0, when the target covariance Σ = Σ u is a sparse matrix. he POE estimator is then the same as either the thresholding estimator of Bickel and Levina (2008) or the adaptive thresholding estimator of Cai and Liu (20). heorem 3.2 then implies ( ) Σ log p Σ Σ = O p = ( Σ ) Σ. 3

14 Remark 3.2. When estimating Σ, p is allowed to grow exponentially fast in, and the estimator has the same rate of convergence as the estimator Σ u in heorem 3.. When p becomes much larger than, the precision matrix can be estimated as if the factors were observable. his fact was also shown in the asymptotic expansions obtained by Bai (2003) when K is finite and Σ u is diagonal. herefore in the approximate factor model, we can do a much better job in estimating Σ than estimating Σ. he intuition follows from the fact that Σ has bounded eigenvalues whereas Σ has diverging eigenvalues. 3.4 Convergence of unknown factors and factor loadings Many applications of the factor models contain the latent factors, and as a result, the common factors need to be estimated by the method of principal components. In this case, the factor loadings in B and the common factors f t are not separably identifiable, as for any nonsingular K K matrix H, Bf t = BH Hf t. Hence (B, f t ) cannot be identified from (BH, Hf t ). However, this ambiguity is eliminated by the identifiability condition (2.7), subject to a permutation. Note that the linear space spanned by B is the same as that by BH. In practice, it often does not matter which one to be used. Let V denote the K K diagonal matrix of the first K largest eigenvalues of the sample covariance matrix in decreasing order. Recall that F = (f,..., f ) and let H = V F FB B. hen for t =,...,, Hf t = V F (Bf,..., Bf ) Bf t. Note that Hf t depends only on the data V F and identifiable part of parameters {Bft }. herefore, there is no identifiability issue in Hf t no matter the identifiability condition (2.7) is imposed. Bai (2003) showed that when K is fixed and Σ u is diagnonal, b i and f t are consistent estimators of H b i and Hf t respectively. he following theorem allows K to grow slowly and Σ u to be a sparse non-diagonal matrix. heorem 3.3. Under the assumptions of heorem 3., ( ) max b i (H ) ω b i = O p, i p K and max t f t Hf t = O p (δ ), 4

15 where δ = K/ + K 3/2 /4 / p. As a consequence of heorem 3.3, we obtain the following Corollary 3.. Under the assumptions of heorem 3., ( ) max b i f t b if t = O p Kδ + ω (log ) /r 2. i p,t 4 Applications We give four examples which are immediate applications of the results in heorems Example 4. (Large-scale hypothesis testing). Controlling the false discovery rate in largescale hypothesis testing based on correlated test statistics is an important and challenging problem in statistics (Leek and Storey, 2008; Efron, 200). Suppose that the test statistic for each of the hypothesis H i0 : µ i = 0 vs. H i : µ i 0 is Z i N(µ i, ). hese test statistics Z are correlated and follow N(µ, Σ) with unknown covariance matrix Σ. For a given critical value x, the false discovery proportion is then defined as FDP(x) = V (x)/r(x) where V (x) = p p I( Z i > x) and R(x) = p I( Z i > x) µ i =0 are the total number of false discoveries and the total number of discoveries, respectively. Our interest is to estimate FDP(x) for each given x. Note that R(x) is an observable quantity. Only V (x) needs to be estimated. If the covariance Σ admits the approximate factor structure (.3), then the test statistics can be stochastically decomposed as i= Z = µ + Bf + u, where Σ u is sparse. (4.) By the principal factor approximation (heorem, Fan, Han, Gu, 202) V (x) = p {Φ(a i (z x/2 + η i )) + Φ(a i (z x/2 η i ))} + o P (p), (4.2) i= 5

16 when m p = o(p) and the number of true significant hypothesis {i : µ i 0} is o(p), where z x is the upper x-quantile of the standard normal distribution, η i = (Bf) i and a i = var(u i ). Now suppose that we have n repeated measurements from the model (4.). hen, by Corollary 3., {η i } can be uniformly estimated, and hence p V (x) can be consistently estimated and hence FDP(x). Efron (200) obtained these repeated test statistics based on the bootstrap sample from the original raw data. Our theory gives a formal justification to the seminal work of Efron (2007, 200). Example 4.2 (Risk management). he maximum elementwise estimation error Σ Σ max appears in risk assessment as in Fan, Zhang and Yu (2008). For a fixed portfolio allocation vector w, the true portfolio variance and the estimated one are given by w Σw and w Σ w respectively. he estimation error is bounded by w Σ w w Σw Σ Σ max w 2, where w, the l norm of w, is the gross exposure of the portfolio. Usually a constraint is placed on the total percentage of the short positions, in which case we have a restriction w c for some c > 0. In particular, c = corresponds to a portfolio with no-short positions (all weights are nonnegative). heorem 3.2 quantifies the maximum approximation error. Example 4.3 (Panel regression with a factor structure in the errors). Consider the following panel regression model Y it = x itβ + ε it, i p, t, ε it = b if t + u it, where x it is a vector of observable regressors with fixed dimension. he regression error ε it has a factor structure, but b i, f t and u it are all unobservable. Assuming x it to be independent of ε it, we are interested in the common regression coefficients β. he above panel regression model has been considered by many researchers, such as Ahn, Lee and Schmidt (200), Pesaran (2006), etc, which has broad applications in social sciences. For example, in the income studies, Y it represents the income of individual i at age t, x it is a vector of observable characteristics that are associated with income. Here b i represents a vector of unmeasured skills, such as innate ability, motivation, and hardworking; f t is a vector of unobservable prices for the unmeasured skills, which is assumed to be time-varying. Although OLS (ordinary least squares) produces a consistent estimator of β, a more efficient estimation would be obtained by GLS (generalized least squares). he GLS method 6

17 depends on an estimator of Σ ɛ, the inverse of the covariance matrix of ε t = (ε t,..., ε pt ), which should be consistent under the spectral norm. In a large panel model, p can be larger than. As a result, the traditional GLS, which estimates Σ ɛ covariance, is infeasible. based on the sample residual By assuming the covariance matrix of (u t,..., u pt ) to be sparse, we can successfully solve this problem by applying heorem 3.2. Although ε it is unobservable, it can be replaced by the regression residuals ˆε it, obtained via first regressing Y it on x it. We then apply the POE estimator to ε t ε t, as described in Section 2.. By heorem 3.2, the inverse of the resulting estimator is a consistent estimator of Σ ɛ under the spectral norm. A slight difference lies in the fact that when we apply POE, ε tε t is replaced with ε t ε t, which introduces an additional term O p ( log p ) in the estimation error. Example 4.4 (esting for asset pricing theory). A celebrated financial economic theory is the capital asset pricing model (CAPM, Sharpe 964) that makes William Sharpe win the Nobel prize in Economics in 990, whose extension is the multi-factor model (Ross, 976, Chamberlain and Rothschild, 983). It states that in a frictionless market, the excessive returns of any financial asset equals to the excessive returns of the risk factors plus idiosyncratic noises that are only related to the asset itself. In the multi-period model, the excess return y it of firm i at time t follows model (.), in which f t is the excess returns of the risk factors at time t and u it is the idiosyncratic noise. o test the null hypothesis (.2), one embeds the model into the multivariate linear model y t = µ + Bf t + u t, t =,, (4.3) and wishes to test H 0 : µ = 0. he F-test statistic involves the estimation of the covariance matrix Σ u, whose estimates are degenerate without regularization when p even if f t are observable risk factors. herefore, in the literature (Sentana, 2009, and references therein), one focuses on the case p. he typical choices of parameters are = 60 monthly data and the number of assets p = 5, 0 or 25. However, the CAPM should hold for all tradeable assets, not just a small fraction of assets. With our regularized technique, nondegenerate estimate Σ u can be obtained and the F-test or likelihood-ratio test statistics can be employed even when p. 5 Monte Carlo Experiments In this section, we will examine the performance of the POE method in the finite sample. We will also demonstrate the effect this estimator on the asset allocation and risk 7

18 assessment. Similarly to Fan, et al. (2008, 20), we simulated from a standard Fama-French three-factor model, assuming a sparse error covariance matrix and three unobserved factors. hroughout this section, the time span is fixed at = 300, and the dimensionality p increases from to 600. We assume that the excess returns of each of p stocks over the risk-free interest rate follow the following model: y it = b i f t + b i2 f 2t + b i3 f 3t + u it. he factor loadings are drawn from trivariate normal distributions b N 3 (µ B, Σ B ), the idiosyncratic errors from u t N p (0, Σ u ), and the factor returns f t follow an VAR() model. o make the simulation more realistic, model parameters are calibrated from the financial returns, as detailed in the following section. 5. Calibration o calibrate the model, we use the data on annualized returns of 00 industrial portfolios from the website of Kenneth French, and the data on 3-month reasury bill rates from the CRSP database. hese industrial portfolios are formed as the intersection of 0 portfolios based on size (market equity) and 0 portfolios based on book equity to market equity ratio. heir excess returns (ỹ t ) are computed for the period from Jan st, 2009 to Dec 3 st, 200. Here, we present a short outline of the calibration procedure.. Given {ỹ t } 500 as the input data, we calculate a 00 3 matrix B, and matrix F, using the principal components method described in Section We calculate the sample mean vector µ B and sample covariance matrix Σ B of the rows of B, which are reported in able. he factor loadings b i = (b i, b i2, b i3 ) i =,..., p are drawn from N 3 (µ B, Σ B ). for able : Mean and covariance matrix used to generate b µ B Σ B

19 3. Assume that the factors f t follow the stationary vector autoregressive model, f t = µ + Φf t + ε t, a VAR() model. hen obtain the multivariate least squares estimator for µ and Φ, and estimate Σ ɛ, by autoregressing the columns of F. Note that all eigenvalues of Φ fall within the unit circle, so our model is stationary. he covariance matrix cov(f t ) can be obtained by solving the linear equation below, cov(f t ) = Φcov(f t )Φ + Σ ɛ. he estimated parameters are depicted in able 2. We then obtain the data generating process of f t. able 2: Parameters of f t generating process µ cov(f t ) Φ For each value of p, we generate a sparse covariance matrix Σ u of the form: Σ u = DΣ 0 D. Here, Σ 0 is the error correlation matrix, and D is the diagonal matrix of the standard deviations of the errors. We set D = diag(σ,..., σ p ), where each σ i is generated independently from a Gamma distribution G(α, β), and α and β are chosen to match the sample mean and sample standard deviation of the standard deviations of the errors. A similar approach to Fan et al. (20) has been used in this calibration step. he off-diagonal entries of Σ 0 are generated independently from a normal distribution, with mean and standard deviation equal to the sample mean and sample standard deviation of the sample correlations among the estimated residuals, conditional on their absolute values being no larger than We then employ hard thresholding to make Σ 0 be sparse, where the threshold is found as the smallest constant that provides the positive definiteness of Σ 0. More precisely, start with threshold value, which gives Σ 0 = I p and then decrease the threshold values in grid until positive definiteness is violated. 9

20 5.2 Simulation For the simulation, we fix = 300, and let p increase from to 600. For each fixed p, we repeat the following steps N = 200 times, and record the means and the standard deviations of each respective norm.. Generate independently {b i } p i= N 3(µ B, Σ B ), and set B = (b,..., b p ). 2. Generate independently {u t } N p (0, Σ u ). 3. Generate {f t } as a vector autoregressive sequence of the form f t = µ + Φf t + ε t. 4. Calculate {y t } from y t = Bf t + u t. 5. Calculate Λ and F based on {y t }, using the PCA method described in Section 2.3. log p 6. Set ω = 2(K + K3 p ) with K = 3 to be the threshold for creating Σ u. Calculate Σ and ( Σ ) using the POE method. 7. Calculate the sample covariance matrix Σ sam. In the graphs below, we plot the averages and standard deviations of the distance from Σ and Σ sam to the true covariance matrix Σ, under norms. Σ,. and. max. We also plot the means and standard deviations of the distances from ( Σ ) and Σ sam to Σ under the spectral norm. We plot values of p from 20 to 600 in increments of 20. Due to invertibility, the spectral norm for Σ sam is plotted only up to p = 280. Also, we zoom into these graphs by plotting the values of p from to 00, this time in increments of. Notice that we also plot the distance from Σ obs to Σ for comparison, where Σ obs is the estimated covariance matrix proposed by Fan et al. (20), assuming the factors are observable. 5.3 Results From the simulation results, reported in Figures -4, we observe that our estimator under the unobservable factor model performs just as well as the estimator in Fan et al. (20) under the observable factor model, when p is large enough. he cost of not knowing the factors is approximately of order O p (/ p). It can be seen in Figures and 2 that this cost vanishes for p > 300. o give a better insight of the impact of estimating the unknown factors for small p, a separate set of simulations is conducted for p 00. As we can see from Figures (bottom panel) and 2 (middle and bottom panels), the impact decreases quickly. In addition, when estimating Σ, it is hard to distinguish the estimators with known and unknown factors, whose performances are quite stable compared to the sample covariance 20

21 Figure : Averages (left panel) and standard deviations (right panel) of Σ Σ Σ with with known factors (solid red curve), unknown factors (solid blue curve), and Σ sam Σ Σ (dashed curve) over 200 simulations, as a function of the dimensionality p. op panel: p ranges in 20 to 600 with increment 20; bottom panel: p ranges in to 00 with increment..5 Average 0.03 Standard Deviation p p 0.7 Average 0.05 Standard Deviation p p Figure 2: Averages (left panel) and standard deviations (right panel) of ( Σ ) Σ with known (solid red curve) and unknown (solid blue curve) factors and ( Σ sam ) Σ (dashed curve) over 200 simulations, as a function of the dimensionality p. op panel: p ranges in 20 to 600 with increment 20; middle panel: p ranges in to 00 with increment ; Bottom panel: the same as the top panel with dashed curve excluded. 000 Average 350 Standard Deviation p p 8 Average 0.7 Standard Deviation p p.3 Average 0.2 Standard Deviation p p 2

22 Figure 3: Averages (left panel) and standard deviations (right panel) Σ Σ max with known (solid red curve) and unknown (solid blue curve) factors and Σ sam Σ max (dashed curve) over 200 simulations, as a function of the dimensionality p. hey are nearly indifferentable..4.3 Average Standard Deviation p p matrix. Also, the maximum absolute elementwise error (Figure 3) of our estimator performs very similarly to that of the sample covariance matrix, which coincides with our asymptotic result. Figure 4 shows that the performances of the three methods are indistinguishable in spectral norm, as expected. Figure 4: Averages of Σ Σ with known (solid red curve) and unknown (solid blue curve) factors and Σ sam Σ (dashed curve) over 200 simulations, as a function of the dimensionality p. he three curves are plotted, but hardly distinguishable. Average p 5.4 Portfolio allocation We demonstrate the improvement of our method compared to the sample covariance and that based on the strict factor model, in a problem of portfolio allocation for risk minimization purposes. Let Σ be a generic estimator of the covariance matrix of y t, and w be the allocation vector of a portfolio consisting of the corresponding p financial securities. hen the theoretical and the empirical risk of the given portfolio would be R(w) = w Σw and ˆR(w) = w Σw, respectively. Now, define ŵ = argminw =w Σw, 22

23 the estimated (minimum variance) portfolio. hen the actual risk of the estimated portfolio is defined as R(ŵ) = ŵ Σŵ, and the estimated risk (also called empirical risk) is equal to ˆR(ŵ) = ŵ Σŵ. In practice, the actual risk is unknown, and only the empirical risk can be calculated. For each fixed p, the population Σ was generated in the same way as described in Section 4., with a sparse but not diagonal error covariance. We use three different methods to estimate Σ and obtain ŵ: strict factor model Σ diag (estimate Σ u using a diagonal matrix), our POE estimator Σ, both are with unknown factors, and sample covariance Σ sam. We then calculate the corresponding actual and empirical risks. It is interesting to examine the accuracy and the performance of the actual risk of our portfolio ŵ in comparison to the oracle risk R = min w = w Σw, which is the theoretical risk of the portfolio we would have created if we knew the true covariance matrix Σ. We thus compare the regret R(ŵ) R, which is always nonnegative, for three estimators of Σ. hey are summarized by using the box plots over the 200 simulations. he results are reported in Figure 5. In practice, we are also concerned about the difference between the actual and empirical risk of the chosen portfolio ŵ. Hence, in Figure 6, we also compare the average difference R(ŵ) ˆR(ŵ) over 200 simulations. When ŵ is obtained based on the strict factor model, both the differences between actual and oracle risk, and between actual and empirical risk are persistently greater than the corresponding differences for the approximate factor estimator. Figure 5: Box plots of regrets R(ŵ) R for p = 80 and 40. In each panel, the box plots from left to right correspond to ŵ obtained using Σ based on approximate factor model, strict factor model, and sample covariance, respectively p= p= approx. factor strict factor sample 0 approx. factor strict factor sample 23

24 Figure 6: Error of risk estimation R(ŵ) ˆR(ŵ) as p increases. ŵ and ˆR are obtained based on three estimators of Σ. Blue line: approximate factor model Σ ; red line: strict factor model Σ diag ; dotted line: sample covariance Σ sam Risk factor: sparse factor: diagonal sample cov p 6 Real Data Example We demonstrate an application of the approximate factor model on real data. he data was obtained from the CRSP database, and consists of p = 50 stocks and their annualized daily returns for the period Jan. st, 200-Dec.3 st 200 ( = 252). he stocks are chosen from 5 different industry sectors, (more specifically, Consumer Goods-extile & Apparel Clothing, Financial-Credit Services, Healthare-Hospitals, Services-Restaurants, Utilities-Water utilities), with 0 stocks from each sector. he largest eignevalues of the sample covariance equal 0.002, and , while the rest are bounded by Hence we consider K =, 2, 3 factors respectively. he threshold has been chosen using a leave-one-out cross-validation procedure. Figure 7 shows the heatmap of the thresholded error correlation matrix. We compare the level of sparsity (percentage of non-zero off-diagonal elements) for the diagonal 5 blocks of size 0 0, versus the sparsity of the rest of the matrix. For K = 2, our method results in 25.8% non-zero off-diagonal elements in the 5 diagonal blocks, as opposed to 7.3% non-zero elements in the rest of the covariance matrix. Note that, out of the non-zero elements in the central 5 blocks, 00% are positive, as opposed to a distribution of 60.3% positive and 39.7% negative amongst the non-zero elements in off-diagonal blocks. here is a strong positive correlation between the returns of companies in the same industry after the common factors are taken out, and the thresholding has preserved them. he results for K = and K = 3 show the same characteristics. hese provide stark evidence that the strict factor model is not appropriate. 7 Conclusion We study the problem of estimating a high dimensional covariance matrix with conditional sparsity. Realizing unconditional sparsity assumption is inappropriate in many appli- 24

25 Figure 7: Heatmap of thresholded error correlation matrix for number of factors K =, K = 2 and K = 3. cations, we introduce a factor model that has a conditional sparsity feature, and propose the POE estimator to take advantage of the structure. his expands considerably the scope of the model based on the strict factor model, which assumes independent idiosyncratic noise and is too restrictive in practice. By assuming sparse error covariance matrix, we allow for the presence of the cross-sectional correlation even after taking out common factors. he sparse covariance is estimated by the adaptive thresholding technique. It is found that the rates of convergence of the estimators have an extra term approximately O p (p /2 ) in addition to the results based on observable factors by Fan et al. (2008) and Fan et al. (20), which arises from the effect of estimating the unobservable factors. As we can see, this effect vanishes as the dimensionality increases, as there are more information available about the common factors. When p gets large enough, the effect of estimating the unknown factors is negligible, and we estimate the covariance matrices as if we knew the factors. his fact was also shown in the asymptotic expansions obtained by Bai (2003). he sparse covariance is estimated using the adaptive hard thresholding. Recently, Cai and Liu (20) studied a more general thresholding function of Antoniadis and Fan (200), which admits the form σ ij (θ ij ) = s(σ ij ), and also allows for soft-thresholding. It is easy to apply the more general thresholding here as well, and the rate of convergence of the resulting covariance matrix estimators should be straightforward to derive. A Proofs for Section 2 Proof of heorem 2. Proof. he sample covariance matrix of the residuals using least squares method is given by Σ u = (Y Λ F )(Y F Λ ) 25

26 = YY Λ Λ. where we used the normalization condition F F = IK. If we show that Λ Λ = K i= λ i ξi ξ i, then from the decompositions of the sample covariance we have R = Σ u. YY = Λ Λ + Σ u = K i= λ i ξi ξ i + R, Consequently, applying thresholding on Σ u is equivalent to applying thresholding on R, which gives the desired result. We now show Λ Λ = K i= λ i ξi ξ i indeed holds. Consider again the least squares problem (2.0) but with the following alternative normalization constraints: p p b i b i = I K, i= f t f t is diagonal. Let ( Λ, F) be the solution to the new optimization problem. Switching the roles of B and F, then the solution of (2.2) is Λ = (ˆξ,, ˆξ K ) and F = p Y Λ. In addition, F F = diag( λ,, λ K ). From Λ F = Λ F, it follows that Λ Λ = Λ F F Λ = Λ F F Λ K = λ i ξi ξ i. i= Q.E.D. B Proofs for Section 3 We will proceed by subsequently showing heorems 3.3, 3. and

27 B. Preliminary lemmas he following results are to be used subsequently. he proofs of Lemmas B. and B.2 are found in Fan, Liao and Martina (20). Lemma B.. Suppose A, B are symmetric semi-positive definite matrices, and λ min (B) > c for a sequence c > 0. If A B = o p (c ), then λ min (A) > c /2, and A B = O p (c 2 ) A B. Lemma B.2. Suppose that the random variables Z, Z 2 both satisfy the exponential-type tail condition: here exist r, r 2 (0, ) and b, b 2 > 0, such that s > 0, P ( Z i > s) exp( (s/b i ) r i ), i =, 2. hen for some r 3 and b 3 > 0, and any s > 0, P ( Z Z 2 > s) exp( (s/b 3 ) r 3 ). (B.) Lemma B.3. Under the assumptions of heorem 3., (i) max i,j K f itf jt Ef it f jt = O p ( (log K)/ ). (ii) max i,j p u itu jt Eu it u jt = O p ( (log p)/ ) (iii) max i K,j p f itu jt = O p ( (log p)/ ) Proof. he proof follows from the Bernstein inequality for weakly dependent data (Merlevède et al. (2009, heorem ). See Lemmas A.3 and B. of Fan, Liao and Mincheva (20). Lemma B.4. Let λ K denote the Kth largest eigenvalue of Σ = y ty t, then λ K > C p with probability approaching one for some C > 0. Proof. First of all, by Proposition 2., under Assumption 3.5, the Kth largest eigenvalue λ K of Σ satisfies: λ K b K λ K b K pλ min (Ω) Σ u pλ min (Ω)/2 for sufficiently large p. Using Weyl s theorem, we need only to prove that Σ Σ = o p (p). Without loss of generality, we prove the result under the identifiability condition (2.7). Using model (.2), Σ = (Bf t + u t )(Bf t + u t ). 27

Theory and Applications of High Dimensional Covariance Matrix Estimation

1 / 44 Theory and Applications of High Dimensional Covariance Matrix Estimation Yuan Liao Princeton University Joint work with Jianqing Fan and Martina Mincheva December 14, 2011 2 / 44 Outline 1 Applications