ECA: High Dimensional Elliptical Component Analysis in non-gaussian Distributions

Size: px

Start display at page:

Download "ECA: High Dimensional Elliptical Component Analysis in non-gaussian Distributions"

Bertha Griffin
5 years ago
Views:

1 : High Dimensional Elliptical Component Analysis in non-gaussian Distributions Fang Han and Han Liu arxiv: v4 [stat.ml] 3 Oct 016 Abstract We present a robust alternative to principal component analysis PCA called elliptical component analysis for analyzing high dimensional, elliptically distributed data. estimates the eigenspace of the covariance matrix of the elliptical data. To cope with heavytailed elliptical distributions, a multivariate rank statistic is exploited. At the model-level, we consider two settings: either that the leading eigenvectors of the covariance matrix are non-sparse or that they are sparse. Methodologically, we propose procedures for both non-sparse and sparse settings. Theoretically, we provide both non-asymptotic and asymptotic analyses quantifying the theoretical performances of. In the non-sparse setting, we show that s performance is highly related to the effective rank of the covariance matrix. In the sparse setting, the results are twofold: i We show that the sparse estimator based on a combinatoric program attains the optimal rate of convergence; ii Based on some recent developments in estimating sparse leading eigenvectors, we show that a computationally efficient sparse estimator attains the optimal rate of convergence under a suboptimal scaling. Keyword: multivariate Kendall s tau, elliptical component analysis, sparse principal component analysis, optimality property, robust estimators, elliptical distribution. 1 Introduction Principal component analysis PCA plays important roles in many different areas. For example, it is one of the most useful techniques for data visualization in studying brain imaging data Lindquist, 008. This paper considers a problem closely related to PCA, namely estimating the leading eigenvectors of the covariance matrix. Let X 1,..., X n be n data points of a random vector X R d. Denote Σ to be the covariance matrix of X, and u 1,..., u m to be its top m leading eigenvectors. We want to find û 1,..., û m that can estimate u 1,..., u m accurately. Department of Statistics, University of Washington, Seattle, WA 98115, USA; fanghan@uw.edu Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; hanliu@princeton.edu Fang Han s research was supported by NIH-R01-EB01547 and a UW faculty start-up grant. Han Liu s research was supported by the NSF CAREER Award DMS , NSF IIS , NSF IIS , NSF IIS , NIH R01-MH10339, NIH R01-GM083084, and NIH R01-HG

2 This paper is focused on the high dimensional setting where the dimension d could be comparable to, or even larger than, the sample size n, and the data could be heavy-tailed especially, non- Gaussian. A motivating example for considering such high dimensional non-gaussian heavy-tailed data is our study on functional magnetic resonance imaging fmri. In particular, in Section 6 we examine an fmri data, with 116 regions of interest ROIs, from the Autism Brian Imaging Data Exchange ABIDE project containing 544 normal subjects. There, the dimension d = 116 is comparable to the sample size n = 544. In addition, Table 3 shows that the data we consider cannot pass any normality test, and Figure 5 further indicates that the data are heavy-tailed. In high dimensions, the performance of PCA, using the leading eigenvectors of the Pearson s sample covariance matrix, has been studied for subgaussian data. In particular, for any matrix M R d d, letting TrM and σ i M be the trace and i-th largest singular value of M, Lounici 014 showed that PCA could be consistent when r Σ := TrΣ/σ 1 Σ satisfies r Σ log d/n 0. r Σ is referred to as the effective rank of Σ in the literature Vershynin, 010; Lounici, 014. When r Σ log d/n 0, PCA might not be consistent. The inconsistency phenomenon of PCA in high dimensions has been pointed out by Johnstone and Lu 009. In particular, they showed that the angle between the PCA estimator and u 1 may not converge to 0 if d/n c for some constant c > 0. To avoid this curse of dimensionality, certain types of sparsity assumptions are needed. For example, in estimating the leading eigenvector u 1 := u 11,..., u 1d T, we may assume that u 1 is sparse, i.e., s := card{j : u 1j 0} n. We call the setting that u 1 is sparse the sparse setting and the setting that u 1 is not necessarily sparse the non-sparse setting. In the sparse setting, different variants of sparse PCA methods have been proposed. For example, d Aspremont et al. 007 proposed formulating a convex semidefinite program for calculating the sparse leading eigenvectors. Jolliffe et al. 003 and Zou et al. 006 connected PCA to regression and proposed using lasso-type estimators for parameter estimation. Shen and Huang 008 and Witten et al. 009 connected PCA to singular vector decomposition SVD and proposed iterative algorithms for estimating the left and right singular vectors. Journée et al. 010 and Zhang and El Ghaoui 011 proposed greedily searching the principal submatrices of the covariance matrix. Ma 013 and Yuan and Zhang 013 proposed using modified versions of the power method to estimate eigenvectors and principal subspaces. Theoretical properties of these methods have been analyzed under both Gaussian and subgaussian assumptions. On one hand, in terms of computationally efficient methods, under the spike covariance Gaussian model, Amini and Wainwright 009 showed the consistency in parameter estimation and model selection for sparse PCA computed via the semidefinite program proposed in d Aspremont et al Ma 013 justified the use of a modified iterative thresholding method in estimating principal subspaces. By exploiting a convex program using the Fantope projection Overton and Womersley, 199; Dattorro, 005, Vu et al. 013 showed that there exist computationally efficient estimators that attain under various settings O P s log d/n rate of convergence for possibly non-spike model. In Section 5.1, we will discuss the Fantope projection in more detail. On the other hand, there exists another line of research focusing on studying sparse PCA conducted via combinatoric programs. For example, Vu and Lei 01, Lounici 013, and Vu and Lei 013 studied leading eigenvector and principal subspace estimation problems via exhaustively searching over all submatrices. They showed that the optimal O P s loged/s/n up to some

3 other parameters of Σ rate of convergence can be attained using this computationally expensive approach. Such a global search was also studied in Cai et al. 015, where they established the upper and lower bounds in both covariance matrix and principal subspace estimations. Barriers between the aforementioned statistically efficient method and computationally efficient methods in sparse PCA was pointed out by Berthet and Rigollet 013 using the principal component detection problem. Such barriers were also studied in Ma and Wu 015. One limitation for the PCA and sparse PCA theories is that they rely heavily on the Gaussian or subgaussian assumption. If the Gaussian assumption is correct, accurate estimation can be expected, otherwise, the obtained result may be misleading. To relax the Gaussian assumption, Han and Liu 014 generalized the Gaussian to the semiparametric transelliptical family called the meta-elliptical in their paper for modeling the data. The transelliptical family assumes that, after unspecified increasing marginal transformations, the data are elliptically distributed. By resorting to the marginal Kendall s tau statistic, Han and Liu 014 proposed a semiparametric alternative to scale-invariant PCA, named transelliptical component analysis, for estimating the leading eigenvector of the latent generalized correlation matrix Σ 0. In follow-up works, Wegkamp and Zhao 016 and Han and Liu 016 showed that, under various settings, i In the non-sparse case, attains the O P r Σ 0 log d/n rate of convergence in parameter estimation, which is the same rate of convergence for PCA under the subgaussian assumption Lounici, 014; Bunea and Xiao, 015; ii In the sparse case, sparse, formulated as a combinatoric program, can attain the optimal O P s loged/s/n rate of convergence under the sign subgaussian condition. More recently, Vu et al. 013 showed that, sparse, via the Fantope projection, can attain the O P s log d/n rate of convergence. Despite all these efforts, there are two remaining problems for the aforementioned works exploiting the marginal Kendall s tau statistic. First, using marginal ranks, they can only estimate the leading eigenvectors of the correlation matrix instead of the covariance matrix. Secondly, the sign subgaussian condition is not easy to verify. In this paper we show that, under the elliptical model and various settings see Corollaries 3.1, 4.1, and Theorems 5.4 and 3.5 for details, the O P s loged/s/n rate of convergence for estimating the leading eigenvector of Σ can be attained without the need of sign subgaussian condition. In particular, we present an alternative procedure, called elliptical component analysis, to directly estimate the eigenvectors of Σ and treat the corresponding eigenvalues as nuisance parameters. exploits the multivariate Kendall s tau for estimating the eigenspace of Σ. When the target parameter is sparse, the corresponding procedure is called sparse. We show that sparse, under various settings, has the following properties. i In the nonsparse setting, attains the efficient O P r Σ log d/n rate of convergence. ii In the sparse setting, sparse, via a combinatoric program, attains the minimax optimal O P s loged/s/n rate of convergence. iii In the sparse setting, sparse, via a computationally efficient program which combines the Fantope projection Vu et al., 013 and truncated power algorithm Yuan and Zhang, 013, attains the optimal O P s loged/s/n rate of convergence under a suboptimal scaling s log d/n 0. Of note, for presentation clearness, the rates presented here omit a variety of parameters regarding Σ and K. The readers should refer to Corollaries 3.1, 4.1, and Theorem 5.4 for accurate descriptions. 3

4 Table 1: The illustration of the results in sparse PCA, sparse, and sparse for the leading eigenvector estimation. Similar results also hold for principal subspace estimation. Here Σ is the covariance matrix, Σ 0 is the latent generalized correlation matrix, r M := TrM/σ 1 M represents the effective rank of M, r.c. stands for rate of convergence, n-s setting stands for the non-sparse setting, sparse setting 1 stands for the sparse setting where the estimation procedure is conducted via a combinatoric program, sparse setting stands for the sparse setting where the estimation procedure is conducted via combining the Fantope projection Vu et al., 013 and the truncated power method Yuan and Zhang, 013. For presentation clearness, the rates presented here omit a variety of parameters regarding Σ and K. The readers should refer to Corollaries 3.1, 4.1, and Theorem 5.4 for accurate descriptions. sparse PCA sparse sparse working model: subgaussian family transelliptical family elliptical family parameter of interest: eigenvectors of Σ eigenvectors of Σ 0 eigenvectors of Σ input statistics: Pearson s covariance matrix Kendall s tau multivariate Kendall s tau n-s setting r.c.: r Σ log d/n r Σ 0 log d/n r Σ log d/n sparse setting 1 r.c: s loged/s/n s log d/n general, s loged/s/n sign subgaussian s loged/s/n sparse setting r,c: s loged/s/n s log d/n general, s loged/s/n given s log d/n 0 s loged/s/n sign subgaussian given s log d/n 0 given s log d/n 0 We compare sparse PCA, sparse, and sparse in Table Related Works The multivariate Kendall s tau statistic is first introduced in Choi and Marden 1998 for testing independence and is further used in estimating low-dimensional covariance matrices Visuri et al., 000; Oja, 010 and principal components Marden, 1999; Croux et al., 00; Jackson and Chen, 004. In particular, Marden 1999 showed that the population multivariate Kendall s tau, K, shares the same eigenspace as the covariance matrix Σ. Croux et al. 00 illustrated the asymptotical efficiency of compared to PCA for the Gaussian data when d = and 3. Taskinen et al. 01 characterized the robustness and efficiency properties of in low dimensions Some related methods using multivariate rank-based statistics are discussed in Tyler 198, Tyler 1987, Taskinen et al. 003, Oja and Randles 004, Oja and Paindaveine 005, Oja et al. 006, and Sirkiä et al Theoretical analysis in low dimensions is provided in Hallin and Paindaveine 00b,a, 004, 005, 006, Hallin et al. 006, 010, 014, and some new extensions to high dimensional settings are provided in Croux et al. 013 and Feng 015. Our paper has significantly new contributions to high dimensional robust statistics literature. Theoretically, we study the use of the multivariate Kendall s tau in high dimensions, provide new properties of the multivariate rank statistic, and characterize the performance of in both non-sparse and sparse settings. Computationally, we provide an efficient algorithm for conducting 4

5 sparse and highlight the optimal rate, suboptimal scaling phenomenon in understanding the behavior of the proposed algorithm. 1. Notation Let M = [M jk ] R d d be a symmetric matrix and v = v 1,..., v d T R d be a vector. We denote v I to be the subvector of v whose entries are indexed by a set I, and M I,J to be the submatrix of M whose rows are indexed by I and columns are indexed by J. We denote suppv := {j : v j 0}. For 0 < q <, we define the l q and l vector norms as v q := d i=1 v i q 1/q and v := max 1 i d v i. We denote v 0 := cardsuppv. We define the matrix entry-wise maximum value and Frobenius norms as M max := max{ M ij } and M F = M jk 1/. Let λ j M be the j-th largest eigenvalue of M. If there are ties, λ j M is any one of the eigenvalues such that any eigenvalue larger than it has rank smaller than j, and any eigenvalue smaller than it has rank larger than j. Let u j M be any unit vector v such that v T Mv = λ j M. Without loss of generality, we assume that the first nonzero entry of u j M is positive. We denote M to be the spectral norm of M and S d 1 := {v R d : v = 1} to be the d-dimensional unit sphere. We define the restricted spectral norm M,s := sup v S d 1, v 0 s v T Mv, so for s = d, we have M,s = M. We denote fm to be the matrix with entries [fm] jk = fm jk. We denote diagm to be the diagonal matrix with the same diagonal entries as M. Let I d represent the d by d identity matrix. For any two numbers a, b R, we denote a b := min{a, b} and a b := max{a, b}. For any two sequences of positive numbers {a n } and {b n }, we write a n b n if a n = Ob n and b n = Oa n. We write b n = Ωa n if a n = Ob n, and b n = Ω o a n if b n = Ωa n and b n a n. 1.3 Paper Organization The rest of this paper is organized as follows. In the next section, we briefly introduce the elliptical distribution and review the marginal and multivariate Kendall s tau statistics. In Section 3, in the non-sparse setting, we propose the method and study its theoretical performance. In Section 4, in the sparse setting, we propose a sparse method via a combinatoric program and study its theoretical performance. A computationally efficient algorithm for conducting sparse is provided in Section 5. Experiments on both synthetic and brain imaging data are provided in Section 6. More simulation results and all technical proofs are relegated to the supplementary materials. Background This section briefly reviews the elliptical distribution, and marginal and multivariate Kendall s tau statistics. In the sequel, we denote X d = Y if random vectors X and Y have the same distribution..1 Elliptical Distribution The elliptical distribution is defined as follows. Let µ R d and Σ R d d with rankσ = q d. A d-dimensional random vector X has an elliptical distribution, denoted by X EC d µ, Σ, ξ, if 5

6 it has a stochastic representation X d = µ + ξau,.1 where U is a uniform random vector on the unit sphere in R q, ξ 0 is a scalar random variable independent of U, and A R d q is a deterministic matrix satisfying AA T = Σ. Here Σ is called the scatter matrix. In this paper, we only consider continuous elliptical distributions with Pξ = 0 = 0. An equivalent definition of the elliptical distribution is through the characteristic function expit T µψt T Σt, where ψ is a properly defined characteristic function and i := 1. ξ and ψ are mutually determined. In this setting, we denote by X EC d µ, Σ, ψ. The elliptical family is closed under independent sums, and the marginal and conditional distributions of an elliptical distribution are also elliptically distributed. Compared to the Gaussian family, the elliptical family provides more flexibility in modeling complex data. First, the elliptical family can model heavy-tail distributions in contrast, Gaussian is light-tailed with exponential tail bounds. Secondly, the elliptical family can be used to model nontrivial tail dependence between variables Hult and Lindskog, 00, i.e., different variables tend to go to extremes together in contrast, Gaussian family can not capture any tail dependence. The capability to handle heavy-tailed distributions and tail dependence is important for modeling many datasets, including: 1 financial data almost all the financial data are heavy-tailed with nontrivial tail dependence Rachev, 003; Čižek et al., 005; genomics data Liu et al., 003; Posekany et al., 011; 3 bioimaging data Ruttimann et al., In the sequel, we assume that Eξ < so that the covariance matrix CovX is well defined. For model identifiability, we further assume that Eξ = q so that CovX = Σ. Of note, the assumption that the covariance matrix of X exists is added only for presentation clearness. In particular, the follow-up theorems still hold without any requirement on the moments of ξ. Actually, still works even when Eξ = due to its construction see, for example, Equation.6 and follow-up discussions.. Marginal Rank-Based Estimators In this section we briefly review the marginal rank-based estimator using the Kendall s tau statistic. This statistic plays a vital role in estimating the leading eigenvectors of the generalized correlation matrix Σ 0 in Han and Liu 014. Letting X := X 1,..., X d T R d with X := X 1,..., X d T an independent copy of X, the population Kendall s tau statistic is defined as: τx j, X k := CovsignX j X j, signx k X k. Let X 1,..., X n R d with X i := X i1,..., X id T be n independent observations of X. The sample Kendall s tau statistic is defined as: τ jk X 1,..., X n := signx ij X i nn 1 jsignx ik X i k. 1 i<i n It is easy to verify that E τ jk X 1,..., X n = τx j, X k. Let R = [ R jk ] R d d, with R jk = sin π τ jkx 1,..., X n, be the Kendall s tau correlation matrix. The marginal rank-based estimator 6

7 θ 1 used by is obtained by plugging R into the optimization formulation in Vu and Lei 01. When X EC d µ, Σ, ξ and under mild conditions, Han and Liu 014 showed that log d E sin θ 1, u 1 Σ 0 = O s, n where s := u 1 Σ 0 0 and Σ 0 is the generalized correlation matrix of X. However, is a variant of the scale-invariant PCA and can only estimate the leading eigenvectors of the correlation matrix. How, then, to estimate the leading eigenvector of the covariance matrix in high dimensional elliptical models? A straightforward approach is to exploit a covariance matrix estimator Ŝ := [Ŝjk], defined as Ŝ jk = R jk σ j σ k,. where { σ j } d j=1 are sample standard deviations. However, since the elliptical distribution can be heavy-tailed, estimating the standard deviations is challenging and requires strong moment conditions. In this paper, we solve this problem by resorting to the multivariate rank-based method..3 Multivariate Kendall s tau Let X EC d µ, Σ, ξ and X be an independent copy of X. The population multivariate Kendall s tau matrix, denoted by K R d d, is defined as: X XX X T K := E X X..3 Let X 1,..., X n R d be n independent data points of a random vector X EC d µ, Σ, ξ. The definition of the multivariate Kendall s tau in.3 motivates the following sample version multivariate Kendall s tau estimator, which is a second-order U-statistic: K := X i X i X i X i T nn 1 X i i X <i i..4 It is obvious that E K = K, and both K and K are positive semidefinite PSD matrices of trace 1. Moreover, the kernel of the U-statistic k MK : R d R d R d d, k MK X i, X i := X i X i X i X i T X i X i,.5 is bounded under the spectral norm, i.e., k MK 1. Intuitively, such a boundedness property makes the U-statistic K more amenable to theoretical analysis. Moreover, it is worth noting that k MK X i, X i is a distribution-free kernel, i.e., for any continuous X EC d µ, Σ, ξ with the generating variable ξ, k MK X i, X i d = k MK Z i, Z i,.6 7

8 where Z i and Z i follow Z N d µ, Σ. This can be proved using the closedness of the elliptical family under independent sums and the property that Z is a stochastic scaling of X check Lemma B.7 for details. Accordingly, as will be shown later, the convergence of K to K does not depend on the generating variable ξ, and hence K enjoys the same distribution-free property as the Tyler s M estimator Tyler, However, the multivariate Kendall s tau can be directly extended to analyze high dimensional data, while the Tyler s M estimator cannot 1. The multivariate Kendall s tau can be viewed as the covariance matrix of the self-normalized data {X i X i / X i X i } i>i. It is immediate to see that K is not identical or proportional to the covariance matrix Σ of X. However, the following proposition, essentially coming from Marden 1999 and Croux et al. 00 also explicitly stated as Theorem 4.4 in Oja 010, states that the eigenspace of the multivariate Kendall s tau statistic K is identical to the eigenspace of the covariance matrix Σ. Its proof is given in the supplementary materials for completeness. Proposition.1. Let X EC d µ, Σ, ξ be a continuous distribution and K be the population multivariate Kendall s tau statistic. Then if rankσ = q, we have λ j K = E λ j ΣYj λ 1 ΣY λ qσyq,.7 where Y := Y 1,..., Y q T N q 0, I q is a standard multivariate Gaussian distribution. In addition, K and Σ share the same eigenspace with the same descending order of the eigenvalues. Proposition.1 shows that, to recover the eigenspace of the covariance matrix Σ, we can resort to recovering the eigenspace of K, which, as is discussed above, can be more efficiently estimated using K. Remark.. Proposition.1 shows that the eigenspaces of K and Σ are identical and the eigenvalues of K only depend on the eigenvalues of Σ. Therefore, if we can theoretically calculate the relationships between {λ j K} d j=1 and {λ jσ} d j=1, we can recover Σ using K. When, for example, λ 1 Σ = = λ q Σ, this relationship is calculable. In particular, it can be shown check, for example, Section 3 in Bilodeau and Brenner 1999 that Y j Y1 + + Y q 1 Beta, q 1, for j = 1,..., q, where Betaα, β is the beta distribution with parameters α and β. Accordingly, λ j K = EYj /Y + Yq = 1/q. The general relationship between {λ j K} d j=1 and {λ jσ} d j=1 is non-linear. For example, when d =, Croux et al. 00 showed that λj Σ λ j K = λ1 Σ +, for j = 1,. λ Σ 1 The Tyler s M estimator cannot be directly applied to study high dimensional data because of both theoretical and empirical reasons. Theoretically, to the authors knowledge, the sharpest sufficient condition to guarantee its consistency still requires d = on 1/ Duembgen, Empirically, our simulations show that Tyler s M estimator always fails to converge when d > n

9 3 : Non-Sparse Setting In this section we propose and study the method in the non-sparse setting when λ 1 Σ is distinct. In particular, we do not assume sparsity of u 1 Σ. Without the sparsity assumption, we propose to use the leading eigenvector u 1 K to estimate u 1 K = u 1 Σ: The estimator : u 1 K the leading eigenvector of K, where K is defined in.4. For notational simplicity, in the sequel we assume that the sample size n is even. When n is odd, we can always use n 1 data points without affecting the obtained rate of convergence. The approximation error of u 1 K to u 1 K is related to the convergence of K to K under the spectral norm via the Davis-Kahan inequality Davis and Kahan, 1970; Wedin, 197. In detail, for any two vectors v 1, v R d, let sin v 1, v be the sine of the angle between v 1 and v, with sin v 1, v := 1 v1 T v. The Davis-Kahan inequality states that the approximation error of u 1 K to u 1 K is controlled by K K divided by the eigengap between λ 1 K and λ K: sin u 1 K, u 1 K λ 1 K λ K K K. 3.1 Accordingly, to analyze the convergence rate of u 1 K to u 1 K, we can focus on the convergence rate of K to K under the spectral norm. The next theorem shows that, for the elliptical distribution family, the convergence rate of K to K under the spectral norm is K r K log d/n, where r K = TrK/λ 1 K is the effective rank of K and must be less than or equal to d. Theorem 3.1. Let X 1,..., X n be n independent observations of X EC d µ, Σ, ξ. Let K be the sample version of the multivariate Kendall s tau statistic defined in Equation.4. We have, provided that n is sufficiently large such that with probability larger than 1 α, n 16 3 r K + 1log d + log1/α, 3. K 16 K K 3 r K + 1log d + log1/α. n Remark 3.. The scaling requirement on n in 3. is posed largely for presentation clearness. As a matter of fact, we could withdraw this scaling condition by posing a different upper bound on K K. This is via employing a similar argument as in Wegkamp and Zhao 016. However, we note such a scaling requirement is necessary for proving consistency in our analysis, and was also enforced in the related literature see, for example, Theorem 3.1 in Han and Liu

10 There is a vast literature on bounding the spectral norm of a random matrix see, for example, Vershynin 010 and the references therein and our proof relies on the matrix Bernstein inequality proposed in Tropp 01, with a generalization to U-statistics following similar arguments as in Wegkamp and Zhao 016 and Han and Liu 016. We defer the proof to Section B.. Combining 3.1 and Theorem 3.1, we immediately have the following corollary, which characterizes the explicit rate of convergence for sin u 1 K, u 1 K. Corollary 3.1. Under the conditions of Theorem 3.1, provided that n is sufficiently large such that n 16 3 r K + 1log d + log1/α, we have, with probability larger than 1 α, sin u 1 K, u 1 K λ 1 K λ 1 K λ K 16 3 r K+1log d+log1/α. n Remark 3.3. Corollary 3.1 indicates that it is not necessary to require d/n 0 for u 1 K to be a consistent estimator of u 1 K. For example, when λ K/λ 1 K is upper bounded by an absolute constant strictly smaller than 1, r K log d/n 0 is sufficient to make u 1 K a consistent estimator of u 1 K. Such an observation is consistent with the observations in the PCA theory Lounici, 014; Bunea and Xiao, 015. On the other hand, Theorem 4.1 in the next section provides a rate of convergence O P λ 1 K d/n for K K. Therefore, the final rate of convergence for, under various settings, can be expressed as O P r K log d/n d/n. Remark 3.4. We note that Theorem 3.1 can also help to quantify the subspace estimation error via a variation of the Davis-Kahan inequality. In particular, let P m K and P m K be the projection matrices onto the span of m leading eigenvectors of K and K. Using Lemma 4. in Vu and Lei 013, we have P m K P m K F m λ m K λ m+1 K K K, 3.3 so that P m K P m K F can be controlled via a similar argument as in Corollary 3.1. The above bounds are all related to the eigenvalues of K. The next theorem connects the eigenvalues of K to the eigenvalues of Σ, so that we can directly bound K K and sin u 1 K, u 1 K using Σ. In the sequel, let s denote r Σ := Σ F /λ 1 Σ d to be the second-order effective rank of the matrix Σ. Theorem 3.5 The upper and lower bounds of λ j K. Letting X EC d µ, Σ, ξ, we have λ j Σ 3 λ j K 1 TrΣ + 4 Σ F log d + 8 Σ log d d, and when TrΣ > 4 Σ F log d, λ j K λ j Σ TrΣ 4 Σ F log d + 1 d 4. 10

11 Using Theorem 3.5 and recalling that the trace of K is always 1, we can replace r K by r Σ+4r Σ log d+8 log d 1 3d 1 in Theorem 3.1. We also note that Theorem 3.5 can help understand the scaling of λ j K with regard to λ j Σ. Actually, when Σ F log d = TrΣ o1, we have λ j K λ j Σ/TrΣ, and accordingly, we can continue to write λ 1 K λ 1 K λ K λ 1 Σ λ 1 Σ λ Σ. In practice, Σ F log d = TrΣ o1 is a mild condition. For example, when the condition number of Σ is upper bounded by an absolute constant, we have TrΣ Σ F d. For later purpose see, for example, Theorem 5.3, sometimes we also need to connect the elementwise maximum norm K max to that of Σ max. The next corollary gives such a connection. Corollary 3. The upper and lower bounds of K max. Letting X EC d µ, Σ, ξ, we have K max and when TrΣ > 4 Σ F log d, Σ max 1 TrΣ + 4 Σ F log d + 8 Σ log d 3 d, K max Σ max TrΣ 4 Σ F log d + 1 d 4. 4 Sparse via a Combinatoric Program We analyze the theoretical properties of in the sparse setting, where we assume λ 1 Σ is distinct and u 1 Σ 0 s < d n. In this section we study the method using a combinatoric program. For any matrix M R d d, we define the best s-sparse vector approximating u 1 M as u 1,s M := We propose to estimate u 1 Σ = u 1 K via a combinatoric program: arg max v T Mv. 4.1 v 0 s, v 1 Sparse estimator via a combinatoric program : u 1,s K, where K is defined in.4. Under the sparse setting, by definition we have u 1,s K = u 1 K = u 1 Σ. On the other hand, u 1,s K can be calculated via a combinatoric program by exhaustively searching over all s by s submatrices of K. This global search is not computationally efficient. However, the result in quantifying the approximation error of u 1,s K to u 1 K is of strong theoretical interest. Similar algorithms were also studied in Vu and Lei 01, Lounici 013, Vu and Lei 013, and Cai et al Moreover, as will be seen in the next section, this will help clarify that a computationally efficient sparse algorithm can attain the same convergence rate, though under a suboptimal scaling of n, d, s. In the following we study the performance of u 1,s K in conducting sparse. The approximation error of u 1,s K to u 1 K is connected to the approximation error of K to K under the 11

12 restricted spectral norm. This is due to the following Davis-Kahan type inequality provided in Vu and Lei 01: sin u 1,s K, u 1,s K λ 1 K λ K K K,s. 4. Accordingly, for studying sin u 1,s K, u 1,s K, we focus on studying the approximation error K K,s. Before presenting the main results, we provide some extra notation. For any random variable X R, we define the subgaussian ψ and sub-exponential norms ψ1 of X as follows: X ψ := sup k 1 k 1/ E X k 1/k and X ψ1 := sup k 1 E X k 1/k. 4.3 k 1 Any d-dimensional random vector X R d is said to be subgaussian distributed with the subgaussian constant σ if v T X ψ σ, for any v S d 1. Moreover, we define the self-normalized operator S for any random vector to be SX := X X/ X X where X is an independent copy of X. 4.4 It is immediate that K = ESXSX T. The next theorem provides a general result in quantifying the approximation error of K to K with regard to the restricted spectral norm. Theorem 4.1. Let X 1,..., X n be n observations of X EC d µ, Σ, ξ. Let K be the sample version multivariate Kendall s tau statistic defined in Equation.4. We have, when s loged/s + log1/α/n 0, for n sufficiently large, with probability larger than 1 α, K K,s sup v S d 1 v T SX ψ + K for some absolute constant C 0 > 0. Here sup v S d 1 v T X ψ sup v T SX ψ = v S d 1 sup v S d 1 where v := v 1,..., v d T and Y 1,..., Y d T N d 0, I d. s3 + logd/s + log1/α C 0, n can be further written as d i=1 v iλ 1/ i ΣY i d i=1 λ 1, 4.5 iσyi ψ It is obvious that SX is subgaussian with variance proxy 1. However, typically, a sharper upper bound can be obtained. The next theorem shows, under various settings, the upper bound can be of the same order as 1/q, which is much smaller than 1. Combined with Theorem 4.1, these results give an upper bound of K K,s. 1

13 Theorem 4.. Let X 1,..., X n be n observations of X EC d µ, Σ, ξ with rankσ = q and u 1 Σ 0 s. Let K be the sample version multivariate Kendall s tau statistic defined in Equation.4. We have, sup v T SX ψ v S d 1 λ 1 Σ λ q Σ q 1, and accordingly, when s loged/s + log1/α/n 0, with probability at least 1 α, { } K 4λ1 Σ s3 + logd/s + log1/α K,s C 0 qλ q Σ 1 + λ 1 K. n Similar to Theorem 3.1, we wish to show that K K,s = O P λ 1 K s loged/s/n. In the following, we provide several examples such that sup v v T SX ψ is of the same order as λ 1 K, so that, via Theorem 4., the desired rate is attained. Condition number controlled: Bickel and Levina 008 considered the covariance matrix model where the condition number of the covariance matrix Σ, λ 1 Σ/λ d Σ, is upper bounded by an absolute constant. Under this condition, we have and applying Theorem 3.5 we also have sup v T SX ψ d 1, v λ j K λ jσ TrΣ d 1. Accordingly, we conclude that sup v v T SX ψ and λ 1 K are of the same order. Spike covariance model: Johnstone and Lu 009 considered the following simple spike covariance model: Σ = βvv T + a I d, where β, a > 0 are two positive real numbers and v S d 1. In this case, we have, when β = oda / log d or β = Ωda, sup v T SX ψ β + a v da 1 and λ 1 K β + a β + da. A simple calculation shows that sup v v T SX ψ and λ 1 K are of the same order. Multi-Factor Model: Fan et al. 008 considered a multi-factor model, which is also related to the general spike covariance model Ma, 013: Σ = m β j v j vj T + Σ u, j=1 13

14 where we have β 1 β β m > 0, v 1,..., v m S d 1 are orthogonal to each other, and Σ u is a diagonal matrix. For simplicity, we assume that Σ u = a I d. When β j = od a 4 / log d, we have sup v T SX ψ β 1 + a v da 1 and λ 1 K β 1 + a m j=1 β j + da, and sup v v T SX ψ and λ 1 K are of the same order if, for example, m j=1 β j = Oda. Equation 4. and Theorem 4. together give the following corollary, which quantifies the convergence rate of the sparse estimator calculated via the combinatoric program in 4.1. Corollary 4.1. Under the condition of Theorem 4., if we have s loged/s + log1/α/n 0, for n sufficiently large, with probability larger than 1 α, sin u 1,s K, u 1,s K C 04λ 1 Σ/qλ q Σ 1 + λ 1 K s3 + logd/s + log1/α. λ 1 K λ K n Remark 4.3. The restricted spectral norm convergence result obtained in Theorem 4. is also applicable to analyzing principal subspace estimation accuracy. Following the notation in Vu and Lei 013, we define the principal subspace estimator to the space spanned by the top m eigenvectors of any given matrix M R d d as U m,s M := arg max V R d m M, VV T, subject to d 1IV j 0 s, 4.6 j=1 where V j is the j-th row of M and the indicator function returns 0 if and only if V j = 0. We then have U m,s KU m,s K T U m,s KU m,s K T F m λ m K λ m+1 K K K,ms. An explicit statement of the above inequality can be found in Wang et al Sparse via a Computationally Efficient Program There is a vast literature studying computationally efficient algorithms for estimating sparse u 1 Σ. In this section we focus on one such algorithm for conducting sparse by combining the Fantope projection Vu et al., 013 with the truncated power method Yuan and Zhang, Fantope Projection In this section we first review the algorithm and theory developed in Vu et al. 013 for sparse subspace estimation, and then provide some new analysis in obtaining the sparse leading eigenvector estimators. 14

15 Let Π m := V m Vm T where V m is the combination of the m leading eigenvectors of K. It is well known that Π m is the optimal rank-m projection onto K. Similarly as in 4.6, we define s Π to be the number of nonzero columns in Π m. We then introduce the sparse principal subspace estimator X m corresponding to the space spanned by the first m leading eigenvectors of the multivariate Kendall s tau matrix K. To induce sparsity, X m is defined to be the solution to the following convex program: X m := arg max K, M λ M jk, subject to 0 M I d and TrM = m, 5.1 M R d d j,k where for any two matrices A, B R d d, A B represents that B A is positive semidefinite. Here {M : 0 M I d, TrM = m} is a convex set called the Fantope. We then have the following deterministic theorem to quantify the approximation error of X m to Π m. Theorem 5.1 Vu et al If the tuning parameter λ in 5.1 satisfies that λ K K max, we have 4s Π λ X m Π m F λ m K λ m+1 K, where we remind that s Π is the number of nonzero columns in Π m. It is easy to see that X m is symmetric and the rank of X m must be greater than or equal to m, but is not necessarily exactly m. However, in various cases, dimension reduction for example, it is desired to estimate the top m leading eigenvectors of Σ, or equivalently, to estimate an exactly rank m projection matrix. Noticing that X m is a real symmetric matrix, we propose to use the following estimate X m R d d : X m := j m u j X m [u j X m ] T. 5. We then have the next theorem, which quantifies the distance between X m and Π m. Theorem 5.. If λ K K max, we have X m Π m F 4 X m Π m F 5. A Computationally Efficient Algorithm 16s Π λ λ m K λ m+1 K. In this section we propose a computationally efficient algorithm to conduct sparse via combining the Fantope projection with the truncated power algorithm proposed in Yuan and Zhang 013. We focus on estimating the leading eigenvector of K since the rest can be iteratively estimated using the deflation method Mackey, 008. The main idea here is to exploit the Fantope projection for constructing a good initial parameter for the truncated power algorithm and then perform iterative thresholding as in Yuan and Zhang 013. We call this the Fantope-truncated power algorithm, or FM, for abbreviation. Before 15

16 proceeding to the main algorithm, we first introduce some extra notation. For any vector v R d and an index set J {1,..., d}, we define the truncation function TRC, to be TRCv, J := v 1 1I1 J,..., v d 1Id J T, 5.3 where 1I is the indicator function. The initial parameter v 0, then, is the normalized vector consisting of the largest entries in u 1 X 1, where X 1 is calculated in 5.1: v 0 = w 0 / w 0, where w 0 = TRCu 1 X 1, J δ and J δ = {j : u 1 X 1 j δ}. 5.4 We have v 0 0 = supp{j : u 1 X 1 j > 0}. Algorithm 1 then provides the detailed FM algorithm and the final FM estimator is denoted as û FT 1,k. Algorithm 1 The FM algorithm. Within each iteration, a new sparse vector v t with v t 0 k is updated. The algorithm terminates when v t v t 1 is less than a given threshold ɛ. Algorithm: û FT 1,k K FM K, k, ɛ Initialize: X 1 calculated by 5.1 with m = 1, v 0 is calculated using 5.4, and t 0 Repeat: t t + 1 X y Kv t 1 If X t 0 k, then v t = X t / X t Else, let A t be the indices of the elements in X t with the k largest absolute values v t = TRCX t, A t / TRCX t, A t Until convergence: v t v t 1 ɛ û FT 1,k K v t Output: û FT 1,k K In the rest of this section we study the approximation accuracy of û FT 1,k to u 1K. Via observing Theorems 5.1 and 5., it is immediate that the approximation accuracy of u 1 X 1 is related to K K max. The next theorem gives a nonasymptotic upper bound of K K max, and accordingly, combined with Theorems 5.1 and 5., gives an upper bound on sin u 1 X 1, u 1 K. Theorem 5.3. Let X 1,..., X n be n observations of X EC d µ, Σ, ξ with rankσ = q and u 1 Σ 0 s. Let K be the sample version multivariate Kendall s tau statistic defined in Equation.4. If log d + log1/α/n 0, we have there exists some positive absolute constant C 1 such that for sufficiently large n, with probability at least 1 α, Accordingly, if K 8λ1 Σ K max C 1 qλ q Σ + K log d + log1/α max. n 8λ1 Σ λ C 1 qλ q Σ + K log d + log1/α max, 5.5 n 16

17 we have, with probability at least 1 α, sin u 1 X 1, u 1 K 8 sλ λ 1 K λ K. Theorem 5.3 builds sufficient conditions under which u 1 X 1 is a consistent estimator of u 1 K. In multiple settings the condition number controlled, spike covariance model, and multifactor model settings considered in Section 4 for example when λ λ 1 K log d/n, we have sin u 1 X 1, u 1 K = O P s log d/n. This is summarized in the next corollary. Corollary 5.1. Under the conditions of Theorem 5.3, if we further have λ 1 Σ/qλ q Σ = Oλ 1 K, Σ F log d = TrΣ o1, λ Σ/λ 1 Σ is upper bounded by an absolute constant less than 1, and λ λ 1 K log d/n, then sin u 1 X 1, u 1 K = O P s log d Corollary 5.1 is a direct consequence of Theorem 5.3 and Theorem 3.5, and its proof is omitted. We then turn to study the estimation error of û FT 1,k K. By examining Theorem 4 in Yuan and Zhang 013, for theoretical guarantee of fast rate of convergence, it is enough to show that v 0 T u 1 K is lower bounded by an absolute constant larger than zero. In the next theorem, we show that, under mild conditions, this is true with high probability, and accordingly we can exploit the result in Yuan and Zhang 013 to show that û FT 1,k K attains the same optimal convergence rate as that of u 1,s K. Theorem 5.4. Under the conditions of Corollary 5.1, let J 0 := {j : u 1 K j = Ω 0 s log d/ n}. Set δ in 5.4 to be δ = C slog d/ n for some positive absolute constant C. If s log d/n 0, and u 1 K J0 C 3 > 0 is lower bounded by an absolute positive constant, then, with probability tending to 1, v 0 0 s and v 0 T u 1 K is lower bounded by C 3 /. Accordingly under the condition of Theorem 4 in Yuan and Zhang 013, for k s, we have n. sin û FT 1,k K, k + s log d u 1 K = O P. n Remark 5.5. Although a similar second step truncation is performed, the assumption that the largest entries in u 1 K satisfy u 1 K J0 C 3 is much weaker than the assumption in Theorem 3. of Vu et al. 013 because we allow a lot of entries in the leading eigenvector to be small and not detectable. This is permissible since our aim is parameter estimation instead of guaranteeing the model selection consistency. Remark 5.6. In practice, we can adaptively select the tuning parameter k in Algorithm 1. One possible way is to use the criterion of Yuan and Zhang 013, selecting k that maximizes û FT 1,k K T K val û FT 1,k K, where K val is an independent empirical multivariate Kendall s tau statistic based on a separated sample set of the data. Yuan and Zhang 013 showed that such a heuristic performed quite well in applications. 17

18 Remark 5.7. In Corollary 5.1 and Theorem 5.4, we assume that λ is in the same scale of λ 1 K log d/n. In practice, λ is a tuning parameter. Here we can select λ using similar data driven estimation procedures as proposed in Lounici 013 and Wegkamp and Zhao 016. The main idea is to replace the population quantities with their corresponding empirical versions in 5.5. We conjecture that similar theoretical behaviors can be anticipated by the data driven way. 6 Numerical Experiments In this section we use both synthetic and real data to investigate the empirical usefulness of. We use the FM algorithm described in Algorithm 1 for parameter estimation. To estimate more than one leading eigenvectors, we exploit the deflation method proposed in Mackey 008. Here the cardinalities of the support sets of the leading eigenvectors are treated as tuning parameters. The following three methods are considered: : Sparse PCA method on the Pearson s sample covariance matrix; : Transelliptical component analysis based on the transformed Kendall s tau covariance matrix shown in Equation.; : Elliptical component analysis based on the multivariate kendall s tau matrix. For fairness of comparison, and also exploit the FM algorithm, while using the Kendall s tau covariance matrix and Pearson s sample covariance matrix as the input matrix. The tuning parameter λ in 5.1 is selected using the method discussed in Remark 5.7, and the truncation value δ in 5.4 is selected such that v 0 0 = s for the pre-specified sparsity level s. 6.1 Simulation Study In this section, we conduct a simulation study to back up the theoretical results and further investigate the empirical performance of Dependence on Sample Size and Dimension We first illustrate the dependence of the estimation accuracy of the sparse estimator on the triplet n, d, s. We adopt the data generating schemes of Yuan and Zhang 013 and Han and Liu 014. More specifically, we first create a covariance matrix Σ whose first two eigenvectors v j := v j1,..., v jd T are specified to be sparse: v 1j = { j 10 0 otherwise and v j = { j 0 0 otherwise. Then we let Σ be Σ = 5v 1 v T 1 + v v T + I d, where I d R d d is the identity matrix. We have λ 1 Σ = 6, λ Σ = 3, λ 3 Σ =... = λ d Σ = 1. Using Σ as the covariance matrix, we generate n data points from a Gaussian distribution or a multivariate-t distribution with degrees of freedom 3. Here the dimension d varies from 64 to 56 and the sample size n varies from 10 to 500. Figure 1 18

19 plots the averaged angle distances sin ṽ 1, v 1 between the sparse estimate ṽ 1 and the true parameter v 1, for dimensions d = 64, 100, 56, over 1,000 replications. In each setting, s := v 1 0 is fixed to be a constant s = d=64 d=100 d= d=64 d=100 d= log d/n log d/n A B Figure 1: Simulation for two different distributions normal and multivariate-t with varying numbers of dimension d and sample size n. Plots of s between the estimators and the true parameters are conducted over 1,000 replications. A Normal distribution; B Multivariate-t distribution. By examining the two curves in Figure 1 A and B, the between v 1 and ṽ 1 starts at almost zero for sample size n large enough, and then transits to almost one as the sample size decreases in another word, 1/n increases simultaneously. Figure 1 shows that all curves almost overlapped with each other when the s are plotted against log d/n. This phenomenon confirms the results in Theorem 5.4. Consequently, the ratio n/ log d acts as an effective sample size in controlling the prediction accuracy of the eigenvectors. In the supplementary materials, we further provide results when s is set to be 5 and 0. There one will see the conclusion drawn here still holds Estimating the Leading Eigenvector of the Covariance Matrix We now focus on estimating the leading eigenvector of the covariance matrix Σ. The first three rows in Table list the simulation schemes of n, d and Σ. In detail, let ω 1 > ω > ω 3 =... = ω d be the eigenvalues and v 1,..., v d be the eigenvectors of Σ with v j := v j1,..., v jd T. The top m leading eigenvectors v 1,..., v m of Σ are specified to be sparse such that s j := v j 0 is small and v jk = { 1/ s j, 1+ j 1 i=1 s i k j i=1 s i, 0, otherwise. 19

20 Table : Simulation schemes with different n, d and Σ. Here the eigenvalues of Σ are set to be ω 1 >... > ω m > ω m+1 =... = ω d and the top m leading eigenvectors v 1,..., v m of Σ are specified to be sparse with s j := v j 0 and u jk = 1/ s j for k [1 + j 1 i=1 s i, j i=1 s i] and zero for all the others. Σ is generated as Σ = m j=1 ω j ω d v j vj T + ω di d. The column Cardinalities shows the cardinality of the support set of {v j } in the form: s 1, s,..., s m,,,.... The column Eigenvalues shows the eigenvalues of Σ in the form: ω 1, ω,..., ω m, ω d, ω d,.... In the first three schemes, m is set to be ; In the second three schemes, m is set to be 4. Scheme n d Cardinalities Eigenvalues Scheme , 10,,,... 6, 3, 1, 1, 1, 1,... Scheme , 10,,... 6, 3, 1, 1, 1, 1,... Scheme , 10,,,... 6, 3, 1, 1, 1, 1,... Scheme , 8, 6, 5,,,... 8, 4,, 1, 0.01, 0.01,... Scheme , 8, 6, 5,,,... 8, 4,, 1, 0.01, 0.01,... Scheme , 8, 6, 5,,,... 8, 4,, 1, 0.01, 0.01,... Accordingly, Σ is generated as Σ = m ω j ω d v j vj T + ω d I d. j=1 Table shows the cardinalities s 1,..., s m and eigenvalues ω 1,..., ω m and ω d. In this section we set m = for the first three schemes and m = 4 for the later three schemes. We consider the following four different elliptical distributions: = d χ d. Here χ d is the chi-distribution with Normal X EC d 0, Σ, ξ 1 d/eξ 1 with ξ 1 i.i.d. degrees of freedom d. For Y 1,..., Y d N0, 1, Y Y d d = χ d. In this setting, X follows a Gaussian distribution Fang et al., Multivariate-t X EC d 0, Σ, ξ d/eξ with ξ = d κξ1 /ξ. Here ξ d 1 = χ d and ξ d = χ κ with κ Z +. In this setting, X follows a multivariate-t distribution with degrees of freedom κ Fang et al., Here we consider κ = 3. EC1 X EC d 0, Σ, ξ 3 with ξ 3 F d, 1, i.e., ξ 3 follows an F -distribution with degrees of freedom d and 1. Here ξ 3 has no finite mean. But could still estimate the eigenvectors of the scatter matrix and is thus robust. EC X EC d 0, Σ, ξ 4 d/eξ 4 with ξ 4 Exp1, i.e., ξ 4 follows an exponential distribution with the rate parameter 1. We generate n data points according to the schemes 1 to 3 and the four distributions discussed above 1,000 times each. To show the estimation accuracy, Figure plots the s between the estimate v 1 and v 1, defined as sin v 1, v 1, against the number of estimated nonzero entries defined as v 1 0, for three different methods:,, and. 0

21 Scheme 1 normal Scheme normal Scheme 3 normal Scheme 1 multivariate-t Scheme multivariate-t Scheme 3 multivariate-t Scheme 1 EC1 Scheme EC1 Scheme 3 EC Scheme 1 EC Scheme EC Scheme 3 EC Figure : Curves of s between the estimates and true parameters for different schemes and distributions normal, multivariate-t, EC1, and EC, from top to bottom using the FM algorithm. Here we are interested in estimating the leading eigenvector. The horizontal-axis represents the cardinalities of the estimates support sets and the vertical-axis represents the s. 1

Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model

Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model Fang Han Department of Biostatistics Johns Hopkins University Baltimore, MD 21210 fhan@jhsph.edu Han Liu Department