arxiv: v2 [math.st] 7 Aug 2014

Size: px

Start display at page:

Download "arxiv: v2 [math.st] 7 Aug 2014"

Prudence Randall
6 years ago
Views:

1 Sparse and Low-Rank Covariance Matrices Estimation Shenglong Zhou, Naihua Xiu, Ziyan Luo +, Lingchen Kong Department of Applied Mathematics + State Key Laboratory of Rail Traffic Control and Safety arxiv: v2 [math.st] 7 Aug Beijing Jiaotong University, Beijing 00044, P. R. China Abstract This paper aims at achieving a simultaneously sparse and low-rank estimator from the semidefinite population covariance matrices.we first benefit from a convex optimization which develops l -norm penalty to encourage the sparsity and nuclear norm to favor the low-rank property. For the proposed estimator, we then prove that with large probability, the Frobenious norm of the estimation rate can be of order O ( (s logr /n under a mild case, where s and r denote the number of sparse entries and the rank of the population covariance respectively, n notes the sample capacity. Finally an efficient alternating direction method of multipliers with global convergence is proposed to tackle this problem, and meantime merits of the approach are also illustrated by practicing numerical simulations. Keywords: Covariance matrices, Sparse, Low-rank, Rate of estimation, Alternating direction method of multipliers Introduction Estimation of population covariance matrices from samples of multivariate data has draw many attentions in the last decade owing to its fundamental importance in multivariate analysis. With dramatic advances in technology in recent years, various research fields, such as genetic data, brain imaging, spectroscopic imaging, climate data and so on, have been used to deal with massive highdimensional data sets, whose sample sizes can be very small relative to dimension. In such settings, the standard and the most usual sample covariance matrices often performs poorly [, 2, ]. Fortunately, regularization as a class of new methods to estimate covariance matrices has recently emerged to overcome those shortages of using traditional sample covariance matrices. These methods encompass several specified forms, banding [, 6, 7], tapering [4, 0] and thresholding [2, 5, 8, 6] for instance. Aug 7, longnan_zsl@63.com, nhxiu@bjtu.edu.cn, zyluo@bjtu.edu.cn,konglchen@26.com

2 Moreover, there are many cases where the model is known to be structured in several ways at the same time. In recent years, one of research contents is to estimate a covariance matrix possessing both sparsity and positive definiteness. For instance, Rothman [5] gave the following model: ˆΣ + = argmin Σ 0 2 Σ Σ n 2 F τlogdet(σ + λ Σ, ( where Σ n is the sample covariance matrix, F is the Frobenious norm, is the element-wise l -norm and Σ := Σ Diag(Σ. From the optimization viewpoint, ( is similar to the graphical lasso criterion [9] which also has a log-determinant part and the element-wise l -penalty. Rothman [5] derived an iterative procedure to solve (. While Xue, Ma and Zou [8] omitted the log-determinant part and considered the positive definite constraint {Σ εi} for some arbitrarily small ε > 0: ˆΣ ε = argmin Σ εi 2 Σ Σ n 2 F + λ Σ, (2 They utilized an efficient alternating direction method (ADM to solve the challenging problem (2 and established its convergence properties. Most of the literatures, e.g.,[2, 5, 8], required the population covariance matrices being positive definite, and thus there is no essence of pursuing the low-rank of the estimator. By contrast, newly appeared research topic is to consider simultaneously the sparsity and low-rank of a structured model, which implies that the population covariance matrices are no longer restricted to the positive definite matrix cone and can be relaxed to the positive semidefinite cone. In addition, the models with structure of being simultaneously the sparsity and low-rank are widely applied into practice, such as sparse signal recovery from quadratic measurements and sparse phase retrieval, see [3] for example. Moreover, Richard et al. [4] showed that both sparse and low-rank model can be derived in covariance matrix when the random variables are highly correlated in groups, which means this covariance matrix has a block diagonal structure. With stimulations of those ideas, we construct the following convex model encompassing the l - norm and nuclear norm for estimating the covariance matrix: ˆΣ = argmin Σ 0 2 Σ Σ n 2 F + λ Σ + τ Σ, (3 where λ 0,τ 0 are tuning parameters. The l -norm penalty Σ = i,j σ i j is also called lassotype penalty and is used to encourage sparse solutions. The nuclear norm, Σ = i λ i (Σ with λ i (Σ being the eigenvalue of Σ, is the trace norm when Σ 0 and ensures low-rank solutions of (3. Here we inroduce the approximate rank to interpret the low-rank, which is defined as ar (A := ar {,2,, p} being the smallest number such that σ ar + (A γ, (4 σ (A 2

3 where σ i (A(i =,2,, p are the singular value of A with σ (A σ 2 (A σ p (A, and γ > 0 could be chosen based on the needs, throughout our paper we fix γ = 0.00 for simplicity. The contributions of this paper mainly center on two aspects. For one thing, being different from [3, 4], we establish the theoretical statistical theory under different assumptions rather than giving the generalized error bound of the estimation. Especially, we acquire the estimation rate O ( (s logr /n under the Frobenious norm error, which improves the optimal rate O ( (s log p/n where the low-rank property of the estimator does not be considered [7, 5, 8] and p is the samples dimension with p > max{n,r }. For another, we take advantage of the alternating direction method of multipliers (ADMM, also can be seen in [8, 9], to combat our problem (3. The organization of this paper is as follows. In Section 2 we will present some theoretical properties of the estimator derived by the proposed model (3. After that the alternating direction method of multipliers (ADMM is going to be introduced to combat the problem, and numerical experiments are projected to show the performance of this method in Sections 3 and 4 respectively. We make a conclusion in the last section. 2 A Sparse and Low-Rank Covariance Estimator Before the main part, we hereafter introduce some notations. E(X and P(A denote the expectation of X and the probability of the incident A occurring respectively. Card(S is the number of entries of the set S. Normal distribution with mean µ and covariance Σ is written as N ( µ,σ. Say Y n = O P ( if for every ε > 0, there is a C > 0 such that P{ Y n > C } < ε for all n n 0 (ε, and say Y n = O P (a n if Y n /a n = O P (. If there are two constants C C 2 such that C X n /Y n C 2, we write as X n Y n. For given observed independently and identically distributed (i.i.d. for short p-variate random variables X,, X n with covariance matrix Σ 0 and p > n, the goal is to estimate the unknown matrix Σ 0 based on the sample {X l : l =,,n}. This problem is called covariance matrix estimation which is of fundamental importance in multivariate analysis. Given a random sample {X,, X n } from E(X = 0 (without loss of generality and a population covariance matrix Σ 0 = ( σ 0i j i,j p = E( X X, the sample covariance matrix is Σ n = ( σ ni j i,j p = n ( Xl X ( X l X, n where X = n n l= X l. Denote S the support set of the population covariance matrix Σ 0 as l= S = { (i, j : σ 0i j 0 }, s = Card(S, r = rank(σ 0. Assumption 2. For all p, 0 λ min (Σ 0 λ max (Σ 0 λ <, where λ is a constant. 3

4 Assumption 2. is a common used condition in covariance matrix estimation, on which a useful lemma based is recalled here for the sequel analysis. One can also refer it in []. Lemma 2.2 Let X l R p,l =,,n be i.i.d. N (0,Σ 0 and Assumption 2. holds, (i.e., for all p, λ max (Σ 0 λ <. Then, if Σ 0 = ( σ 0i j i,j p, n ( P{ } Xl i X l j σ 0i j nν C exp { C 2 nν 2}, f or ν δ (5 l= where constants C,C 2 and δ depend on λ only. Another lemma which plays an important role in our main results is stated below. Lemma 2.3 Suppose that Assumption 2. holds, λ ε 6 s,τ ε 8. Then for ε > 0 sufficiently small, r max σ0i j σ ni j λ i mpl i es ˆΣ Σ 0 F ε, (6 where ˆΣ is defined as (3. i,j p Proof First make the eigenvalue decomposition of Σ 0 (r = rank(σ 0 as ( Σ 0 = UΛ Σ0 U Diag(λ(Σ0 U U, 0 where U R p p with UU = U U = I p is the matrix composed of eigenvectors, Diag(λ(Σ 0 R r r is a diagonal matrix generated by eigenvalues with λ(σ 0 = (λ (Σ 0,,λ r (Σ 0 and λ j (Σ 0 > 0, j =,,r. By denoting Σ := U U + Σ 0 with = consider the model ˆ := argmin +Λ Σ0 0 argmin +Λ Σ0 0 F ( = ( argmin +Diag(λ(Σ 0 0 and R r r, which implies = U ΣU Λ Σ0, we F ( (7 U U + Σ 0 Σ n 2 F 2 + λ U U + Σ 0 + τ U U + Σ 0. Clearly, from (3 we have ˆΣ = U ˆ U +Σ 0 which implies ˆ = U ˆΣU Λ Σ0. For a given ε > 0 sufficiently small and any F = ε (i.e., F = ε, we compute F ( F (0 = U U + Σ 0 Σ n 2 F 2 + λ U U + Σ 0 + τ U U + Σ 0 2 Σ 0 Σ n 2 F + λ Σ 0 + τ Σ 0 = 2 2 F + U U,Σ 0 Σ n + λ ( U U + Σ 0 Σ 0 +τ ( U U + Σ 0 Σ F + I + II + III. 4

5 For convenience we denote = U U. Then for I, it holds I =,Σ 0 Σ n ( tr (Σ n Σ 0 = ( i j σ0i j σ ni j max σ0i j σ ni j, i,j i,j p For II, we obtain by noting S = { (i, j : σ 0i j 0 } and s = Card(S that, II = λ ( Σ 0 + Σ 0 = λ ( Σ 0S + S + S C Σ 0S ( λ Σ 0S + S + S C ( Σ 0S + S + S = λ ( S C S. From the Hölder Inequality, one can prove that = = = For III, combining with (8 we get that r λ i ( r F = r F, (8 i= S s S F. (9 III = τ ( Σ 0 + Σ 0 τ = τ τ r F. Since max i,j σ0i j σ ni j λ and (9, G( F ( F (0 = 2 2 F + I + II + III 2 2 F λ + λ ( S C S τ r F = 2 2 F ( λ S + S C ( S C S τ r F = 2 2 F 2λ S τ r F 2 2 F 2λ s S F τ r F 2 2 F 2λ s F τ r F Therefore, by λ = 2 2 F 2λ s F τ r F. ε 6 s,τ ε 8 r G( 2 2 F 2λ s S F τ r F ε2 2 ε2 8 ε2 8 = ε2 4 > 0. 5

6 Hence we prove that if λ ε 6 s,τ ε 8 r and max i,j σ0i j σ ni j λ, for any F = ε, it holds G( F ( F (0 > 0. In addition, from (7, we have ˆ = argmin F ( F (0 = +Diag(λ Σ0 0 argmin +Diag(λ Σ0 0 G(, (0 which implies that ˆ F ε. Otherwise, we suppose ˆ F > ε, then ε ˆ / ˆ F F = ε. Since for any F = ε, it follows G( > 0 = G(0 which is contradicted with the fact G( is a convex function and G( ˆ G(0 = 0, because ( ε 0 < G ˆ ˆ F ( = G ε ˆ + ˆ F ε G( ˆ + ˆ F ( ε ˆ ( F ε ˆ F 0 G(0 0. Finally ˆ F ε indicates that ˆΣ Σ 0 F ε due to ˆ F = U ˆ U F = ˆΣ Σ 0 F. Hence the desired result is obtained. Then in order to acquiring the rate of the estimation, the two following commonly used assumptions are needed to introduced, and also can be seen [, 5]. Assumption 2.4 holds, for example, if X l,, X l p are Gaussian. } Assumption 2.4 E {exp(t X 2l j C < hold for all j =,, p and 0 < t < t 0, where t 0 and C are two constants. Assumption 2.5 E{ Xl j 2α } C 2 < hold for all j =,, p, where some α 2 and C 2 is a constant. Built on the two assumptions, we give our main results with regard to rates of the estimator of (3. Theorem 2.6 For some δ [,2, let K be ( a sufficiently large constant and suppose Assumptions 2. logr and 2.4 hold. If λ = K n + log p s logr,τ = O nk δ P nr + s log p and s logr + (s log p/k δ nr K δ = o(n, then ˆΣ Σ 0 F = O P ( s logr n + s log p nk δ. ( Proof Since Assumptions 2. and 2.4 hold, a fact employed by Rothman et al. [6] is that for ν > 0 sufficiently small, { P max i,j n } σ0i j σ ni j ν C 3 p 2 exp { C 4 nν 2}, 6

7 where C 3 and C 4 are some constants. We then apply the bound s logr +(s log p/k δ = o(n and Lemma 2.3 with ε = 6K s logr n i,j n + s log p logr nk δ, λ = K n + log p nk δ to obtain that P { ˆΣ Σ 0 F ε } { } P max σ0i j σ ni j λ C 3 p 2 C 4K 2 δ r C 4K 2. Evidently, C 3 p 2 C 4K 2 δ r C 4K 2 can be arbitrarily close to one by choosing K sufficiently large. Corollary 2.7 For some δ [,2, let K be a sufficiently large constant such that K δ logr log p, and ( logr suppose Assumptions 2., 2.4 hold. If λ = K n,τ = O s logr P nr and s logr = o(n, then ˆΣ Σ 0 F = O P s logr. n (2 ( s logr Remark 2.8 Clearly, if the λ min (Σ 0 > 0 in Assumption 2., the better rate O P n would reduce ( s log p to O P n. It is worth mentioning that under the the Assumption 2., the minimax optimal rate ( s log p of convergence under the Frobenius norm in Theorem 4 of [7] is O P n which also has been obtained by [8]. However, to attain the same rate in the presence of the log-determinant barrier term (, Rothman [5] instead would require that λ min, the minimal eigenvalue of the true covariance matrix, should be bounded away from zero by some positive constant, and also that the barrier parameter should be bounded by some positive quantity. [8] illustrated this theory requiring a lower bound on λ min is not very appealing. Theorem 2.9 Let K 2 be a sufficiently large constant and suppose that Assumptions 2. and 2.5 hold. If ( p λ = K 4/α 2 n,τ = O sp 4/α P nr and sp 4/α = o(n, then ˆΣ Σ 0 F = O P sp 4/α n. (3 Proof Since Assumptions 2. and 2.5 hold, one can modify a result of Bickel & Levina (2008a and show that for ν > 0 sufficiently small { P max i,j n } σ0i j σ ni j ν p 2 C 5 n α/2 ν α, 7

8 where C 5 is a constant. We then apply the bound sp 4/α sp = o(n and Lemma 2.3 with ε = 6K 4/α 2 n,λ = p K 4/α 2 n to get P { ˆΣ Σ F 0 ε } { } P max σ0i j σ ni j λ C 6 K α i,j n Apparently, the bound C 6 K α 2 can be arbitrarily close to one by taking K 2 sufficiently large Alternating Direction Method of multipliers In this section, we will construct the alternating direction method of multipliers (ADMM to solve problem (3. By introducing an auxiliary variable Γ, problem (3 can be rewritten as min 2 Σ Σ n 2 F + λ Σ + τ Γ, (4 s.t. Γ 0, Σ Γ = 0. The constraint Γ 0 can be put into the objective function by using an indicator function: { 0, Γ 0 I (Γ 0 = +, otherwise. This leads to the following equivalent reformulation of (4: min 2 Σ Σ n 2 F + λ Σ + τ Γ + I (Γ 0, (5 s.t. Σ Γ = 0. Recently, the alternating direction method of multipliers (ADMM has been studied extensively for solving (5. A typical iteration of ADMM for solving (5 can be described as Γ k+ := argmin Γ L µ (Σ k,γ,λ k (6 Σ k+ := argmin Σ L µ (Σ,Γ k+,λ k (7 Λ k+ := Λ k ( Γ k+ Σ k+, (8 µ where the augmented Lagrangian function L µ (Σ,Γ,Λ is defined as L µ (Σ,Γ,Λ := 2 Σ Σ n 2 F + λ Σ + τ Γ +I (Γ 0 Λ,Γ Σ + 2µ Σ Γ 2 F, (9 in which Λ is the Lagrange multiplier and µ > 0 is a penalty parameter. Note that ADMM (6-8 can be written explicitly as 8

9 Γ k+ := argmin Γ 0 τ Γ + 2µ Γ (Σk + µλ k 2 F (20 Σ k+ := argmin Σ λ Σ + µ + 2µ Σ µ µ + (Σ n + µ Γk+ Λ k 2 F (2 Λ k+ := Λ k (Γ k+ Σ k+ /µ. (22 We now show that the two subproblems (20 and (2 can be easily solved. For the subproblem (20, Γ k+ = argmin τ Γ + Γ 0 2µ Γ (Σk + µλ k 2 F = argmin τ Γ, I + Γ 0 2µ Γ (Σk + µλ k 2 F = argmin Γ (Σ k + µλ k µτi 2 F Γ 0 = (Σ k + µλ k µτi +, (23 where (X + denote the projection of a matrix X onto the convex positive semidefinite cone S n +. Namely (X + = U Diag(max{λ(X,0}U, where X = U Diag(λ(X U and λ(x = (λ (X,λ 2 (X,,λ p (X. The solution of the second subproblem (2 is given by the l -shrinkage operation Σ k+ = = µ µ + Shrink(Σ n + µ Γk+ Λ k,λ µ ( { } ( max Pi j λ,0 sign Pi j µ + i,j p, (24 where P = ( P i j i,j p := Σ n + µ Γk+ Λ k and sign( is a sign function. Therefore, combining with (22-(24, the whole algorithm is written as follows Table : The framework of the ADMM. ADMM: Alternating Direction Method of Multipliers Initialize µ, λ, τ, Σ 0, Λ 0 ; Repeat Compute Γ k+ = (Σ k + µλ k µτi + ; Compute Σ k+ = µ µ+ Shrink(Σ n + µ Γk+ Λ k,λ; Compute Λ k+ := Λ k µ (Γk+ Σ k+. Untill Convergence Return To end this section, we prove that the sequence (Γ k,σ k,λ k produced by the alternating direction method of multipliers (Table converges to (ˆΓ, ˆΣ, ˆΛ, where (ˆΓ, ˆΣ is an optimal solution of (5 and 9

10 ˆΛ is the optimal dual variable. Now we label some necessary notations for the ease of presentation. Let H be a 2p 2p matrix defined as H = ( µ I p p 0 0 µi p p, the weighted norm 2 H stands for V 2 H := V, HV and the corresponding inner product, H is U,V H := U, HV. Before presenting the main theorem with regard to the global convergence of ADMM, we introduce the following lemma. Lemma 3. Assume that (ˆΓ, ˆΣ is an optimal solution of (5 and ˆΛ is the corresponding optimal dual variable associated with the equality constraint Σ Γ = 0. Then the sequence { (Γ k,σ k,λ k } produced by ADMM satisfies ˆV V k 2 H ˆV V k+ 2 H V k+ V k 2 H, (25 where V k = (Λ k,σ k and ˆV = ( ˆΛ, ˆΣ. Based on the lemma above, the convergent theorem can be derived immediately. Theorem 3.2 The sequence { (Γ k,σ k,λ k } generated by Algorithm from any starting point converges to an optimal solution of (5. 4 Numerical Simulations In this section we will exploit the proposed method ADMM to tackle two examples, one of which possessed the block structured population covariance matrix, and another utilized the banded population covariance matrix. Actually as the constraint Σ 0, our proposed model (3 is equivalent to ˆΣ = argmin Σ 0 2 Σ (Σ n τi 2 F + λ Σ. (26 So similar to the method in [8], one can solve the soft-thresholding estimator Σ st := argmin Σ 2 Σ (Σ n τi 2 F + λ Σ to initialize the Σ 0. If the derived Σ st 0 then the recovered sparse and low-rank semidefinite estimator ˆΣ = Σ st. In our stimulation, we uniformly initialize Λ 0 as the matrix with all entries being, Σ 0 as zero matrix and Σ st respectively. Unlike λ and τ, µ does not change the final covariance estimator, thus we fixed µ = just for simplicity and the stop criteria is set as { } max Γ k+ Γ k F, Σ k+ Σ k F For the sample dimensions, we always take n = 50 and p = 00,200,500,000. 0

11 4. Example I: Block Structure Analogous to the model, modified slightly here, emerged in [4] who synthesized n samples X l N (0,Σ 0 for a block diagonal population covariance matrix Σ 0 R p p, we will use K (= 5,0,20 blocks of random sizes, and each block is generated by v v where the entries of v are drawn i.i.d. from the uniform distribution on [,]. Evidently, the rank of Σ 0 produced in the way is K. Corresponding MATLAB code of generating X l is X l = Σ /2 0 r andn(p,, l =,2,,n, thereby deriving the sample covariance matrix Σ n = n ( Xl X ( X l X. n l= What is worth mentioning is that if Σ 0 is a positive definite matrix, the solution our method obtains would be a positive definite matrix with full rank. But fortunately, compared to some largest singular values of Σ 0, the left are relatively small so that can be ignored. Here, therefore, we consider the approximate rank (4. In addition, we say the sparsity of a matrix A = ( a i j R n p by sp(a = Card{ (i, j : a i j 0 } n p. Figure : Covariance estimation with K = 5, λ = 0.5,τ = 0.2 and p = 200. In the row above, F PR = 0.092,T PR =, whilst F PR = 0.08,T PR = in the row below.

Apart from the approximate ar and the sparsity sp of the sparse and low-rank semidefinite estimator ˆΣ = ( σ i j i,j p, we also take advantage of other two types of errors to show the selection

12 Apart from the approximate ar and the sparsity sp of the sparse and low-rank semidefinite estimator ˆΣ = ( σ i j i,j p, we also take advantage of other two types of errors to show the selection performance of our proposed method ADMM: F PR := Card{ (i, j : σ 0i j 0& σ i j = 0 } { Card { Card (i, j : σ0i j 0& σ i j 0 } }, T PR := (i, j : σ i j = 0 Card { (i, j : σ i j 0 }, where F PR stands for the false positive rate, which means the rate of significant variables that are unselected over the whole zero entries, and T PR denotes the true positive rate, which implies the ratio of significant variables that are selected over the entire none zero elements. For more visualized purpose, we plot the Population Covariance, Sample Covariance an the Recovered Covariance. From Figures and 2, the left Population Covariance is Σ 0, the median Sample Covariance stands for Σ n and the right Recovered Covariance denotes ˆΣ. The yellow region stands for the sparse area in which the values are quite close (or most of them equal to zero, while the green zone is the place where the none zero entries locate. Moreover the deeper the green color is, the larger the value stands. Evidently the recovered covariance matrices are quite dependent on the sample covariance matrix. Figure 2: Covariance estimation with K = 5, λ = 0.5,τ = 0.2 and p = 500. In the row above, F PR = 0.098,T PR = 0.99, whilst F PR = 0.07,T PR = in the row below. 2

13 We then report average results over 00 runs. Timing (in seconds was carried out on a CPU 2.6GHz desktop. In computation, the sparse parameter λ and the low-rank parameter τ are given in corresponding stimulations respectively. Σ 0 ia taken as the zero matrix. As we mentioned before the rank of the produced Σ 0 actually is given, i.e., r (Σ 0 = ar (Σ 0 = K. Table 2 shows the performance of our approach under different dimensions p = 00,200,500,000 and various blocks K. Obviously, all the ar ( ˆΣ nearly tends to the true rank K. The values of F PR and T PR are quite desirable, which manifests that the selection performance of ADMM is very well. Moreover, the CPU time reveals the method runs relatively fast. In addition, with the K rising, for example p = 500, the F PR is decreasing while F PR is increasing, and the time spent by the method is also ascending. In addition, for small dimension such as p = 200 and large block such as K = 50, ADMM performs unstably because various stimulations can not be recovered. Table 2: The performance of ADMM under different dimensions p and blocks K. n = 50, ar (Σ 0 = K, λ = 0.25, τ = 0.5, Σ 0 = 0 p ar ( ˆΣ sp(σ 0 sp( ˆΣ F PR T PR Time K = K = K = K = To simply observe the performance of our proposed method under different parameters λ,τ and initialized Σ 0, we fix n = 50, p = 200 and ar (Σ 0 = K = 5. From Table 3, results of left columns of ar ( ˆΣ,F PR,T PR and Time are generated from the initialized Σ 0 = 0, and results of right columns are produced with Σ 0 = Σ st. One can easily to discern that when λ = 0.05 and Σ 0 = Σ st, the performance of the method is relatively bad regardless of what τ is taken due to the ar ( ˆΣ and T PR are undesirable. By contrast, when λ = 0.25 and 0.50, it behaves much better, particularly when τ = Moreover, with the increasing of λ, the rate F PR of significant variables that are unselected is rising, even though the percentage T PR of significant variables that are selected is ascending as well. By comparing the effectiveness of those two initialized point Σ 0 = 0 and Σ st, as shown in the table, ar ( ˆΣ,F PR,T PR and Time generated from Σ 0 = 0 are basically same, which means method with zero starting point 3

14 performs more stable. But when λ = 0.25 and 0.50, ADMM with Σ 0 = Σ st generates larger T PR than that from Σ 0 = 0, moreover it needs less computational time in all stimulations regardless of the parameters. Table 3: The performance of ADMM under distinct parameters λ,τ and initialized Σ 0. n = 50, p = 200, ar (Σ 0 = K = 5 λ τ ar ( ˆΣ F PR T PR Time Example II: Banded Structure In this part we consider the population covariance matrix with banded structure which has been emerged in [, 3, 8]. To be more exact, the population covariance matrix Σ 0 = (σ 0i j i,j p R p p has the following formula { σ 0i j = max } ( 0 i j, 0 = 0 i j. + We first report average results over 00 replicators and take the sparse parameter λ = 0.5 and the low-rank parameter τ = 0.75 respectively. Information listed in Table 4 shows the performance of our approach under different dimensions p = 00,200,500,000 and two distinct starting point Σ 0 = 0 and Σ st. As we can discern in Table 4, compared with Σ 0, the ar ( ˆΣ and sp( ˆΣ are relatively small, and the former ascends while the latter descends with the rise of p. In addition, in Example 4. the block structured Σ 0 whose r ank(σ 0 = ar (Σ 0 = K leads to the estimator the rank of ˆΣ is also close to K. Being distinct with that, in this example, the ar (Σ 0 increases with the dimension p and is not lowrank, but the recovered solution ˆΣ has been rendered the relatively low-rank property. The values of F PR and T PR are both quite desirable, which manifests that the selection performance of ADMM is very well in this example. Moreover, the CPU time reveals the method runs extremely fast as well. In addition, under such parameters Zλ = 0.5, τ = 0.75, ADMM behaves nearly identically even though the starting point Σ 0 are different. 4

15 Table 4: The performance of ADMM over 00 simulations under different dimensions p and Σ 0. n = 50, λ = 0.5, τ = 0.75 p ar (Σ 0 ar ( ˆΣ sp(σ 0 sp( ˆΣ F PR T PR Time Σ 0 = Σ 0 = Σ st Figure 3: Covariance estimation with λ = 0.5,τ = 0.75 and p = 00. In the row above, F PR = 0.062,T PR =, whilst F PR = 0.054,T PR = in the row below. Then from Figure 3, one can check that the recovered covariance matrices ˆΣ are quite dependent on the sample covariance matrix Σ n, and the selection performance are relatively well because F PR is pretty small while T PR is close to. 5

16 To simply observe the behavior of ADMM under different parameters λ,τ and initialized Σ 0, we fix n = 50, p = 200 and ar (Σ 0 = 76. As indicated in Table 5, results of left columns of ar ( ˆΣ,F PR,T PR and Time are generated from the initialized point Σ 0 = 0, and results of right columns are produced with Σ 0 = Σ st. It is clear that the CPU time cost by the method with starting point Σ st is almost less than that from zero initialization. With the λ rising, TPR generated by the method with Σ 0 = Σ st is increasing to, while that with Σ 0 = 0 basically stabilizes at In terms of the ar ( ˆΣ, the proposed method with starting point Σ st will not create a lower rank solution, comparing with Σ 0 = 0. The reason for this phenomenon probably is that Σ st is a sparse point but not low rank; from (26 if Σ st is an approximately semidefinite positive matrix, algorithm will stop after a few iterations which results in the solution is not a desired low-rank one. Table 5: The performance of ADMM under distinct parameters λ,τ and initialized Σ 0. n = 50, p = 200, ar (Σ 0 = 76 λ τ ar ( ˆΣ F PR T PR Time Conclusion We have acquired a positive semidefinite estimator, being simultaneously sparse and low-rank, from samples of the covariance matrices through utilizing l norm and nuclear norm penalties. The theoretical properties manifest that in high-dimensional settings the estimator we have constructed performs very well. Meantime, the efficient ADMM with global convergence has possessed several merits illustrated by the numerical simulations, such as less computational time and beautiful recovered effectiveness. 6

17 Acknowledgement The work was supported in part by the National Basic Research Program of China (200CB73250, the National Natural Science Foundation of China (708, 72702, References [] Bickel, P. and Levina, E.: Regularized estimation of large covariance matrices. Ann. Statist. 36, (2008. [2] Bickel, P. and Levina, E.: Covariance regularization by thresholding. Ann. Statist. 36, (2008. [3] Cai, T. and Liu, W.: Adaptive thresholding for sparse covariance matrix estimation. J. Amer. Statist. Assoc. 06, -3 (20. [4] Cai, T., Zhang, C.: and Zhou, H., Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 38, (200. [5] Cai T T, Yuan M.: Adaptive covariance matrix estimation through block thresholding. Ann.Statist. 40, (202. [6] Cai, T. and Zhou, H.: Minimax estimation of large covariance matrices under l -norm. Statist. Sinica 22, (202. [7] Cai, T. and Zhou, H.: Optimal rates of convergence for sparse covariance matrix estimation. Ann.Statist. 40, (202. [8] El Karoui, N.: Operator norm consistent estimation of large dimensional sparse covariance matrices. Ann. Statist. 36, (2008. [9] Friedman, J., Hastie, T. and Tibshirani, R., Sparse inverse covariance estimation with the graphical lasso. Biostat. 9, 432 (2008. [0] Furrer, R. and Bengtsson, T.: Estimation of high-dimensional prior and posterior covariance matrices in Kalman filter variants. J. Multivariate Anal. 98, (2007. [] Johnstone, I.: On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist., (200. [2] Liu H, Wang L, Zhao T.: Sparse Covariance Matrix Estimation with Eigenvalue Constraints. Journal of Computational and Graphical Statistics, (203 accepted. 7

18 [3] Oymak, S., Jalali, A., Fazel, M., Eldar, Y. C. and Hassibi, B.: Simultaneously structured models with application to sparse and low-rank matrices. arxiv preprint arxiv: (202 to appear. [4] Richard, E., Savalle, P. A., and Vayatis, N.: Estimation of simultaneously sparse and low rank matrices. arxiv preprint arxiv: (202 to appear. [5] Rothman, A.: Positive definite estimators of large covariance matrices. Biometrika 99, (202. [6] Rothman, A., Levina, E., and Zhu, J.: Generalized thresholding of large covariance matrices. J. Amer. Statist. Assoc. 04, (2009. [7] Wu, W. and Pourahmadi, M.: Nonparametric estimation of large covariance matrices of longitudinal data. Biomet. 90, (2003. [8] Xue, L., Ma, S. and Zou, H.: Positive Definite l Penalized Estimation of Large Covariance Matrices. J. Amer. Statist. Assoc., 07, (202. [9] Yuan X.M.: Alternating direction method of multipliers for covariance selection models. Journal of Scientific Computing 5, (202. Appendix Proof of Lemma 3. For convenience, we denote that g (Γ := τ Γ + I (Γ 0. Clearly, g : R p p R is a convex function. Since (ˆΓ, ˆΣ is an optimal solution of (5, which satisfies the following KKT conditions: ˆΛ ( τ ˆΓ + I (ˆΓ 0 = g (ˆΓ, (27 ( ˆΛ ˆΣ + Σ n ˆΣ, (28 λ ˆΣ = ˆΓ. (29 Note that the optimality conditions for the first subproblem in ADMM, i.e., the subproblem with respect to Γ in (6, are given by Λ k µ (Γk+ Σ k g (Γ k+, (30 this together with (8, i.e., Λ k = Λ k+ + µ ( Γ k+ Σ k+, we have Λ k+ µ (Σk+ Σ k g (Γ k+, (3 8

19 Combining (27 and (3 and using the fact that g ( is a monotone operator, we get ˆΓ Γ k+, ˆΛ Λ k+ + µ (Σk+ Σ k 0. (32 The optimality conditions for the second subproblem (i.e., the subproblem with respect to Σ in (7 are given by λ ( Λ k Σ k+ + Σ n µ (Σk+ Γ k+ Σ k+, (33 this together with (8, i.e., Λ k = Λ k+ + µ λ ( Γ k+ Σ k+, we have ( Λ k+ Σ k+ + Σ n Σ k+, (34 Similarly, combining (28 and (34, using the fact that is a monotone operator, we get ˆΣ Σ k+, ˆΛ + Λ k+ ( ˆΣ Σ k+ 0, (35 The summation of (32 and (35 gives ˆΣ Σ k+ 2 F ˆΣ Σ k+, ˆΛ + Λ k+ + ˆΓ Γ k+, ˆΛ Λ k+ + µ ˆΓ Γ k+,σ k+ Σ k = ˆΣ Σ k+, ˆΛ + Λ k+ + ˆΣ Σ k+ + µ(λ k+ Λ k, ˆΛ Λ k+ + µ ˆΣ Σ k+ + µ(λ k+ Λ k,σ k+ Σ k, (36 where the equality because of Γ k+ = Σ k+ µ(λ k+ Λ k and ˆΣ = ˆΓ. Simple algebraic derivation from (36 yields the following inequality: ˆΣ Σ k+ 2 F Λk+ Λ k,σ k+ Σ k µ ˆΛ Λ k+,λ k+ Λ k + µ ˆΣ Σ k+,σ k+ Σ k, (37 Rearranging the right hand side of (37 using ˆΛ Λ k+ = ˆΛ Λ k + Λ k Λ k+ and ˆΣ Σ k+ = ˆΣ Σ k + Σ k Σ k+, then (37 can be reduced to µ ˆΛ Λ k,λ k+ Λ k + µ ˆΣ Σ k,σ k+ Σ k µ Λ k+ Λ k 2 F + µ Σk+ Σ k 2 F + ˆΣ Σ k+ 2 F Λk+ Λ k,σ k+ Σ k, Using the notation of V k and ˆV, the inequality above can be rewritten as ˆV V k,v k+ V k H V k+ V k 2 H + ˆΣ Σ k+ 2 F Λk+ Λ k,σ k+ Σ k, (38 9

20 Combining (38 with the following identity we get ˆV V k+ 2 H = V k+ V k 2 H 2 ˆV V k,v k+ V k H + ˆV V k 2 H, ˆV V k 2 H ˆV V k+ 2 H = 2 ˆV V k,v k+ V k H V k+ V k 2 H 2 V k+ V k 2 H + 2 ˆΣ Σ k+ 2 F 2 Σk+ Σ k,λ k+ Λ k V k+ V k 2 H = V k+ V k 2 H + 2 ˆΣ Σ k+ 2 F 2 Σk+ Σ k,λ k+ Λ k, (39 Now, using (34 for k instead of k +, we get, λ ( Λk Σ k + Σ n Σ k, (40 Combining (34,(40 and using the fact that is a monotone function, we obtain, Σ k+ Σ k, Λ k+ + Λ k (Σ k+ Σ k 0, which means Σ k+ Σ k,λ k+ Λ k Σ k+ Σ k 2 F 0, (4 By substituting (4 into (39, we get the desired result (25. Proof of Theorem 3.2 From Lemma 3., we can get that (a V k+ V k 2 H 0; (b {V k } lies in a compact region; (c V k+ ˆV 2 H is monotonically non-increasing and thus converges. Connecting with notations of H,V and (a, it holds that Λ k+ Λ k 0 and Σ k+ Σ k 0, which together with (8 imply that Γ k+ Γ k 0 and Σ k Γ k 0. From (b, {V k } has a subsequence {V k j } that converges to ˇV = ( ˇΛ, ˇΣ, i.e., Λ k j ˇΛ and Σ k j ˇΣ. Also we have Γ k j ˇΓ(:= ˇΣ from Σ k Γ k 0. Therefore, (ˇΓ, ˇΣ, ˇΛ is a limit point of { (Γ k,σ k,λ k }. Note that (30 and (33 respectively imply that those together with ˇΓ = ˇΣ, it reduces that ˇΛ + µ (ˇΓ ˇΣ g (ˇΓ, λ ( ˇΛ ˇΣ + Σ n µ ( ˇΣ ˇΓ ˇΣ, ˇΛ g (ˇΓ, (42 ( ˇΛ ˇΣ + Σ n ˇΣ, (43 λ (42,(43 and ˇΓ = ˇΣ mean that (ˇΓ, ˇΣ, ˇΛ is an optimal solution to (5. Therefore, we showed that any limit point of { (Γ k,σ k,λ k } is an optimal solution to (5. 20

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana