FUSED MULTIPLE GRAPHICAL LASSO

Size: px

Start display at page:

Download "FUSED MULTIPLE GRAPHICAL LASSO"

Harvey Berry
5 years ago
Views:

1 FUSED MULTIPLE GRAPHICAL LASSO SEN YANG, ZHAOSONG LU, XIAOTONG SHEN, PETER WONKA, JIEPING YE Abstract. In this paper, we consider the probem of estimating mutipe graphica modes simutaneousy using the fused asso penaty, which encourages adjacent graphs to share simiar structures. A motivating exampe is the anaysis of brain networks of Azheimer s disease using neuroimaging data. Specificay, we may wish to estimate a brain network for the norma contros (NC), a brain network for the patients with mid cognitive impairment (MCI), and a brain network for Azheimer s patients (AD). We expect the two brain networks for NC and MCI to share common structures but not to be identica to each other; simiary for the two brain networks for MCI and AD. The proposed formuation can be soved using a second-order method. Our key technica contribution is to estabish the necessary and sufficient condition for the graphs to be decomposabe. Based on this key property, a simpe screening rue is presented, which decomposes the arge graphs into sma subgraphs and aows an efficient estimation of mutipe independent (sma) subgraphs, dramaticay reducing the computationa cost. We perform experiments on both synthetic and rea data; our resuts demonstrate the effectiveness and efficiency of the proposed approach. Key words. Fused mutipe graphica asso, screening, second-order method AMS subject cassifications. 90C22, 90C25, 90C47, 65K05, 62J10 1. Introduction. Undirected graphica modes expore the reationships among a set of random variabes through their joint distribution. The estimation of undirected graphica modes has appications in many domains, such as computer vision, bioogy, and medicine [11, 17, 44]. One instance is the anaysis of gene expression data. As shown in many bioogica studies, genes tend to work in groups based on their bioogica functions, and there exist some reguatory reationships between genes [5]. Such bioogica knowedge can be represented as a graph, where nodes are the genes, and edges describe the reguatory reationships. Graphica modes provide a usefu too for modeing these reationships, and can be used to expore gene activities. One of the most widey used graphica modes is the Gaussian graphica mode (GGM), which assumes the variabes to be Gaussian distributed [2, 47]. In the framework of GGM, the probem of earning a graph is equivaent to estimating the inverse of the covariance matrix (precision matrix), since the nonzero off-diagona eements of the precision matrix represent edges in the graph [2, 47]. In recent years many research efforts have focused on estimating the precision matrix and the corresponding graphica mode (see, for exampe [2, 10, 16, 17, 22, 23, 25, 26, 29, 30, 33, 47]. Meinshausen and Bühmann [30] estimated edges for each node in the graph by fitting a asso probem [36] using the remaining variabes as predictors. Yuan and Lin [47] and Banerjee et a. [2] proposed a penaized maximum ikeihood mode using 1 reguarization to estimate the sparse precision matrix. Numerous methods have been deveoped for soving this mode. For exampe, d Aspremont et a. [8] and Lu [25, 26] studied Nesterov s smooth gradient methods [32] for soving this probem or its dua. Banerjee et a. [2] and Friedman et a. [10] proposed bock coordinate ascent methods for soving the dua probem. The atter method [10] is Schoo of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, Arizona 85287, USA (senyang@asu.edu, peter.wonka@asu.edu, jieping.ye@asu.edu). Department of Mathematics, Simon Frasor University, Burnaby, BC, V5A 156, Canada (zhaosong@sfu.ca). Schoo of Statistics, University of Minnesota, Minneapois, Minnesota 55455, USA (xshen@umn.edu). 1

2 2 FUSED MULTIPLE GRAPHICAL LASSO widey referred to as Graphica asso (GLasso). Mazumder and Hastie [29] proposed a new agorithm caed DP-GLasso, each step of which is a box-constrained QP probem. Scheinberg and Rish [35] proposed a coordinate descent method for soving this mode in a greedy approach. Yuan [48] and Scheinberg et a. [34] appied aternating direction method of mutipiers (ADMM) [4] to this probem. Li and Toh [22] and Yuan and Lin [47] proposed to sove this probem using interior point methods. Wang et a. [40], Hsieh et a. [16], Osen et a. [33], and Dinh et a. [9] studied Newton method for soving this mode. The main chaenge of estimating a sparse precision matrix for the probems with a arge number of nodes (variabes) is its intensive computation. Witten et a. [42] and Mazumder and Hastie [28] independenty derived a necessary and sufficient condition for the soution of a singe graphica asso to be bock diagona (subject to some rearrangement of variabes). This can be used as a simpe screening test to identify the associated bocks, and the origina probem can thus be decomposed into a group of smaer sized but independent probems corresponding to these bocks. When the number of bocks is arge, it can achieve massive computationa gain. However, these formuations assume that observations are independenty drawn from a singe Gaussian distribution. In many appications the observations may be drawn from mutipe Gaussian distributions; in this case, mutipe graphica modes need to be estimated. There are some recent works on the estimation of mutipe precision matrices [7, 11, 12, 13, 19, 20, 31, 49]. Guo et a. [11] proposed a method to jointy estimate mutipe graphica modes using a hierarchica penaty. However, their mode is not convex. Honorio and Samaras [13] proposed a convex formuation to estimate mutipe graphica modes using the 1, reguarizer. Hara and Washio [12] introduced a method to earn common substructures among mutipe graphica modes. Danaher et a. [7] estimated mutipe precision matrices simutaneousy using a pairwise fused penaty and grouping penaty. ADMM was used to sove the probem, but it requires computing mutipe eigen decompositions at each iteration. Mohan et a. [31] proposed to estimate mutipe precision matrices based on the assumption that the network differences are generated from node perturbations. Compared with singe graphica mode earning, earning mutipe precision matrices jointy is even more chaenging to sove. Recenty, a necessary and sufficient condition for mutipe graphs to be decomposabe was proposed in [7]. However, such necessary and sufficient condition was restricted to two graphs ony when the fused penaty is used. It is not cear whether this screening rue can be extended to the more genera case with more than two graphs, which is the case in brain network modeing. There are two types of fused penaties that can be used for estimating mutipe (more than two) graphs: (a) pairwise fused or (b) sequentia fused [37]. In this paper we set out to address the sequentia fused case first, because we work on practica appications that can be more appropriatey formuated using the sequentia formuation. Specificay, we consider the probem of estimating mutipe graphica modes by maximizing a penaized og ikeihood with 1 and sequentia fused reguarization. The 1 reguarization yieds a sparse soution, and the fused reguarization encourages adjacent graphs to be simiar. The graphs considered in this paper have a natura order, which is common in many appications. A motivating exampe is the modeing of brain networks for Azheimer s disease using neuroimaging data such as Positron emission tomography (PET). In this case, we want to estimate graphica modes for three groups: norma contros (NC), patients of mid cognitive impairment (MCI), and Azheimer s patients (AD). These networks are expected to share some common con-

3 S. YANG, Z. LU, X. SHEN, P. WONKA, AND J. YE 3 nections, but they are not identica. Furthermore, the networks are expected to evove over time, in the order of disease progression from NC to MCI to AD. Estimating the graphica modes separatey fais to expoit the common structures among them. It is thus desirabe to jointy estimate the three networks (graphs). Our key technica contribution is to estabish the necessary and sufficient condition for the soution of the fused mutipe graphica asso (FMGL) to be bock diagona. The duaity theory and severa other toos in inear programming are used to drive the necessary and sufficient condition. Based on this crucia property of FMGL, we deveop a screening rue which enabes the efficient estimation of arge mutipe precision matrices for FMGL. The proposed screening rue can be combined with any agorithms to reduce computationa cost. We empoy a second-order method [16, 21, 38] to sove the fused mutipe graphica asso, where each step is soved by the spectra projected gradient method [27, 43]. In addition, we propose a shrinking scheme to identify the variabes to be updated in each step of the second-order method, which reduces the computation cost of each step. We conduct experiments on both synthetic and rea data; our resuts demonstrate the effectiveness and efficiency of the proposed approach Notation. In this paper, R stands for the set of a rea numbers, R n denotes the n-dimensiona Eucidean space, and the set of a m n matrices with rea entries is denoted by R m n. A matrices are presented in bod format. The space of symmetric matrices is denoted by S n. If X S n is positive semidefinite (resp. definite), we write X 0 (resp. X 0). Aso, we write X Y to mean X Y 0. The cone of positive semidefinite matrices in S n is denoted by S+. n Given matrices X and Y in R m n, the standard inner product is defined by X, Y := tr(xy T ), where tr( ) denotes the trace of a matrix. X Y and X Y means the Hadamard and Kronecker product of X and Y, respectivey. We denote the identity matrix by I, whose dimension shoud be cear from the context. The determinant and the minima eigenvaue of a rea symmetric matrix X are denoted by det(x) and λ min (X), respectivey. Given a matrix X R n n, diag(x) denotes the vector formed by the diagona of X, that is, diag(x) i = X ii for i = 1,..., n. Diag(X) is the diagona matrix which shares the same diagona as X. vec(x) is the vectorization of X. In addition, X > 0 means that a entries of X are positive. The rest of the paper is organized as foows. We introduce the fused mutipe graphica asso formuation in Section 2. The screening rue is presented in Section 3. The proposed second-order method is presented in Section 4. The experimenta resuts are shown in Section 5. We concude the paper in Section Fused mutipe graphica asso. Assume we are given K data sets, x (k) R nk p, k = 1,..., K with K 2, where n k is the number of sampes, and p is the number of features. The p features are common for a K data sets, and a K n k sampes are independent. Furthermore, the sampes within each data set x (k) are identicay distributed with a p-variate Gaussian distribution with zero mean and positive definite covariance matrix Σ (k), and there are many conditionay independent pairs of features, i.e., the precision matrix Θ (k) = (Σ (k) ) 1 shoud be sparse. For notationa simpicity, we assume that n 1 =... = n K = n. Denote the sampe covariance matrix for each data set x (k) as S (k) with S (k) = 1 n (x(k) ) T x (k), and Θ = (Θ (1),..., Θ (K) ). Then the negative og ikeihood for the data takes the form of (2.1) K ( ) og det(θ (k) ) + tr(s (k) Θ (k) ).

4 4 FUSED MULTIPLE GRAPHICAL LASSO Ceary, minimizing (2.1) eads to the maximum ikeihood estimate (MLE) ˆΘ (k) = (S (k) ) 1. However, the MLE fais when S (k) is singuar. Furthermore, the MLE is usuay dense. The 1 reguarization has been empoyed to induce sparsity, resuting in the sparse inverse covariance estimation [2, 10, 46]. In this paper, we empoy both the 1 reguarization and the fused reguarization for simutaneousy estimating mutipe graphs. The 1 reguarization eads to a sparse soution, and the fused penaty encourages Θ (k) to be simiar to its neighbors. Mathematicay, we sove the foowing formuation: (2.2) where min K Θ (k) 0,...K P (Θ) = λ 1 K i j ( ) og det(θ (k) ) + tr(s (k) Θ (k) ) + P (Θ), Θ (k) + λ 2 K 1 i j Θ (k) Θ (k+1), λ 1 > 0 and λ 2 > 0 are positive reguarization parameters. This mode is referred to as the fused mutipe graphica asso (FMGL). To ensure the existence of a soution for probem (2.2), we assume throughout this paper that diag(s (k) ) > 0, k = 1,..., K. Reca that S (k) is a sampe covariance matrix, and hence diag(s (k) ) 0. The diagona entries may be not, however, stricty positive. But we can aways add a sma perturbation (say 10 8 ) to ensure the above assumption hods. The foowing theorem shows that under this assumption the FMGL (2.2) has a unique soution. Theorem 2.1. Under the assumption that diag(s (k) ) > 0, k = 1,..., K, probem (2.2) has a unique optima soution. To prove Theorem 2.1, we first estabish a technica emma which regards the existence of a soution for a standard graphica asso probem. Lemma 2.2. Let S S p + and Λ S p be such that Diag(S) + Λ > 0 and diag(λ) 0. Consider the probem (2.3) min og det(x) + tr(sx) + Λ X. X 0 }{{} f(x) Then the foowing statements hod: (a) Probem (2.3) has a unique optima soution; (b) The sub-eve set L = {X 0 : f(x) α} is compact for any α f, where f is the optima vaue of (2.3). Proof. (a) Let U = {U S p : U [ 1, 1], i, j}. Consider the probem (2.4) max {og det(s + Λ U) : S + Λ U 0}. U U We first caim that the feasibe region of probem (2.4) is nonempty, or equivaenty, there exists Ū U such that λ min(s + Λ Ū) > 0. Indeed, one can observe that max λ min(s + Λ U) = max {t : Λ U + S ti 0}, U U t,u U

5 S. YANG, Z. LU, X. SHEN, P. WONKA, AND J. YE 5 (2.5) = min max {t + tr(x(λ U + S ti))}, X 0 t,u U = min X 0 tr(sx) + Λ X : tr(x) = 1, where the second equaity foows from the Lagrangian duaity since its associated Sater condition is satisfied. Let Ω := {X S p : tr(x) = 1, X 0}. By the assumption Diag(S) + Λ > 0, we see that Λ > 0 for a i j and S ii + Λ ii > 0 for every i. Since Ω S p +, we have tr(sx) 0 for a X Ω. If there exists some k such that X k > 0, then i j Λ X > 0 and hence, (2.6) tr(sx) + Λ X > 0, X Ω. Otherwise, one has X = 0 for a i j, which, together with the facts that S ii +Λ ii > 0 for a i and tr(x) = 1, impies that for a X Ω, tr(sx) + Λ X = i (S ii + Λ ii )X ii tr(x) min i (S ii + Λ ii ) > 0. Hence, (2.6) again hods. Combining (2.5) with (2.6), one can see that max λ min(s + U U Λ U) > 0. Therefore, probem (2.4) has at east a feasibe soution. We next show that probem (2.4) has an optima soution. Let Ū be a feasibe point of (2.4), and Ω := {U U : og det(s + Λ U) og det(s + Λ Ū), S + Λ U 0}. One can observe that {S + Λ U : U U} is compact. Using this fact, it is not hard to see that og det(s + Λ U) as U U and λ min (S + Λ U) 0. Thus there exists some δ > 0 such that which impies that Ω {U U : S + Λ U δi}, Ω = {U U : og det(s + Λ U) og det(s + Λ Ū), S + Λ U δi}. Hence, Ω is a compact set. In addition, one can observe that probem (2.4) is equivaent to max og det(s + Λ U). U Ω The atter probem ceary has an optima soution and so is probem (2.4). Finay we show that X = (S+Λ U ) 1 is the unique optima soution of (2.3), where U is an optima soution of (2.4). Since S + Λ U 0, we have X 0. By the definitions of U and X, and the first-order optimaity conditions of (2.4) at U, one can have U = 1 if X > 0; β [ 1, 1] if X = 0; 1 otherwise.

6 6 FUSED MULTIPLE GRAPHICAL LASSO It foows that Λ U ( Λ X ) at X = X, where ( ) stands for the subdifferentia of the associated convex function. For convenience, et f(x) denote the objective function of (2.3). Then we have (X ) 1 + S + Λ U f(x ), which, together with X = (S + Λ U ) 1, impies that 0 f(x ). Hence, X is an optima soution of (2.3) and moreover it is unique due to the strict convexity of og det( ). (b) By statement (a), probem (2.3) has a finite optima vaue f. Hence, the above sub-eve set L is nonempty. We can observe that for any X L, 1 Λ X = f(x) [ og det(x) + tr(sx) + 1 Λ X ], 2 2 }{{} f(x) (2.7) α f, where f := inf{f(x) : X 0}. By the assumption Diag(S) + Λ > 0, one has Diag(S) + Λ/2 > 0. This together with statement (a) yieds f R. Notice that Λ > 0 for a i j. This reation and (2.7) impy that X is bounded for a X L and i j. In addition, it is we-known that det(x) X 11 X 22 X pp for a X 0. Using this reation, the definition of f( ), and the boundedness of X for a X L and i j, we have that for every X L, og(x ii ) + (S ii + Λ ii )X ii f(x) (S X + Λ X ), i j (2.8) i α i j (S X + Λ X ) δ for some δ > 0. In addition, notice from the assumption that S ii + Λ ii > 0 for a i, and hence og(x ii ) + (S ii + Λ ii )X ii 1 + min og(s kk + Λ kk ) =: σ k for a i. This reation together with (2.8) impies that for every X L and a i, og(x ii ) + (S ii + Λ ii )X ii δ (p 1)σ, and hence X ii is bounded for a i and X L. We thus concude that L is bounded. In view of this resut and the definition of f, it is not hard to see that there exists some ν > 0 such that λ min (X) ν for a X L. Hence, one has L = {X νi : f(x) α}. By the continuity of f on {X : X νi}, it foows that L is cosed. Hence, L is compact. We are now ready to prove Theorem 2.1. Proof. Since λ 1 > 0 and diag(s (k) ) > 0, k = 1,..., K, it foows from Lemma 2.2 that there exists some δ such that for each k = 1,..., K, og det(θ (k) ) + tr(s (k) Θ (k) ) + λ 1 Θ (k) δ, Θ(k) 0. i j

7 S. YANG, Z. LU, X. SHEN, P. WONKA, AND J. YE 7 For convenience, et h(θ) denote the objective function of (2.2) and Θ = ( Θ (1),..., Θ (K) ) an arbitrary feasibe point of (2.2). Let Ω = { Θ = (Θ (1),..., Θ (K) ) : h(θ) h( Θ), Θ (k) 0, k = 1,..., K }, Ω k = {Θ (k) 0 : og det(θ (k) ) + tr(s (k) Θ (k) ) + λ 1 i j Θ(k) δ } for k = 1,..., K, where δ = h( Θ) (K 1)δ. Then it is not hard to observe that Ω Ω := Ω 1 Ω K. Moreover, probem (2.2) is equivaent to (2.9) min h(θ). Θ Ω In view of Lemma 2.2, we know that Ω k is compact for a k, which impies that Ω is aso compact. Notice that h is continuous and stricty convex on Ω. Hence, probem (2.9) has a unique optima soution and so is probem (2.2). 3. The screening rue for fused mutipe graphica asso. Due to the presence of the og determinant, it is chaenging to sove the formuations invoving the penaized og-ikeihood efficienty. The existing methods for singe graphica asso are not scaabe to the probems with a arge amount of features because of the high computationa compexity. Recent studies have shown that the graphica mode may contain many connected components, which are disjoint with each other, due to the sparsity of the graphica mode, i.e., the corresponding precision matrix has a bock diagona structure (subject to some rearrangement of features). To reduce the computationa compexity, it is advantageous to first identify the bock structure and then compute the diagona bocks of the precision matrix instead of the whoe matrix. Danaher et a. [7] deveoped a simiar necessary and sufficient condition for fused graphica asso with two graphs, thus the bock structure can be identified. However, it remains a chaenge to derive the necessary and sufficient condition for the soution of fused mutipe graphica asso to be bock diagona for K > 2 graphs. In this section, we first present a theorem demonstrating that FMGL can be decomposabe once its soution has a bock diagona structure. Then we derive a necessary and sufficient condition for the soution of FMGL to be bock diagona for arbitrary number of graphs. Let C 1,..., C L be a partition of the p features into L non-overapping sets, with C C =, and L =1 C = {1,..., p}. We say that the soution Θ of FMGL (2.2) is bock diagona with L known bocks consisting of features in the sets C, = 1,..., L if there exists a permutation matrix U R p p such that each estimation precision matrix takes the form of (3.1) Θ (k) = U Θ (k) 1... Θ (k) L UT, k = 1,..., K. For simpicity of presentation, we assume throughout this paper that U = I. The foowing decomposition resut for probem (2.2) is straightforward. Its proof is thus omitted. Theorem 3.1. Suppose that the soution Θ of FMGL (2.2) is bock diagona with L known C, = 1,..., L, i.e., each estimated precision matrix has the form (3.1)

8 8 FUSED MULTIPLE GRAPHICAL LASSO with U = I. Let Θ = ( (3.2) Θ = arg min Θ 0 (1) Θ,..., K ( (K) Θ ) for = 1,..., L. Then there hods: og det(θ (k) ) + tr(s (k) Θ (k) ) ) + P (Θ ), = 1,..., L, where Θ (k) and S (k) are the C C symmetric submatrices of Θ (k) and S (k) corresponding to the -th diagona bock, respectivey, for k = 1,..., K, and Θ = (Θ (1),..., Θ (K) ) for = 1,..., L. The above theorem demonstrates that if a arge-scae FMGL probem has a bock diagona soution, it can then be decomposed into a group of smaer sized FMGL probems. The computationa cost for the atter probems can be much cheaper. Now one natura question is how to efficienty identify the bock diagona structure of the FMGL soution before soving the probem. We address this question in the remaining part of this section. The foowing theorem provides a necessary and sufficient condition for the soution of FMGL to be bock diagona with L bocks C, = 1,..., L, which is a key for deveoping efficient decomposition scheme for soving FMGL. Since its proof requires some substantia deveopment of other technica resuts, we sha postpone the proof unti the end of this section. Theorem 3.2. The FMGL (2.2) has a bock diagona soution Θ (k), k = 1,..., K with L known bocks C, = 1,..., L if and ony if S (k), k = 1,..., K satisfy the foowing inequaities: t S(k) tλ 1 + λ 2, (3.3) t 1 k=0 S(r+k) tλ 1 + 2λ 2, 2 r K t, t S(K t+k) tλ 1 + λ 2, K S(k) Kλ 1 for t = 1,..., K 1, i C, j C,. One immediate consequence of Theorem 3.2 is that the conditions (3.3) can be used as a screening rue to identify the bock diagona structure of the FMGL soution. The steps about this rue are described as foows. 1. Construct an adjacency matrix E = I p p. Set E = E ji = 0 if S (k), k = 1,..., K satisfy the conditions (3.3). Otherwise, set E = E ji = Identify the connected components of the adjacency matrix E (for exampe, it can be done by caing the Matab function graphconncomp ). In view of Theorem 3.2, it is not hard to observe that the resuting connected components are the partition of the p features into nonoverapping sets. It then foows from Theorem 3.1 that a arge-scae FMGL probem can be decomposed into a group of smaer sized FMGL probems restricted to the features in each connected component. The computationa cost for the atter probems can be much cheaper. Therefore, this approach may enabe us to sove arge-scae FMGL probems very efficienty. In the remainder of this section we provide a proof for Theorem 3.2. proceeding, we estabish severa technica emmas as foows. Before

9 S. YANG, Z. LU, X. SHEN, P. WONKA, AND J. YE 9 Lemma 3.3. Given any two arbitrary index sets I {1,, n} and J {1,, n 1}, et Ī and J be the compement of I and J with respect to {1,, n} and {1,, n 1}, respectivey. Define (3.4) P I,J = { y R n : y I 0, y Ī 0, y J y J+1 0, y J y J+1 0 }, where J + 1 = {j + 1 : j J} and J + 1 = {j + 1 : j J}. Then, the foowing statements hod: (i) Either P I,J = {0} or P I,J is unbounded; (ii) 0 is the unique extreme point of P I,J ; (iii) Suppose that P I,J is unbounded. Then, = ext(p I,J ) Q, where ext(p I,J ) denotes the set of a extreme rays of P I,J, and (3.5) Q := {α(0,, 0, 1,, 1, 0,, 0) T R n : α 0, m 0, 1 n}. }{{}}{{} m Proof. (i) We observe that 0 P I,J. If P I,J {0}, then there exists 0 y P I,J. Hence, {αy : α 0} P I,J, which impies that P I,J is unbounded. (ii) It is easy to see that 0 P I,J and moreover there exist n ineary independent active inequaities at 0. Hence, 0 is an extreme point of P I,J. On the other hand, suppose y is an arbitrary extreme point of P I,J. Then there exist n ineary independent active inequaities at y, which together with the definition of P I,J immediatey impies y = 0. Therefore, 0 is the unique extreme point of P I,J. (iii) Suppose that P I,J is unbounded. By statement (ii), we know that P I,J has a unique extreme point. Using Minkowski s resoution theorem (e.g., see [3]), we concude that ext(p I,J ). Let d ext(p I,J ) be arbitrariy chosen. Then d 0. It foows from (3.4) that d satisfies the inequaities (3.6) d I 0, d Ī 0, d J d J+1 0, d J d J+1 0, and moreover, the number of independent active inequaities at d is n 1. If a entries of d are nonzero, then d must satisfy d J d J+1 = 0 and d J d J+1 = 0 (with a tota number n 1), which impies d 1 = d 2 = = d n and thus d Q. We now assume that d has at east one zero entry. Then, there exist positive integers k, {m i } k i=1 and {n i } k i=1 satisfying m i n i < m i+1 n i+1 for i = 1,..., k 1 such that (3.7) {i : d i = 0} = {m 1,, n 1 } {m 2,, n 2 } {m k,, n k }. One can immediatey observe that (3.8) d mi = = d ni = 0, d j d j+1 = 0, m i j n i 1, 1 i k. We next divide the rest of proof into four cases. Case (a): m 1 = 1 and n k = n. In view of (3.7), one can observe that d mi 1 d mi 0 and d ni 1 d ni for i = 2,..., k. We then see from (3.6) that except the active inequaities given in (3.8), a other possibe active inequaities at d are (3.9) d j d j+1 = 0, n i 1 < j < m i 1, 2 i k (with a tota number k i=2 (m i n i 1 2)). Notice that the tota number of independent active inequaities given in (3.8) is k i=1 (n i m i + 1). Hence, the number

10 10 FUSED MULTIPLE GRAPHICAL LASSO of independent active inequaities at d is at most k (n i m i + 1) + i=1 k (m i n i 1 2) = n k m 1 k + 2 = n k + 1. i=2 Reca that the number of independent active inequaities at d is n 1. Hence, we have n k + 1 n 1, which impies k 2. Due to d 0, we observe that k 1 hods for this case. Aso, we know that k > 0. Hence, k = 2. We then see that a possibe active inequaities described in (3.9) must be active at d, which together with k = 2 immediatey impies that d Q. Case (b): m 1 = 1 and n k < n. Using (3.7), we observe that d mi 1 d mi 0 for i = 2,..., k and d ni d ni +1 0 for i = 1,..., k. In view of these reations and a simiar argument as in case (a), one can see that the number of independent active inequaities at d is at most k (n i m i + 1) + i=1 k (m i n i 1 2) + n n k 1 = n m 1 k + 1 = n k. i=2 Simiary as in case (a), we can concude from the above reation that k = 1 and d Q. Case (c): m 1 > 1 and n k = n. By (3.7), one can observe that d mi 1 d mi 0 for i = 1,..., k and d ni d ni+1 0 for i = 1,..., k 1. Using these reations and a simiar argument as in case (a), we see that the number of independent active inequaities at d is at most m k (n i m i + 1) + i=1 k (m i n i 1 2) = n k k = n k. i=2 Simiary as in case (a), we can concude from the above reation that k = 1 and d Q. Case (d): m 1 > 1 and n k < n. From (3.7), one can observe that d mi 1 d mi 0 for i = 1,..., k and d ni d ni +1 0 for i = 1,..., k. By virtue of these reations and a simiar argument as in case (a), one can see that the number of independent active inequaities at d is at most m k (n i m i + 1) + i=1 k (m i n i 1 2) + n n k 1 = n k 1. i=2 Reca that k 1 and the number of independent active inequaities at d is n 1. Hence, this case cannot occur. Combining the above four cases, we concude that ext(p I,J ) Q. Lemma 3.4. Let P IJ and Q be defined in (3.4) and (3.5), respectivey. Then, {ext(p I,J ) : I {1,, n}, J {1,, n 1}} = Q. Proof. It foows from Lemma 3.3 (iii) that {ext(p I,J ) : I {1,, n}, J {1,, n 1}} Q.

11 We next show that S. YANG, Z. LU, X. SHEN, P. WONKA, AND J. YE 11 {ext(p I,J ) : I {1,, n}, J {1,, n 1}} Q. Indeed, et d Q be arbitrariy chosen. Then, there exist α 0 and positive integers m 1 and n 1 satisfying 1 m 1 n 1 such that d i = α for m 1 i n 1 and the rest of d i s are 0. If α > 0, it is not hard to see that d ext(p I,J ) with I = {1,, n} and J = {m 1,, n 1}. Simiary, if α < 0, d ext(p I,J ) with I = and J being the compement of J = {m1,, n 1}. Hence, d {ext(p I,J ) : I {1,, n}, J {1,, n 1}}. Lemma 3.5. Let x R n, λ 1, λ 2 0 be given, and et f(y) := x T y λ 1 n i=1 n 1 y i λ 2 y i y i+1. Then, f(y) 0 for a y R n if and ony if x satisfies the foowing inequaities: i=1 k j=1 x j kλ 1 + λ 2, k 1 j=0 x i+j kλ 1 + 2λ 2, 2 i n k, k j=1 x n k+j kλ 1 + λ 2, n j=1 x j nλ 1 for k = 1,..., n 1. Proof. Let P I,J be defined in (3.4) for any I {1,..., n} and J {1,..., n 1}. We observe that (a) R n = {P I,J : I {1,..., n}, J {1,..., n 1}}; (b) f(y) 0 for a y R n if and ony if f(y) 0 for a y P I,J, and every I {1,..., n} and J {1,..., n 1}; (c) f(y) is a inear function of y when restricted to the set P I,J for every I {1,..., n} and J {1,..., n 1}. If P I,J is bounded, we have P I,J = {0} and f(y) = 0 for y P I,J. Suppose that P I,J is unbounded. By Lemma 3.3 and Minkowski s resoution theorem, P I,J equas the finitey generated cone by ext(p I,J ). It then foows that f(y) 0 for a y P I,J if and ony if f(d) 0 for a d ext(p I,J ). Using these facts and Lemma 3.4, we see that f(y) 0 for a y R n if and ony if f(d) 0 for a d Q, where Q is defined in (3.5). By the definitions of Q and f, we further observe that f(y) 0 for a y R n if and ony if f(d) 0 for a d ±(0, } {{, 0, 1,, 1, 0,, 0) T R n : m 0, 1 n }}{{}, m which together with the definition of f immediatey impies that the concusion of this emma hods.

12 12 FUSED MULTIPLE GRAPHICAL LASSO (3.10) Lemma 3.6. Let x R n, λ 1, λ 2 0 be given. The inear system x 1 + λ 1 γ 1 + λ 2 v 1 = 0, x i + λ 1 γ i + λ 2 (v i v i 1 ) = 0, 2 i n 1, x n + λ 1 γ n λ 2 v n 1 = 0, 1 γ i 1, i = 1,..., n, 1 v i 1, i = 1,..., n 1 has a soution (γ, v) if and ony if (x, λ 1, λ 2 ) satisfies the foowing inequaities: k j=1 x j kλ 1 + λ 2, k 1 j=0 x i+j kλ 1 + 2λ 2, 2 i n k, k j=1 x n k+j kλ 1 + λ 2, n j=1 x j nλ 1 for k = 1,..., n 1. Proof. The inear system (3.10) has a soution if and ony if the inear programming (3.11) min γ,v {0T γ + 0 T v : (γ, v) satisfies (3.10)} has an optima soution. The Lagrangian dua of (3.11) is max min y γ,v { which is equivaent to x T y + λ 1 n i=1 } n 1 y i γ i + λ 2 (y i y i+1 )v i : 1 γ, v 1, i=1 (3.12) max f(y) := x T y λ 1 y n i=1 n 1 y i λ 2 y i y i+1. i=1 By the Lagrangian duaity theory, probem (3.11) has an optima soution if and ony if its dua probem (3.12) has optima vaue 0, which is equivaent to f(y) 0 for a y R n. The concusion of this emma then immediatey foows from Lemma 3.5. We are now ready to prove Theorem 3.2. Proof. For the sake of convenience, we denote the inverse of Θ (k) as Ŵ(k) for k = 1,..., K. By the first-order optimaity conditions, we observe that Θ (k) 0, k = 1,..., K is the optima soution of probem (2.2) if and ony if it satisfies (3.13) (3.14) (3.15) (3.16) Ŵ(k) ii + S (k) ii = 0, 1 k K, Ŵ(1) + S (1) + λ 1 γ (1) + λ 2 υ (1,2) = 0, Ŵ(k) Ŵ(K) + S (k) + S (K) + λ 1 γ (k) + λ 1 γ (K) + λ 2 ( υ (k 1,k) λ 2 υ (K 1,K) = 0 + υ (k,k+1) ) = 0, 2 k K 1,

13 for a i, j = 1,..., p, i j, where γ (k) υ (k,k+1) ( Θ (k) S. YANG, Z. LU, X. SHEN, P. WONKA, AND J. YE 13 is a subgradient of Θ (k), Θ (k+1) and υ (k,k+1) ), that is, υ (k,k+1) [ 1, 1] if = Θ (k) Θ (k+1) is a subgradient of Θ (k) at Θ(k) = Θ (k) = 1 if (k+1) Θ. with respect to Θ (k) (k+1) > Θ, υ (k,k+1) Θ (k) ; and at (Θ (k), Θ(k+1) ) = (k) (k+1) = 1 if Θ < Θ, Necessity: Suppose that Θ (k), k = 1,..., K is a bock diagona optima soution of probem (2.2) with L known bocks C, = 1,..., L. Note that Ŵ(k) has the same bock diagona structure as Θ (k). Hence, Ŵ(k) (k) = Θ = 0 for i C, j C,. This together with (3.14)-(3.16) impies that for each i C, j C,, there exist, v(k,k+1) ), k = 1,..., K 1 and γ (K) such that (γ (k) (3.17) S (1) + λ 1 γ (1) + λ 2 υ (1,2) = 0, S (k) S (K) + λ 1 γ (k) + λ 1 γ (K) + λ 2 ( υ (k 1,k) λ 2 υ (K 1,K) = 0, 1 γ (k) 1, 1 k K, 1 v (k,k+1) 1, 1 k K 1. + υ (k,k+1) ) = 0, 2 k K 1, Using (3.17) and Lemma 3.6, we see that (3.3) hods for t = 1,..., K 1, i C, j C,. Sufficiency: Suppose that (3.3) hods for t = 1,..., K 1, i C, j C,. It then foows from Lemma 3.6 that for each i C, j C,, there exist (γ (k), v(k,k+1) ), k = 1,..., K 1 and γ (K) such that (3.17) hods. Now et Θ (k), k = 1,..., K be a bock diagona matrix as defined in (3.1) with U = I, where Θ = (1) (K) ( Θ,..., Θ ) is given by (3.2) for = 1,..., L. Aso, et Ŵ(k) be the inverse of Θ (k) for k = 1,..., K. Since Θ is the optima soution of probem (3.2), the first-order optimaity conditions impy that (3.13)-(3.16) hod for a i, j C, i j, = 1,..., L. (k) Notice that Θ = Ŵ(k) = 0 for every i C, j C,. Using this fact and (3.17), we observe that (3.13)-(3.16) aso hod for a i C, j C,. It then foows that Θ (k), k = 1,..., K is an optima soution of probem (2.2). In addition, Θ (k), k = 1,..., K is bock diagona with L known bocks C, = 1,..., L. The concusion thus hods. 4. Second-order method. The screening rue proposed in Section 3 is capabe of partitioning a features into a group of smaer sized bocks. Accordingy, a argescae FMGL (2.2) can be decomposed into a number of smaer sized FMGL probems. For each bock, we need to compute its individua estimated precision matrix Θ (k) by soving the FMGL (2.2) with S (k) repaced by S (k). In this section, we discuss how to sove those singe bock FMGL probems efficienty. For simpicity of presentation, we assume throughout this section that the FMGL (2.2) has ony one bock, that is, L = 1. We now propose a second-order method to sove the FMGL (2.2). For simpicity of notation, we et Θ := (Θ (1),..., Θ (K) ) and use t to denote the Newton iteration index. Let Θ t = (Θ (1) t,..., Θ (K) t ) be the approximate soution obtained at the t-th Newton iteration.

14 14 FUSED MULTIPLE GRAPHICAL LASSO The optimization probem (2.2) can be rewritten as (4.1) where min F (Θ) := K f k (Θ (k) ) + P (Θ), Θ 0 f k (Θ (k) ) = og det(θ (k) ) + tr(s (k) Θ (k) ). In the second-order method, we approximate the objective function F (Θ) at the current iterate Θ t by a quadratic mode Q t (Θ): (4.2) min Θ Q t(θ) := K q k (Θ (k) ) + P (Θ), where q k is the quadratic approximation of f k at Θ (k) t, that is, q k (Θ (k) ) = 1 2 tr(w(k) t D (k) W (k) D (k) ) + tr((s (k) W (k) )D (k) ) + f k (Θ (k) t with W (k) t = (Θ (k) t ) 1 and D (k) = Θ (k) Θ (k) t. Suppose that Θ t+1 is the optima soution of (4.2). Then we obtain the Newton search direction t t ) (4.3) D = Θ t+1 Θ t. We sha mention that the subprobem (4.2) can be suitaby soved by the nonmonotone spectra projected gradient (NSPG) method (see, for exampe, [43, 27]). It was shown by Lu and Zhang [27] that the NSPG method is ocay ineary convergent. Numerous computationa studies have demonstrated that the NSPG method is very efficient though its goba convergence rate is so far unknown. When appied to (4.2), the NSPG method requires soving the proxima subprobems in the form of (4.4) min Θ 1 2 K Θ (k) G (k) 2 F + αp (Θ) for some G = (G (1),..., G (K) ) and α > 0. By the definition of P (Θ), it is not hard to see that probem (4.4) can be decomposed into a set of independent and smaer sized probems (4.5) 1 min Θ (k),,...,k 2 K (Θ (k) G (k) )2 + α 1 K K 1 Θ (k) + α 2 Θ (k) Θ (k+1) for a i j, j = 1,..., p, where (α 1, α 2 ) = α(λ 1, λ 2 ). The probem (4.5) is known as the fused asso signa approximator, which can be soved very efficienty and exacty [6, 24]. In addition, they are independent from each other and thus can be soved in parae. Given the current search direction D = (D (1),..., D (K) ) that is computed above, we need to find the suitabe step ength β (0, 1] to ensure a sufficient reduction in the objective function of (2.2) and positive definiteness of the next iterate Θ (k) Θ (k) t t+1 = + βd (k), k = 1,..., K. In the context of the standard (singe) graphica asso,

15 S. YANG, Z. LU, X. SHEN, P. WONKA, AND J. YE 15 Hsieh et a. [16] have shown that a step ength satisfying the above requirements aways exists. We can simiary prove that the desired step ength aso exists for the FMGL (2.2). Lemma 4.1. Let Θ t = (Θ (1) t,..., Θ (K) t ) be such that Θ (k) t 0 for k = 1,..., K, and et D = (D (1),..., D (K) ) be the associated Newton search direction computed according to (4.2). Suppose D 0. 1 Then there exists a β > 0 such that Θ (k) t + βd (k) 0 and the sufficient reduction condition (4.6) F (Θ t + βd) F (Θ t ) + σβδ hods for a 0 < β < β, where σ (0, 1) is a given constant and δ = K tr((s (k) W (k) t )D (k) ) + P (Θ t + D) P (Θ t ). Proof. Let β = 1/ max{ (Θ (k) t ) 1 D (k) 2 : k = 1,..., K}, where 2 denotes the spectra norm of a matrix. Since D 0 and Θ (k) 0, k = 1,..., K, we see that β > 0. Moreover, we have for a 0 < β < β and k = 1,..., K, ( ) Θ (k) t + βd (k) (Θ (k) t ) 1 2 = I + β(θ (k) t (Θ (k) t ) 1 2 By the definition of D and (4.2), one can easiy show that δ K t ) 1 2 D (k) (Θ (k) t ) 1 2 (1 β (Θ (k) t ) 1 D (k) 2 )I 0. tr(w (k) t D (k) W (k) t D (k) ), which together with the fact that W (k) t 0, k = 1,..., K and D 0 impies that δ < 0. Using differentiabiity of f k, convexity of P, and the definition of δ, we obtain that for a sufficienty sma β > 0, F (Θ t + βd) F (Θ t ) = K (f k(θ (k) t + βd (k) ) f k (Θ (k) t )) + P (Θ t + βd) P (Θ t ), = K tr((s(k) W (k) t )D (k) )β + o(β) + P (β(θ t + D) + (1 β)θ t ) P (Θ t ), K tr((s(k) W (k) t )D (k) )β + o(β) + βp (Θ t + D) + (1 β)p (Θ t ) P (Θ t ), βδ + o(β). This inequaity together with δ < 0 and σ (0, 1) impies that there exists ˆβ > 0 such that for a β (0, ˆβ), F (Θ t + βd) F (Θ t ) σβδ. It then foows that the concusion of this emma hods for β = min{ β, ˆβ}. By virtue of Lemma 4.1, we can adopt the we-known Armo s backtracking ine search rue [38] to seect a step ength β (0, 1] so that Θ (k) t + βd (k) 0 and (4.6) hods. In particuar, we choose β to be the argest number of the sequence {1, 1/2,..., 1/2 i,...} that satisfies these requirements. We can use the Choesky factorization to check the positive definiteness of Θ (k) t + βd (k), k = 1,..., K [16]. In addition, the associated terms og det(θ (k) t + βd (k) ) and (Θ (k) t + βd (k) ) 1 can be efficienty computed as a byproduct of the Choesky decomposition of Θ (k) t + βd (k). 1 It is we known that if D = 0, Θ t is the optima soution of probem (2.2).

16 16 FUSED MULTIPLE GRAPHICAL LASSO 4.1. Shrinking scheme. Given the arge number of unknown variabes in (4.2), it is advantageous to minimize (4.2) in a reduced space. The issue now is how to identify the reduced space. In the case of a singe graph (K = 1), probem (4.2) degenerates to a asso probem of size p 2. Hsieh et a. [16] proposed a strategy to determine a subset of variabes that are aowed to be updated in each Newton iteration for singe graphica asso. Specificay, the p 2 variabes in singe graphica asso are partitioned into two sets, J free and J fixed, based on the gradient at the start of each Newton iteration, and then the minimization is ony performed on the variabes in J free. We ca this technique shrinking in this paper. Due to the sparsity of the precision matrix, the size of J free is usuay much smaer than p 2. Moreover, it has been shown in the singe graph case that the size of J free wi decrease quicky [16]. The shrinking technique can thus improve the computationa efficiency. This technique was aso successfuy used in [18, 33, 45]. We show that shrinking can be extended to the fused mutipe graphica asso based on the resuts estabished in Section 3. G (k) t Denote the gradient of f k at t-th iteration by eement by. Then we have the foowing resut. G (k) t, = S (k) W (k) t Lemma 4.2. For Θ t in the t-th iteration, define the fixed set J fixed as, and its (i, j)-th J fixed = {(i, j) Θ (1) t, =... = Θ(K) (1) (K) t, = 0 and G t,,..., G t, satisfy the inequaities beow}. (4.7) u (k) G t, < uλ 1 + λ 2, u 1 (r+k) k=0 G t, < uλ 1 + 2λ 2, 2 r K u, u (K u+k) G t, < uλ 1 + λ 2, K G (k) t, < Kλ 1 for u = 1,..., K 1. Then, the soution of the foowing optimization probem is D (1) =... = D (K) = 0 : (4.8) min Q t(θ t + D) such that D (1) =... = D (K) = 0, (i, j) / J fixed. D (4.9) Proof. Consider probem (4.8), which can be reformuated to where H (k) t ( 1 K min D 2 vec(d(k) ) T H (k) t vec(d (k) ) + vec( +P (Θ t + D), s.t. D (1) = W (k) t =... = D (K) = 0, (i, j) / J fixed, W (k) t. Because of the constraint D (1) ) (k) G t ) T vec(d (k) ) =... = D (K) = 0, (i, j) / J fixed, we ony consider the variabes in the set J fixed. According to Lemma 3.6, it is easy to see that D Jfixed = 0 satisfies the optimaity condition of the foowing probem min D Jfixed K vec( G (k) t,j fixed ) T vec(d (k) J fixed ) + P (D Jfixed ).

17 S. YANG, Z. LU, X. SHEN, P. WONKA, AND J. YE 17 Since K vec(d(k) ) T H (k) t vec(d (k) ) 0, the optima soution of (4.8) is given by D (1) =... = D (K) = 0. Lemma 4.2 provides a shrinking scheme to partition the variabes into the free set J free and the fixed set J fixed. With shrinking, each Newton step of the proposed second-order method fas into a bock coordinate gradient descent framework [38]. Lemma 4.2 shows that when the variabes in the free set J free are fixed, no update is needed for the variabes in the fixed set J fixed. Minimization of (4.2) restricted to the free set can therefore guarantee the convergence to the unique optima soution [16, 38]. In addition, it has been shown that oca quadratic convergence rate can be achieved when the exact Hessian is used (see, for exampe, [16, 21]). The resuting second-order method for soving the fused mutipe graphica asso is summarized in Agorithm 1. Agorithm 1: Proposed second-order method for Fused Mutipe Graphica Lasso (FMGL) Input: S (k), k = 1,..., K, λ 1, λ 2 Output: Θ (k), k = 1,..., K Initiaization: Θ (k) 0 = (diag(s (k) )) 1 ; whie Not Converged do Determine the sets of free and fixed indices J free and J fixed using Lemma 4.2. Compute the Newton direction D (k), k = 1,..., K by soving (4.2) and (4.3) over the free variabes J free. Choose Θ (k) t+1 by performing the Armo backtracking ine search aong Θ (k) t + βd (k) for k = 1,..., K. end return Θ (k), k = 1,..., K; 5. Experimenta resuts. In this section, we evauate the proposed agorithm and screening rue on synthetic datasets and two rea datasets: ADHD and FDG- PET images 3. The experiments are performed on a PC with quad-core Inte 2.67GHz CPU and 9GB memory Simuation Efficiency. We conduct experiments to demonstrate the effectiveness of the proposed screening rue and the efficiency of our method FMGL. The foowing agorithms are incuded in our comparisons: FMGL: the proposed second-order method in Agorithm 1. ADMM: ADMM method. FMGL-S: FMGL with screening. ADMM-S: ADMM with screening. Both FMGL and ADMM are written in Matab, and they are avaiabe onine 4. Since both methods invove soving (4.4) which invoves a doube oop, we impement the sub-routine for soving (4.4) in C for a fair comparison

18 18 FUSED MULTIPLE GRAPHICAL LASSO The synthetic covariance matrices are generated as foows. We first generate K bock diagona ground truth precision matrices Θ (k) with L bocks, and each bock Θ (k) is of size (p/l) (p/l). Each Θ (k), = 1,..., L, k = 1,..., K has random sparsity structures. We contro the number of nonzeros in each Θ (k) to be about 10p/L so that the tota number of nonzeros in the K precision matrices is 10Kp. Given the precision matrices, we draw 5p sampes from each Gaussian distribution to compute the sampe covariance matrices. The fused penaty parameter λ 2 is fixed to 0.1, and the 1 reguarization parameter λ 1 is seected so that the tota number of nonzeros in the soution is about 10Kp. We terminate the NSPG in the FMGL when the reative error max{ Θ(k) r Θ(k) r 1 } 1e-6. The FMGL is terminated when max{ Θ (k) r 1 } the reative error of the objective vaue is smaer than 1e-5, and ADMM stops unti it achieves an objective vaue equa to or smaer than that of FMGL. The resuts presented in Tabe 5.1 show that FMGL is consistenty faster than ADMM. FMGL converges much more quicky than ADMM. Moreover, the screening rue can achieve great computationa gain. The speedup with the screening rue is about 10 and 20 times for L = 5 and 10 respectivey. Tabe 5.1 Comparison of the proposed FMGL and ADMM with and without screening in terms of average computationa time (seconds). FMGL-S and ADMM-S are FMGL and ADMM with screening respectivey. p stands for the dimension, K is the number of graphs, L is the number of bocks, and λ 1 is the 1 reguarization parameter. The fused penaty parameter λ 2 is fixed to 0.1. Θ 0 represents the tota number of nonzero entries in ground truth precision matrices Θ (k), k = 1,..., K, and Θ 0 is the number of nonzeros in the soution. Data and parameter setting Computationa time (iteration numbers) p K L Θ 0 λ 1 Θ 0 FMGL-S FMGL ADMM-S ADMM (6) (152) (6) (140) (6) (146) (6) (150) (6) (152) (6) (155) (7) (155) (6) (155) (7) (153) (6) (172) (6) (157) (6) (168) Stabiity. We conduct experiments to demonstrate the effectiveness of FMGL. The synthetic sparse precision matrices are generated in the foowing way: we set the first precision matrix Θ (1) as 0.25I p p, where p = 100. When adding an edge (i, j) in the graph, we add σ to θ (1) ii and θ (1) jj, and subtract σ from θ(1) and θ (1) ji to keep the positive definiteness of Θ (1), where σ is uniformy drawn from [0.1, 0.3]. When deeting an edge (i, j) from the graph, we reverse the above steps with σ = θ (1). We randomy assign 200 edges for Θ(1). Θ (2) is obtained by adding 25 edges and deeting 25 different edges from Θ (1). Θ (3) is obtained from Θ (2) in the same way. For each precision matrix, we randomy draw n sampes from the Gaussian distribution with the corresponding precision matrix, where n varies from 40 to 200 with a step of 20. We perform 500 repications for each n. For each n,

19 S. YANG, Z. LU, X. SHEN, P. WONKA, AND J. YE 19 λ 2 is fixed to 0.08, and λ 1 is adjusted to make sure that the edge number is about 200. The accuracy n d /n g is used to measure the performance of FMGL and GLasso, where n d is the number of true edges detected by FGML and GLasso, and n g is the number of true edges. The resuts are shown in Figure 5.1. We can see from the figure that FMGL achieves higher accuracies, demonstrating the effectiveness of FMGL for earning mutipe graphica modes simutaneousy GLasso FMGL Θ (1) GLasso FMGL Θ (2) GLasso FMGL Θ (3) Accuracy Accuracy Accuracy Sampe Size Sampe Size Sampe Size Fig Comparison of FMGL and GLasso in detecting true edges. Sampe size varies from 40 to 200 with a step of Rea data ADHD-200. Attention Deficit Hyperactivity Disorder (ADHD) affects at east 5-10% of schoo-age chidren with annua costs exceeding 36 biion/year in the United States. The ADHD-200 project has reeased resting-state functiona magnetic resonance images (fmri) of 491 typicay deveoping chidren and 285 ADHD chidren, aiming to encourage the research on ADHD. The data used in this experiment is preprocessed using the NIAK pipeine, and downoaded from neurobureau 5. More detais about the preprocessing strategy can be found in the same website. The dataset we choose incudes 116 typicay deveoping chidren (TDC), 29 ADHD-Combined (ADHD-C), and 49 ADHD-Inattentive (ADHD-I). There are 231 time series and 2834 brain regions for each subject. We want to estimate the graphs of the three groups simutaneousy. The sampe covariance matrix is computed using a data from the same group. Since the number of brain regions p is 2834, obtaining the precision matrices is computationay intensive. We use this data to test the effectiveness of the proposed screening rue. λ 1 and λ 2 are set to 0.6 and The convergence criterion is 1e-5. The comparison of FMGL and ADMM in terms of the objective vaue curve is shown in Figure 5.2. The resut shows that FMGL converges much faster than ADMM. The computationa times of FMGL and ADMM are and seconds respectivey. However, utiizing the screening, the computationa times of FMGL-S and ADMM-S are and seconds respectivey, demonstrating the superiority of the screening rue. The obtained soution has 1443 bocks. The argest one incuding 634 nodes is shown in Figure 5.3. The bock structures of the FMGL soution are the same as those identified by the screening rue. The screening rue can be used to anayze the rough structures of the graphs. The cost of identifying bocks using the screening rue is negigibe compared to that of estimating the graphs. For high-dimensiona data such as ADHD-200, it 5

20 20 FUSED MULTIPLE GRAPHICAL LASSO 10 0 FMGL ADMM reative error (og scae) time (seconds) Fig Comparison of FMGL and ADMM in terms of objective vaue curve on the ADHD- 200 dataset. The dimension p is 2834, and the number of graphs K is 3. Fig A subgraph of ADHD-200 identified by FMGL with the proposed screening rue. The grey edges are common edges among the three graphs; the red, green, and bue edges are the specific edges for TDC, ADHD-I, and ADHD-C respectivey. is practica to use the screening rue to identify the bock structure before estimating the arge graphs. We use the screening rue to identify bock structures on ADHD-200 data with varying λ 1 and λ 2. The size distribution is shown in Figure 5.4. We can observe that the number of bocks increases, and the size of bocks deceases when the

21 S. YANG, Z. LU, X. SHEN, P. WONKA, AND J. YE 21 og 10 (Size of Bock) >500 og 10 (Size of Bock) > λ 1 (a) λ 2 (b) Fig The size distribution of bocks (in the ogarithmic scae) identified by the proposed screening rue. The coor represents the number of bocks of a specified size. (a): λ 1 varies from 0.5 to 0.95 with λ 2 fixed to (b): λ 2 varies from 0 to 0.2 with λ 1 fixed to reguarization parameter vaue increases FDG-PET. In this experiment, we use FDG-PET images from 74 Azheimer s disease (AD), 172 mid cognitive impairment (MCI), and 81 norma contro (NC) subjects downoaded from the Azheimer s disease neuroimaging initiative (ADNI) database. The different regions of the whoe brain voume can be represented by 116 anatomica voumes of interest (AVOI), defined by Automated Anatomica Labeing (AAL) [39]. Then we extracted data from each of the 116 AVOIs, and derived the average of each AVOI for each subject. The 116 AVOIs can be categorized into 10 groups: prefronta obe, other parts of the fronta obe, parieta obe, occipita obe, thaamus, insua, tempora obe, corpus striatum, cerebeum, and vermis. More detais about the categories can be found in [39, 41]. We remove two sma groups (thaamus and insua) containing ony 4 AVOIs in our experiments. To examine whether FMGL can effectivey utiize the information of common structures, we randomy seect g percent sampes from each group, where g varies from 20 to 100 with a step size of 10. For each g, λ 2 is fixed to 0.1, and λ 1 is adjusted to make sure the number of edges in each group is about the same. We perform 500 repications for each g. The edges with probabiity arger than 0.85 are considered as stabe edges. The resuts showing the numbers of stabe edges are summarized in Figure 5.5. We can observe that FMGL is more stabe than GLasso. When the sampe size is too sma (say 20%), there are ony 20 stabe edges in the graph of NC obtained by GLasso. But the graph of NC obtained by FMGL sti has about 140 stabe edges, iustrating the superiority of FMGL in stabiity. The brain connectivity modes obtained by FMGL are shown in Figure 5.6. We can see that the number of connections within the prefronta obe significanty increases, and the number of connections within the tempora obe significanty decreases from NC to AD, which are supported by previous iteratures [1, 14]. The connections between the prefronta and occipita obes increase from NC to AD, and connections within cerebeum decrease. We can aso find that the adjacent graphs are simiar, indicating that FMGL can identify the common structures, but aso keep

Moreau-Yosida Regularization for Grouped Tree Structure Learning

Moreau-Yosida Regularization for Grouped Tree Structure Learning Moreau-Yosida Reguarization for Grouped Tree Structure Learning Jun Liu Computer Science and Engineering Arizona State University J.Liu@asu.edu Jieping Ye Computer Science and Engineering Arizona State