Derivations for the Fglasso Algorithm

Size: px

Start display at page:

Download "Derivations for the Fglasso Algorithm"

Reynold Hubbard
5 years ago
Views:

1 Supplementary Material to Functional Graphical Models Xinghao Qiao, Shaojun Guo, and Gareth M. James This supplementary material contains the details of the algorithms with derivations in Appendix B, technical proofs of Propositions 1 2, Theorems 1 4, Lemmas 1-15 in Appendix C, and further discussion in Appendix D. B Derivations for the Fglasso Algorithm In Appendix B, we provide some further details about the fglasso algorithm and the joint fglasso algorithm. B.1 Step 2b of Algorithm 1 Note 14 is equivalent to finding w j1,, w jp 1 to minimize trace p 1 p 1 p 1 p 1 S jj wjlθ T 1 j lk w + 2 s T w + 2γ n w F. l=1 k=1 k=1 k=1 B.1 Setting the derivative of B.1 with respect to w to be zero and applying Lemma 4 yields B.1 = Θ 1 w j kk w S jj + Θ 1 j T kkw S T jj + Θ 1 j T lkw jl S T jj + Θ 1 j kl w jl S jj + 2s + 2γ n ν l k = 2 Θ 1 j kk w S jj + Θ 1 j T lkw jl S jj + s + γ n ν = 0, l k where ν = w w F We define the block residual by if w 0, and ν R M M with ν F 1 otherwise, k = 1,..., p 1. r = l k Θ 1 j T lkw jl S jj + s. B.2 1

2 If w = 0, then r F = γ n ν F γ n. Otherwise we need to solve for w in the following equation Θ 1 w j kk w S jj + r + γ n = 0. B.3 w F We replace B.3 by B.4, and standard packages in R/MatLab can be used to solve the following M 2 by M 2 nonlinear equation Θ 1 j kk S jj vecw + vecr + γ n vecw w F = 0. B.4 Hence, the block coordinate descent algorithm for solving w j in 14 is summarized in Algorithm 3. Algorithm 3 Block Coordinate Descent Algorithm for Solving w j 1. Initialize ŵ j. 2. Repeat until convergence for k = 1,..., p 1. a Compute r via B.2. b Set ŵ = 0 if r F γ n ; otherwise solve for ŵ via B.4. B.2 Steps 2a and 2c of Algorithm 1 At the jth step, we need to compute Θ 1 j in 14 given current Σ = Θ 1. Then step 2a follows by the blockwise inversion formula. Next we solve for w j via Algorithm 3, and then update Θ 1 given current w j, Θ jj, and Θ 1 j, by applying the blockwise inversion formula again. Rearranging the row and column blocks such that the j, j-th block is the last one, we obtain the permuted version of Θ 1 by Θ 1 j + U j V j U T j U j V j, where U j = Θ 1 jw j V j U T j V j and V j = Θ jj wj T U j 1 = S jj. Step 2c follows as a consequence. B.3 Joint Fglasso Algorithm We put superscript q on the terms used in Section 3.1 to denote the corresponding ones for the q-th class, 1 q Q. Then, for a fixed value of Θ q j, some calculations show that 2

3 11 with the addition of the penalty 12 is minimized by setting Θ q jj = S q jj 1 + ŵ q j T Θ q j 1 ŵ q j, B.5 where ŵ 1 j,..., ŵ Q j are obtained by minimizing Q q=1 trace S q p 1 +2γ 1n l=1 jj wq j q=1 T Θ q j 1 w q j l=1 + 2s q j q=1 T w q p 1 Q w q jl F + 2γ 2n Q w q jl 2 F, j B.6 and w q jl represents the lth M M block of w q j. Analogously to the fglasso algorithm, we summarize the joint fglasso algorithm, which is developed to solve the optimization problem 11 in Algorithm 4. Algorithm 4 Joint Functional Graphical Lasso Algorithm 1. Initialize Θ q = I and Σ q = I, q = 1,..., Q. 2. Repeat until convergence for j = 1,..., p, q = 1,..., Q. q a Compute Θ j 1 j σ q q Σ jj 1 σ q j T. Σ q j b Solve for ŵ q j in B.6 using Algorithm 5. c Reconstruct U q j S q 3. Set Êq = jj Uq j Σ q using Σ q jj T, where U q j = S q jj, σq j q = Θ j 1 ŵ q j. = U q j S q jj } q j, l : Θ jl F 0, j, l V 2, j l, q = 1,..., Q. and Σ q j q = Θ j 1 + 3

4 Setting the derivative of B.6 with respect to w q to be zero and applying Lemma 4 yield B.6 w q = Θ q j 1 kk w q Sq jj + Θ 1 j q T kkw q Sq jj T + l k Θ q j 1 T lkw q jl Sq jj T + Θ q j 1 kl w q jl Sq jj = 2 +2s q + 2λνq Θ q j 1 kk w q Sq jj + l k Θ q j 1 T lkw q jl Sq jj + sq + γ 1nν q + γ 2nµ q = 0, where ν q F 1, Q ν q F 1, µ q ν q = wq w q F q=1 µq 2 F 1, = w q Q, µ q q=1 wq 2 F = w q Q q=1 wq 2 F if Q q=1 wq 2 F = 0., if Q q=1 wq 2 F, if Q q=1 wq 2 F 0 and wq = 0. 0 and wq 0. We define the qth block residual by r q = l k Θ q j 1 T lkw q jl Sq jj + sq. B.7 If w q = 0 for all Q classes, then Q q=1 rq F Q q=1 γ 1n ν q F + γ 2n µ q F γ 1n Q + γ 2n. Otherwise if w q following equation = 0, then rq F γ 1n ; if w q 0 we need to solve for wq in the Θ q j 1 kk w q Sq jj + rq + γ 1n w q w q F w q + γ 2n Q q=1 wq 2 F = 0. B.8 Hence, the block coordinate descent algorithm for solving w q j Algorithm 5. in B.6 is summarized in 4

5 Algorithm 5 Block Coordinate Descent Algorithm for Solving w q j 1. Initialize ŵ 1 j,..., ŵ Q j. 2. Repeat until convergence for k = 1,..., p 1, q = 1,..., Q. a Compute r q b Set ŵ q via B.7. = 0 for all Q classes if Q q=1 rq F γ 1n Q + γ 2n ; otherwise go to c c For q = 1,..., Q, set ŵ q = 0 if rq F γ 1n ; otherwise solve for ŵ q via B.8. C Proofs of Technical Details C.1 Proof of Proposition 1 Substituting Θ = diagθ 1,..., Θ K into 9 yields K K max log det Θ k traces k Θ k γ n Θ 1,,Θ K k=1 k=1 which is equivalent to K separate fglasso problems in 15. j l } K Θ k,jl F, C.1 k=1 C.2 Proof of Proposition 2 If Θ is block diagonal, and i and i belong to separate index sets G k and G k, then Θ ii = 0 and hence Θ 1 ii = 0. By C.12, we have S ii F γ n Z ii F γ n. This completes the proof for the sufficient condition. Next we prove the condition is necessary. We construct Θ k by solving the fglasso problem 9 applied to the symmetric submatrix of S given by index set G k for k = 1,..., K, and let Θ = diag Θ 1,..., Θ K. Since S ii F γ n for all i G k, i G k, k k, and Θ ii = 0 by construction, we have Θ 1 ii = 0 and hence the i, i -th equation of C.12 is satisfied. Moreover, the k, k-th equation of C.12 is satisfied by construction. Therefore, Θ satisfies the KKT condition C.12 and is the solution to the fglasso problem 9. 5

6 C.3 Proof of Theorem 1 We begin with some notation. For any Hs, t, s, t T 2 with the corresponding Karhunen- Loève decomposition Hs, t = j=1 λ jφ j sφ j t, define H S = 1/2. λj 2 For two square-integrable functions ft, gt, define f, g = t T ftgtdt and f 2 = f, f. Denote also a i = λ 1/2 ξ i, where ξ i N0, 1 and λ 0 = sup jp k=1 λ. j 1 We now prove Theorem 1. We first consider σ jk for j = 1,..., p and k = 1,..., M. Note that n σ jk = n â2 i nā2 and σ jk = Ea2 1 with ā = n 1 n âi, and, for each i, j, k, â i = g ij, φ = a i + g ij, φ φ. Then n σ jk σ jk is rewritten as n σ jk σ jk = λ n + ξ 2 i 1 + 2λ 1/2 n ξ i g ij, φ φ n g ij, φ φ 2 nā 2 = I 1 + I 2 + I 3 + I 4. Note that for δ > 0, P σjk σ jk 4δ 4 m=1 P Im nδ. To derive the concentration inequality of n σ jk σ jk, it suffices to derive the tail behaviors of all I m s m = 1,..., 4. a Since ξ i s are independent N0, 1, we have that all 0 < δ 1, n P ξ 2 i 1 } nδ 2 exp nδ2 2 exp δ nδ2. 36 Hence, it follows that there exists a constants C 1 such that for 0 < δ C 1, P I 1 nδ = P n ξi 2 nδ 1 2 exp C 1 nδ 2. λ 0 b First, the term I 2 can be bounded by I 2 2λ 1/2 n ξ ig ij φ φ. Let Y n1 = n 2 } 1/2, i ξ2 Yn2 = λ n jm ξ } 2 1/2. iξ ijm Then, m k n 2 n ξ i g ij = λ ξ 2 i 2 n 2 + λ jm ξ i ξ ijm = λ Yn1 2 + Yn2, 2 m k 6

7 which implies that n ξ ig ij λ 1/2 Y n1 + Y n2. By the condition λ k β and d λ = Ok, we have that d λ d 0 k and d d 0 k 1+β for some positive constant d 0. By Lemma 8, φ φ d K jj K jj S, where, w.l.o.s., φ sgn φ, φ = 1. As a result, I 2 can be further bounded by can be chosen to satisfy I 2 2d 0 ky n1 K jj K jj S + 2d 0 λ 1/2 ky n2 K jj K jj S. C.2 We first bound Y n1 and Y n2. On one hand, n } P Y n1 2n = P ξi 2 1 n exp n. C.3 36 On the other hand, since ξ ij1, ξ ij2 N0, 1 for each j, k, n E ξ ij1ξ ij2 k neξ 2k 1j1 k!n2 k. As a result, it follows that for all δ > 0 n P ξ ij1 ξ ij2 δ 2 exp δ 2 2 exp δ2 + 2 exp δ. 16n + 4δ 32n 8 Consequently, using integration by parts, there exist two positive constants L 1 and L 2 not depending on n such that E n ξ ij1 ξ ij2 2k k!nl1 k + 2k!L 2k 2, k = 1, 2, 3,.... This further implies E Y n2 EY n2 2k k!2λ0 L 1 n k + 2k!2λ 1/2 0 L 2 2k, k 1. Hence we obtain from Theorem 2.3 of Boucheron et al that for all δ > 0 and n 2L 2 2L 1 1, P Y n2 EY n2 δ exp δ 2 32λ 0 L 1 n + 8λ 1/2 0 L 2 δ. Note that EY n2 λ 0 n 1/2. Hence, for δ 2λ 0 n 1/2 and n 2L 2 2L 1 1, P Y n2 δ P Y n2 EY n2 δ/2 exp δ 2 168λ 0 L 1 n + λ 1/2 0 L 2 δ I2 Now consider P nδ. By C.2, we can bound this term by } C.4 2P K jj K jj S δ + P Y n1 2n + P 8kd 0 Y n2 2nλ 1/2. 7

8 Together with C.3, C.4 and Lemma 6, it follows that there exist two positive constants C k k = 2, 3 free of n and p such that for 0 < δk 1 C 2, I2 P nδ C 3 exp C 2 nk 2 δ 2 + exp C 2 nk β. c In a similar way to I 2, we can show that there exist three positive constants C k k = 4, 5, 6 not depending on n and p such that for 0 < δ C 5, I3 P nδ 2 exp C 4 n + C 6 exp C 5 nk 2+2β δ. d Consider the last term I 4. First, we have ā λ 1/2 ξ + ḡ j φ φ λ 1/2 ξ + d 0 k ḡ j K jj K jj S, where ḡ j 2 = m=1 λ ξ jm 2. Note that the following inequalities hold for all δ > 0: P ξ δ 2 exp C 7 nδ 2 and P ḡ j δ 2 exp C 7 nδ 2. for some positive constant C 7. Hence, together with Lemma 6, we obtain that P ā 2 δ can be bounded by P ξ δ 1/2 λ 1/2 /2 + P ḡ j K jj K jj S d 1 P ξ δ 1/2 λ 1/2 /2 + P ḡ j d 2δ 1/4 /2 } } +P K jj K jj S λ 2 δ 1/4 /2 2 exp C 7 nk β/2 δ + C 9 exp C 8 nk β δ 1/2 for all 0 < δ C 8 with some positive constants C 8 and C 9. δ1/2 /2 Combining a, b, c and d and choosing suitable constants, the inequality 16 follows consequently. For general cases of j, l, k, m with j l or m k, σ jlkm = 1 n n âiâ ilm ā ā lm and σ jlkm = Ea ia ilm. Hence n σ jlkm σ jlkm can be expressed as the sum of the following five terms: n +λ 1/2 lm ai a ilm σ lm + λ 1/2 n ξ i g il, φ lm φ lm n ξ ilm g ij, φ φ + = I 1 + I I 5. n g ij, φ φ g il, φ lm φ lm nā ā lm 8

9 Observe that I 2 O1 k β/2 m 1+β n ξ ig il Kll K ll S, I 3 O1 m β/2 k 1+β n ξ S ilmg ij Kjj K jj, and I 4 O1 km 1+β n g ij g il K S S ll K ll Kjj K jj. Hence the proof techniques for n σ jk σjk can be applied here and as a result, 17 follows. The proof is completed. C.4 Proof of Theorem 2 First we obtain the general error bound for Θ in Section C.4.1. Second in Section C.4.2 we present the general model selection consistency of fglasso in Theorem 4. Finally in Section C.4.3 we prove Theorem 2 based on the results of Lemma 3 and Theorem 4. For convenient presentation, we adopt the definition of tail condition for the random variable given in Ravikumar et al Definition 1 Tail condition The random vector a R Mp satisfies the tail condition if there exists a constant v 0, ] and a function f : N 0, 0,, such that for any i, j 1,..., Mp} 2, let S ij, Σ ij be the i, j-th entry of S, Σ respectively, then P S ij Σ ij δ 1/fn, δ for all δ 0, 1/v ]. C.5 The tail function f is required to be monotonically increasing in δ and n. functions of n and δ are respectively defined as The inverse δ f w; n = argmax δ fn, δ w} and n f δ; w = argmax n fn, δ w}, where w [1,. Then we assume that the Hessian of the negative log determinant satisfies the following general irrepresentable-type assumption. Condition 6 There exists some constant η 0, 1] such that Γ S c S Γ S S 1 M 2 1 η. C.6 9

10 C.4.1 General Error Bound In this section, we present Theorem 3 on the general error bound. We first begin with some notation. Denote by κ Γ = Γ S S 1 M 2,κ B = Θ 1 B M κ 1 Σ, where B,jl = Θ jl for j, l S c and B,jl = 0 for j, l S, and d = max l V : Θ jl F >. j V Theorem 3 Let Θ be the unique solution to the fglasso problem 9 with regularization parameter γ n = 16η 1 M δ f n, Mp τ. Suppose that Conditions 2-4 and 6 hold, the bias term satisfies B M max γ n ηκ 2 Σ /16 and the sample size n satisfies the lower bound κσ κ Γ κ n > n f 1/ 3 Σ max v, 6c η Md max, κ2 Γ c }} η 1 3κ B κ Σ 1 3κ B κ 3 Σ κ Γ c, Mp τ η with c η = η 1, then with probability at least 1 Mp 2 τ, we have i The estimate Θ satisfies the error bound C.7 Θ Θ M max 2c η κ Γ M δ f n, Mp τ ; C.8 ii The estimated edge set Ê is a subset of E. C.4.2 General Model Selection Consistency Theorem 4 Let Θ min = min Θ jl F. Under the same conditions as in Theorem 3, if the j,l E sample size n satisfies the lower bound }} n > n f 1/ max 2κ Γ c η Θ 1 min M, v, 6c η Md max then Ê = E } holds with probability at least 1 Mp 2 τ. κ Σ κ Γ κ 3 Σ, κ2 Γ c η 1 3κ B κ Σ 1 3κ B κ 3 Σ κ Γ c η, Mp τ, C.4.3 Proof of Theorem 2 By 18 in Theorem 1, the sample covariance matrix satisfies the tail condition C.5 with some constants v = C 1 1 and fn, δ = C 1 2 expc 1 n 1 2α1+β δ 2 }. Therefore, the corresponding inverse functions take the following forms δ f n, Mp τ logc 2 Mp = τ } τlogmp + logc2 =, C.9 C 1 n 1 2α1+β C 1 n 1 2α1+β 10

11 τlogmp + n f δ, Mp τ logc2 = C 1 δ 2 } 1 2α1+β} 1. C.10 It follows from Lemma 3 with = C E 2 n α1 2ν β that E = E. Thus we have S = S, d = d, B = B, κ Γ = κ Γ and κ B = κ B. By substituting these terms into Theorem 4, some calculations using C.9 and C.10 lead to the lower bound for the sample size, i.e. n > C 3 M 2 d 2 τlogmp + logc 2 /c 2 1 and n > C 4 M 2 Θ 2 min τlogmp + τlogc 2/c 2 1 and the desired regularization parameter γ n. Under Conditions 2 4, it follows from Lemma 3 that E = E. By satisfying Condition 6 and the lower bound condition, Theorem 4 indicates that E = Ê} holds with probability at least 1 1/c 1 n α p τ 2. Combining these two results completes the proof. C.5 Proof of Theorem 3 We let the sub-differential of j l jl F matrices Z R Mp Mp with M by M blocks defined by evaluated at some Θ involves all symmetric 0 if j = l Z jl = Θ jl Θ jl F if j l and Θ jl 0 Zjl R M M : Z jl F 1 } if j l and Θ jl = 0. C.11 By the Karush-Kuhn-Tucker KKT condition, a necessary and sufficient condition for Θ to maximize 9 is Θ 1 S γ n Ẑ = 0, C.12 where Ẑ belongs to the family of sub-differential of j l Θ jl F defined in C.11. The main idea of the proof is based on constructing the primal-dual witness solution Θ and Z in the following four steps. First, Θ is obtained by the following restricted fglasso problem } min tracesθ log detθ + γ n Θ jl F, C.13 Θ S c =0 j l 11

12 where Θ R Mp Mp is symmetric positive definite. Second, for each j, l S, we choose Z jl from the family of sub-differential of j l Θ jl F Third, for each j, l S c, where Θ jl F, Z jl is replaced by 1 γ n evaluated at Θ jl defined in C.11. S jl + Θ 1, C.14 jl} which satisfies the KKT condition C.12. Finally, we need to verify strict dual feasibility condition, that is, Z jl F < 1 uniformly in j, l S c. The following terms are needed in the proof of Theorem 3. Let W be the noise matrix, and the difference between the primal witness matrix Θ and the truth Θ, W = S Θ 1, = Θ Θ = Θ Θ + Θ Θ = + B, C.15 where Θ,jl = 0 for j, l S c and Θ,jl = Θ jl for j, l S. Hence for each j, l S, c jl F. Note B corresponds to the bias matrix caused by M-dimensional approximation in 5 to a larger dimensional function. The second order remainder for Θ 1 near Θ is given by R = Θ 1 Θ 1 + Θ 1 Θ 1. C.16 To prove Theorem 3, we need use Lemmas 9-15 as stated in Supplementary Material. We organize our proof in the following six steps. Step 1. It follows from the tail condition C.5 and Lemma 14 that with probability at least 1 Mp 2 τ the event W M max M δ } f n, Mp τ holds. We need to verify that the conditions in Lemma 10 hold. Choosing the regularization parameter γ n = 16η 1 M δ f n, Mp τ and applying the inequalities in Lemma 15 together with the bound condition for the bias term, we have W M max W M max+ Θ 1 B Θ 1 M max W M max+ κ 2 Σ B M max ηγ n /16 + ηγ n /16 = ηγ n /8. It remains to prove R M max is also bounded by ηγ n /8 = 2M δ f n, Mp τ. Step 2. Let r = 2κ Γ W M max + γ n 2κ Γ c η M δ f n, Mp τ. By δ f n, Mp τ 1/v and monotonicity of the inverse tail function, for any n satisfying the lower bound condition, 12

13 we have 2κ Γ c η M δ 1 f n, Mp τ 3κB min κ Σ, 3κ Σ d 1 1 min, 3κ Σ d 3κ 3 Σ κ Γ d 1 3κB κ3 Σ κ Γ c } η 3κ 3 Σ κ Γ d c η } κ B d. Then the conditions in Lemma 12 are satisfied, and hence the error bound satisfies M max = Θ Θ M max r. Step 3. The condition M max 1 3κ Σ d 11 and results in step 2, we have κ B d is satisfied by step 2. Thus by Lemma R M max 3 2 κ3 Σ M max d M max + κ B 3κ 3 Σ κ Γ c η d 2κ Γ c ηm δ } f n, Mp τ ηγ n + κ B 8 ηγ n 8, where the last inequality comes from the monotonicity of the tail function, the bound condition for the sample size n, and the fact that 2d κ Γ c η M δ f n, Mp τ 1 3κ B κ3 Σ κ Γ c η 3κ 3 Σ κ Γ c η = 1 3κ 3 Σ κ Γ c η κ B. Step 4. Steps 1 and 3 imply the strict dual feasibility in Lemma 10, and hence Θ = Θ by Lemma 9. Step 5. It follows from the results in steps 2 and 4 that the error bound C.8 holds with probability at least 1 Mp 2 τ. Step 6. For j, l S c, Θ jl F. Step 4 implies Θ S c = Θ S c. In the restricted fglasso problem C.13, we have Θ S c = Θ S c = 0. Therefore, E c Êc and part ii follows by taking the complement. C.6 Proof of Theorem 4 It follows from the proof and results of Theorem 3i that Θ Θ M max r 2c η κ Γ M δ f n, Mp τ and Θ = Θ hold with probability at least 1 Mp 2 τ. The lower bound for the sample size n in C.9 implies Θ min > 2c η κ Γ M δ f n, Mp τ r. By Lemma 13 we have Θ jl 0 for all j, l S, which entails that E Ê. Combining this result with Theorem 3ii yields E = Ê. 13

14 C.7 Proof of Lemma 1 Since both a = a T 1,..., a T p T and φ = φ T 1,..., φ T p T depend on M, we omit the corresponding superscripts to simplify the notation for readability. Let U = V \j, l} and a U, φ U denote p 2M-dimensional vectors excluding the jth and lth subvectors from a and φ, respectively. By definition 6, we have that, for any pair j, l V 2, j l, C M jl s, t = Cov a T j φ j s, a T l φ l t a T k φ k u, k j, l, u T = Cov a T j φ j s, a T l φ l t a k, k j, l = φ j s T Cova j, a l a U φ l t. C.17 The second equality comes from the following argument. For any k U and u T, g M k u = M m=1 a kmφ km u = a T k φ ku. By the orthogonality of φ km, it follows that there exists a one to one correspondence between a k } and g M k in k. u, u T }, which holds uniformly Since C.17 holds for all s, t T 2, we have that, for fixed pair j, l V 2, j l, C M jl s, t = 0 for all s, t T 2 if and only if Cova j, a l a U = 0. Let C jl = Cova j, a l a U for each pair j, l. Then it follows from multivariate normal theory that, for each j, l V 2, j l, C jl = Θ 1 jj Θ jl Θ 1 ll. Since both Θ jj and Θ ll are positive definite, we have C jl = 0 if and only if Θ jl = 0 for each pair j, l V 2, j l. This completes the proof. C.8 Lemma 2 and its Proof Lemma 2 Suppose that Conditions 2 3 hold. Then, for each j, l V 2, Θ jl Ω M O E 2 n α1 2ν β}, F jl,1 C.18 where Ω M jl,1 is the upper left M M submatrix of Ω jl. Proof. First we give some notations. For any p p matrix A = A ij 1i,jp, let tra = i A ii and A F = tra T A } 1/2. For any M1 p M 2 p block matrix A = A ij with A ij R M 1 M 2, 1 i, j p, we define A M 1,M 2 max 14 = max A ij F, and A M 1,M 2 = 1i,jp

15 max p 1ip j=1 A ij F. In a special case when M 1 = M 2 = M, denote A M 1,M 1 max and A M 1,M 1 by A max M and A M, respectively. For any block matrix A = A ij with A ij R M M, 1 i, j p, we define A M tr = max 1i,jp traii tra jj } 1/2. We now prove Lemma 2. For convenience, for j = 1,..., p, denote a ij = b T ij, c T ij T where b ij = a ij1,..., a ijm T and c ij = a ijm+1,..., a ijmn T. Define Σ to be the covariance matrix of b T 11,..., b T 1p, c T 11,..., c T 1p T. Then we can find that there exists a permutation matrix P π such that P π ΣP T π = Σ. Since P 1 π = P T π, Ω = P π Ω 1 P T π, which means that Ω is only a permutation of Θ. Let Σ = Σ 11 Σ 21 Σ12 Σ22 and Ω = Ω 11 Ω 21 Ω12 Ω22, where Ω 11 and Ω 11 are pm pm matrices and Ω 11 and Ω 22 are pm 2 pm 2 matrices with M 2 = M n M. Now we apply Lemma 5 to prove this lemma. By Condition 3, we see that Ω 12 M 1,M 2 O E n αν. Furthermore, since the diagonal entries of Σ 22 are eigenvalues λ s, we have Σ 22 M 2 tr O n α1 β}. Hence, it follows from Lemma 5 that Ω 11 Θ M max O E 2 n α1 2ν β}. As a result, for each pair j, l V 2, Θ jl Ω M F O E 2 n α1 2ν β}. This completes the proof for Lemma 2. jl,1 C.9 Lemma 3 and its Proof In general, for any 0, we define the corresponding truncated edge set E = j, l V 2 : j l, Θ jl F > }. Let S = E 1, 1,, p, p}. Denote S c to be the complement of S in V 2 with Θ jl F for j, l S. c Lemma 3 below ensures the equivalence between the true and truncated edge sets. Lemma 3 Under Conditions 2 4, let = C E 2 n α1 2ν β for some large constant C > 0, we have E = E. Proof. First, Lemma 2 implies that for each j, l V 2, Θ jl Ω M jl,1 F O E 2 n α1 2ν β. Hence, for each pair j, l E, Θ jl F Ω M jl,1 F Θ jl Ω M jl,1 F E 2 n α1 2ν β, and for j, l S c, Θ jl = Θ jl Ω M F O E 2 n α1 2ν β, since min Ω M F jl,1 15 j,l E jl,1 F

16 E 2 n α1 2ν β by Condition 4 and Ω M jl,1 = 0 if j, l S c. This means that for = F C E 2 n α1 2ν β with a large constant C, we obtain Θ jl F if j, l E but Θ jl F if j, l S c. Therefore, E = E as claimed. C.10 Lemma 4 and its Proof Lemma 4 For any A R p q, B R r r, and X R q r, we have traceax T BX X = BXA + B T XA T. C.19 Proof. Since dtraceax T BX = tracedax T BX + traceax T dbx, we have dtraceax T BX = tracedx T BXA + traceax T BdX = tracea T X T B T + AX T BdX. Hence traceaxt BX X = A T X T B T + AX T B T, which completes the proof. C.11 Lemma 5 and its Proof Lemma 5 Suppose that for a positive definite matrix H = is H11 H 12 H 21 H 22 H 11 H 12 H 21 H 22, its inverse H 1, where H 11 and H 11 are pm 1 pm 1 matrices and H 22 and H 22 are pm 2 pm 2 matrices. If H 22 M 2 tr λ and H 12 M 1,M 2 δ, then Proof. For a positive definite matrix H = H11 H 12 H 21 H 22 = H 11 H 1 11 M 1 max δ 2 λ. C.20 H 11 H 12 H T 12 H 22, its inverse H 1 is expressed as H H 1 11 H 12 D 1 H T 12H 1 11 H 1 11 H 12 D 1 D 1 H T 12H 1 11 D 1 with D = H 22 H T 12H 1 11 H 12. Since D is positive definite, D M 2 max D M 2 tr H 22 M 2 tr λ. Since H 12 = H 1 11 H 12 D 1, we have H 1 11 H 12 M 1,M 2 max = H 12 D M 1,M 2 max H 12 M 1,M 2 D M 2 max δλ. 16

17 Hence, The lemma is proved. H 11 H 1 11 M 1 max H 12 M 1,M 2 H 1 11 H 12 M 1,M 2 max δ 2 λ. C.12 Lemma 6 and its Proof Lemma 6 Suppose that Condition 1 holds. Then there exist two positive constants C k k = 1, 2 not depending on n and p such that, for 0 < δ C 1 and each j = 1,..., p, S P Kjj K jj δ C 2 exp C 1 nδ 2, where K jj and K jj are defined in Section 2.3. Proof. Without ambiguity, we drop the index j in the following. For a function Ks, t, define a functional l K φt = 1 0 Ks, tφsds and its norm l K S = k 1 l Kφ k 2 1/2. Then K K S = l K l K S. For j = 1,..., p, let X ij s, t = g ij sg ij t and D j s, t = ḡ j sḡ j t with ḡ j t = n 1 n g ijt. We know that nl K l K = n l X i l K + nl D and hence n K n K S l Xi l K + n l D S. S To prove this lemma, we are going to derive the following tail inequalities: a There exist two constants L 1 and L 2 such that for any δ > 0, n } P l Xi l K nδ S 2 exp b There exist two positive constants L 3 and L 4 such that for δ > 2λ 0 n 1, nδ 2 ; C.21 2L 1 + 2L 2 δ n 2 δ 2 P l D S δ exp. C.22 8L 3 + 8L 4 nδ 17

18 After getting the above two inequalities C.21 and C.22, we have that for all δ > λ 0 /2n, P n K K S nδ can be bounded by } n P l Xi l K nδ + P n l D S nδ 2 2 S nδ 2 n 2 δ 2 2 exp + exp. 8L 1 + 8L 2 δ 32L L 4 nδ Take C 1 = min1, L 1 L 1 2, 16L 1 1, 64L 4 1 } and C 2 = 3 expc 2 3 with C 3 = max2λ 0, L 3 L 1 4 }. As a result, we obtain for any δ with 0 < δ C 1, This lemma follows. P K } K S δ C 2 exp C 1 nδ 2. Now we turn to prove C.21. Note that E l Xi l K = 0 for each i. By Lemma 7, it suffices to show that there exist two positive constants L 1 and L 2 such that Note that l Xi l K 2 S n E l Xi l K k S 1 2 k!nl 1L k 2 2, k = 2,.... C.23 = m,m =1 Im = m. By Jensen s inequality, E l Xi l K k S = E a im a im λ mm 2 where λmm = λ m δ mm and δ mm = m,m =1 m,m =1 2 m=1 λ m λ m λ m λ m } k/2 2 ξ im ξ im δ mm k/2 1 m,m =1 k λ m Eξi1 2k + 1, k λ m λ m E ξ im ξ im δ mm where the inequality Eξi1 2 1 k 2 k Eξi1 2k k 2 is used. Since ξ i1 N0, 1, 2k + 1 E ξ i1 2k = π 1/2 2 k Γ 2 k k!. 2 Let L 2 = 4 m=1 λ m = 4λ 0 < and L 1 = 4L 2 2. Then, for k = 2, 3,..., n E l Xi l K k S L 2/2 k 2 2 k k! 1 2 k!nl 1L k

19 Next we consider to prove the inequality C.22. Suppose that we have shown E l D k S 1 2 n k k!l 3 L k 2 4, k = 2, 3,.... C.24 Then, the following inequality follows from Lemma 7: n 2 δ 2 P l D S E l D S δ exp 2L 3 + 2L 4 nδ for all δ > 0. Note that l D 2 S = n 2 m,m =1 λ mλ m ξ m ξm 2, where ξ m = n 1/2 n ξ im. Hence E l D S n 1 λ 0. As a result, for δ > 2n 1 λ 0, we have that n 2 δ 2 P l D S δ P l D S E l D S δ/2 exp. 8L 3 + 8L 4 nδ Hence, C.22 follows. Now we derive the upper bound of E l D k S for k 2 as in C.24. By Jensen s inequality, E l D k S 1 n k E 1 n k m,m =1 m,m =1 1 n k m=1 λ m λ m ξ im ξ im 2 } k/2 λ m λ m λ m k Eξ 2k i1 k/2 1 m,m =1 2 n λ m λ m E ξ im ξ im k k λ m k!. Let L 4 = 2 m=1 λ m and L 3 = 2L 2 4. Then E l D k S 2 1 n k k!l 3 L k 2 4. Lemma 6 is proved. m=1 C.13 Lemma 7 and its Proof Lemma 7 Let X 1,..., X n } be independent random variables in a separable Hilbert space with norm. If EX i = 0 i = 1,..., n and n E X i k k! 2 nl 1L k 2 2, k = 2, 3,..., for two positive constants L 1 and L 2, then for all δ > 0, n P X i nδ 2 exp nδ 2 2L 1 + 2L 2 δ Proof. This lemma can be derived directly from Theorem of Bosq 2000 and hence its proof is omitted. 19.

20 C.14 Lemma 8 and its Proof Lemma 8 Suppose that Condition 1 holds. Denote φ = sgn φ, φ φ. Then φ φ d Kjj K jj S, where d = 2 2 maxλ jk 1 λ 1, λ λ jk+1 1 } if k 2, and d j1 = 2 2λ j1 λ j2 1. Proof. omitted. This lemma can be found in Lemma 4.3 of Bosq 2000 and hence the proof is C.15 Lemma 9 and its Proof Lemma 9 For any γ n 0, the fglasso problem 9 has a unique solution that satisfies the optimal condition C.12 with Ẑ defined in C.11. Proof. The fglasso problem can be written in the constrained form min tracesθ log detθ}, C.25 j l Θ jl F Cγ n where Θ R Mp Mp is symmetric positive definite. The objective function is strictly convex in view of its Hessian and the constraint on the parameter space, if the minimum is attained then the solution is uniquely determined. We need to show that the minimum is achieved. Note the off block diagonal elements are bounded by satisfying j l Θ jl F Cλ <. By the fact that max A ij maxa ii for a positive definite matrix A, we only need to consider i,j i the possibly unbounded diagonal elements. By Hadamard s inequality for positive definite matrices, we have Mp tracesθ log detθ S ii Θ ii log detθ ii. The diagonal elements of S are positive. The objective function goes to infinity as any sequence Θ k 11,..., Θ k Mp,Mp, k 1, goes to infinity. Thus the minimum is uniquely achieved. 20

21 C.16 Lemma 10 and its Proof } Lemma 10 Suppose that max W M max, R M max Then Z S c constructed in C.14 satisfies Z S c M max < 1. ηγn 8, where W = W+Θ 1 B Θ 1. Proof. The optimal condition C.12 can be replaced by Θ 1 Θ 1 + W R + γ n Z = 0, and can be rewritten as Θ 1 Θ 1 + W R + γ n Z = 0. C.26 Note vecθ 1 Θ 1 = Θ 1 Θ 1 vec. Taking vectorization for C.26, we have Γ S S Γ S cs Γ S S c Γ S c Sc vec,s vec,s c + vecw,s vecw,s c vecr S vecr S c + γ n vec Z S vec Z S c = 0. C.27 We solve for vec,s from the first line and substitute it into the second line. vec Z S c can be represented as Then vec Z S c = 1 γ n Γ S c SΓ S S 1 vecw,s vecr S +Γ S c SΓ S S 1 vec Z S 1 γ n vecw,s c vecr S c. For any vector v = v j with v j R M 2, 1 j p, define v M 2 max = max v j 2 as the j M 2 -group version of l norm. Taking the M 2 -group l norm on both sides, it follows from C.33 and C.34 in Lemma 15 that vec Z S c M 2 max 1 Γ S γ csγ S S 1 M 2 vecw,s M 2 max + vecr S M 2 max n + Γ S csγ S S 1 M 2 vec Z S M 2 max + 1 vecw,s c γ M 2 max + vecr S c M 2 max. n 21

22 Note that vec Z S M 2 max 1 by construction. Applying C.30 in Lemma 15, the bound condition for W M max, R M max and Condition 6 yield Z S c M max 2 η γ n 2 η γ n W M max + R M ηγn 4 max + 1 η + 1 η η η < 1. C.17 Lemma 11 and its Proof Lemma 11 Suppose that M max 1 3κ Σ d κ B d, then J T 3 and 2 R M max 3 2 κ3 Σ M max d M max + κ B, where J = k=0 1k Θ 1 k and R = Θ 1 Θ 1 JΘ 1. Proof. By the fact that has at most d M M blocks whose Frobenius norm is at least for each column block, then M Lemma 15 and the bound condition for M max that Θ 1 M Θ 1 M M d M max. It follows from C.31, C.32 in + Θ 1 B M κ Σ d M max + κ B 1/3. Hence it follows from we have the convergent matrix expansion via Neumann series Θ + 1 = Θ 1 Θ 1 Θ 1 + Θ 1 Θ 1 JΘ 1. By the definitions of R and, we obtain R = Θ 1 Θ 1 JΘ 1. Let e j R Mp M with identity matrix in the jth block and zero matrix elsewhere, and x R Mp M with jth block x j R M M. Define x M max = max x j F and x M 1 = p j=1 x j F. Recall that given an M-block matrix A, we have defined M-block version of matrix -norm as A M = max p i j=1 A ij F. Define the corresponding M-block version of matrix 1-norm j 22

23 by A M 1 = max p A ij F. It follows from the inequalities in Lemma 15 that j R M max = max et i Θ 1 Θ 1 JΘ 1 e j F i,j max e T i Θ 1 M i maxmax Θ 1 JΘ 1 e j M 1 j max e T i Θ 1 M 1 M i maxmax Θ 1 JΘ 1 e j M 1 j = Θ 1 M M max Θ 1 JΘ 1 e j M 1 κ Σ M max Θ 1 J T Θ 1 M κ 2 Σ M max J T M Θ 1 M Note that J = k=0 1k Θ 1 k. It follows from C.32 in Lemma 15 that J T M k=0 Θ 1 M k 1 = 1 Θ 1 M 3 2. Hence it follows from C.28 that we can bound the second order remainder R by R M max 3 2 κ3 Σ M max d M max + κ B. C.18 Lemma 12 and its Proof Lemma 12 Suppose that r = 2κ Γ W M max + γ n min M max = Θ Θ M max r. } 1 1 3κ Σ d, 3κ 3 Σ κ Γ d κ B d. then Proof. Let GΘ S = Θ 1 S + S S + γ n ZS. We define a continuous map F : R M 2 S R M 2 S by F vec S = Γ S S 1 vecgθ S + S + vec S. C.28 Note that F vec S = vec S holds if and only if GΘ S + S = G Θ S = 0 by construction. We need to show that the function F maps the following ball Br onto itself Br = Θ S : Θ S M max r}, C.29 where r = 2κ Γ W M max +γ n. Note F is continuous and Br is convex and compact, then by Brouwer s fixed point theorem, there exists some fixed point S Br, which implies 23

24 that Θ S Θ S M max r. It remains to prove the claim F Br Br. Note that GΘ S + S = [Θ + 1 ] S + S S + γ n ZS = [Θ + 1 Θ 1 ] S + [S Θ 1 ] S + γ n ZS = [R Θ 1 Θ 1 ] S + W,S + γ n ZS. Then C.28 can be substituted by F vec S = Γ S S 1 vecr S }} T 1 Γ S S 1 vecw,s + γ n vec Z S. }} T 2 By the definition of κ Γ and C.33 in Lemma 15, T 2 can be bounded by T 2 M 2 max κ Γ W,S M max + γ n = r/2. With the assumed bound for r, we have M max r 1 3κ Σ d κ B d. Then an application of the bound for R in Lemma 11 yields T 1 M 2 max 3 2 κ Γ κ3 Σ M max d M M max max + κ B 2 r 2, where we have used the assumption M 1 max r 3κ 3 Σ κ Γ d κ B d. Therefore, we obtain which proves the claim. F vec S M 2 max T 1 M 2 max + T 2 M 2 max r, C.19 Lemma 13 and its Proof Lemma 13 Suppose that all conditions in Lemma 12 hold and Θ min = Θ min > 2κ Γ W M max + γ n, then Θ jl 0 for all j, l S. min j,l E Θ jl F satisfies Proof. From Lemma 12, we have Θ jl Θ jl F r for any j, l S. Thus Θ jl 0 for all j, l S follows immediately from the lower bound condition on Θ min. 24

25 C.20 Lemma 14 and its Proof Lemma 14 For any τ > 2 and sample size n such that δ f n, Mp τ 1/v, we have P W M max M δ f n, Mp τ Mp 2 τ. Proof. By the definition of the tail function in C.5, we have P W kl > δ 1 fn,δ, where W R Mp Mp and k, l 1,..., Mp} 2. It then follows from union bound of probability and δ = δ f n, Mp τ that P W M max M δ f n, Mp τ = P max i,j W ij F > Mδ M 2 p 2 fn, δ = Mp2 τ. C.21 Lemma 15 and its proof Lemma 15 Let A = A ij, B = B ij with A ij, B ij R M M, 1 i, j p, u = u j, v = v j with u j, v j R M, 1 j p, and x, y R Mp M with jth block x j, y j R M M, respectively. Then the following norm properties hold: A M max = veca M 2 max, C.30 A + B M AB M A M + B M, C.31 A M B M, C.32 Au M max A M u M max, u + v M max u M max + v M max, C.33 C.34 x T y M F x M max y M 1, C.35 Ax M max A M max x M 1, C.36 A M = A T M 1. C.37 Proof. Here we will only prove one inequality C.32. Other properties can be proved 25

26 using similar techniques, so we skip the details. From definition, we write which completes the proof. AB M = max i max i = max i max i p j=1 p p A ik B kj F k=1 j=1 k=1 p A ik F B kj F p A ik F k=1 p k=1 p B kj F j=1 A ik F max k = A M B M, p B kj F j=1 D Further Discussion D.1 Approximation for Multivaraite Functional Data One referee was concerned that, for multivariate functional data, the truncation approach through performing FPCA separately for each individual curve does not provide the best M- dimensional approximation. We refer to Chiou et al and Happ and Greven 2017 for some recent developments on the Karhunen-Loeve expansion for multivariate functional data with fixed p. However, this multivariate FPCA approach cannot handle high dimensional functional data when p is very large, posing additional challenges to derive the relevant concentration bounds. In contrast, our approach is easy to implement and we are able to derive the relevant concentration bounds. Under certain regularity conditions, we can prove that our truncation approach indeed can control the bias which approaches zero as M. Roughly speaking, suppose that for each j = 1,..., p, g j t = gj M t + ξ j t, t T, with ξ j 0 as M and E gj M t = E ξ j t = 0. It follows from the expansion, where Cov g j s, g k t Cov gj M s, gk Mt = Cov gj M s, ξ j t + Cov ξ j s, gk Mt + Cov ξ j s, ξ k t, and Cauchy-Schwarz inequality that E Cov g s,t T 2 j s, g k t Cov gj M s, gk Mt} 2 dsdt 9 supj g j 2 sup j ξ j 2. In 26

27 other words, if sup j g j 2 C with some positive constant C, the truncated bias can be controlled at the same order as sup j ξ j 2. D.2 Connection between the Fglasso Approach and 24 We discuss the connection between our proposed fglasso approach and the alternative method using the inverse correlation matrix discussed in Section 6. Let S M = D M R M D M, where D M is the diagonal matrix of S M with its j-th block given by D M j R M M, j = 1,..., p. We modify the penalty term in 9 and consider maximizing log detθ M traced M R M D M Θ M γ n D M j Θ M jl D M l F, C.38 over symmetric positive definite matrices Θ M R pm pm. Let Q M = D M Θ M D M, it is clear that the solution to the optimization problem C.38 is equivalent to 24 in Section 6. j l D.3 The Algorithm to Solve 24 Since the fglasso criterion in 9 and 24 discussed in Section 6 take a similar form, we develop Algorithm 6 to solve the optimization problem in 24 following an analogous procedure described in Section 3.1. Let Q j, P j and R j respectively be Mp 1 Mp 1 sub matrices excluding the jth row and column block of Q, P = Q 1 and R, and let q j, p j and r j be Mp 1 M matrices representing the jth column block after excluding the jth row block. Finally, let Q jj, P jj and R jj be the j, jth M M blocks in Q, P and R respectively. Then, for a fixed value of Q j, 24 can be solved by setting Q jj = R 1 jj + qt j Q 1 j q j, C.39 where q j = arg min q j tracer jj q T j Q 1 j q j + 2tracer T j q j + 2γ n } p 1 q jl F, C.40 where q jl represents the lth M M block of q j. The algorithm to solve 24 is summarized in Algorithm 6 below. l=1 27

28 Algorithm 6 The Algorithm to Solve Initialize Q = I Mp and P = I Mp. 2. Repeat until convergence for j = 1,..., p. a Compute Q 1 j P j p j P 1 jj pt j. b Solve for q j in C.40 using Algorithm 3 in Section B.1. c Reconstruct P using P jj = R jj, p j = V j R jj and P j = Q 1 j +V jr jj V T j, where 1 V j = Q j q j. 3. Set j, Ê = l : Q } jl F 0, j, l V 2, j l. References Bosq, D Linear Processes in Function Spaces, Springer, New York. Boucheron, S., Lugosi, G. and Massart, P Concentration Inequalities: A Nonasymptotic Theory of Independence, Oxford University Press. Chiou, J.-M., Chen, Y.-T. and Yang, Y.-F Multivariate functional principal component analysis: a normalization approach. Statistica Sinica., 24, Happ, C. and Greven, S Multivariate functional principal component analysis for data observed on different dimensional domains. Journal of the American Statistical Association, in press. Ravikumar, P., Wainwright, M., Raskutti, G. and Yu, B High-dimensional covariance estimation by minimizing l 1 -penalized log-determinant deivergence. Electronic Journal of Statistics., 5,

Functional Graphical Models

Functional Graphical Models Xinghao Qiao 1, Shaojun Guo 2, and Gareth M. James 3 1 Department of Statistics, London School of Economics, U.K. 2 Institute of Statistics and Big Data, Renmin University of