EFFECTIVE AND ROBUST PRECONDITIONING OF GENERAL SPD MATRICES VIA STRUCTURED INCOMPLETE FACTORIZATION

Size: px

Start display at page:

Download "EFFECTIVE AND ROBUST PRECONDITIONING OF GENERAL SPD MATRICES VIA STRUCTURED INCOMPLETE FACTORIZATION"

Damon Green
5 years ago
Views:

1 EFFECVE AND ROBUS PRECONDONNG OF GENERAL SPD MARCES VA SRUCURED NCOMPLEE FACORZAON JANLN XA AND ZXNG XN Abstract For general symmetric positive definite SPD matrices, we present a framework for designing effective and robust black-box preconditioners via structured incomplete factorization n a scaling-and-compression strategy, off-diagonal blocks are first scaled on both sides by the inverses of the factors of the corresponding diagonal blocks and then compressed into low-rank approximations ULV-type factorizations are then computed A resulting prototype preconditioner is always positive definite Generalizations to practical hierarchical multilevel preconditioners are given Systematic analysis of the approximation error, robustness, and effectiveness is shown for both the prototype preconditioner and the multilevel generalization n particular, we show how local scaling and compression control the approximation accuracy and robustness, and how aggressive compression leads to efficient preconditioners that can significantly reduce the condition number and improve the eigenvalue clustering A result is also given to show when the multilevel preconditioner preserves positive definiteness he costs to apply the multilevel preconditioners are about ON, where N is the matrix size Numerical tests on several ill-conditioned problems show the effectiveness and robustness even if the compression uses very small numerical ranks n addition, significant robustness and effectiveness benefits can be observed as compared with a standard rank structured preconditioner based on direct off-diagonal compression Key words effective preconditioning, approximation error, preservation of positive definiteness, structured incomplete factorization, scaling-and-compression strategy, multilevel scheme AMS subject classifications 15A3, 65F10, 65F30 1 ntroduction Preconditioners play a key role in iterative solutions of linear systems For symmetric positive definite SPD matrices, incomplete factorizations have often been used to construct robust preconditioners See [1,, 3, 13, 19, ] for some examples n recent years, a strategy based on low-rank approximations has been successfully used to design preconditioners, where certain off-diagonal blocks are approximated by low-rank forms Such rank structured preconditioners often target at some specific problems, usually PDEs and integral equations see, eg, [6, 10, 11, 14] For more general sparse matrices, some algebraic preconditioners that incorporate low-rank approximations appear in [15, 16, 17, 5] For general dense SPD matrices A, limited work is available about the feasibility and effectiveness of preconditioning techniques based on low-rank off-diagonal approximations n general, when the off-diagonal blocks of A are directly approximated by low-rank forms with very low accuracy, it is not clear how effective the resulting preconditioner is he preconditioner may also likely fail to preserve the positive definiteness that is crucial in many applications Preliminary studies are conducted in [1, 9, 3], where sequential Schur complement computations are used to ensure positive definiteness in an idea of Schur monotonicity or Schur compensation n [9], the effectiveness for preconditioning is analyzed in terms of a special case n this paper, we consider the preconditioning of a general SPD matrix A based on incomplete factorizations that involve low-rank off-diagonal approximations For convenience, all our discussions are in terms of real dense SPD matrices We present a Department of Mathematics, Purdue University, West Lafayette, N xiaj@mathpurdue edu, zxin@purdueedu he research of Jianlin Xia was supported in part by NSF CAREER Award DMS

2 JANLN XA AND ZXNG XN framework to construct effective and robust black-box structured preconditioners that are convenient to analyze n the framework, we compute a structured incomplete factorization SF A L L, where appropriate off-diagonal blocks are first scaled on both sides and then compressed into low-rank approximations he resulting preconditioners called SF preconditioners can effectively reduce the condition number of A and bring the eigenvalues to cluster around 1 hey are also robust in the sense that they can preserve the positive definiteness either unconditionally in a prototype case or under a condition in a general case More specifically, our main contributions include: 1 a fundamental SF preconditioning framework, the multilevel generalization, and 3 systematic analysis he details are as follows We give a fundamental SF framework for designing effective and robust structured preconditioners for A he framework includes a scaling-and-compression strategy for matrix approximation and a ULV-type factorization hat is, instead of directly compressing an off-diagonal block, we scale the off-diagonal block on both sides with the inverses of the factors of the corresponding diagonal blocks and then perform the compression A sequence of orthogonal and triangular factors called ULV factors is then computed L is defined by these ULV factors and L L serves as the preconditioner We show that this preconditioner has superior effectiveness and robustness he basic idea is illustrated in terms of a prototype preconditioner, which is shown to be always positive definite regardless of the accuracy used in the compression We can also show that the local compression tolerance τ precisely controls the overall relative approximation accuracy of the matrix and the factor see heorem 4 Furthermore, we can justify the effectiveness of the preconditioner based on how τ impacts the eigenvalue distribution and the condition number of the preconditioned matrix L 1 A L n fact, the eigenvalues of L 1 A L cluster inside [1 τ, 1 + τ] Moreover, as long as the singular values of the scaled off-diagonal block slightly decay, we can aggressively truncate them so as to get a compact preconditioner L L that significantly reduces the condition number of L 1 A L his is rigorously characterized in terms of a representation 1+τ 1 τ in heorem 5 We further generalize the prototype preconditioner to multiple levels, where multilevel scaling-and-compression operations are combined with a hierarchical ULV-type factorization he robustness, accuracy, and effectiveness of the multilevel preconditioner L L are also shown A condition for τ is given to ensure that the preconditioner is positive definite heorem 31 We also prove the condition number of L 1 A L in a form similar to that in the prototype case, ie, 1+ϵ 1 ϵ in heorem 3, where ϵ is related to τ he eigenvalues of L 1 A L 1 cluster inside [ he accuracy and effectiveness analysis confirms that, with low-accuracy approximation, the preconditioners can significantly reduce the condition number and improve the eigenvalue clustering he SF framework and preconditioning techniques have some attractive features 1 he scaling strategy can enhance the compressibility of the off-diagonal blocks in some problems, where those blocks may have singular values very close to each other An example is the five-point discrete Laplacian matrix from a D regular grid, which has the identity matrix as the nonzero subblock in its off-diagonal blocks he scaling-and-compression strategy also propagates diagonal block information to off-diagonal blocks, so that it is convenient to justify the robustness 1+ϵ, 1 1 ϵ ]

3 EFFECVE AND ROBUS SPD MARX PRECONDONNG 3 and effectiveness n the earlier work in [9], only the leading diagonal block information is propagated to an off-diagonal block 3 Unlike the method in [9] that involves Schur complement computations in a sequential fashion, our SF framework computes hierarchical ULV-type factorizations and avoids the use of sequential Schur complement computations he construction of the preconditioner follows a binary tree, where the operations at the same level can be performed simultaneously his potentially leads to much better scalability 4 t is convenient to obtain a general multilevel preconditioner by repeatedly applying the scaling-and-compression strategy he cost to apply the resulting preconditioner is ON log N, where N is the order of A A modified version is also provided and costs only ON to apply he ON log N version is generally more effective, while the ON version is slightly more robust in practice 5 Systematic analysis of the approximation accuracy, robustness, and effectiveness is given, not only for the prototype preconditioner, but also for the multilevel one Multilevel error accumulation is analyzed he eigenvalue clustering and the resulting condition number are also shown he studies in [9] are restricted to a special 1-level case he performance of the preconditioners is illustrated in some numerical tests n particular, we use aggressive compression with very small ranks in the approximation of the scaled off-diagonal blocks and get satisfactory convergence in iterative solutions For some highly ill-conditioned problems such as some interpolation matrices in radial basis function methods, our preconditioners can still preserve positive definiteness and also greatly accelerate the convergence On the other hand, a structured preconditioner based on direct off-diagonal compression as in traditional rank structured methods yields slower convergence, and also fails to be positive definite he outline of the paper is as follows We first present the fundamental SF preconditioning framework in Section he prototype preconditioner and its analysis are included he generalization of the framework and analysis to multiple levels is given in Section 3, followed by numerical tests in Section 4 he following notation will be used throughout the discussions λa denotes an eigenvalue of A, and we usually use λa to mean any eigenvalue of A in general λ 1 A and λ N A specifically denote the largest and smallest eigenvalues of A, respectively ρa denotes the spectral radius of A κa is the -norm condition number of A diag represents a diagonal or block diagonal matrix with the given diagonal entries/blocks Fundamental SF preconditioning framework and analysis One essential idea of our preconditioners is a scaling-and-compression strategy: the offdiagonal blocks are first scaled on both sides by the inverses of the factors of the corresponding diagonal blocks and then compressed he work in [9] only uses the scaling on the left by the inverse of the leading factor, and the scaling on both sides is mentioned without details he double-sided scaling strategy is first formally mentioned in our presentation [6] Here, we show that, with the double-sided scaling and compression, it becomes very convenient to justify the effectiveness and also to generalize to multiple levels in the next section Also unlike in [9], our schemes do

4 4 JANLN XA AND ZXNG XN not need sequential Schur complement computations n this section, we present our scaling-and-compression strategy in a prototype preconditioner hen we rigorously show its properties he preconditioner and analysis will be generalized to multiple levels in Section 3 1 Scaling-and-compression strategy Consider an N N SPD matrix A partitioned into a block form A11 A 1 A 1, A 1 A where the two diagonal blocks A 11 and A are square matrices Suppose the Cholesky factorizations of the diagonal blocks look like A 11 = L 1 L 1, A = L L hen 3 A = L1 L C C L 1 L, where the identity matrices may have different sizes, and 4 C = L 1 1 A 1L Compute an SVD 5 C = U 1 Û 1 Σ ˆΣ U Û = U 1 ΣU + Û1 ˆΣÛ, where Σ = diagσ 1,, σ r is for the leading singular values σ 1,, σ r of C, and ˆΣ = diagσ r+1, σ r+, is for the remaining singular values σ r+1, σ r+, of C Σ is a square matrix, and ˆΣ may be rectangular U1 Û 1 and U Û are square matrices Suppose all the singular values are ordered from the largest to the smallest, and 6 σ r+1 τ σ r Later, τ will be used as a tolerance for the truncation of the singular values in ˆΣ Before we proceed, we give the following simple lemma Lemma 1 Suppose C is an m n matrix with singular values σ i, i = 1,,, C min{m, n} Let H = C hen H has an eigenvalue 1 with multiplicity m + n min{m, n}, and its remaining eigenvalues are given by 1 ± σ i, i = 1,,, min{m, n} Furthermore, H is SPD if and only if σ i < 1, i = 1,,, min{m, n} Proof Suppose Σ is an m n diagonal matrix with the main diagonal entries given Σ by σ i, i = 1,,, min{m, n} hen H is similar to Σ, whose eigenvalues are either 1 or 1 ± σ i his result then immediately implies that H is positive definite if and only if σ i < 1

5 EFFECVE AND ROBUS SPD MARX PRECONDONNG 5 C Since A is SPD, so is C in 3 hen Lemma 1 leads to the following result Lemma All the singular values σ i of C in 4 satisfy σ i < 1, i = 1,, o construct a structured incomplete preconditioner, we truncate the singular values of C in 4 by dropping ˆΣ in 5 hen 7 A Ã L1 L U 1 ΣU U ΣU1 L 1 We can continue to factorize the matrix in the middle of the right-hand side so as to obtain a preconditioner For the analysis purpose, we tentatively use the Cholesky factorization Later in Section 4, this will be replaced by a more practical ULV factorization hat is, 8 U 1 ΣU U ΣU1 where S is the Schur complement = U ΣU1 9 S = U Σ U S L U1 ΣU From Proposition 3 below, we can see that S is positive definite, so assume S = D D is its Cholesky factorization hus, U 1 ΣU U ΣU1 hen a prototype preconditioner looks like = U ΣU 1 D D U1 ΣU, 10 L = L1 L Ã = L L, with U ΣU1 = D L 1 L U ΣU 1 L D Remark 1 Our scheme here is more general than the one in [9], and there are some fundamental differences 1 Double-sided scaling L 1 1 A 1L is used here in 4, while single-sided scaling L 1 1 A 1 is used in [9] he double-sided scaling and compression make it convenient to control the approximation accuracy of Ã, and can be used to produce more general and comprehensive analysis See the next subsection he analysis in [9] is restricted to a special case Explicit Schur complement computation is needed in [9] to get a structured Cholesky factorization, which is essentially a sequential process Our ULV factorization in Section 4 will avoid the Schur complements he mentioning of the Schur complement above is only for the analysis purpose, and is not needed in the implementation 3 he generalization of the prototype preconditioner based on 1-level block partitioning to multiple levels is also more natural via repeated application We will further provide accuracy, effectiveness, and robustness analysis for our multilevel version See Section 3 No such type of analysis is available in [9]

6 6 JANLN XA AND ZXNG XN Robustness and accuracy We then discuss the robustness and accuracy of the prototype preconditioner Ã in 7 and also 10 We show that Ã is guaranteed to be positive definite, and it approximates A by a relative error bound precisely controlled by the absolute compression tolerance τ he tolerance τ also controls the approximation accuracy of L Proposition 3 Ã in 7 resulting from the truncation of ˆΣ in 5 is always positive definite Proof here are multiple ways to show the result We prove it from an aspect C that clearly reflects the robustness of the preconditioner Since C is SPD, the following Schur complement is also SPD: 11 S C C = U Σ U Û ˆΣ ˆΣÛ Note that ˆΣ is allowed to be rectangular On the other hand, 9 means 1 S = S + Û ˆΣ ˆΣÛ U hus, S in 8 is SPD Accordingly, 1 ΣU U ΣU1 in 8 is SPD 7 then indicates Ã is SPD he relationship in 1 is consistent with the idea of Schur monotonicity or Schur compensation in [1, 9], and leads to the enhanced robustness of Ã An alternative proof can be based on Lemma 1 We then show the approximation accuracy of Ã and L in terms of the truncation tolerance τ heorem 4 Suppose L is the exact lower triangular Cholesky factor of A hen A Ã τ A, L L τ 1 + d n 1 σ n 1 σ1 τ L, where d n = 1 + log n and n is the column size of C Proof Note that 6 and Lemma imply τ < 1 According to 3, 5, and 7, the approximation error matrix is given by hus, A Ã = L1 L = 0 Û 1 ˆΣÛ Û ˆΣ Û1 0 0 L 1 Û 1 ˆΣÛ L L Û ˆΣ Û 1 L 1 0 L 1 L A Ã = L 1 Û 1 ˆΣÛ L L 1 ˆΣ L = σ r+1 A11 A τ A, where we have used the relationship between the -norm of an SPD matrix and the -norm of its Cholesky factor

7 EFFECVE AND ROBUS SPD MARX PRECONDONNG 7 Next, consider the approximation accuracy of L Suppose S = D D is the Cholesky factorization of S in 11 According to 3 and 5, L1 L 1 13 L = L C = D L U ΣU1 + Û ˆΣ Û1 L D his together with 10 yields L L = hus, 0 0 L Û ˆΣ Û 1 L D D L L = L Û ˆΣ Û 1 L D D L Û ˆΣ Û 1 D D L τ + D D = τ + D D A τ + D D A = τ + D D L According to [9], the Cholesky factor D of S satisfies the approximation error bound D D d n S 1 D S S = d n S 1 D σ r+1, where d n = 1/ + log n and n is the column size of C and also the size of S From 5 and 11 and noticing Lemmas 1, we can see that S 1 = 1 1 σ1, D = 1 σn hus, D D 1 σ d n n 1 σ1 σr+1 d n 1 σ n 1 σ1 τ, L L τ + d n 1 σ n 1 σ1 τ L = τ 1 + d n 1 σ n 1 σ1 τ L heorem 4 and Lemma indicate that the absolute tolerance τ in 6 for the compression of C essentially plays a role of a relative error bound in the approximation of A by Ã τ also controls the relative approximation accuracy of the Cholesky factor L n fact, if τ is not too large and satisfies dn 1 σ n τ = O1, then 1 σ1 L L = Oτ L 3 Effectiveness of the preconditioner heorem 4 indicates that we can use τ to control how accurately Ã approximates A and L approximates L Moreover, since we are interested in preconditioning, we would like to show that τ does not have to be very small to yield an effective preconditioner A larger tolerance τ means a

8 8 JANLN XA AND ZXNG XN more compact SF preconditioner that is more efficient to apply Note that our scheme here is different and more general than the one in [9] see Remark 1 However, we can still prove effectiveness results similar to those in [9] and even stronger, though the double-sided scaling makes the analysis much less trivial here heorem 5 For L in 10, we have where Ĉ = Û1 ˆΣÛ D and L 1 A L = Ĉ Ĉ 14 Ĉ = σ r+1 τ Moreover,, L 1 A L = σ r+1, λ L 1 A L 1 σ r+1, κ L 1 A L = 1 + σ r+1 1 σ r τ 1 τ Proof According to 10 and 13, 1 L 1 A L C U1 ΣU = U ΣU1 D C D D D C U1 ΣU = D 1 C U ΣU1 D D D Û1 ˆΣÛ = D 1 Û ˆΣ Û1 D D D Û = 1 ˆΣÛ D D 1 Û ˆΣ Û1 D 1 Û ˆΣ ˆΣÛ + S D Û = 1 ˆΣÛ D Ĉ Û ˆΣ Û1 = S Ĉ D 1 D 1 D Next, we show 14 Notice that the -norm of a symmetric matrix is equal to its spectral radius We have Ĉ = Ĉ 1 Ĉ = D 1 = ρ D Û ˆΣ ˆΣÛ = ρ S 1 Û ˆΣ ˆΣÛ Û ˆΣ ˆΣÛ D = ρ = ρ U Σ U 1 Û ˆΣ ˆΣÛ D D According to the Sherman-Morrison-Woodbury formula, U Σ U 1 = [ U ΣU Σ ] 1 D 1 Û ˆΣ ˆΣÛ = U Σ[ + U Σ U Σ] 1 U Σ = U Σ + Σ 1 Σ U 1

9 EFFECVE AND ROBUS SPD MARX PRECONDONNG 9 hus, Ĉ = ρ[ U Σ + Σ 1 Σ U ]Û ˆΣ ˆΣÛ = ρû ˆΣ ˆΣÛ = Û ˆΣ ˆΣÛ = σ r+1, where U Û = 0 is used he remaining results can then be conveniently shown 15 is obvious Lemma Ĉ 1 means that Ĉ has the largest eigenvalue 1 + σ r+1 and the smallest eigenvalue 1 σ r+1 his leads to 16 and 17 heorem 5 indicates that τ also controls how close the preconditioned matrix L 1 A L is to and how closely λ L 1 A L clusters around 1 Moreover, the effectiveness of the preconditioner when τ is not small can be interpreted from the theorem similarly to [9] hat is, when the singular values σ i of C decay, the condition number 1+σr+1 1 σ r+1 after truncation decays much faster n [9], this is illustrated in terms of a special case A similar study of the decay property is also done in [17] for some Schur complement based preconditioners Here, we can further rigorously characterize this as follows Suppose st is a differentiable and non-increasing function with t > 0, 0 st < 1 ts decay rate at t can be reflected by its slope s t hen for ft = 1+st 1 st, f t = θts t, with θt = 1 st his indicates that, the slope of ft at t is that of st enlarged by a magnification factor θt he closer t is to 0 or s is to 1, the larger θt is and the faster ft decays Accordingly, this means that a relatively small t can already bring ft to a reasonable magnitude hat is, by keeping a relatively small number of largest singular values σ i of C σ i [0, 1, we can get a reasonable condition number after preconditioning Notice that the singular value truncation is controlled by 6 We can thus use a relatively large tolerance τ for preconditioning For example, for selected values of τ, the corresponding bounds for κ L 1 A L and the decay magnification factors are given in able 1 For τ as large as 095, we already have a reasonable condition number bound 1+τ 1 τ = 39 n practice, the faster σ i decays, the better the preconditioner works able 1 Understanding the decay behavior of the condition number of the preconditioned matrix in terms of the decay of the singular values σ i when the singular values smaller than or equal to τ are truncated τ τ τ 1 τ n particular, for a 5-point discrete D Laplacian matrix A evenly partitioned into a block form, we can compute C and investigate its singular values σ i, and then show the resulting condition number if all the singular values σ i satisfying σ i τ are truncated See Figure 1 Clearly, for smaller indices i, the magnification factors θ

10 10 JANLN XA AND ZXNG XN are larger and the condition number decays faster By keeping only a small number of singular values or using a small numerical rank r, we can already get a reasonable condition number σi σi/1 σi /1 σi ndex i ndex i ndex i a Singular value b Condition number c Magnification factor σ i 1+σ i 1 σ i θ i = 1 σ i Fig 1 For a 5-point discrete Laplacian matrix A evenly partitioned into a block form, a shows how the nonzero singular values σ i of C decay, b shows how the condition number of the preconditioned matrix decays when the singular values smaller than or equal to σ i are truncated, and c shows the factor that magnifies the decay Remark For some model problems such as the 5-point discrete Laplacian matrix, the analytical derivation of the singular values of C will appear in [31] t can further be shown that κ L 1 A L = O N when r is set to be O1 his is beyond the focus here 4 Basic ULV factorization scheme As mentioned in Remark 1, the preconditioning scheme in [9] has several limitations, one of which is the sequential computation of Schur complements his can be avoided by using a ULV-type factorization that is similar to but simpler than the ones in [5, 8] for hierarchically semiseparable HSS matrices he idea is to introduce zeros into the scaled 1, and, 1 off-diagonal blocks simultaneously and then obtain reduced matrices More specifically, for U i, i = 1, in 7, suppose Q i is an orthogonal matrix such that 0 Q i U i = hen U 1 ΣU U ΣU1 Q1 = Q 0 0 Σ 0 Σ 0 Q 1 Q Let Π 3 be an appropriate permutation matrix such that U 1 ΣU Q1 U ΣU1 = Π Q 3 Π Q 1 D3 3 Q, Σ where D 3 = is a small r r matrix called reduced matrix Π Σ 3 is used to assemble D 3 hen directly compute a Cholesky factorization of D 3 = L 3 L 3 to

11 EFFECVE AND ROBUS SPD MARX PRECONDONNG 11 complete the factorization his in turn gives a factorization of Ã in 7 that looks like 18 L 3 = L1 L Ã = L 3 L 3, with Q1 Π 3 Q L3 L 3 is said to be a ULV factor, since it is given by a sequence of orthogonal and triangular matrices he term ULV factor here follows the name in HSS ULV factorizations [5], and differs from the traditional sense of ULV factors Notice that the operations associated with i = 1, can be performed independently, and this avoids the sequential computation of a Schur complement hus, the overall SF framework includes the scaling-and-compression strategy followed by the ULV factorization With the 1-level block partitioning, we get the prototype preconditioner 18 3 Practical multilevel SF preconditioners and analysis he fundamental ideas for the prototype preconditioner in the previous section can be conveniently generalized to practical preconditioners with multiple levels of partitioning n the following, we describe our multilevel SF framework, including a multilevel scalingand-compression strategy combined with a multilevel ULV factorization We have two versions for the multilevel generalization, with some differences in the performance o describe the preconditioners and analysis, we list commonly used notation as follows = {1 : N} is the entire index set for A For an index subset s i, \s i is its complement represents a postordered binary tree with the nodes labeled as i = 1,,, root, where root denotes the root node Suppose root is at level 0, and the leaves are at the bottom level L sibi and pari denote the sibling and parent of node i in, respectively Each node i of is associated with a set s i of consecutive indices hese sets satisfy s root =, and s i = s c1 s c, s c1 s c = if i is a nonleaf node with children c 1 and c A si s j represents the submatrix of A with row index set s i and column index set s j b si represents the vector selected from a vector b with the index set s i 31 Multilevel SF preconditioner o generalize the ideas in the previous section to multiple levels, we can apply the scheme repeatedly to the diagonal blocks A 11 and A in 1 his can be done following a binary tree, and the operations corresponding to the nodes at the same level of can be performed simultaneously We can further avoid the top-down recursion by organizing all the local operations along a bottom-up traversal of At each level of, suppose A is partitioned following the index set s i associated with each node i at that level he number of partition levels L will be decided at the end of this subsection n the traversal of, for a leaf node i, directly compute the Cholesky factorization of the corresponding diagonal block 31 A si s i = L i L i Also for later notational consistency, set L i L i For a nonleaf node i, suppose c 1

12 1 JANLN XA AND ZXNG XN and c are its left and right children, respectively hen A sc1 A A si s i = sc sc1 1 sc A sc s c1 A sc s c By induction as in 36 below, the diagonal blocks have been approximately factorized as 3 A sc1 s c1 L c1 L c 1, A sc s c L c L c hus, 33 L A si s i c1 Lc L 1 L 1 c A sc s c1 L c 1 c 1 A sc1 s c L c L c1 L c hen just like the 1-level scheme in Section, we computed the compression approximate SVD 34 L 1 c 1 A sc1 s c L c U c1 Σ c1 Uc, and introduce zeros into this block by using orthogonal matrices Q c1 and Q c such that 35 Q 0 c 1 U c1 =, Q 0 c U c = Since L c1 and L c are structured factors, the formation of L 1 c 1 A sc1 s c L c in 34 involves multiple structured solutions his yields a factorization just like in 18: 36 A si s i L i L i, with L L i = c1 Qc1 Π Lc Q i, c Li where Π i is an appropriate permutation matrix to assemble the reduced matrix D i = Σc1, and L Σ c1 i is the lower triangular Cholesky factor of D i Later in Section 3, we will discuss a condition that guarantees the positive definiteness he formula gives a recursive representation for L i hus, it is clear that the algorithm generates a sequence of orthogonal matrices Q i and triangular matrices L i hese matrices Q i, L i together with the rotations Π i define a ULV factor L in an approximate ULV factorization 37 A Ã L L, where L L root Our multilevel SF preconditioner is then L L he application of the preconditioner is also done following the traversal of For example, consider the solution of the following linear system via forward substitution: Ly = b Partition b into pieces b si following the leaf level index sets s i For each leaf i, compute y i = Q i L 1 i b si

13 EFFECVE AND ROBUS SPD MARX PRECONDONNG 13 For each nonleaf node i with children c 1 and c, compute y i = Q i Π yc1 i L 1 i When root is reached, we obtain the solution y y root he backward substitution is done similarly he costs of the construction and application of the preconditioner can be conveniently counted Suppose r is the maximum rank used for the compression or truncated SVD in 34 he repeated diagonal partition is done until the finest level diagonal block sizes are around r hus, L log N r hen the approximate factorization costs OrN log N r flops Notice that A is a general SPD matrix his cost can be reduced when A is sparse or has some special properties o get a preconditioner, we set r = O1 he cost to apply the resulting preconditioner is ON log N he storage is also ON log N he log N term in the costs is due to the size of Q i, which is roughly N if i is at level l Q l i results from the extension of U i into an orthogonal matrix, which can be done via Householder transformations for convenience See 35 hus, Q i can be represented in terms of r Householder vectors We can remove the log N term in a modified scheme in Section 33 3 Robustness, accuracy, and effectiveness of the multilevel preconditioner We then provide analysis for the multilevel preconditioner For convenience, introduce the notation A l corresponding to the levels l = 0, 1,, L of so as to track the error accumulation at each level nitially, A L A For each leaf node i at level L, we compute the Cholesky factorization 31 At this level, no approximation is involved hen for each node i at level l < L, the scaling-and-compression strategy in 3 34 is applied, producing an approximate factorization in 36 his yields the approximation of A l+1 by A l, where the diagonal block A l+1 si s i is approximated by A l si s i L i L i his process then continues, and produces a sequence of approximations to A: 38 A L A = A L 1 = = A 0 Ã, where Ã is the final preconditioner in a factorized form Since A L 1 approximates A with modified diagonal blocks, A L 1 may not be positive definite Similarly, A l for l < L may not be positive definite We would like to study the levelwise error accumulation and give a condition for A l to remain positive definite heorem 31 Let τ be the tolerance of the truncated SVD in the scaling-andcompression strategy in 3 34 to generate the matrices A l, l = L 1, L,, 0 in 38 with L 1 f τ L < κa, then each A l is positive definite, and Proof 39 means y c A Ã [1 + τ L 1] A 310 [1 + τ L 1] A < λ N A,

14 14 JANLN XA AND ZXNG XN where λ N A is the smallest eigenvalue of A For l = L 1, L,, 0, we show A A l [1 + τ L 1] A < λ N A hen Weyl s heorem means that each A l is SPD For l = L 1, the matrix A L 1 is obtained from A through the 1-level scaling and compression, where A si s i is approximated as in 36 for each node i at level L 1 hus, heorem 4 indicates Accordingly, A si s i A L 1 si s i τ A si s i A A L 1 τ Since τ 1 + τ L 1, 310 means max { A s i s i } τ A i at level L 1 A A L 1 τ A [1 + τ L 1] A < λ N A hus, A L 1 is SPD he factorization can then proceed to level l = L when L Similarly to the argument as above, we have By the triangle inequality, A L 1 A L τ A L A A L A A L 1 + A L 1 A L A A L 1 + τ A L 1 A A L 1 + τ A A L 1 + A 1 + τ A A L 1 + τ A [1 + τ + 1]τ A = [1 + τ 1] A hen by 310, A A L [1 + τ L 1] A < λ N A his leads to the positive definiteness of A L he process then proceeds by induction n fact, just like in 311, for level l < L 1, we can get an error accumulation pattern A A l 1 + τ A A l+1 + τ A hus, A A τ A A 1 + τ A 1 + τ[1 + τ A A + τ A ] + τ A [1 + τ L τ + 1]τ A = [1 + τ L 1] A < λ N A

15 EFFECVE AND ROBUS SPD MARX PRECONDONNG 15 his completes the proof 1 hus, if τ is smaller than roughly LκA, then the preconditioner Ã is always positive definite, and it approximates A to the accuracy about Lτ Of course, this is still a very conservative estimate n practice, the positive definiteness is often well preserved by Ã for various ill-conditioned matrices A even with relative large τ See the examples in Section 4 Our modified scheme in Section 33 has even better robustness in practice n our future work, we will investigate the feasibility of improving the methods and analysis to avoid the term κa in the condition he following theorem shows the effectiveness of the multilevel preconditioner in terms of the eigenvalues and the condition number of the preconditioned matrix L 1 A L We also show how close L 1 A L is to heorem 3 With the same conditions as in heorem 31 and with L in 37, the eigenvalues of L 1 A L satisfies ϵ λ L 1 A L 1 1 ϵ, where ϵ = [1 + τ L 1]κA < 1 Accordingly, 313 L 1 A L ϵ 1 ϵ, κ L 1 A L 1 + ϵ 1 ϵ Proof 39 means ϵ < 1 Note that the eigenvalues of L 1 A L are the inverses of the eigenvalues of L 1 ÃL, where A = LL By heorem 31, Note that hus, L 1 ÃL = L 1 Ã AL A Ã L 1 L [1 + τ L 1] A L 1 L = [1 + τ L 1] A A 1 = ϵ L 1 ÃL = max{ λ N L 1 ÃL 1, λ 1 L 1 ÃL 1 } 314 λ N L 1 ÃL 1 ϵ his yields λ N L 1 ÃL 1 ϵ n fact, either λ N L 1 ÃL 1, or otherwise, 314 means 1 λ N L 1 ÃL ϵ Accordingly, 315 λ 1 L 1 A L = On the other hand, 1 λ N L 1ÃL 1 1 ϵ λ 1 L 1 ÃL = L 1 ÃL 1 + L 1 ÃL 1 + ϵ

16 16 JANLN XA AND ZXNG XN Accordingly, 316 λ N L 1 A L = 1 λ 1 L 1ÃL ϵ Combining 315 and 316 yields 31 and 313 As a consequence, L 1 A L = max{ λ 1 L 1 A L 1, λ N L 1 A L 1 } max{ 1 1 ϵ 1, ϵ } ϵ 1 ϵ herefore, in the multilevel case, ϵ plays a role similar to τ in the 1-level case in heorem 5 We can then similarly understand the effectiveness of the preconditioner when aggressive compression with relatively large τ is used, like in the end of Section 3 able 1 and Figure 1 33 Modified multilevel SF preconditioner n the multilevel scheme in Section 31, the resulting Q i factors have sizes equal to the sizes of the corresponding index sets s i Such sizes increase when we advance to upper level nodes here are different ways to save some computational costs One way is to instead scale and compress the off-diagonal blocks A si \s i in a hierarchical scheme A si \s i is essentially called an HSS block row a block row without the diagonal block [8] Some nested basis matrices will be produced his can avoid the scaling and compression of large dense off-diagonal blocks since A si \s i may include subblocks that have already been compressed in previous steps However, it makes the scaling on the right more delicate hus, we use a modified version hat is, we first apply the left scaling to A si \s i, perform compression, and compute ULV factorization Since the entire block row A si \s i instead of just A si s j for j = sibi is scaled on the left, the symmetry indicates that the right-side scaling is implicitly built in his will become clear later in 31 and 3 We would first like to mention a notation issue n the algorithm, when A si \s i of A is compressed by a rank-revealing QR factorization as A si \s i U i i, the matrix i can reuse the storage for A si \s i For convenience, we write this compression step as A si \s i U i Â i ŝi \s i, where Âi ŝi \s i indicates the storage of i in A with the row index set ŝ i and column index set \s i Such a notation scheme is used in [7] and makes it convenient to identify the subblocks that have been compressed in earlier steps he symbols Âi are only for the purpose of organizing the compression steps, and do not imply any extra storage During the factorization, we try to take advantage of previous compression information as much as possible by keeping track of compressed subblocks his is done with the aid of a visited set in [7] Definition 33 For a node i of a postordered binary tree, a visited set V i is defined to be V i = {j: j < i and sibj is an ancestor of i} he visited set essentially corresponds to the roots of all the subtrees with indices smaller than i hose nodes have been traversed, and correspond to blocks that are

17 EFFECVE AND ROBUS SPD MARX PRECONDONNG 17 highly compressed A detailed explanation of this and a pictorial illustration of V i can be found in [7] n fact, V i can be made more general to include the roots of all the subtrees previous visited, not necessarily those smaller than i We follow the original definition in [7] just for the convenience of presentation he modified multilevel SF works as follows For a leaf node i of, compute the Cholesky factorization as in 31 and set L i L i We then scale and compress the off-diagonal block A si \s i Note that, due to the symmetry, some subblocks of A si \s i may have been compressed in the previous steps Assume 317 V i = {k 1, k,, k α : k 1 < < k α }, where α is the number of elements in V i and depends on i Form Ω i = Âk1 ŝk1 s i Âkα ŝkα s i A si s + i, where s + i contains all indices larger than those in s i, and Âk1 ŝk1 s i,, are results from earlier compression steps More specifically, each block Âk sk s i is previously compressed as U k Â k ŝk s i during the compression step associated with node k V i Due to the symmetry, Âk sk s i should be compressed at step i, and the row basis matrix U k can be ignored hus, the factor Âk ŝk s i is collected into Ω i his can be understood like in Figure 33 in [7] We skip the technical details Scale Ω i as Θ i L 1 i Ω i hen compute a rank-revealing Âkα ŝkα s i QR RRQR factorization Θ i U i Âi ŝi ŝ k1 Â i ŝi ŝ kα Â i ŝi s + i where ŝ i is the row index set used to identify the second factor Also, find an orthogonal matrix Q i such that Q i U i = hen set L 0 i = L i Q i 0 he process above indicates that L i = L i U i and it is an approximate column basis matrix for Ω i Similarly, for j = sibi, we also obtain an approximate 0 column basis matrix L j for Ω j hese yield 318 A si s j where L i 0 B i B i = Âι ŝi ŝ j with ι = max{i, j} L j = L i diag0, B i L j, For a nonleaf node i with children c 1 and c, the induction process below shows 0 0 that L c1 and L c are approximate column basis matrices for A sc1 \s c1 and A sc \s c, respectively, and A sc1 s c1 L c1 L c 1, A sc s c L c L c, 0 A sc1 s c L c1 B c1 0 L c = L c1 diag0, B c1 L c, with B c1 = Âc ŝc1 ŝ c

18 18 JANLN XA AND ZXNG XN t is thus clear that A si s i is approximated as L 3 A si s i c1 diag0, B c1 Lc diag0, Bc 1 L c 1 L c 31 and 3 precisely explain how the right-side scaling is implicitly done in the process Again, with a permutation matrix Π i, the matrix in the middle on the righthand side of 3 can be assembled as Π i Bc1, where D D i = i Bc 1 Compute a Cholesky factorization D i = L i L i, then L 33 A si s i ˆL iˆl i, with ˆL i = c1 Lc Π i Li At this point, we can use ˆL 1 i to scale the off-diagonal block row A si \s i associated with i and then compress Again by induction, 34 A si \s i = A sc1 \si A sc \s i = L c1 0 L c 0 Ω i,1 Ω i, where Ω i = Ωi,1 Ω i, Âc1 ŝk1 ŝ c1 Âc1 ŝkα ŝ c1 Â c1 ŝc1 s + i Âc ŝk1 ŝ c Âc ŝkα ŝ c Â c ŝc s + i, and the notation in 317 is assumed hen ˆL 1 i A si \s i = ˆL 1 i diag = = hus, we just need to compress L 1 i L 1 i 0 L c1 Π 0 i diag, L c 0 0 diag, Θ i L 1 i Ω i We can then perform an RRQR factorization Ω i 0, Ω i = Θ i U i Âi ŝi ŝ k1 Â i ŝi ŝ kα Â i ŝi s + i 0 Again, find an orthogonal matrix Q i such that Q i U i = Ω i 0 L 1 i Let Ω i L 35 L i ˆL i = c1 Q i Lc Π i LiQi

19 EFFECVE AND ROBUS SPD MARX PRECONDONNG hen we can verify that L i = ˆL i and Ui A si \s i L i 0 Âi ŝi ŝk1 Â i ŝi ŝ kα Â i ŝi s + i 0 his confirms the induction about the column basis matrices of the forms L i See, eg, 31 and 34 Moreover, 33 can be rewritten as A si s i L i L i, with L i in 35 his confirms the induction process in 30 n addition, we can extract B i just like in 319 so as to obtain an approximation like in 318 he factorization process then continues until root is reached, and then we have an approximate factorization like in 37 A major difference is that L is now given by the recursion 35, where Q i is a small matrix with size r when r is the numerical rank in the compression For a general dense matrix A, this scheme costs OrN to compute the approximate factorization he solution procedure can be designed similarly to that at the end of Section 31 With r = O1 in preconditioning, the cost to apply the resulting preconditioner is ON, and the storage is ON his scheme shares some similarities as the one in Section 31, but is much more sophisticated We thus skip the other analysis We expect slightly reduced effectiveness, but enhanced robustness, as confirmed by numerical tests in the next section Remark 31 n this scheme and also the one in Section 31, we may replace the truncated SVDs or RRQR factorizations by other appropriate methods to improve the efficiency n particular, if A is a sparse matrix or a dense one that admits fast matrix-vector multiplications, randomized construction schemes similar to those in [18, 0, 30] can be used so that the cost to construct the preconditioner can be reduced to roughly ON Relevant details will appear in [31], as this work focuses on general SPD matrices Here for the compression of dense blocks, either truncated SVD or a randomized method is used, depending on the block sizes 4 Numerical experiments n this section, we demonstrate the performance of the new preconditioners, which are used in the preconditioned conjugate gradient method PCG to solve some ill-conditioned linear systems We compare the effectiveness and robustness of the following preconditioners: bdiag: the block diagonal preconditioner; HSS: HSS approximation in ULV factors [8] based on direct off-diagonal compression; NEW0: the multilevel SF preconditioner in Section 31; NEW1: the modified multilevel SF preconditioner in Section 33 For convenience, the following notation is used in the discussions: r: the numerical rank bound used in off-diagonal compression; A prec : the preconditioned matrix eg, L 1 A L when a preconditioner L L is used; for some cases where HSS loses positive definiteness, factors from nonsymmetric HSS factorizations are used; γ = Ax b b : the relative residual, where b is generated with the vector of all ones as the exact solution;

20 0 JANLN XA AND ZXNG XN N it : the number of iterations to reach a given accuracy Several ill-conditioned matrices are tested Since our studies are concerned with general SPD matrices, all the test matrices are treated as dense ones, and we do not take advantage of possible special structures within some examples n all the tests, we keep the numerical rank bound r very small, so that the costs of applying our preconditioners by substitution are nearly linear in N Such costs are almost negligible as compared with dense matrix-vector multiplication costs hus, it is important to reduce the number of iterations in PCG For consistency, the diagonal block sizes of bdiag and also the finest level diagonal block sizes of HSS, NEW0, and NEW1 are all set to be r Example 1 Consider a matrix A with entries A ij = ij1/4 π 16 + i j he matrix can be known to be SPD and ill conditioned based on a result in [4] We test different matrix sizes N, but fix r = 5 when generating the preconditioners PCG with the four preconditioners is used to reach the relative accuracy γ 10 1 he tests are in Matlab and run on an ntel Xeon-E5 processor with 64GB memory able 41 shows the results We report the condition numbers before and after preconditioning, the numbers of iterations N it, and the relative residuals γ For all the tests, NEW0 and NEW1 remain SPD, although r is very small However, HSS fails to preserve the positive definiteness hus, a nonsymmetric HSS factorization is used in the solution NEW0 and NEW1 significantly reduce the condition numbers and lead to much faster convergence than bdiag and HSS HSS gives reasonable reduction in the condition numbers, but the numbers of iterations are higher he performance can also be observed from Figure 41a, which shows the actual convergence behaviors for N = 300 he fast convergence with the new preconditioners can also be confirmed by the eigenvalue distribution of A prec See Figure 41b, where the eigenvalues of A prec with NEW0 or NEW1 closely cluster around 1 t is not the case for bdiag or HSS Relative residual γ bdiag HSS NEW0 NEW N it Number of iterations a Convergence of PCG Eigenvalues A A prec bdiag A prec HSS A prec NEW0 A prec NEW ndex b λa and λa prec Fig 41 Example 1 Convergence of PCG with the preconditioners for the matrix with N = 300, and the eigenvalues before and after preconditioning n addition, NEW0 fully respects the scaling-and-compression strategy and is the most effective NEW1 needs less memory than NEW0 ON vs ON log N See

21 EFFECVE AND ROBUS SPD MARX PRECONDONNG 1 able 41 Example 1 Performance of the preconditioners in PCG he condition numbers are not evaluated for N = 5, 600 and 51, 00 since it is too slow to form A prec κa prec N , 800 5, , 00 κa 148e6 14e6 306e6 438e6 bdiag 618e3 60e3 61e3 6e3 HSS 489e 489e 481e 476e NEW NEW N it HSS NEW bdiag NEW bdiag 810e e e e e e 13 γ HSS 878e e e 13 48e e e 13 NEW0 689e e e e e e 13 NEW1 518e e 13 38e e e e 13 bdiag Positive HSS definiteness NEW0 NEW1 Figure 4a he storage of HSS is nearly the same as that of NEW1 and is not shown We also report the runtime to construct the two new preconditioners and to apply them to one vector in Figure 4b he construction of HSS is about 3 times faster than that of NEW1, and the application of HSS has runtime close to that of NEW1 However, the iteration time of PCG with HSS is much longer due to the larger number of iterations he iteration time of PCG with bdiag is also quite larger than with the new preconditioners he construction and application of bdiag are more efficient in Matlab due to the simple block diagonal structure, while the new preconditioners involve more sophisticated structured matrix operations such as the application of lots of local Householder matrices hus, the advantage of the new preconditioners as compared with bdiag in iteration time is not as large as the advantage in the numbers of iterations n practice, the reduction of the number of iterations is essential, since each dense matrix-vector multiplication costs ON, while each application of these preconditioners costs only about ON Example hen consider some interpolation matrices A for the following radial basis functions RBFs of t: f 1 t, µ = e µ t, f t, µ = 1 µ t + 1, f 3t, µ = sech µt, where µ is the shape parameter For convenience, choose t to be between 0 and N 1 t is well known that the RBF matrices are very ill conditioned for small shape parameters For some cases, the condition numbers are nearly exponential in 1 µ or 1 µ see, eg, [4]

22 JANLN XA AND ZXNG XN Storage 106 ON log N reference line ON reference line NEW0 NEW1 ime s ON log N reference line NEW0 - Construction NEW1 - Construction ON reference line NEW0 - Application NEW1 - Application N a Storage number of nonzeros N b Construction and application time Fig 4 Example 1 Storage for NEW0 and NEW1 and the runtime to construct them and to apply them to one vector We test several matrices with different parameters µ he matrix sizes are set to be N = 1000 Similar results are observed for other sizes We fix r = 7 when generating the preconditioners See able 4 for the results PCG with NEW0 or NEW1 converges quickly, and NEW0 is the most effective On the other hand, NEW1 remains positive definite for all the test cases, while NEW0 loses positive definiteness for one case indicated by almost in able 4 n that case, two small intermediate reduced matrices in the construction of the preconditioner become indefinite A very small diagonal shift is added to make these reduced matrices positive definite n addition, HSS again becomes indefinite for all the tests able 4 Example Performance of the preconditioners in PCG RBF e ϵ t sech ϵt 1 ϵ x +1 ϵ κa 49e6 930e8 348e6 130e10 5e5 536e7 N it bdiag 883e4 75e7 80e4 318e8 651e3 10e6 κa prec HSS 160e1 54e4 94e1 79e4 63e1 750e4 NEW e1 NEW e e e HSS NEW bdiag NEW bdiag 667e e e e e e 13 γ HSS 863e e 13 74e 13 57e 13 70e e 13 NEW0 49e e 14 93e e 13 77e 13 41e 14 NEW1 38e e e e e 13 70e 13 bdiag Positive HSS definiteness NEW0 Almost NEW1

23 EFFECVE AND ROBUS SPD MARX PRECONDONNG 3 Figure 43 also shows the actual convergence and eigenvalue distributions for one matrix he new preconditioners yield much faster convergence and better eigenvalue clustering Relative residual γ bdiag HSS NEW0 NEW N it Number of iterations a Convergence of PCG Eigenvalues A A prec bdiag A prec HSS A prec NEW0 A prec NEW ndex b λa and λa prec Fig 43 Example Convergence of PCG with the preconditioners for the matrix corresponding to sech µt, µ = 0 in able 4, and the eigenvalues before and after preconditioning Example 3 We further test some highly ill-conditioned matrices from different applications, for which NEW1 still demonstrates superior robustness and also leads to much faster convergence than HSS LinProg N = 301, κa = 09e11 A matrix of the form CΛC as used in some linear programming problems, where C is a rectangular matrix nemspmm from the linear programming test matrix set Meszaros in [7], and Λ is a diagonal matrix with diagonal entries even spaced between 10 5 and 1 We set r = 9 MaxwSchur N = 3947, κa = 478e9 A dense normal matrix in [9] constructed based on a Schur complement in the direct factorization of a discretized 3D Maxwell equation We set r = 0 BC5Schur N = 186, κa = 38e11 A dense intermediate Schur complement in the multifrontal factorization [8] of the BCS structural engineering matrix BCSSK5 from the Harwell-Boeing Collection [1] We set r = 7 BCSSK13 N = 003, κa = 110e10 he matrix BCSSK13 directly treated as dense from the Harwell-Boeing Collection [1] We set r = 10 For each matrix, NEW1 remains positive definite, while HSS loses the positive definiteness he convergence of PCG with NEW1 is also faster, as shown in Figure 44 n able 43, we also show the convergence results for one matrix with different compression rank bounds r A larger rank bound r leads to better convergence, but the preconditioner construction and application cost more 5 Conclusions We have presented the SF framework and its analysis for preconditioning general SPD matrices he SF techniques include the scaling-andcompression strategy and the ULV factorization A prototype preconditioner is used to illustrate the basic ideas and fundamental analysis Generalizations to practical multilevel schemes are then made and also analyzed Systematic analysis for the robustness, accuracy control, and effectiveness of the preconditioners is given

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS

MULTI-LAYER HIERARCHICAL STRUCTURES AND FACTORIZATIONS JIANLIN XIA Abstract. We propose multi-layer hierarchically semiseparable MHS structures for the fast factorizations of dense matrices arising from