A robust inner-outer HSS preconditioner

Size: px

Start display at page:

Download "A robust inner-outer HSS preconditioner"

Kory Johnson
6 years ago
Views:

1 NUMERICAL LINEAR ALGEBRA WIH APPLICAIONS Numer. Linear Algebra Appl. 2011; 00:1 0 [Version: 2002/09/18 v1.02] A robust inner-outer HSS preconditioner Jianlin Xia 1 1 Department of Mathematics, Purdue University, West Lafayette, IN 47907, U.S.A. SUMMARY his paper presents an inner-outer preconditioner for symmetric positive definite matrices based on hierarchically semiseparable HSS matrix representation. A sequence of new HSS algorithms are developed, including some ULV-type HSS methods and triangular HSS methods. During the construction of this preconditioner, off-diagonal blocks are compressed in parallel, and an approximate HSS form is used as the outer HSS representation. In the meantime, the internal dense diagonal blocks are approximately factorized into HSS Cholesky factors. Such inner factorizations are guaranteed to exist for any approximation accuracy. he inner-outer preconditioner combines the advantages of both direct HSS and triangular HSS methods, and are both scalable and robust. Systematic complexity analysis and discussions of the errors are presented. Various tests on some practical numerical problems are used to demonstrate the efficiency and robustness. In particular, the effectiveness of the preconditioner in iterative solutions has been shown for some illconditioned problems. his work also gives a practical way of developing inner-outer preconditioners using direct rank structured factorizations other than iterative methods. Copyright c 2011 John Wiley & Sons, Ltd. key words: ULV factorization robust preconditioner; hierarchically semiseparable HSS matrix; inner-outer HSS algorithm; 1. INRODUCION In this paper, we present a robust and scalable inner-outer preconditioner based on hierarchically semiseparable HSS matrix structures [7, 8, 35]. It is known that dense intermediate matrices in many practical applications have certain low-rank structure or low-rank off-diagonal blocks, based on the early studies by Rokhlin [28], Gohberg, Kailath, and Koltracht [12], and on the ideas of the fast multipole method FMM by Greengard and Rokhlin [17]. he HSS representation by Chandrasekaran, Dewilde, Gu, et al. is a stable special form of FMM representations and H- and H 2 - matrices by Hackbusch, et al [5, 19, 20]. HSS matrices have been widely used in the development of fast numerical methods for sparse matrices, PDEs, integral equations, oeplitz problems, and more [7, 22, 23, 26, 31, 34, etc.]. HSS methods can often be used to solve related problems with nearly Correspondence to: Jianlin Xia, Department of Mathematics, Purdue University, West Lafayette, IN 47907, U.S.A. xiaj@math.purdue.edu he research of Jianlin Xia was supported in part by NSF grants DMS and CHE Copyright c 2011 John Wiley & Sons, Ltd.

2 2 JIANLIN. XIA linear complexity. An HSS form has a nice binary tree structure and the operations are conducted locally at multiple hierarchical levels. he applications of HSS methods and similar as preconditioners have been considered by Dewilde, Gu, et al. [18, 30, 36]. hese methods can be considered as structured incomplete factorizations. Robustness enhancement has also been studied. In previous studies for general symmetric positive definite SPD matrices, it is know that incomplete Cholesky factorizations exist for M-matrices and certain H-matrices. See, e.g., [27] by Meijerink and van der Vorst and [25] by Manteuffel. Some robust or stabilized methods have been proposed for general SPD matrices. See, e.g., [3, 4] by Benzi, Cullum, and uma, and [21] by Kaporin. In fact, rank structured matrix techniques have also been shown very useful in robust preconditioning. Robustness techniques for H-matrices are discussed by Bebendorf and Hackbusch [2]. For an SPD matrix, the structured methods in [18] by Gu, Li, and Vassilevski and in [36] by Xia and Gu guarantee that the structured preconditioners are always positive definite. In particular, it is discussed in [36] that HSS methods have the potential to work as effective preconditioners for problems where the low-rank structure is insignificant, or when the problems have only weak low-rank property. However, both methods in [18, 36] involve Schur complement computations in approximate triangular factorizations, and are not easily parallelizable. Recently, the scalability of HSS methods has been investigated by Wang, Li, et al. [32]. It is shown that both the direct construction of an HSS form and the solution of an HSS system with an ULV factorization procedure are highly parallelizable. In ULV, U and V represent orthogonal matrices and L represents triangular ones. On the other hand, the direct HSS construction may not preserve positive definiteness so that the ULV factorization may break down. Direct HSS methods and triangular HSS factorizations also have different efficiency. For example, the direct construction of an HSS approximation to a general SPD matrix costs 3rn 2 flops under certain assumptions, where r is a related off-diagonal rank bound [33]. hen the ULV factorization and solution in [35] costs at least 56 3 r2 n. In contrast, an HSS Cholesky factorization costs about 11 2 rn2, but the triangular HSS solution is more efficient and needs only 10rn Main results he task of this work is to combine the benefits of both direct HSS and triangular HSS methods. hat is, we propose an efficient and effective preconditioner for SPD matrices which is both scalable and robust. Motivated by the ideas of inner-outer iterative preconditioning by Saad [29], Golub and Ye [14], Bai, Benzi, and Chen [1], et al., we design an inner-outer HSS factorization scheme. We give a sequence of new HSS algorithms, including both direct and triangular HSS ones. hen the inner-outer preconditioner uses a direct HSS construction and ULV factorization as the outer scheme, where an inner triangular HSS factorization is applied to the diagonal blocks. Both the inner and the outer schemes approximate certain off-diagonal blocks with compression or rank-revealing factorizations. Just like some inner-outer iteration-based preconditioners [1, 14], we can also use lower accuracies in the inner HSS factorizations so as to save costs. Moreover, we allow the flexibility of adjusting the levels of hierarchical operations in both the inner and the outer HSS methods. he level of outer operations can be designed to fit the available parallel computing resource, and the inner one can then be adjusted to enhance the robustness. Here, the robustness means that our inner scheme guarantees the existence of a local Cholesky HSS factorization regardless of the approximation accuracy, unlike ULV HSS methods which may break down at certain stages of the hierarchical operations see, e.g.,

3 ROBUS INNER-OUER HSS PRECONDIIONER 3 Example 2 below. At the outer HSS levels, even if breakdown occurs, we can use simple diagonal shifting. Since the number of outer HSS levels is often small, the effect of diagonal shifting on the approximation accuracy is limited, as illustrated in Section 5. he individual new HSS algorithms here are compared with similar existing ones, and generally have better efficiency and scalability. For example, for ULV HSS factorizations, we use local triangular factorizations for diagonal blocks, which not only are more efficient than QR factorizations as used in [8], but also enable us to use the new triangular HSS algorithms as inner operations. Our triangular HSS factorization only needs to compress off-diagonal blocks about half of the size of those used in [36], while keeping about the same accuracy. Unlike the triangular HSS solution procedure in [24] which is sequential, a scalable triangular ULV factorization scheme with similar cost is proposed. he detailed complexity of these new HSS algorithms is analyzed. he analysis can help us compare different methods and provide improvements and optimization. See, e.g., 39 and 40. Discussions on the errors are given. Several numerical examples are used to demonstrate the efficiency and robustness of the algorithms. We show the effectiveness of the inner-outer HSS preconditioning for some ill-conditioned problems, such as a oeplitz problem generated by a Gaussian radial basis function and a linear elasticity PDE. We observe that, with efficient inner-outer HSS preconditioning, the preconditioned conjugate gradient method converges quickly for all the examples, and the convergence is often much faster than with block diagonal preconditioning. he overall inner-outer preconditioning cost is often insignificant when we manually set the off-diagonal rank bounds to be small. Our methods can also be built into sparse factorization or preconditioning frameworks as in [15, 16, 34] by Grasedyck, Le Borne, Kriemann, Xia, et al Outline and notation he remaining sections are organized as follows. Section 2 provides some new HSS factorization algorithms for SPD matrices. Additional HSS solution algorithms and our inner-outer HSS preconditioner is given in Section 3. he complexity and error analysis are presented in Section 4. Sections 5 and 6 are devoted to some numerical experiments and concluding remarks, respectively. For convenience, the following notation is used in the presentation: For an index set t i I {1, 2,..., n}, let t c i t i t r i = I, where all indices in tc i are smaller than those in t i, and all indices in t r i are larger than those in t i. A ti t j is the submatrix of a matrix A with row index set t i and column index set t j. If a rank-revealing QR RRQR factorization is computed for A ti t j, we sometimes write A ti t j U i A ˆt i t j, which can be understood as that the factor A ˆt i t j is still stored in A with row index set ˆt i. is a postordered full binary tree with nodes ordered as 1, 2,... hat is, each non-leaf node i and its children c i,1, c i,2 are ordered following c i,1 < c i,2 < i. Sometimes we just write the children as c 1, c 2 when no confusion is caused. For a node i, denote its parent and sibling by pari and sibi, respectively. diaga 1,..., A k is a diagonal matrix with diagonal blocks A 1,..., A k. he definition and notation for HSS structures are also briefly reviewed as follows, following a postordering HSS form [35, 36]. Definition 1.1. Assume A be an n n dense matrix. Let be a postordered full binary tree with 2k 1 nodes, and t i be an index set associated with each node i of. We say is a postordered HSS tree and A is in an HSS form if:

4 4 JIANLIN. XIA 1. For each non-leaf node i with children c 1 and c 2, t c1 t c2 = t i, t c1 t c2 = φ, and t 2k 1 = I = {1, 2,..., n}. 2. here exist matrices D i, U i, V i, R i, W i, B i called HSS generators associated with each node i satisfying Dc1 U D i = c1 B c1 Vc Uc1 2 R U c2 B c2 Vc, U i = c1 Vc1 W, V i = c1, 1 1 D c2 U c2 R c2 V c2 W c2 so that D 2k 1 A ti t i. Here, U 2k 1, V 2k 1, R 2k 1, W 2k 1, B 2k 1 are empty matrices. A is in a data-sparse form given be the generators. For a non-leaf node i, the generators D i, U i, V i are recursively defined and are not explicitly stored. Define the following HSS blocks: A = A i = A ti I\t i, A i = A I\t i t i. Clearly, U i gives a column basis for A i HSS block row, and Vi gives a row basis for A i HSS block column. U i and V i are also called cluster bases in [5]. he maximum numerical rank of all HSS blocks is called the HSS rank of A. he following is a block 4 4 HSS matrix example: D 1 U 1 B 1 V2 U 1 R 1 B 3 W4 V4 U 1 R 1 B 3 W5 V5 U 2 B 2 V1 D 2 U 2 R 2 B 3 W4 V4 U 2 R 2 B 3 W5 V5 U 4 R 4 B 6 W1 V1 U 4 R 4 B 6 W2 V2 D 4 U 4 B 4 V5 U 5 R 5 B 6 W 1 V 1 U 5 R 5 B 6 W 2 V 2 U 5 B 5 V 4 D 5 See Figure 1. Obviously, if A is symmetric, we can set [35]. 2 D i = D i, V i = U i, B j = B i with j = sibi. 3 7 D 7 =A 0 3 D 3 D D 1 D 2 D 4 D 5 2 level Figure 1. HSS tree for the block 4 4 HSS matrix 2. A given HSS form generally enables us to conduct related matrix operations such as matrix inversions and multiplications in linear complexity, if the HSS rank is small. Otherwise, we can use compact HSS forms as efficient preconditioners. 2. NEW SPD HSS FACORIZAION ALGORIHMS We show how to convert a dense SPD matrix into data-sparse forms, by either direct HSS construction with ULV factorization, or Cholesky HSS factorization. In this section, for convenience, each subsection uses the same notation for the generators of a related HSS form. o help the reader understand the algorithms, we briefly summarize the major framework of each algorithm in able I.

5 ROBUS INNER-OUER HSS PRECONDIIONER 5 HSS construction Section 2.1 Leaf level: off-diagonal block row compression for U generators; Non-leaf levels: off-diagonal block row compression for R generators; Similar off-diagonal block column compression. ULV HSS factorization Section 2.2, and similar for Section 2.4 Diagonal block factorization; Introducing zeros into off-diagonal blocks and preserving diagonal identity matrices; Partial diagonal elimination; Merging and recursion. Robust HSS Cholesky factorization Section 2.3 Leaf level: Diagonal block factorization; Computation of the off-diagonal block row of the factor and its compression for U generators; Partial Schur complement formation; Non-leaf level: off-diagonal block row compression for R generators; Similar off-diagonal block column compression. able I. Framework of the major algorithms in Section Scalable symmetric HSS construction he HSS construction algorithm in [35] uses nested compression of the HSS blocks of a given matrix A. However, it is a sequential process since early compression results participate in later compression. he algorithm can be modified into a new scalable one. In particular, for a symmetric matrix A, we need only to compress the block rows. Moreover, we can further reduce the compression cost. he method has two stages and is briefly explained as follows. For simplicity, assume the row size and numerical rank of all HSS block rows are m and r, respectively. During the first stage, all HSS block rows are compressed. If i is a leaf, compute an RRQR A ti I\t i U i A ˆt i I\t i. If i is a non-leaf node with children c 1 and c 2, compute an RRQR of the matrix A ˆt c1 I\t i Rc1 A ˆt R. i I\t i c2 hen by recursion, A ti I\t i Uc1 A ˆt c2 I\ti U c2 A ˆt c2 I\t i A ˆt c2 I\t i = Uc1 U c2 Rc1 R c2 A ˆt i I\t i = U ia ˆt i I\t i. hat is, we obtain the orthonormal column basis U i for each HSS block row A i. hese compression steps are done independently of other nodes at the same level of the HSS tree. After this stage, all U and R generators are available. In the next stage, we compute the B generators. Let j = sibi. According to the first stage, we have A ti t j U i A ˆt i t j.

6 6 JIANLIN. XIA On the other hand, in the HSS approximation, A ti t j has a form U i B i Uj A ˆt j t i U j, if i is a leaf, Bi = Rcj1 A ˆt j t i, otherwise. R cj2. hus, we can let 4 Note that the method in [35] needs additional compression steps to get B i, which are generally more expensive than Improved ULV-type HSS factorization he ULV HSS Cholesky factorization in [35] computes an explicit ULV-type factorization of an HSS matrix A. he basic idea is similar to the implicit ULV solution in [8]. However, for SPD matrices, as pointed out in [35], the efficiency of these ULV algorithms can be improved. Moreover, we can take advantage of certain special forms of the generators. he main steps are illustrated in Figure 2 and are explained as follows. For simplicity, assume all leaf level HSS block row sizes are m i m, and the ranks of all HSS blocks are r. Since U i forms a basis for the column space of A i, sometimes we write A i = U i i. If node i is a leaf, compute a Cholesky factorization D i = L i L i, and then a full QL factorization of L 1 i U i as 0 m r L 1 i U i = Q i. 5 Ũ i r See Figure 2i ii. hen apply Q i to the HSS block row A i = U i i on the left: 0 Q i A i = Ũ i i m r r his also means Q i is applied to the HSS block column on the right. Notice that the diagonal block remains to be an identity matrix. See Figure 2iii. After this, the leading m r rows of block row i can be eliminated. his is done for all the leaves. If node i is a non-leaf node with its children c 1 and c 2 being leaves, assume the previous operations are applied to both c 1 and c 2. hat is, Q c1 L 1 c 1 D c1 U c1 B c1 Uc 2 L c 1 Q c1 Q c2 L 1 c 2 U c2 Bc 1 Uc 1 D c2 L c 2 Q c2 I diag 0, Ũ c1 B c1 Ũ c2 =, 6 diag 0, Ũ c2 B c1 Ũ c1 I where diag denotes a block diagonal matrix with the given blocks on the diagonal. hen we merge appropriate blocks on the right-hand side of 6 and remove c 1 and c 2 from the HSS tree so that i becomes a leaf with D and U generators D i = I Ũ c1 B c1 Ũc 2 Ũ c2 Bc 1 Ũc 1 I, U i = Ũc1 R c1 Ũ c2 R c2. 7 Note that we are using D i and U i to mean the generators of a smaller matrix after elimination, instead of A. his smaller matrix is called a reduced HSS matrix. See Figure 2iv. he same operations then apply recursively. At the root level, a direct Cholesky factorization is computed. Figure 3 shows how an HSS tree is reduced along the factorization.

7 ROBUS INNER-OUER HSS PRECONDIIONER 7 L 1 L 2 L 4 L 5 Q 1 Q 2 Q 4 Q 5 L 1 D 1 U 1 Q 1 L 2 U 2 D 2 U 3 Q 2 L 4 D 4 U 4 Q 4 L 5 U 6 U 5 D 5 Q 5 i Cholesky factorization of diagonal blocks ii Introducing zeros into off-diagonal L 3 L 6 L 3 L 6 iii After introducing zeros iv Merging remaining blocks and repeating Figure 2. An improved ULV HSS factorization scheme for an SPD matrix. 3 6 R 1 R B 2 R 4 R 1 B D 1,U 1 D 2,U 2 D 4,U 4 D 5,U 5 7 B 3 i Symmetric HSS tree 7 B D 3,U 3 D 6,U 6 ii After eliminating one level 7 D 7 iii Until the root Figure 3. How the HSS tree is reduced along the ULV HSS factorization. Note that due to the special form of D i in 7, the intermediate Cholesky factorization D i = L i L i at non-leaf levels can be done quickly. hat is, we have I L i = Ũ c2 Bc 1 Ũc, Li L i = I Ũc 2 Bc 1 Li 1 Ũc 2 Bc 1. 8 Similarly, the update L 1 i U i for 7 8 can also be done quickly as L 1 I Ũc1 R i U i = c1 Ũ L 1 i Ũ c2 Bc 1 Ũc = c1 R c1 L 1 1 i Ũ c2 R c2 L 1 i Ũ c2 [R c2 Bc 1 Ũc 1 Ũc 1 R c1]. 9

8 8 JIANLIN. XIA 8 9 can be carefully organized so that they only need six matrix multiplications and one triangular solution. his improved algorithm is thus more efficient than the one in [35], which requires two dense square matrix products in the form of Q i D i Q i and the full QR factorization of a size m r r dense block for each node i. In fact, according to able VI in Section 4, this new ULV factorization with m = 2r costs 35 3 r2 n flops, while the one in [35] costs 56 3 r2 n flops. Our algorithm computes an ULV-type HSS factorization A = ˆLˆL, 10 where ˆL is given by a sequence of orthogonal and lower triangular matrices. We call ˆL an ULV factor, which is formed recursively. Let A l be the reduced matrix at level l obtained from A l+1 after the elimination of all nodes at level l + 1, where A lmax A for the leaf level l max. Assume the nodes at level l are j 1, j 2,..., j α. Also let P l be a permutation matrix which performs all the merge steps during the elimination of level l + 1 and also forms the new reduced HSS matrix. hat is, P l X l diag Ūcj1,1 R c j1,1 Ū cj1,2r cj1,2,..., Ūcj α,1r cj α,1 Ū cjα;2 R cjα,2 P l X l A l+1 X l P l = diag = 0 diag Ū j1,..., Ūj α, 11 I, A l, 12 where X l = diag diag Q c L 1 j1,1 c, Q j1,1 c L 1 j1,2 c j1,..., diag Q,2 c L 1 j α,1 c, Q j α,1 c L 1 j α,2 c j. α,2 hen we let ˆL ˆL lmax, and A l = ˆL l ˆL l, l = l max, l max 1,..., 1, 0, where ˆL 0 is the Cholesky factor associated with the root, and ˆL l+1 = X 1 l Pl diag I, ˆL l P l, l = l max 1, l max 2,..., 1, Robust HSS Cholesky factorization Motivation of robust and effective HSS preconditioning he factorization method in [36] computes an approximate Cholesky factorization for a dense SPD matrix A R R, where R is an upper triangular HSS matrix. he approximation R R is guaranteed to exist and to be positive definite, regardless of the approximation accuracy. he basic idea is called Schur compensation and can be illustrated as follows. Consider a block 2 2 SPD matrix and the Cholesky factorization of its 1, 1 block: A11 A 21 D1 A = A 21 A 22 A 21 D 1 1 I D1 D 1 A 21 S, 14 where S = A 22 A 21 D1 1 D 1 A 21 is the Schur complement. Compute an SVD: D 1 A 21 = U 1 ΣU2 + Û1 ˆΣÛ 2, where the singular values in Σ are larger than a tolerance τ, and those in ˆΣ are smaller than τ. A relative tolerance τ may be used. If all the singular values in ˆΣ are dropped, then S is approximated by S = A 22 Û2Σ 2 Û2 = S + Oτ hat is, a positive semidefinite term is automatically added to the Schur complement. Compute a Cholesky factorization S = D 2 D 2. We then get an approximate Cholesky factorization A R R.

9 ROBUS INNER-OUER HSS PRECONDIIONER 9 It is also shown in [36] that, with certain additional techniques, a modification to R can work as an effective preconditioner when the low-rank property is not significant. hat is, if the numerical rank for a given tolerance is large, we can still manually set a small rank to get a low accuracy R. he idea is then generalized so that the Cholesky factor of A is approximated by an HSS matrix. RRQR factorizations are often used to replace SVDs in the compression, since they are generally more efficient Improved HSS Cholesky factorization he HSS Cholesky factorization algorithm in [36] uses a form for R where only D, U, R, B generators are needed. R is computed block rowwise. he computed off-diagonal block column R t c i t i and off-diagonal block row R ti t r associated with i node i are put together as R t c i t i R ti t r, which is compressed to provide the column i basis U i. his is equivalent to constructing a form for a symmetric full matrix whose block upper triangular part is R. Here, we propose a new procedure which has several advantages: 1 Instead of compressing R t c i t i R ti t r, we compress R i t c i t i and R ti t r i independently so as to get their row and column bases U i and Vi, respectively. his implementation is much simpler. 2 he HSS form of the Cholesky factor we obtained is more compact, since the individual numerical ranks of R t c i t i and R ti t r are usually smaller than that of i R t c i t i R ti t r, if the same compression accuracy is used. i 3 he algorithm in [36] maintains a compressed form see 22 below only for Schur complement computations. Here, this compressed form is also used to speed up the computation of V i generators. he new HSS Cholesky factorization procedure is described with the aid of the following definition. Definition 2.1. [36] For a node i of a postordered binary tree, the set of predecessors of i is defined to be { {i}, if i is the root of, predi = {i} predpari, otherwise. A visited set V i set of visited nodes before i whose siblings have not been visited is defined to be V i = {j : j is a left node and sibj predi}. 16 We can still use the HSS tree as in Definition 1.1 for R, but some generators do not exist. he following result can be easily verified. Observation 1. For an HSS matrix to be block triangular, some of its generators are empty matrices, as specified in able II. Lower triangular Upper triangular pari pred1 U i, R i V i, W i pari predj V i, W i U i, R i able II. Empty generators of triangular HSS matrices, where j is the last largest-labeled leaf of the HSS tree of the matrix. he factorization is conducted along the postordering traversal of the HSS tree for nodes i = 1, 2,...

10 10 JIANLIN. XIA If i is a leaf, let A i 1 t r i 1 t r denote the Schur complement due to the previous i 1 factorization i 1 steps. A i 1 t r i 1 t r is partially formed, or, its first block row Ai 1 i 1 ti t r is available see 23 i 1 below. Note t r i 1 = t i t r i. Compute a block row D i R ti t r i of R as in the usual Cholesky factorization. hat is, compute the following Cholesky factorization and update the off-diagonal block as hen compress R ti t r i A i 1 ti t i = D i D i, R ti t r i = D i A i 1 ti t r i. 17 with an RRQR factorization R ti t r i U ir ˆt i t r i. 18 he off-diagonal block column R t c i t i is already partially compressed in previous steps so that R tj t i U j R ˆt j t i for all j V i {j 1, j 2,..., j s }, where t c i = t j1 t j2 t js. Ignoring previously computed U bases, we only need to compress R ˆtj1 ti R ˆtj2 ti R ˆtjs ti V i R ˆt j1 ˆt i R ˆt j2 ˆt i R ˆt js ˆt i. 19 Note that 19 can be replaced by a more efficient step as illustrated in 26 and 27 below. If i is a non-leaf node, let c 1 and c 2 be its children. Clearly, t r c 2 = t r i. he blocks R t c1 t r and i R tc2 t r are compressed in previous steps. With the basis matrices ignored, we compress the following i matrix Rc1 R ˆt R. 20 i t r i c2 R ˆtc1 t r i R ˆt c2 t r i Similarly, the further compression of R t c i t i = R t c i t c1 R t c i t c2 is done by ignoring certain existing V basis matrices. hat is, compress R ˆt j1 ˆt c1 R ˆt js ˆt c1 R ˆt j1 ˆt c2 R ˆt js ˆt c2 Wc1 W c2 R ˆtj1 ˆti R ˆtjs ˆti. 21 After these compression steps, U i and V i are implicitly available due the recursions in 1. According to Observation 1, if i pred1, 18 and 20 are not needed. Also, if i predj where j is the last largest-labeled leaf, 19 and 21 are not needed. Furthermore, if i is a right node with its sibling j, we set B j = R ˆt j ˆt i. Before the traversal of each leaf i, we partially form A i 1 ti t r, which is the first block row of i 1 the new Schur complement. Similar to [36], we also maintain a compact approximation R t c i t r i 1 Q i 1Y i 1, 22 where Q i 1 has orthonormal columns and is not stored. Once 22 holds, then A i 1 ti t r i 1 = A t i t r i 1 R t c i t i R t c i t r i 1 A t i t r i 1 Y i 1;1Y i 1, 23 where a partitioned form of 22 is used: R t c i t i R t c i t r i Qi 1 Yi 1;1 Y i 1;2. 24

11 ROBUS INNER-OUER HSS PRECONDIIONER 11 Here, Y i 1 is obtained approximately by recursion. For any i 1 pred1, set Y i 1 R ˆt. i 1 t r i 1 Otherwise, let j be the largest leaf before i so that Y j 1 is previously available. hen Y i 1 can be approximately computed with an RRQR factorization Yj 1;2 R ˆt j t r j Q i 1 Y i See [36] for the detailed derivation. Furthermore, unlike [36], here we can use Y i 1 to improve the efficiency of computing V i. Due to 24, the HSS block column R t c i t i is already in a compressed form R t c i t i Q i 1 Y i 1;1. hus, to get the Vi which gives a row basis for the row space of R t c i t i, we just compress Yi 1;1 with an RRQR factorization Yi 1;1 V i i, 26 where i is not stored. hen set R ˆtj1 ˆti R ˆtj2 ˆti R ˆtjs ˆti = V i R ˆt j1 t i R ˆt j2 t i R ˆt j s ti. 27 Equations 26 and 27 can be used to replace 19. his generally reduces the compression cost, since the difference between the costs of 19 and is 4msrr 2r 2 sr + m + 43 r3 2srmr + 4mrr 2r 2 r + m + 43 r3 = 2r 2 m rs + r 2m. Here, m > r in general, and s is as large as Olog n riangular ULV HSS factorization Next, we study the ULV factorization of triangular HSS matrices. Notice that the ULV algorithm in Section 2.2 cannot be applied in a straightforward way, due to the nonsymmetry. We consider a block lower triangular HSS matrix L. An upper triangular HSS factorization procedure can be similarly derived. A triangular HSS solution algorithm is given in [23, 24]. he algorithm traverses the HSS tree with depth-first sweeps so that newly computed solution entries can be used to update the right-hand side, or information is propagated to all related nodes of the HSS tree. his algorithm implicitly computes a triangular HSS matrix-vector multiplication by recursion. It is thus not suitable for parallelization. Here, we provide a new triangular ULV HSS solver which is scalable. Again, we convert the diagonal blocks to identity matrices so as to avoid multiplications of square dense blocks. he main steps are illustrated in Figure 4 and explained briefly as follows. If node i is a leaf, multiply D 1 i matrix, and update U i as D 1 i hen let Q i V i = Ṽi Ṽ i to the block row i so that the diagonal block becomes an identity U i. Compute a full QL factorization of D 1 i U i 0 m r D 1 i U i = Q i. 28 Ũ i r, where the partition follows that of 28. Clearly, the leading m r m r subblock of the diagonal block becomes an identity matrix, which is the only nonzero block in those rows and can be eliminated.

12 12 JIANLIN. XIA D 1 D 1 D 2 U 2 D 2 Q 2 D 4 D 4 Q 4 D 5 U 6 U 5 D 5 Q 5 i Factorization of diagonal blocks ii Introduction of zeros into off-diagonal Q 2 Q 4 Q 5 iii Restoring diagonal identity iv Merging remaining blocks and repeating Figure 4. A triangular ULV HSS factorization scheme. If node i is a non-leaf node with its children c 1 and c 2 being leaves, assume the previous operations are applied to both c 1 and c 2. hat is, Q c1 D 1 c 1 Q c2 D 1 c 2 D c1 U c2 B c2 V c 1 D c2 = I diag 0, Ũ c2 B c2 Ṽc 1 I. 29 hen we merge appropriate blocks on the right-hand side of 29 and remove c 1 and c 2 from the HSS tree so that i becomes a leaf with D and U generators D i = I Ũ c2 B c2 Ṽ c 1 I, U i = Ũc1 R c1 Ũ c2 R c2 L is reduced to a smaller matrix. he same operations then apply recursively.. 30 Note that due to the special form of D i in 30, the intermediate update step Dp 1 Ũ p can be done quickly as D 1 i Ũ i = Ũ c1 R c1 Ũ c2 [R c2 B c2 Ṽ c 1 Ũc 1 R c1 ]. 31 he products in can be carefully organized so that only five multiplications of r r matrices are needed.

13 ROBUS INNER-OUER HSS PRECONDIIONER Improved ULV HSS solution 3. INNER-OUER HSS PERCONDIIONER Due to the improved ULV-type HSS factorization in Section 2.2, the solution procedure with the ULV factor is much simpler than the one in [35]. For example, for ˆL in 10, we consider the solution of the following system by forward substitution: ˆLy = b. In the process, each node i of the HSS tree is associated with b i, a piece of b, and y i, a piece of y. hese pieces are defined recursively. First, b is partitioned into b i pieces according to the leaf level L i sizes. For each leaf i, let b i = Q i L 1 i b i. 32 Partition b i as b i = bi,1 b i,2 m r r, and set ỹ i = b i,1. b Assume similar operations are also conducted for j = sibi. hen for p = pari, set b p = i,2 b j,2 b if i < j, or b p = j,2 otherwise. Similar operations can then be applied to upper level nodes. b i,2 hese steps proceed until the root node 2k 1 is reached, where ỹ 2k 1 = L 1 2k 1 b k. 33 After all these ỹ i pieces are computed, we traverse the HSS tree top-down. Let y 2k 1 = ỹ 2k 1. ỹi,1 r For a non-leaf node i with children c 1, c 2, partition ỹ i as ỹ i = and compute ỹ i;2 r y c1 = L ỹc1 c 1 Q c1, y c2 = L ỹc2 c ỹ 2 Q c2. 34 i;1 ỹ i;2 When all the nodes are visited, the pieces y i associated with all the leaves can be assembled into y. See Figure 6 below riangular ULV HSS solution We then consider the ULV solution for a lower triangular HSS system Lx = b. Initially, partition b conformably following the sizes of the leaf level diagonal blocks D i in L. For each node i of the HSS tree, compute bi = Q i D 1 i b i Similarly, the structure of D i in 30 can be used to accelerate the computation of D 1 i b i. Partition bi as b bi,1 m r i =, and set bi,2 r x i = b i,1.

14 14 JIANLIN. XIA hen similar to the previous subsection, we form b p by stacking b i,2 and b j,2 for j = sibi. he process then proceeds, until the root node 2k 1 is reached, where x 2k 1 = D 1 2k 1 b k. After all solution pieces x i are computed, we traverse the HSS tree top-down. Let x 2k 1 = x 2k 1. xi,1 r For a non-leaf node i with children c 1, c 2, partition x i as x i = and compute r x c1 = Q c1 xc1 x i;1, x c2 = Q c2 xc2 x i;2 x i;2. 35 When all the nodes are visited, the pieces x i associated with all the leaves can be assembled into x Inner-outer HSS solution With all the new HSS algorithms above, we are ready to present our inner-outer HSS algorithms. In the inner-outer factorization, A is approximated by an HSS form with the HSS tree, where each diagonal block D i is approximately factorized as D i L i L i, and L i is an inner lower triangular HSS form with the HSS tree 0. his is illustrated with Figure 5. hen ULV factorizations are computed for both inner and outer HSS forms. Sections 2.3 and 2.2 are used. In the solution stage, we follow an outer solution procedure given by the ULV solution in Section 3.1. In the meantime, for any solution steps involving L i associated with a leaf i, such as 32, we use the triangular HSS solution in Section 3.2. he solution algorithm is summarized in able III. 7 D 7 =A 3 D 3 D D 1 D 2 D 4 D 5 Figure 5. A pictorial illustration of the inner-outer HSS process, where each leaf D i block of the outer HSS tree in Figure 1 is approximately factorized into an HSS Cholesky factor as the inner HSS matrix. In practical implementations of the algorithm as a preconditioner, additional techniques can be utilized. One is that we can traverse the HSS trees levelwise, so that the algorithms can be implemented in parallel. For example, when all the y i pieces are obtained as in 34, they can be first assembled together at each level. Assume the solution pieces provided at each level l form a vector ȳ l. hen the solution y is formed by collecting all ȳ l with appropriate permutations Figure 6. hat is, we define y l = P l ȳl y l 1, l = l max, l max 1,..., 2, 1 36 with ȳ 0 = y 2k 1. hen y y 0. Another techniques is that we can use diagonal shifts in the outer factorization step to enhance robustness. his is not needed for the inner factorizations. Since D i has the form in 7 for a

15 ROBUS INNER-OUER HSS PRECONDIIONER 15 Algorithm 1. Inner-outer HSS solution subroutine io hsssola, b Input: A in factorized form, where the outer ULV factor is given by Q i and L i, and L i is given by its inner ULV factor; b: right-hand side initially, y b Output: Solution y for all nodes i at the bottom level l max, partition y into y i pieces for each leaf i, compute ỹ i = Q i [triulvsoll i, b i ] for l = l max 1, l max 2,..., 1 do for all nodes i at level l do end for for each child c of i, compute b c = Q c L 1 c b c ỹc1,2 Let b i =, ỹ c1 = ỹ c1,1, ỹ c2 = ỹ c2,1 ỹ c2,2 end for Compute ỹ 2k 1 = L 1 2k 1 b 2k 1 for l = 0, 1,..., l max 1 do ỹc,1 ỹ c,2 m r r end for for all nodes i at level l do end for Partition ỹ i as ỹ i = ỹi,1 ỹ i;2 for j = 1, 2, compute y cj for each leaf i, compute y cj end subroutine r r = L ỹcj c j Q cj ỹ i;j = triulvsol L ỹcj c j, Q cj ỹ i;j able III. An inner-outer HSS solution algorithm, where triulvsol represents the solution scheme in Section 3.2. non-leaf node i in, we can add the following shift to D i : s i = εi. 37 Notice that this diagonal shift is only needed for non-leaf nodes in. In parallel computing, the number of non-leaf nodes may be set to be twice of the number of processors, and the affect on the accuracy is limited. 4. ANALYSIS We collect all the new HSS algorithms in this work as follows:

16 JIANLIN. XIA y 0 { y... y l max-1 { y lmax = y 0 y l max-1 y l max 0 l max -1 l max level 3 y y 1 y 2 Figure 6. How the solution pieces generated in Algorithm 1 are assembled into the solution y, where the pieces y i associated with the nodes at each level l of the HSS tree are assembled into ȳ l, and then into y l as in 36. Direct-HSS: he new HSS algorithms based on direct HSS construction and ULV operations: Direct-HSSconstr: he HSS construction procedure in Section 2.1. Direct-HSSfact: he ULV HSS factorization procedure in Section 2.2. Direct-HSSsol: he ULV HSS solution procedure in Section 3.1. ri-hss: he new HSS algorithms based on triangular HSS operations: ri-hssconstr: he HSS Cholesky factorization procedure in Section 2.3. ri-hsssol: he triangular HSS factorization procedure in Section 2.4 together with solution in Section 3.2. I/O-HSS: he new inner-outer HSS algorithms: I/O-HSSconstr: he procedure with outer Direct-HSSconstr and inner ri-hssconstr. I/O-HSSfact: he procedure with outer Direct-HSSfact. I/O-HSSsol: he procedure with outer Direct-HSSsol and inner ri-hsssol. For convenience, we use the following notation in the analysis and the numerical experiments: For the outer HSS algorithms, n, r, m, and are the matrix size, HSS rank, leaf level diagonal block size, and outer HSS tree, respectively. he number of leaves in is k. For the inner HSS algorithms, we similarly use the notation n 0, r 0, m 0, 0, and k 0. Here, n 0 m. ξ 1, ξ 2, and ξ 3 are the flop counts of the above three types of algorithms. For example, ξ 2 constr is the cost of ri-hssconstr Complexity count: the new HSS Cholesky factorization as an example As an example, we count the flops of ri-hssconstr, and then other algorithms can be easily studied. For simplicity, assume the inner HSS tree 0 is a perfect binary tree, and m 0 = Or 0. Since there are k 0 leaves, k 0 m 0 = n 0. We also assume the root of 0 is at level 0 so that there are totally about log 2 k 0 levels. Our complexity analysis uses the flop counts of some basic matrix operations in able IV. Also two useful formulas are needed: 2 l n i = k 0 j k 0 2 l m 0 = 1 2 k l l 0m 0 2 l 1, s i = = 2 l 1, 38 j i at level l j=1 i at level l j=1

17 ROBUS INNER-OUER HSS PRECONDIIONER 17 Operation Flops 1 Cholesky factorization of an m 0 m 0 matrix 3 m3 0 Product of an m 0 q matrix and an q r 0 matrix 2m 0 qr 0 RRQR factorization of an m 0 q matrix with rank r 0 Full QR factorization of an m 0 r 0 tall matrix m 0 > r 0 4m 0 qr 0 2r0m q r3 0 2r0 2 m0 r 0 3 Product of the Q factor and an m q matrix 2r 0 q2m 0 r 0 Solution of an order m 0 triangular system Lx = b m 2 0 able IV. Flop counts leading terms only of some basic matrix operations, which can be found in, say, [10, 13] or can be derived based on those. he RRQR factorization in our work is based on the modified Gram-Schmidt process with column pivoting [13]. where n i is the column size of R ti t r, and s i i is the cardinality of V i in 16. he first formula is because there are 2 l nodes at each level l, with the n i values k 0 j k 0 m 2 l 0, j = 1, 2,..., 2 l. he second formula is derived in [33]. a Elimination cost. For each leaf i, the elimination step 17 costs m m2 0n i flops. hus, the total elimination cost for all leaves is i: leaf m m2 0n i = k 0m m 2 0 n i = 1 2 k2 0m k 0m 3 0. i at level log 2 k 0 b Compression cost. For each leaf i, the RRQR step 18 is applied to an m 0 n i block to get U i. For each non-leaf node i at level l > 0, the RRQR step 20 is applied to a 2r 0 n i block to get R c1, R c2. he total for all HSS block row compression is then i: leaf 4m 0 n i r 0 2r0m n i + 4 log 2 k r r 0 k 2 0m r 2 0k 2 0m 0 2r 2 0km k 0r 3 0, l=1 i at level l [42r 0 n i r 0 2r 2 02r 0 + n i r3 0] where 38 is used. For each leaf i, to compute V i with 26 27, we need the compression of a size r 0 m 0 block and the multiplication of a m 0 r 0 and an r 0 s i r 0 matrix. For each non-leaf node i at level l > 0, the RRQR step 21 is applied to an 2r 0 s i r 0 block to get the W generators. he total cost at all HSS block column compression is then i: leaf [4m 0 r0 2 2r 2 m 0 +r log 2 k 1 3 r3 0 +2s i r 0 r 0 m 0 ]+ 4k 0 r 2 0m r3 0m k 0r 3 0. l=1 i at level l [8s i r 0 r 2 0 2r 2 02r 0 +s i r r3 0] c Schur complement computation cost. Similarly, we can count the cost of 23 and 25 for each leaf i as m 0 r 0 + 3r0 2 k 2 0 m 0. herefore, the total factorization cost is roughly ξ 2 constr 1 2 m 0 + 3r r2 0 m 0 n his formula can be used to minimize ξ constr. 2 It can be easily shown that the optimal cost is ξ 2 constr r 0 n 2 0, 40

18 18 JIANLIN. XIA which is achieved when m 0 = 10r 0. If we choose m 0 = 2r 0 as often used, ξ 2 constr 13 2 r 0n 2 0, which is slightly larger than the optimal cost. On the other hand, the original version in [36] with m 0 = 2r 0 costs at least 11 r 0 n 2 0 flops [33], where r 0 may be as large as 2r 0. Similarly, the costs of the new ULV triangular HSS factorization and solution are respectively. ξ 2 fact r 0 m 0 + 2r r0 2 m 0 n 0 and ξ 2 sol m 0 + 4r r2 0 m Complexity of the inner-outer HSS algorithm and other algorithms he complexity of other algorithms involved can be counted similarly. See able V. ype Operation Complexity Construction m r rn 2 1 Direct-HSS ULV factorization 3 m2 + mr + 2r ULV solution m + 4r + 8 r2 m n 1 Factorization 2 m 0 + 3r r2 0 m 0 n 2 0 ri-hss ULV factorization r 0 m 0 + 2r m 0 n 0 ULV solution m 0 + 4r r2 0 m 0 n 0 I/O-HSS 1 r 2 0 r 3 m n Construction m r rn m 0 + 3r r2 0 m 0 mn Factorization [r 0 m 0 + 2r r m 0 + r m 0 + 4r r2 0 Solution m 0 + 4r r2 0 m 0 + 4r + 8 r2 m n m 0 n 0, + 2r able V. Flop counts leading terms only of the HSS algorithms proposed in this work. ] r 3 m n In particular, with certain values of m and m 0, we get some sample counts in able VI. It can then be verified that the new HSS algorithms proposed in this work are generally much faster than similar existing ones. See, e.g., Section 2.2. ype Operation Complexity Construction 3rn 2 35 Direct-HSS ULV factorization 3 r2 n ULV solution 10rn 13 Factorization 2 r 0n ri-hss ULV factorization 3 r2 0n 0 ULV solution 10r 0 n 0 23 Construction 8 rn2 73 I/O-HSS Factorization Solution 3 r2 n 14rn able VI. Flop counts leading terms only as in able V, with r = r 0, m 0 = 2r 0, m = n 4. hese assumptions may not be satisfied in the numerical experiments in the next section.

19 ROBUS INNER-OUER HSS PRECONDIIONER Approximation errors he off-diagonal compression introduces errors into the approximations, which can be roughly measured by the following result. Proposition 4.1. For a given SPD matrix A, let τ be the relative tolerance in the off-diagonal compression by truncated SVD. hen after the outer HSS construction for A, the HSS approximation Ã satisfies A Ã 2 τolog n. A 2 Sketch of the proof. he proof uses the result that the U, V generators have orthonormal columns. he HSS tree has Olog n levels, and the approximation error accumulates additively when the compression level moves bottom-up along. his proposition can be applied to HSS constructions with a given relative tolerance τ for the offdiagonal singular values. For an off-diagonal block B, the singular values σ i that satisfy σ i < σ 1 τ are dropped, where σ 1 is the largest singular value of B. he number of remaining singular values is the numerical rank of B. he proposition gives a bound for the error introduced in the approximation of A. he error analysis for the HSS Cholesky factorization is less straightforward. As in [18], the compression of R ti t r = D i i A i 1 ti t r in 17 with an absolute error τ i 0 introduces an error of e = O D 1 2 τ 0 = O A 2 τ 0 into A i 1 ti t r. If an relative tolerance is used, this becomes i more complicated. With hierarchical compression, the error introduced into the original off-diagonal block A ti t r can be as large as Olog n e. Moreover, the compression of R i ti t r adds an error i of ẽ = Oτ0 2 to the Schur complement S i, which continues to be factorized approximately, and ẽ may be magnified by O n. Such analysis may be too pessimistic, and the detailed study is not the main focus of this work. In practice, the error is often well controlled. Since our objective is preconditioning, a lower accuracy is allowed in the inner HSS Cholesky factorization. he feasibility is verified by the numerical experiments. 5. NUMERICAL EXPERIMENS he inner-outer HSS algorithms and others are implemented in Matlab, and are tested on various examples. For convenience, we list additional notation as follows, other than that summarized at the beginning of Section 4: Notation Meaning Notation Meaning κ 2 A 2-norm condition number of A e Relative residual A1x b 2 b 2 ε Size of the shift in 37 N iter Number of preconditioned N shift Number of times when shifts 37 are conjugate gradient PCG iterations used to avoid breakdown in I/O-HSS ξ iter otal PCG cost flops Example 1. Let A 0 = x i x j, 41 n n where the points x i = cos2i + 1π/2n are the zeros of the nth Chebyshev polynomial, as used in [7, 9].

20 20 JIANLIN. XIA We demonstrate the performance of the algorithms for a linear system with an SPD coefficient matrix: A 1 x = b, with A 1 = A 0 A 0 + 2I. 42 he 2I diagonal matrix ensures that A 1 remains positive definite in our numerical tests with the presence of numerical errors on the computer. his nearly has no effect on the off-diagonal numerical ranks in our factorizations, and does not change the nature of the tests. A 1 has size n ranging from 500 to 160, 000, and b is generated with random exact x. Several aspects are tested. First, both ri-hss and Direct-HSS algorithms proposed here are faster than the original versions in [7, 36]. For example, for n = 4000, m 0 = 50, and compression accuracy τ 0 = 10 6, our ri-hssconstr and ri-hsssol directly applied to A 1 cost about 8.00E8 and 6.09E5 flops, respectively. While the original versions in [36] cost 1.24E9 and 6.11E5 flops, respectively. In the second test, we use I/O-HSS algorithms to directly solve the system 42. Different inner and outer off-diagonal compression accuracies are used, which are τ 0 and τ, respectively. No diagonal shift is involved here, or ε = 0 in 37. he performance is demonstrated in able VII. It can be seen that, for reasonable accuracy, I/O-HSSconstr has about On 2 complexity, and I/O-HSSfact and I/O-HSSsol have nearly On complexity. n ξ 3 constr 4.84E6 2.07E7 8.56E7 3.47E8 1.42E9 5.49E9 ξ 3 fact 3.30E5 6.88E5 1.40E6 2.79E6 5.77E6 1.06E7 ξ 3 sol 7.91E4 1.70E5 3.51E5 7.07E5 1.43E6 2.78E6 e 1.68E E E E E E 8 able VII. Example 1: Flop counts ξ 3 and relative residuals e of the inner-outer HSS algorithms for solving 42, where m = 1000, m 0 = 50, ε = 0, τ = 10 8, τ 0 = he I/O-HSS algorithms are also compared with ri-hss and Direct-HSS algorithms, when they are directly applied to A 1 to get about the same accuracy. he same set of parameters m 0, τ 0 are used by ri-hss and Direct-HSS. Note that the ULV factorization in ri-hss is intended for parallel implementation and may be slower than the sequential triangular solver in [24]. hus, here we use the one in [24], which does not use a factorization stage. From Figure 7, we observe that the ξ sol costs of all three types of algorithms are comparable to each other, while I/O-HSS can take into consideration both the scalability and the robustness. In the third test, we examine the effectiveness of I/O-HSS as a preconditioner in PCG for A 1 and also A 2 = A 0 A 0 A 0 A 0 + 2I. he matrices A 1 and A 2 have 2-norm condition numbers κ 2 = 5.4e6 and κ 2 = 6.2e13, respectively. In I/O-HSSconstr, we only keep very few columns in the off-diagonal compression, or we manually set the numerical rank r to be a small number and set r 0 = r. hus, it is very efficient to obtain and use the preconditioner. he results are shown in able VIII. he details of the iteration processes are shown in Figure 8. We can see that, PCG with I/O-HSS converges much faster and costs far less than with block diagonal preconditioning. Lastly, we test the performance of the inner-outer preconditioning by using lower accuracies in the inner ri-hss factorization. For A 1 in able VIII, we use outer HSS rank bound r = 7, and the inner HSS rank bound r 0 varies from 3 to 7. See able 9 for the performance. We see that fast

21 ROBUS INNER-OUER HSS PRECONDIIONER ri HSS I/O HSS Basic HSS 10 8 Basic HSS I/O HSS 10 7 I/O HSS ri HSS Basic HSS flops flops flops n matrix size n matrix size n matrix size i ξ constr ii ξ fact iii ξ sol Figure 7. Example 1: Flop counts of I/O-HSS, ri-hss, and Direct-HSS for solving 42, where m = 1000, m 0 = 50, τ = 10 8, τ 0 = 10 6, and ε = 0 in 37. Matrix A 1 κ 2 = 5.4E6 Matrix A 2 κ 2 = 6.2E13 N iter ξ iter e N iter ξ iter e Block diagonal E E 15 Block diagonal E E 9 I/O-HSS E8 6.63E 15 I/O-HSS E8 2.48E 15 able VIII. Example 1: Performance of PCG with block diagonal preconditioning and I/O-HSS preconditioning for A 1 and A 2 with sizes n = 4000, where m = 1000, m 0 = 50, r = 7, and ε = 0 in Block diag prec 2 I/O HSS prec Block diag prec 2 I/O HSS prec e relative residual e relative residual N # of iterations iter N # of iterations iter i A 1 ii A 2 Figure 8. Example 1: Convergence of PCG with block diagonal preconditioning and I/O-HSS preconditioning for A 1 and A 2 with sizes n = 4000, where m = 1000, m 0 = 50, r = 7, and ε = 0. convergence is achieved for all these r 0 values. In fact, the convergence behaviors for r 0 = 4, 5, 6, 7 are almost the same. Example 2. hen we consider a oeplitz matrix A generated by a Gaussian radial basis function, as vt i = e δ2 t 2 i,

22 22 JIANLIN. XIA e relative residual I/O HSS prec r =3 0 I/O HSS prec r =4 0 I/O HSS prec r =5 0 I/O HSS prec r =6 0 I/O HSS prec r = N iter # of iterations Figure 9. Example 1: Convergence of PCG with I/O-HSS preconditioning for A 1 in able VIII with different inner rank bounds r 0. where the t i = i 1. he oeplitz structure is not our focus and is ignored. he matrix is highly ill conditioned for small δ. In fact, its condition number is about e π2 /4δ 2 [6]. We consider the matrix with size n = If regular Direct-HSS methods are directly applied with a low accuracy τ, they actually break down since some intermediate diagonal blocks corresponding to non-leaf nodes of fail to be positive definite. hen we may use shifts as in 37. he number of times nodes of such breakdown or shifting is denoted N shift. However, I/O-HSS can significantly enhance the robustness. he larger m is, the more robust the method is. See Figure IX. If no inner factorization is used, then too many shifting steps are needed, and the preconditioner fails to be useful in practice able X. PCG with I/O-HSS preconditioning also costs much less than with block diagonal preconditioning. he detailed convergence is shown in Figure 10. Direct-HSS I/O-HSS m 50 m = m ξ 3 constr 6.11E9 5.42E9 5.49E9 5.64E9 5.79E9 5.95E9 6.11E9 6.33E9 6.48E9 ξ 3 fact 2.98E7 1.88E7 1.95E7 1.98E7 1.99E7 2.01E7 2.01E7 2.02E7 2.02E7 ξ 3 sol 2.62E6 1.38E6 1.33E6 1.31E6 1.30E6 1.30E6 1.29E6 1.29E6 1.29E6 N shift able IX. Example 2: Costs ξ 3 of I/O-HSS algorithms, and the number N shift of shifts, where of PCG with I/O-HSS preconditioning, where m 0 = 50, r = 10, and ε = 10 4 in 37. N iter ξ iter e Block diagonal E E 14 Direct-HSS F ailure I/O-HSS r = E E 16 able X. Example 2: Convergence of PCG with block diagonal preconditioning and I/O-HSS preconditioning, where m 0 = 50, r = 10, and ε = 10 4 in 37.

Fast algorithms for hierarchically semiseparable matrices

NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. 2010; 17:953 976 Published online 22 December 2009 in Wiley Online Library (wileyonlinelibrary.com)..691 Fast algorithms for hierarchically