Effective matrix-free preconditioning for the augmented immersed interface method

Effective matrix-free preconditioning for the augmented immersed interface method Jianlin Xia a, Zhilin Li b, Xin Ye a a Department of Mathematics, Purdue University, West Lafayette, IN 47907, USA. E-mail: xiaj@math.purdue.edu, ye83@purdue.edu. b Center for Research in Scientific Computation & Department of Mathematics, North Carolina State University, Raleigh, NC 27695, USA. E-mail: zhilin@math.ncsu.edu. Abstract We present effective and efficient matrix-free preconditioning techniques for the augmented immersed interface method (AIIM). AIIM has been developed recently and is shown to be very effective for interface problems and problems on irregular domains. GMRES is often used to solve for the augmented variable(s) associated with a Schur complement A in AIIM that is defined along the interface or the irregular boundary. The efficiency of AIIM relies on how quickly the system for A can be solved. For some applications, there are substantial difficulties involved, such as the slow convergence of GMRES (particularly for free boundary and moving interface problems), and the inconvenience in finding a preconditioner (due to the situation that only the products of A and vectors are available). Here, we propose matrix-free structured preconditioning techniques for AIIM via adaptive randomized sampling, using only the products of A and vectors to construct a hierarchically semiseparable matrix approximation to A. Several improvements over existing schemes are shown so as to enhance the efficiency and also avoid potential instability. The significance of the preconditioners includes: (1) they do not require the entries of A or the multiplication of A T with vectors; (2) constructing the preconditioners needs only O(log N) matrix-vector products and O(N) storage, where N is the size of A; (3) applying the preconditioners needs only O(N) flops; (4) they are very flexible and do not require any a priori knowledge of the structure of A. The preconditioners are observed to significantly accelerate the convergence of GMRES, with heuristical justifications of the effectiveness. Comprehensive tests on several important applications are provided, such as Navier-Stokes equations on irregular domains with traction boundary conditions, interface problems in incompressible flows, mixed boundary problems, and free boundary problems. The preconditioning techniques are also useful for several other problems and methods. Key words: augmented immersed interface method, Schur complement system, GMRES, matrix-free preconditioning, hierarchically semiseparable structure, adaptive randomized compression The research of Jianlin Xia was supported in part by an NSF CAREER Award DMS-1255416 and an NSF grant DMS-1115572. The research of Zhilin Li was partially supported by an NIH grant 5R01GM96195-2, an NSF grant DMS-1522768, and an NCSU Research Innovation Seed Fund. Preprint submitted to Journal of Computational Physics September 5, 2015

J. Xia, Z. Li, and X. Ye 2 1. Introduction In recent years, the augmented immersed interface method (AIIM) has been shown to be very effective for the solution of many interface problems and problems on irregular domains. The method is first developed for elliptic interface problems with discontinuous and piecewise constant coefficients [19]. Later, the idea is extended to moving interface problems on irregular domains [26] and incompressible Stokes equations with discontinuous viscosities [22]. We refer the readers to [21] for more information about AIIM. Augmented strategies can be naturally used to design efficient and accurate algorithms based on existing fast solvers. As an example, for incompressible Stokes equations with a discontinuous viscosity across the interface, the discontinuous pressure and velocity can be decoupled by augmented strategies. The immersed interface method can then be applied conveniently with a fast Poisson solver in the iterative solution of the Schur complement system. In augmented strategies, a large linear system is formed for the approximate solution u to the original problem together with an augmented variable g (which may be a vector) of co-dimension one. Eliminating the block corresponding to u from the coefficient matrix yields a Schur complement system for g, which is often much smaller compared with u. Finding g can then make it convenient to solve for u. Thus, it is critical to quickly solve the Schur complement system, which can be done via direct or iterative solvers. Direct solvers can be used for some applications when the Schur complement matrix A is a constant matrix, for example, for a fixed interface or boundary. If A is not a constant matrix, typically for free boundary and moving interface problems, efficient iterative solvers may be preferred. Iterative solvers such as GMRES [37] do not require the explicit formation of the Schur complement matrix A, and are thus often used. In addition, the solution often needs just modest accuracies, and iterative methods make it convenient to control the accuracy and the cost. Iterative solvers only need the product of A and vectors. This is usually done as follows. First, assume g is available and solve the original problem for an approximate solution u. Next, use the approximate solution to compute the residual and thus the matrix-vector product. AIIM can be considered as a generalized boundary integral method without explicitly using Green functions. The augmented variable can be considered as a source term. If the system behaves like an integral equation of the first kind, then GMRES converges slowly in general as described in several applications in this paper. For these applications, effective preconditioners are then needed. Our work here is initially motivated by the application of AIIM to Navier-Stokes equations with a traction boundary condition. Explicit numerical tests indicate that, for some applications, the condition number of the Schur complement matrix A is of size O(1), and A T A AA T 2 is also very small. Nevertheless, GMRES still either takes too many steps to converge or simply stalls. Finding a preconditioner that can greatly improve the convergence of GMRES and is easy to apply becomes crucial for augmented methods. However, there are some substantial difficulties in constructing suitable preconditioners. In fact, it may be difficult to even obtain a simple preconditioner such as the diagonal of A. The reasons are: (1) A is generally dense and the entries are not known explicitly in augmented methods, as mentioned above; (2) we can quickly multiply A and vectors, but not A T and vectors. Preliminary attempts have been made, such as extracting few columns of A or multiplying A with special vectors, and then converting the results into a block diagonal preconditioner. The preconditioner

J. Xia, Z. Li, and X. Ye 3 may work for a particular problem or right-hand side, but often fails. Thus, it is necessary to design reliable matrix-free preconditioning techniques based on only matrix-vector products. Previously, some matrix-free preconditioners are designed for certain sparse problems and specific applications [2, 4, 7, 10, 31]. The ideas are to extract partial approximations or to compute incomplete factorizations via matrix-vector multiplications. The objective of this work is to construct effective and efficient structured matrix-free preconditioners solely based on the products of A and vectors. This is achieved by taking advantage of some new preconditioning techniques introduced in recent years, such as rank structured preconditioning [5, 8, 11, 14, 18, 47] and randomized preconditioning [33, 34]. The idea of rank structured preconditioning is to obtain a preconditioner via the truncation of the singular values of appropriate off-diagonal blocks. If the off-diagonal singular values decay quickly (the problem is then said to have a low-rank property), it is known that this truncation strategy can be used to develop fast approximate direct solvers. On the other hand, if the off-diagonal singular values are aggressively truncated regardless of their decay rate, structured preconditioners can be obtained. The effectiveness is studied for some cases in [18, 47]. Among the most frequently used rank structures is the hierarchically semiseparable (HSS) form [6, 46]. Unlike standard dense matrix operations, structured methods in terms of compact HSS forms can achieve significantly better efficiency. In fact, the factorization and solution of an HSS matrix need only about O(N) flops and O(N) storage, where N is the matrix size. In some latest developments, randomized sampling is combined with rank structures to obtain enhanced flexibility [28, 32, 42, 48]. That is, the construction of rank structures (e.g., HSS) may potentially use only matrix-vector products instead of the original matrix itself. The methods in [32, 48] use both matrix-vector products and selected entries of the matrix. The one in [28] is matrix-free, although requiring slightly more matrix-vector products. In [42], a fully matrix-free and adaptive HSS construction scheme is developed. It uses an adaptive rank detection strategy in [15] to dynamically decide the off-diagonal ranks based on a pre-specified accuracy. However, all the randomized methods in [28, 32, 42, 48] require the products of both A and A T with vectors if A is nonsymmetric, and are thus not applicable to AIIM. Here, we seek to precondition A with an improved adaptive matrix-free scheme, using only the products of A and vectors. Due to the special features of AIIM as mentioned above, we construct a nearly symmetric HSS approximation H to A via randomized sampling. In the construction, A replaces A T for the multiplication of A T and vectors. Thus, the column basis of an off-diagonal block of A obtained by randomized sampling is used to approximate the row basis of the off-diagonal block at the symmetric position of A. This enables us to find low-rank approximations to all the off-diagonal blocks. We explicitly specify a small (O(1)) rank or a low accuracy in the approximation. The off-diagonal approximations are then combined with the matrix-vector products to yield approximations of the diagonal blocks. This is done in a hierarchical fashion so that the overall HSS construction process needs only O(log N) matrix-vector products. Several other improvements to the schemes in [28, 42] are made. We replace half of the randomized sampling by deterministic QR factorizations. We also design a strategy to reduce the number of matrix multiplications and avoid the use of pseudoinverses that are both expensive and potentially unstable. H is then quickly factorized, and the factors are used as a preconditioner. Existing HSS algorithms are simplified to take advantage of the near symmetry, and the factorization and the application of the preconditioner have only O(N) complexity and O(N) storage. Since A corre-

J. Xia, Z. Li, and X. Ye 4 sponds to the interface, its size N is much smaller than the Poisson solution needed to compute a matrix-vector product. Thus, the preconditioning cost is negligible as compared with the matrixvector multiplication cost in the iterations. We also provide a simplified preconditioner given by the diagonal blocks of H that is easier to use and sometimes has comparable performance. The effectiveness of the preconditioners is discussed in terms of several aspects of structured and randomized preconditioning, such as the benefits of low-accuracy rank structured preconditioners in reducing condition numbers. The preconditioners also share some ideas with the additive preconditioning techniques in [33, 34], where random low-rank matrices are added to the original matrix to provide effective preconditioners. In addition, since a low-accuracy HSS approximation tends to preserve well-separated eigenvalues [39], it can bring the eigenvalues together when used as a preconditioner. Although the convergence of GMRES does not necessarily rely on the eigenvalue distribution [12, 38], the eigenvalue redistribution provides an empirical explanation for the preconditioner, as used in practice. The effectiveness and the efficiency are further demonstrated with a survey of several important applications, such as Navier-Stokes equations on irregular domains with traction boundary conditions, an incompressible interface in incompressible flow, a contact problem of drop spreading, and a mixed boundary problem. For the Schur complement matrix A in AIIM, GMRES (with restart) generally stalls. Even non-restarted GMRES barely converges unless the number of iterations reaches nearly N. However, after our matrix-free preconditioning, GMRES converges quickly. The convergence is also observed to be relatively insensitive to N and some physical parameters. We also give a comprehensive test for a free boundary problem that involves multiple stages of GMRES solutions within the iterative solution of a nonlinear equation. The problem involves mixed boundary conditions, and the system behaves like integral equations of both first and second kinds. With our preconditioner, the overall performance GMRES solutions for all the stages is significantly improved. The presentation is organized as follows. In Section 2, the features of the Schur complements and the motivation for our work are discussed, together with a brief review of the idea of AIIM. In particular, the linear system with the Schur complement matrix A is explained in detail, including the fast multiplication of A with vectors and the formation of the right-hand side. Section 3 presents the matrix-free structured preconditioner and the detailed algorithms, and the effectiveness is discussed. Section 4 lists the applications, summarizes the numerical tests, and discusses the generalizations. Some conclusions are drawn in Section 5. 2. Features of the Schur complement systems in AIIM and motivation for the work AIIM usually has two discretizations. One is for the governing PDE with the assumption that the augmented variable is known. The second is for the augmented equation such as the boundary condition or the interface condition. In this section, as a preparation for our preconditioning techniques, we use the example of solving a Poisson equation on an irregular domain in [19, 21] to demonstrate the idea of AIIM, and show the form of the Schur complement A and its multiplication with vectors. We then discuss some useful features of the Schur complement system. These features are based on both mathematical and numerical observations, and provide a motivation for the development of our preconditioners.

J. Xia, Z. Li, and X. Ye 5 2.1. Finite difference method for elliptic problems with singular source terms In this subsection, we review the discretization of the governing PDE as in [19, 21], assuming the augmented variable is known. Let R = {(x, y), a < x < b, c < y < d} be a rectangular domain. Consider an elliptic interface problem with a specified boundary condition on R: u = f (x, y), (x, y) R Ω, [u] = w, [u n ] = v, (1) where Ω = ( ˆx(s), ŷ(s)) is a curve or interface within R with a parameter s (such as the arc-length), [u] is the jump of the solution across the boundary Ω, n is the unit direction pointing outward of Ω, and [u n ] = [ u n] is the jump in the normal derivative of u across Ω. If w 0, (1) can be rewritten as Peskin s model [35] u = f (x, y) + v(s)δ(x ˆx(s))δ(y ŷ(s)) ds. Ω We have a single equation for the entire domain. The second term on the right-hand side involves the two-dimensional Dirac delta function, which is called a singular source term or a source distribution along the curve Ω. If w 0, then it is called a double layer, similar to the derivative of the Dirac delta function, which is again called a singular source. To solve the equation above numerically, either the immersed boundary method or the immersed interface method (IIM) can be used. For example, following the standard five-point finite difference discretization on a uniform mesh, we can write both methods as u i 1, j + u i+1, j + u i, j 1 + u i+1, j 4u i, j h 2 = f i j + c i j, where h is the uniform mesh size, u i j u(x i, y j ) is the discrete solution, f i j = f (x i, y j ), and c i j is the discrete delta function in the immersed boundary method or is chosen to achieve the second order accuracy in IIM. A matrix-vector form can be conveniently written for the scheme: Au = f + Bw + Cv, where w and v are the discrete forms of w and v, respectively. The matrices B and C are sparse matrices that are related to the coordinates of grid points in the finite difference stencil and the boundary information including the first and the second order partial derivatives of Ω. Usually, each row of B or C has 3 9 nonzero entries, depending on the applications. Here, we assume that we know w and v on a rectangular domain. For AIIM, one of them is unknown and should be chosen to satisfy a certain kind of interface/boundary conditions for different applications. 2.2. AIIM for Helmoltz/Poisson equations on irregular domains Now we explain AIIM for the solution of Helmholtz/Poisson equations on an irregular interior or exterior domain Ω. See [16, 17, 26] for more details. Consider an Helmholtz/Poisson equation with a linear boundary condition q(u, u n ) along the boundary Ω: u ωu = f (x), x Ω, q(u, u n ) = 0, x Ω.

J. Xia, Z. Li, and X. Ye 6 In AIIM, Ω is embedded into a rectangle or cube domain R, and Ω becomes an interface. The PDE and the source term are then extended to the entire domain R as follows: { f, x Ω, u λu = 0, x R Ω, q(u, u n ) = 0, x Ω. { { [u] = g, [u] = 0, or [u n ] = 0, [u n ] = g, x Ω. If g is known, then we can find the solution u with a fast Poisson solver. In the discrete sense, this can be represented by a matrix-vector equation Au = f Bg. (2) To solve the original problem, the augmented variable g is determined so that q(u(g), u n (g)) = 0 for u(g) along the boundary/interface Ω. This is the second discretization in AIIM. One strategy of the discretization is to apply least squares interpolations [19, 21] in terms of u and g at a set of discrete points along Ω, which leads to a matrix-vector equation The residue vector is Cu + Dg q = 0. (3) R(g) = Cu(g) + Dg q. R(g) here has dual meanings. It is not only the regular residual of the linear system (6) below, but is also the measurement of how the boundary condition is satisfied. u is the solution when R(g) = 0. Combine (2) and (3) to get ( A B C D ) ( u g ) = ( f q ). Compute a block LU factorization ( A B C D ) = ( A C I ) ( I A 1 B A ), (4) where A is the Schur complement We can then solve a much smaller system for g: A = D CA 1 B. (5) Ag = b, with b = q CA 1 f. (6) Once we find g, then we solve (2) for the solution u. The right-hand side vector b in (6) can be found with the evaluation of R(g) at g = 0, since R(0) = (Cu(0) + D0 q) = (CA 1 f q) = b.

J. Xia, Z. Li, and X. Ye 7 In iterative solutions of (6), we need to evaluate the products of A with vectors g. For this purpose, we first solve (2) with g replaced by g: Then Aũ( g) = f B g. A g = D g CA 1 B g = D g CA 1 (f Aũ( g)) (7) = D g Cu(0) + Cũ( g) = (Cũ( g) + D g) Cu(0) = (Cũ( g) + D g q) (D0 + Cu(0) q) = R( g) R(0). That is, to compute A g, we can use an interpolation to get the residual for the boundary condition. In general, the Schur complement matrix A is nonsymmetric. Even A may be nonsymmetric, such as in the other applications in this paper. A is generally a dense matrix since A 1 is. In the boundary integral method, A is close to the discretization of the second kind integral equation. In the examples presented in this paper, the matrices A, B, C, and D are not explicitly formed. The matrices C and D are sparse and rely on the interpolation scheme used to approximate the boundary condition. There are about 3 16 entries each row, depending on the geometry of the boundary and its neighboring grid points for different applications. C and D are determined from the interpolation scheme to approximate the boundary condition. Hence, we generally do not have C T and D T available if the matrices are not formed. Thus in practice, A is not explicitly available, and the multiplication of A T by vectors is not convenient. 2.3. Features of the Schur complement systems and challenges in GMRES solutions As mentioned above, the Schur complement matrices A in AIIM such as (5) are usually dense and not explicitly formed. On the other hand, A can be multiplied by vectors quickly with the aid of fast solvers (e.g., Poisson solvers). Thus, iterative methods such as GMRES are usually used to solve the Schur complement system Ag = b. (8) For the problems we consider, the following features or challenges are often observed. For many Schur complement systems resulting from AIIM, GMRES without preconditioning has difficulty in converging. By saying this, we mean that the restarted GMRES method stalls or the non-restarted GMRES method converges only when the number of iterations reaches nearly the size N of A. Sometimes, this is due to the ill conditioning of A. However, for various cases here, this happens even if A is well conditioned. Often, the eigenvalues of A are scattered around the origin, which is empirically observed to affect the convergence of GMRES. The physical background for the slow convergence is as follows. AIIM can be considered as a generalized boundary integral method without explicitly using Green functions. The augmented variable can be considered as a source term. Thus, if the discrete system corresponds to an integral equation of the second kind, then GMRES can converge quickly. (We refer the readers to [49] for the relation between an augmented approach and the boundary integral method.) One such an example is to solve a Poisson equation on an irregular domain with different boundary conditions. If the system behaves like an integral

J. Xia, Z. Li, and X. Ye 8 equation of the first kind, then GMRES converges slowly in general. One such example is to solve a Poisson equation on an irregular domain with both Dirichlet and Neumann boundary conditions defined along part of the boundary. Thus, an effective preconditioner is crucial for the convergence of GMRES. A is usually dense and it is costly to form it. (For example, A 1 is involved in (5) for a large sparse matrix A.) In fact, it is not convenient to even extract its diagonal. Even if we find the diagonal, it may have zero entries and cannot be used as a preconditioner in a straightforward way. A g can be conveniently computed for a vector g. Thus, we may extract few columns of A or multiply A by certain special vectors (e.g., vector of ones), so as to construct diagonal or block diagonal preconditioners. However, the preconditioners may be close to singular or may perform poorly. A T g cannot be conveniently computed for a vector g. (See the end of the previous subsection.) Thus, even the recent matrix-free direct solution techniques in [28, 42] cannot be used to get a preconditioner, since they require the multiplication of both A and A T by vectors to get an approximation to A. In some cases, A is close to be normal or even symmetric. The singular values of the off-diagonal blocks of A have reasonable decay. In some cases, the decay is very fast so that such blocks have small numerical ranks. As an example, we consider a 680 680 Schur complement A arising from AIIM for solving the Navier-Stokes equation with a traction boundary condition in Section 4.1.1. The mesh size is 240 240. Note that the 2-norm condition number of the matrix is κ 2 (A) = 10.09, and A A T 2 = 0.02, A T A AA T 2 = 1.98 10 5. However, due to the traction boundary condition, restarted GMRES (with 50 inner iterations) applied to (8) fails to converge, as shown in Figure 1. 1 0.95 γ 2 0.9 0.85 0 2 4 6 8 10 Number of iterations x 10 4 Figure 1: Convergence of restarted GMRES without preconditioning for (8) from AIIM for solving the Navier-Stokes equations with a traction boundary condition in Section 4.1.1, where N = 680, the mesh size is 240 240, and γ 2 = Ag b 2 b 2 is the relative residual. Thus, an effective preconditioner is needed. The diagonal of A contains zero entries and cannot be conveniently used as a preconditioner, even if we could extract it. We will seek to find a structured preconditioner. In fact, the off-diagonal blocks of A have relatively small numerical

J. Xia, Z. Li, and X. Ye 9 ranks. In Figure 2, we show the singular values of an N N off-diagonal block of A. Clearly, the 2 2 block only has few large singular values. Thus, we can use a low-rank form to approximate this block to a reasonable accuracy. Overall, we can approximate A by a rank structured (e.g., HSS) matrix that can serve as an effective preconditioner. However, this would require either A explicitly or the multiplication of both A and A T by vectors. In the next section, we address these issues. 10 0 Singular values σ i 10 5 10 10 10 15 0 5 10 15 20 25 30 Index i of singular values Figure 2: The first 30 singular values of A(1 : N 2, N 2 + 1 : N) for the Schur complement A from AIIM for solving the Navier-Stokes equations with a traction boundary condition in Section 4.1.1, where N = 680 and the mesh size is 240 240. 3. Matrix-free preconditioning techniques for AIIM The challenges and the tests in the previous section provide a motivation for this work. We show a matrix-free scheme that improves those in [28, 48] and produces a nearly symmetric approximation H to A in a structured form. H will be quickly factorized and used as a structured preconditioner. 3.1. Rank structures Our preconditioning techniques employ rank structures, or more specifically, HSS representations [6, 46]. An HSS representation provides a convenient format to organize the off-diagonal blocks of a matrix H by low-rank forms. H is partitioned hierarchically, so that the off-diagonal blocks at all the hierarchical levels have low-rank forms and also share certain common bases. A comprehensive summary of HSS structures and algorithms can be found in [43]. Here, we briefly review its definition, with the aid of the following notation: H s t is a submatrix of H formed by its entries corresponding to the row index set s and column index set t; H s is a submatrix of H formed by its rows corresponding to the index set s; H : t is a submatrix of H formed by its columns corresponding to the index set t. A general HSS form looks like D k H I I, where I is the index set {1, 2,..., N}, and D k is defined recursively following a full binary tree T with k nodes labeled as i = 1, 2,..., k in a postorder. Each node i of T is associated with an index set t i such that t k = I, t i = t c1 t c2, t c1 t c2 =,

J. Xia, Z. Li, and X. Ye 10 where c 1 and c 2 are the children of a non-leaf node i and c 1 < c 2 (c 1 ordered before c 2 ). Then for each such i, recursively define ( ) ( ) ( ) D D i = c1 U c1 B c1 Vc T Uc1 2 R U c2 B c2 Vc T, U 1 D i = c1 Vc1 W, V c2 U c2 R i = c1, c2 V c2 W c2 where the matrices D i, U i, etc. are called the generators that define the HSS form of H. Also, if we associate each node i of T with the corresponding generators, T is called an HSS tree. T can be used to conveniently organize the storage of H and perform HSS operations. The off-diagonal blocks we are interested in are H ti (I\t i ) and H (I\ti ) t i, called HSS block rows and columns, respectively. U i gives a column basis for H ti (I\t i ), and V i gives a column basis for (H (I\ti ) t i ) T. For convenience, assume U i and V i have orthonormal columns. In standard HSS computations, the generators are usually dense. However, particular HSS construction algorithms may yield generators that have additional internal structures (see, e.g., [44, 48]). Later, we use par(i) and sib(i) to denote the parent and the sibling of i, respectively. Also we say c 1 is a left node and c 2 is a right one. 3.2. Matrix-free structured preconditioning via randomized sampling HSS preconditioning for symmetric positive definite problems have been discussed in [47]. Here, our matrices are generally nonsymmetric and indefinite. The primary idea of our preconditioning techniques is to construct a nearly-symmetric HSS approximation H to A with an improved matrix-free HSS construction scheme via only the products of A and vectors. The major components are as follows. 1. For an off-diagonal block of A, use an adaptive randomized sampling method [15, 42] to find an approximate column basis, and a deterministic method to find an approximate row basis. 2. In a top-down traversal of the HSS tree, use the products of A and vectors to obtain hierarchically low-rank approximations to certain off-diagonal blocks. Treat A as a symmetric matrix to approximate the other off-diagonal blocks. We avoid using pseudoinverses and some matrix multiplications in [28, 42]. 3. This further enables us to approximate the diagonal blocks, and we then obtain a nearly symmetric HSS approximation H to A. 4. Compute a structured ULV-type factorization [6, 46] of H. Use the factors as a preconditioner in a ULV solution scheme. The detailed improvements over existing similar schemes are elaborated in the following subsections. 3.2.1. Improved randomized compression Firstly, we explain briefly the adaptive randomized sampling ideas, and show our improvement to the randomized compression. For an M N matrix Φ of rank or numerical rank r, we seek to find a low-rank approximation of the form Φ UBV T.

J. Xia, Z. Li, and X. Ye 11 We multiply Φ and random vectors and use adaptive randomized sampling to find U as in [15, 40]. Then unlike the methods in [15, 28, 40, 42], we do not multiply Φ T with random vectors when finding V. Furthermore, we do not use pseudoinverses to find B. These are detailed as follows. Initially, let X be an M r Gaussian random matrix, where r is a conservative estimate of r. Compute the product Y = ΦX, and a QR factorization Y = UỸ. (9) U is expected to provide a column basis matrix for Φ when r is close to r. To quickly estimate how well U(U T Φ) approximates Φ, the following bound can be used with high probability [15, 40]: 2 Φ U(U T Φ) 2 η max (I UU T )Y 2 : j, (10) π j= r d+1,..., r where d is a small integer and η is a real number, and d and η determine the probability. If the desired accuracy is not reached, more random vectors are used. When this stops, we obtain the approximate basis matrix U and the rank estimate. In existing randomized sampling methods, it then usually multiplies Φ T with random vectors to find V in a similar way. That is, randomized sampling are used twice to extract both the row and the column basis. Here, instead, we compute Φ = U T Φ, (11) and then compute an RQ factorization Φ = BV T. (12) That is, we use randomized sampling only once, but use the deterministic orthogonal way (11) (12) to find V. This is potentially beneficial for the approximation. Furthermore, (12) provides both B and V without the need of pseudoinverses in [28, 42]. The RQ factorization is generally much faster and much more stable. This idea is similar to a deterministic compression strategy for HSS construction in [46], and a parallel implementation is later discussed in [30]. 3.2.2. Nearly-symmetric matrix-free HSS construction with improvements We then apply the improved randomized compression to the HSS blocks of A: Φ = A ti (I\t i ) or A (I\ti ) t i, (13) The compression is done hierarchically for all the HSS blocks. Previous attempts to find accurate HSS approximations to problems with low-rank off-diagonal blocks are made in [28, 32, 42, 48], where the products of both A and A T with random vectors are needed. As compared with the original matrix-free HSS constructions in [28, 42], the following improvements and modifications are made: A nearly-symmetric HSS approximation is constructed. That is, for a node i of the HSS tree and j = sib(i), V i U i, W i R i, B j = B T i, but the D generators are nonsymmetric.

J. Xia, Z. Li, and X. Ye 12 Randomized sampling is only used to find U i for the left nodes i. To find U i for the right nodes i, the deterministic way (11) (12) is used instead. This reduces the number of matrix multiplications, and avoids expensive and potentially unstable pseudoinverses. Specifically for the purpose of preconditioning, a low approximation accuracy and small r and d are used in (10). The details are as follows. Assume that T is the desired HSS tree. As in [28, 42], the nodes i of T are visited in a top-down way, so as to construct the column bases hierarchically for the HSS blocks A ti (I\t i ). For convenience, we use T l and ˆT l to denote the sets of left and right nodes at a level l of T, respectively. Also, let the root of T be at level 0 and the leaves be at the largest level l max. For 1 l l max, define t l = t i, ˆt l = t j. i T l j ˆT l Let X be a Gaussian random matrix whose column size is decided via adaptive randomized sampling applied to A ti t j for each i T l. Define X so that Compute X t l = X t l, X ˆt l = 0. (14) Ỹ = A X Z, where Z is the product of the partially computed HSS form (due to recursion at upper levels) and X, denoted Z = hssmv(a, X, l 1). ( Z is 0 if l = 1.) Then compute a QR factorization Ỹ ti = Ū i S i, where Ū i is a column basis matrix for A ti t j with j = sib(i). To find a row basis matrix for A ti t j, unlike the methods in [28, 42] that still uses randomized sampling, we use a deterministic way. Define a matrix ˆX, which is a zero matrix except for each i T l, ˆX ti = Ū i. Compute Ŷ = A ˆX hssmv(a, ˆX, l 1). Then for each right node j at level l, compute a QR factorization Ŷ t j = Ū j B j, where Ū j is a column basis matrix for (A ti t j ) T. We are then ready to find the generators. If l = 1, simply set U i = Ū i, B i = B T j. Otherwise, ( ) Up;1 after j = sib(i) is also visited, let p = par(i) and partition U p as so that U p;1 has the same U p;2 row size as Ū i (assuming i is a left node). Then compute QR factorizations ( Ū i U p;1 ) = U i ( S i R i ), ( Ūi U p;2 ) = U j ( S j R j ).

Also, set J. Xia, Z. Li, and X. Ye 13 B i = S j B j S T i. In comparison, multiple pseudoinverses and a lot more matrix multiplications are needed in [28]. An improved version is given in [42], but still needs two more matrix multiplications and a pseudoinverse on top of two randomized sampling stages. Another strategy to avoid the pseudoinverse is later given in [30], and can potentially better reveal the hierarchical structures. However, it needs additional randomized sampling and matrix-vector multiplications. Here, since our purpose is preconditioning, the strategy above is sufficient and preferable. This process is repeated for the nodes of T in a top-down traversal. After the leaf level is traversed, approximations to all the HSS blocks are obtained. We can then approximate all the diagonal blocks A ti t i for the leaves i by subtracting the products of appropriate off-diagonal blocks of A and I = (I, I,..., I) T from AI [28]. (Some additional zero columns may be needed for certain block rows of I.) More specifically, A ti t i H ti t i AI hssmv(a, I, l max ). By now, we have obtained an HSS approximation H to A, where the off-diagonal blocks have a symmetric pattern, or H ti t j = (H t j t i ) T. The diagonal blocks D i H ti t i are nonsymmetric. The accuracy of this HSS approximation may be low, but it suffices for our preconditioning purpose. 3.2.3. Nearly-symmetric HSS factorization At this point, we can quickly factorize the HSS matrix H, and the factors are used for preconditioning. Since H is nearly symmetric, we use the fast ULV factorization in [46], with some simplifications to take full advantage of the off-diagonal symmetric pattern and with some modifications to accommodate the nonsymmetric diagonal blocks. The basic idea includes the following steps. Introduce zeros into H ti (I\t i ) by expanding U i into an orthogonal matrix Ũ i and multiplying Ũ i to H ti. This just needs to modify D i as D i Ũ i D i Ũ T i. ( Di;1,1 D Partition D i into a block 2 2 form i;1,2 D i;2,1 D i;2,2 ), so that D i;1,1 a square matrix with size equal to the column size of U i. Partially LU factorize D i : ( ) ( ) ( Di;1,1 D D i i;1,2 Li;1,1 Gi;1,1 G = i;1,2 I G i;2,2 D i;2,1 D i;2,2 L i.2,1 where D i;1,1 = L i;1,1 G i;1,1 is the LU factorization of D i;1,1, and G i;2,2 = D i;2,2 L i.2,1 G i;1,2. After these steps for two sibling nodes i and j of T are finished, merge the remaining blocks. Here, this is to simply set ( ) Gi;2,2 B D p i. B T i G j;2,2 ),

J. Xia, Z. Li, and X. Ye 14 Note that no actual operations are performed. Then we remove i and j from T to obtain a smaller HSS form, called a reduced HSS matrix [45]. Repeat the above steps on the reduced HSS matrix. After the factorization, the ULV factors are a sequence of orthogonal and lower and upper triangular matrices and are used as a structured preconditioner. The ULV solution procedure for the preconditioning is similar to that in [46] and is skipped. 3.2.4. Algorithm and variation The details are shown in Algorithm 1, which includes the HSS approximation and the factorization for constructing the preconditioner. To make the preconditioner easier to use, we may also use a simplified variation by forming a block diagonal matrix D from the D i generators: D = diag(d i i: leaves of T). Then factorize D and use the triangular factors as the preconditioner. To summarize, we have two types of preconditioners: Preconditioner I: Structured preconditioner given by the ULV factors of the HSS approximation H to A; Preconditioner II: Block-diagonal preconditioner given by the LU factors of the block diagonal matrix D formed by the D i generators of H. This preconditioner is easier to apply, although it may lead to slightly slower convergence. For convenience, we may simply say that H or D is the preconditioner. 3.3. Efficiency and effectiveness The HSS construction costs O(r 2 N) flops plus the cost to multiply A with O(r log N) vectors, where r is the maximum numerical rank of the HSS blocks. The factorization and the preconditioning costs are O(r 2 N) and O(rN) flops, respectively. We may also treat the diagonal blocks as symmetric ones so as to further save the costs. The storage for the preconditioner is O(rN). In the preconditioning, we use a low accuracy so that r is a very small integer. The preconditioning cost at each GMRES iteration is then O(N). This cost is negligible as compared with the cost of multiplying A and a vector. The reason is that A corresponds to the interface and has a much smaller size than the entire problem. See, e.g., (4) (5), where the multiplication of A and a vector needs to solve a Poisson problem represented by A. The matrix A has a size much larger than N. Just like many other preconditioning techniques, a full analytical justification of the effectiveness of the preconditioner is not yet available. Especially, the complex nature of the rank structures makes the analysis a nontrivial issue. However, several aspects of rank structured approximation and randomized preconditioning are closely related and give useful heuristical explanations. 1. HSS representations recursively capture the algebraic structure of the discretization with the off-diagonal blocks corresponding to the interactions of subdomains. A low-accuracy off-diagonal approximation represents essential basic information within the subdomain interaction.

J. Xia, Z. Li, and X. Ye 15 Algorithm 1 Constructing the matrix-free preconditioner 1: procedure mfprec Input: HSS partition, HSS tree T, compression tolerance (and/or rank bound r), and mat-vec (Schur complement matrix-vector multiplication routine as in (7)) Output: HSS factors as a preconditioner 2: X 0, ˆX 0 3: for level l = 1, 2,..., l max do Improved nearly-symmetric matrix-free HSS construction 4: Construct X as in (14) via adaptive randomized sampling 5: Ỹ mat-vec(a, X) hssmv(a, X, l 1) hssmv returns 0 if l = 1 6: for each left node i at level l do 7: Ỹ ti = Ū i S i QR factorization 8: ˆX ti Ū i 9: end for 10: Ŷ mat-vec(a, ˆX) hssmv(a, ˆX, l 1) 11: for each right node j at level l do 12: Ŷ t j = Ū j B j QR factorization 13: i sib( j) 14: if l = 1 then i is a child of the root 15: U i Ū i, U j Ū j, B i B T j U, B generators 16: else ( ) Upar(i);1 17: U par(i) = Partition so that U par(i);1 and Ū i have the same row size U par(i);2 18: ( Ū i U par(i);1 ) = U i ( S i R i ), ( Ū j U par(i);2 ) = U j ( S j R j ) QR factorizations for U, R generators 19: B i S j B j S i T B generator 20: end if 21: end for 22: end for 23: Ỹ AI hssmv(a, I, l max ) 24: for each leaf i do 25: D i Ỹ ti D generator 26: end for 27: for node i = 1, 2,..., k do ULV factorization to generate the preconditioner 28: Ũ i U i Extension of U i to orthogonal Ũ i ; implicit zero introduction into A ti (I\t i ) 29: D i ( Ũ i D i Ũi T ) ( ) Diagonal block update Li;1,1 Gi;1,1 G 30: D i = i;1,2 Partial LU factorization L i.2,1 I G i;2,2 31: if i is a nonleaf ( node then ) Gc1 ;2,2 B 32: D i c1 B T c c 1 G 1, c 2 : children of i c2 ;2,2 33: end if 34: end for 35: end procedure

J. Xia, Z. Li, and X. Ye 16 2. Keeping few largest off-diagonal singular values in rank structured approximation tends to have an effect of roughly preserving certain eigenvalues of the original matrix [39, 42]. In particular, if some eigenvalues are well separated from the others, then a low-accuracy HSS approximation can give a much more accurate approximation of these eigenvalues. This structured perturbation analysis is shown in [39]. Therefore, when the HSS approximation is used as a preconditioner, it can potentially help to bring the eigenvalues closer. Although this does not necessarily guarantee the fast convergence of GMRES [12, 38], it gives an empirical way to look into the behavior of the preconditioner. 3. Even if the problem may not have the low-rank property or the off-diagonal singular values only slowly decay, a structured approximation as a preconditioner can significantly accelerate the decay of the condition number κ [47], where the preconditioner is obtained with different numbers of off-diagonal singular values kept in the approximation. 4. In our preconditioner, multiplications of A by random matrices are used to approximate the off-diagonal blocks by low-rank forms, which are further used to approximate the diagonal blocks via matrix addition/subtraction. This is then similar to the idea of additive preconditioning [33, 34], where random low-rank matrices are added to the original matrix, which is shown to yield a well-conditioned preconditioner, and also improves the condition number of the original matrix. 4. Preconditioning AIIM for different applications and generalizations In this section, we show several important applications where our preconditioning techniques for AIIM can be applied, particularly for flow problems. Numerical results are given to demonstrate the effectiveness and efficiency. We focus on the structured Preconditioner I given at the end of Section 3.2, and also briefly show the performance of Preconditioner II. 4.1. Preconditioning individual Schur complements in various applications In this subsection, we show the preconditioning of some individual Schur complements in several applications. In our tests, different types of right-hand sides b are tried, such as random vectors, products of A and given vectors, and those actually arising in the applications. Similar performance is observed. Thus for convenience in comparing different methods, we report the results with b to be the products of A and vectors of all ones. We point out that our preconditioners are not limited to any specific type of right-hand sides. For convenience, the following notation is used: κ 2 (A): 2-norm condition number of A; n pre mv: number of matrix-vector multiplications for computing the preconditioner; n it : number of GMRES iterations; γ 2 = Ag b 2 b 2 : relative residual. Before presenting the details of the individual problems, we give the basic properties of the matrices in Table 1, and summarize the convergence results in Table 2. The matrices include both well-conditioned and ill-conditioned ones. We can observe that, GMRES has difficulty to converge for all the cases. In fact, restarted GMRES fails to converge for all the cases except one (where

J. Xia, Z. Li, and X. Ye 17 it takes 8794 steps to converge for N = 194). Even non-restarted GMRES needs large numbers of iterations to converge, which are often close to N. On the other hand, preconditioned GMRES with our structured preconditioner gives quick convergence for all the cases. Both n pre mv and n it are relatively insensitive to the increase of N. Table 1: Properties of A, including the corresponding mesh size for the entire domain. Problem Mesh N κ 2 (A) A A T 2 A A T 2 A 2 A T A AA T 2 A T A AA T 2 A T A 2 Traction 60 60 176 9.62 0.02 1.43 2.36 10 5 0.10 120 120 344 9.48 0.02 1.40 2.23 10 5 0.09 240 240 680 10.09 0.02 1.41 1.98 10 5 0.09 480 480 1360 13.71 0.02 1.42 1.66 10 5 0.08 128 128 128 1.55e5 4.08 0.69 9.33 0.27 Inextensible 256 256 256 1.34e7 8.05 0.70 39.05 0.29 512 512 512 2.31e6 18.65 0.51 176.85 0.13 128 128 182 6.79e2 1.01 1.02 0.86 0.88 Contact 256 256 362 9.56e2 1.02 1.03 0.86 0.87 512 512 726 4.76e3 1.06 0.99 0.89 0.78 64 64 194 3.16e3 1.03 0.98 0.91 0.82 128 128 274 3.46e3 1.26 0.97 1.52 0.89 Mix Irregular 256 256 514 1.71e4 2.21 0.99 4.83 0.97 512 512 834 2.54e5 3.74 1.00 13.91 0.99 1024 1024 1314 8.59e6 6.91 1.00 47.58 1.00 Remark 1. Since A corresponds to the interface for the augmented variable, the size N of A is usually not very large. However, the number of variables in the original problem is much larger. See the mesh size in Table 1. Even if A may have a small size, the multiplication of A by a vector needs the solution of a much larger equation for A, e.g., with a Poisson solver. Remark 2. For time dependent problems and multiple right-hand sides, we may use n pre mv matrixvector multiplications in a precomputation to find a preconditioner, and then apply the preconditioner to multiple time steps and right-hand sides. Thus, it sometimes makes sense to allow n pre mv to be slightly larger so as to reduce n it. This would be beneficial in reducing the total number of matrix-vector products. Remark 3. The measurements in Table 1 only give a partial way to look into the symmetry/normality of A, and do not necessarily tell the actual symmetry/normality. In practice, we may not know if A is actually close to symmetric or not. For some of the examples, the eigenvalues of A are scattered around (say, in a disk) and are not close to the real axis. In general, the structure of the Schur complement matrix depends on the application and the representation of the interface. For our test examples, it is hard to tell whether the matrix is eventually close to symmetric or not, but the preconditioner works well. Remark 4. The reasons why we are not reporting results for restarted GMRES are as follows. First, the numbers of iterations in our tests are usually small. Next, if restart is used, the conver-

J. Xia, Z. Li, and X. Ye 18 Table 2: Summary of the convergence of GMRES without preconditioning and with our structured Preconditioner I, where τ is around 0.1 0.8 in constructing the preconditioner. Problem Traction GMRES Preconditioned GMRES Restarted Non-restarted (Non-restarted) n it (outer; inner) γ 2 n it γ 2 n pre mv n it γ 2 91750 (1835; 50) Fail 175 1.43e 4 63 12 3.73e 7 93650 (1873; 50) Fail 343 1.34e 5 88 11 5.83e 7 9500 (190; 50) Fail 672 3.11e 6 85 12 5.14e 7 3300 (66; 50) Fail 1331 1.71e 6 106 13 9.01e 7 55750 (1115; 50) Fail 127 1.63e 4 73 26 7.90e 7 Inextensible 30200 (604; 50) Fail 250 8.91e 7 92 42 7.96e 8 100000 (2000; 50) Fail 492 4.63e 7 110 68 2.94e 7 49250 (985; 50) Fail 181 6.98e 6 71 21 3.47e 7 Contact 81500 (1630; 50) Fail 361 5.20e 7 83 30 5.47e 7 3150 (63; 50) Fail 725 1.71e 6 99 48 8.01e 7 8794 (176; 50) 1.00e 6 118 7.03e 7 68 22 7.82e 7 14150 (283; 50) Fail 156 8.14e 7 79 39 2.47e 7 Mix Irregular 95200 (1904; 50) Fail 328 9.48e 7 81 64 7.61e 7 7300 (146; 50) Fail 633 9.18e 7 102 84 9.60e 7 33950 (679; 50) Fail 1045 9.36e 7 105 98 9.09e 7 gence is slower than without, depending on the number of inner iterations. However, in AIIM with preconditioning, it seems preferable to store a little more intermediate iterates (as in non-restarted GMRES) so as to accelerate the convergence. This is because the sizes of the original discretized PDEs are much larger than the sizes of the Schur complement matrices A. It is usually cheap to store some small intermediate iterates in GMRES, but is costly to compute matrix-vector products. Therefore, reducing the total number of matrix-vector products is more crucial (than saving a small amount of storage for the GMRES iterates). Of course, for large-scale practical computations, we may choose to balance the convergence and the storage via restart with appropriate numbers of inner iterations. We then explain the individual applications in Tables 1 2 and show additional specific tests in the following subsections. 4.1.1. Navier-Stokes equations on irregular domains with open and traction boundary conditions This application is one of our motivations to develop the preconditioner. A detailed description of the problem and its solution with AIIM can be found in [20]. Let R be a rectangular domain with an inclusion Ω. Consider the Navier-Stokes equation ( ) u ρ + (u )u + p = µ u + g, t u = 0, (15) x R \ Ω, where ρ, µ, u, p, and g(x, y, t) are the fluid density, the viscosity, the fluid velocity, the pressure, and an external forcing term, respectively. There is a traction boundary condition [29] along the

J. Xia, Z. Li, and X. Ye 19 interior boundary Ω: n T µ( u + u T ) n = p ˆp γξ, η T µ( u + u T ) n = 0, (16) x Ω, where ξ is the curvature, γ is the coefficient of the surface tension, η is the unit tangential direction of the interface, and ˆp is the pressure of the air. During the numerical solution, the procedure to advance from a time step t k to the next one t k+1 includes two steps [20]. The first is to solve (15) with (16) for the velocity using AIIM. The second is to solve the momentum equation for the pressure p. The augmented variable is q k+1 uk+1, which is determined so that n uk+1 satisfies (16) with an approximation to p k+1. In a specific example, set the boundary Ω to be a particular curve such as a circle defined by a function φ(x, y). Let φ i, j be the discrete case of φ(x, y). A grid point x i j is irregular if max { φ i 1, j, φ i+1, j, φ i j, φ i, j 1, φ i, j+1 } min { φi 1, j, φ i+1, j, φ i j, φ i, j 1, φ i, j+1 } 0. The Schur complement system (8) for the augmented variable is defined at the orthogonal projections of such irregular x i j on Ω from the outside. If the time step size t = t k+1 t k is fixed, then so is A. (8) is solved by GMRES with our preconditioning techniques. Preconditioned GMRES quickly converges. See the results in the row of Traction in Table 2, where we use time t = 0 and µ = 0.02. Note that the number of matrix-vector products for constructing the preconditioner and the number of iterations increase very slowly with N. More specifically, Figure 3 shows the costs (excluding the matrix-vector multiplication which is problem dependent) and the storage. For varying N, we use a low accuracy τ = 0.4 and a small off-diagonal rank. Then the construction of the preconditioner (including the HSS construction and the ULV factorization) and the application of the preconditioner (the ULV solution) both cost about O(N) flops and need about O(N) storage. Such costs are negligible as compared with even one matrix-vector product with A, which costs about O(N 2 log N). 10 6 10 5 Flops 10 5 10 4 O(N log N) reference line 10 3 Precomputation (HSS construction) Precomputation (HSS factorization) Preconditioning (HSS solution) O(N) reference line 10 2 10 2 10 3 N (i) Costs of the algorithms Storage (entries) 10 4 O(N) reference line Preconditioner storage 10 3 10 2 10 3 N (ii) Storage for the preconditioner Figure 3: Costs for precomputing or constructing the preconditioner (including HSS construction and ULV factorization) and applying the preconditioner (HSS solution), and the storage for the preconditioner. In particular, for the example with N = 680 in Figure 1 in Section 2.3, the convergence of preconditioned GMRES is significantly faster than the standard non-restarted GMRES method.