A taxonomy and comparison of parallel block multi-level preconditioners for the incompressible Navier Stokes equations q

Size: px

Start display at page:

Download "A taxonomy and comparison of parallel block multi-level preconditioners for the incompressible Navier Stokes equations q"

Dorothy Bond
5 years ago
Views:

Aailable online at www.sciencedirect.com Journal of Comutational Physics 7 (008) 1790 1808 www.elseier.

1 Aailable online at Journal of Comutational Physics 7 (008) A taxonomy and comarison of arallel block multi-leel reconditioners for the incomressible Naier Stokes equations q Howard Elman a, *,1, V.E. Howle b, John Shadid c, Robert Shuttleworth d, Ray Tuminaro e a Deartment of Comuter Science and Institute for Adanced Comuter Studies, Uniersity of Maryland, College Park, MD 074, United States b Sandia National Laboratories, P.O. Box 969, MS 9159 Liermore, CA 94551, United States c Sandia National Laboratories, P.O. Box 5800, MS 1111, Albuquerque, NM 87185, United States d Alied Mathematics and Scientific Comuting Program and Center for Scientific Comutation and Mathematical Modeling, Uniersity of Maryland, College Park, MD 074, United States e Sandia National Laboratories, P.O. Box 969, MS 9159, Liermore, CA 94551, United States Receied 11 Aril 007; receied in reised form 14 Setember 007; acceted 7 Setember 007 Aailable online 1 October 007 Abstract In recent years, considerable effort has been laced on deeloing efficient and robust solution algorithms for the incomressible Naier Stokes equations based on reconditioned Krylo methods. These include hysics-based methods, such as SIMPLE, and urely algebraic reconditioners based on the aroximation of the Schur comlement. All these techniques can be reresented as aroximate block factorization (ABF) tye reconditioners. The goal is to decomose the alication of the reconditioner into simlified sub-systems in which scalable multi-leel tye solers can be alied. In this aer we deelo a taxonomy of these ideas based on an adatation of a generalized aroximate factorization of the Naier Stokes system first resented in [A. Quarteroni, F. Saleri, A. Veneziani, Factorization methods for the numerical aroximation of Naier Stokes equations, Comutational Methods in Alied Mechanical Engineering 188 (000) ]. This taxonomy illuminates the similarities and differences among these reconditioners and the central role layed by efficient aroximation of certain Schur comlement oerators. We then resent a arallel comutational study that examines the erformance of these methods and comares them to an additie Schwarz domain decomosition (DD) algorithm. Results are resented for two and three-dimensional steady state roblems for enclosed domains and inflow/outflow systems on both structured and unstructured meshes. The numerical exeriments are erformed using MPSalsa, a stabilized finite element code. Ó 007 Elseier Inc. All rights resered. q This work was artially suorted by the DOE Office of Science MICS Program and by the ASC Program at Sandia National Laboratories. Sandia is a multirogram laboratory oerated by Sandia Cororation, a Lockheed Martin Comany, for the United States Deartment of Energy s National Nuclear Security Administration under Contract DE-AC04-94AL * Corresonding author. addresses: elman@cs.umd.edu (H. Elman), ehowle@sandia.go (V.E. Howle), jnshadi@cs.sandia.go (J. Shadid), rshuttle@math.umd.edu (R. Shuttleworth), rstumin@sandia.go (R. Tuminaro). 1 The work of this author was suorted by the Deartment of Energy under Grant DEFG004ER /$ - see front matter Ó 007 Elseier Inc. All rights resered. doi: /j.jc

2 H. Elman et al. / Journal of Comutational Physics 7 (008) Keywords: Incomressible flow; Naier Stokes; Iteratie methods 1. Introduction Current leading-edge engineering and scientific flow simulations often entail comlex two and three-dimensional geometries with high resolution unstructured meshes to cature all the releant length scales of interest. After suitable discretization and linearization these simulations can roduce large linear systems of equations with on the order unknowns. As a result, efficient and scalable arallel iteratie solution methods are required. We consider solution methods for the incomressible Naier Stokes equations where the equations below reresent conseration of momentum and mass, and the constitutie equation for the Newtonian stress tensor, Momentum : qðu rþu ¼rTþqg; Mass : ru ¼ 0; ð1þ Stress tensor : T ¼ PIþlðru þru T Þ in X R d (d = or 3). Here the elocity, u, satisfies suitable boundary conditions on ox, P reresents the hydrodynamic ressure, q the density, l the dynamic iscosity, and g the body forces. We focus on solution algorithms for the algebraic system of equations that result from linearization and discretization of these equations. The coefficient matrices hae the general form! A ¼ F BT : ðþ bb C The strategies we emloy for soling () are deried from the LDU block factorization of this coefficient matrix,! I 0 F 0 I F 1 B T A ¼ ; ð3þ bbf 1 I 0 S 0 I where S ¼ C þ bbf 1 B T ð4þ is the Schur comlement (of F in A). They require methods for aroximating the action of the inerse of the factors of (3), which, in articular, requires aroximation to the actions of F 1 and S 1. For large-scale comutations, use of the exact Schur comlement is not feasible. Therefore effectie aroximate block factorization (ABF) reconditioners are often based on a careful consideration of the sectral roerties of the comonent block oerators and the aroximate Schur comlement oerators. There has been a great deal of recent work on ABF methods (e.g. [3,4,6,8,17]). These techniques take a urely linear algebraic iew of reconditioning. Through these decomositions a simlified system of block comonent equations is deeloed that encodes a secific hysics-based decomosition. Alternatiely, one could start with hysics-based iteratie solution methods for the Naier Stokes equations (e.g. [1,]) and deelo reconditioners based on these techniques as described in [18]. In both these cases, the system has been transformed by the factorization into comonent systems that are essentially conection diffusion and Poisson tye oerators. The result is a system to which multi-leel methods, and in our articular case, algebraic multi-leel methods (AMG) can be alied successfully for arallel unstructured mesh simulations. In this aer, we include reconditioners deeloed from iteratie solution strategies based on ressure correction methods, like SIMPLE, roosed by Patankar and Saulding [] and reiously studied as a reconditioner by Pernice and Tocci in [3]. We interret these methods in the context of our taxonomy and comare them with some new aroximate commutator methods. These techniques are based on the aroximation

3 179 H. Elman et al. / Journal of Comutational Physics 7 (008) of the Schur comlement oerator by a technique roosed by Kay et al. [17], Silester et al. [9], and Elman et al. [10]. The aer is organized as follows. Section gies a brief descrition of the Newton iteration and roides an oeriew of the discretization and resulting coefficient matrix used for our numerical exeriments. Section 3 resents a taxonomy for classifying the aroximate block reconditioners. Section 4 roides a brief oeriew of the arallel imlementation of the nonlinear and linear solers. Details of the numerical exeriments and the results of these exeriments are described in Section 5. Concluding remarks are roided in Section 6.. Background Our focus is on solution algorithms for the systems of equations that arise after discretization and linearization of the system (1). A nonlinear iteration based on an inexact Newton Krylo method is used to sole this roblem. If the nonlinear roblem to be soled is written as G(x) = 0, where G : R n! R n, then at the kth ste of Newton s method, the solution of the linear Newton equation, Jðx k Þs k ¼ gðx k Þ ð5þ is required, where x k is the current solution and J(x k ) denotes the Jacobian matrix of G at x k. Once the Newton udate, s k, is determined, the current aroximation is udated ia x kþ1 ¼ x k þ s k : Newton Krylo methods [7] relax the requirement to comute an exact solution to (5). Instead, a Krylo subsace method, such as GMRES, is alied until an iterate s k is found that satisfies the inexact Newton condition, kgðx k ÞþJðx k Þs k k 6 g k kgðx k Þk; ð6þ where, g k [0, 1], is a tolerance. If g k = 0, this is an exact Newton method. For a discussion of the merits of different choices of g k, see [7]. For this comutational study, g k is chosen to be a constant and our attention is focused on reconditioning methods for use with GMRES soling for the Newton udate. For the discrete Naier Stokes equations, the Jacobian system at the kth ste that arises from Newton s method is!! F B T Duk ¼ gk u ; ð7þ bb C D k g k where F is a conection diffusion-like oerator, B T is the gradient oerator, bb is the diergence oerator that for some higher-order stabilized formulations can include a contribution from non-zero higher-order deriatie oerators in the stabilized formulation [5], and C is the oerator that stabilizes the finite element discretization. The right-hand side ector, (g u, g ) T, contains, resectiely, the nonlinear residual for the momentum and continuity equations. This Newton rocedure starts with some initial iterate u 0 for the elocities, 0 for the ressure; then udates for the elocities and ressures are comuted by soling the Newton equations (7). Further details on the articular discretization used in our exeriments can be found in Section 4. All of the methods we describe in Section 3 generate some aroximation, ea, to the Jacobian system found in (5). Some of the methods considered in this aer hae been traditionally deried as stationary iteratie solers, and we use this mode of descrition in arts of the aer. That is, for Au ¼ f ð8þ stationary iterations hae the form u nþ1 ¼ u n þ ea 1 ðf Au n Þ where ea is a Jacobian aroximation, A ¼ ea E, andu n is the aroximate solution at the nth iteration. All of our exeriments use the slitting oerators as reconditioners in a Krylo subsace method.

4 3. Taxonomy of aroximate block factorization reconditioners We adot a nomenclature deeloed by Quarteroni et al. [5] for algebraic slittings of A for rojection tye methods. Let H 1 reresent an aroximation to F 1 in the Schur comlement (4) and let H be an aroximation to F 1 in the uer triangular block of the factorization (3). This results in the following decomosition: I 0 F 0 I H B T A s ¼ bbf 1 I 0 ðc þ bbh 1 B T Þ 0 I The error matrix, E s = A A s is E s ¼ 0 ði FH ÞB T 0 bbðh H 1 ÞB T : H. Elman et al. / Journal of Comutational Physics 7 (008) ¼ F FH B T : ð9þ bb ðc þ bbðh 1 H ÞB T Þ This decomosition is used in [5] to illuminate the structure of seeral rojection techniques for soling the time-deendent Naier Stokes equations. By examining the error, we can determine which equation (momentum or continuity) in the original roblem is erturbed by the aroximations H 1 or H. For examle, if H 1 = F 1 and H 1 6¼ H, then the oerators alied to the ressure in both the momentum equation and continuity equation are erturbed, whereas oerators alied to the elocity are not erturbed. On the other hand, if H = F 1 and H 1 6¼ H, then the (1, ) block of the error matrix is zero. So, the momentum equation is unerturbed, thus giing a momentum resering strategy, whereas a erturbation of the incomressibility constraint occurs [5]. IfH 1 = H 6¼ F 1, then the scheme is mass resering because the (,) block of the error matrix is zero, so the continuity equation is not modified. Finally, if H 1 6¼ H 6¼ F 1, then both the momentum and continuity equations are modified. The aboe factorization can be generalized to incororate classical methods used for these roblems such as SIMPLE, SIMPLEC, SIMPLER [,3], as well as newer aroximate commutator methods deised to generate good aroximations to the Schur comlement [17,9]. Let us modify (9) using some aroximation H 1 in lace of F 1 in the lower triangular block. In addition, let Ŝ reresent an aroximation of the Schur comlement. This gies I 0 F 0 I H B ea T F FH B T ¼ ¼ : ð10þ bbh 1 I 0 bs 0 I bbh 1 F bbh 1 FH B T bs The error, denoted ee ¼ A A,is e 0 B ee T FH B T ¼ : bb bbh 1 F bs ðc þ bbh 1 FH B T Þ Techniques exlored in this study can be classified into two categories: those whose factorization grous the lower triangular and the diagonal comonents as [(LD)U], and those that grou the diagonal and lower triangular comonents as [L(DU)]. Methods with the (LD)U grouing hae the factorization F 0 I H B ea T ðldþu ¼ : ð11þ bbh 1 F bs 0 I Methods with the L(DU) grouing hae the factorization I 0 F FH B ea T LðDUÞ ¼ bbh 1 I 0 b : ð1þ S Some of the techniques considered do not use the comlete factorization (11) or (1), but rather use only triangular comonents of the factorization. SIMPLE uses the block (LD)U grouing. The aroximate commutator methods are deried from the block L(DU) grouing and use just the diagonal and uer triangular

5 1794 H. Elman et al. / Journal of Comutational Physics 7 (008) (DU) comonents in the method. Finally, these classifications are further refined by secifying strategies for aroximating the Schur comlement Pressure correction The ressure correction family of Naier Stokes reconditioners is deried from the diergence free constraint with decouling of the incomressible Naier Stokes equations. In the following sections, three ressure correction methods are deried, SIMPLE (Semi-Imlicit Method for Pressure Linked Equations), SIMPLEC, and SIMPLER (Semi-Imlicit Method for Pressure Linked Equations Reised) [ 4] The SIMPLE reconditioner The SIMPLE-like algorithm described here begins by soling a ariant of the momentum equation for an intermediate elocity using a reiously generated ressure; then the continuity equation is soled using the intermediate elocity to calculate the ressure udate. This alue is used to udate the elocity comonent. The SIMPLE algorithm exressed as a stationary iteration is as follows: 1. Sole: F u nþ 1 ¼ f B T n for the elocity, u.. Sole: ðc þ bbdiagðf Þ 1 B T Þd ¼ bbu nþ 1 þ C n for d. 3. Calculate the elocity correction: du ¼ u nþ1 u nþ 1 ¼ð diagðf Þ 1 B T Þd. 4. Udate the ressure: n+1 = n + ad 5. Udate the elocity: u nþ1 ¼ u nþ 1 þ du The quantity a is a arameter in (0, 1] that dams the ressure udate. An alternatie deriation is obtained using the LDU framework described aboe. The block lower triangular factor (L) and the block diagonal (D) are groued together. In terms of the taxonomy described aboe, this corresonds to the choices H 1 = F 1, H = (diag(f)) 1, and bs ¼ C þ bbðdiagðf ÞÞ 1 B T in (11). The decomosition is F B T I 0 F 0 I ðdiagðf ÞÞ 1 B T bb C bbf 1 I 0 bs 0 ai Thus, one iteration of SIMPLE corresonds to u nþ1 ¼ u n f þ ea 1 SIMPLE A u n nþ1 n 0 n ¼ A e SIMPLE : ¼ F 0 I ðdiagðf ÞÞ 1 B T bb bs 0 ai where A is defined in (). The error for this method (when a =1)is E SIMPLE ¼ A ea SIMPLE ¼ 0 BT F ðdiagðf ÞÞ 1 B T : 0 0 SIMPLE does not affect the terms that oerate on the elocity, but it erturbs the ressure oerator in the momentum equation. This results in a method that is mass resering. When diag(f) 1 is a good aroximation to F 1, then E SIMPLE is close to a zero matrix, so this method generates a ery close aroximation to the original Jacobian system. From our comutational exeriments in Section 5, we hae found that the diagonal aroximation can yield oor results because the diagonal aroximation does not cature enough information about the conection oerator The SIMPLEC reconditioner The SIMPLEC algorithm is a ariant of SIMPLE [3]. It relaces the diagonal aroximation of the inerse of F with the diagonal matrix whose entries contain the absolute alue of the row sums of F. The matrix structure is the same (LD)U as that of SIMPLE. The symbol P ðjf jþ denotes a matrix whose entries are equal to the absolute alue of the row sum of F. With the choices H 1 = F 1, H ¼ð P jf jþ 1, and

6 bs ¼ C þ bbð P jf jþ 1 B T, the SIMPLEC method can be exressed in terms of the block factorization (11). The decomosition is F B T F 0 " P # I ð ðjf jþ 1 B T ¼ A bb C bb bs e SIMPLEC 0 ai where bs ¼ C þ bbð P jf jþ 1 B T. The error for this method (when a =1)is " E SIMPLEC ¼ A A e SIMPLEC ¼ 0 BT F ð P # jf jþ 1 B T 0 bbð P : jf jþ 1 B T þ bbf 1 B T This method erturbs the ressure oerator in both the momentum and continuity equations. The choice of the absolute alue of the row sum tends to roide a better aroximation to the matrix F, therefore reducing the error associated with this method [3]. We hae found that this choice works reasonably well and is easy to construct. Further ariations of this class of methods can be determined by choosing different aroximations to F 1, such as sarse aroximate inerses. For our comutational results in Section 5, we use the absolute alue of the row sum The SIMPLER reconditioner The SIMPLER algorithm is ery similar to SIMPLE, excet that it first determines ^ nþ1 using u n, then it calculates an intermediate elocity alue, u nþ 1. This intermediate elocity is rojected to enforce the continuity equation, which determines u n+1. The stes required are as follows: 1. Sole: ðc þ bbdiagðf Þ 1 B T Þ^ nþ1 ¼ bbdiagðf Þ 1 ðf þ F u n B T n Þ for the ressure, ^ nþ1.. Sole: F u nþ 1 ¼ f B T ð^ nþ1 n Þ for the elocity, u. 3. Project u nþ 1 to obtain u nþ1 ¼½I þðdiagðf Þ 1 Þ 1 bbðc þ BdiagðF Þ 1 B T Þ 1 B T ÞŠu nþ 1 4. Udate the ressure: nþ1 ¼ a^ nþ1 H. Elman et al. / Journal of Comutational Physics 7 (008) Once again, a is a arameter in (0, 1] that dams the ressure udate. SIMPLER can also be exressed using the LDU framework. The block diagonal (D) and the block uer triangular (U) factors are groued together and an additonal matrix, P, a rojection matrix for the elocity rojection in ste 3, is added to the factorization. In terms of the taxonomy, this corresonds to the choices of H 1 = diag (F) 1, H = F 1, and bs ¼ C þ bbdiagðf Þ 1 B T in (1). Then F B T I 0 F B T I 0 F B T bb C bbf 1 I 0 S bbðdiagðf ÞÞ 1 I 0 b S where bs ¼ C þ bbðdiagðf ÞÞ 1 B T. Now, the rojection matrix is added to gie the SIMPLER algorithm in matrix form. This results in ea SIMPLER ¼ I þðdiagðf ÞÞ 1 bbbs 1 B T 0 I 0 F B T 0 ai bbðdiagðf ÞÞ 1 ð13þ I 0 bs [3]. Thus, one iteration of SIMPLER corresonds to u nþ1 ¼ u n f þ ea 1 SIMPLER A u n nþ1 n 0 n where A is defined in () and e A SIMPLER is defined in (13). The use of the rojection matrix, which has subsidiary soles that must be erformed to ery high accuracy, greatly degrades the erformance of this method when comared to SIMPLE. Howeer, the rojection matrix is needed to enforce the continuity equation, and therefore roduce a solution that is diergence free [3]. This method erturbs the ressure oerator in both the momentum and continuity equations.

7 1796 H. Elman et al. / Journal of Comutational Physics 7 (008) Remarks on ressure correction methods In this section, the ressure correction methods (SIMPLE/SIMPLEC) that begin with the underlying factorization, (LD)U and use aroximations to the comonents of the factors to define the reconditioner hae been gien. SIMPLER is based on the decomosition L(DU) with aroximations to P 1 (DU) 1 L 1 as the reconditioner. These methods are useful for steady-state flow roblems. Howeer, these methods tend to conerge slowly and require the user to inut a relaxation arameter to imroe conergence. 3.. Aroximate commutator methods The ressure conection diffusion reconditioners grou together the diagonal and uer triangular factors and omit the lower triangular factor. Let H 1 = H = F 1. Then the block factorization of the coefficient matrix is!!! F B T I 0 F FH B T I 0 F B T ¼ ¼ : ð14þ bb C bbh 1 I 0 S bbf 1 I 0 S where the diagonal (D) and uer triangular (U) factors are groued together. For our comutations, we only use the uer triangular factor, and relace the Schur comlement S by some aroximation Ŝ (to be secified later). The efficacy of this strategy can be seen by analyzing the following generalized eigenalue roblem: F bb!! B T u ¼ k F BT u : C 0 b S If Ŝ is the Schur comlement, then all the eigenalues of the reconditioned matrix are identically one. This oerator contains Jordan blocks of dimension at most, and consequently at most two iterations of a reconditioned GMRES iteration would be needed to sole the system [0]. We motiate the Aroximate Commutator Methods by examining the comutational issues associated with alying this reconditioner Q in a Krylo subsace iteration. At each ste, the alication of Q 1 to a ector is needed. By exressing this oeration in factored form,! 1!! F B T ¼ F 1 0 I B T I 0 0 S 0 I 0 I 0 S 1 two otentially difficult oerations can be seen: S 1 must be alied to a ector in the discrete ressure sace, and F 1 must be alied to a ector in the discrete elocity sace. The alication of F 1 can be erformed relatiely chealy using an iteratie technique, such as multigrid. Howeer alying S 1 to a ector is too exensie. An effectie reconditioner can be built by relacing this oeration with an inexensie aroximation. We discuss three reconditioning strategies, the ressure conection diffusion (P CD), the Least Squares Commutator (LSC), and the aroximate SIMPLE commutator (ASC) The ressure conection diffusion (PCD) reconditioner Pressure conection diffusion reconditioners take a fundamentally different aroach to aroximate the inerse Schur comlement than SIMPLE. The basic idea hinges on the notion of an aroximate commutator. Consider a discrete ersion of the conection diffusion oerator, ðmr þðwgradþþ ð15þ where w is a constant ector. When w is an aroximation to the elocity obtained from the reious nonlinear ste, (15) is an Oseen linearization of the nonlinear term in (1). Suose there is an analogous oerator defined on the ressure sace, ðmr þðwgradþþ

8 H. Elman et al. / Journal of Comutational Physics 7 (008) Consider the commutator of these oerators with the gradient: ¼ðmr þðw gradþþr rðmr þ ðw gradþþ ð16þ Suosing that is small, multilication on both sides of (16) by the diergence oerator gies r ðmr þðw gradþþ 1 rðmr þ ðw gradþþ 1 r ð17þ In discrete form, using finite elements, this usually takes the form, ðq 1 A ÞðQ 1 F Þ 1 ðq 1 A F 1 Q BF 1 B T BÞðQ 1 F Þ 1 ðq 1 BT Þ where here F reresents a discrete conection diffusion oerator on the elocity sace, F is the discrete conection diffusion oerator on the ressure sace, A is a discrete Lalacian oerator, Q the elocity mass matrix, and Q is the lumed ressure mass matrix. This suggests the aroximation for the Schur comlement S bs ¼ A F 1 Q ð18þ for a stable finite element discretization when C = 0. In the case of our ressure stabilized finite element discretizations, the same tye of aroximation is required [8]: S ¼ C þ bbf 1 B T A F 1 Q : ð19þ Alying the action of the inerse of A F 1 Q to a ector requires soling a system of equations with a discrete Lalacian oerator, then multilication by the matrix F, and soling a system of equations with the ressure mass matrix. In ractice, Q can be relaced by its lumed aroximation with little deterioration of effectieness. Both the conection diffusion-like system, F, and the Lalace system, A, can also be handled using multigrid with little deterioration of effectieness. In our taxonomy, the ressure conection diffusion method is generated by grouing together the uer triangular and diagonal factors as in (1), choosing H = F 1 and Ŝ as in (19). In matrix form this is " ea PCD ¼ F FH B T ¼ F # BT 0 bs 0 A F 1 Q : The error matrix is " E PCD ¼ A ea PCD ¼ 0 0 # ^B A F 1 Q ; C which shows that the momentum equation is unerturbed and only the ressure oerator in the continuity equation is erturbed by this method, thus giing a momentum resering strategy. Considerable emirical eidence for two and three-dimensional roblems indicates that this reconditioning strategy is effectie, leading to conergence rates that are indeendent of mesh size and mildly deendent on Reynolds numbers for steady flow roblems [9,1,17,9]. A roof that conergence rates are indeendent of the mesh is gien in [19]. One drawback is the requirement that the matrix F be constructed. There might be situations where a deeloer of a soler does not hae access to the code that would be needed to construct F. This issue is addressed in the following section The least squares commutator (LSC) reconditioner The Least Squares Commutator method automatically generates an F matrix by soling the normal equations associated with a certain least squares roblem deried from the commutator [10]. This aroach leads to the following definition of F : F ¼ Q ðbbq 1 BT Þ 1 ðbbq 1 FQ 1 BT Þ: ð0þ

9 1798 H. Elman et al. / Journal of Comutational Physics 7 (008) Substitution of the oerator into (19) generates an aroximation to the Schur comlement for di-stable finite element discretizations (i.e. C = 0): bbf 1 B T ðbbq 1 BT Þ 1 ðbbq 1 For stabilized finite element discretizations, this can be modified to C þ bbf 1 B T ðbbq 1 FQ 1 BT Þ 1 ðbq 1 BT Þ: ð1þ BT þ ccþ 1 ðbbq 1 FQ 1 B T ÞðbBQ 1 BT þ ccþ 1 þ ad 1 ðþ where a and b are scaling factors, and D is the diagonal of ðbbdiagðf Þ 1 B T þ CÞ. For a further discussion of the merits of this method including heuristics for generating a and b, see [11]. In the taxonomy, the LSC oerator is generated by grouing together the uer triangular and diagonal factors as in (1), choosing H = F 1 and Ŝ as in (). In matrix form this is A LSC ¼ F FH B T 0 bs " ¼ F # BT 0 ðbbq 1 BT þ ccþðbbq 1 FQ 1 B T Þ 1 : ðbbq 1 BT þ ccþþad The error matrix is E LSC ¼ A ea LSC ¼ 0 0 ^B ðbbq 1 BT þ ccþðbbq 1 FQ 1 B T Þ 1 ðbbq 1 BT þ ccþþad C so that the momentum equation is again unerturbed. Emirical eidence indicates that this strategy is effectie, leading to conergence rates that are mildly deendent on Reynolds numbers for steady flow roblems [1,9] The aroximate SIMPLE commutator reconditioner In this section, we define an alternatie strategy that uses the same factors as SIMPLE, together with the commutator used to derie the P-C D and LSC factorizations. This results in a mass resering strategy. In terms of the taxonomy, this method is generated by grouing together the lower triangular and diagonal factors, choosing H 1 = F 1 and bs ¼ C þ bbdiagðf Þ 1 B T F 1. Insertion of the choices into (1) leads to F 0 I H B T ¼ bbh 1 F b S 0 I A ASC ¼ F BT bb C ; " ¼ F 0 # bb ðc þ bbdiagðf Þ 1 B T F 1 Þ We can aroximate H B T in the uer triangular factor by diagðf Þ 1 B T F 1 A ASC ¼ F BT bb C ¼ F 0 bb ðc þ bbdiagðf Þ 1 B T F 1 Þ The error matrix is E ASC ¼ A ea ASC ¼ 0 BT F diagðf Þ 1 B T F 1 : 0 0 I H B T : 0 I. In matrix form this becomes I diagðf Þ 1 B T F 1 : 0 I Here, the continuity equation is unerturbed. This method erforms well when the error in the (1, ) block is small. More details on the method with a further discussion of how this method comares to SIMPLE can be found in [10]. 4. Imlementation and testing enironment We hae tested the methods discussed aboe using MPSalsa [6], a code deeloed at Sandia National Laboratory, that models chemically reactie, incomressible fluids. The discretization of the Naier Stokes equations roided by MPSalsa is a ressure stabilized, streamline uwinded Petro Galerkin finite element scheme

10 [30] with Q 1 Q 1 elements. One adantage of equal order interolants is that the elocity and ressure degrees of freedom are defined at the same grid oints, so the same interolants for both elocity and ressure are used Problem and reconditioner structure The nonlinear system is soled by Newton s method where the structure of a two-dimensional steady ersion of F is a block matrix consisting of a discrete ersion of the oerator! md þ u ðn 1Þ rþðu ðn 1Þ 1 Þ x ðu ðn 1Þ 1 Þ y : ð3þ ðu ðn 1Þ Þ x md þ u ðn 1Þ rþðu ðn 1Þ Þ y For the ressure conection diffusion reconditioning strategy, we need to secify the oerators F, A, and Q. These oerators are generated using the alication code, MPSalsa. For the A oerator required by this strategy, we choose it by taking 1/m times the symmetric art of F. This generates a Lalacian tye oerator suitable for the use in this reconditioning strategy. For Q, we use a lumed ersion of the ressure mass matrix. For roblems with inflow boundary conditions, we secify Dirichlet boundary conditions on the inflow boundary for all of the reconditioning oerators [8]. For singular oerators found in roblems with enclosed flow, the hydrostatic ressure makes B T and the Jacobian system rank-deficient by one. Since we are gien a Jacobian matrix from MPSalsa that is inned, i.e. a row and column that is causing the rank deficiency is remoed, we in all of the oerators in the reconditioner (F, A,Q ) as the Jacobian matrix is inned. The other methods (i.e. SIMPLE, LSC) in this study were built as described in Section 3. One asect of the block reconditioners discussed here is that they require two subsidiary scalar comutations, solutions for the Schur comlement aroximation and conection diffusion-like subroblem. Both of these comutations are amenable to multigrid methods. We emloy smoothed aggregation algebraic multigrid (AMG) for these comutations because AMG does not require mesh or geometric information, and thus is attractie for roblems osed on comlex domains or unstructured meshes. More details on AMG can be found in [31,33]. 4.. Software H. Elman et al. / Journal of Comutational Physics 7 (008) Our imlementation of the reconditioned Krylo subsace solution algorithm uses Trilinos [16], a software enironment deeloed at Sandia National Laboratories for imlementing arallel solution algorithms using a collection of object-oriented software ackages for large-scale, arallel multihysics simulations. One adantage of using Trilinos is its caability to seamlessly use comonent ackages for core oerations. We use the following comonents of Trilinos: 1. Meros This ackage roides scalable block reconditioning for roblems with couled simultaneous solution ariables. Both the ressure conection diffusion and SIMPLE reconditioner studied here are imlemented in this ackage. Meros uses the Eetra ackage for basic linear algebra functions.. Eetra This ackage roides the fundamental routines and oerations needed for serial and arallel linear algebra libraries. Eetra also facilitates matrix construction on arallel distributed machines. Each rocessor constructs the subset of matrix rows assigned to it ia the static domain decomosition artitioning generated by a stand-alone library, CHACO [15], and a local matrix ector roduct is defined. Eetra handles all the distributed arallel matrix details (e.g. local indices ersus global indices, communication for matrix ector roducts, etc.). Once the matrices F, B, bb, and C are defined, a global matrix ector roduct for (7) is defined using the matrix ector roducts for the indiidual systems. Construction of the reconditioner follows in a similar fashion. 3. AztecOO This ackage is a massiely arallel iteratie soler library for sarse linear systems. It sulies all of the Krylo methods used in soling (7), the F, and Schur comlement aroximation subsystems. 4. ML This is a multi-leel algebraic multigrid reconditioning ackage. We use this ackage with AztecOO to sole the F and Schur comlement aroximation subsystems.

11 1800 H. Elman et al. / Journal of Comutational Physics 7 (008) NOX This is a ackage for soling nonlinear systems of equations. We use NOX for the inexact nonlinear Newton soler Oerations required Once all of the matrices and matrix ector roducts are defined, we can use Trilinos to sole the incomressible Naier Stokes equations using our block reconditioner with secific choices of linear solers for the Jacobian system and the conection diffusion and Schur comlement aroximation subroblems. For soling the system with coefficient matrix F we use the generalized minimal residual method (GMRES) reconditioned with four leels of algebraic multigrid, and for the ressure Poisson roblem, we use the conjugate gradient method (CG) reconditioned with four leels of algebraic multigrid. For the conection diffusion roblem, a block Gauss Seidel (GS) smoother is used and for the ressure Poisson roblem, a multi-leel smoother olynomial is used for the smoothing oerations [1]. The block GS smoother is a domain-based Gauss Seidel smoother where the diagonal blocks of the matrix (the elocity comonents) corresond to subdomains, and a traditional oint GS swee occurs in the smoothing ste. The local Gauss Seidel rocedure includes a communication ste (which udates ghost alues around each subdomain s internal boundary) followed by a traditional Gauss Seidel swee within the subdomain. For the coarsest leel in the multigrid scheme, a direct LU sole was emloyed. We used smoothed aggregation multigrid solers aailable in Trilinos. To sole the linear roblem associated with each Newton iteration, we use GMRESR, a ariation on GMRES roosed by an der Vorst and Vuik [3] allowing the reconditioner to ary at each iteration. GMRESR is required because we use a reconditioned Krylo subsace method to generate aroximate solutions in the subsidiary comutations (ressure Poisson and conection diffusion-like) of the reconditioner, so the reconditioner is not a fixed linear oerator. In our exeriments, we comare methods from ressure correction (SIMPLEC) and aroximate commutator (PCD) with a one-leel Schwarz domain decomosition reconditioner [7]. This reconditioner does not ary from iteration to iteration (as the block reconditioners do), so GMRES can be used as the outer soler. Domain decomosition methods are based uon comuting aroximate solutions on subdomains. Robustness can be imroed by increasing the couling between rocessors, thus exanding the original subdomains to include unknowns outside of the rocessor s assigned nodes. Again, the original Jacobian system matrix is artitioned into subdomains using CHACO, whereas AztecOO is used to imlement the one-leel Schwarz method and automatically construct the oerlaing submatrices. Instead of soling the submatrix systems exactly we use an incomlete factorization technique on each subdomain (rocessor). For our exeriments, we used an ILU with a fill-in of 1.0 and a dro tolerance of 0.0. Therefore, the ILU factors hae the same number of nonzeros as the original matrix with no entries droed. A -leel or 3-leel Schwarz scheme might erform better. Howeer, there are some issues with directly alying a coarsening scheme to the entire Jacobian-system due to the indefinite nature of the system [7]. 5. Numerical results 5.1. Benchmark roblems For our comutational study, we hae focused our efforts on steady solutions of two benchmark roblems, the lid drien caity roblem and flow oer an obstruction, each osed in both two and three satial dimensions Drien caity roblem For the two-dimensional drien caity, we consider a square region with unit length sides. Velocities are zero on all edges excet the to (the lid), which has a driing horizontal elocity of one. For the three-dimensional drien caity, the domain is a cube with unit length sides. Velocities are zero on all faces of the cube, excet the to (lid), which has a driing elocity of one. Each of these roblems is then discretized on a uniform mesh of width h. In two dimensions, we hae aroximately 3/h unknowns, i.e. 1/h ressure and /h elocity unknowns. In three dimensions, we hae aroximately 4/h 3 unknowns.

12 The lid drien caity is a well-known benchmark for fluids roblems because it contains many features of harder flows, such as recirculations. The lid drien caity oses challenges to both linear and nonlinear solers and exhibits unsteady solutions and multile solutions at high Reynolds numbers. In two dimensions, unsteady solutions aear around Reynolds number 8000 [14]. In three dimensions, unsteady solutions aear around Reynolds number 100 [8]. Fig. 1 shows the elocity field and ressure field for an examle solution to a two-dimensional lid drien caity roblem with h = 1/ Flow oer an obstruction For the two-dimensional flow oer a diamond obstruction, we consider a rectangular region with width of unit length and a channel length of seen units, where the fluid flows in one side of a channel, then around the obstruction and out the other end of the channel. Velocities are zero along the to and bottom of the channel and along the obstruction. The flow is set with a arabolic inflow condition, i.e. u x =1 y, u y = 0 and a natural outflow condition, i.e. oux ox ¼ and oux ox ¼ 0. For the three-dimensional flow oer a cube, we consider a rectangular region with a width of one and a half units, a height of three units, and a channel length of fie units. The fluid flows in one side of the channel, then around the cube, and out the other end of the channel. Velocities are zero along the to and bottom of the channel, and along the obstruction. The flow is set with a arabolic inflow condition similar to the two-dimensional case and with a natural outflow condition. The flow oer an obstruction also oses many difficulties for both linear and nonlinear solers. This roblem contains an unstructured mesh with inflow and outflow conditions which generates a more realistic, yet difficult roblem than the drien caity. In two dimensions, unsteady solutions aear around Reynolds number 50 [13]. Figs. and 3 shows the elocity field and unstructured mesh for an examle solution to a twodimensional flow oer a diamond obstruction for Re 5. Fig. 4 shows the elocity field for an examle solution to a three-dimensional flow oer a cube obstruction for Re Numerical results H. Elman et al. / Journal of Comutational Physics 7 (008) We terminate the nonlinear iteration when the relatie error in the residual is 10 4, i.e. f F ðuþu þ! BT Þ g ðbbu CÞ 6 f 10 4 g : ð4þ The tolerance g k for (6), the sole with the Jacobian system, is fixed at 10 5 with zero initial guess. For all of the roblems with the ressure conection diffusion reconditioner, we emloy inexact soles on the subsidiary ressure Poisson tye and conection diffusion subroblems. For soling the system with coefficient matrix A, we use six iterations of algebraic multigrid reconditioned CG and for the conection diffusion-like subroblem, with coefficient matrix F, we fix a tolerance of 10, i.e. this iteration is terminated when Selected streamlines Pressure field Fig. 1. Samle elocity field and ressure field from D lid drien caity. h = 1/18, Re = 100.

180 H. Elman et al. / Journal of Comutational Physics 7 (008) 1790 1808 Fig.. Samle elocity field from D flow oer a diamond obstruction. 6K unknowns, Re = 5. Fig. 3.

kðy F uþk 6 10 kyk: ð5þ We comare this method to a one-leel oerlaing Schwarz domain decomosition reconditioner that uses GMRES to sole the Jacobian system at each ste [7].

13 180 H. Elman et al. / Journal of Comutational Physics 7 (008) Fig.. Samle elocity field from D flow oer a diamond obstruction. 6K unknowns, Re = 5. Fig. 3. Samle elocity field and unstructured mesh from D flow oer a diamond obstruction. kðy F uþk 6 10 kyk: ð5þ We comare this method to a one-leel oerlaing Schwarz domain decomosition reconditioner that uses GMRES to sole the Jacobian system at each ste [7]. In order to minimize the CPU time and thus reduce the number of outer iterations, we hae found that for the SIMPLEC reconditioner, we could not erform the Schur comlement aroximation sole and the sole with F as loosely as we did with the ressure conection diffusion reconditioner. For SIMPLEC, we fix a tolerance of 10 5 for the sole with coefficient matrix F in (5) and the sole with the Schur comlement aroximation. For the ressure conection diffusion and SIMPLEC reconditioners, we use a Krylo subsace size of 300 and a maximum number of iterations of 900. For the D domain decomosition reconditioner, we use a Krylo subsace of 600 and a maximum number of iterations of For the 3D domain decomosition reconditioner, we use a Krylo subsace of 400 and

H. Elman et al. / Journal of Comutational Physics 7 (008) 1790 1808 1803 Fig. 4. Samle elocity field from 3D flow oer a cube obstruction, Re = 50. a maximum number of iterations of 100.

14 H. Elman et al. / Journal of Comutational Physics 7 (008) Fig. 4. Samle elocity field from 3D flow oer a cube obstruction, Re = 50. a maximum number of iterations of 100. All of these alues are chosen to limit the number of restarts needed for the soler, while balancing the memory on the comute node. The results were obtained in arallel on Sandia s Institutional Comuting Cluster (ICC). Each of this cluster s comute nodes are dual Intel 3.6 GHz Xenon rocessors with GB of RAM Lid drien caity roblem We first comare the erformance of the ressure conection diffusion reconditioner to that of the domain decomosition reconditioner and SIMPLEC on the lid drien caity roblem generated by MPSalsa. In the first column of Table 1, we list the Reynolds number followed by four mesh sizes in column two. In columns three, four, and fie, we list the total CPU time and the aerage number of outer linear iterations er Newton ste for the ressure conection diffusion, domain decomosition, and SIMPLEC reconditioners, resectiely. The trends are as follows. The ressure conection diffusion method dislays iteration counts that are largely indeendent of the mesh size. The domain decomosition reconditioner does not dislay mesh indeendent conergence behaior as the mesh is refined. Howeer, there is much less comutational effort inoled in one iteration of reconditioning with domain decomosition than in one iteration of reconditioning with ressure conection diffusion. For the fine meshes, the CPU time for the ressure conection diffusion reconditioner is four times smaller than domain decomosition (when the latter method was conergent). The SIMPLEC method also does not dislay mesh indeendent conergence behaior, but it roides solutions in fewer iterations and in less CPU time for finer meshes than the domain decomosition reconditioner. For large Re, SIMPLEC is sensitie to the daming arameter on the ressure udate. For the results shown, the daming factor was 0.01; for larger alues of a the method stagnated. We found SIMPLE to be less effectie than SIMPLEC and do not reort results for SIMPLE. For the 3D drien caity roblems (Table ), we find that the ressure conection diffusion method again is faster on larger meshes than the one-leel domain decomosition method. The ressure conection diffusion method again dislays essentially mesh-indeendent iteration counts and a slight deendence on the Reynolds number. The SIMPLEC method roduces iteration counts that are less deendent on the Reynolds number than domain decomosition, but it is cometitie and in many cases faster than domain decomosition in terms of CPU time Flow oer an obstruction The ressure conection diffusion reconditioner, SIMPLEC, and the domain decomosition reconditioner are comared for the two-dimensional diamond obstruction roblem in Table 3. The trends are sim-

15 1804 H. Elman et al. / Journal of Comutational Physics 7 (008) Table 1 Comarison of the iteration counts and CPU time for the ressure conection diffusion, SIMPLEC, and domain decomosition reconditioners for the D lid drien caity roblem Re number Mesh size Pressure C D SIMPLEC DD one-leel Procs Iters Time Iters Time Iters Time Re = Re = NC NC 64 Re = NC NC 64 Re = NC NC NC NC NC NC NC NC 64 NC stands for no coergence. For the DD soler results, we could not conerge to a solution for a Krylo subsace size of 900 and 4500 max iterations. Table Comarison of the iteration counts and CPU time for the ressure conection diffusion, SIMPLEC, and domain decomosition reconditioners for the 3D lid drien caity roblem Re number Mesh size Pressure C D SIMPLEC One-leel DD Procs Iters Time Iters Time Iters Time Re = Re = Re = ilar to the results from the drien caity roblem; in articular, iteration counts for PCD are largely indeendent of mesh size for a gien Reynolds number. The domain decomosition reconditioner and SIM- PLEC do not dislay mesh indeendent conergence behaior as the mesh is refined. For Re 10 and Re 5, the ressure conection diffusion reconditioner was faster in all cases. For Re 40, it was faster for all meshes excet for the small roblems with 6,000 unknowns on one rocessor. Note that the GMRES soler reconditioned with domain decomosition stagnated before a solution was found for the roblems with 4 million unknowns. The ressure conection diffusion reconditioner conerged without difficulty on this roblem. On modest sized roblems (those with more than 56 K unknowns) where both methods conerged, the ressure conection diffusion reconditioner ranged from 4 to 14 times faster than domain decomosition. Results for the three-dimensional flow oer a cube in Table 4. Once again the trends are similar; we omit a detailed discussion.

16 H. Elman et al. / Journal of Comutational Physics 7 (008) Table 3 Comarison of the iteration counts and CPU time for the ressure conection diffusion, SIMPLEC and domain decomosition reconditioners for the D flow oer a diamond obstruction Re number Unknowns Pressure C D SIMPLEC DD one-leel Procs Iters Time Iters Time Iters Time Re = 10 6K K M M NC NC 64 Re = 5 6K K M M NC NC 64 Re = 40 6K K M M NC NC 64 NC stands for no conergence. Table 4 Comarison of the iteration counts and CPU time for the ressure conection diffusion and domain decomosition reconditioners for the flow oer a 3D cube Re number Unknowns Pressure C D SIMPLEC DD one-leel Procs Iters Time Iters Time Iters Time Re = 10 70K M M Re = 50 70K M M NC stands for no conergence. Table 5 Comarison of the iteration counts and CPU time for the inexact ressure conection diffusion, exact ressure conection diffusion and domain decomosition reconditioners for the D flow oer a diamond obstruction Re number Unknowns Inexact Pressure C D Exact P-C D DD one-leel Procs Iters Time Iters Time Iters Time Re = 10 6K K M M NC NC 64 Re = 5 6K K M M NC NC 64 Re = 40 6K K M M NC NC 64 NC stands for no conergence.

17 1806 H. Elman et al. / Journal of Comutational Physics 7 (008) Finally, in Table 5, we comare the imact of inexact soles of the subsidiary systems required for the ressure conection diffusion reconditioner. In articular, we look at the exact ressure conection diffusion reconditioner, where we soled the subsidiary systems to a tolerance of The exact PCD reconditioner shows iteration counts that are mesh indeendent and reduce as the mesh is refined, but with increasing CPU cost. Howeer, the exact method is still considerably faster than domain decomosition for this roblem. For a user of these methods, we recommend the inexact ariant because the iteration counts are nearly indeendent and require less CPU time Additional discussion We comment on some additional oints concerning costs and scalability of the PCD reconditioner. In most of the examles with this reconditioner, as the mesh is refined we do notice an increase in the comutational time for a gien Reynolds number. A reresentatie examle is from Table 3, Re 5, where the CPU times are and in the cases of 1M and 4M unknowns, resectiely. There are two causes for this. One is iteration counts: both the outer iterations needed to satisfy the stoing criterion (6) and the inner iterations needed for (5) in the aroximate conection diffusion sole show some increase as the mesh is refined. The conection diffusion sole is the dominant comutation of the outer iteration, and this leads to an increase in CPU time een though the number of unknowns er rocessor is constant. In the examle cited from Table 3, the aerage inner iteration counts increased from 10 (for 1M unknowns) to 13, and the aerage outer iteration counts increased from 43.6 to If we use the factor ð 43:6 49:1 Þð10 Þ¼:68 to adjust the 13 CPU time in the case of 4M unknowns, we obtain an adjusted CPU time of 503.1, which is 13% higher than the time for 1M unknowns. The second main cause of erformance sensitiity to mesh size is the increasing cost of the coarsest leel sole in the multi-leel method, which in these tests was done with a serial sarse direct soler. One can control this cost by adding additional leels to the multi-leel method or by using either a arallel soler or an iteratie method for the coarse direct sole. Howeer, we hae found that this comutation is not resonsible for significant oerhead (about 13% in the examle cited aboe) and we hae not exlored this further. 6. Conclusions We hae described a taxonomy for reconditioning techniques for the incomressible Naier Stokes equations. We hae included traditional methods of ressure rojection and ressure correction tye along with newer aroximate commutator methods deried from an aroximation of the Schur comlement. This taxonomy is based uon a block factorization of the Jacobian matrix in the Newton nonlinear iteration where methods are determined by making choices on the grouing of the block uer, lower, and diagonal factors along with aroximations to the action of the inerse of certain oerators and the Schur comlement. All the methods require solutions of discrete scalar systems of conection diffusion and ressure Poisson-tye that are significantly easier to sole than the entire couled system. In exeriments with these methods using benchmark roblems from MPSalsa we hae demonstrated that the ressure conection diffusion method gies suerior iteration counts and CPU times for D and 3D roblems with the one-leel additie Schwarz domain decomosition method. For the aroximate commutator methods we hae demonstrated asymtotic conergence behaior that is essentially mesh indeendent in D and 3D for roblems generated by an alication code, MPSalsa, oer a range of Reynolds numbers and roblems discretized on structured and unstructured meshes with inflow and outflow conditions. For the steady-state roblems exlored, the iteration counts show only a slight degradation for increasing Reynolds number. In the future, we intend to further exand this technique to time deendent roblems and roblems osed on more comlex domains. We exect the former count to tend to a constant as the mesh is refined, but smoothed aggregation multigrid can dislay some mild mesh deendence [31,33]. The conection diffusion sole also has an imact on costs as the Reynolds number is increased. Soling nonsymmetric roblems with algebraic multigrid is an actie research toic; if a more effectie scalable soler did exist for this subroblem, then the CPU would be considerably lower and more scalable [].

A Taxonomy and Comparison of Parallel Block Multi-level Preconditioners for the Incompressible Navier Stokes Equations

SANDIA REPORT SAND2007-2761 Unlimited Release Printed May 2007 A Taxonomy and Comparison of Parallel Block Multi-level Preconditioners for the Incompressible Navier Stokes Equations Howard Elman, Victoria