A new class of preconditioners for discontinuous Galerkin methods for unsteady 3D Navier-Stokes equations: ROBO-SGS

Size: px

Start display at page:

Download "A new class of preconditioners for discontinuous Galerkin methods for unsteady 3D Navier-Stokes equations: ROBO-SGS"

Anna Chapman
5 years ago
Views:

1 A new class of preconditioners for discontinuous Galerkin methods for unsteady 3D Navier-Stokes equations: ROBO-SGS Philipp Birken a Gregor Gassner b Mark Haas b Claus-Dieter Munz b a University of Kassel, Department of Mathematics, Heinrich-Plett-Str. 40, Kassel, Germany b University of Stuttgart, Institute of Aerodynamics and Gas Dynamics, Pfaffenwaldring 21, Stuttgart, Germany Abstract In this work we propose a new class of preconditioners for the speed-up of implicit time integration of discontinuous Galerkin discretizations of the three dimensional time dependent Navier-Stokes equations. This new class of preconditioners exploits the hierarchy of modal basis functions and introduces a flexible order of the offdiagonal Jacobian blocks. While the standard preconditioners block Jacobi (no off-blocks) and full symmetric Gauss-Seidel (full off-blocks) are included as special cases, the reduction of the off-block order results in a new type of preconditioner. We show that these new preconditioners reduce the overall CPU time, while retaining a good strong parallel scaling on up to 512 processors. Key words: Discontinuous Galerkin, Unsteady flows, Navier-Stokes, Implicit methods, Preconditioning, Three dimensional problems 1 Introduction The solution of unsteady compressible viscous flows typically leads to stiff problems, in particular for wall bounded flows and flows at low Mach numbers. addresses: birken@mathematik.uni-kassel.de (Philipp Birken), gassner@iag.uni-stuttgart.de (Gregor Gassner), haas@iag.uni-stuttgart.de (Mark Haas), munz@iag.uni-stuttgart.de (Claus-Dieter Munz). Preprint submitted to Elsevier 1 June 2012

2 This means that the time step size in explicit methods is driven by stability and in implicit methods by accuracy alone. In the context of discontinuous Galerkin (DG) methods, the stability restriction on explicit methods becomes worse with higher polynomial degrees. Therefore, implicit methods, which can be constructed to have unbounded stability regions, are attractive for a lot of problems and are a standard part of solution procedures for finite volume methods. However, finding an efficient solver for the appearing linear and nonlinear equation systems has turned out to be a difficult problem in the DG case. As the structure of the nonlinear and the linear systems to be solved does not depend on the specific time integration method, in this work we use the 4th order accurate diagonally implicit Runge-Kutta method ESDIRK4 [21]. We choose a 4th order accurate method in time, as we are interested in the simulation of unsteady problems. In addition to the unsteadiness, the focus of this work lies on the efficient simulation of three dimensional viscous flows governed by the compressible Navier-Stokes equations. Regarding the solvers, there is a number of requirements that have to be satisfied. Firstly, three dimensional computations have strong memory demands that will actually increase in the future with newer computer generations having less and less memory per core available. This problem is particularly pronounced for DG methods. These use larger number of degrees of freedom per cell, leading to larger Jacobian blocks with more intercell connectivity. Already on today s supercomputers one may find oneself running out of memory faster than one would expect from experience with finite volume methods, and future computers will probably have less memory per core available. We will discuss these aspects in more detail in the section on the nonlinear solver. Secondly, the solver has to scale in parallel to be feasible for supercomputing. Thirdly, it has to be reasonably fast, which is still a challenge in the DG context. Finally, the implementation cost should be as low as possible, where here, we are concerned with the additional coding needed to make an explicit DG method implicit. If all of these requirements are met (low storage requirements, parallel scaling, fast convergence and ease of implementation), the use of high order methods in the industrial context would become feasible. At this point it has to be noted that there is not yet a standard DG method and that in our experience, the question of an efficient solver depends on the specific discretization in a nontrivial way. However, it seems that an efficient DG method makes use of what is called a nodal basis in some way [17]. Here, we will consider the mixed modal-nodal variant suggested by Gassner et. al. [15], which is based on a modal basis but uses a nodal basis for integration. For the diffusive terms, the dgrp flux is used [12]. Furthermore, we did some preliminary tests of the methodology on a DG Spectral Element method (DGSEM), e.g. [24], which showed that the preconditioning procedure used here indeed has to be modified in that context. 2

3 Basic candidates for solvers are FAS-Multigrid and preconditioned Jacobian- Free Newton-Krylov methods (JFNK). The FAS multigrid is the method of choice for steady Euler flows when using finite volume schemes. In the context of unsteady flows, it is often called dual time stepping and this is a slow method, as reported by several authors, e.g. [19]. When looking at DG methods, the design of a fast multigrid solver is an open problem both for steady and unsteady flows [30,22,3,2,33,27]. We will not attack this problem and consider JFNK schemes in this work. There, the linear systems are solved using Krylov subspace methods, which do not need the system matrix explicitely, but only matrix vector products. Since the system matrix is a Jacobian, these can be approximated using finite differences, circumventing in theory the construction and storage of the Jacobian. In praxis, a preconditioner is needed, making the schemes not completely matrix free. Regarding the specific Krylov subspace method, it turns out that GMRES is the best choice in this context [23]. The Newton method is an inexact Newton method, where a good strategy to control the termination criteria for the linear solver is necessary. Namely, the strategy by Eisenstat and Walker is used [10]. As mentioned, a preconditioner is necessary to speed up GMRES. Here, multigrid can be used as a preconditioner, which was considered by Nastase and Mavriplis [30], as well as May et. al. [27] for the steady Euler equations in two dimensions. Generally, multigrid preconditioners would satisfy the requirements mentioned, but they face the same problem as the FAS multigrid: the lack of theory for DG discretizations leads to nonoptimal methods. Therefore, we will not consider these schemes here. Regarding other preconditioners, a number of authors has considered Newton-Krylov methods for two dimensional flows, in particular Rasetarinera and Hussaini (steady NS) [35], Dolejsi and Feistauer (steady Euler) [9], Darmofal et al. (steady NS) [11,8], Kanevsky et. al. (unsteady Euler, NS) [20], as well as Persson and Peraire (unsteady NS) [32]. Whereas all the work mentioned above considers two dimensional flow problems, we will instead look at the 3D unsteady case. As opposed to the case of finite volume methods, where going from 2D to 3D increases the number of unknowns per cell from four to five and thus by just one, we have to multiply the number of unknowns per cell with a factor dependend on the polynomial degree, resulting in hundreds of unknowns per cell with an according block structure. Thus, the two dimensional case is in our opinion not representative of the three dimensional one and very different from the finite volume case. Therefore, a successful implicit DG scheme must take these huge blocks in the Jacobian into account. Here, we do this via the preconditioner and propose a new class of multilevel symmetric Gauss-Seidel (SGS) preconditioner, which we call ROBO-SGS (Reduced Offdiagonal Block Order). This new class of 3

4 preconditioner exploits the hierarchical basis of the mixed modal-nodal DG method to reduce the order of the off block Jacobian blocks and includes the block Jacobi and the full SGS preconditioner as special cases. The outline of the paper is as follows: First we will describe the governing equations and the DG methodology used. Then we will briefly discuss the ESDIRK4 method, after which we will describe the JFNK method and the different preconditioners. Finally, numerical results are presented where we compare the new preconditioner to several others in a parallel environment. 2 Governing equations The Navier-Stokes equations are a second order system of conservation laws (mass, momentum, energy) modeling viscous compressible flow. Written in conservative variables density ρ, momentum m and energy per unit volume ρe: t ρ + m = 0, d t m i + xj (m i v j + pδ ij ) = 1 j=1 Re d xj S ij + q i, j=1 d t (ρe) + (Hm) = 1 ( d xj Re j=1 i=1 i = 1... d S ij v i 1 ) P r W j + q e. Here, d stands for the number of dimensions, H for the enthalpy per unit mass, S represents the viscous shear stress tensor and W the heat flux. As the equation are dimensionless, the Reynolds number Re and the Prandtl number P r appear. The equations are closed by the equation of state for the pressure p = (γ 1)ρe, where we assume an ideal gas. Finally, q e denotes a possible source term in the energy equation, whereas q = (q 1,..., q d ) T is a source term in the momentum equation, for example due to external forces. 3 Spatial Discretization We employ the mixed modal-nodal Discontinuous Galerkin scheme which has been suggested by Gassner et al. [15]. One of the main advantages of this method is that it allows the use of elements of arbitrary shape (i.e. tetrahedrons, prisms, pyramids, hexahedrons,...) with high order of accuracy. In our experience discretizations using hexahedrons very often require less elements 4

5 and thus less total degrees of freedom than ones that only use tetrahedrons for approximately the same discretizaton error. This property becomes extremely important in the context of implicit methods since the total number of degrees of freedom has a strong influence on the performance of the solver and an even greater impact on the memory consumption which is crucial for DG methods, especially when doing real world 3D simulations. We will demonstrate these aspects in the following sections. 3.1 The Discontinuous Galerkin Method We consider conservation laws of the form u t + f (u) = q(t, u), (1) with suitable initial and boundary conditions in a domain Ω [0, T ] R d R + 0. Here, u = u ( x, t) R d+2 is the state vector, f (u) = f C (u) f D (u, u) is the physical flux, where f C (u) is the convective (i.e. hyperbolic) and f D (u, u) the diffusive (i.e. parabolic) flux component. The possibly time and space dependent source term is given by q(t, x, u). We derive the DG method by first subdividing the domain Ω into non-overlapping grid cells Q. In each grid cell we approximate the state vector using a local polynomial approximation of the form u ( x, t) N Q u Q ( x, t) = û j (t) ϕ Q j ( x), (2) j=1 where in our case, {ϕ Q j ( x)} j=1,...,n are modal hierarchical orthonormal basis functions and û are the corresponding coefficients in that cell. The basis functions are constructed from a monomial basis with a simple Gram-Schmidt orthogonalization algorithm for arbitrary (reference) grid cell types. The dimension of the local approximation space depends on the spatial dimension d and the polynomial degree p N = N(p, d) = (p + d)!. (3) p!d! The next step of our approximation is to define how the unknown degrees of freedom û j (t) are determined. The basis of the considered discontinuous Galerkin method is a weak formulation. Neglecting the source term for now, we insert the approximate solution (2) into the conservation law (1), multiply with a smooth test function φ = φ ( x) and integrate over Q to obtain u Q t + f ( u Q), φ Q = 0, (4) 5

6 where.,. Q denotes the L 2 (Q) scalar product over Q. We proceed with an integration by parts to obtain u Q t, φ Q + (f (u) n, φ) Q f ( u Q), φ Q = 0, (5) where (.,.) Q denotes the surface integral over the boundary of the element Q. As the approximate solution is in general discontinuous across grid cell interfaces, the trace of the flux normal component f (u) n is not uniquely defined. To get a stable and accurate discretization, several choices for the numerical approximation are known. Here, we use the HLLC flux [38]. For a purely convective problem inserting the trace approximation f ( u Q) n g C (u, u +, n) into equation (5) would yield u Q t, φ Q + ( g C ( u, u +, n ), φ ) Q f C ( u Q), φ Q = 0. (6) We denote by (.) values at the inner side of a cell interface, i.e. values that depend on u Q and by (.) + values that depend on the neighbor cells sharing the interface with the cell Q. The handling of the diffusive part of the flux is a little more delicate for DG methods, because the jump in the gradients needs special handling. Several authors have suggested solutions for this problem [31,6,1,4,5] and all of these have been used in conjunction with implicit temporal discretizations. In this work we apply the so-called dgrp flux by Gassner, Lörcher and Munz [12,14,13,26]. This is a symmetric interior penalty based method that has been derived in a way that optimizes stability, i. e. minimizes the eigenvalues of the DG operator and yields optimal order of convergence for mixed hyperbolic/parabolic PDEs such as the compressible Navier-Stokes equations [12,14]. From a technical point of view this flux introduces an approximation of the trace of the flux normal component f D (u, u) n g dgrp (u, u, u +, u +, η, n), where η is a parameter that depends on the geometry of the cell Q and its neighbor and the local order of the polynomial approximation (2). To ensure adjoint consistency an additional surface flux term h (u, u +, n) is introduced via two integrations by parts [14] yielding the final DG formulation u Q t, φ Q + ( g C g dgrp, φ ) Q (h, φ) Q f C f D, φ Q = 0. (7) 3.2 Nodal Integration The computation of the volume and surface quadrature operators can be a very expensive task if standard methods such as Gaussian quadrature are used, which is caused by the high number of polynomial evaluations required for computing the fluxes. Based on the nodal DG scheme developed by Hesthaven 6

7 and Warburton [17], Gassner et al. developed a way of constructing efficient quadrature operators that work on arbitrarily shaped elements, see [15] for further details. This has the advantage that the number of degrees of freedom does not depend on the element shape as it would be for a purely nodal scheme when using elements other than tetrahedra. As points to define the nodal basis, Legendre-Gauss-Lobatto (LGL) points are used on edges and then a method called LGL-type nesting is used to determine the interior points, which leads to a small Lebesgue constant. The coexistence of modal and nodal elements is quite natural for a DG scheme since the transformation from modal (û) to nodal (ũ) degrees of freedom is nothing else but a polynomial evaluation of the modal polynomials at the nodal interpolation points which can be expressed in the form of a matrixvector-multiplication: ũ = Vû. (8) Here V is a Vandermonde matrix containing the evaluations of the modal polynomials at the interpolation points. The back transformation can be implemented using the inverse of the Vandermonde matrix: û = V 1 ũ. (9) Should the number of nodal interpolation points not be equal to the number of modal degrees of freedom, as it is the case for elements other than tetrahedra, the inverse V 1 is defined using a least squares procedure based on singular value decomposition [15]. The nodal DG method can be conveniently formulated in terms of matrices representing the discrete integrals in (7): Mũ t + nfaces i=1 M S d i g i N i h i S k fk = 0. (10) }{{}}{{} k=1 surface integral volume integral Should the equations be nonlinear such as e.g. the compressible Navier-Stokes equations the nonlinearity is present in the evaluation of the fluxes. In Eq. (10), f k, k = 1,..., d are the vectors of flux evaluations at all nodal points, while g i and h i stand for the evaluations of the surface flux approximations at the nodal points of the element face i. The operators in Eq. (10) are designed to act on nodal input vectors and to produce a nodal output. Using Eq. (9) all the operators in Eq. (10) can be modified in order to produce a modal output yielding the mixed modal-nodal DG method: ( nfaces û t = V 1 M 1 i=1 ) ( ) d M S i g i N i h i S k f k. (11) k=1 7

8 4 Time integration scheme Equation (11) represents a system of ordinary differential equations (ODEs) in the cell Q. If we combine the modal coefficient vectors in one vector u R m, we obtain a large system of ODEs u t (t) = f(t, u(t)), (12) where f is a vector valued function corresponding to the right hand side in (11) for the whole grid. Generally, from now on a vector with an underbar will denote a vector from R m. We denote the time step size by t and u n is the numerical approximation to u(t n ). Note that the explicit dependence of the right hand side in (12) on t is relevant only for time dependent boundary conditions or time dependent source terms. 4.1 ESDIRK4 We will restrict ourselves here to the explicit step singly diagonally implicit Runge-Kutta method of fourth order (ESDIRK4), designed in [21]. Given coefficients a ij and b i, such a method with s stages can be written as i U i = u n + t a ij f(t n + c j t n, U j ), i = 1,..., s (13) j=1 s u n+1 = u n + t b j f(t n + c j t n, U j ). (14) j=1 Thus, all entries of the Butcher array in the strictly upper triangular part are γ γ γ / γ / γ / γ γ b i γ ˆbi b i ˆb i Table 1 Butcher diagram for ESDIRK4 with γ = zero. The coefficients can be obtained from table 1. This scheme is A-stable, also L-stable and stiffly accurate. The point about DIRK schemes is, that the computation of the stage vectors corresponds to the sequential application of several implicit Euler steps. With the starting vectors 8

9 we can solve for the stage values s i = u n i 1 + t a ij f(t n + c j t n, U j ), (15) j=1 U i = s i + ta ii f(t n + c i t n, U i ). (16) The equation (16) corresponds to a step of the implicit Euler method with starting vector s i and time step a ii t. Note that because the method is stiffly accurate, u n+1 = U s and thus we do not need to evaluate (14). The first explicit step of the Runge-Kutta schemes allows to have a stage order of two, but also means that the methods cannot be algebraically stable. Furthermore, the explicit stage involving f(t n, u n ) only allows to reuse the last stage derivative from the last time step in the first stage, since f(t n+1, u n+1 ) from the last time step is just the same thing. 4.2 Adaptive time step size selection For unsteady flows, we need to make sure that the time integration error can be controlled. To do this, we will estimate the time integration error and select the time step size accordingly. This is done using embedded schemes of a lower order ˆp. Comparing the local truncation error of both schemes, we obtain the following estimate for the local error of the lower order scheme: s l t n (b j ˆb j )f(t n + c j t n, U j ). (17) j=1 To determine the new step size, we decide beforehand on a target error tolerance and use a common fixed resolution test [37]. This means that we define the error tolerance per component via d i = RT OL u n i + AT OL. (18) Then we compare this to the local error estimate via requiring l./d 2 1, where. denotes a pointwise division operator. The next question is, how the time step has to be chosen, such that the error can be controlled. The classical method is the following, also called EPS (error per step) control [16]: 9

10 t new = t n l./d 1/(ˆp+1) 2. (19) This is combined with two safety factors to avoid volatile increases or decreases in time step size: if l./d 1, else t n+1 = t n max(f min, f safety l./d 1/ˆp+1 2 ) t n+1 = t n min(f max, f safety l./d 1/ˆp+1 2 ). Here, we chose f min = 0.3, f max = 2.0 and f safety = Solver for the nonlinear equation systems The application of implicit time discretizations used herein leads to a globally coupled nonlinear system of equations of the form u ν+1 = a ν+1 t f ( u ν+1) + R fix ( u 1,..., u ν), (20) where we assume that u 1,..., u ν are given, R fix depends on space- and time discretization and a ν+1 is a constant depending on the time discretization method. 5.1 Inexact Newton method To solve the appearing nonlinear systems, we use inexact Newton s method, which is locally convergent and can exhibit quadratic convergence. This method solves the root equation u a ν+1 t f (u) R fix ( u 1,..., u ν) =: F(u) = 0. As a termination criterion we use a residual based one similar to (18) with a relative tolerance, resulting in F(u k ) ɛ = ɛ r F(u 0 ). If the iteration does not converge after a maximal number of iterations, the time step will be repeated with half the time step size. In particular, we will use the inexact Newton s method from [7], where the linear system in the k-th Newton step is solved only up to a relative tolerance, given by a forcing term η k. This can be written as: 10

11 F(u) u + F(u (k) ) u η k F(u k ) (21) u (k) u (k+1) = u (k) + u, k = 0, 1,... In [10], the choice of the sequence of forcing terms is discussed and it is proved that the inexact Newton iteration (21) converges linearly. Moreover, if η k 0, the convergence is superlinear and if η k K η F(u k ) p for some K η > 0, p [0, 1], the convergence is superlinear with order 1 + p. In particular, this means that for a properly chosen sequence of forcing terms, the convergence can be quadratic. A way of achieving this (as proved in [10]) is the following: η A k = γ F(u k) 2 F(u k 1 2 with a parameter γ (0, 1]. The theorem says that if this sequence is bounded away from one uniformly, convergence is quadratic. Therefore, we set η 0 = η max for some η max < 1 and for k > 0: η k = min(η max, η A k ). Eisenstat and Walker furthermore suggest safeguards to avoid volatile decreases in η k. To this end, γη 2 k 1 > 0.1 is used as a condition to determine if η k 1 is rather large and thus the definition of η k is refined to η B k = min(η max, max(η A k, γη 2 k 1)). Finally, to avoid oversolving in the final stages, they use η k = min(η max, max(η B k, 0.5ɛ/ F(u k ) )), where ɛ is the tolerance at which the nonlinear iteration would terminate. 5.2 Linear Solver In each iteration of the inexact Newton method, a linear equation system of the form Ax = b, A R m m has to be solved with A = I a ν+1 t f u ū. In general, A consist of n E block rows, where n E is the number of elements in our computational domain. Each row i consists of a dense main diagonal 11

12 block of the size (N i n V ar ) (N i n V ar ), where N i is the number of degrees of freedom of the cell i given by Eq. (3) and n V ar is the number of unknowns in the equations that are being solved. For the compressible Navier-Stokes Equations n V ar = 4 for the two-dimensional case and n V ar = 5 for the threedimensional case. In addition to the main block each Neumann neighbor j of i contributes a block of the size (N i n V ar ) (N j n V ar ) which leads to a quickly rising number of matrix entries when using higher order DG methods. Figure 1a shows the memory requirements per block row for different element Memory / Element [MB] Tri Quad Tet Hex DOF / GB Order of Approximation Order of Approximation (a) Memory required for each element (b) Number of degrees of freedom that can be stored in the Jacobian per GB of memory Fig. 1. Memory considerations for the Jacobian types and orders of approximation in 2D and 3D where we assumed a uniform order in the computational domain, i.e. N j = N i. It becomes clear that while memory usually is not an issue for Finite Volume methods (equivalent to the first order case) it is a critical aspect for DG schemes, especially for the 3D case. In order to illustrate how restrictive this may become, Fig. 1b shows the maximum number of cells a computational domain may contain if only one gigabyte of memory is available for the storage of the Jacobian. It should also become clear now that while a 2D scheme is not a problem for today s computers this is absolutely not the case for a 3D scheme as the memory requirements are an order of magnitude more restrictive than for the 2D case. Now for real world problems, the number of unknowns is in the magnitude of tens of millions, so direct solvers are infeasible which leads us to iterative methods. In particular, Krylov subspace methods such as GMRES or BiCGSTAB have been shown to perform well in this context. In their basic version, these schemes need the Jacobian and in addition a preconditioning matrix to improve convergence speed, which is storage wise a huge problem. Furthermore, the use of approximate Jacobians to save storage and CPU time leads to a decrease of the convergence speed of the Newton method. Therefore, we will use here Jacobian-free Newton-Krylov methods [23], which do 12

13 not need the Jacobian (but still a preconditioner). The idea is that in Krylov subspace methods, the Jacobian appears only in the form of matrix vector products Av i which can be approximated by a difference quotient Av i F (ū + ɛv i) F(ū) ɛ = v i a ν+1 t f (ū + ɛv i) f(ū). (22) ɛ The parameter ɛ is a scalar, where smaller values lead to a better approximation but may lead to truncation errors. A simple choice for the parameter, that avoids cancellation but still is moderately small is given by Quin, Ludlow and Shaw [34] as eps ɛ =, u 2 where eps is the machine accuracy. Second order convergence is obtained up to ɛ-accuracy if proper forcing terms are employed, since it is possible to view the errors coming from the finite difference approximation as arising from inexact solves. Of the Krylov subspace methods suitable for the solution of unsymmetric linear systems, the GMRES method of Saad and Schultz [36] was explained by McHugh and Knoll [28] to perform better than others in the matrix free context. The reason for this is that the vectors in matrix vector multiplications in GMRES are normalised, as opposed to those in other methods. 6 Preconditioning It is well known that the speed of convergence of Krylov subspace methods depends strongly on the matrix. Therefore, right preconditioning is used to transform the linear equation system to speed up convergence: APx P = b, x = Px P. Here, P is an invertible matrix, called a right preconditioner that approximates the system matrix in a cheap way. Every time a matrix vector product Av j appears in a Krylov subspace method, the right preconditioned method is obtained by applying the preconditioner to the vector in advance and then computing the matrix vector product with A. Right preconditioning does not change the initial residual, because r 0 = b 0 Ax 0 = b 0 APx P 0, therefore the computation of the initial residual can be done without the right preconditioner. This also means that, in contrast to left preconditioning, right 13

14 preconditioning does not interfere with the Eisenstat-Walker strategy, which is the main reason we use right preconditioning only. Once the tolerance criterion is fulfilled, the right preconditioner has to be applied one last time to change back from the preconditioned approximation to the unpreconditioned. Often, the preconditioner is not given directly, but implicitly via its inverse. Then the application of the preconditioner corresponds to the solution of a linear equation system. If chosen well, the speedup of the Krylov subspace method is enormous and therefore, the choice of the preconditioner is more important than the specific Krylov subspace method used. For nonnormal matrices as we have here, the existing theory is not sufficient to determine optimal preconditioners in any sense. Therefore, the preconditioner has to be chosen by numerical experiments and heuristics. An overview of preconditioners with special emphasis on application in flow problems can be found in [29]. For the DG case, Persson and Peraire have conducted a survey of several preconditioning methods in [32]. Several methods are interesting in this context. 6.1 ROBO-SGS An important class of preconditioners are splitting methods that are based on decomposing A into a (block) diagonal part D, an upper diagonal part U and a lower diagonal part L in such a way that A = L + D + U. These blocks are then used to obtain simple approximations to A 1. The most simple method here is block-jacobi, where the off diagonal blocks are neglected, which leads to P = D 1. (23) A much more sophisticated method that is very good preconditioner for compressible flow problems is the symmetric block Gauss-Seidel-method (SGS), which corresponds to solving the equation system (D + L)D 1 (D + U)x = x P. (24) As mentioned before, the major issue here is that the Jacobian consists of blocks with hundreds of unknowns for DG methods in 3D. In FV schemes, where the blocks have size 5 in 3D, the off diagonal blocks are sometimes computed on the fly, whereas only the diagonal is stored. In the DG case, the high construction cost for the element Jacobians makes this infeasible. 14

15 Therefore, using the full Jacobian A leads to significantly higher memory requirements compared to storing the diagonal D only, e.g. a factor of five for tetrahedra and a factor of seven for hexahedra. Thus, SGS needs a huge amount of storage, resulting in rather high application costs. On the other hand, the storage requirements of Jacobi are reasonable, as well as the application cost, but the resulting decrease in iteration numbers is much smaller than for SGS. Based on this observation we propose a new class of SGS-like preconditioners that is between SGS and Jacobi where a varying amount of entries of the off diagonal blocks L and U is neglected. In this way, we get a trade off between memory and application cost on the one hand and efficiency of the preconditioner on the other hand. In particular, we make use of the fact that the modal DG scheme we employ has a hierarchical basis which also gives the element Jacobians a hierarchical structure. As shown in Fig. 2, this gives us the (a) k = 0 variant (b) k = 1 variant Fig. 2. Reduced versions of the off-diagonal blocks of the Jacobian, k = 0 and k = 1 variants opportunity to reduce the inter-element coupling by neglecting all derivatives of higher-order degrees of freedom of the neighboring cells in the Jacobian. For example, in the we k = 0 we take only the DOFs of the neighbors into account, which correspond to the integral mean values of the conserved quantities. However, we keep not only the derivatives of the remaining degrees of freedom with respect to themselves, but to all degrees of freedom, resulting in a rectangular structure in the off diagonal blocks of the preconditioner. The number of entries of the off-diagonal blocks in the preconditioner is then (N i n V ar ) ( ˆNj n V ar ), where ˆNj depends on the user-defined parameter k. While this results in a decreased accuracy of the preconditioner, the memory requirements and the computational cost of the application become better, the less degrees of freedom of the neighboring cells we take into account. We call this preconditioner ROBO-SGS-k for Reduced Offdiagonal Block Order, where k is the degree of the polynomial basis functions taken into account. 15

16 For k = p, we keep everything, thus recovering the original block SGS preconditioner. If we formally set k = 1, we neglect all off diagonal block entries, thus recovering block Jacobi. 6.2 ILU preconditioning Another important class of preconditioners are block incomplete LU (ILU) decompositions, where the blocks correspond to the units the Jacobian consists of. The computation of a complete LU decomposition is quite expensive and in general needs also for sparse matrices full storage. By prescribing a sparsity pattern, incomplete LU decompositions can be defined. The application of such a decomposition as a preconditioner then corresponds to solving by forward-backward substition the appropriate linear equation system. The sparsity pattern can for example be influenced by the level of fill. This is in short a measure for how much beyond the original sparsity pattern is allowed for the purpose of ILU. Those decompositions with higher levels of fill are very good black box preconditioners for flow problems [29]. However they are not in line with the philosophy of matrix-free methods. Thus remains ILU(0), which has no additional level of fill beyond the sparsity pattern of the original matrix A. We use the ILU(0) preconditioner in the form proposed by Persson and Peraire [32] with the in-place factorization suggested by Diosady and Darmofal [8]. While this preconditioner usually performs better than the ones based on splitting it has the drawback that it has to act on the full Jacobian matrix which makes it less attractive in computational environments with limited memory. 6.3 Multilevel preconditioners Another possibility is to use multilevel schemes as preconditioners. A number of approaches have been tried in the context of DG methods, e.g. multigrid methods and multi-p methods with different smoothers, as well as a variant by Persson and Peraire [32], where a two-level multi-p method is used with ILU(0) as a presmoother and Jacobi as a postsmoother. We will employ a variant where no postsmoothing is employed as we have not experienced any convergence acceleration by the postsmoother and name it ILU-CSC. Since the computation of the residual requires a matrix-vector multiplication, the cost for applying a multilevel variant of one of the above preconditioners is approximately double the cost of a standard single level variant. 16

17 6.4 Parallelization Regarding parallelization, we use the MPI paradigm. The physical domain is decomposed into several domains, each of which is assigned to a processor. The matrix-free approach allows us to use the parallelization scheme described in the PhD thesis of Lörcher [25] for the function evaluation in the matrix-vector multiplication, which is shown to scale very well. The use of GMRES, however, is a drawback here, since this requires the use of k scalar products on the k-th iteration, which do not scale perfectly in parallel. However, the alternatives have other drawbacks, as discussed earlier. As for the preconditioner, with the exception of Jacobi, most schemes would actually require excessive communication at domain boundaries due to the fact that the Neumann neighbors off-block entries of the system matrix would be located on different CPUs. In order to circumvent adding this overhead to our scheme we neglect these parts of the system matrix. This way, as the number of cells per CPU decreases all preconditioners used herein ultimately converge to the Jacobi scheme reducing the overall parallel efficiency not due to communication overhead as it is the case for the scalar products but due to a degradation of the numerical scheme itself resulting in a higher iteration number for the solution process. However, as we will demonstrate in results section, see e.g. table 4, the schemes still scale very well. 7 Numerical results 7.1 Software All computations were carried out using the Fortran code HALO, developed at the Institute of Aerodynamics and Gasdynamics (IAG). As a grid partitioner, ParMETIS is used. Furthermore, all test runs were carried out in parallel using openmpi. 7.2 Test cases The first test corresponds to a sine wave in both energy and density in 3D, generated by a source term. The exact solution is given by ρ(x, t) = sin(π(x 1 + x 2 + x 3 ) at), ρe(x, t) = ρ(x, t) 2. 17

18 All velocities are set to one, implying that the momentum components have the same value as the density. To make this an exact solution, source terms have been determined by inserting the solution into the Navier-Stokes equations and computing the remainder, where we use Re = 100. We use polynomials of degree four on a grid, thus having 175, 000(= ncells N nv ar) unknowns. As a second test case, vortex shedding behind a cylinder in 3D is employed, with a Mach number of 0.3 and a Reynolds number of 1,000. The grid consists of 10,400 hexahedral cells, as illustrated in figure 3. To obtain initial data, we start with free stream data and third order polynomials and compute 101 s of simulation time using an explicit scheme. Then, we switch to fifth order polynomials and compute another 5 s with the explicit method. As the actual test interval, 1 s of simulation time is used, where we have an initial time step of and a tolerance of The total number of unknowns for this test case is 2,912,000. Distribution of the velocity magnitude at initial time and at the end time are shown in figure 4. Fig. 3. Grid for cylinder test case. The grid is extended in three dimensions with 8 regular grid cells. For the final and third test case, vortex shedding behind a 3D sphere at Mach 0.3 and a Reynolds number of 1,000 is considered. The unstructured grid consists of 21,128 hexahedral cells. To obtain initial conditions and start the initial vortex shedding, we begin with free stream data and linear polynomials. Furthermore, to cause an initial disturbance, the boundary of the sphere is chosen to be noncurved at first. In this way, we compute 120 s of time using an explicit time integration scheme. Then we switch to polynomials of degree three and compute another 10 s of 18

19 Fig. 4. Initial (top) and final solution of cylinder problem (bottom). Distribution of the velocity magnitude. Fig. 5. Isosurfaces of λ 2 = 10 4 for initial (left) and final solution of sphere problem (right). time using the explicit method. From this, we start the actual computations with a polynomial degree of four and a curved representation of the boundary. 19

20 In total, we have 739,480 unknowns for this test case. For the first time step, we chose t = , from then on the steps are determined by the time adaptive algorithm. This time step size was chosen such that the resulting embedded error estimate is between 0.9 and 1, leading to an accepted time step. Then, we perform the computations on a time interval of 30 seconds. The initial solution and the result at the end when using ES- DIRK4 time integration with a tolerance of 10 3 are shown in figure 5, where we see isosurfaces of λ 2 = 10 4, a common vortex identifier [18]. Note that there is no visual difference between the results for ESDIRK4 and LSERK Comparison of preconditioners We will now compare Jacobi, ILU, ILU-CSC and the different ROBO-SGS variants when using GMRES(20) in a parallel environment. Fig. 6. Comparison of different preconditioners for DG discretization, only every fifth iteration is shown. First, we look at the impact on convergence for a discretization of the sphere problem when using fourth degree polynomials. We use eight cores. The system solved is the first system arising in the Newton scheme for the first nonlinear system during the first time step with a time step size of t = The relative residuals for every fifth iteration is depicted in figure 6. As can be seen, the most powerful preconditioner when regarding the reduction in GMRES iterations needed is ILU-CSC, followed by ILU. Then, we have ROBO-SGS-4, which is almost as powerful, and then the other ROBO-SGS preconditioner in decreasing order. As expected, Jacobi is the least powerful preconditioner, but is still able to improve the speed of convergence significantly when compared to no preconditioning. In a way, Jacobi can be seen as the most extreme variant of ROBO-SGS where no information from neighboring cells is taken 20

21 into account. In the following tables, the preconditioners are ordered by how powerful we expect them to be. Regarding efficiency, the most powerful preconditioner is not necessarily the best one. Therefore, we will now use the CPU time needed for the different test cases as a measure of efficiency. The first test case is the energy density wave on the time interval t [0, 4], where we used 8 cores on an AMD Opteron Quad TwelveCore system. For the implicit calculations, the initial time step is chosen as t = and T OL = which corresponds to the spatial discretization error. For the second test case, 512 cores of the Cray XE6 (Hermit) of the HLRS were used. The results can be seen in table 2. EnDensWave Cylinder Preconditioner Iter. CPU in s Iter. CPU in s No preconditioner 7, , Jacobi 2, , ROBO-SGS-0 1, , ROBO-SGS-1 1, , ROBO-SGS-2 1, , ROBO-SGS-3 1, , ROBO-SGS-4 1, , ROBO-SGS , ILU(0) 1, , ILU(0)-CSC 1, , Table 2 CPU times and total number of GMRES iterations for various preconditioners The results for the third and final test case are shown in table 3. The computations were performed in parallel using eight cores of an AMD Opteron 8378 machine (with a total of 4 quad cores). In line with the results from figure 6, the total number of GMRES iterations decreases monotonously for all test cases from no preconditioning to Jacobi to ROBO-SGS-0 up to ROBO-SGS-4, then ILU and then ILU-CSC. By contrast to the finite volume case, Jacobi is a powerful preconditioner. Regarding ROBO-SGS, the decrease is strongest for the first two orders, namely when the constant and the linear part of the solution are taken into account. Consequently, the method of this class with the minimal CPU time is ROBO- SGS-1. When comparing ROBO-SGS to ILU, we see that while both ILU and ILU-CSC are more powerful preconditioners, this does not pay in terms of 21

22 Preconditioner Iter. CPU in s Add. MPU in KB None 218, ,968 - Jacobi 23,369 90, ROBO-SGS-0 18,170 77, ROBO-SGS-1 14,639 66, ROBO-SGS-2 13,446 76, ROBO-SGS-3 13,077 87, ROBO-SGS-4 13, , ILU(0) 12, , ILU(0)-CSC 11, , Table 3 Total GMRES iterations, total CPU times and additional memory per unknown (MPU) required for the preconditioner, sphere test case. CPU time and thus the overall best preconditioner is either ROBO-SGS-1 or ROBO-SGS-2. Regarding storage, the increase in storage is nonlinear with increasing orders, where ROBO-SGS-4 takes into account the full off diagonal block and thus needs the same amount of storage as ILU, which is more than double as much as ROBO-SGS-1. This demonstrates again the huge size of the blocks of the Jacobian in a DG scheme and why it is so important to find a way of treating these in an efficient way. Furthermore, it explains why Jacobi is actually a good preconditioner. 7.4 Parallel scalability CPUs Unkn./CPU Iter. CPU in s Scaling 64 45, ,750 2, % ,375 2, % 512 5,688 3, % Table 4 Parallel scaling of ROBO-SGS-2 for cylinder test case on Hermit. Now, we demonstrate the strong parallel scaling of the implicit method using ROBO-SGS-2 as a preconditioner, meaning the scaling of solution time with the number of processors for a fixed problem size. As a test case, we use the cylinder problem and the results are shown in table 4. As can be seen, the 22

23 number of iterations required for the solution of the linear system increases as expected due to the parallel implementation of the preconditioner that ignores communication. However, the overall parallel efficiency of the ROBO- SGS scheme is very good, even with only 5,688 unknowns on a processor. 8 Conclusions In this work, a new class of preconditioners for use in implicit time integration schemes in the context of DG methods for the three dimensional unsteady compressible Navier-Stokes equations is introduced. A particular problem for implicit DG schemes is the size of the blocks the Jacobian consists of. The new class of preconditioners, ROBO-SGS, uses a reduced off diagonal block order to significantly decrease memory usage, setup and application cost of the preconditioner. The choice of the order of the off block diagonals allows us to recover Jacobi (no off block) and full SGS as special cases of this new preconditioner class. In this way, the new method uses less CPU time than ILU or Jacobi preconditioning, while still retaining an excellent strong parallel scaling on up to 512 processors. Acknowledgements The research presented in this paper was supported in parts by the Deutsche Forschungsgemeinschaft (DFG) in the context of the Schwerpunktprogramm MetStroem, the cluster of excellence Simulation Technology (SimTech) and the SFB/TR TRR 30. References [1] D. Arnold. An Interior Penalty Finite Element Method with Discontinuous Elements. PhD thesis, The University of Chicago, [2] F. Bassi, A. Ghidoni, and S. Rebay. Optimal Runge-Kutta smoothers for the p- multigrid discontinuous Galerkin solution of the 1D Euler equations. J. Comp. Phys., 11: , [3] F. Bassi, A. Ghidoni, S. Rebay, and P. Tesini. High-order accurate p-multigrid discontinuous Galerkin solution of the Euler equations. Int. J. Num. Meth. Fluids, 60: ,

24 [4] F. Bassi and S. Rebay. A high-order discontinuous Galerkin finite element method solution of the 2D Euler equations. J. Comput. Phys., 138: , [5] F. Bassi, S. Rebay, G. Mariotti, S. Pedinotti, and M. Savini. A highorder accurate discontinuous finite element method fir inviscid an viscous turbomachinery flows. In R. Decuypere and G. Dibelius, editors, Proceedings of 2nd European Conference on Turbomachinery, Fluid and Thermodynamics, pages , Technologisch Instituut, Antwerpen, Belgium, [6] B. Cockburn and C.-W. Shu. The local discontinuous Galerkin method for timedependent convection diffusion systems. SIAM J. Numer. Anal., 35: , [7] R. Dembo, R. Eisenstat, and T. Steihaug. Inexact Newton methods. SIAM J. Numer. Anal., 19: , [8] L. T. Diosady and D. L. Darmofal. Preconditioning methods for discontinuous Galerkin solutions of the Navier-Stokes equations. J. Comp. Phys., 228: , [9] V. Dolejší and M. Feistauer. A semi-implicit discontinuous Galerkin finite element method for the numerical solution of inviscid compressible flow. J. Comp. Phys., 198: , [10] S. C. Eisenstat and H. F. Walker. Choosing the forcing terms in an inexact newton method. SIAM J. Sci. Comp., 17:16 32, [11] K. J. Fidkowski, T. A. Oliver, Lu. J., and D. L. Darmofal. p-multigrid solution of high-order discontinuous Galerkin discretizations of the compressible Navier- Stokes equations. J. Comp. Phys., 207:92 113, [12] G. Gassner, F. Lörcher, and C.-D. Munz. A contribution to the construction of diffusion fluxes for finite volume and discontinuous Galerkin schemes. J. Comput. Phys., 224: , [13] G. Gassner, F. Lörcher, and C.-D. Munz. A contribution to the construction of diffusion fluxes for finite volume and discontinuous Galerkin schemes. J. Comput. Phys., 224(2): , [14] G. Gassner, F. Lörcher, and C.-D. Munz. A Discontinuous Galerkin Scheme based on a Space-Time Expansion II. Viscous Flow Equations in Multi Dimensions. J. Sci. Comput., 34: , [15] G. J. Gassner, F. Lörcher, C.-D. Munz, and J. S. Hesthaven. Polymorphic nodal elements and their application in discontinuous galerkin methods. Journal of Computational Physics, 228(5): , [16] E. Hairer and G. Wanner. Solving Ordinary Differential Equations II. Springer, Berlin, Series in Computational Mathematics 14, 3. edition, [17] J. S. Hesthaven and T. Warburton. Nodal Discontinuous Galerkin Methods: Algorithms, Analysis, and Applications. Springer,

25 [18] J. Jeong and F. Hussain. On the identification of a vortex. J. Fluid Mech., 285:69 94, [19] G. Jothiprasad, D. J. Mavriplis, and D. A. Caughey. Higher-order time integration schemes for the unsteady Navier-Stokes equations on unstructured meshes. J. Comp. Phys., 191: , [20] A. Kanevsky, M. H. Carpenter, D. Gottlieb, and J. S. Hesthaven. Application of implicit-explicit high order Runge-Kutta methods to discontinuous-galerkin schemes. J. Comp. Phys., 225: , [21] C. A. Kennedy and M. H. Carpenter. Additive Runge-Kutta schemes for convection-diffusion-reaction equations. Appl. Num. Math., 44: , [22] C. M. Klaij, M. H. van Raalte, J. J. W. van der Vegt, and H. van der Ven. h-multigrid for space-time discontinuous Galerkin discretizations of the compressible Navier-Stokes equations. J. Comp. Phys., 227: , [23] D. A. Knoll and D. E. Keyes. Jacobian-free Newton-Krylov methods: a survey of approaches and applications. J. Comp. Phys., 193: , [24] D. A. Kopriva, S. L. Woodruff, and M. Y. Hussaini. Computation of electromagnetic scattering with a non-conforming discontinuous spectral element method. Int. J. Num. Meth. in Eng., 53: , [25] F. Lörcher. Predictor-Corrector DG schemes for the numerical solution of the compressible Navier-Stokes equations in complex domains. Dissertation, Universität Stuttgart, [26] F. Lörcher, G. Gassner, and C.-D. Munz. An explicit discontinuous Galerkin scheme with local time-stepping for general unsteady diffusion equations. J. Comput. Phys., 227(11): , [27] G. May, F. Iacono, and A. Jameson. A hybrid multilevel method for high-order discretization of the Euler equations on unstructured meshes. J. Comp. Phys., 229: , [28] P. R. McHugh and D. A. Knoll. Comparison of standard and matrix-free implementations of several Newton-Krylov solvers. AIAA J., 32(12): , [29] A. Meister and C. Vömel. Efficient Preconditioning of Linear Systems arising from the Discretization of Hyperbolic Conservation Laws. Advances in Computational Mathematics, Vol.14(1):49 73, [30] C. R. Nastase and D. J. Mavriplis. High-order discontinuous Galerkin methods using an hp-multigrid approach. J. Comp. Phys., 213: , [31] J. Peraire and P.-O. Persson. The Compact Discontinuous Galerkin (CDG) Method for Elliptic Problems. SIAM J. Sci. Comput., 30(4): , [32] P.-O. Persson and J. Peraire. Newton-GMRES Preconditioning for Discontinuous Galerkin discretizations of the Navier-Stokes equations. SIAM J. Sci. Comp., 30: ,

Preconditioning for modal discontinuous Galerkin methods for unsteady 3D Navier-Stokes equations

Preconditioning for modal discontinuous Galerkin methods for unsteady 3D Navier-Stokes equations Philipp Birken a Gregor Gassner b Mark Haas b Claus-Dieter Munz b a University of Kassel, Department of