Preconditioning for modal discontinuous Galerkin methods for unsteady 3D Navier-Stokes equations

Preconditioning for modal discontinuous Galerkin methods for unsteady 3D Navier-Stokes equations Philipp Birken a Gregor Gassner b Mark Haas b Claus-Dieter Munz b a University of Kassel, Department of Mathematics, Heinrich-Plett-Str. 40, 34132 Kassel, Germany b University of Stuttgart, Institute of Aerodynamics and Gas Dynamics, Pfaffenwaldring 21, 70569 Stuttgart, Germany Abstract We compare different block preconditioners in the context of parallel time adaptive higher order implicit time integration using Jacobian-free Newton-Krylov (JFNK) solvers for discontinuous Galerkin (DG) discretizations of the three dimensional time dependent Navier-Stokes equations. A special emphasis of this work is the performance for a relative high number of processors, i.e. with a low number of elements on the processor. For high order DG discretizations, a particular problem that needs to be addressed is the size of the blocks in the Jacobian. Thus, we propose a new class of preconditioners that exploits the hierarchy of modal basis functions and introduces a flexible order of the off-diagonal Jacobian blocks. While the standard preconditioners block Jacobi (no off-blocks) and full symmetric Gauss-Seidel (full off-blocks) are included as special cases, the reduction of the off-block order results in the new scheme ROBO-SGS. This allows us to investigate the impact of the preconditioner s sparsity pattern with respect to the computational performance. Since the number of iterations is not well suited to judge the efficiency of a preconditioner, we additionally consider CPU time for the comparisons. We found that both block Jacobi and ROBO-SGS have good overall performance and good strong parallel scaling behavior. Key words: Discontinuous Galerkin, Unsteady flows, Navier-Stokes, Implicit methods, Preconditioning, Three dimensional problems Email addresses: birken@mathematik.uni-kassel.de (Philipp Birken), gassner@iag.uni-stuttgart.de (Gregor Gassner), haas@iag.uni-stuttgart.de (Mark Haas), munz@iag.uni-stuttgart.de Preprint submitted to Elsevier 10 December 2012

1 Introduction The solution of unsteady compressible viscous flows may lead to stiff problems, in particular for wall bounded flows and flows at low Mach numbers. This means that the time step size in explicit methods is driven by stability and in implicit methods by accuracy alone. Therefore, implicit methods, which can be constructed to have unbounded stability regions, are attractive for a number of problems and are a standard part of solution procedures for finite volume methods. However, finding an efficient solver for the resulting linear and nonlinear equation systems has turned out to be a difficult problem in the DG case [42], in particular for three-dimensional problems. The structure of the nonlinear systems to be solved is of the form u ψ + α tf(u) = 0, where ψ is a known vector, α is a method dependent scalar paramater, u the vector of unknowns and f a function representing the overall spatial discretization. Hence, it does not depend on the specific time integration method. A corresponding statement holds for the linear equation systems which results from an iterative solution process of the nonlinear system. Thus, we will use in this work the 4th order accurate diagonally implicit Runge-Kutta method ESDIRK4 [22]. We choose a 4th order accurate method in time, as we are interested in the simulation of unsteady problems. Regarding the solvers for the algebraic systems, there is a number of requirements that have to be satisfied. Firstly, three-dimensional computations have strong memory demands that will actually increase in the future with newer computer generations having less and less memory per core available. This problem is particularly pronounced for DG methods. These use a large number of degrees of freedom per cell, leading to large Jacobian blocks with more intercell connectivity. Already on today s supercomputers one may find oneself running out of memory faster than one would expect from experience with finite volume methods. Secondly, the solver has to scale in parallel to be feasible for supercomputing. Thirdly, it has to be reasonably fast, which is still a challenge in the DG context. Finally, the implementation cost should be as low as possible, where here, we are concerned with the additional coding needed to make an explicit DG method implicit. If all of these requirements are met (low storage requirements, parallel scaling, fast convergence and ease of implementation), the use of high order methods in the industrial context would become more feasible. At this point it has to be noted that there is not yet a standard DG method and that in our experience, the question of an efficient solver depends on (Claus-Dieter Munz). 2

the specific discretization in a nontrivial way. However, it seems that an efficient DG method makes use of what is called a nodal basis in some way [17]. Here, we will consider the mixed modal-nodal variant suggested by Gassner et. al. [15], which is based on a modal basis but uses a nodal basis for integration. For the diffusive terms, the dgrp flux is used [13]. Furthermore, we did some preliminary tests of the methodology on a DG Spectral Element method (DGSEM), e.g. [25], which showed that the preconditioning procedure used here indeed has to be modified in that context. Basic candidates for solvers are FAS-Multigrid and preconditioned Jacobian- Free Newton-Krylov methods (JFNK), whereby multigrid can be used as a preconditioner. The FAS multigrid is the method of choice for steady Euler flows when using finite volume schemes. In the context of unsteady flows, it is often a dual time stepping procedure which seems to be a slow method, as reported by several authors, e.g. [20,6]. When looking at DG methods, the design of a fast multigrid solver is an open problem both for steady and unsteady flows [31,23,3,2,34,28]. We will not attack this problem and consider JFNK schemes in this work. There, the linear systems are solved using Krylov subspace methods, which do not need the system matrix explicitly, but only matrix vector products. Since the system matrix is a Jacobian, it can be approximated using finite differences, circumventing in theory the construction and storage of the Jacobian. In practice, a preconditioner is needed, making the schemes not completely matrix-free. Regarding the specific Krylov subspace method, it turns out that GMRES is the best choice in this context [24]. The Newton method is an inexact Newton method, where a good strategy to control the termination criteria for the linear solver is necessary. Namely, the strategy by Eisenstat and Walker is used [11]. As mentioned, a preconditioner is necessary to speed up GMRES. Here, multigrid can be used as a preconditioner, which was considered by several authors for the steady Euler equations in two dimensions [31,9,28]. Generally, multigrid preconditioners would satisfy the requirements mentioned, but they face the same problem as the FAS multigrid: the lack of theory for DG discretizations leads to nonoptimal methods. Therefore, we will not consider these schemes here. Regarding other preconditioners, a number of authors has considered Newton-Krylov methods for two-dimensional flows, in particular Rasetarinera and Hussaini (steady NS) [36], Dolejsi and Feistauer (steady Euler) [10], Darmofal et al. (steady NS) [12,9], Kanevsky et. al. (unsteady Euler, NS) [21], as well as Persson and Peraire (unsteady NS) [33]. Whereas all the work mentioned above considers two-dimensional flow problems, we will focus on the 3D unsteady case. As opposed to the case of finite volume methods, where going from 2D to 3D increases the number of un- 3

knowns per cell from four to five and thus by just one, we have to multiply the number of unknowns per cell with a factor dependend on the polynomial degree, resulting in hundreds of unknowns per cell with an according block structure. Thus, the two-dimensional case is in our opinion not representative of the three-dimensional one and furthermore, the discontinuous Galerkin case is very different from the finite volume case. Therefore, a successful implicit DG scheme must take these huge blocks in the Jacobian into account. In this work, we examine the performance of different preconditioners: Block Jacobi, block symmetric Gauß-Seidel (SGS), block-ilu and a multilevel block- ILU suggested by Persson and Peraire [33]. Furthermore, we propose a new class of SGS-type preconditioners, which we call ROBO-SGS (Reduced Offdiagonal Block Order). This new class exploits the hierarchical basis of the mixed modal-nodal DG method to reduce the order of the off block Jacobian blocks and includes the block Jacobi and the full SGS preconditioner as special cases. An approach that is similar in spirit has been suggested by Renac et al. [37]. The variable sparsity pattern of the ROBO-SGS preconditioner class gives us the possibility to investigate the impact of the preconditioners matrix structure on the overall performance. The point about the comparison is that first of all, it is done on threedimensional test cases, second, it is done in a realistic setting of a time adaptive scheme with a smart choice of tolerances in Newton and a parallel solver and third, we compare not only iteration numbers, but also CPU time. This is important, because iterations show accuracy of a preconditioner but not its efficiency, since the cost of application of the preconditioner is neglected. As we are interested in high fidelity simulations (direct numerical simulation or large eddy simulation) of compressible turbulent flow problems, we are solely interested in unsteady computations on large parallel architectures. Thus, an important aspect of the investigations is the parallel scaling of the methods and the impact of the preconditioner on the parallel performance. The outline of the paper is as follows: First we will describe the governing equations and the DG methodology used. Then we will briefly discuss the ESDIRK4 method, after which we will describe the JFNK method and the different preconditioners. Finally, numerical results are presented where we compare the different preconditioners. 2 Governing equations The Navier-Stokes equations are a second order system of conservation laws (mass, momentum, energy) modeling viscous compressible flow. Written in conservative variables density ρ, momentum m and energy per unit volume 4

ρe: t ρ + m = 0, d t m i + xj (m i v j + pδ ij ) = 1 j=1 Re d xj S ij + q i, j=1 d t (ρe) + (Hm) = 1 ( d xj Re j=1 i=1 i = 1... d S ij v i 1 ) P r W j + q e. Here, d stands for the number of dimensions, H for the enthalpy per unit mass, S represents the viscous shear stress tensor and W the heat flux. As the equations are dimensionless, the Reynolds number Re and the Prandtl number P r appear. The equations are closed by the equation of state for the pressure p = (γ 1)ρe, where we assume a perfect gas. Finally, q e denotes a possible source term in the energy equation, whereas q = (q 1,..., q d ) T is a source term in the momentum equation, for example due to external forces. 3 Spatial Discretization We employ the mixed modal-nodal Discontinuous Galerkin scheme which has been suggested by Gassner et al. [15]. One of the main advantages of this method is that it allows the use of elements of arbitrary shape (i.e. tetrahedrons, prisms, pyramids, hexahedrons,...) with high order of accuracy. In our experience, discretizations using hexahedrons very often require less elements and thus less total degrees of freedom than ones that only use tetrahedrons for approximately the same discretizaton error. This property becomes extremely important in the context of implicit methods, since the total number of degrees of freedom has a strong influence on the performance of the solver and an even greater impact on the memory consumption. This is crucial for DG methods, especially, when doing real world 3D simulations. We will demonstrate these aspects in the following sections. 3.1 The Discontinuous Galerkin Method We write the Navier-Stokes equations in the form u t + f (u) = q(t, u), (1) with suitable initial and boundary conditions in a domain Ω [0, T ] R d R + 0. Here, u = u ( x, t) R d+2 is the state vector, f (u) = f C (u) f D (u, u) is the 5

physical flux, where f C (u) is the convective (i.e. hyperbolic) and f D (u, u) the diffusive (i.e. parabolic) flux component. The possibly time and space dependent source term is given by q(t, x, u). We derive the DG method by first subdividing the domain Ω into non-overlapping grid cells Q. In each grid cell we approximate the state vector using a local polynomial approximation of the form u ( x, t) N Q u Q ( x, t) = û j (t) ϕ Q j ( x), (2) j=1 where in our case, {ϕ Q j ( x)} j=1,...,n are modal hierarchical orthonormal basis functions and û are the corresponding coefficients in the cell Q. The basis functions are constructed from a monomial basis with a simple Gram-Schmidt orthogonalization algorithm for arbitrary (reference) grid cell types. The dimension of the local approximation space depends on the spatial dimension d and the polynomial degree p N = N(p, d) = (p + d)!. (3) p!d! The next step of our approximation is to define how the unknown degrees of freedom û j (t) are determined. The basis of the considered discontinuous Galerkin method is a weak formulation. Neglecting the source term for now, we insert the approximate solution (2) into the conservation law (1), multiply with a smooth test function φ = φ ( x) and integrate over Q to obtain u Q t + f ( u Q), φ Q = 0, (4) where.,. Q denotes the L 2 (Q) scalar product over Q. We proceed with an integration by parts to obtain u Q t, φ Q + (f (u) n, φ) Q f ( u Q), φ Q = 0, (5) where (.,.) Q denotes the surface integral over the boundary of the element Q. As the approximate solution is in general discontinuous across grid cell interfaces, the trace of the flux normal component f (u) n is not uniquely defined. To get a stable and accurate discretization, several choices for the numerical approximation are known. Here, we use the HLLC flux [40]. For a purely convective problem inserting the trace approximation f ( u Q) n g C (u, u +, n) into equation (5) would yield u Q t, φ Q + ( g C ( u, u +, n ), φ ) Q f C ( u Q), φ Q = 0. (6) We denote by (.) values at the inner side of a cell interface, i.e. values that depend on u Q and by (.) + values that depend on the neighbor cells sharing the interface with the cell Q. 6

The handling of the diffusive part of the flux is a little more delicate for DG methods, because the jump in the gradients needs special handling. Several authors have suggested solutions for this problem [32,7,1,4,5] and all of these have been used in conjunction with implicit temporal discretizations. In this work we apply the dgrp flux of Gassner, Lörcher and Munz [13,14,27]. The dgrp flux is an extension of the symmetric interior penalty (SIP) method for compressible Navier-Stokes equations to guarantee optimal order of convergence. We choose this variant of the diffusion flux as it has been derived in a way that optimizes stability, i. e. minimizes the eigenvalues of the DG operator [13,14]. From a technical point of view this flux introduces an approximation of the trace of the flux normal component f D (u, u) n g dgrp (u, u, u +, u +, η, n), where η is a parameter that depends on the geometry of the cell Q and its neighbor and the local order of the polynomial approximation (2). To ensure adjoint consistency an additional surface flux term h (u, u +, n) is introduced via two integrations by parts [14] yielding the final DG formulation u Q t, φ Q + ( g C g dgrp, φ ) Q (h, φ) Q f C f D, φ Q = 0. (7) The coupling between elements and thus the resulting fill in of the Jacobian matrix is comparable to the standard SIP method and the commonly used second method of Bassi and Rebay (BR2). We expect thus that the results shown below are directly applicable for those diffusive flux variants as well, whereas flux functions with different element coupling such as e.g. the local DG (LDG) and its modification, the compact DG (CDG), may perform differently, although we expect the impact of the choice of the diffusive flux function not to be significant. 3.2 Nodal Integration The computation of the volume and surface quadrature operators can be a very expensive task if standard methods such as Gaussian quadrature are used, which is caused by the high number of polynomial evaluations required for computing the fluxes. Based on the nodal DG scheme developed by Hesthaven and Warburton [17], Gassner et al. developed a way of constructing efficient quadrature operators that work on arbitrarily shaped elements, see [15] for further details. This has the advantage that the number of degrees of freedom does not depend on the element shape as it would be for a purely nodal scheme when using elements other than tetrahedra. As points to define the nodal basis, Legendre-Gauss-Lobatto (LGL) points are used on edges and then a method called LGL-type nesting is used to determine the interior points, which leads to a small Lebesgue constant. The coexistence of modal and nodal elements is quite natural for a DG scheme 7

since the transformation from modal (û) to nodal (ũ) degrees of freedom is nothing else but a polynomial evaluation of the modal polynomials at the nodal interpolation points which can be expressed in the form of a matrixvector-multiplication: ũ = Vû. (8) Here V is a Vandermonde matrix containing the evaluations of the modal polynomials at the interpolation points. The back transformation can be implemented using the inverse of the Vandermonde matrix: û = V 1 ũ. (9) If the number of nodal interpolation points is different from the number of modal degrees of freedom, as is the case for elements other than tetrahedra, the approximate inverse V 1 is defined using a least squares procedure based on singular value decomposition [15]. The nodal DG method can be conveniently formulated in terms of matrices representing the discrete integrals in (7): nfaces Mũ t + M S d i g i N i h i S }{{} k f k = 0. (10) }{{} i=1 k=1 surface integral volume integral In the case of nonlinear equations, such as e.g. the compressible Navier-Stokes equations, the nonlinearity is present in the evaluation of the fluxes. In Eq. (10), f k, k = 1,..., d are the vectors of flux evaluations at all nodal points, while g i and h i stand for the evaluations of the surface flux approximations at the nodal points of the element face i. The operators in Eq. (10) are designed to act on nodal input vectors and to produce a nodal output. Using Eq. (9) all the operators in Eq. (10) can be modified in order to produce a modal output yielding the mixed modal-nodal DG method: ( nfaces û t = V 1 M 1 i=1 ) ( ) d M S i g i N i hi S k fk. (11) k=1 4 Time integration scheme Equation (11) represents a system of ordinary differential equations (ODEs) in the cell Q. If we combine the modal coefficient vectors in one vector u R m, we obtain a large system of ODEs u t (t) = f(t, u(t)), (12) where f is a vector valued function corresponding to the right hand side in (11) for the whole grid. Generally, from now on a vector with an underbar 8

will denote a vector from R m. We denote the time step size by t and u n is the numerical approximation to u(t n ). Note that the explicit dependence of the right hand side in (12) on t is relevant only for time dependent boundary conditions or time dependent source terms. 4.1 ESDIRK4 We will restrict ourselves here to the explicit step singly diagonally implicit Runge-Kutta method of fourth order (ESDIRK4), designed in [22]. Given coefficients a ij and b i, such a method with s stages can be written as i U i = u n + t a ij f(t n + c j t n, U j ), i = 1,..., s (13) j=1 s u n+1 = u n + t b j f(t n + c j t n, U j ). (14) j=1 Thus, all entries of the Butcher array in the strictly upper triangular part are 0 0 0 0 0 0 0 2γ γ γ 0 0 0 0 83/250 0.137776-0.055776 γ 0 0 0 31/50 0.144636866026982-0.223931907613345 0.449295041586363 γ 0 0 17/20 0.098258783283565-0.591544242819670 0.810121053828300 0.283164405707806 γ 0 1 0.157916295161671 0 0.186758940524001 0.680565295309335-0.275240530995007 γ b i 0.157916295161671 0 0.186758940524001 0.680565295309335-0.275240530995007 γ ˆbi 0.154711800763212 0 0.189205191660680 0.702045371228922-0.319187399063579 0.273225035410765 b i ˆb i 0.003204494398459 0-0.002446251136679-0.021480075919587 0.043946868068572-0.023225035410765 Table 1 Butcher diagram for ESDIRK4 with γ = 0.25. zero. The coefficients can be obtained from table 1. This scheme is A-stable, also L-stable and stiffly accurate. The point about DIRK schemes is, that the computation of the stage vectors corresponds to the sequential application of several implicit Euler steps. With the starting vectors we can solve for the stage values s i = u n i 1 + t a ij f(t n + c j t n, U j ), (15) j=1 U i = s i + ta ii f(t n + c i t n, U i ). (16) The equation (16) corresponds to a step of the implicit Euler method with starting vector s i and time step a ii t. Note that because the method is stiffly accurate, u n+1 = U s and thus we do not need to evaluate (14). 9

The first explicit step of the Runge-Kutta schemes allows to have a stage order of two, but also means that the methods can t be algebraically stable. Furthermore, the explicit stage involving f(t n, u n ) allows to reuse the last stage derivative from the last time step, since f(t n+1, u n+1 ) from the last time step is the same quantity, thus avoiding an evaluation of the right hand side. 4.2 Adaptive time step size selection For unsteady flows, we need to make sure that the time integration error can be controlled. To do this, we estimate the time integration error and select the time step size accordingly. This is done using embedded schemes of a lower order ˆp. For ESDIRK4, ˆp = 3. Comparing the local truncation error of both schemes, we obtain the following estimate for the local error of the lower order scheme: s l t n (b j ˆb j )f(t n + c j t n, U j ). (17) j=1 To determine the new step size, we decide beforehand on a target error tolerance and use a common fixed resolution test [39]. This means that we define the error tolerance per component via d i = RT OL u n i + AT OL, (18) where RT OL and AT OL are the relative and absolute tolerances. We always choose AT OL = RT OL =: T OL. Then we compare this to the local error estimate via requiring l./d 1, where. denotes a pointwise division operator and we use the 2-norm throughout the text. The next question is, how the time step has to be chosen, such that the error can be controlled. The classical method is the following, also called EPS (error per step) control [16]: t new = t n l./d 1/(ˆp+1). (19) This is combined with two safety factors to avoid volatile increases or decreases in time step size: if l./d 1, t n+1 = t n max(f min, f safety l./d 1/ˆp+1 ) else t n+1 = t n min(f max, f safety l./d 1/ˆp+1 ). 10

Here, we chose f min = 0.3, f max = 2.0 and f safety = 0.9. 5 Solver for the nonlinear equation systems The application of implicit time discretizations used herein leads to a globally coupled nonlinear system of equations of the form u ν+1 = a ν+1 t f ( u ν+1) + R fix ( u 1,..., u ν), (20) where we assume that u 1,..., u ν are given, R fix depends on space- and time discretization and a ν+1 is a constant depending on the time discretization method. 5.1 Inexact Newton method To solve the appearing nonlinear systems, we use inexact Newton s method, which is locally convergent and can exhibit quadratic convergence. This method solves the root equation u a ν+1 t f (u) R fix ( u 1,..., u ν) =: F(u) = 0. As a termination criterion we use a residual based one similar to (18) with a relative tolerance, resulting in F(u k ) ɛ = ɛ r F(u 0 ). If the iteration does not converge after a maximal number of iterations, the time step is repeated with half the time step size. In particular, we will use the inexact Newton s method from [8], where the linear system in the k-th Newton step is solved only up to a relative tolerance, given by a forcing term η k. This can be written as: F(u) u + F(u (k) ) u η k F(u k ) (21) u (k) u (k+1) = u (k) + u, k = 0, 1,... In [11], the choice of the sequence of forcing terms is discussed and it is proved that the inexact Newton iteration (21) converges linearly. Moreover, if η k 0, the convergence is superlinear and 11

if η k K η F(u k ) p for some K η > 0, p [0, 1], the convergence is superlinear with order 1 + p. In particular, this means that for a properly chosen sequence of forcing terms, the convergence can be quadratic. A way of achieving this (as proved in [11]) is the following: η A k = γ F(u k) 2 F(u k 1 ) 2 with a parameter γ (0, 1]. The theorem says that convergence is quadratic if this sequence is bounded away from one uniformly. Therefore, we set η 0 = η max for some η max < 1 and for k > 0: η k = min(η max, η A k ). Eisenstat and Walker furthermore suggest safeguards to avoid volatile decreases in η k. To this end, γη 2 k 1 > 0.1 is used as a condition to determine if η k 1 is rather large and thus the definition of η k is refined to η B k = min(η max, max(η A k, γη 2 k 1)). Finally, to avoid oversolving in the final stages, they use η k = min(η max, max(η B k, 0.5ɛ/ F(u k ) )), where ɛ is the tolerance at which the nonlinear iteration would terminate. 5.2 Linear Solver In each iteration of the inexact Newton method, a linear equation system of the form Ax = b, A R m m has to be solved with A = I a ν+1 t f u ū. In general, A consist of n E block rows, where n E is the number of elements in our computational domain. Each row i consists of a dense main diagonal block of the size (N i n V ar ) (N i n V ar ), where N i is the number of degrees of freedom of the cell i given by Eq. (3) and n V ar is the number of unknowns in the equations that are being solved. For the compressible Navier-Stokes Equations n V ar = 4 for the two-dimensional case and n V ar = 5 for the threedimensional case. In addition to the main block each Neumann neighbor j of i, i.e. a neighbor j sharing a common face with i, contributes a block of the size (N i n V ar ) (N j n V ar ) which leads to a quickly rising number of 12

10 1 10 6 Memory / Element [MB] 10 0 10 1 10 2 10 3 Tri Quad Tet Hex DOF / GB 10 5 10 4 10 4 1 2 3 4 5 6 7 8 9 10 Order of Approximation 1 2 3 4 5 6 7 8 9 10 Order of Approximation (a) Memory required for each element (b) Number of degrees of freedom that can be stored in the Jacobian per GB of memory Fig. 1. Memory considerations for the Jacobian matrix entries when using higher order DG methods. Figure 1a shows the memory requirements per block row for different element types and orders of approximation in 2D and 3D where we assumed a uniform order in the computational domain, i.e. N j = N i. It becomes clear that while memory usually is not an issue for Finite Volume methods (equivalent to the first order case) it is a critical aspect for DG schemes, especially for the 3D case. In order to illustrate how restrictive this may become, Fig. 1b shows the maximum number of cells a computational domain may contain if only one gigabyte of memory is available for the storage of the Jacobian. It should also become clear now that while a 2D scheme is not a problem for today s computers this is absolutely not the case for a 3D scheme as the memory requirements are an order of magnitude more restrictive than for the 2D case. Now for real world problems, the number of unknowns is in the magnitude of tens of millions, so direct solvers are infeasible which leads us to iterative methods. In particular, Krylov subspace methods such as GMRES or BiCGSTAB have been shown to perform well in this context. In their basic version, these schemes need the Jacobian and in addition a preconditioning matrix to improve convergence speed, which is storage wise a huge problem. Furthermore, the use of approximate Jacobians to save storage and CPU time leads to a decrease of the convergence speed of the Newton method. Therefore, we will use here Jacobian-free Newton-Krylov methods [24], which do not need the Jacobian (but still a preconditioner). The idea is that in Krylov subspace methods, the Jacobian appears only in the form of matrix vector products Av i which can be approximated by a difference quotient Av i F (ū + ɛv i) F(ū) ɛ = v i a ν+1 t f (ū + ɛv i) f(ū). (22) ɛ 13

The parameter ɛ is a scalar, where smaller values lead to a better approximation but may lead to truncation errors. A simple choice for the parameter, that avoids cancellation but still is moderately small is given by Quin, Ludlow and Shaw [35] as eps ɛ =, u 2 where eps is the machine accuracy. Second order convergence is obtained up to ɛ-accuracy if proper forcing terms are employed, since it is possible to view the errors coming from the finite difference approximation as arising from inexact solves. Of the Krylov subspace methods suitable for the solution of unsymmetric linear systems, the GMRES method of Saad and Schultz [38] was explained by McHugh and Knoll [29] to perform better than others in the matrix free context. The reason for this is that the vectors in matrix vector multiplications in GMRES are normalised, as opposed to those in other methods. 6 Preconditioning It is well known that the speed of convergence of Krylov subspace methods depends strongly on the matrix. Therefore, right preconditioning is used to transform the linear equation system appropriately: AP 1 x P = b, x = P 1 x P. Here, P is an invertible matrix, called a right preconditioner that approximates the system matrix in a cheap way. Every time a matrix vector product Av j appears in a Krylov subspace method, the right preconditioned method is obtained by applying the preconditioner to the vector in advance and then computing the matrix vector product with A. Right preconditioning does not change the initial residual, because r 0 = b 0 Ax 0 = b 0 AP 1 x P 0. This also means that, in contrast to left preconditioning, right preconditioning does not interfere with the Eisenstat-Walker strategy, which is the main reason we use right preconditioning only. Once the termination criterion is fulfilled, the right preconditioner has to be applied one last time to change back from the preconditioned approximation to the unpreconditioned. Often, the preconditioner is not given directly, but implicitly via its inverse. Then its application corresponds to the solution of a linear equation system. If chosen well, the speedup of the Krylov subspace method is significantly and 14

therefore, the choice of the preconditioner is more important than the specific Krylov subspace method used. For non-normal matrices as we have here, the existing theory is not sufficient to determine optimal preconditioners in any sense. Therefore, we have to resort to numerical experiments and heuristics. An overview of preconditioners with special emphasis on application in flow problems can be found in [30]. For the DG case, Persson and Peraire have conducted a survey of several preconditioning methods in [33]. Several methods are interesting in this context. 6.1 Jacobi, SGS and ROBO-SGS An important class of preconditioners are splitting methods that are based on decomposing A into a (block) diagonal part D, an upper diagonal part U and a lower diagonal part L in such a way that A = L + D + U. These blocks are then used to obtain simple approximations to A 1. The most simple method here is block-jacobi, where the off diagonal blocks are neglected, which leads to P = D. (23) A much more sophisticated method that is a very good preconditioner for compressible flow problems is the symmetric block Gauss-Seidel-method (SGS), which corresponds to solving the equation system (D + L)D 1 (D + U)x = x P. (24) As mentioned before, the major issue here is that the Jacobian consists of blocks with hundreds of unknowns for DG methods in 3D. In FV schemes, where the blocks have size 5 in 3D, the off diagonal blocks are sometimes computed on the fly, whereas only the diagonal is stored. In the DG case, the high construction cost for the element Jacobians makes this infeasible. Therefore, using the full Jacobian A leads to significantly higher memory requirements compared to storing the diagonal D only, e.g. a factor of five for tetrahedra and a factor of seven for hexahedra. Thus, SGS needs a huge amount of storage and leads to rather high application costs. On the other hand, the storage requirements of Jacobi are reasonable, as well as the application cost, but the resulting decrease in iteration numbers is much smaller than for SGS. 15

Based on this observation we propose a new class of SGS-like preconditioners that is between SGS and Jacobi where a varying amount of entries of the off diagonal blocks L and U is neglected. In this way, we get a trade off between memory and application cost on the one hand and efficiency of the preconditioner on the other hand. In particular, we make use of the fact that the modal DG scheme we employ has a hierarchical basis, meaning that the basis functions can be grouped by their degree: u Q = p j=0 α =p û α ϕ Q α. Here, α is a multi-index, ϕ Q α is the unique hierarchical modal basis function corresponding to that multi-index and û α the vector of coefficients of the solution in this decomposition. A block of the Jacobian consists of subblocks where each subblock consists of the derivatives of the values corresponding to one multiindex with respect to the coefficient vector û α corresponding to one basis function and thus a possibly different multiindex. The idea of ROBO-SGS is now to reduce the interelement coupling in the Jacobian by neglecting all derivatives of higher-order degrees of freedom of the neighboring cells. Thus we neglect all derivatives in the offdiagonal blocks Jacobian with respect to degrees p > k with k user defined. We call this preconditioner ROBO-SGS-k for Reduced Offdiagonal Block Order, where k is the degree of the polynomial basis functions taken into account. Note that this idea requires the hierarchical (modal) basis and does not work with a purely nodal implementation. A similar idea is often used for finite volume discretizations, where an approximate Jacobian is computed based on the first order discretization, neglecting the impact of the reconstruction [41]. However, this does not change the amount of storage needed, but only the computational complexity of the Jacobian construction. For example, in the case k = 0 we take only the DOFs of the neighbors into account which correspond to the integral mean values of the conserved quantities. However, we keep not only the derivatives of the remaining degrees of freedom with respect to themselves, but to all degrees of freedom, resulting in a rectangular structure in the off diagonal blocks of the preconditioner. This is illustrated in Fig. 2. For k = p, we keep everything, thus recovering the original block SGS preconditioner. If we formally set k = 1, we neglect all off diagonal block entries, thus recovering block Jacobi. The number of entries of the off diagonal blocks in the preconditioner is then (N i n V ar ) ( ˆNj n V ar ), where ˆNj depends on the user-defined parameter k. While this results in a decreased accuracy of the preconditioner, the memory requirements and the computational cost of the application become smaller, 16

(a) k = 0 variant (b) k = 1 variant Fig. 2. Reduced versions of the off-diagonal blocks of the Jacobian, k = 0 and k = 1 variants the less degrees of freedom of the neighboring cells we take into account. Note that even in the case k = 0, the mean values of the neighbors are taken into account, thus leading to reduced offdiagonal blocks that still have a physical meaning. Furthermore, the effect of this strategy becomes more pronounced the larger the order is, since the number of basis functions corresponding to a certain degree increases with k. To estimate the memory savings, we neglect boundary elements. For the Jacobi preconditioner, we get the overall amount of memory M emory(jacobi) = nelems M emory(f ullblock), (25) where the memory of the full block is given by Memory(F ullblock) = (N i n V ar ) (N i n V ar ), (26) which scales like p 6 /36 with respect to the polynomial degree p. This means, that even the simple block Jacobi preconditioner can get prohibitive with respect to memory requirements for a three-dimensional computation for large polynomial degrees p. The full SGS preconditioner drastically amplifies this even further as we need to store the blocks for each neighbor. Thus, depending on the considered element type, we get the overall memory storage for the SGS preconditioner (assuming large total number of elements compared to the boundary elements) as Memory(SGS) = nelems Memory(F ullblock) (1 + nsides), (27) where nsides is the number of sides for the element type (e.g. nsides = 6 for the hexahedra). The memory reduction of ROBO-SGS can be expressed when we introduce the amount of memory needed by the reduced blocks Memory(Reducedblock) = (N i n V ar ) ( ˆNj n V ar ), (28) which scales like p 3 k 3 /36 for a given off block order k. The total amount of 17

memory needed by the ROBO-SGS preconditioner is given by M emory(robosgs) = nelems M emory(f ullblock) + nelems M emory(reducedblock) (nsides). (29) Thus, the ratio of the ROBO-SGS memory consumption in comparison to the memory needed by the Jacobi preconditioner is given by M emory(robosgs) M emory(jacobi) = 1 + nelems Memory(Reducedblock) (nsides) nelems M emory(f ullblock) = 1 + nsides ˆN j N i. (30) Consider the example of hexahedral elements (nsides = 6) and the polynomial degree p = 5, as used in the results section, we get for the ratio of the memory the values 1, 1.1, 1.4, 2.1, 3.1, 4.8 and 7 with the value of the off block order k = { 1, 0, 1, 2, 3, 4, 5}, respectively, where k = 1 yields the Jacobi preconditioner and k = 5 the full SGS preconditioner. 6.2 ILU preconditioning Another important class of preconditioners are block incomplete LU (ILU) decompositions, where the blocks correspond to the units the Jacobian consists of. The computation of a complete LU decomposition is quite expensive and in general needs also for sparse matrices full storage. By prescribing a sparsity pattern, incomplete LU decompositions can be defined. The application of such a decomposition as a preconditioner then corresponds to solving by forward-backward substition the appropriate linear equation system. The sparsity pattern can for example be influenced by the level of fill. This is in short a measure for how much beyond the original sparsity pattern is allowed for the purpose of ILU. Those decompositions with higher levels of fill are very good black box preconditioners for flow problems [30]. However they are not in line with the philosophy of matrix-free methods. Thus remains ILU(0), which has no additional level of fill beyond the sparsity pattern of the original matrix A. We use the ILU(0) preconditioner in the form proposed by Persson and Peraire [33] with the in-place factorization suggested by Diosady and Darmofal [9]. While this preconditioner usually performs better than the ones based on splitting it has the drawback that it has to act on the full Jacobian matrix which makes it less attractive in computational environments with limited memory. 18

6.3 Multilevel preconditioners Another possibility is to use multilevel schemes as preconditioners. A number of approaches have been tried in the context of DG methods, e.g. multigrid methods and multi-p methods with different smoothers, as well as a variant by Persson and Peraire [33], where a two-level multi-p method is used with ILU(0) as a presmoother and Jacobi as a postsmoother. We will employ a variant where no postsmoothing is employed as we have not experienced any convergence acceleration by the postsmoother and name it ILU-CSC for ILU with coarse scale correction. Since the computation of the residual requires a matrix-vector multiplication, the cost for applying a multilevel variant of one of the above preconditioners is approximately double the cost of a standard single level variant. 6.4 Parallelization Regarding parallelization, we use the MPI paradigm. The physical domain is decomposed into several domains, each of which is assigned to a processor. The matrix-free approach allows us to use the parallelization scheme described in the PhD thesis of Lörcher [26] for the function evaluation in the matrix-vector multiplication, which is shown to scale very well. The use of GMRES, however, is a drawback here, since this requires the use of k scalar products on the k-th iteration, which do not scale perfectly in parallel. However, the alternatives have other drawbacks, as discussed earlier. As for the preconditioner, with the exception of Jacobi, most schemes would actually require excessive communication at domain boundaries due to the fact that the Neumann neighbors off-block entries of the system matrix would be located on different CPUs. In order to circumvent adding this overhead to our scheme we neglect these parts of the system matrix. This way, as the number of cells per CPU decreases all preconditioners used herein ultimately converge to the Jacobi scheme reducing the overall parallel efficiency not due to communication overhead as it is the case for the scalar products but due to a degradation of the numerical scheme itself resulting in a higher iteration number for the solution process. However, as we will demonstrate in the results section the schemes scale very well. 7 Numerical results In this section, we examine the performance of the different preconditioners. To focus on the effect of the preconditioner, we freeze for all tests the numer- 19

ical discretization in space as well as in time: we choose the ESDIRK4 time integrator with adaptive time stepping using the tolerance T OL = 10 2. We then compare the total number of GMRES iterations needed for a complete run of the solver as a measure of pre-conditioner accuracy and the total CPU time needed. While the latter is implementation dependent, it is a necessary additional information, since a more powerful preconditioner can be much more costly and less efficient overall than a less powerful preconditioner. All computations were carried out using the Fortran code HALO, developed at the Institute of Aerodynamics and Gasdynamics and all test runs were carried out in parallel using MPI with double precision arithmetic. The partitioning of the grid is achieved with a space-filling curve approach, which guarantees that the number of grid cells on a processor core is balanced and thus the memory requirements for each processor core is roughly about the same. As we are interested in three-dimensional unsteady simulations, the focus of the numerical model is on the high performance computing aspect. This means that we are especially interested in the efficiency of the preconditioner for a large numbers of cores, i.e. for parallel simulations with low computational load on each processor core. Explicit in time DG discretizations are known for their excellent strong parallel scaling, e.g. [18], and it is apriori not clear if implicit time integrators can sustain this property. As a first test case, we choose the flow past a circular cylinder with free stream Mach number M = 0.3 and Reynolds number of Re = 1, 000. The computational grid consists of 10, 400 hexahedral cells, as illustrated in figure 3, with curved grid cells at the cylinder boundary. For the discontinuous Galerkin discretization we choose polynomials with degree five, resulting in 582, 400 degrees of freedom per conservative variable or 2, 912, 000 unknowns in total. To obtain an initial condition, we decided to use an explicit time integration scheme, where we suppose that the time integration errors are negligible due to the small stability driven time step. The test interval is 1s and we choose the initial time step as t 0 = 0.0118. This time step size was chosen such that the resulting error estimate is between 0.9 and 1, leading to an accepted time step. The distribution of the velocity magnitude at initial time and at the end time of the test run are shown in figure 4. The following computations are all performed on the CRAY XE 6 cluster (Hermit) of the computing center HLRS. All preconditioners are tested for computations with 64, 128, 256 and 512 processor cores (threads), resulting in an average of 9100, 4550, 2275 and 1137 DOF per conservation variable on a processor core, which would be low even for an explicit in time discontinuous Galerkin discretization. Table 2 shows the result for the simulations with 64 cores. We list the number 20

Fig. 3. Grid for cylinder test case. The grid is extended in three dimensions with 8 regular grid cells. Fig. 4. Initial (top) and final solution of cylinder problem (bottom). Distribution of the velocity magnitude. of GMRES iterations, the overall wallclock time and a comparison of the CPU times with respect to the standard block Jacobi preconditioner. We clearly see that the number of iterations decreases the better the preconditioner, with ROBO-SGS-5 and ILU(0)-CSC being the most powerful. However, the wallclock time gives a very different picture. Here, the most powerful preconditioners are among the slowest in the end, because the preconditioner matrix has a larger sparsity pattern and is thus more expensive to apply. The most 21

efficient preconditioner is the ROBO-SGS-1 in this case, with the CPU time about 11% faster compared to the Jacobi preconditioner. Preconditioner Iter. CPU [s] Comparison to Jacobi [%] No preconditioner 8,797 2,194 36.0 Jacobi 3,712 1,613 0.0 ROBO-SGS-0 3,338 1,538-4.6 ROBO-SGS-1 2,824 1,429-11.4 ROBO-SGS-2 2,656 1,485-7.9 ROBO-SGS-3 2,641 1,679 4.1 ROBO-SGS-4 2,645 1,989 23.3 ROBO-SGS-5 2,640 2,427 50.5 ILU(0) 2,641 2,467 52.9 ILU(0)-CSC 2,640 2,994 85.6 Table 2 Number of iterations and wallclock time of the test computations on 64 cores of the CRAY XE6 cluster Hermit for all preconditioners with comparison of CPU time ot the Jacobi preconditioner computation. The following tables 3-5 show the results for 128, 256 and 512 processor cores (threads). The third column of the table shows the parallel scaling of the method with respect to the 64 processor cores computation. It is important to note that the no preconditioner case gives us basically the scaling similar to an explicit time discretization, as we only use the spatial DG operator to approximate the matrix-vector product. The only difference to an explicit method is that the GMRES algorithm needs an all-to-all communication because of the vector norm in each iteration. The strong scaling of over 85% in this case demonstrates again how well suited discontinuous Galerkin discretizations are for parallel computations. The results show that the scaling of the Jacobian-free implicit method is as good as the explicit method for this example. We even get better scaling results with preconditioner compared to the no preconditioner case. The reason for this is that the computational load on a processor increases due to the additional work of applying the preconditioner. Thus, the ratio of computation to communication increases, yielding a better parallel scaling for the preconditioned schemes. An additional sign for this is that the most expensive preconditioners (with the largest sparsity pattern) scale the best. The parallel implementation is such that we only consider the preconditioner for the local MPI domain, without communication of the preconditioner. By increasing the number of processors we decrease the load (number of grid 22

Preconditioner Iter. CPU [s] Scaling [%] Comparison to Jacobi [%] No preconditioner 8,797 1,196 91.7 43.6 Jacobi 3,712 833 96.8 0.0 ROBO-SGS-0 3,462 818 94.0-1.8 ROBO-SGS-1 2,926 750 95.3-10.0 ROBO-SGS-2 2,676 757 98.1-9.1 ROBO-SGS-3 2,651 843 99.6 1.2 ROBO-SGS-4 2,677 986 100.9 18.4 ROBO-SGS-5 2,641 1,175 103.3 41.1 ILU(0) 2,641 1,196 103.1 43.6 ILU(0)-CSC 2,640 1,477 101.4 77.3 Table 3 Number of iterations and wallclock time of the test computations on 128 cores of the CRAY XE6 cluster Hermit for all preconditioners. Strong scaling results compared to the 64 cores computations and comparison of CPU time to Jacobi preconditioner computation. Preconditioner Iter. CPU [s] Scaling [%] Comparison to Jacobi [%] No preconditioner 8,797 600 91.4 38.9 Jacobi 3,712 432 93.3 0.0 ROBO-SGS-0 3,516 428 89.8-0.9 ROBO-SGS-1 3,031 395 90.4-8.6 ROBO-SGS-2 2,785 393 94.5-9.0 ROBO-SGS-3 2,685 422 99.5-2.3 ROBO-SGS-4 2,722 486 102.3 12.5 ROBO-SGS-5 2,649 568 106.8 31.5 ILU(0) 2,647 576 107.1 33.3 ILU(0)-CSC 2640 728 102.8 68.5 Table 4 Number of iterations and wallclock time of the test computations on 256 cores of the CRAY XE6 cluster Hermit for all preconditioners. Strong scaling results compared to the 64 cores computations and comparison of CPU time to Jacobi preconditioner computation. cells) on the processor and thus the preconditioners all get more similar to block Jacobi. In the extreme case of only one element on a processor core, all preconditioners would be block Jacobi due to the parallelisation we use. Thus, 23

Preconditioner Iter. CPU [s] Scaling [%] Comparison to Jacobi [%] No preconditioner 8,797 322 85.2 39.4 Jacobi 3,712 231 87.3 0.0 ROBO-SGS-0 3,508 235 81.8 1.7 ROBO-SGS-1 3,479 230 77.7-0.4 ROBO-SGS-2 3,014 215 86.3-6.9 ROBO-SGS-3 2,819 229 91.6-0.9 ROBO-SGS-4 2,773 246 101.1 6.5 ROBO-SGS-5 2,720 284 106.8 22.9 ILU(0) 2,713 287 107.4 24.2 ILU(0)-CSC 2,679 383 97.7 65.8 Table 5 Number of iterations and wallclock time of the test computations on 512 cores of the CRAY XE6 cluster Hermit for all preconditioners. Strong scaling results compared to the 64 cores computations and comparison of CPU time to Jacobi preconditioner computation. we can observe that the number of iterations slightly increases when increasing the processor numbers for the more powerful preconditioning techniques. This has furthermore the effect that the difference to block Jacobi with respect to CPU time decreases, since Jacobi is not affected by the parallelisation due to its element local nature. We see that the most efficient preconditioners are the ROBO-SGS variants with low off block order. But using 512 processor cores (threads), the difference is only about 7% in favor of the best preconditioner, ROBO-SGS 2. It is clear that the more and more processors we use for the computation the more efficient block Jacobi gets in comparison to the more sophisticated preconditioners. This suggests that more powerful preconditioner are more effective compared to Jacobi when we have a large number of elements per process, as in this case their iteration number decreasing effect is more pronounced. To demonstrate this we consider a second test case which we compute with only eight cores on a 4 quad core AMD Opteron 8378. We consider for this the flow past a sphere at a Mach number of M = 0.3 and a Reynolds number of Re = 1, 000. The unstructured grid is larger compared to the cylinder grid and consists of 21, 128 hexahedral grid cells and the polynomial degree is chosen equal to four, resulting in 739, 480 DOF per conservative variable and a total of 3, 697, 400 unknowns. As before, we use again an explicit time integrator to generate a time error free inital flow field for our tests. We perform the computations with all preconditioner variants for a time interval of 30 seconds. The initial solution and the result at the end when using ESDIRK4 time integration with 24

a tolerance of T OL = 10 3 and an initial time step of t = 0.0065 are shown in figure 5, where we see isosurfaces of λ 2 = 10 4, a common vortex identifier [19]. Note that there is no visual difference between the results for ESDIRK4 and those obtained using an explicit Runge-Kutta scheme. Fig. 5. Isosurfaces of λ 2 = 10 4 for initial (left) and final solution of sphere problem (right). Table 6 shows the results of this simulation. Since only 8 processes are used to compute this test case, we have an average of 2641 grid cells on one core. The results show that the most efficient preconditioner, the ROBO-SGS-1, is about 30% faster compared to the Jacobi preconditioner. Using this result only, obtained with a low number of processors, one could argue that the new intermediate class of preconditioner, the ROBO-SGS-1, is the most effective among all preconditioner techniques. Again, the more powerful preconditioners like ILU(0) and ILU(0)-CSC are not more computationally efficient. If we compare this to the results with high processor numbers, where the maximum difference of the fastest preconditioner was only about 7%, we get the outcome that the standard Jacobi preconditioner is a viable candidate with good efficiency and scaling. This shows that for a meaningful comparison of preconditioner techniques for the simulation of three-dimensional unsteady compressible flows, test runs with high number of processors, i.e. low number of grid cells on a processor core (thread), are necessary to get the right picture for practical applications. 25