Implementation and Comparisons of Parallel Implicit Solvers for Hypersonic Flow Computations on Unstructured Meshes

20th AIAA Computational Fluid Dynamics Conference 27-30 June 2011, Honolulu, Hawaii AIAA 2011-3547 Implementation and Comparisons of Parallel Implicit Solvers for Hypersonic Flow Computations on Unstructured Meshes Ioannis Nompelis, Jason D. Bender and Graham V. Candler Aerospace Engineering and Mechanics, University of Minnesota, Minneapolis MN, USA A study of parallel implicit solvers for accelerating convergence to steady state solutions of the compressible Navier-Stokes equations with finite-rate chemistry is presented. The solvers in question are pertinent to applications of hypersonic flows that can be modeled as laminar, or to turbulent flows that can be simulated using the Reynolds averaged (RANS) equations. The current state-of-the-art method, the Data-Parallel Line Relaxation (DPLR), is examined. Its convergence properties are evaluated for a class of challenging external aerodynamics problems. A more sophisticated method based on the GMRES linear system solver is built around the DPLR method, where the DPLR is used as a preconditioner. The convergence characteristics of the augmented method are studied for model problems of practical interest. Results show that the more sophisticated method has better convergence properties, but exhibits higher cost and should be used selectively. Nomenclature Ã Large block-sparse matrix representing the discretized implicit form of the PDEs A ± Jacobians of the Steger-Warming flux A A non-singular matrix with real elements representing a system of equations B Portions of the implicit operator that are relaxed b The right-hand side vector of a linear system of equations R Time-stepping residual vector for the discretized parabolic PDEs ˆn Unit normal vector D Diffusive flux vector of conserved quantities F Inviscid flux vector of conserved quantities S Face area r Residual vector of a linear system of equations U Vector (list) of conserved quantities V Volume of a cell in the discretization W Chemical source terms x The vector of unknowns of a linear system of equations δx GMRES update to the approximate solution vector δu Change in the state vector from timestep to timestep Superscript n indicates a point in the temporal discretization (timestep n) Research Associate, <nompelis@aem.umn.edu>, 110 Union St. S.E. Minneapolis MN 55455 USA, AIAA Senior Member. Research Assistant, AIAA Student Member. Professor, AIAA Fellow. 1 of 16 Copyright 2011 by Ioannis Nompelis. Published by the, Inc., with permission.

Subscript i indicates a cell in the finite volume discretization j indicates a face of the finite volume grid I. Introduction The simulation of hypersonic flows using computer codes remains a subject of research today. In order to accurately capture hypersonic flow phenomena for practical applications, large computational grids are used. This is because a large range of length-scales exist (and must be resolved) in flows that are of practical importance. Some examples are full-scale vehicle design evaluation and optimization, aerothermal database generation, and shape optimization, all of which are common-place in the aerospace industry today. The use of large meshes combined with the large range of length-scales that exist at hypersonic flow conditions have greater implications in terms of difficulty and expense in performing accurate calculations. First, the use of large meshes in itself makes the simulation expensive. And second, the vast range of length-scales, particularly in turbulent flows, requires that the grid be refined in certain regions where large gradients exist. This results in small grid spacing which imposes stringent limitations on the timestep for time-marched simulations and generally hinders convergence for non-linear Newton solvers. In order to overcome these difficulties and limitations, several years of research were devoted to accelerating the method of solution. Spectacular advancements have been made over the years on this front and with the introduction of parallel computers. To this end, this paper addresses one aspect of convergence acceleration and focuses on numerical methods that time-march the solution and employ a finite volume method of discretization. Simulations of hypesonic flows using the finite volume method of discretization have been around for decades. The finite volume method is suitable for such simulations because it naturally admits discontinuous solutions to the weak form of the compressible Navier-Stokes equations, including more detailed formulations for nonequilibrium effects, flowfield chemistry and combustion. Discontinuous solutions are necessary because of the presence of shock waves in high Mach number flows. Other methods for obtaining weak and discontinuous solutions to the partial differential equations that govern hypersonic flows in reacting environments are emerging. Most notably, solvers based on the discontinuous Galerkin method with finite elements and other variants are of growing popularity. Today however, the state-of-the-art CFD codes for hypersonic simulations are largely based on the finite volume method. This is because extensive research has been done on all aspects of numerical simulation using this method, and computer codes have been validated for certain classes of problems. In this work, we focus on the solution of the compressible Navier-Stokes with chemistry discretized with the finite volume method and we build on top of what has been the defacto approach in the industry. We are only concerned with methods that are suited for distributed parallel computers. I.A. Overview of Related Work When seeking steady state numerical solutions to the governing equations, the non-linear system of the original PDEs results in a non-linear system for the degrees of freedom of the discretized weak form of the equations. The nonlinear system of equations can be solved with an iterative Newton method, as described in Orkwis. 1 3 The Newton method requires a Jacobian of the residual with respect to the unknowns, which must be inverted at each iteration. A direct inversion would typically be very costly, but equally importantly because the system of equations in question is non-linear a good initial guess must also be provided. Direct inversion of the operator is not at all practical, and thus an iterative method of solution of the linear system of equations (within the context of the Newton method) is typically employed. Many such methods have been utilized with success in the past for high-speed flow applications. 3 The performance, convergence properties and cost associated with these methods vary based on the level of sophistication and the level of approximation that is involved; quantitative comparisons of exact and approximate methods are shown in 2 of 16

Orkwis 3 for simple high-speed flows. Some of the approximate methods march the solution in pseudo-time by introducing a term on the diagonal of the operator which is of the form 1/ τ, such that as τ the original operator representing the steady-state Navier-Stokes equations is recovered. 1 This motivates the idea of integrating the unsteady form of the equations to steady state assuming that a steady state solution of the problem exists. However, because the explicit solution of the discretized transient equations has a stability condition, the use of an implicit method of solution is necessary in most cases in order to overcome the stringent limitation imposed by the CFL condition. In the implicit formulation, all spatial terms of the equations must be evaluated at the future time-level n + 1 when the state vector is known at time-level n. a It should be noted that the CFL condition that is customarily presented for the inviscid model problem is not nearly as stringent as the stability condition imposed by the diffusive terms for practical problems. For the remainder of this discussion, each time we refer to the CFL number or CFL condition we imply the stability condition for the inviscid problem. To the knowledge of the authors, the most comprehensive study of implicit methods for finite volume solvers developed for hypersonic reacting-flow problems was done by Wright. 4 In the cited work, several dataparallel methods were presented and benchmarked in terms of performance and convergence characteristics using model problems. The prevailing method for external aerothermodynamics according to this study is the Data-Parallel Line Relaxation. 5 There are several reasons for this. First, the method is data-parallel and scales extremely well on distributed parallel systems. And second, it is a relatively low cost method in the sense that no additional memory other than the storage of the block-sparse implicit operator is necessary for timestepping. The notion of a data-parallel method is that operations are performed in a fully distributed sense simultaneously such that no processors waits for any other processor to finish before continuing with the calculation. In most cases communication is done in the background and all processors synchronize to advance in time in a lock-step manner. The power of the DPLR method relies on the ability to solve the linear system exactly in directions where coupling is strong. In the study of Wright, that largely involved external flow, it was the direction normal to wall boundaries. More details on this method will be given in a later section. However, the method is essentially an open-loop iterative linear system solver that has been tuned for external aerothermodynamics problems. By open-loop we mean that the solver does not rely on a convergence criterion, and it always performs the same number of relaxations. For more complex geometries and flows with large anisotropy the method does not perform as well. Because more complex geometries are commonplace in hypersonic applications today, more advanced methods of solution need to be readily available in contemporary computer codes if those are to remain relevant. I.B. Recent Work by the Authors In recent work, 6 we extended the DPLR method to fully unstructured meshes in an effort to bring accelerated convergence to more complex shapes. And even more recently we augmented the implicit solver of the unstructured grid code to use a fully-coupled linear system solver based on the PETSc 7 library as described in Nompelis, et al. 8 The implicit solver based on the PETSc package used the GMRES method 9 with a special treatment for data dependencies when running in parallel. The package also used a very expensive but very good preconditioner, the ILU(0) incomplete LU factorization algorithm. The implicit solver was tuned such that we balance cost and performance for the problems of interest; there are several tunable parameters for this solver. More details on this method will be discussed in a later section. Results showed that the more sophisticated implicit solver performed better than the DPLR solver for a case where the grid had not been generated with line-solutions in mind. At the same time, the PETSc-based implicit solver performed equally well or better in cases where DPLR was benefitted by the grid construction. Fig. 1 plots the convergence rates for two cases of a blunt-body external flow and of the flow inside a duct with shock generators as computed using the different methods. The fully-coupled method performs very well and in terms of convergence rate it is better than the DPLR method. However, in terms of elapsed CPU a This is a reasonable approach because the model problem, the linear one-way wave equation, is unconditionally stable. 3 of 16

time both methods were almost equivalent. This is because the fully-coupled method is inherently more expensive per timestep, but also because it was based on a general purpose package and was not streamlined for use in this code. The most important conclusion of that study was that the use of an external library came at a very large memory cost. This is because the implicit system s block-sparse matrix had to be replicated for use within the GMRES routine, and the ILU(0) preconditioner required storage equal to exactly that of the large matrix. The memory cost slightly exceeded 200% when the fully-coupled method was invoked. Therefore, for simulations that employ very large meshes with the memory nearly exhausted, this is not at all a useful option. 10 1 FMDP method 10-1 DPLR method Fully coupled implicit 10 3 Hybrid DPLR-FMDP method Fully coupled implicit method 10-3 Re = 10 7 10 1 Density residual 10-5 10-7 10-9 Re = 10 6 Density residual 10-1 10-3 10-5 10-11 Re = 10 4 10-13 Re = 10 5 10-7 0 1000 2000 3000 4000 Number of timesteps 10-9 0 2000 4000 6000 8000 Number of timesteps Figure 1. Convergence of the DPLR method, the point implicit method (FMDP) and the fully-coupled implicit method based on PETSc for a blunt-body problem as a function of the free-stream Reynolds number (left), and convergence of the flow inside a duct with shock generators using the hybrid DPLR/FMDP and the fully-coupled methods (right). (Taken from Nompelis, et al. 2007) I.C. Scope of Current Work After having extended the line relaxation method to unstructured grids and having explored the possibility of employing more accurate but expensive linear system solvers, we came to the following conclusions. If a less approximate linear system solver is to be used for the classes of problems that do not allow the use of grid topologies that are suitable for line solutions, the implementation must be made part of the CFD code. By this we mean that it does not benefit to make use of an external library in a production CFD code. Therefore, making the more advanced linear system solver part of the CFD code is what we have done in this work. Also, the use of a preconditioner such as the ILU(0) and other variants is introducing additional expense, both in terms of memory and CPU cycles for possibly only marginal benefits. Instead, it is more appropriate to use a preconditioner that is (a) built with the specific application in mind as the DPLR method is and (b) re-use as much of the existing data-structures and data during timestepping. Our implementation and characterization of the performance of the augmented solver is the scope of this work. We present the current implementation of a GMRES-based implicit solver as it has been applied to a fully unstructured code. 6 In this context we briefly discuss the hybrid DPLR method as it is used for time integration; this is important for understanding how preconditioning based on this method is done very efficiently within the GMRES algorithm. First, we show results for the DPLR and FMDP methods on a standard blunt-body problem. We compare the convergence rates of these methods to using the GMRES 4 of 16

implicit solver with DPLR / FMDP as preconditioner for the same problem. We present similar comparisons for a more difficult class of problems that involve a double-cone flowfield. 10,11 In this case, the presence of both separation and a complex shock-wave structure exhibits a feedback mechanism resulting in a flowfield that evolves slowly and requires a relatively large amount of physical time to be simulated to reach steadystate. The complexity of the flowfield and the large run-times required make this case ideal for convergence acceleration studies. Lastly, we show results for the convergence of a turbulent inward-turning inlet design 12 for which CFD has been used to perform shape optimization. In section 2 we present the numerical method that is used in the CFD code. In section 3 we discuss implicit solvers and how the DPLR, FMDP and the preconditioned GMRES methods are implemented in the solver. In section 4 we present results for the cases mentioned above and discuss the convergence characteristics of the methods. II. Numerical Method We solve the weak form of the compressible Navier-Stokes equations with chemistry. 13 This is a set of equations for the species densities, three momentum equations, an equation for total energy, and typically an additional equation for the vibrational energy of all species. The state equation assumes Dalton s law of partial pressures. The diffusive terms assume linear diffusion of momentum, energy and mass. An additional equation is solved when the solver is running in RANS mode, where the Spalart-Allmaras model with a compressibility correction is employed. 14,15 The finite volume method of discretization is used to obtain weak solutions to the equations. In this formulation, the integrals of all spatial terms are converted to surface integrals and the fluxes at the faces of the unstructured grid are evaluated at each timestep. We use a modified Steger-Warming flux 16,17 for the inviscid terms; this has been shown to be adequate for problems that are of interest to us. 11 The discretized form of the equations is: = 1 V i U i t j { i cell faces } ( F j D j ) ˆn j S j + W i (1) and represents the rate of change of the state vector at a cell i of an unstructured grid. III. Implicit Solvers To overcome the timestep limitation of the CFL condition, we use an implicit formulation of the equations. We discretize the transient term with a first order Euler finite difference and we evaluate the spatial terms at the future time level n + 1. Therefore, the weak form of the equations on the finite volume discretization take the form: U n+1 i Ui n t = 1 V i ( F j D j ) n+1 ˆn j S j + W n+1 i (2) j { i cell faces } where the right hand-side is evaluated at n+1. We typically work with the normal component of the flux to the face which we short-hand as: F = F ˆn and similarly for the diffusive flux D = D ˆn. Because of how the Steger-Warming split fluxes are expressed, it is easy to linearize the inviscid fluxes in time using a Jacobian of the flux and get an approximation of the flux F at the future time-level as: F n+1 = F n +A + δu L +A δu R where the subscripts R, L indicate cells to the left and to the right of any given face and A +, A are the inviscid flux Jacobians. A similar linearization using an appropriate approximation is done to the viscous flux D such that the linearized form at n + 1 involves a matrix operating on δu of the left and right cells; the details can be found in Nompelis. 13 The source terms W are also linearized in time with a Jacobian as: W n+1 = W n + W n U δu. The resulting system of linear equations at every timestep can be written in the compact form of Eq. (3) that represents the system after it has been discretized with a given grid and 5 of 16

linearized in time. Ã n δu = Rn The vector on the right hand side is the rate of change of the conserved quantities. The vector of unknowns is the change in the state vector from timestep n to timestep n + 1. The matrix Ã consists of the Jacobians of the fluxes and source terms, and a term on the diagonal that contains the volume of each cell in the discretization V divided by the timestep t. The term on the diagonal enhances diagonal dominance of the system as t remains small. The large block-sparse system of Eq. (3) needs to be solved at every timestep n. III.A. Data-Parallel Implicit Relaxation Methods For practical problems the dimension of the state vector is very large. For example, when a mesh of 1 million cells is employed for a laminar three-dimensional calculation with 11 species air chemistry, the number of degrees of freedom is 16 million. Therefore, a direct inversion of the operator Ã is not possible. For this reason, the solution to the linear system is obtained approximately by means of a relaxation method. Dataparallel methods take into consideration the partitioning of the domain onto processors. In general, solutions of smaller portions of the large block-sparse system are performed, and the coupling that is not included in these solutions is obtained approximately via relaxation. In the case of the DPLR method, lines are formed inside the domain ahead of time when the grid is being distributed across processors. As lines of cells are constructed in the wall-normal direction, data-structures are created for the ordering of the cells. In this way, portions of the large block-sparse matrix are implicitly ordered. A direct LU factorization is performed on the portions of the system that are ordered as block tridiagonal sub-systems representing the line-solutions. The off-diagonal blocks that do not get factored in the LU factorization are the terms relaxed during the relaxation sweeps. In principle, lines can be constructed in any direction within the domain. But in practice, for external and some internal aerothermodynamics problems, the grid is constructed such that lines can be built in a region that is close to the wall and can extend to the outer boundary of the domain. Cells that are not part of line-solutions simply get an LU factorization of their diagonal block (from the block-row of Ã) and off-diagonal blocks are relaxed during relaxation sweeps. That is exactly what the Full- Matrix Data-Parallel method does in the entire domain. The Hybrid DPLR-FMDP method is essentially a DPLR method that uses FMDP-type solutions for cells that linger outside of line-solutions. All data-parallel implicit relaxation methods discussed here can be expressed in the compact form: (3) (Ãn B n ) δu k = R n B n k 1 δu (4) The number of relaxation sweeps of 4 has been shown to be adequate for problems of interest. 4 We have adopted this value for our implementation of the hybrid DPLR-FMDP in the unstructured flow solver. These data-parallel methods are open-loop iterative linear system solvers that take advantage of the physics of the underlying parabolic PDEs to accelerate convergence. Namely, the FMDP is a point-relaxation method that is suitable for problems with stiff chemistry, and flows where the chemical time-scales are faster than the flow time-scales. The DPLR method is marginally more expensive that the FMDP but has strong coupling in the wall-normal direction where flow gradients are strong. The hybrid method takes advantage of the convergence properties of the DPLR method in regions were gradients are strong, but allows for more general unstructured grids to be used. III.B. Preconditioned GMRES Implicit Solver The use of a fully-coupled linear system solver for implicit timestepping of parabolic PDEs is prohibitively expensive. In order to take advantage of additional coupling within the Ã matrix to accelerate convergence one must use an iterative solver that is significantly more sophisticated than the open-loop methods discussed 6 of 16

in the previous subsection. There are several solvers that can be used but not all of them are competitive in terms of cost. For example, the Conjugate Gradient method guarantees exact solution of a symmetric linear system of equations for N unknowns in N steps. This is not at all practical. Furthermore, a preconditioned CG method that can potentially reach convergence to a specified tolerance in far fewer steps will likely not be competitive either. This is because the CG method can only be used with positive definite and symmetric systems, and therefore the system of Eq. (3) has to be symmetrized which significantly adds to the cost of memory, CPU cycles and hinders convergence as the condition number of the matrix worsens during symmetrization. We use the preconditioned GMRES method 9 for a linear system solver. The method lends itself to easy implementation and the memory requirements are not severe if the solver is part of the CFD code. The GMRES method seeks to find the solution of a linear system of equations in a sequence of steps where given an initial guess the approximate solution vector is altered at each step such that the linear system residual is minimized. It is important to note that the method guarantees that the residual will indeed be smaller after invocation. Also note that this is in contrast with open-loop methods such as DPLR and the hybrid DPLR-FMDP. The GMRES method solves a system of the form Ax = b given an initial guess x 0 by altering the initial guess by a vector δx in a sequence of successive invocations (outer loops) x m = x m 1 + δx until a tolerance for the magnitude of the linear system residual is reached. The vector δx is formed as a linear combination of basis vectors q k as δx = nk k=1 λ kq k. The λ k are a solution to a least-squares minimization problem. The power of the method is in the construction of the basis {q k }. As with other Krylov-space methods, the GMRES method forms a basis for the Krylov space using an Arnoldi orthogonalization (modified Gram-Schmidt process). Starting from a normalized residual r 0 = b Ax 0, any additional vector q k+1 is constructed from the previous vector in the basis q k that is mapped by an operation of the A matrix from the left and the result is made orthogonal to all previous vectors in the basis {a 0, q 1,..., q k 1 }. It should be noted that the number of Krylov dimensions or the number of basis vectors is a tunable parameter of the method. Each of the basis vectors has the dimension of the degrees of freedom. (In the example of the 6-species three-dimensional flow the dimension of each of those basis vectors is 16 million). It is possible to start the method with a homogeneous initial guess, and in this case the first basis vector of the Krylov space will be: q 1 = r 0 / r 0 = b/ b. The GMRES method by itself will always converge, but it will do so slowly in the sense that for a given problem it may take many more than N steps to converge the solution of a linear system for N unknowns. For this reason, the GMRES method should be used in conjunction with a preconditioner. We think of the preconditioner as a matrix mostly for purposes of notation. But in reality, the preconditioner is a function that maps vectors to vectors. In an abstract sense, we think of a preconditioner as a function that approximates the solution to the problem Ax = b given an argument b and taking the matrix A as a parameter. b Formally, using the preconditioned GMRES method to solve Ax = b amounts to solving for the vector y in: AM 1 y = b (5) and then using the equation y = Mx to solve for x = M 1 y. Note that if one were to use a matrix for preconditioning within a CFD solver, depending on the level of approximation that is involved in preconditioning, the matrix may require more storage than the Ã matrix itself, resulting in very large memory overhead. This is because during construction of the basis for the Krylov space, the A matrix must remain unaltered in memory. Simple preconditioning, on the other hand, can use zero additional memory but may largely be inadequate for accelerating convergence. In earlier work, 8 we used the preconditioners that came with the PETSc library. 7 We used the ILU(0) preconditioner which is very powerful but also very expensive. In this work, we strike a balance between computational and memory cost by using the hybrid DPLR-FMDP method as a preconditioner to the GMRES. The main advantage of this approach is that the integrated solver within the CFD code is very fast, largely due to how the data-dependencies from other processors are treated. Also, we can be selective b The best preconditioner is the inverse of A. Prof. B. Cockburn, Univ. of Minnesota - Mathematics Dept., 2010 7 of 16

about what is stored and what is (re)computed within the invocation and restarting of the GMRES method based on either speed or memory savings. Specifically, if we wish to minimize the memory requirements, we only allocate enough memory for the portion of the preconditioner that is used in a single line-solution within the DPLR method; we choose the largest line to size the working arrays. In this way, we save on memory usage but we perform more calculations for the factorizations involved in preconditioning. The opposite is true when we can afford to use more memory but want to make the solver as fast as possible. Then, the arrays that are used for preconditioning are precomputed and stored unaltered until completion of the timestep. In our case and with the DPLR as a preconditioner, we store the ordered LU-factored block tri-diagonal pieces of the operator on what we think of as M 1. In this way, the most expensive part of the linear algebra of the timestep is done only once, regardless of the number of Krylov dimensions and outer loops that are employed. During construction of the Krylov-space basis only back-substitutions are necessary. And finally a single back-substitution is needed for the recovery of the solution by x = M 1 y. The back-substitutions and the parallel inner-products that are involved in the basis construction are very fast. It is important to understand how coupling of the degrees of freedom in the linear system happens when this preconditioned GMRES implicit solver is invoked. There are two types of coupling. The first is the coupling that comes from the recursive Krylov-space basis vector construction, when the operator A is operated on the previous vector to create the next one. The other form of coupling comes from within the preconditioner. In our case it is either the block-tridiagonal structure that experiences a direct inversion in the DPLR method, or the relaxation sweeps. We will see that for this set of equations the coupling has direct implications on the robustness of the solver. IV. Results IV.A. Flow Over a Blunt Body The first model problem that is of interest to us is the hypersonic flow around a blunt body. We chose a 30 degree cone that is spherically blunted. This is a generic flowfield for hypersonic simulations that exhibits strong shocks and strong gradients in the boundary layer near the wall. The length of the cone is 1.0m and the radius of the nose is 0.1m. We computed the flow over the blunt body using the CFD solver and by employing three methods: (a) the FMDP, (b) the DPLR, and (c) the GMRES method with FMDP and DPLR as preconditioners. We obtained solutions for three different conditions where the free-stream Reynolds number was varied, assuming that the flow remains laminar. The free-stream velocity was set to 3224 m/s and the free-stream temperature was set to 250 K. These conditions are typical of moderate enthalpy flows. The Reynolds number was varied by changing the freestream density such that we computed the flow at nominal free-stream Reynolds numbers of 10 5, 10 6 and 10 7. We had performed similar studies in earlier work. 8 In this study, we computed the flow on two sets of grids. The first was a grid that was tailored to align a grid-line with the shock specific to the three conditions, but the wall-spacing was set such that the flow corresponding to the highest Re conditions would result in a maximum y + of 1.0 at the wall. This grid adequately resolves the flow for the highest Re conditions. The second set of grids were tailored to align with the shock, but the wall spacing was tailored to result in a maximum y + of 1.0 for each of the conditions. Fig. 2 plots the convergence rates versus number of timesteps for three free-stream Reynolds numbers on different grids that have been tailored specifically for the free-stream conditions. We see that the DPLR method converges almost independently of the Reynolds number. This is because it can maintain stability at much larger timesteps than the FMDP method. The results are tabulated in Table 1. Interestingly, when the same grid is used the grid that resolves the highest Re flow the DPLR method performs marginally better for all cases. The FMDP method exhibits even slower convergence rates for the lower Reynolds number cases on this grid. We simulated the same cases using the preconditioned GMRES implicit solver to make direct comparisons. 8 of 16

DPLR Re = 10 5 DPLR Re = 10 6 DPLR Re = 10 7 FMDP Re = 10 5 FMDP Re = 10 6 FMDP Re = 10 7 Density field residual 10-12 10-14 10-16 0 2000 4000 6000 8000 10000 Figure 2. Convergence of the DPLR and FMDP methods on grids tailored for the particular conditions of three free-stream Reynolds numbers. Table 1. s to steady-state for the blunt body flows using the DPLR method with 4 relaxation sweeps. Surface quantities were used as the criterion for convergence. FMDP DPLR Grid Re Re 1e+5 1e+6 1e+7 1e+5 1e+6 1e+7 Tailored 6640 3100 9500 1172 1840 1640 High Re >10000 >10000 9500 780 980 1640 We did not expect the more sophisticated method to perform much better for these well-behaved cases, and results are in agreement with this a priori assessment. Recall that there is a total of four tunable parameters in this implicit solver. For these calculations we fixed the number of restarts (outer loops) of the GMRES method to 10, and we specified a tolerance of 10 10 for the linear system residual magnitude. This fixed two of the three tunable parameters of the GMRES wrapper. However, we varied the number of Krylov vectors. The other tunable parameter is part of the preconditioner, and that is the number of relaxation sweeps of the DPLR and FMDP methods. Consider the results tabulated in Table 2 for the highest Re conditions. It is important to note that when two Krylov dimensions are used, the GMRES method has the explicit timestepping update for q 1 and what would be considered the equivalent of a solution provided by the preconditioner for q 2. Therefore, the final change to the initial guess (which is zero in our case) will be a linear combination of those two vectors. When more Krylov dimensions (vectors) are introduced, the method has more vectors from which the update can be constructed. The different number of sweeps affects the quality of those vectors and directly influence convergence. When the DPLR method is used as a preconditioner, the results do not improve much from when the DPLR method was used in an open-loop fashion. Increasing the number of Krylov dimensions does not increase performance; meanwhile there is an inherent increase in cost. But more importantly, increasing the number of relaxation sweeps within the preconditioner does not seem to affect the convergence. We believe that because the preconditioner (DPLR) performs well for this problem, the GMRES wrapper is unable to 9 of 16

Table 2. s to steady-state for the blunt body flow at Re=1e7 using the preconditioned GMRES method with up to 10 restarts and DPLR or FMDP as preconditioners with different numbers of relaxation sweeps and number of Krylov vectors. FMDP DPLR nk kmax=2 kmax=4 kmax=0 kmax=2 kmax=10 2 >10000 >10000 1640 710 1640 10 >10000 >10000 1670 1670 1670 offer any additional benefits. In great contrast, when the FMDP method is used as a preconditioner, the GMRES wrapper performs even worse than the original preconditioner by itself. Convergence is greatly hindered in this case. IV.B. Double-Cone Flow The second model problem is of a more complex flowfield. We focused on Run 35 as is presented in Nompelis, et al. 10 The flow over the double cone has a large region of separation that evolves slowly in time and alters the shock structure. As a result, the flowfield continues to evolve over a large amount of physical time and long run times are required. For this class of problems, a robust and efficient implicit solver is necessary. We simulated the flow over the model by initializing the domain to free-stream conditions and integrating (timestepping) to steady-state. When the simulation starts and the shock-structure is being established, we use timesteps that are not very large but they are still many times greater than the CFL condition for this grid and conditions. Typically, the basic flow-structure is established over approximately 500 steps at a moderate CFL number. When the basic structure is in place, we increase the timestep by many times that corresponding to the stable CFL condition. Fig. 3 plots the convergence rates using the DPLR method for the double-cone flowfield when a different maximum CFL number is used. We notice that when the CFL is kept at a low number (for example 500 or 2,000), the calculation reaches steady state as indicated by a large reduction in the magnitude of the rate vector followed by a plateau at a small value. However, when the CFL number is very large, the simulation appears to go into a limit cycle. It should be noted that the flowfields are qualitatively the same at the end of those runs. In order to get a better understanding of convergence to steady-state of this flowfield, we examined the convergence of the surface quantities. Fig. 3 also plots a residual of the surface quantities defined as j wall [(pn j pn 1 j ) 2 + (Tj n T n 1 j ) 2 ]. We expect that this norm will reach a plateau at about the same low value for all CFL sequences in spite of having ringing of the solution in the interior of the domain. We varied the number of relaxation sweeps of the DPLR method and using the same CFL sequences. The results are tabulated in Table 3. For no relaxation sweeps all calculations, including those at low CFL numbers, went into a limit cycle. This indicates that for this class of flows with flow separation even when the linear system has coupling in the wall-normal direction, the coupling that comes through relaxation plays an important role. Consistent with the results of Wright, 4 we found that when the number of relaxation sweeps is set to 4 the method behaves very well. However, when the number of relaxation sweeps is set to a larger value, the linear system solver becomes unstable at large timesteps. We simulated the double-cone flow with the GMRES method and DPLR as the preconditioner. We varied the number of Krylov dimensions, the number of outer loops and the number of relaxation sweeps of the preconditioner. We focused on the cases with high maximum CFL numbers (20,000 and 50,000) which exhibited lack of convergence. We used the same CFL sequences as with the DPLR method. Fig. 4 plots the convergence rates in terms of three residuals. The first is the standard timestepping residual. This is plotted on the left axis along with the surface residual defined earlier. The wide band corresponds to the convergence of the linear system within each timestep. This quantity is plotted on the right axis. At any 10 of 16

Table 3. s to steady-state for the double-cone flow using the DPLR method with different number of relaxation sweeps. Surface quantities were used as convergence criterion. Cases marked with X were unstable and cases marked with LC went into limit-cycle. CFL max kmax 0 4 10 500 LC 43000 >80000 2,000 LC 14800 20000 20,000 LC 7000 X 50,000 LC 7400 X Density Residual 6 DPLR CFL = 500 10 DPLR CFL = 2000 DPLR CFL = 20000 DPLR CFL = 50000 10 4 10 2 Surface Residual Norm 10 3 DPLR CFL = 500 DPLR CFL = 2000 10 1 DPLR CFL = 20000 DPLR CFL = 50000 10-1 10-3 10-5 10-7 10-9 10-11 0 5000 10000 15000 20000 25000 0 5000 10000 15000 20000 25000 Figure 3. Convergence to steady state of the double-cone flowfield (left) and convergence of the surface quantities (right) plotted versus timestep number for four different maximum CFL numbers using the DPLR method. given timestep on the horizontal axis, the upper part of the band corresponds to the residual of the linear system for the initial invocation of the GMRES solver. With successive restarts (up to 10) the linear system residual drops until it reaches the lower part of the band. The band has a lower bound at the pre-specified tolerance of 10 10. What is important to notice is that for the high maximum CFL number, plotted on the right of Fig. 4, we reach convergence at the same level for all residual norms. This happens in fewer timesteps because we are advancing the time that is simulated by 10 times. This shows that when the GMRES method is wrapped around the DPLR preconditioner, the linear system solver is more robust and behaves as expected. However, the cost of the method is naturally higher when the GMRES is used with multiple restarts. Table 4 summarizes the results of the simulations using the preconditioned GMRES method with baseline parameters (10 restarts and tolerance set to 10 10 ). We see that with increasing number of Krylov vectors we do not get appreciable gains in convergence when 10 restarts are performed. But more strikingly, increasing the number of relaxation sweeps adversely affects the robustness of the solver. The effect is more pronounced when the linear system has less diagonal dominance (compare the results for maximum CFLs of 20,000 and 50,000). When the GMRES method is limited to no restarts the convergence characteristics degrade. Limiting the method to no restarts is equivalent to not letting the linear system converge toward the specified tolerance at each timestep. As a result, it takes many more steps to reach the same level of convergence as 11 of 16

Table 4. s to convergence of the DPLR-preconditioned GMRES method with up to 10 restarts tabulated by different relaxation sweeps (kmax) and different number of Krylov vectors (nk) for maximum CFL of 20,000 (left) and 50,000 (right). Cases marked with X were unstable. CFL = 20,000 nk kmax 0 2 4 2 4600 4200 X 3 4376 3950 X 4 4124 5730 X 5 4240 X X CFL = 50,000 nk kmax 0 2 4 2 X 3500 X 3 3970 3260 X 4 3780 X X 5 3570 X X Table 5. s to convergence of the DPLR-preconditioned GMRES method with no restarts tabulated by different relaxation sweeps (kmax) and different number of Krylov vectors (nk) for maximum CFL of 20,000 (left) and 50,000 (right). Cases marked with X were unstable. CFL = 20,000 nk kmax 0 2 4 2 22500 9000 X 3 15500 X X 4 12400 X X 5 X X X 8 X CFL = 50,000 nk kmax 0 2 4 2 22000 8739 X 3 15500 X X 4 12300 X X 5 10000 X X 8 X 12 of 16

Density field & Surface Residual 10 6 10 4 10 2 Linear system residual Surface residual res. (CFL = 2000) Linear System Residual Density field & Surface Residual 10 6 10 4 10 2 Linear system residual Surface residual res. (CFL = 20000) Linear System Residual 10-12 0 5000 10000 15000 20000 25000 10-12 10-12 0 5000 10000 15000 20000 25000 10-12 Figure 4. Convergence to steady state for the DPLR-preconditioned GMRES method for a maximum CFL of 2,000 (left) and 20,000 (right). The times-stepping residual and surface residual norm are plotted on the left axis and the linear system residual at each timestep on the right axis. is shown in Table 5. These results are consistent with what we observed when examining the DPLR method where lack of convergence of the linear system resulted in slow convergence to steady-state. In this case of no restarts, there are cost savings because we do not need to perform the construction of the Krylov space basis 10 times, which has inherent back-substitutions within the preconditioner. A comparison for different number of Krylov vectors and different number of restarts are shown in Fig. 5. Increasing the number of Krylov vectors when doing 10 restarts does not significantly accelerate convergence. Increasing the number of restarts to 10 when using two Krylov vectors has a significant effect on convergence. Lack of robustness of the implicit solver was also observed when the number of Krylov vectors was large as shown in table 5. We do not have an explanation for this effect. The important aspect of convergence acceleration is reduction of elapsed CPU time to reach steady state. We expect the more sophisticated and more complex implicit solver to generally be more expensive than the simpler methods. Table 6 compares the cost of using the GMRES based implicit solver versus the DPLR method in terms of CPU time. For these comparisons we only consider the cases where the relaxation sweep parameter for the GMRES is zero. These results are compared to the baseline DPLR method which uses 4 relaxation sweeps. We see that the most basic invocation of the GMRES solver that uses only two vectors for the basis of the Krylov space and using multiple restarts (up to 10) reaches the converged result in almost twice as much elapsed CPU time as the baseline DPLR method. Generally, increasing the number of vectors for the Krylov space basis increases the cost per timestep. However, we saw earlier that convergence is marginally better when more vectors are used. In this case we do not see a significant change in elapsed CPU time. In contrast, the number of Krylov vectors has a more substantial effect on run-times when no restarts of the method are performed. From these results we see that run-times are reduced for the GMRES based implicit solver when multiple restarts are performed. IV.C. Inward-turning Inlet Flow This inward-turning inlet concept 12 has been used to demonstrate shape optimization capabilities using CFD. The grid for this calculation consists of 4 million points and was built in a manner such that the DPLR method is used in regions near the wall where the flow exhibits strong gradients. Under the conditions of 13 of 16

Density field & Surface Residual 10 6 res. (CFL = 20000, nk = 2) res. (CFL = 20000, nk = 3) res. (CFL = 20000, nk = 4) 10 4 10 2 Density field & Surface Residual 10 6 res. (CFL = 20000) res. (CFL = 20000, no restart) 10 4 10 2 10-12 1000 2000 3000 4000 5000 10-12 0 5000 10000 15000 20000 Figure 5. Convergence to steady state for the DPLR-preconditioned GMRES method for a maximum CFL of 20,000 plotted for different numbers of Krylov vectors when doing 10 restarts (left) and for different number of restarts when using 2 Krylov vectors (right). Table 6. Elapsed CPU time to steady-state for the double-cone flow at a maximum CFL of 20,000 using the baseline DPLR method and the preconditioned GMRES method for a different number of Krylov vectors and different number of restarts. DPLR GMRES GMRES kmax=4 kmax=0 kmax=0 10 loops 0 loops nk=2 nk=3 nk=4 nk=2 nk=3 nk=4 3355 6463 6420 6998 10125 7440 6138 interest the flow is turbulent such that it remains attached in the compression region. Convergence of the flowfield is very important in this case because the CFD solver is placed in a feedback loop optimization and is used as a black box. Fig. 6 plots the convergence rates to steady-state for the inward-turning inlet flowfield. When the DPLR method is used, the time-stepping residual fluctuates and does not converge to a very low value. This indicates that the solution is ringing or experiences low level unsteadiness. This may be due to lack of convergence of the linear system at each timestep as was the case with the double-cone flows discussed earlier. On the other hand the GMRES-based implicit solver converges to over 12 orders of magnitude. The large transient to establish the flowfield takes about the same number of steps to overcome for both methods. This is because the CFL numbers during the transient are kept to small levels. Once the basic flowfield is established the CFL number is greatly increased. In the DPLR calculation, the baseline number of relaxation sweeps of 4 was used. In the GMRES based method, no restarts were used and the number of Krylov vectors was 5. Based on what we observed for the earlier cases, we used no relaxation sweeps for the preconditioner. 14 of 16

10 10 Linear system residual residual 10 6 10 4 10 10 Linear system residual residual Density field & Surface Residual 10 5 10-5 10 2 Linear System Residual Density field & Surface Residual 10 5 10-5 1000 2000 3000 4000 5000 6000 1000 2000 3000 4000 5000 6000 Figure 6. Convergence of the flowfield inside the inward-turning inlet design using the DPLR method (left) and the GMRES method with DPLR as the preconditioner (right). V. Conclusion We have studied the performance of a GMRES based implicit solver for hypersonic flow applications. We used three model problems that are relevant to hypersonic applications to examine the convergence characteristics of the solver. We found that for simple external flows such as the flow over a blunt body the method does not offer any benefits in terms of performance or robustness. And using certain combinations of the tunable parameters the method is less robust than the simpler time-stepping methods currently in use. We used the solver to simulate a more complex flowfield, the flow over a double-cone model in hypersonic conditions. We were able to draw most of our conclusions about the solver from numerical experiments with this flowfield. In contrast to what we observed when using the DPLR method, the GMRES based solver is robust and converges better to the steady-state solution. The DPLR method goes into a limit-cycle at large timesteps. However, when the GMRES method is wrapped around the DPLR method which is then used as a preconditioner the implicit solver is more robust. The iterative linear system solver is significantly more expensive than the underlying preconditioner in this case. Based on our results, the iterative solver is at least twice as expensive at best. Numerical experiments for different values of the tunable parameters show that the cost in terms of elapsed CPU time greatly varies. Results also show that the choice of these parameters affect the robustness of the solver. In particular, when the DPLR (and possibly any other relaxation method) is used as a preconditioner to GMRES, increasing the relaxation sweeps reduces robustness. In theory, increasing the number of vectors in the Krylov space basis provides more stability. In our study we found that increasing the number of vectors for this set of equations may have adverse effects and not only hinder convergence, but adversely affect the robustness of the solver. Restarting the method multiple times per timestep is relatively inexpensive because of how the solver has been incorporated into the CFD code. In particular, the structures used for preconditioning do not need to be computed again when the method is restarted. When using multiple restarts the solver is able to get a better approximation to the solution of the linear system, which enhances convergence to steady state for this class of problems as we have seen. In our numerical experiments with the double-cone we observed that using up to 10 restarts ultimately costs less than using no restarts in spite of the added operations that are 15 of 16