Low-Communication, Parallel Multigrid Algorithms for Elliptic Partial Differential Equations

Size: px

Start display at page:

Download "Low-Communication, Parallel Multigrid Algorithms for Elliptic Partial Differential Equations"

Henry Floyd
5 years ago
Views:

University of Colorado, Boulder CU Scholar Applied Mathematics Graduate Theses & Dissertations Applied Mathematics Summer 8-24-2017 Low-Communication, Parallel Multigrid Algorithms for Elliptic

1 University of Colorado, Boulder CU Scholar Applied Mathematics Graduate Theses & Dissertations Applied Mathematics Summer Low-Communication, Parallel Multigrid Algorithms for Elliptic Partial Differential Equations Wayne Mitchell Follow this and additional works at: Part of the Numerical Analysis and Computation Commons, and the Partial Differential Equations Commons Recommended Citation Mitchell, Wayne, "Low-Communication, Parallel Multigrid Algorithms for Elliptic Partial Differential Equations" (2017). Applied Mathematics Graduate Theses & Dissertations This Thesis is brought to you for free and open access by Applied Mathematics at CU Scholar. It has been accepted for inclusion in Applied Mathematics Graduate Theses & Dissertations by an authorized administrator of CU Scholar. For more information, please contact

2 Low-Communication, Parallel Multigrid Algorithms for Elliptic Partial Differential Equations by Wayne Bradford Mitchell B.S., Loyola University, 2013 M.S., University of Colorado, 2016 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Applied Mathematics 2017

3 This thesis entitled: Low-Communication, Parallel Multigrid Algorithms for Elliptic Partial Differential Equations written by Wayne Bradford Mitchell has been approved for the Department of Applied Mathematics Thomas A. Manteuffel Stephen F. McCormick Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

4 iii Mitchell, Wayne Bradford (Ph.D., Applied Mathematics) Low-Communication, Parallel Multigrid Algorithms for Elliptic Partial Differential Equations Thesis directed by Thomas A. Manteuffel When solving elliptic partial differential equations (PDE s) multigrid algorithms often provide optimal solvers and preconditioners capable of providing solutions with O(N) computational cost, where N is the number of unknowns. As parallelism of modern super computers continues to grow towards exascale, however, the cost of communication has overshadowed the cost of computation as the next major bottleneck for multigrid algorithms [21]. Typically, multigrid algorithms require O((log P ) 2 ) communication steps in order to solve a PDE problem to the level of discretization accuracy, where P is the number of processors. This has inspired the development of new algorithms that employ novel paradigms for parallelizing PDE problems, and this thesis studies and further develops two such algorithms. The nested iteration with range decomposition (NIRD) algorithm, originally presented by Appelhans et al. in [3], is known to achieve accuracy similar to that of traditional methods in only a single iteration with log P communication steps for simple elliptic problems. This thesis makes several improvements to the NIRD algorithm and extends its application to a much wider variety of problems, while also examining and updating previously proposed convergence theory and performance models. The second method studied is the algebraic multigrid with domain decomposition (AMG-DD) algorithm, originally presented by Bank et al. in [6]. Though previous work showed only marginal benefits when comparing convergence factors for AMG-DD against standard AMG V-cycles, this thesis studies the potential of AMG-DD as a discretization-accuracy solver. In addition to detailing the first parallel implementation of this algorithm, this thesis also shows new results that study the effect of different AMG coarsening and interpolation strategies on AMG-DD convergence and show some evidence to suggest AMG-DD may achieve discretization accuracy in a fixed number of

5 cycles with O(log P ) communication cost even as problem size increases. iv

6 Dedication For my parents, who taught me the value of scientific thought, inspired me to pursue the highest levels of education, and gave me the love and support I needed to confidently make my own way through life.

7 vi Acknowledgements I give many thanks to my academic mentors, Tom Manteuffel, Steve McCormick, Rob Falgout, and John Ruge, who have been generous with their time, patient in their teaching, and who constantly show and inspire great enthusiasm for research in mathematics and computation. I also thank my peers in the Grandview Gang, Ben Southworth, Aly Fox, Jeff Allen, Ben O Neill, Delyan Kalchev, Chris Leibs, and Steffen Muenzenmaier for being encouraging fellow students, engaged and interested fellow researchers, and excellent drinking budies. Finally, I thank my wife, Rebecca, for being patient, supportive, and loving throughout the (often stressful) process of creating this thesis and, more importantly, for being my favorite person to be around for all the times when I m not thinking about math.

8 vii Contents Chapter 1 Introduction Preliminaries First-order system least-squares (FOSLS) Accuracy-per-computational-cost efficiency-based refinement (ACE) Nested iteration Algebraic multigrid (AMG) Nested Iteration with Range Decomposition (NIRD) NIRD algorithm description Algorithm details The preprocessing step Partitions of unity Subproblem functional measurement NIRD convergence theory Examining a previous NIRD convergence proof An updated NIRD convergence proof Modeling communication cost Traditional NI and NIRD models Model comparisons with varying problem difficulty

9 viii 2.5 NIRD numerical results Poisson Advection-diffusion Poisson with anisotropy or jump coefficients Higher-order finite elements Discussion Algebraic multigrid with domain decomposition (AMG-DD) AMG-DD algorithm description Algebraic multigrid interpolation Full algebraic multigrid (FAMG) Improving AMG interpolation Numerical results for AMG-DD Parallel implementation The residual communication algorithm Implementation details Composite solves Conclusions 91 Bibliography 94

10 ix Tables Table 2.1 Measured constants for Poisson with a smooth RHS Table of relevant values for NIRD using different PoU s for Poisson with a smooth RHS Table of relevant values for NIRD using different PoU s for Poisson with an oscillatory RHS Table of relevant values for NIRD using different PoU s for advection-diffusion with a smooth RHS and no boundary layer in the solution Table of relevant values for NIRD using different PoU s for advection-diffusion with a smooth RHS and a boundary layer in the solution Table of relevant values for NIRD using different PoU s for advection-diffusion with an oscillatory RHS and a boundary layer in the solution Table of relevant values for NIRD using different PoU s for anisotropic Poisson with a smooth RHS Table of relevant values for NIRD using different PoU s for anisotropic Poisson with an oscillatory RHS

11 x Figures Figure 2.1 Comparison of NIRD convergence with and without an adaptive preprocessing step using 1,024 processors Visualization of example characteristic functions (top) and the resulting subproblem refinement patterns (bottom) for the discontinuous (left), C 0 (middle), and C (right) PoU s when applying NIRD to Poisson with a smooth right-hand side using 16 processors Convergence of the LSF for the subproblems on each of 16 processors (note that many of the curves lie on top of each other due to the symmetry of this problem) for Poisson s equation with naive functional measurement (left) and measurement with the Ker(L ) components removed from the functional (right) ACE refinement patterns for the subproblem on processor 0 for Poisson s equation with naive functional measurement (left) and measurement with the Ker(L ) components removed from the functional (right). Note that naive functional measurement uses more elements outside the home domain due to the measurement of superfluous error Plot of model predictions of number of communications vs. number of processors for NI vs. NIRD NIRD convergence for Poisson with a smooth right-hand side using a discontinuous PoU

12 xi 2.7 Compare NIRD convergence using different PoU s for Poisson with a smooth righthand side and 1,024 processors NIRD convergence for Poisson with an oscillatory right-hand side using a discontinuous PoU Solutions to the advection-diffusion problem with positive b (left) and negative b (right) NIRD convergence for advection-diffusion with a smooth right-hand side and no boundary layer in the solution using a discontinuous PoU NIRD convergence for advection-diffusion with a smooth right-hand side and a boundary layer in the solution using a discontinuous PoU Compare NIRD convergence using different PoU s for advection-diffusion with a smooth right-hand side, a boundary layer in the solution, and 1,024 processors NIRD convergence for jump-coefficient Poisson with a smooth right-hand side using a discontinuous PoU Compare NIRD convergence using different PoU s for anisotropic Poisson with a smooth right-hand side and 1,024 processors Comparison of traditional ACE refinement pattern (top left) vs. the union mesh produced by NIRD using the discontinuous PoU (top right), the C 0 PoU (bottom left), and the C PoU (bottom right) NIRD convergence for anisotropic Poisson with a smooth right-hand side using a C PoU Solution to the wave front problem used for testing higher-order elements NIRD convergence using a discontinuous PoU, 256 processors, and varying the polynomial order of the finite element discretization for the wave front problem (note the difference in scale for the vertical axis as polynomial order increases)

13 xii 2.19 NIRD subproblem convergence using a discontinuous PoU, 16 processors, and varying the polynomial order of the finite element discretization for the wave front problem (note the difference in scale for the vertical axis as polynomial order increases) Visualization of composite creation steps (note that here η = 2 on each level) Relative total error convergence for standard FAMG vs. FMG on n n finite element grids of increasing size Interpolation error across levels of the multigrid hierarchies generated by standard AMG vs. GMG (with 0 representing the finest level) Plot of the finest-level interpolation error over the grid for GMG (top) vs. standard AMG (bottom). The cross section is taken through the middle of the domain Interpolation error across multigrid levels and FAMG convergence when choosing interpolation to fit the sine hump Interpolation error across multigrid levels and FAMG convergence when choosing interpolation to fit x as defined in (3.4). Convergence is shown when using ν = 2 smoothings and when using ν = 8 smoothings. Interpolation error is shown for ν = 2, 4, 8 smoothings Interpolation error across multigrid levels and FAMG convergence when choosing interpolation to fit x as defined in (3.5) using ν = 2 boundary smoothings Interpolation error across multigrid levels and FAMG convergence when choosing interpolation to fit x as defined in (3.5) using ν = 2 boundary smoothings for the jump coefficient problem Interpolation error across multigrid levels and FAMG convergence when choosing interpolation to fit x as defined in (3.5) using ν = 2 boundary smoothings for the linear elasticity problem AMG-DD convergence for 256 simulated processors comparing different underlying AMG coarsening and interpolation strategies

14 xiii 3.11 AMG-DD convergence for 256 simulated processors comparing different paddings AMG-DD convergence when scaling the padding with problem size (left) vs. when keeping a fixed padding (right) Visualization of the residual communication algorithm from the perspective of the receiving processor (home domain in the red box). Examples of the sets Ψ and Ψ c are shown in blue boxes. The nodes are numbered by the communication from which they are received with boxed numbers denoting the neighboring nodes which were the root of the communicated Ψ c grid Comparison of AMG-DD composite grid setup time vs. the setup time of the initial AMG hierarchy for n = 100, 1000 (where n 2 is the number of nodes perprocessor) Breakdown of where time is spent in the composite grid setup phase Measured residual communication time vs. model predictions for η = 1 and n = 100, Diagram of an FAC cycle in 1D. Real nodes in the composite grid are blue and ghost nodes are white. The orange box represents a processor s home domain

15 Chapter 1 Introduction When solving elliptic partial differential equations (PDE s), multigrid algorithms often provide optimal solvers and preconditioners capable of providing solutions with O(N) computational cost, where N is the number of unknowns. As parallelism of modern super computers continues to grow towards exascale, however, the cost of communication has overshadowed the cost of computation as the next major bottleneck for multigrid algorithms [21]. Typically, multigrid algorithms require O((log P ) 2 ) communication steps in order to solve a PDE problem to the level of discretization accuracy, where P is the number of processors. Efforts to reduce the cost of communication for multigrid algorithms is a major area of research, and various approaches to this problem are currently being explored. A good survey of much of this work is given in [17]. One approach to reducing communication in multigrid algorithms is to optimize their parallel implementation. These approaches do not change the underlying algorithms, but rather seek to achieve speedups through optimizing operations such as sparse matrix-vector multiplications [9] or determining processor neighbors [5]. As such, the original convergence properties of the underlying methods are retained, but reductions in communication cost resulting from these optimizations are relatively modest. A different approach is to modify the multigrid algorithm itself. Much work has been done in this area, especially involving efforts to reduce complexity of operators and density of communication patterns on the coarser grids of multigrid hierarchies [18, 20]. Although such modifications of the underlying multigrid hierarchies can be detrimental to convergence, the savings in commu-

16 2 nication cost is often more than enough to ensure a faster method overall. A third approach to the problem of high communication costs is to design new algorithms altogether, starting from the premise of minimizing communication. Some notable examples of this kind of work include the algorithms developed by Mitchell [23,24,26] and by Bank and Holst [7,8]. This thesis studies two recently developed, low-communication algorithms, Nested Iteration with Range Decomposition [3] and Algebraic Multigrid with Domain Decomposition [6], which aim to solve elliptic PDE s with O(log P ) communication cost by trading additional computation on each processor for a reduction in communication cost. Traditional parallel PDE solvers split the work of solving a problem among processors by partitioning the computational domain or degrees of freedom [17]. Each processor then owns information about the problem and is responsible for the solution of the problem over its assigned partition of the domain, which is referred to here as that processor s home domain. Problems that are parallelized in this way require inter-processor communication for many important global operations, such as the communication of boundary information with nearest processor neighbors to perform matrix-vector multiplications. Thus, these communications must occur many times over the course of a single multigrid cycle. To avoid this frequent need for communication, the algorithms described in this thesis decompose the problem itself rather than the computational domain. The resulting subproblems are defined over the global computational domain and are able to be stored and solved on individual processors without the need for communication. Solutions to these subproblems are then recombined (involving communication) in order to improve a global solution to the original problem. Thus, the algorithms described here seek to perform much more computation on each processor before needing to communicate. Chapter 2 presents the Nested Iteration with Range Decomposition (NIRD) algorithm. This method, originally proposed by Appelhans et al. in [3], decomposes the original PDE by applying a partition of unity to the right-hand side. Previous work has shown that NIRD converges to a high level of accuracy (comparable to that of a traditional solution method) within a small, fixed number of iterations (usually one or two) when applied to simple elliptic problems such as

17 3 two-dimensional Poisson over a square domain [3]. The work presented in this thesis adds some important capabilities and features to the NIRD algorithm, allowing it to be successfully applied to a much wider set of test problems and discretizations. Previously developed convergence theory is also examined and extended in order to better understand and predict when NIRD performs very well, and a new performance model is developed that shows increased performance benefits for NIRD when problems are more expensive to solve using traditional methods. Ultimately, Chapter 2 provides further analysis and evidence to suggest that NIRD achieves excellent convergence on a wide class of elliptic PDE s and, as such, should be a very competitive method for solving PDE s on large parallel computers. Chapter 3 studies the Algebraic Multigrid with Domain Decomposition (AMG-DD) algorithm. This method decomposes the algebraic problem resulting from the discretization of a PDE rather than the PDE itself and utilizes the coarse grids of an underlying algebraic multigrid hierarchy in order to generate locally refined versions of the global problem on each processor. Previous work by Bank et al. on this algorithm focused on comparing asymptotic convergence factors of AMG-DD cycles with traditional AMG V-cycles and found only marginal benefit [6]. This thesis, on the other hand, focuses on examining AMG-DD as a discretization-accuracy solver, i.e. its potential to achieve a solution with accuracy as good as the discretization allows within a small, fixed number of cycles. As part of this study, Chapter 3 examines some properties of the underlying algebraic multigrid hierarchy as well as parameters of AMG-DD and how these factors affect the algorithm s overall convergence. In addition, some detail is given on the first parallel implementation of this algorithm. 1.1 Preliminaries This section briefly describes some preliminary information and underlying principles and algorithms that are used throughout this thesis. Preliminaries for the NIRD algorithm are discussed first and include a description of the PDE discretization and adaptive mesh refinement strategy used here as well as a description of nested iteration. Standard algebraic multigrid is then briefly

18 described, as a good understanding of this algorithm is required for the work presented in Chapter First-order system least-squares (FOSLS) Though the NIRD algorithm may be applied to any discretization that yields a locally sharp a posteriori error estimate, the discussion of NIRD in this thesis specifically uses a first-order system least squares (FOSLS) method to discretize the PDE to be solved. The FOSLS method firsts casts the PDE as a first order system, LU = f, and then a solution over some Hilbert space, W, is obtained by minimizing U = arg min LV f, (1.1) V W where is the L 2 norm. This formulation has several very attractive properties, especially when applied to convection-diffusion-reaction equations, a class of problems that are a main focus of this thesis. Consider the following elliptic PDE on a bounded, open, connected domain, Ω R 2, with Lipschitz boundary, Ω: (A p) + Xp = f, in Ω, p = 0, on Γ D, n p = 0, on Γ N, (1.2) where and are the divergence and gradient operators, respectively, A is 2 2 symmetric matrix of functions in L (Ω), X is, at most, a first-order linear differential operator, Γ D Γ N = Γ is a partitioning of the boundary of Ω, and n is the unit outward normal to the boundary. The first-order system for this PDE can then be constructed by introducing u = p. (1.3)

19 5 Using the above definition and adding a curl constraint on u yields the first-order system, u p = 0, in Ω, (Au) + Xp = f, in Ω, u = 0, in Ω, p = 0, on Γ D, n u = 0, on Γ N, τ u = 0, on Γ D, (1.4) where τ is the unit vector tangent to the boundary. Collecting the variables, U = (u, p) T, and right hand side, f = (0, f, 0), and letting L represent the action of the left-hand side operators, the full system may be expressed in shorthand as LU = f. The associated least-squares functional (LSF) is LU f 2 = u p 2 + (Au) + Xp f 2 + u 2. (1.5) This functional is continuous and coercive in the H 1 Sobolev norm, as shown by Cai et al. in [15,16], which implies optimal error estimates for finite element approximation by subspaces of H 1 and the existence of optimal multigrid solvers for the resulting discrete problems. This allows the local least-squares functional (LLSF), LU f τ, evaluated over a finite element, τ, to be used as an a posteriori error estimate for adaptive mesh refinement, which is an important component of the NIRD algorithm described in Chapter Accuracy-per-computational-cost efficiency-based refinement (ACE) The adaptive mesh refinement (AMR) strategy used throughout this thesis is designed to maximize accuracy-per-computational-cost efficiency (ACE). The ACE refinement method, originally proposed by DeSterck et al. in [19], aims to refine a fraction, r (0, 1], of the total number of elements at a given refinement level by minimizing an effective functional reduction factor, γ(r) 1/W (r), where γ(r) is the functional reduction expected after refining r of the elements and

20 W (r) is the anticipated amount of work required to solve on that refined grid. This is equivalent to choosing r opt = arg min r (0,1] log(γ(r)). (1.6) W (r) The logic behind this minimization is the desire to maximize functional reduction per unit of work. ACE refinement has been shown to perform well on a wide variety of problems, including FOSLS discretizations of PDE s where the FOSLS functional is used as the a posteriori error measure for refinement [2] Nested iteration Nested iteration (NI) is an efficient, multilevel method for solving PDE s that proceeds by solving first on a very coarse grid, refining, and interpolating the coarse-grid solution to serve as an initial guess on the next grid. This process of solving and interpolating is then repeated until a desired finest grid is reached. Since it leverages coarse-grid solutions, which are relatively cheap to obtain, NI is usually much more efficient than simply solving the fine grid problem by starting with a random or zero initial guess. Furthermore, when paired with a suitable adaptive refinement scheme (such as ACE), NI not only provides a solution with optimal cost but also generates an optimal fine-grid mesh for that solution. In a geometric multigrid setting, where the solves on each grid level are accomplished by V- cycles, NI is sometimes also referred to as full multigrid (FMG). In this case, it is straightforward to show that a single FMG cycle (using a sufficient, constant number of V-cycles to solve on each grid level) yields a solution with accuracy on the order of the discretization on the finest grid with O(N) computational cost, where N is the number of fine-grid unknowns. A brief inductive proof of this fact can be found in [14], and is summarized here. The goal of the proof is to show that performing a constant number of V-cycles on each level of the FMG cycle is sufficient to reduce the fine-grid error to the level of discretization accuracy, resulting in an overall O(N) cost for the whole cycle.

21 Let u h and u 2h be fine- and coarse-grid discrete solutions to the given problem on grid levels h and 2h, respectively, where h is the step size of the fine-grid discretization: 7 A h u h = f h A 2h u 2h = f 2h. Now assume that the approximation of the solution on the coarse grid, v 2h, has accuracy on the order of the grid 2h discretization: u 2h v 2h A 2h K(2h) p, where p is the discretization order and K is a constant. The A 2h -norm used here is the grid 2h energy norm, defined as w A 2h = A 2h w, w, where, is the discrete L 2 inner product. A similar definition follows for the A h -norm. This is simply for ease of presentation, since the argument may be made in any norm for which the assumptions presented here hold. Now assume that interpolation has accuracy on the order of the fine-grid discretization: u h P u 2h A h Kαh p, (1.7) where P is the interpolation operator from grid 2h to grid h and α is a positive constant of O(1). Putting these assumptions together yields a bound on the error present in the interpolated approximation P v 2h :

22 8 u h P v 2h A h u h P u 2h A h + P u 2h P v 2h A h = u h P u 2h A h + u 2h v 2h A 2h Kαh p + K(2h) p = K(α + 2 p )h p = Kβh p, where β = (α + 2 p ) is the amount of error reduction required to achieve discretization accuracy on the fine grid. Note that the equality in the second line above follows from the Galerkin condition (A 2h = P T A h P ): P u 2h P v 2h A h = A h P (u 2h v 2h ), P (u 2h v 2h ) = P T A h P (u 2h v 2h ), (u 2h v 2h ) = A 2h (u 2h v 2h ), (u 2h v 2h ) = u 2h v 2h A 2h. Thus, given a V-cycle with a grid-independent convergence factor, ρ, performing log 1/ρ β = O(1) cycles is sufficient to reduce the fine-grid error to the level of discretization accuracy: u h v h Ah Kh p. Note that the proof above shows that NI (or FMG) has optimal scaling in terms of computational cost, but says nothing about the scalability of communication cost for FMG in a parallel setting. The efficiency of FMG comes from its ability to leverage information from coarser grids, which are computationally much cheaper than the fine grid. This strength does not translate to communication cost for traditional NI implementations, however, due to the need for a constant number of communications on each level of the multigrid hierarchy. Thus, traditional NI in a weak scaling context (where the number of unknowns per processor is kept fixed, and the number of processors is scaled) accrues an O((log P ) 2 ) communication cost. Similarly, applying standard

23 9 V-cycles to the fine-grid problems yields O((log P ) 2 ) communication cost (though with a larger constant than FMG) due to the need for O(log P ) cycles to solve the problem. Again, the algorithms presented in this thesis attempt to significantly reduce this communication cost to O(log P ) Algebraic multigrid (AMG) The algebraic multigrid with domain decomposition algorithm (AMG-DD) discussed in Chapter 3 is built on top of an underlying algebraic multigrid (AMG) hierarchy. Thus, some understanding of basic AMG hierarchies and how they are constructed is needed before discussing AMG-DD in detail. All multigrid algorithms are composed of the complimentary processes of relaxation and coarse-grid correction. In geometric multigrid, oscillatory components of the error are quickly damped by relaxation, while coarse-grid correction addresses smooth components of the error. The key observation of multigrid is that smooth errors are representable by coarser grids, where their representation becomes relatively more oscillatory and relaxation again becomes a viable method of damping this error. The two-level ideas of relaxation and coarse-grid correction can be applied recursively to form a multigrid cycle. Algebraic multigrid [10] attempts to utilize the above multigrid principles in a purely algebraic setting, that is, it attempts to construct a multigrid hierarchy based solely on a given fine-grid matrix. Thus, the concepts of oscillatory and smooth error must be generalized and adapted to the algebraic setting. Relaxation schemes are generally chosen to be simple, point-wise iterative methods like Jacobi or Gauss-Seidel, though block versions of these methods or more complicated smoothers may also be used, depending on the problem. In an algebraic setting, relaxation is typically used to attenuate components of the error associated with large eigenvalues of the finegrid matrix (generalizing what is meant by oscillatory ), leaving the lower end of the spectrum (the algebraically smooth error) to be addressed by coarse grid correction. AMG must also algebraically construct coarse grids, interpolation and restriction operators, and a coarse representation of the problem. Developing strategies for how to build each of these

24 10 pieces is an active area of research and is quite problem dependent. This thesis, however, focuses on the classical AMG algorithm originally designed to solve elliptic PDE s [10, 28]. Here, coarse grids are generated using algebraic strength of connection, which is determined by the relative size of the matrix entries connecting different degrees of freedom, the guiding idea being that fine-grid points may be accurately interpolated from strongly connected coarse-grid points. Interpolation (discussed in further detail in Section 3.2.2) is also constructed based on the matrix entries, restriction is then typically chosen as the transpose of interpolation, and the coarse-grid operator, A c, is constructed via the Galerkin condition, A c = P T AP, (1.8) where P is the interpolation operator and A is the fine-grid operator.

25 Chapter 2 Nested Iteration with Range Decomposition (NIRD) On a parallel machine, traditional PDE solvers decompose the computational domain over processors, each of which is responsible for solving only over its home domain. This way of partitioning work requires the continual exchange of boundary information as the algorithm progresses. In the NIRD algorithm, on the other hand, the PDE itself is decomposed over processors, which each solve their own subproblem on the global computational domain. Thus, the subproblems are solved without the need for communication. Solutions from the subproblems are then recombined to improve the global solution. Previous work on this algorithm by Appelhans et al. [3] developed an initial implementation of the algorithm and tested it on model Poisson and advection-diffusion problems. Ultimately, it was shown that, for the model Poisson problem, NIRD achieves accuracy similar to traditional methods within a single iteration, and some convergence theory was established to motivate this single-cycle convergence. This thesis extends the work done in [3] by reimplementing NIRD with several new capabilities and advantages over the previous code, revisiting and updating the convergence theory, and applying NIRD to a wider set of test problems to gain a better overall understanding of the algorithm s behavior, the validity of the convergence theory, and the class of problems for which NIRD performs well. This chapter begins with a description of NIRD in Section 2.1 followed by a more detailed discussion of some of the important improvements to the algorithm over the implementation used in [3] in Section 2.2. Section 2.3 then examines and updates the previously developed convergence

26 12 theory with the goal of providing a more general theoretical framework that provides better insight into the behavior of NIRD. A performance model is then developed in Section 2.4, which demonstrates the new insight that NIRD not only achieves significant reduction in the number of communications compared with a traditional nested iteration approach, but also provides even larger savings for problems that are more difficult and expensive to solve using traditional methods. Finally, Section 2.5 shows numerical results for NIRD applied to a wide variety of elliptic test problems. These results not only give some insight into the theoretical understanding of the algorithm as discussed in Section 2.3 but also showcase the utility of many of the improvements to NIRD discussed in Section 2.2, which enhance the algorithm s performance on more difficult problems. Ultimately, NIRD is shown to perform well for elliptic problems with a variety of difficulties as well as for problems using higher-order finite elements. 2.1 NIRD algorithm description The NIRD algorithm begins with a preprocessing step performed simultaneously on all processors, requiring no communication. First, adaptively solve the original PDE until there are at least P elements in the mesh, where P is the number of processors. This coarse mesh is then partitioned into home domains, Ω l, for each processor, l, and a corresponding partition of unity (PoU) is defined. As a simple example, consider the PoU defined by the characteristic functions, 1, in Ω l, χ l = (2.1) 0, elsewhere. Note that adaptive refinement should place approximately equal portions of the error in each Ω l, though the Ω l s may be very different sizes. Also note that what is meant by error here is left intentionally vague, since the NIRD algorithm may be applied to any discretization method with an accurate a posteriori error measure. The discussion here uses a FOSLS discretization along with ACE refinement as described in Sections and The definition of the home domains and corresponding PoU completes the preprocessing step.

27 As suggested by its name, the NIRD algorithm then decomposes the PDE problem by decomposing the range (or right-hand side) of the residual equation: 13 Lδu = f Lu h 0 = l χ l (f Lu h 0), where u h 0 is the coarsely meshed preprocessing solution. Each processor then adaptively solves a subproblem over the entire domain for its right-hand side: δu h l = arg min v h W l Lv h f l, (2.2) where f l = χ l (f Lu h 0 ) is non-zero only on or near processor l s home domain. This solve is the main computation step of the algorithm and involves no communication, since each processor solves its subproblem over the entire domain. Note that the PoU should be chosen so that the support of the right-hand side, f l, is localized on or near the home domain, Ω l, for each processor. This should lead to an adaptive mesh that is refined primarily in, and remains coarse away from, the home domain. This allows each processor to obtain accurate subproblem solutions while using a reasonable number of elements. The global solution update is then calculated by summing the solutions from each processor: u h 1 = u h 0 + l δu h l. This sum can be accomplished in O(log P ) complexity [3] and yields a solution that lives on a globally refined union mesh obtained by taking the union of the meshes from each subproblem. A new global residual may then be formed and new updates calculated to produce an iteration. Algorithm 1 shows the pseudo code for the NIRD algorithm in its entirety.

28 Algorithm 1 NIRD Algorithm. Solve Lu h 0 = f using NI with AMR until the coarse mesh has at least P elements. 14 Partition the coarse mesh into home domains, Ω l. Define a PoU with characteristic functions, χ l. for i = 0 (num iterations) do Let f l = χ l (f Lu h i ), where l is this processor s rank. Solve the subproblem, Lδu h l = f l, using NI with AMR. Update the global solution, u h i+1 = uh i + l δuh l (communication step). end for Having described the NIRD algorithm, it is worth comparing it here to other low-communication algorithms that have been previously proposed. Perhaps the most natural comparison is with the parallel adaptive meshing paradigm presented by Bank and Holst in [7, 8], which shares several similarities with NIRD. After an adaptive, preprocessing step, in which a global coarse mesh is partitioned into home domains with approximately equal error, each processor independently does its own adaptive refinement focusing on its home domain. These independent, adaptively refined meshes are then unioned to form the global mesh, very much like what is done in NIRD. The main distinction between the algorithms is how the actual global solution on the union mesh is obtained. The Bank-Holst paradigm does not partition the right-hand side. Rather each processor solves their subproblem with the full right-hand side, and the adaptive refinement is guided to focus in the home domain by simply multiplying the a posteriori error measure by 10 6 outside the home domain. The subproblem solutions are then used as initial guesses over the home domains, and the final global solution is obtained through standard, parallel domain decomposition or multigrid techniques. Thus, while this algorithm generates the adaptively refined parallel mesh very efficiently, obtaining the solution on that mesh still requires the application of less communication-efficient, traditional methods. NIRD, on the other hand, achieves both a good global mesh and global solution with very low communication cost. By partitioning the right-hand side and allowing adaptive refinement on each processor to proceed as usual, NIRD generates subproblem solutions that are

29 15 well-resolved and thus yield a good global solution when added together. Another important note about both the Bank-Holst parallel adaptive meshing algorithm and NIRD is their ease of implementation. Both algorithms very naturally leverage existing software and solution techniques in order to perform the serial, adaptive solves on each processor. The only new pieces of code that need to be written are the communication routines and implementation of the partition of unity characteristic functions for NIRD. This makes NIRD relatively easy to implement within the framework of an existing finite element or solver package. 2.2 Algorithm details This section gives some further details on a few key pieces of the NIRD algorithm that are newly implemented or improved since the previous implementation used in [3]. The preprocessing step is briefly discussed, followed by some significant additions to the choice of characteristic functions for the partition of unity. The final subsection reveals that naive functional measurement on the NIRD subproblems is insufficient as an error measure for adaptive refinement and proposes a fix to restore the validity of the functional as an a posteriori error measure The preprocessing step One contribution of this work is the implementation of an adaptive preprocessing step. Although the preprocessing step was discussed in [3], that implementation was not adaptive. Instead, the initial coarse mesh was uniformly refined until it had at least P elements, which may result in unequal distribution of the error across processors and hinder overall NIRD performance. Assuming that computational resources are equally distributed across processors, the error should be as well. Using adaptive refinement to establish the coarse mesh should roughly equally distribute error among processors. Allowing for more elements in the coarse mesh gives adaptive refinement the best chance of equally distributing error among those elements (and subsequently among processors), but processors should still reserve most of their memory resources for solving the local subproblem. This raises the question of how much refinement should be done during the preprocessing step.

30 16 For the purposes of this thesis, the preprocessing step enforces N c P, where N c is the number of elements in the coarse mesh. This is only for ease of implementation and discussion: as P gets large compared to the memory available on each processor, this is clearly untenable. In this case, one may consider P to be the number of nodes or clusters of processors and apply NIRD at the level of parallelism between these nodes. Such an approach would still result in greatly enhanced performance, as inter-node communication is generally much more expensive than onnode communication. With the above consideration in mind, the strategy for stopping refinement during the preprocessing step may then be described as follows. If E is the total number of elements that may be stored on a processor, most of these, say 90%, should be reserved for solving the processor s subproblem. If it is possible to equidistribute the error with fewer elements, however, it is preferable to use as few elements as possible. Judging whether error is well-equidistributed can be tricky and requires the major features (sources of error) of the problem to be resolved with 0.1 E or fewer elements. Assuming this is the case, the preprocessing step considers error to be equidistributed if the ratio, η 1 /η 0, is below some threshold factor, say 10, where η 1 is the largest error over a home domain and η 0 is the smallest error over a home domain. To summarize, the preprocessing step refines while (N c < P ) or ((N c < 0.1 E) and (η 1 /η 0 > 10)). Algorithm 2 shows pseudocode for the preprocessing step. Algorithm 2 Preprocessing refinement. Solve on the coarsest mesh and evaluate LLSF. while (N c < P ) or ((N c < 0.1 E) and (η 1 /η 0 > 10)) do Refine using ACE. Solve on refined mesh. Partition mesh into home domains. Evaluate LLSF over home domains to get η 1, η 0. end while

31 To demonstrate the importance of the adaptive preprocessing step, consider a 2D Poisson problem with the given right-hand side: 17 p = f, on Ω = [0, 1] [0, 1], p = 0, on Ω, 100, (x, y) [0.49, 0.51] [0.49, 0.51], f(x, y) = 0, else. The very localized, discontinuous right-hand side demands adaptive refinement during the preprocessing step in order to equidistribute error. Comparing the overall performance of NIRD with and without an adaptive preprocessing step on this problem shows a significant increase in accuracy when using adaptivity as shown in Figure 2.1. The LSF values shown in this plot are normalized by the LSF of the preprocessing solution using uniform refinement. The reference lines shown are the expected accuracies using a traditional nested iteration approach using N U or N T elements, where N U is the number of elements in the union mesh produced by NIRD and N T > N U is the total number of elements used by NIRD for all subproblems. Note that the union meshes produced by NIRD with and without the adaptive preprocessing step have similar numbers of elements, but their accuracies are quite different. This indicates that without the adaptive preprocessing step, NIRD is unable to generate a union mesh with optimal element distribution. As such, the accuracy suffers significantly, whereas with the adaptive preprocessing step, the NIRD iterates quickly achieve accuracy very similar to that achieved by a traditional nested iteration approach using ACE refinement. Thus, the adaptive preprocessing step has allowed NIRD to generate a nearly optimal union mesh, even for this problem with a highly localized, discontinuous right-hand side Partitions of unity In addition to an adaptive preprocessing step, another major advancement in the implementation of the NIRD algorithm is the ability to construct continuous or even infinitely smooth characteristic functions for the partition of unity (PoU). Previous implementation of the algorithm

32 18 Figure 2.1: Comparison of NIRD convergence with and without an adaptive preprocessing step using 1,024 processors. supported only the discontinuous PoU shown in equation (2.1). As mentioned in Section 2.1, PoU s should be chosen so that the characteristic functions have local support near a processor s home domain, thereby inducing adaptive refinement that focuses on the home domain as much as possible. The discontinuous characteristic functions in equation (2.1) have support exactly over Ω l, but these may not be the only good choice for a PoU. A C 0 PoU may be constructed from piecewise-linear nodal basis functions over the coarse elements in the preprocessing mesh with nodal values 1/n i, x i Ω l, χ l (x i ) = 0, x i / Ω l, where n i is the number of home domains incident to node x i. This yields C 0 characteristic functions but also expands their support to neighboring elements in the coarse, preprocessing mesh, resulting in more refinement outside of a processor s home domain when adaptively solving its subproblem. Even smoother characteristic functions may be constructed by combining C exponential bump functions over the coarse mesh triangulation. The bump functions, w k, window the coarse

33 19 elements, τ k, and have the form 1 e (λ 1 λ 2 λ 3 ) p, if λ i > 0, i, w k (x, y) = 0, otherwise, where the Cartesian coordinates, (x, y), are converted to the Barycentric coordinates, (λ 1, λ 2, λ 3 ). The value of p controls the steepness of these bump functions, and for the discussion here, p = 1/2. The windowing functions, w k, need to overlap each other, so the Barycentric coordinates, (λ 1, λ 2, λ 3 ), are based on the vertices of an extended triangle defined by pushing the vertices of τ k away from the triangle s midpoint by a distance of d min, where d min is the minimum diameter of an inscribed circle for all elements that are adjacent to τ k. In this way, the support of the bump function is limited to τ k and its immediate neighboring coarse elements. The smooth, overlapping, windowing functions, w k, may now be combined using Shepard s method [27] to obtain a PoU over the home domains, Ω l, with characteristic functions, χ l = τ k Ω l w k τ j Ω These characteristic functions retain the compact support and C smoothness of the windowing functions, but now sum to 1 over the domain, generating a PoU. Also note that, while the sum in the denominator appears to include all coarse elements in the domain, only a few of the windowing functions, w j, are nonzero when evaluating χ l, namely those that are directly adjacent to Ω l, resulting in a reasonable number of required operations in order to evaluate χ l. A heuristic comparison between the various PoU s presented here may be made by comparing their ACE refinement patterns. Figure 2.2 shows example characteristic functions from each PoU along with the corresponding subproblem ACE refinement pattern when applied to a Poisson problem with a smooth right-hand side using 16 processors. The discontinuous PoU does the best job of localizing the refinement in and around the home domain, but there is a lot of refinement near the discontinuities in the right-hand side at corners of home domains. The C 0 PoU smooths out this error, resulting in more evenly distributed refinement, but more refinement thus focuses w j.

34 away from the home domain. The C PoU combines the better qualities of the other two, both doing a decent job of localizing refinement and avoiding excessive refinement near edges or corners 20 of the support of the characteristic function. The overall effect of the choice of PoU on NIRD performance is discussed further and better quantified in Section Subproblem functional measurement Another important note pertaining to the PoU s is that, regardless of the PoU selected, it is possible that f l may not be in the range of L, even if f is. That is, applying the characteristic function χ l may push the subproblem right-hand side out of the range of L. Thus, f l may be decomposed as f l = Lδu l + φ l, where δu l is the continuum solution to the subproblem (i.e. the continuum minimizer for equation (2.2)) and φ l Ker(L ) is nontrivial. This does not change the fact that u = u h 0 + l δu l, as can be seen through a simple inner product argument presented below. Note that the FOSLS solution u in the appropriate space, W, satisfies Lu, Lv = f, Lv, v W. In addition, the continuum solutions for each subproblem obey similar relations: Lδu l, Lv = f l, Lv, l, v W. Summing over all l and using the linearity of L and the definitions of the f l s yields Lδu l, Lv = f l, Lv, v W l l ( ) L δu l, Lv = χ l (f Lu 0 ), Lv, v W l l ( L u h 0 + ) δu l, Lv = f, Lv, v W. l

35 Figure 2.2: Visualization of example characteristic functions (top) and the resulting subproblem refinement patterns (bottom) for the discontinuous (left), C 0 (middle), and C (right) PoU s when applying NIRD to Poisson with a smooth right-hand side using 16 processors. 21

36 22 Then, the uniqueness of u implies u = u h 0 + l δu l. Thus, the fact that f l / Ran(L) does not affect NIRD on a fundamental level: the solution updates found in each subproblem are still correct. This fact can be problematic, however, when measuring the LSF for the subproblems for adaptive refinement. The functional, Lδu h l f l, measures the nontrivial component, φ l, which prevents the functional from converging with refinement. To remedy this, it is necessary to subtract off φ l when measuring the functional, resulting in the measurement Lδu h l (f l φ l ) = L(δu h l δu l). This functional now converges with refinement and provides a better indication of where adaptive refinement should occur. Calculating the Ker(L ) components, φ l, clearly varies from problem to problem, depending on the operator L. As an example, consider the div-curl system with homogeneous Dirichlet boundary conditions, u = f, on Ω, u = 0, on Ω, τ u = 0, on Γ. This system yields the following form for L: L =. Adopt the following notation for the domains of L and L : u D(L), v = q D(L ). s

37 The discussion here is limited to the two-dimensional case and makes use of the following identities, which result from simple integration by parts: 23 where q, u = q, u + u, s = u, s Γ Γ q(u n), s(u τ ), q = yq. x q Applying these identities yields the follow calculation of L : Lu, v = u, q + u, s = u, q + q(u n) + u, s = u, q + s + Γ Γ Γ s(u τ ) q(u n) s(u τ ). Γ Utilizing the boundary condition, u τ = 0 on Γ, eliminates one of the boundary integrals, and the other is dealt with by imposing the adjoint boundary condition, q = 0 on Γ, leaving To find Ker(L ), set L = [ ]. L v = q + s = 0. Taking the divergence of the above equation yields q = 0, on Ω, q = 0, on Γ,

38 24 which implies that q = 0 on all of Ω. Then what remains is s = 0, on Ω, which implies that s is constant. Thus, Ker(L ) for the div-curl problem is the simple onedimensional space, Ker(L ) = [0, C] T, where C is constant. The calculation of φ l = [0, C l ] T for a NIRD subproblem then involves a simple, one-dimensional minimization problem in which C l can be directly calculated: C l = arg min f c l 0 c f l, C l = 1, 1. A more complicated case occurs when solving a general advection-diffusion-reaction equation: (A p) + b p + cp = f, on Ω = [0, 1] [0, 1], p = 0, on Γ D, n p = 0, on Γ N, where, for simplicity, b and c are constant, and A is a symmetric positive definite, 2 2 matrix.

39 25 Casting this as a first-order system yields u p = 0, on Ω, (Au) + b u + cp = f, on Ω, u = 0, on Ω, p = 0, on Γ D, τ u = 0, on Γ D, n u = 0, on Γ N. Thus, the operator, L, has the form I L = A + b c, 0 and the domains of L and L are denoted U = u D(L), p v V = q D(L ). s As for the div-curl system, the adjoint operator L can be found via integration by parts: LU, V = u p, v + Au + b u + cp, q + u, s = u, v + p, v p(v n) + u, A q q(u n) + u, bq + p, cp + u, s = U, L V p(v n) q(u n) Γ Γ Γ Γ s(u τ ) Γ Γ s(u τ ),

40 26 where the formal adjoint, L is given by L I A + b =. c 0 The boundary integrals are again eliminated by imposing boundary conditions on the original problem along with the corresponding adjoint boundary conditions. First, consider the case of Dirichlet conditions on the entire boundary, i.e. Γ D = Γ. Then, the adjoint boundary condition is q = 0 on Γ. The form of Ker(L ) may be found by setting L V = v + A q + bq + s = 0 v + cq 0 and taking the divergence of the first equation to obtain v + (A q) + b q = 0. Then, using the second equation to substitute cq for v yields a homogeneous advectiondiffusion-reaction equation with homogeneous Dirichlet boundary conditions, cq + (A q) + b q = 0, on Ω, q = 0, on Γ, which implies that q = 0 on Ω. The second equation then becomes v = 0, which implies that v = ψ for some function ψ H 1. Substituting the results for q and v back into the first equation yields ψ + s = 0, which implies s = ψ. Thus, the Ker(L ) can be expressed as ψ Ker(L ) = 0, ψ

41 27 where ψ is any function in H 1. Then, to find φ l for a NIRD subproblem, find ψ l such that ψ l = arg min ψ ψ f l 0 ψ Note that this is not quite as straightforward as the case where Ker(L ) was one dimensional, as for the div-curl system. It is possible, however, to solve the above minimization for ψ in finite element space, W (1) l, a scalar subspace of the full vector finite element space W l associated with processor l s subproblem. This results in an approximation, φ h l, to φ l, and the error in this approximation should converge at the same rate as the error in the approximation to the subproblem solution, δu l. Thus, the modified functional has the form 2. Lδu h l (f l φ h l ) = L(δuh l δu l ) + (φ h l φ l ) L(δu h l δu l ) + φ h l φ l, and convergence of this functional is again restored, resulting in superior local error measurement over the naive functional, Lδu h l f l. Figure 2.3 shows convergence of both the naive and modified functionals for each subproblem of NIRD applied to a Poisson problem using 16 processors and linear finite elements. The modified functional achieves the expected O(h) convergence, whereas the naive functional stalls at the level of φ l for each processor. The effect of the functional measurement on the ACE refinement patterns can be seen in the example shown in Figure 2.4. This shows that additional refinement occurs far from the home domain when using naive functional measurement due to the fact that φ l begins to dominate the functional measurement at some point during refinement. The modified functional described in this subsection is used throughout the numerical results presented in Section 2.5, which examines several variations on the general advection-diffusionreaction problem stated here. The test problems in that section also include a case where Neumann conditions are imposed on one side of the domain and Dirichlet conditions are used elsewhere. Specifically, Γ N is the East edge of the square domain, and the North, South, and West edges

42 28 Figure 2.3: Convergence of the LSF for the subproblems on each of 16 processors (note that many of the curves lie on top of each other due to the symmetry of this problem) for Poisson s equation with naive functional measurement (left) and measurement with the Ker(L ) components removed from the functional (right). comprise Γ D. Working through the same argument as above yields the same space for Ker(L ) with the additional constraint that ψ = 0 on the East edge. 2.3 NIRD convergence theory This section examines a convergence proof previously presented by Appelhans et al. in [3] and then presents an updated convergence proof with streamlined assumptions and support for arbitrary partitions of unity. The previous proof is rooted in several strong, heuristic assumptions about the PDE problem and its decomposition under NIRD. By measuring these assumptions in practice for some simple test problems, the assumptions are shown to be potentially unnecessary or at least too pessimistic in predicting good NIRD convergence. The updated proof attempts to identify crucial assumptions and provide a more general framework that allows for the use of other partitions of unity Examining a previous NIRD convergence proof The previous convergence proof by Appelhans [3] suggests that, for many elliptic problems, NIRD requires only a single iteration in order to achieve a solution with functional error within a

with naive functional measurement (left) and measurement with the Ker(L ) components

43 Figure 2.4: ACE refinement patterns for the subproblem on processor 0 for Poisson s equation with naive functional measurement (left) and measurement with the Ker(L ) components removed from the functional (right). Note that naive functional measurement uses more elements outside the home domain due to the measurement of superfluous error. 29

44 30 small factor of that achieved by a traditional nested iteration solve using the same resources. The argument for why such immediate convergence should occur can be broken down into two parts by first showing that the exact union mesh solution has accuracy similar to a traditional solve and then showing that the NIRD solution has accuracy similar to the exact union mesh solution. First, assume that the adaptive meshes used here adequately resolve problem features and that error is equidistributed across elements of the adaptive meshes. Under these assumptions, it is possible to bound the LSF for some discrete solution, u h, in terms of the number of elements in the adaptive mesh, N, that is, L(u h u) C I N q/d u q+1, where q is the polynomial order of the finite element space, d is the dimension of the problem, and C I is an interpolation constant [4,32]. Furthermore, assume bounds of this form are sharp, so that L(u h u) C I N q/d u q+1. With the above assumptions in mind, consider the union mesh that the NIRD solution is defined on. Denote the exact solution on the union mesh by u U and the exact solution generated by a traditional nested iteration solve by u T, and define Q = N U /N T, where N U and N T are the number of elements in the union and traditional meshes, respectively. Then the LSF s for each of these solutions are related by L(u U u) K 0 Q q/d L(u T u), (2.3) where K 0 = O(1). There is also an implicit assumption about element distribution of the union mesh here. The traditional nested iteration solution is arrived at through some global, parallel, adaptive refinement strategy and is, thus, assumed to have an element distribution that is optimal in some sense, that is, the appropriate error is as small as possible for a mesh with N T elements. If the union mesh produced by the first NIRD iteration also has a nearly optimal element distribution, then K 0 1. As shown later in Section 2.5, however, NIRD does not always produce a union mesh with optimal element distribution. In such a case, K 0 will be larger.

45 31 In the case where K 0 is close to 1, the union mesh solution will have an LSF within a small fraction of the traditional solution if Q is also close to 1. This means that NIRD has effectively used the available resources by placing most elements in the home domains during the subproblem solves, resulting in more of those elements ending up in the union mesh. Now it remains to be shown that the first NIRD iteration yields a solution with LSF within a small fraction of that achieved by u U. It is actually possible to prove that this occurs under some strong, heuristic assumptions about the problem and its decomposition under NIRD as shown by Appelhans in [3]. The rest of this section gives an overview of the convergence proof from [3] and discusses whether the assumptions made appear to be valid in practice. Consider the error in the first NIRD solution, u h 1, over a processor s home domain, Ω k, and use the triangle inequality to split this error into the contributions from each processor s subproblem: P L(u h 1 u) Ωk = L(δu h l δu l ) Ωk l=1 P L(δu h l δu l ) Ωk. l=1 Note that this proof assumes the use of a discontinuous PoU, and, as such, the error contributions from other processors, l k, that have zero right-hand side over Ω k are split off and treated separately from the home processor s error contribution: P L(δu h l δu l ) Ωk = L(δu h k δu k) Ωk + L(δu h l δu l ) Ωk. l k l=1 The sum over l k is now manipulated through the use of some heuristic assumptions in order to bound it in terms of a constant times L(δu h k δu k) Ωk. First, assume a symmetry between the error contributions of other processors to processor k s home domain and processor k s contribution to other home domains: l k L(δu h l δu l ) Ωk C s,k L(δu h k δu k) Ωl, l k, (2.4) L(δu h l δu l ) Ωk C s,k L(δu h k δu k) Ωl, (2.5) l k where C s,k = O(1). Furthermore, assume a quasiuniform distribution of home domains and decay

46 of processor k s error over Ω l as the distance between Ω l and Ω k grows such that L(δu h k δu k) Ωl C ρ,k L(δu h k δu k) 2 Ω l l k l k 1/2 32, (2.6) where C ρ,k = O(1) (see Lemma 5.2 from [3]). For example, if the error decays exponentially away from the home domain, then (2.6) is satisfied. Both of the assumptions above are motivated by the exponentially decaying Green s functions for the elliptic PDE being solved. Now, define min L(δu h τ Ω k δu k) τ ˆQ k = Q k k max τ Ω L(δuh k δu k) τ 2, (2.7) where Q k is the ratio of elements in processor k s home domain over the total number of elements in its subproblem mesh after the subproblem solve and τ is an element in processor k s subproblem mesh. ˆQk is assumed to be reasonably close to 1 since refinement is expected to focus inside the home domain and ACE should approximately equidistribute error among elements. With this definition for ˆQ k, some algebra yields l k Combining all of the above inequalities yields L(δu h k δu k) 2 Ω l 1 ˆQ k ˆQ k L(δu h k δu k) 2 Ω k. (2.8) L(u h 1 u) Ωk L(δu h k δu k) Ωk + L(δu h l δu l ) Ωk l k L(δu h k δu k) Ωk + C s,k L(δu h k δu k) Ωl l k L(δu h k δu k) Ωk + C s,k C ρ,k L(δu h k δu k) 2 Ω l l k ( C ρ,k (1 1 + C ˆQ ) 1/2 k ) s,k L(δu ˆQ h k δu k) Ωk. k Renaming the constant in front as C b,k and taking C b = max k C b,k yields the following crucial bound: 1/2 L(u h 1 u) Ωk C b L(δu h k δu k) Ωk, k = 1...P. (2.9)

47 With this bound in hand, the proof follows by splitting the total error in the first NIRD iteration 33 over home domains, using bound (2.9), and then using some interpolation assumptions. First, assume that L(δu h k δu k) Ωk L(δu h I,k δu k) Ωk, where δu h I,k is the interpolant of the exact δu k on processor k s subproblem mesh. Then, P L(u h 1 u) 2 Ω = L(u h 1 u) 2 Ω k k=1 C 2 b C 2 b C 2 b P L(δu h k δu k) 2 Ω k k=1 P L(δu h I,k δu k) 2 Ω k k=1 P ( CI h q k δu ) 2 k q+1,ωk, k=1 where h k is the maximum element diameter over Ω k. Further assuming that the h k s may be replaced by an effective mesh spacing for the union mesh, N 1/d U, and that there is some constant, C n = O(1), such that δu k q+1,ωk C n u q+1,ωk, k, yields ( P ) 1/2 L(u h 1 u) Ω C b C I N q/d U δu k 2 q+1,ω k C b C I C n N q/d U k=1 = CN q/d U u q+1,ω, ( P ) 1/2 u 2 q+1,ω k where C = C b C I C n. This bound is now of the same form as that achieved by directly solving on the union mesh but with a different leading constant. If these bounds are sharp, then there is a constant, K 1, such that k=1 L(u h 1 u) K 1 L(u h U u), (2.10) that is, the first NIRD iterate, u h 1, achieves error within a factor, K 1, of the error achieved by the union mesh solution, u U. Combining this with bound (2.3) yields the desired result: L(u h 1 u) K 1 L(u U u) K 1 K 0 Q q/d L(u T u) L(u h 1 u) K L(u T u).

48 This proof relies heavily on several heuristic assumptions to ensure reasonable values for the 34 constants defined in (2.5), (2.6), and (2.7). These assumptions come together to influence the important bound (2.9). In order to investigate the validity of these assumptions and the overall strategy of the proof, it is possible to directly measure these constants in practice. Consider a simple Poisson problem, p = f, on Ω = [0, 1] [0, 1], p = 0, on Γ, discretized with FOSLS using linear elements. NIRD is known to converge to a solution with accuracy that is very close to that of the union mesh solution in one iteration for this problem (see [3] or Section 2.5 of this thesis). Evaluating the above line of proof by measuring key constants, even for this very simple test problem where NIRD performs well, reveals some potential problems and weaknesses in the assumptions. Table 2.1 shows measurements of several constants from the proof while scaling the number of processors, P, from 64 up to 1,024 and keeping the number of elements per processor fixed at 20,000. The first column measures C s = max k,l L(δu h l δu l ) Ωk L(δu h k δu k) Ωl. Note that this quantity grows with problem size, indicating that one cannot rely on the symmetry previously assumed of the form L(δu h l δu l ) Ωk L(δu h k δu k) Ωl, l k. The next column, however, measures a similar quantity, L(δu h l δu l ) Ωk l k C s = max k L(δu h k δu. k) Ωl l k Note that, in order for C s to be an O(1) constant, it is no longer necessary to have some exact symmetry between a processor k s error over Ω l and processor l s error over Ω k for each pair of processors, k and l. Rather, the sums above must be of similar size, which relaxes the assumption

49 35 somewhat. Interestingly, Table 2.1 shows C s is significantly smaller than C s and also remains essentially constant as P grows. Note that the proof still follows if the bound in equation (2.5) is simply replaced by a bound of the form L(δu h l δu l ) Ωk C s L(δu h k δu k) Ωl. l k Motivating or justifying this bound from a theoretical perspective, however, is not as straightforward l k as the simpler symmetry assumption used in the proof. The next quantity measured in Table 2.1 is the smallest value, C ρ, that satisfies the bound (2.6). Again, even for this simple Poisson problem, C ρ is somewhat large, and may or may not be growing with P. Thus, this bound is, at best, quite pessimistic. Similarly, the use of ˆQ in the proof requires no assumptions, whatsoever. Simple algebra yields the bound l k L(δu h k δu k) 2 Ω k 1 ˆQ k ˆQ k L(δu h k δu k) 2 Ω k. As shown in Table 2.1, however, the value of ˆQ is quite small and shrinks as P grows. As a result, the above bound (while true) is extremely pessimistic to the point where it is no longer useful. Lastly, Table 2.1 shows a bottomline comparison between the value for C b that is predicted by the theory vs. the best value for the bound (2.9) as measured in practice, that is, ( C ρ,k (1 C b = max 1 + C ˆQ ) 1/2 k ) s,k, (2.11) k ˆQ k C b = max k L(u h 1 u) Ω k L(δu h k δu k) Ωk. (2.12) Clearly, the actual bound, (2.12), is satisfied quite nicely, as C b is small and constant with respect to P, whereas the theory stemming from the various heuristic assumptions predict a much larger value, C b, which grows with P. This suggests that the heuristic assumptions, including the symmetry assumption, (2.5), and the assumption of exponentially decaying error, (2.6), as well as some pessimistic bounds such as (2.8) do not do a good job of describing the behavior of NIRD. As an examination of this simple Poisson problem shows, these assumptions and bounds fail to hold or provide extremely pessimistic estimates even in an easy case where NIRD has ideal performance.

50 P C s Cs C ρ ˆQ Cb Cb E E E E E E Table 2.1: Measured constants for Poisson with a smooth RHS An updated NIRD convergence proof While the previous section shows that the proof from [3] utilizes some assumptions and bounds which are either unrealistic or simply too pessimistic in practice, the overall structure of the proof may be helpful in understanding NIRD convergence. One other weakness of the proof is that it assumes the use of a discontinuous PoU. This section presents a generalization of the proof that is applicable to arbitrary PoU s and does away with some of the previous assumptions and bounds. In doing so, this proof identifies what appears to be a single, crucial bound that correlates well with good NIRD performance as shown in Section 2.5. Thus, the new, adapted proof is a much more useful tool in understanding why and predicting when NIRD will perform well or poorly. In order to account for arbitrary PoU s, which may overlap, this proof proceeds by dividing the domain, Ω, over the coarse elements of the preprocessing mesh, rather than the home domains, Ω l. Define N c to be the number of elements in the preprocessing mesh, and call each of these elements τ k for k = 1...N c. Then, define S l = supp(χ l ) for l = 1...P and the sets L k = {l : S l τ k } and M k = {m : S m τ k = } for each τ k. Now, break up the total functional for the solution produced by the first full NIRD iteration, u h 1, over the τ k s: N c L(u h 1 u) 2 Ω = L(u h 1 u) 2 τ k, k=1 and, similar to the proof in the last section, use the triangle inequality on each term in the sum above: L(u h 1 u) τk L(δu h l δu l ) τk + l L k m M k L(δu h m δu m ) τk. Now, rather than employing the heuristic assumptions from the previous proof in order to proceed, simply assume a bound which is analogous to (2.9). Specifically, assume that the sum over L k

51 37 contains most of the error and that the remaining sum over M k may be swept into an O(1) constant, C 0 : L(u h 1 u) τk C 0 L(δu h l δu l ) τk, τ k. (2.13) l L k In the case of the discontinuous PoU, if each home domain, Ω l, consists of a single coarse element, τ k, the above reduces to (2.9) with C 0 = C b. Now, proceed by bounding the sum above by the error contribution from a single processor: L(δu h l δu l ) τk L k L(δu h l k δu lk ) τk, τ k, (2.14) l L k where l k = arg max l Lk L(δu h l δu l) τk. Combining the above steps yields the following important bound: L(u h 1 u) τk C 0 L k L(δu h l k δu lk ) τk, τ k, (2.15) where, again, C 0 is assumed to be an O(1) constant independent of P. To complete the proof, assume that the seminorm of the subproblem solution for processor l k can be related to the seminorm of the PDE solution, u, over τ k by δu lk q+1,τk C n L k u q+1,τ k, (2.16) where C n is an O(1) constant. The proof then follows using interpolation assumptions similar to those used in the previous subsection: N c N c L(u h 1 u) 2 Ω = L(u h 1 u) 2 τ k (C 0 L k L(δu h l k δu lk ) τk ) 2 k=1 N c k=1 C0 2 ( L k C I h q k δu N c l k q+1,τk ) 2 (C 0 C I N q/d ) 2 Cn u 2 2 q+1,τ k k=1 = (C 0 C I C n N q/d u q+1,ω ) 2. k=1 Taking the square root yields the bound, L(u h 1 u) Ω CN q/d u q+1,ω, where C = C 0 C I C n. Again, directly solving on the union mesh yields the same bound with a different constant, implying

52 that when the above assumptions hold and the bounds are sharp, there is a constant, K 1, such that 38 L(u h 1 u) K 1 L(u U u). As before, combining this with bound (2.3) yields the desired result: L(u h 1 u) K 1 L(u U u) K 1 K 0 Q q/d L(u T u) L(u h 1 u) K L(u T u). Thus, the first iteration of NIRD yields a solution with LSF within a factor K of that achieved by a traditional solve, and it does so with only a single communication phase with O(log P ) cost. The size of the constant K = K 1 K 0 Q q/d is determined by the degree to which the assumptions made in the proof hold. The size of the constants C 0 and Q both play a significant role in determining K and are also both directly measurable. Section 2.5 shows numerical tests in which these quantities are measured and convergence is studied for some different elliptic test problems. When considering a discontinuous PoU where each processor s home domain consists of a single element from the coarse preprocessing mesh, the above proof reduces to the same argument made in [3] and the previous subsection (combining some of the heuristic assumptions). The generalized framework presented here is designed to accommodate other PoU s. For example, consider the extreme case in which χ l = 1/P over the entire domain for each processor, l. Clearly, these global characteristic functions yield an impractical overall algorithm, but they also trivially lead to L(u h 1 u) = L(u U u) = K 0 P q/d L(u T u), which provides a good sanity check for the structure of the proof. The C 0 and C PoU s or other PoU s with extended support beyond the processor home domains fall somewhere between the discontinuous and global PoU s, but should still be accommodated successfully by this version of the proof. 2.4 Modeling communication cost This section outlines some simple performance models comparing NIRD with a traditional nested iteration (NI) approach. Similar models and descriptions may be found in [3], though the

53 39 updated models presented here take into account the effect of problem difficulty on performance. The following models subsequently predict larger performance gains for NIRD over traditional NI for more difficult problems. The dominant cost of communication on most modern machines is latency rather than limited bandwidth. In addition, bandwidth costs may be further reduced by sending only boundary information for communicated subdomains and then recalculating the solution over that subdomain on the receiving processor [3]. Thus, only the latency cost, or the number of communications, is considered here Traditional NI and NIRD models For the purposes of this discussion, assume that traditional NI begins by refining on a single processor until some maximum number of elements is reached, then distributes this mesh among all available processors and continues to refine, solving on each level using some fixed number of AMG V-cycles, until a desired finest mesh is obtained. In a weak scaling context, the size of the coarse, single-processor mesh is E, and the size of the finest mesh is N = EP, where E is the maximum number of elements per processor, P is the number of processors, and E >> P. Assume that each refinement multiplies the number of degrees of freedom by some constant c, and, for simplicity, assume the AMG coarsening factor is similar, so that the number of levels in the multigrid hierarchy for the V-cycles is equal to the number of NI refinements. Furthermore, assume that the number of communications on each level of the V-cycle is a constant, ν. As shown in Section 1.1.3, there should be some fixed amount of error reduction, β, required on each NI level in order to maintain a solution with accuracy on the order of the discretization. Thus, with a V-cycle convergence factor, ρ, that is independent of problem size, NI requires log 1/ρ β V-cycles per level. The resulting, approximate communication cost for traditional NI, C T, can then be obtained by summing over the NI levels:

54 40 C T log c (EP ) i=log c E ν(log 1/ρ β)i = ν(log 1/ρ β) log c (EP ) i=1 log c E 1 i j j=1 = ν(log 1/ρ β) ((log 2 c (EP )(log c (EP ) + 1) (log c E 1)(log c E)) = ν(log 1/ρ β)(log c E log c P + (1/2)(log c P ) 2 + log c E + (1/2) log c P ) = O(log c E log c P ) Note that, in the regime where E > P, it follows that log c E log c P > (log c P ) 2. While the O((log c P ) 2 ) scaling is the major concern, the constant, ν(log 1/ρ β) should not be ignored. This constant can be as small as about 20 for the easiest of problems, such as a 5-point Laplacian problem, but can also grow dramatically for more difficult problems. Larger stencils induced by higher-dimensional problems or higher-order elements as well as the problem of increased operator complexity on the coarse levels of the AMG hierarchies can greatly increase the value of ν [20,21]. Problems which are more difficult for AMG will also yield larger convergence factors, ρ, leading to an increase in the log 1/ρ β term. Also, the base of the logarithms, c, may vary from problem to problem. For problems that demand highly localized adaptive refinement, the total number of NI levels will grow. Also, AMG may not be able to coarsen some difficult problems as aggressively, leading to growth in the number of levels in the multigrid hierarchy. This is all to say that traditional NI exhibits not only poor asymptotic scaling of communication cost, but also suffers larger communication cost depending on the difficulty of the problem. The main thrust of this section is to compare the problem dependence of communication cost for traditional NI vs. NIRD. Note that, regardless of the problem, NIRD incurs a communication cost, C N, of C N = α log 2 P, (2.17)

55 41 where α is simply the number of NIRD iterations. As shown in Section 2.5, α = 1 or 2 is usually sufficient to provide a very accurate result across a wide variety of problems and discretizations, even when these problems are more difficult to solve using traditional methods Model comparisons with varying problem difficulty A notable difference between the communication paradigms for NIRD vs. traditional NI is that the communication cost for NIRD is agnostic to the linear system solver used and subsequently also largely agnostic to problem difficulty. Additional work performed in solving the linear system is all done in parallel for traditional NI and, thus, incurs larger communication cost, whereas this extra effort for difficult problems is performed in serial, independently on each processor in NIRD. Figure 2.5 shows the predicted number of communications, C T and C N, for a traditional NI solve and NIRD. The cost for NIRD assumes the use of α = 2 NIRD iterations. For traditional NI, predicted performance is shown for both an easy and more difficult problem. The easy problem uses ν = 20 communications per level, which is a reasonable estimate for doing two relaxations, prolongation, restriction, and residual calculation with four processor neighbors as one might expect for a 5-point Laplacian problem. The convergence factor for the easy problem is ρ = 0.1, the coarsening factor is c = 4, and the amount of reduction needed on each level is β = 9. For the hard problem, ν = 40, as one might expect for a problem with eight processor neighbors on each level, the convergence factor is ρ = 0.8, and the coarsening factor is c = 2, as one might expect for a problem that demands semicoarsening. Each problem uses E = 10 6 elements per processor. As shown in Figure 2.5, even for the easy problem, which is basically an ideal, textbook case for full multigrid, traditional NI requires almost two orders of magnitude more communications than NIRD. For the hard problem, this difference is magnified, with traditional NI requiring between three and four orders of magnitude more communications than NIRD. Thus, there is a clear benefit here for NIRD when attempting to solve more difficult problems. The communication cost of traditional methods will continue to rise as the difficulty of the linear systems that need to be solved increases, whereas NIRD retains its already comparatively small communication cost.

56 42 Figure 2.5: Plot of model predictions of number of communications vs. number of processors for NI vs. NIRD. 2.5 NIRD numerical results Numerical results from [3] have already shown the NIRD algorithm to have performance benefits over traditional methods on high-performance machines (where communication is a dominant part of the cost) for simple Poisson and advection-diffusion problems, and the performance model discussed in Section 2.4 predicts even larger performance benefits for more difficult problems. The focus here is to compare variants of the NIRD algorithm (i.e. using different PoU s) over a wider set of test problems and to study the convergence of NIRD on these problems in greater detail, especially as it relates to the theory presented in Section The general form of the test problems used in this section is (A p) + b p + cp = f, on Ω = [0, 1] [0, 1], (2.18) p = 0, on Γ D, (2.19) n p = 0, on Γ N, (2.20)

57 43 where c is a scalar coefficient, b is a vector representing the direction and strength of the advection term, n is the unit outward normal vector, and A = α 1 0 (2.21) 0 ɛ represents the diffusion coefficient, α, along with grid-aligned anisotropy with strength ɛ. The following subsections give numerical results for several different variations of this general, elliptic problem Poisson Beginning with the simplest example, consider a Poisson problem, α = 1, ɛ = 1, b = 0, c = 0, with zero Dirichlet conditions on the entire boundary. Let the right-hand side be defined by the smooth function, f 0 (x, y) = sin(πx) sin(πy). Figure 2.6 shows NIRD convergence for this Poisson problem with the smooth right-hand side, f 0, using a discontinuous PoU and scaling the number of processors, P, from 64 to 256 to 1,024 while keeping the number of elements per processor, E, fixed at 20,000. The LSF s shown are normalized by the LSF achieved by the solution on the preprocessing mesh, and the reference lines plotted are the accuracy expected using traditional nested iteration (using ACE refinement) using N U or N T elements, where N U is the number of elements in the union mesh produced by NIRD and N T is the total number of elements used by NIRD across all subproblems. As expected, NIRD performs well on this problem, producing an accurate solution within the first one or two iterations. Recall that each NIRD iteration requires only log 2 P communications, so the number of communications required to solve the problem with NIRD is significantly less than what is required by a traditional nested iteration approach. Another thing to note about this problem is that N U is quite close to N T, meaning that most of the elements used by NIRD end up in the union mesh. Also, the NIRD iterates obtain accuracy close to the traditional solution with N U elements, indicating that

58 44 Figure 2.6: NIRD convergence for Poisson with a smooth right-hand side using a discontinuous PoU. Figure 2.7: Compare NIRD convergence using different PoU s for Poisson with a smooth right-hand side and 1,024 processors. the union mesh produced by NIRD is achieving nearly optimal element distribution, even for the discontinuous PoU. Comparing the performance of NIRD using different PoU s, as shown in Figure 2.7, shows that the discontinuous and C PoU s have similar performance, whereas for the C 0 PoU, there is some widening of the gap between N U and N T, meaning that the NIRD union mesh is unable to achieve as good accuracy. This can be explained by the fact that the C 0 PoU has the largest support of any of the PoU s, and therefore requires the most refinement outside of the home domains for each subproblem. In addition to simply showing the performance of NIRD on different problems, this section also seeks to measure some important quantities of interest from the convergence proof given in Section in order to better understand the behavior of NIRD and potentially to validate some of the assumptions made. Table 2.2 shows shows some measured quantities for Poisson with a smooth right-hand side. First, η 1 /η 0, where η 1 and η 0 are the largest and smallest LSF s over home domains

59 45 after the preprocessing step, and N c, the number of elements in the preprocessing mesh, are shown. This gives an idea of how good a job the preprocessing step does of equidistributing the error among processors. For this simple problem, the error is, unsurprisingly, quite well equidistributed. Next, the smallest constant, C 0, that satisfies bound (2.13) is measured. For this problem, C 0 remains small and constant even as P grows. The fact that this is correlated with excellent NIRD convergence on the first iteration lends some validity to the line of proof given in Section Finally, the table also shows Q (i), the ratio of the number of elements in the union mesh on the i th NIRD iteration vs. the total elements used by NIRD, and K (i) = Lu h i f / Luh T f, where u h T is the solution arrived at by a traditional solve using the same number of elements. One trend present here, which holds for many of the test problems tried throughout this section, is that Q (i) grows on the second iteration (sometimes quite significantly), which is part of the reason why the overall functional for the second NIRD iterate is noticeably smaller than the for the first. The major takeaway here, however, is that even on the first NIRD iterations, the functional error for the NIRD solution is within a factor of about 2 or 3 of the error expected for a traditional solve using the same resources. Thus, NIRD is achieving accuracy which is quite close to traditional methods with minimal communication. Table 2.3 and Figure 2.8 show results for NIRD applied to the Poisson problem with the more oscillatory right-hand side, f 1 (x, y) = α(x, y) sin(3 P πx) sin(3 P πy), where α(x, y) yields random values in [ 1, 1] over the domain such that the amplitudes of the sine humps for f 1 vary randomly while retaining the function s continuity. Note that for the smooth right-hand side, f 0, the same problem is solved as the number of processors, P, grows, but for f 1, the problem actually changes with P. NIRD performance for the oscillatory right-hand side is even better than for the smooth right-hand side, which is notable since the problem with the more oscillatory right-hand side (and subsequently more oscillatory solution) is traditionally more difficult to solve, requiring finer meshes to properly resolve the solution. For this problem, NIRD

60 convergence on the first iteration and the measured value for C 0 are both excellent across all PoU s and processor counts. 46 PoU P η 1 /η 0 N c C 0 Q (1) K (1) Q (2) K (2) Discts C C Table 2.2: Table of relevant values for NIRD using different PoU s for Poisson with a smooth RHS. Figure 2.8: NIRD convergence for Poisson with an oscillatory right-hand side using a discontinuous PoU. PoU P η 1 /η 0 N c C 0 Q (1) K (1) Q (2) K (2) Discts C C Table 2.3: Table of relevant values for NIRD using different PoU s for Poisson with an oscillatory RHS.

61 Advection-diffusion The next test problem introduces a constant advection term, b, and modifies the boundary conditions. A Neumann condition is used on the East side of the square domain, and Dirichlet conditions are retained on the North, South, and West sides. The advection term is grid aligned, b = ±15, 0 so that the direction of advection is either straight into or away from the Dirichlet condition on the West side, depending on the sign. Note that when using the smooth right-hand side, f 0, the solution to the problem has a very different character depending on the sign for b. For a negative sign, the solution has a steep boundary layer against the West side of the domain, whereas this boundary layer is not present in the case where b is positive. Figure 2.9 shows the solutions to the problem for each sign of b. As shown in Figures 2.10 and 2.11, NIRD performance is quite different, depending on the presence of the boundary layer. When no boundary layer is present (as shown in Figure 2.10), performance is quite good, not unlike the simple Poisson case. With the boundary layer (shown in Figure 2.11), NIRD performance is clearly not as good, however. As the number of processor s grows, more NIRD iterations are required to obtain a solution with acceptable accuracy. Examining the values for C 0 shown in Tables 2.4 and 2.5 shows a correlation between small values for C 0 and good first-iteration convergence for NIRD for the problem with no boundary layer as well as a correlation between larger values for C 0 and worse first-iteration convergence for the problem with a boundary layer. Again, this suggests the validity of the line of proof presented in Section and also suggests that the boundary layer, which is present in the subproblem solutions as well as the global solution, may be causing issues for NIRD by making the NIRD error less localized. That is, processors that are far from the West side of the domain still need to resolve the boundary layer there in their subproblem solution, and they do a somewhat poor job of this, resulting in larger values of C 0 and poor NIRD convergence on the first iteration. Changing the PoU used by NIRD does not make a substantial difference in performance on

10: NIRD convergence for advection-diffusion with a smooth right-hand side and no boundary layer in the solution using a discontinuous PoU.

62 48 Figure 2.9: Solutions to the advection-diffusion problem with positive b (left) and negative b (right). Figure 2.10: NIRD convergence for advection-diffusion with a smooth right-hand side and no boundary layer in the solution using a discontinuous PoU. Figure 2.11: NIRD convergence for advection-diffusion with a smooth right-hand side and a boundary layer in the solution using a discontinuous PoU. Figure 2.12: Compare NIRD convergence using different PoU s for advection-diffusion with a smooth right-hand side, a boundary layer in the solution, and 1,024 processors.

63 49 PoU P η 1 /η 0 N c C 0 Q (1) K (1) Q (2) K (2) Discts C C Table 2.4: Table of relevant values for NIRD using different PoU s for advection-diffusion with a smooth RHS and no boundary layer in the solution. PoU P η 1 /η 0 N c C 0 Q (1) K (1) Q (2) K (2) Discts C C Table 2.5: Table of relevant values for NIRD using different PoU s for advection-diffusion with a smooth RHS and a boundary layer in the solution.

64 50 the advection-diffusion problem with the boundary layer, as seen in Figure In each case, multiple NIRD iterations are required in order to converge to a solution with accuracy comparable to a traditional nested iteration solve with N U elements. Performance is almost identical for the discontinuous and C PoU s, while the C 0 PoU converges to the union mesh solution the fastest. As is the case for many of the test problems here, however, the C 0 PoU causes NIRD to allocate the least elements inside the home domains, resulting in smaller values for Q (i) (see Table 2.5) and a wider gap between N U and N T. One idea for improving NIRD performance on this problem is to remove the boundary layer through a transformation of the problem. For example, changing variables from p in equation (2.20) to ˆp = e 15 2 x p results in the self-adjoint problem, ˆp ( ) 15 2 ˆp = e 15 2 x f, on Ω, 2 ˆp = 0, on NSW, x ˆp 15 ˆp = 0, on E, 2 where N, S, W, and E refer to the North, South, West, and East edges of the domain. This reformulation of the problem does away with the advection term altogether (thus eliminating the boundary layer on the East edge of the domain), but also scales the right-hand side by an exponential. Thus, the problem of a steep layer in the solution is simply moved to the East edge of the domain, and NIRD still requires multiple iterations to converge for this version of the problem. On top of the fact that NIRD performance is not improved, the reformulation also encourages adaptive refinement to focus near the East edge, making it difficult to recover an accurate solution to the original problem (where more resolution is needed near the boundary layer on the West edge). Thus, transforming the problem in this (or a similar) way is not helpful in this case. Other fixes to alleviate the issue of the boundary layer, such as the use of exponential finite elements, may be viable, but have not yet been tested. It is worth noting again that NIRD is actually still performing quite well on the advectiondiffusion problem with the boundary layer, even though it is not achieving the same immediate

65 convergence observed for Poisson. An accurate solution is achieved in 3 or 4 iterations for P = 1,024, each of which use only log 2 P communications, making this still significantly cheaper than traditional methods. Examining the advection-diffusion problem with negative b and the oscillatory right-hand side, f 1, yields yet more evidence that the dominance of the boundary layer feature in the solution is what is causing problems for NIRD. The solution for this problem setup still has a boundary layer on the West side of the domain, but the oscillatory right-hand side results in relatively steep changes in the solution that occur over the whole domain. The effect is that the boundary layer plays a much less significant role, and the error in the NIRD solution is once again dominated by the error contributions from local processors as shown by the small values of C 0 in Table 2.6. Subsequently, good NIRD convergence on the first iteration is also restored for this problem, as indicated by the small values of K (1) in Table 2.6. PoU P η 1 /η 0 N c C 0 Q (1) K (1) Q (2) K (2) Discts C C Table 2.6: Table of relevant values for NIRD using different PoU s for advection-diffusion with an oscillatory RHS and a boundary layer in the solution. 51

66 52 Figure 2.13: NIRD convergence for jump-coefficient Poisson with a smooth right-hand side using a discontinuous PoU Poisson with anisotropy or jump coefficients The next set of test problems include anisotropy or jump coefficients. jump-coefficient Poisson problem with α from equation (2.21) set as 1, if (x, y) [0, 0.5] [0, 0.5] [0.5, 1] [0.5, 1] α(x, y) = 100, if (x, y) (0.5, 1] [0, 0.5) [0, 0.5) (0.5, 1] First, consider a, which yields a standard, 2 2 checkerboard pattern of discontinuous coefficients, and set ɛ = 1, b = 0, and c = 0 with Dirichlet conditions on the entire boundary. This problem seems to cause no issues for NIRD, and excellent convergence is observed for the smooth right-hand side using the discontinuous PoU as shown in Figure Similar to the standard Poisson case, there is no real benefit to using either the C 0 or C PoU for this problem, since the discontinuous PoU is sufficient for good convergence, and when solving the problem with the oscillatory right-hand side, NIRD again performs very well. Next, consider the anisotropic Poisson problem where ɛ = 0.001, again keeping all other parameters at their default values, α = 1, b = 0, c = 0, Dirichlet conditions on the entire boundary, and again starting by studying the smooth right-hand side, f 0. This problem, more so than any of the other test problems presented here, really exposes the differences between the various PoU s. As shown in Figure 2.14, there is a wide gap between the NIRD solution and the optimal solution using N U elements for the discontinuous PoU, indicating that NIRD is doing a poor job of distributing

67 53 the elements. Figure 2.15 shows the refinement pattern produced by traditional ACE refinement vs. the union meshes produced by NIRD using the various PoU s for a small problem with 16 processors and 1,000 elements per processor. When using the discontinuous PoU, NIRD is placing too many elements near the corners and edges of home domains and, as a consequence, is not globally well refined. The C 0 and C PoU s, however, yield much more sensible refinement patterns, with the C PoU obtaining more elements in the union mesh. NIRD using the C 0 PoU does a much better job of approximating a traditional solution with N U elements, as shown in Figure 2.14, though, again, the gap between N U and N T is the largest of all the choices of PoU. In this case, the C PoU appears to be the best choice, achieving both a narrow gap between N U and N T and good approximation of the traditional solution with N U elements as well simply the most accurate NIRD solution. When using the C PoU, NIRD performance scales well with the number of processors, as shown in Figure It is also worth noting that, as shown in Table 2.7, the values for C 0 are still relatively small and constant across all PoU s, which again correlates to swift NIRD convergence in all cases. The difference in quality of solution here is almost entirely due to the differences in element distribution for the union mesh. It is again notable that solving the same anisotropic Poisson problem with the oscillatory right-hand side, f 1, yields excellent NIRD convergence for all PoU s. As shown by the small values of K (1) in Table 2.8, NIRD achieves accuracy very close to that achieved by a traditional solve within the first iteration, indicating that the oscillatory right-hand side has alleviated the problem of bad element distribution encountered with the smooth right-hand side Higher-order finite elements All results from the previous subsections used linear, Lagrange finite elements. This subsection examines the performance of NIRD when applied to higher-order discretizations. In order to justify the use of higher-order elements, consider a solution that has steep gradients. Specifically, this section uses a test problem inspired by Problem 2.9 from Mitchell s set of test problems in [25].

68 54 Figure 2.14: Compare NIRD convergence using different PoU s for anisotropic Poisson with a smooth right-hand side and 1,024 processors. Figure 2.15: Comparison of traditional ACE refinement pattern (top left) vs. the union mesh produced by NIRD using the discontinuous PoU (top right), the C 0 PoU (bottom left), and the C PoU (bottom right).

69 55 Figure 2.16: NIRD convergence for anisotropic Poisson with a smooth right-hand side using a C PoU. PoU P η 1 /η 0 N c C 0 Q (1) K (1) Q (2) K (2) Discts C C Table 2.7: Table of relevant values for NIRD using different PoU s for anisotropic Poisson with a smooth RHS. PoU P η 1 /η 0 N c C 0 Q (1) K (1) Q (2) K (2) Discts C C Table 2.8: Table of relevant values for NIRD using different PoU s for anisotropic Poisson with an oscillatory RHS.

70 56 Figure 2.17: Solution to the wave front problem used for testing higher-order elements. Consider a solution of the form r = (x x c ) 2 + (y y c ) 2 p(x, y) = tan 1 (a(r r 0 )), which describes a circular wave front centered at (x c, y c ) with radius r 0 and steepness determined by a. Taking the Laplacian yields, p = a + a3 (r 2 0 r2 ) r(1 + a 2 (r 0 r) 2 ) 2. The test problem here takes a linear combination of functions of this form to arrive at the right-hand side, r 1 = (x x 1 ) 2 + (y y 1 ) 2, r 2 = (x x 2 ) 2 + (y y 2 ) 2, f 2 (x, y) = a + a3 (r 2 0 r2 1 ) r 1 (1 + a 2 (r 0 r 1 ) 2 ) 2 a + a3 (r 2 0 r2 2 ) r 2 (1 + a 2 (r 0 r 2 ) 2 ) 2, where (x 1, y 1 ) = (0.65, 0.65), (x 2, y 2 ) = (0.35, 0.35), r 0 = 0.3, and a = 100. Using this right-hand side for a Poisson problem with zero Dirichlet conditions on the entire boundary yields the solution shown in Figure This is the test problem used throughout this subsection. One important practical note is that increasing element order necessitates decreasing the number of elements per processor, E, due to memory constraints. For the tests presented here,

71 57 E = 20, 000 is used for degree 1 elements, E = 5, 000 for degree 2, E = 2, 000 for degree 3, and E = 1, 000 for degree 4. Note that this also limits the number of processors that may be used, since NIRD only works well in the regime where E >> P. As previously mentioned, this limitation may be overcome by applying NIRD at the level of parallelism between nodes, treating P as the number of nodes, and thus, increasing the available memory and allowing for larger E. Figure 2.18 shows that NIRD with the discontinuous PoU achieves an accurate solution within one iteration as polynomial order increases from one to four (note the difference in scale for the vertical axis as polynomial order increases). The gap between N U and N T does widen somewhat as the degree increases, but this is due to the fact that E is decreasing relative to P, leaving less elements to be used for adaptive refinement on the subproblems after the preprocessing step. Figure 2.18: NIRD convergence using a discontinuous PoU, 256 processors, and varying the polynomial order of the finite element discretization for the wave front problem (note the difference in scale for the vertical axis as polynomial order increases).

72 58 When comparing different PoU s, NIRD performance is quite similar across all polynomial orders. The excellent performance achieved by the discontinuous PoU shown in Figure 2.18 indicates that smoother PoU s are not necessary as polynomial degree of the finite elements increases. This is likely due to the fact that, although the discontinuous PoU clearly reduces the smoothness of the right-hand side, it introduces discontinuities only along element boundaries. For Lagrange elements, the finite element approximation is only C 0 at these boundaries anyway for any polynomial order on the interior of the element. Coupling these facts with adaptive mesh refinement yields no degradation in the finite element convergence for the subproblems, even when using the discontinuous PoU. As shown in Figure 2.19, the LSF for each processor s subproblem (using the discontinuous PoU) converges at a rate of O(h q ), where q is the polynomial degree of the finite elements (here h is defined as N 1/d ) Discussion Throughout studying the above test problems, a few broad observations can be made. First, note that for many of the test problems above, the preprocessing step achieves a fairly small ratio of η 1 /η 0 and the NIRD solution is able to recover accuracy fairly close to what traditional ACE refinement using N U elements achieves. This is especially notable in the cases of the advectiondiffusion problem with the boundary layer and the wave front problem, both of which have highly localized features demanding adaptive refinement. This indicates that the adaptive preprocessing strategy developed in Section is performing well and doing a good job of equidistributing error among processors, subsequently allowing NIRD to achieve a union mesh with nearly optimal element distribution. Second, some general statements comparing the different PoU s tested here may be made. In general, the discontinuous PoU, though the simplest of the PoU s, was sufficient to obtain good NIRD performance on many of the test problems. In fact, even for higher-order finite elements, the NIRD subproblems achieve the proper order of finite element convergence due to the fact that the discontinuities in the right-hand sides are only along element boundaries. With that said, there

73 59 Figure 2.19: NIRD subproblem convergence using a discontinuous PoU, 16 processors, and varying the polynomial order of the finite element discretization for the wave front problem (note the difference in scale for the vertical axis as polynomial order increases). are some problems for which the C 0 and C PoU s performed better due to the discontinuous PoU s suboptimal element distribution. In general, the discontinuous PoU, which has the smallest support for the characteristic functions, achieves the highest number of elements in the union mesh but has the least optimal element distribution, whereas the C 0 PoU, which has the largest support for the characteristic functions, obtains union meshes that are closest to optimal in their element distribution but have significantly fewer elements. The C PoU proposed here seems to be a middle ground between the discontinuous and C 0 PoU s in that its smoothness generally leads to better element distribution than the discontinuous PoU while it also places more elements in the union mesh than the C 0 PoU since the support of its characteristic functions are smaller. Thus, for problems such as anisotropic Poisson, where the accuracy achieved by the discontinuous PoU

74 60 really suffers due to poor element distribution, the C PoU actually performs best. The final important observation is that the numerical results here both validate the line of proof described in Section and contribute additional understanding and intuition about what causes NIRD to work well. The assumption that C 0 is small and constant with respect to P is the crux of the updated convergence proof presented in this thesis. A strong correlation is observed throughout the numerical results between small values of C 0 and good NIRD convergence on the first iteration: for example, C 0 is, at most, about 2 or 3 for Poisson, where NIRD converges well on the first iteration, and C 0 is as large as 50 for the advection-diffusion problem with a boundary layer, where NIRD converges poorly on the first iteration. This suggests that the line of proof in Section is valid in practice and can be predictive of NIRD performance. This is a step in the right direction for the convergence theory, but a better understanding of how to bound C 0 using a priori knowledge of the PDE is still needed. A further interesting note is that NIRD performs very well (and C 0 is small) across all the test problems when an oscillatory right-hand side is used. The intuitive explanation here is that the use of an oscillatory right-hand side generates NIRD subproblems in which the dominant error is focused inside the support of the characteristic function, where the solution is more difficult to resolve given a more oscillatory right-hand side. This dominance of error in a compactly supported region is one of the guiding principles that enables NIRD to work in the first place, but starting with a highly oscillatory right-hand side seems to magnify this dominance, resulting in exceptional NIRD performance.

75 Chapter 3 Algebraic multigrid with domain decomposition (AMG-DD) While NIRD shows great promise as an O(log P ) solver for elliptic PDE s, the implementation of this algorithm requires access to problem geometry, discretization, and adaptive mesh refinement as well as solvers for the linear systems produced by discretizing on each level of the nested iteration process. This makes for a rather intrusive implementation that potentially limits the algorithms usability for some applications. Algebraic methods such as algebraic multigrid (AMG), on the other hand, attempt to function as black-box solvers for a given linear system. As such, these methods are relatively easy to apply and widely used. AMG, in particular, is often the method of choice when solving or preconditioning linear systems resulting from discretizing elliptic PDE s. The Algebraic Multigrid with Domain Decomposition (AMG-DD) algorithm is similar in spirit to NIRD in that it aims to achieve discretization accuracy in a fixed number of cycles with O(log P ) cost. Unlike NIRD, however, AMG-DD is a purely algebraic method, giving it the same ease of use and versatility as other algebraic methods. Using only algebraic information, AMG-DD mimics some of the ideas from NIRD: each processor solves a subproblem over the entire domain, and the grids for these subproblems are refined in and near the processor s home domain and are coarser farther away. There are some important differences in approach between the two methods, however. NIRD decomposes the problem at the level of the PDE, whereas AMG-DD decomposes the algebraic problem. NIRD begins with a coarse representation of the problem and uses adaptive refinement to build up a fine solution, whereas AMG-DD begins with some fine grid matrix and utilizes AMG to coarsen the problem. These differences add up to NIRD achieving generally better

76 62 convergence properties, while AMG-DD achieves the benefits of being a totally algebraic method. As briefly discussed in the introduction, traditional parallel implementations of AMG usually require O((log P ) 2 ) communication cost to solve the linear system to an acceptable level of accuracy, and this cost is a major challenge for the scalability of AMG as parallelism continues to increase on modern supercomputers [21]. Assuming a weak scaling context in which the global number of degrees of freedom, N, is proportional to the number of processors, P, an AMG hierarchy will have O(log P ) levels. The need for a constant number of communications on each level of the hierarchy then results in an O(log P ) communication cost per AMG V-cycle. Further assuming some fixed V-cycle convergence factor results in the need for O(log P ) V-cycles, and thus O((log P ) 2 ) total communication cost, in order to solve a given problem to the level of discretization accuracy. Previous work by Bank et al. [6] considered AMG-DD as a low-communication alternative to AMG V-cycles. The communication cost per cycle for AMG-DD is lower than that of AMG V-cycles, but only by a constant factor. That is, AMG-DD retains O(log P ) communication cost per cycle. As such, examining asymptotic convergence factors for AMG-DD vs. AMG V-cycles predicted only marginal performance benefits for AMG-DD on current machines. This thesis examines AMG-DD instead as a discretization-accuracy solver. In this context, the focus is on convergence to the solution of the PDE over the first few iterations of AMG-DD rather than asymptotic convergence to the solution of the matrix problem. Results presented in this chapter indicate that AMG-DD may be capable of achieving discretization-accuracy convergence within a small, fixed number of cycles, thus significantly reducing the cost of solving the PDE problem from O((log P ) 2 ) to O(log P ). The following section describes the AMG-DD algorithm. This is followed by a discussion of the underlying AMG hierarchy in Section 3.2, which examines AMG interpolation in a discretization-accuracy context. Section 3.3 then shows numerical results for AMG-DD varying the underlying AMG hierarchy and the construction of the AMG-DD subproblems. These results suggest that AMG-DD is capable of achieving discretization accuracy within a fixed number of cycles with O(log P ) communication cost. Finally, in Section 3.4, a new, fully parallel implementation of

77 63 Figure 3.1: Visualization of composite creation steps (note that here η = 2 on each level). AMG-DD is discussed. This is the first parallel implementation of the algorithm, and some results are shown verifying the correct scaling of communication costs. 3.1 AMG-DD algorithm description Assume an elliptic PDE has been discretized on some fine grid, and denote the associated algebraic system to be solved by Au = f. A, u, and f are assumed to be split over processors in a standard way such that each processor owns the information associated with some partition of the degrees of freedom. The owned degrees of freedom on a processor comprise its home domain. To employ the AMG-DD algorithm, first construct a global AMG hierarchy for the matrix A. This is all done in parallel in the standard way so that each processor ends up owning all information on all levels of the AMG hierarchy for its home domain. The next step is to setup a composite grid on each processor. Composite grids are easiest to visualize on a geometric mesh, but the process for creating them is purely algebraic, relying only on the operators from the global AMG hierarchy. A composite grid may be constructed as follows (see visualization in Figure 3.1). Start by including a processor s home domain on the fine grid. Select a padding η, and add all points within distance η in the stencil of A (white points in Figure 3.1). Then move to the next coarse grid and repeat, including points within distance η in the stencil of the coarse-grid operator. This process is repeated until the entire domain is covered by the composite grid. Since each processor s composite grid covers the entire computational domain (and is thus

78 64 a representation of the global problem), there is no need to communicate with other processors in order to cycle on the composite grids. Thus, each processor may independently solve its composite version of the global problem. Also the composite grid construction maintains a refined mesh in and near each processor s home domain, resulting in composite-grid solutions that are (hopefully) locally accurate over the associated processor s home domain. Armed with the definition of a composite grid, it is now straightforward to describe an AMG- DD cycle. Let P k be the interpolation operator between processor k s composite grid and the global fine grid. Given an initial guess, u, the AMG-DD cycle calculates a global residual, r = f Au, and then restricts this residual to all levels of the global AMG hierarchy. The algorithm then communicates the appropriate pieces of the residual on each level corresponding to the composite grid for each processor. The composite residual communication algorithm is nontrivial, but may be implemented in such a way that it requires only O(log P ) communication cost [6]. This is followed by a solve of the composite problems, (Pk T AP k)u k = Pk T r, on each processor. The global solution is then updated by u u + χ k ku k. Here, χ k is a discrete partition of unity so that u k (x), if x (processor k s home domain) χ k u k (x) = 0, otherwise, Thus, the algorithm patches together the finely meshed home domains from the composite solutions on each processor. Again, note that solving the composite problems on each processor, which comprises the main computation step of the algorithm, requires no communication. Pseudocode for the AMG-DD algorithm is shown in Algorithm 3.

79 Algorithm 3 AMG-DD cylce. Form initial guess, u. 65 for i = 0 (num iterations) do Calculate the global residual, r f Au, and restrict to all levels. Communicate residual on the composite grids to each processor. Solve (Pk T AP k)u k = P k r on each processor. Update solution u χ k ku k. end for In the broader context of recent, novel, low-communication algorithms, the AMG-DD algorithm is most easily compared to Mitchell s parallel multigrid algorithm using full domain partitions [23,24,26]. Both methods leverage underlying multigrid hierarchies to construct subproblems that may be cycled on with low communication cost compared with ordinary V-cycles. Mitchell s algorithm, however, constructs subproblems using a full approximation scheme [11] based on the hierarchical basis [31], which demands some additional geometric information compared to the purely algebraic approach taken in AMG-DD. In addition, the V-cycle proposed by Mitchell employs communication on the finest and coarsest levels, whereas in AMG-DD, cycling on the subproblems is completely communication-free. This is a major advantage of AMG-DD, since the subproblems may be solved as accurately as desired before updating the global solution. This is crucial to AMG-DD s performance and allows for its potential as a fast, discretization-accuracy solver. 3.2 Algebraic multigrid interpolation Since the goal of AMG-DD is to perform well as a discretization-accuracy solver, it is reasonable to ask what properties of the underlying AMG hierarchy are favorable to discretizationaccuracy convergence. In general, AMG methods are designed with the goal of obtaining scalable convergence factors, and there is not much focus on achieving any kind of scalable discretizationaccuracy convergence. This section reveals that obtaining a good discretization-accuracy solver is, in fact, a much different (and perhaps more difficult) goal than obtaining good convergence factors.

80 Full algebraic multigrid (FAMG) To examine AMG in a discretization-accuracy context, consider the full multigrid (FMG) cycle discussed in Section In a geometric multigrid (GMG) setting, a single FMG cycle with a fixed number of V-cycles on each grid level is sufficient to reduce the algebraic error to the level of discretization accuracy as shown by the convergence proof in Section Thus, FMG is an O(N) discretization-accuracy solver for elliptic PDE s when applied to a geometric hierarchy. When applying a full multigrid cycle to an algebraic hierarchy (denoted FAMG), however, this scalable convergence is typically lost, indicating some deficiency in the AMG hierarchy that limits AMG s capabilities as a discretization-accuracy solver. To demonstrate the above claims about FMG and FAMG, consider the model problem of a Poisson equation discretized on a square with bilinear finite elements: u = f, u Ω = [ 1, 1] [ 1, 1] u = 0, u Ω. A manufactured solution, u(x, y) = (x + 1)(1 x)(y + 1)(1 y), yields right-hand side f(x, y) = 2((y + 1)(1 y) + (x + 1)(1 x)). For such a problem, standard AMG is known to have excellent V-cycle convergence, but standard FAMG fails to converge to discretization accuracy in a single cycle. Figure 3.2 shows convergence of the relative total error, Relative Error = vh i u, u for FMG vs. FAMG, where u is the true solution evaluated on the fine grid and v h i is the solution obtained by the nested iteration process (solving on each coarse grid using a single V-cycle, then projecting up) plus i additional V-cycles on the fine grid. This discussion uses the discrete L 2 norm throughout. Note that the error shown stalls at the level of discretization accuracy since it is measured against an analytic solution, u. As the grid is refined (and problem size increases),

81 67 FAMG yields less accurate fine-grid solutions after the nested iteration process, thus requiring more fine-grid V-cycles to obtain the level of discretization accuracy. Notice, however, that FMG achieves accuracy on the order of the discretization with one (or at most two) fine-grid V-cycle, independent of problem size. Figure 3.2: Relative total error convergence for standard FAMG vs. FMG on n n finite element grids of increasing size. The difference in convergence between FMG and FAMG can be attributed to a difference in the way interpolation is constructed between the geometric and algebraic hierarchies. In order to guarantee the single-cycle discretization-accuracy convergence obtained by FMG as established in the proof in Section 1.1.3, the interpolation error must scale like the discretization order. That is, u h P u 2h Kh p, (3.1)

82 68 where u h and u 2h are the fine and coarse grid algebraic solutions respectively for any pair of grids in the hierarchy, P is the interpolation operator between those grids, K is a constant, h is the step size of the fine grid, and p is the order of the discretization. If (3.1) holds on every level of a multigrid hierarchy, then the simple inductive proof from Section shows that a single FMG cycle achieves discretization-accuracy convergence [14]. For the model problem with the appropriate coloring scheme, the coarse grids generated in the AMG hierarchy are the same as the coarse grids generated by GMG. The interpolation operators between grids are also very similar between AMG and GMG. In fact, these operators only differ in the way they choose interpolation weights near the boundary of the domain. This subtle difference in interpolation is enough, however, to prevent the interpolation operators generated by AMG from satisfying assumption (3.1). Figure 3.3 shows the interpolation error, u h P u 2h, on each level (with 0 representing the finest level) for the GMG and AMG hierarchies. These errors are calculated with the manufactured right-hand side, f, specified above. GMG obtains exactly O(h 2 ) convergence of the interpolation error (which is to be expected for the given first-order discretization using the energy norm for the error), while AMG fails this requirement. Even though the difference in interpolation operators between the AMG and GMG methods is subtle (again, they differ only in the way they choose weights near the boundary for the model problem), they exhibit very different behavior across the levels of the hierarchy. A visualization of the error, u h P u 2h, over the domain is helpful in better understanding the nature of interpolation error for GMG vs. AMG. As seen in Figure 3.4, the interpolation error for GMG is small and oscillatory. This is precisely the sort of error that is effectively removed by a V-cycle on the fine grid. For AMG, however, there is a large smooth component to the error produced by large discrepancies near the boundary, which pollute the entire domain. More V-cycles are required to remove this smooth mode in the error. Thus, the above analysis suggests that an FAMG cycle might be improved by changing AMG interpolation in some way to recover O(h 2 ) scaling of the interpolation error.

69 Figure 3.3: Interpolation error across levels of the multigrid hierarchies generated by standard AMG vs. GMG (with 0 representing the finest level). Figure 3.4: Plot of the finest-level interpolation error over the grid for GMG (top) vs.

83 69 Figure 3.3: Interpolation error across levels of the multigrid hierarchies generated by standard AMG vs. GMG (with 0 representing the finest level). Figure 3.4: Plot of the finest-level interpolation error over the grid for GMG (top) vs. standard AMG (bottom). The cross section is taken through the middle of the domain Improving AMG interpolation Throughout this paper, standard AMG refers to the Ruge-Stüben method for generating an AMG hierarchy [10, 28]. The basic principal of any multigrid scheme is that of the complimen-

84 70 tary process of fine-grid relaxation and coarse-grid correction. AMG assumes a relaxation scheme (such as weighted Jacobi or Gauss-Seidel) for which the slowly converging error components have relatively small residuals. So the slowly decaying error, e, which should be addressed by the coarsegrid correction, loosely yields A h e 0. This motivates AMG s choice of interpolation. Rewriting the assumption A h e 0 componentwise and separating the strongly connected coarse points from other connections yields a ii e i a ij e j a ij e j, j C i j D i where the a ij s are entries from the matrix A h, C i is the set of strongly connected coarse points, and D i is the set of remaining connections. To obtain an interpolation formula from the above heuristic, assuming that the goal is to interpolate point i from all strongly connected coarse-grid points, the contributions from D i must be collapsed either to point i or to points in C i so that e i is calculated as a weighted sum of only the interpolary coarse-grid values: e i = j C i w ij e j. The w ij s above are referred to as interpolation weights and become the entries in the interpolation matrix, P. Standard AMG collapses weakly connected points to point i and strongly connected fine points to C i to obtain the interpolation weights: w ij = a ij + k D s i a ii + a ik m D w i a kj a kj j C i a im, (3.2) where D s i is the set of strongly connected fine-grid points and D w i is the set of weakly connected points. Note that the strongly connected fine points are collapsed to points in C i by a simple averaging, which implicitly assumes that the error is approximately constant among these points. Thus, standard AMG fits the constant vector in this way. With interpolation defined, the remaining

85 71 pieces of an AMG hierarchy follow from the usual variational property and Galerkin condition: restriction is defined as the transpose of interpolation, R = P T, and the coarse-grid operators are formed via A 2h = P T A h P. In a GMG hierarchy, linear interpolation between grids can be explicitly enforced, guaranteeing O(h 2 ) scaling of the interpolation error. In an algebraic setting, there is no notion of geometry and, as such, no way to explicitly enforce linear interpolation. Thus, it may be necessary to leverage algebraic information in order to emulate linear interpolation as much as possible. In an algebraic setting, the main approach to constructing interpolation is to ensure that the appropriate vectors lie in the range of interpolation on each level. As mentioned above, the standard Ruge-Stüben way of choosing interpolation weights in AMG ensures that the constant vector is in the range. The analysis of the Section shows that this property is not sufficient for achieving good interpolation near the boundary. Thus, the range of interpolation must be corrected or enriched in some way to recover good interpolation everywhere in the domain. One approach to improving the range of interpolation is to modify the way in which fine-grid connections are collapsed to coarse-grid points when forming interpolation. Standard AMG seeks to fit the constant vector when collapsing these connections, but this approach may be modified to fit an arbitrary vector x, resulting in the following formula for interpolation weights [12]: w ij = a ij + k F s i a ii + a ik m F w i a kj x k a kj x j j C i a im. (3.3) Note that letting x = 1 yields the Ruge-Stüben formula for interpolation weights. In general, the coarse grid should correct components of the error that are in the near-kernel of A h. For the model Poisson problem, the eigenvectors are known, so let x be the eigenvector associated with the smallest eigenvalue (the sine hump) and define interpolation according to (3.3). As shown in Figure 3.5, defining interpolation in this way restores O(h 2 ) convergence of the interpolation error across the multigrid levels and yields FAMG convergence to discretization accuracy in a single cycle. In practice, however, the near-kernel of any given operator is generally not known a

86 72 Figure 3.5: Interpolation error across multigrid levels and FAMG convergence when choosing interpolation to fit the sine hump. priori. A fully adaptive method like the one described in [12] could be used to find the near-kernel, but these methods have expensive setup costs. A more efficient approach here is to try to fit the local kernel at each node, that is, choose a different vector, x, at each point to satisfy (A h x) i = 0. Throughout the rest of this section, boundary nodes refer to nodes that are directly adjacent to the boundary of the domain (the actual nodes on the boundary of the domain are assumed to be eliminated from the system because of the Dirichlet conditions there), and the remaining points are referred to as interior nodes. For the model problem, the operator A h has row sum zero for all interior nodes, so letting x = 1 satisfies (A h x) i = 0 for all interior nodes, i. For the boundary nodes, however, the operator does not have row sum zero, and the constant is no longer in the local kernel. A better approach is to fit a constant vector that has been smoothed. Thus, interpolation is constructed on each level using (3.3) as the formula for the weights and choosing which vector, x, to use at each point, i, according to 1, i I, x = (M 1 ) ν 1, i B, (3.4) where I is the set of all interior nodes, B is the set of boundary nodes, and (M 1 ) ν represents ν

87 73 applications of some smoother, M 1. This has the effect of modifying interpolation only near the boundary, which is appropriate. Recall from Section that the boundary is, in fact, the only place where interpolation weights differ from geometric multigrid for the model problem. As shown in Figure 3.6, this method also restores O(h 2 ) convergence of the interpolation error and convergence to discretization accuracy in a single cycle for FAMG. As the problem size increases, however, the smoothed constant vector used at the boundary requires more smoothing iterations in order for the method to perform well (that is, ν grows with problem size). When n = 64 (where the fine grid consists of n n finite elements), two Jacobi iterations are sufficient, whereas for n = 512, eight Jacobi iterations are required to obtain good FAMG convergence. Figure 3.6 shows the effect on interpolation error for different values of ν. Figure 3.6: Interpolation error across multigrid levels and FAMG convergence when choosing interpolation to fit x as defined in (3.4). Convergence is shown when using ν = 2 smoothings and when using ν = 8 smoothings. Interpolation error is shown for ν = 2, 4, 8 smoothings.

88 74 Taking the ideas presented above one step further, a much better choice for x may be obtained by doing a local smoothing only on the boundary points (where again a point is determined to be on the boundary if its row sum is nonzero). That is, rather than applying a global smoothing M 1 on a constant vector, relax only when a point s row sum is nonzero, leaving other points unchanged. Denote this boundary-only smoother as M 1 B. For the model problem, performing x = (M 1 B )ν 1 converges in only ν = 2 boundary-gauss-seidel relaxations (independent of n) to a vector x that closely approximates linear interpolation to the Dirichlet boundaries of the model problem. Constructing interpolation using formula (3.3) and replacing (3.4) with 1, i I, x = (M 1 B )ν 1, i B, (3.5) yields a much more scalable method for obtaining good interpolation error. For the model problem, performing only ν = 2 applications of M 1 B was sufficient to obtain the O(h2 ) interpolation error and good FAMG convergence shown in Figure 3.7. Figure 3.7: Interpolation error across multigrid levels and FAMG convergence when choosing interpolation to fit x as defined in (3.5) using ν = 2 boundary smoothings. The boundary smoothing technique can be used to successfully generate AMG interpolation that achieves O(h 2 ) interpolation error for a variety of problems. Figure 3.8 shows interpolation error scaling across multigrid levels and FAMG performance when using boundary smoothing for

89 75 Figure 3.8: Interpolation error across multigrid levels and FAMG convergence when choosing interpolation to fit x as defined in (3.5) using ν = 2 boundary smoothings for the jump coefficient problem. a jump-coefficient Poisson problem with constant right-hand side, f = 1: (q u) = f, u Ω = [ 1, 1] [ 1, 1] u = 0, u Ω, 1, if (x, y) [ 1, 0] [ 1, 0] [0, 1] [0, 1] q(x, y) = 1000, if (x, y) (0, 1] [ 1, 0) [ 1, 0) (0, 1]. An even more compelling example where AMG interpolation with boundary smoothing works well is the linear elasticity problem with constant right-hand side, f = 200: (σ(u)) = f σ(u) = ( u)i + ( u + u T ). Results for this problem are shown in Figure 3.9. These results indicate that the boundary smoothing technique may be successfully applied to systems of equations by combining the boundary smoothing technique with the unknown-based approach for constructing AMG interpolation for systems, in which interpolation is constructed separately for each variable in a block fashion [28]. It should be noted that for both the jump-coefficient and elasticity problems, standard AMG interpolation does not scale appropriately, resulting in poor FAMG convergence. Thus, the bound-

90 76 Figure 3.9: Interpolation error across multigrid levels and FAMG convergence when choosing interpolation to fit x as defined in (3.5) using ν = 2 boundary smoothings for the linear elasticity problem. ary smoothing technique provides a meaningful fix for both of these problems. Another important consideration, however, is that for both of these problems, regular coarse grids and homogeneous Dirichlet boundary conditions were used. Modifying the boundary conditions or using irregular coarse grids (as are often generated by AMG in general) result in worse performance for FAMG using interpolation constructed by boundary smoothing. To provide further theoretical motivation, the boundary smoothing process can also be thought of as solving a one-dimensional subproblem around the boundary of the domain. Let A I, A IB, A BI, and A B be the matrices describing the interior-interior, interior-boundary, boundaryinterior, and boundary-boundary connections in the matrix A h, respectively. Then one can rewrite A h x = 0 in block form as A I A BI A IB x I = 0, (3.6) A B x B where x I are the interior degrees of freedom and x B are the boundary degrees of freedom of the vector x. Note that this matrix equation has only the trivial solution when A h is non-singular. Setting x = 1, however, the first block of equations is satisfied:

91 77 ] [A I A IB x I x B = 0 because 1 is in the local kernel for all interior nodes. Now, the bottom block of equations may similarly be used to determine an appropriate x for the boundary by fixing x I = 1 and solving for x B in A B x B = A BI 1. (3.7) The boundary smoothing technique, x = (M 1 B )ν 1, converges to the solution of (3.7). Writing things in this way exposes some of the guiding principles for constructing the modified interpolation described in this section. Finding a global near-kernel vector, x, and constructing interpolation to fit this vector is effective but also expensive. For some problems, it can be much cheaper to find multiple vectors, x, which fit local kernels over certain subsets of the domain (like the interior or the boundary), and constructing interpolation based on these vectors also yields good results. 3.3 Numerical results for AMG-DD This section uses the serial AMG-DD code (simulating parallel) developed in [6] to examine AMG-DD performance in a discretization accuracy context. The test problem used throughout this section is the same Poisson problem used in the previous section s study of FAMG with the same manufactured solution and right-hand side. The number of degrees of freedom per processor is kept at approximately 2,500 (i.e., all results here are presented in a weak scaling context). Unless otherwise noted, the padding used to construct the composite grids is η = 2, and the underlying multigrid hierarchy uses geometric coarsening and interpolation. Also, for all the tests below, the composite-grid problems are solved very accurately, as undersolving the composite problems results in significant degradation of overall AMG-DD convergence. One of the main advantages of

92 78 the AMG-DD algorithm is that cycling on the composite problems requires no communication, so the extra computational effort spent here is relatively cheap and efficient. The first question of interest is how the underlying AMG hierarchy affects convergence of AMG-DD. The previous section examined some different methods for constructing AMG coarsening and interpolation with a focus on trying to achieve good discretization-accuracy convergence of FAMG. Figure 3.10 shows AMG-DD convergence with different underlying AMG hierarchies built using either geometric or algebraic coarsening (where the algebraic coarsening routine produces coarse grids that are different from geometric coarsening) and geometric, algebraic (i.e., standard Ruge-Stüben), or modified interpolation, where the modified interpolation is built using the boundary smoothing technique proposed in the last section. Again, since the error is measured against an analytic solution, it plateaus at the level of discretization accuracy. AMG-DD convergence is nearly identical when using geometric coarsening and either geometric or modified interpolation, since these interpolation strategies produce nearly identical interpolation operators in this case. Interestingly, standard algebraic interpolation yields very similar performance, achieving discretization accuracy in the same number of AMG-DD cycles as the other interpolation strategies. Thus, the subtle differences between standard AMG and geometric or modified interpolation, which had such a large impact on FAMG convergence as observed in the previous section, seem to have only minor significance in the context of the AMG-DD algorithm. The coarsening strategy, on the other hand, seems to play a more significant role in determining AMG-DD performance than the choice of interpolation. Figure 3.10 shows that AMG-DD requires an additional iteration to converge to the level of discretization accuracy when using algebraic rather than geometric coarsening. After the underlying AMG hierarchy, the other major choice made in setting up the AMG- DD algorithm is the size of the padding used to create the composite grids. Figure 3.11 shows AMG-DD convergence using 256 simulated processors and varying the padding, η = 2, 4, 8. Much like the choice of coarse grids, the padding also has a significant effect on AMG-DD convergence, reducing the number of AMG-DD iterations required to obtain discretization accuracy by one each time the padding is doubled. This, together with the results shown in Figure 3.10, seem to indicate

93 79 Figure 3.10: AMG-DD convergence for 256 simulated processors comparing different underlying AMG coarsening and interpolation strategies. Figure 3.11: AMG-DD convergence for 256 simulated processors comparing different paddings. that the size and shape of the composite grids is the most important factor in determining the performance of AMG-DD as a discretization-accuracy solver. Larger and more regular coarse grids produced by larger paddings and geometric coarsening produces the best results. With the above results in mind, Figure 3.12 shows that AMG-DD is, in fact, capable of obtaining discretization accuracy in a fixed number of cycles if the composite grids are constructed with sufficient padding. The left plot in Figure 3.12 shows AMG-DD converging in four iterations for all processor counts when the padding is scaled with the number of processors: for P = 16, 64, 256,

94 80 Figure 3.12: AMG-DD convergence when scaling the padding with problem size (left) vs. when keeping a fixed padding (right). the paddings are η = 2, 4, 8, respectively, so η = O( P ). If the padding is kept fixed as problem size grows, however, this scalable convergence is lost, as shown in the right plot of Figure This is a somewhat mixed result. On one hand, it is encouraging that AMG-DD is capable of converging in a fixed number of cycles independent of the number of processors, making it an O(log P ), algebraic, discretization-accuracy solver. On the other hand, it is impractical to continually increase the padding as problem size grows if the aim is to scale AMG-DD up to very large problem sizes, as the padding controls both the size of the composite grids (and thus the memory required on each processor to store these grids) and the bandwidth costs when communicating residuals between each AMG-DD iteration. Thus, the regime where η = O( P ) is not viable in practice. While the fixed padding result indicates that AMG-DD requires O((log P ) 2 ) communication steps, it still requires less communication than solving the problem to the same accuracy using V-cycles (though the number of communications differs only by a constant rather than a factor of log P ). Although this is not the desired result of superior scaling, it still may be motivation enough to utilize AMG-DD when the desire is to achieve discretization accuracy with as little communication as possible in an algebraic framework.

95 Parallel implementation The results in the previous section were all generated using a serial code that simulates parallel computation. As such, problem size is limited by the memory and time constraints of running on a single core. In order to further develop the AMG-DD algorithm in a more realistic environment, a fully parallel implementation is needed. This thesis presents in detail the first such implementation. Section discusses the residual communication algorithm, which is also used to setup the algorithm by generating the composite grids from a given parallel AMG hierarchy. Details on the implementation developed for this thesis as well as some preliminary performance results are shown in Section Section then discusses how the composite problems are solved in the parallel implementation. Rather than generate matrices for the composite problems, a fast adaptive composite grid (FAC) [22] approach is taken in order to avoid unnecessary construction of matrices and new AMG hierarchies on each processor The residual communication algorithm The most important piece of the AMG-DD algorithm is efficient setup of the composite grids and communication of the residual. This must be done in an intelligent way in order for the algorithm to achieve O(log P ) communication cost for these steps. The algorithm described below is used both in the setup phase of the algorithm and again on each iteration for the residual communication step. For ease of understanding, the description of composite grid construction given in Section 3.1 starts from the home domain on the finest grid, extends out a distance, η, coarsens, and repeats down to the coarsest grid. The parallel residual communication algorithm, however, communicates information starting on the coarsest level and works up to the fine grid. On each level, each processor that owns nodes on that level communicates with its nearest neighbors (where neighbors are determined through algebraic distance). As the algorithm proceeds up through the grid levels, information is accumulated and passed on to the appropriate processors such that each processor

96 Algorithm 4 Parallel residual communication algorithm for AMG-DD. for k = L 0 (loop over levels starting on coarsest) do if (D p 0 Ω k) (if processor owns nodes on this level) then Identify processors {p 1,..., p m } that own nodes within distance η k of D p 0 Ω k for j = 1 m (loop over neighboring processors) do Let Ψ = {x : dist(x, (D pm 0 Ω k )) η k }. Form a sub-composite grid Ψ c starting from Ψ Ω k+1. Send the residual at all nodes in Ψ, Ψ c to p j. Receive information on nodes sent from p j and add to composite grid. end for end if end for 82 gets the information for its entire composite grid by the time the finest level is reached. This yields a total number of communications that is O(L), where L is the number of levels. Assuming a constant coarsening factor, then L = O(log N), where N is the number of fine grid degrees of freedom, and N = O(P ), since a weak scaling context is assumed. Thus, the algorithm achieves the desired O(log P ) cost per residual communication. The data to be communicated on each step of the algorithm is determined by creating small, local composite grids (denoted Ψ c in Algorithm 4) that start with the degrees of freedom that are within an algebraic distance η of the home domain of the processor that will receive the data. To write pseudocode for the residual communication algorithm, first define D p 0 as the nodes on the finest grid (i.e. level 0) that comprise processor p s home domain and denote the entire grid on level k as Ω k. Then Algorithm 4 provides pseudocode for the communication algorithm from the perspective of processor p. It can be shown that when the algorithm completes, each processor has the correct information at each of the points in its composite grid [6]. Figure 3.13 illustrates this from the perspective of a single processor, showing the information it receives from its neighbors. Algorithm 4 is used both for residual communication on each AMG-DD cycle and for generating the composite grids when setting up the algorithm. Note that no specific information about the composite grids is required for the algorithm to run except for what paddings are used on each level. Each processor does not know a priori which degrees of freedom will comprise its composite grid nor which other processors own the information for these degrees of freedom. As such, Algo-

83 Figure 3.13: Visualization of the residual communication algorithm from the perspective of the receiving processor (home domain in the red box).

97 83 Figure 3.13: Visualization of the residual communication algorithm from the perspective of the receiving processor (home domain in the red box). Examples of the sets Ψ and Ψ c are shown in blue boxes. The nodes are numbered by the communication from which they are received with boxed numbers denoting the neighboring nodes which were the root of the communicated Ψ c grid. rithm 4 actually communicates some redundant information, that is, information at a given point in a processor s composite grid may be received from multiple processors. In practice, however, one can avoid doing this redundant communication after the initial setup phase by storing and communicating some additional information that makes further residual communications simpler and cheaper Implementation details This subsection describes some of the details of the parallel implementation of AMG-DD developed for this thesis. This is the first parallel implementation of the algorithm, and, as such, there is much room for improvements and optimizations, which are pointed out along the way. All of the implementation described here is done in hypre, a high-performance solver library developed at Lawrence Livermore National Labs, and is specifically built on top of BoomerAMG [30]. Implementation of the setup phase consists of three major components. The first is generating and communicating the Ψ c grids as described in Algorithm 4. The second involves checking for and communicating information redundancies so that these may be avoided on future residual communications. Finally, the third major component of the setup involves each processor obtaining some additional information on what are referred to here as ghost nodes. These are degrees of freedom

98 84 that do not lie in a processor s composite grid, but some information about them is required in order to do fast adaptive composite cycling on the composite problems (discussed in greater detail in Section 3.4.3). When generating the sets Ψ c, it is necessary to find the set of neighboring processors within distance η through the stencil of A k on each level, k. The current implementation of AMG-DD leverages code already available in hypre that implements an assumed partition algorithm that efficiently finds these neighboring processors when η = 1 [5]. In order to find neighbors of arbitrary degree, η, the current code simply raises the matrix operator, A k, on each level k to the power η and then uses available hypre routines to find neighbors of distance 1 for the matrix A η k. While this provides the correct list of neighboring processors, actually calculating the matrix A η k is quite costly, so there is room for improvement here. Starting with the set of nodes within distance η of a neighboring processor, generating the set Ψ c to be sent to that processor may be done by coarsening the set Ψ, traversing distance η through the stencil of A k, coarsening again, and repeating until the coarsest level is reached. To facilitate the construction and accumulation of information for the small composite grid Ψ c, a new composite-grid data structure was implemented. For each node in the composite grid as well as the ghost nodes, this structure includes a solution, right-hand side, global index, coarse grid global and local indices (indicating where the node lives on the next coarse grid), and corresponding rows of matrix operators. This composite-grid data structure is used both for the Ψ c composite grids communicated between processors and for the home composite grid that represents the processor s subproblem. Since the size of Ψ c is not known a priori, the communication of Ψ c actually requires two steps: a communication of the size of the buffer containing the composite grid followed by the buffer itself. For simplicity in the current implementation, every value communicated is cast to the same data type (HYPRE Complex) when the buffer is packed. Thus, the communication routines may be optimized by utilizing different data types with smaller memory requirements and potentially by estimating or bounding the size of Ψ c in some way to eliminate the need for communicating the

99 85 buffer size. The routine that unpacks the received Ψ c buffers checks to see whether each incoming node has been accounted for already (i.e., has already been received from another processor). This check is currently done by naively searching over nodes already in the composite grid. This is quite inefficient in the current implementation, though an experimental fix to this problem involving the construction of a map using a hashing function is under development. There may be other ways to optimize this portion of the algorithm as well. If redundant information is found, it is flagged as such, and an array of these flags are communicated back to the processors received from. Thus, each processor will know which data it sent is redundant and will not send data for these nodes on future iterations. The cost saved by checking for redundant information is not always clear as there is a trade off between doing this extra communication during the setup phase to avoid the extra bandwidth cost incurred by communicating redundant information in the future residual communications. Further study of these costs is required in order to fully optimize this part of the algorithm. Finally, after the above communications have been performed on each level, each processor will have matrix data and residual values at all its composite-grid points as well as an array flagging any redundant information that should not be sent on future iterations. In order to perform an FAC cycle, however, each processor requires some additional information at ghost nodes around the edge of its composite grid on each level. The number of ghost nodes needed and from which processor the information should be requested is not known a priori. Also, since the ghost nodes are neighbors of composite-grid points that are generally not owned nodes under the original decomposition of the problem among processors, the previously utilized assumed partition code in hypre for finding processor neighbors cannot be used to find the ghost nodes. Thus, new routines were implemented using some of the same principles and tools as the assumed partition algorithm. The problem of finding the ghost nodes presents a scenario in which processors will receive requests for information, but they do not know how many requests they will get nor the size of each request. Neither do the processors requesting ghost node information know how large the

100 86 response buffers will be. The hypre DataExchangeList() function in hypre is designed exactly for such difficult communication scenarios [5]. This function utilizes a spanning tree of the processors, continuous probing for messages while filling responses, and a termination message after all responses have been filled to achieve the above communication in O(log P ) cost. For the current parallel implementation, the overall cost of the setup phase is somewhat large. Figure 3.14 shows the wallclock times for the AMG-DD composite-grid setup phase vs. the setup of the initial AMG hierarchy scaling the number of processors from 4 up to 4,096 and keeping the number of degrees of freedom per processor (under the initial problem decomposition) fixed at n 2 for n = 100 and n = 1, 000. The AMG-DD setup time is much larger than the initial AMG setup phase (two orders of magnitude for the n = 1000 case), but does appear to be scaling well. To see where most of the effort is being spent, different sections of the setup phase were wrapped with timers and MPI Barriers. Figure 3.15 shows a rough breakdown of how much time was spent in different sections of the setup phase. For the n = 100 case, the amount of time spent in each part of the setup phase is relatively balanced, though the amount of time spent obtaining ghost node information appears to be growing. The n = 1000 case, however, really shows a weakness in the current implementation. All of the time is spent unpacking and adding nodes to the composite grids. This is likely due to the fact that when checking for redundant nodes, a linear search is performed over nodes not owned by the given processor. This slow piece of the algorithm is repeated many times and will be much worse for large n, which is why the effect is so obvious in the n = 1000 case. As previously mentioned, a few different optimizations are being explored that should significantly reduce the time spent checking for redundant info, including the use of a hash table to check incoming nodes against what is already in the composite grid. This, in addition to several other optimizations mentioned throughout this section, should be able to significantly decrease the time it takes to setup AMG-DD. A previous performance model from [6] estimated the cost for the residual communication as T = α16(l + 1) + β(2γ nη + 12Lη 2 ) + γ55n 2

101 87 Figure 3.14: Comparison of AMG-DD composite grid setup time vs. the setup time of the initial AMG hierarchy for n = 100, 1000 (where n 2 is the number of nodes perprocessor). where α, β, and γ are the standard constants associated with the latency, bandwidth, and computational costs respectively on a given machine, L is the number of levels in the AMG hierarchy, Γ 0 is the size of a processor boundary on the fine grid, n is the number of degrees of freedom in one direction per processor (i.e., a processor owns n 2 nodes in the two-dimensional case), and η is the padding used to create the composite grids. This model assumes regular grids in two dimensions with a coarsening factor of four. Figure 3.16 compares the above model with timing results obtained on Cab, a high-performance machine at Lawrence Livermore National Lab. A ping pong message test was used to estimate the constants α and β, resulting in α = sec and β = sec/mb. The value for γ was roughly estimated to be While the O(log P ) scaling is evident in both the model and measured timings, there is a discrepancy in slope between the model prediction and what is measured in practice. Several factors may account for this. First, the composite grids that are generated in practice do not have exactly the size and shape assumed by the model. Also, the current implementation does extra communications to get ghost node data, which is not accounted for in the model. Last, the model does not account for the cost of constructing the Ψ c composite grids before packing and sending the data, which is nontrivial part of the code that also requires optimization.

102 88 Figure 3.15: Breakdown of where time is spent in the composite grid setup phase. Figure 3.16: Measured residual communication time vs. model predictions for η = 1 and n = 100, 1000.

103 Composite solves As previously alluded to, the composite problems in the parallel impletation of AMG-DD are solved by fast adaptive composite grid cycles [22]. In the earlier serial version of the AMG-DD code from [6], the composite problems were solved by constructing the composite prolongation operator P k on each processor, k, forming the matrix for the composite operator A c k = (P k) T AP k, constructing a local AMG hierarchy for this matrix, and finally performing V-cycles on this composite hierarchy. This involves a significant amount of overhead in constructing the operators and associated hierarchy. After the parallel setup phase described in the previous section, however, all of the required pieces are in place in order to perform an FAC cycle with no additional construction of operators or multilevel hierarchies. An FAC cycle on a composite grid that is a subset of some global hierarchy of grids may be thought of as equivalent to V-cycles on the full grid where relaxation is suppressed over points not in the composite grid. Over the interior nodes belonging to each level of the composite grid, relaxation, residual calculation, interpolation, and restriction are all straightforward and proceed as usual. The difficulties associated with cycling on the composite grid occur near the edges. Here, for any given node in the composite grid, the connected nodes in the stencils of A k or P k may not all be present in the composite grid. This is where the previously mentioned ghost nodes come into play. Ghost nodes are additional points from the global grids that are required in order to perform relaxation, residual calculation, interpolation, and restriction on the composite grid. Figure 3.17 illustrates an FAC cycle on a one dimensional problem where the stencils all simply include nearest neighbors. Points in blue are real nodes belonging to the composite grid (which is constructed with η = 1 in this illustration), and points in white are the required ghost nodes. Note that depending on how the fine and coarse grids in the composite grid line up on different levels, the number of required ghost nodes may change. Relaxation is performed only on the real nodes (ghost nodes are never relaxed), which requires all immediate neighbors of the real nodes to be either included in the composite grid or added as ghost nodes. Restriction is done first from all

90 Figure 3.17: Diagram of an FAC cycle in 1D. Real nodes in the composite grid are blue and ghost nodes are white. The orange box represents a processor s home domain. real nodes.

104 90 Figure 3.17: Diagram of an FAC cycle in 1D. Real nodes in the composite grid are blue and ghost nodes are white. The orange box represents a processor s home domain. real nodes. Then any point on the coarse grid that was restricted to from a fine grid real node must receive a residual contribution from every point in its restriction stencil. This, in turn, demands that any point in the restriction stencil be able to calculate a residual, which generally requires one additional layer of ghost nodes. As shown in Figure 3.17, if restriction stencils only draw from points that are within distance one through the stencil of A k, then residual calculation is necessary at most distance two away from the real nodes, resulting in the need for three layers of ghost nodes on each level. In general, if restriction stencils reach a distance, ζ, in the operator, A k, then at most (2ζ + 1) layers of ghost nodes are necessary. Implementation of the setup and residual communication algorithms as well as the FAC cycle is complete, resulting in a functional, parallel implementation of the AMG-DD algorithm. Preliminary results show convergence for smaller problems, but larger problem sizes cause the algorithm to diverge. It is currently unclear whether this divergence is due to a bug in either the residual communication or FAC cycle code, the methodology of the FAC implementation, or if there is some mathematical explanation for the algorithm diverging in some cases.

Solving PDEs with Multigrid Methods p.1

Solving PDEs with Multigrid Methods Scott MacLachlan maclachl@colorado.edu Department of Applied Mathematics, University of Colorado at Boulder Solving PDEs with Multigrid Methods p.1 Support and Collaboration