The Pennsylvania State University The Graduate School THE AUXILIARY SPACE SOLVERS AND THEIR APPLICATIONS. A Dissertation in Mathematics by Lu Wang

Size: px

Start display at page:

Download "The Pennsylvania State University The Graduate School THE AUXILIARY SPACE SOLVERS AND THEIR APPLICATIONS. A Dissertation in Mathematics by Lu Wang"

Jared Williams
6 years ago
Views:

1 The Pennsylvania State University The Graduate School THE AUXILIARY SPACE SOLVERS AND THEIR APPLICATIONS A Dissertation in Mathematics by Lu Wang c 2014 Lu Wang Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy December 2014

2 The dissertation of Lu Wang was reviewed and approved by the following: Jinchao Xu Professor of Department of Mathematics Dissertation Advisor, Chair of Committee James Brannick Associate Professor of the Department of Mathematics Ludmil Zikatanov Professor of the Department of Mathematics Chao-Yang Wang Professor of Materials Science and Engineering Yuxi Zheng Professor of the Department of Mathematics Department Head Signatures are on file in the Graduate School.

3 Abstract Developing efficient iterative methods and parallel algorithms for solving sparse linear systems discretized from partial differential equations (PDEs) is still a challenging task in scientific computing and practical applications. Although many mathematically optimal solvers, such as the multigrid methods, have been analyzed and developed, the unfortunate reality is that these solvers have not been used much in practical applications. In order to narrow the gap between theory and practice, we develop, formulate, and analyze mathematically optimal solvers that are robust and easy to use in practice based on the methodology of Fast Auxiliary Space Preconditioning (FASP). We develop a multigrid method on unstructured shape-regular grids by the construction of an auxiliary coarse grid hierarchy on which the multigrid method can be applied by using the FASP technique. Such a construction is realized by a cluster tree which can be obtained in O(N log N) operations for a grid of N nodes. This tree structure is used for the definition of the grid hierarchy from coarse to fine. For the constructed grid hierarchy, we prove that the condition number of the preconditioned system for an elliptic PDE is O(log N). Then, we present a new colored block Gauss-Seidel method for general unstructured grids. By constructing the auxiliary grid, we can aggregate the degree of freedoms in the same cells of the auxiliary girds into one block. By developing and a parallel coloring algorithm for the tree structure, a colored block Gauss-Seidel method can be applied with the aggregates serving as non-overlapping blocks. On the other hand, we also develop a new parallel unsmoothed aggregation algebraic multigrid method for the PDEs defined on an unstructured mesh from the auxiliary grid. It provides (nearly) optimal load balance and predictable communication patterns factors that make our new algorithm suitable for parallel computing. Furthermore, we extend the FASP techniques to saddle point and indefinite problems. Two auxiliary space preconditioners are presented. An abstract framework of the symmetric positive definite auxiliary preconditioner is presented so that the optimal multigrid method could be applied for the indefinite problem on the unstructured grid. We also numerically verify the optimality of the two preconditioners for the Stokes equations. iii

4 Table of Contents List of Figures List of Tables Acknowledgments vii ix x Chapter 1 Introduction 1 Chapter 2 Iterative Method Stationary Iterative Methods Jacobi Method Gauss-Seidel Method Successive Over-Relaxation Method Block Iterative Method Krylov Space Method and Preconditioners Conjugate Gradient Method Preconditioned Iterations Preconditioned Conjugate Gradient Method Preconditioning Techniques Numerical Example Comparison of the Iterative Method Comparison of the Preconditioners Chapter 3 Multigird Method and Fast Auxiliary Space Preconditioner Method of Subspace Correction iv

5 3.1.1 Parallel Subspace Correction and Successive Subspace Correction Multigrid viewed as Multilevel Subspace Corrections Convergence Analysis The Auxiliary Space Method Algebraic Multigrid Method Classical AMG UA-AMG Chapter 4 FASP for Poisson-like Problem on unstructured grid Preliminaries and Assumptions Construction of the Auxiliary Grid-hierarchy Clustering and Auxiliary Box-trees Closure of the Auxiliary Box-tree Construction of a Conforming Auxiliary Grid Hierarchy Adaptation of the Auxiliary Grids to the Boundary Near Boundary Correction Estimate of the Condition Number Convergence of the MG on the Auxiliary Grids Stable decomposition: Proof of (A1) Strengthened Cauchy-Schwarz inequality: Proof of (A2) Condition number estimation Chapter 5 Colored Gauss-Seidel Method by auxiliary grid Graph Coloring Quadtree Coloring Tree Representations Parallel Implementation of the Coloring Algorithm Block Colored Gauss-Seidel Methods Chapter 6 Parallel FASP-AMG Solvers Parallel Auxiliary Grid Aggregation Parallel Prolongation and Restriction and Coarse-level Matrices Parallel Smoothers Based on the Auxiliary Grid GPU Implementation Sparse Matrix-Vector Multiplication on GPUs Parallel Auxiliary Grid Aggregation v

6 Chapter 7 Numerical Applications for Poisson-like Problem on Unstructured Grid Auxiliary Space Multigrid Method Geometric Multigrid ASMG for the Dirichlet problem ASMG for the Neumann problem FASP-AMG Test Platform Performance Chapter 8 FASP for Indefinite Problem Krylov Space Method for Indefinite Problems The Minimal Residual Method Generalized Minimal Residual Method Preconditioners for Indefinite Problems FASP Preconditioner Chapter 9 Fast Preconditioners for Stokes Equation on Unstructured Grid Block Preconditioners Analysis of the FASP SPD Preconditioner Some Examples Use a Lower Order Velocity Space Pair as an Auxiliary Space Use a Lower Order Pressure Space as an Auxiliary Space Chapter 10 Conclusions Conclusions Future works Bibliography 164 vi

7 List of Figures 2.1 Matrix splitting of A Comparison of the number of Iterations Comparison of the CPU time Comparison of the number of iterations for preconditioners Left: The 2D triangulation T of Ω with elements τ i. Right: The barycenters ξ i (dots) and the minimal distance h between barycenters Examples of the region quadtree on different domains Tree of regular boxes with root B 1 in 2D. The black dots mark the corresponding barycenters ξ i of the triangles τ i. Boxes with less than three points ξ i are leaves The subdivision of the marked (red) box on level l would create two boxes (blue) with more than one hanging node at one edge The subdivision of the red box makes it necessary to subdivide nodes on all levels Hanging nodes can be treated by a local subdivision within the box B ν. The top row shows a box with 1, 2, 2, 3, 4 hanging nodes, respectively, and the bottom row shows the corresponding triangulation of the box The final hierarchy of nested grids. Red edges were introduced in the last (local) closure step Case 1: σ i is subdivided in the fine level Case 2: σ i is not subdivided in the fine level A triangulation of the Baltic sea with local refinement and small inclusions Hanging nodes can be treated by a local subdivision within the cube B ν. Firstly erasing the hanging nodes on the face and then connecting the center of the cube The boundary Γ of Ω is drawn as a red line, boxes non-intersecting Ω are light green, boxes intersecting Γ are dark green, and all other boxes (inside of Ω) are blue vii

8 4.13 The boundary Γ of Ω is drawn as a red line, boxes non-intersecting Ω are light green, and all other boxes (intersecting Ω) are blue The finest auxiliary grid σ (10) contains elements of different size. Left: Dirichlet b.c. (852 degrees of freedom), right: Neumann b.c. (2100 degrees of freedom) A balanced quadtree requires at least five colors Forced coloring rectangles Adaptive quadtree and its binary graph Six-Coloring for adaptive quadtree the Mordon code of an adaptive quadtree Adaptive quadtree and its binary graph Coloring of 3D adaptive octree Aggregation on level L Aggregation on the coarse levels Coloring on the finest level L Sparse matrix representation using the ELL format and the memory access pattern of SpMv Covergence rates for Auxiliary Space MultiGrid with n 4 = 737, 933, n 5 = 2, 970, 149, n 6 = 11, 917, 397, and n 7 = 47, 743, 157 degrees of freedom Covergence rates for Auxiliary Space MultiGrid with n 4 = 756, 317, n 5 = 3, 006, 917, n 6 = 11, 990, 933, and n 7 = 47, 890, 229 degrees of freedom Quasi-uniform grid for a 2D unit square Shape-regular grid for a 2D unit square Shape-regular grid for a 2D circle domain Shape regular-grid for the 3D heat transfer problem on a cubic domain(left) and a cavity domain(right) P 2 P 0 elements and P 1 + P0 1 elements P 2 + P1 1 elements and P2 0 P0 1 elements P 2 + P1 1 elements and P2 0 P0 1 elements viii

9 List of Tables 2.1 Comparison of the different iterative method for Poisson equation Comparison of the different preconditioners for Poisson equation The time in seconds for the setup of the matrices and for ten steps of V-cycle (geometric) multigrid, Algorithm The storage complexity in bytes per degree of freedom (auxiliary grids, auxiliary matrices and H-solvers) and the solve time in seconds for an ASMG preconditioned cg-iteration The storage complexity in bytes per degree of freedom (auxiliary grids, auxiliary matrices and H-solvers) and the solve time in seconds for an ASMG preconditioned cg-iteration Test Platform Wall time and number of iterations for the Poisson problem on a 2D uniform grid Wall time and number of iterations for the Poisson problem on a 2D quasiuniform grid Wall time and number of iterations for the Poisson problem on a 2D shaperegular grid Wall time and number of iterations for the Poisson problem on a disk domain Time/Number of iterations for the heat-transfer problem on a 3D unit cube Wall time and the number of iterations for the heat-transfer problem on a cavity domain Number of iterations for using P 1 + P0 1 elements as a preconditioner for Stokes equation with P2 0 P0 1 elements Number of iterations for using P 1 + P0 1 elements as a preconditioner for Stokes equation with P 2 P 0 elements Number of iterations for using P2 0 P0 1 elements as a preconditioner for Stokes equation with P 2 + P1 1 elements ix

10 Acknowledgments First and foremost, I want to express my sincere gratitude to my advisor, Prof. Jinchao Xu for his enthusiasm, patience, encouragement and inspiration. I have learned a great deal from him, academically and beyond. His insightful advice, patient guidance, constant support and encouragement are essential to the completion of my education. His fine intuition and deep understanding of numerical analysis, finite element method, and multigrid methods have been very important to my Ph.D. studies and research. His support for my work, life, and career development are invaluable, and my words simply can not express my gratitude strongly enough. His work ethic has been an enormous influence on me. For me, he has redefined the word advisor to a new level. It has been my extreme privilege to know him and work with him. I also want to thank Prof. James Brannick and Prof. Ludmil Zikatanov, for being my committee members, and providing me with a mathematics perspective for my research and thesis. Their comments and advices have been most essential for me to understand the new methods described in this work. I would like to thank my committee member, Prof. Chao-yang Wang, who valued my research and were generously willing to take the time to evaluate my thesis. I want to thank all the excellent post-docs and colleagues in our group. Dr. Xiaozhe Hu, Dr. Maximilian Metti, Dr. Fei Wang, Kai Yang, Changhe Qiao, and Yicong Ma. Without the advice and helpful discussions, I was impossible to accomplish my Ph.D. thesis. Last but not the least, I want to thank my family, especially my wife Ying Chen, for her unconditional trust, constant encouragement, and sweet love over the years. x

11 Dedication TO GOD BE THE GLORY. xi

12 Chapter 1 Introduction Numerical simulation plays an important role in scientific research and engineering design since experimental investigation is both very expensive and time consuming. Numerical simulation helps to understand important features and reduce the time for development. Progress in computer science and engineering helps to meet this need for computational power. The rapid development of the computer industry provides more and more powerful computing ability for numerical simulation, which also makes numerical simulation even more applicable to wider fields and more complex physical phenomena. As the complexity and difficulty of the numerical simulation increase, the linear solvers become the most stringent bottleneck as measured by the proportion of execution time. The need for the fast and stable linear solvers, especially for massively parallel computers, is becoming increasingly urgent. Assume V is a Hilbert space and V is the dual space of V. Consider the following linear system Au = f, (1.1) where A : V V is a nonsingular linear operator over V and f V is the given function on the dual space V. Since we consider V to be a finite dimensional space, we take V = V. There are two different ways to solve the system (1.1): a direct solver or an iterative solver. Direct solvers theoretically give the exact solution in finite steps, for example Gaussian elimination [1], multifrontal solvers [2], and the like. The review papers of [3, 4, 5] serve as excellent references for various direct solvers. These methods would give the precise solution if they were performed in infinite precision arithmetic. However, in practice, this is rarely true because of the rounding errors. The error made in one step propagates further in all

13 2 following steps. This makes it difficult to solve the equations by direct solvers for complex problems in applications. On the other hand, as scientific computing develops, many of the problems are extremely large and complex. For example, Grand Challenge problems require PetaFLOPs and PetaBytes of computing resources. The large computing complexity of the direct methods makes them infeasible for these problems, even with the best available computing power. Consider Gaussian elimination method as an example since it is still the most commonly used method in practice: Gaussian elimination is a row reduction algorithm for solving linear equations. To perform row reduction on a matrix, one uses a sequence of elementary row operations to modify the matrix until the lower left-hand corner of the matrix is filled with as many zeros as possible. There are three types of elementary row operations: 1) Swapping two rows, 2) Multiplying a row by a non-zero number, 3) Adding a multiple of one row to another row. By using these operations, a matrix can always be transformed into an upper triangular matrix. Once all of the leading coefficients (the left-most non-zero entry in each row) are 1, and in every column containing a leading coefficient has zeros elsewhere, the matrix is said to be in reduced row echelon form. This final form is unique; in other words, it is independent of the sequence of row operations used. For example, in the following sequence of row operations (where multiple elementary operations might be done at each step), the third and fourth matrices are the ones in row echelon form, and the final matrix is the unique reduced row echelon form. The advantage of Gaussian elimination is that it is the most user-friendly solver. For any matrix and right and side, it can guarantee that Gaussian elimination can always solve the equations. However, the computational efficiency is very low. The number of arithmetic operations is one way of measuring the algorithm s computational efficiency. For Gaussian elimination, it requires N(N 1)/2 divisions, (2N 3 + 3N 2 5N)/6 multiplications, and (2N 3 + 3N 2 5N)/6 subtractions, for a total of approximately 2N 3 /3 operations. So the arithmetic complexity is O(N 3 ). When the problem scale is large, it will costs lots of time and memory to solve the problem. Sometimes, it is even impossible. For example, for the input size N = 10 9, even with the top 1 super computer Tianhe-2, which ranked the world s fastest with a record of petaflops in 2013, it costs about 560 years to solve one problem by the O(N 3 ) algorithm. In contrast to direct methods, iterative methods are not expected to terminate in a finite number of steps. Starting from an initial guess, iterative methods form successive approximations that converge to the exact solution only in the limit. These methods are

14 3 relatively easy to implement and use fewer memories. Therefore, the iterative methods are generally needed for large scale problems. Two main classes of the iterative methods are the stationary iterative methods and the more general Krylov subspace methods. Stationary iterative methods solve a linear system with an operator approximating the original one. Examples of stationary iterative methods are the Jacobi method, Gauss Seidel method and the Successive over-relaxation (SOR) method [1]. While these methods are simple to derive, implement, and analyze, the convergence of these methods is only guaranteed for a limited class of matrices. On the other hand, Krylov subspace methods work by forming a basis of the sequence of successive matrix powers times the initial residual (the Krylov sequence). The approximations to the solution are then formed by minimizing the residual over the subspace formed. The prototypical method in this class is the conjugate gradient method (CG) [6], the minimal residual method (MINRES) [7], and the generalized minimal residual method (GMRES) [8]. Since these methods form a basis, it is evident that the method converges in N iterations, where N is the system size. However, in the presence of the rounding errors this statement does not hold. Moreover, in practice N can be very large, and the iterative process reaches sufficient accuracy already far earlier. The Krylov space method can also be accelerated by some preconditioners. For example, if N = 10 9, the CG method with multigrid method as the preconditioner can solve the problem in 500 seconds. It has been observed that when employing classical iteration methods can reduce the high-frequency components of the errors rapidly, but it can hardly reduce the low-frequency components errors [9, 1, 10]. The multigrid principle was motivated by this observation. Another crucial observation is that the low-frequency errors on a fine mesh becomes highfrequency errors on a coarser mesh. For the coarse grid problem, we can apply the smoothing and the separation of scales again. Recursively application of smoothing to each level results in the classical formulation of multigrid. A natural application of this idea is the geometric multigrid (GMG) method [11, 9]. GMG has been provided substantial acceleration when compared to basic iterative solvers like Jacobi or Gauss-Seidel, even better performance has been observed when these methods are used as a preconditioner of Krylov methods. GMG method for a Poisson equation can be solved in O(N) operations. Roughly speaking, there are two different types of theories that have been developed for the convergence of GMG. For the first kind theory that makes critical use of elliptic regularity of the underlying partial differential equations as well as approximation and inverse properties

15 4 of the discrete hierarchy of grids, we refer to Bank and Dupont [12], Braess and Hackbusch [13], Hackbusch[11], and Bramble and Pasciak[14]. The second kind of theory makes minor or no elliptic regularity assumption, we refer to Yserentant [15], Bramble, Pasciak and Xu [16], Bramble, Pasciak, Wang and Xu [17], Xu [18, 19] and Yserentant [20], and Chen, Nochetto and Xu [21, 22]. The GMG method, however, relies on a given hierarchy of geometric grids. Such a hierarchy of grids is sometimes naturally available, for example, due to an adaptive grid refinement or can be obtained in some special cases by a coarsening algorithm [23]. But in most cases in practice, only a single (fine) unstructured grid is given. This makes it difficult to generate a sequence of nested meshes. To circumvent this difficulty, two different methods are developed: algebraic multigrid (AMG) methods and non-nested geometric multigrid. One practical way to generate the grids hierarchy for general unstructured grids is algebraic multigrid (AMG) methods. Most AMG methods, although their derivations are purely algebraic in nature, can be interpreted as nested MG when they are applied to finite element systems based on a geometric grid. AMG methods are usually very robust and converge quickly for Poisson-like problems [24, 25]. There are many different types of AMG methods: the classical AMG [26, 27, 28], smoothed aggregation AMG [29, 30, 31], AMGe [32, 33], unsmoothed aggregation AMG [34, 35] and many others. Highly efficient sequential and parallel implementations are also available for both CPU and GPU systems [36, 37, 38]. AMG methods have been demonstrated to be one of the most efficient solvers for many practical problems [39]. Despite the great success in practical applications, AMG still lacks solid theoretical justifications for these algorithms except for two-level theories [40, 30, 41, 42, 43, 44, 45]. For a truly multilevel theory, using the theoretical framework developed in [17, 19], Vaněk, Mandel, and Brezina [46] provides a theoretical bound for the smoothed aggregation AMG under some assumption about the aggregations. Such an assumption has been recently investigated in [45] for aggregations that are controlled by auxiliary grids that are similar to those used in [47]. Another way to generate the multigrid method for unstructured grid is non-nested geometric multigrid. One example of such kind of theory is by Bramble, Pasciak and Xu [48]. In this work, optimal convergence theories are established under the assumption that a nonnested sequence of quasi-uniform meshes can be obtained. Another example is the work by Bank and Xu [49] that gives a nearly optimal convergence estimate for a hierarchical basis type method for a general shape-regular grid in two dimensions. This theory is based on

16 5 non-nested geometric grids that have nested sets of nodal points from different levels. One feature in the aforementioned MG algorithms and their theories is that the underlying multilevel finite element subspaces are not nested, which is not always desirable from both theoretical and practical points of view. To avoid the non-nestedness, many different MG techniques and theories have been explored in the literature. One such a theory was developed by Xu [47] for a semi-nested MG method with an unstructured but quasi-uniform grid based on an auxiliary grid approach. Instead of generating a sequence of non-nested grids from the initial grid, this method is based on a single auxiliary structured grid whose size is comparable to the original quasi-uniform grid. While the auxiliary grid is not nested with the original grid, it contains a natural nested hierarchy of coarse grids. Under the assumption that the original grid is quasi-uniform, an optimal convergence theory was developed in [47] for second order elliptic boundary problems with Dirichlet boundary conditions. The first goal of my thesis is to extend the algorithm and theory in Xu [47] to shape regular grids that are not necessarily quasi-uniform. The lack of quasi-uniformity of the original grid makes the extension nontrivial for both the algorithm and the theory. First, it is difficult to construct auxiliary hierarchical grids without increasing the grid complexity, especially for grids on complicated domains. The way we construct the hierarchical structure is to generate a cluster tree, based on the geometric information of the original grid [50, 51, 52, 53]. Secondly, it is also not straightforward to establish optimal convergence for the geometric multigrid applied to a hierarchy of auxiliary grids that can be highly locally refined. As the need to solve extremely large systems grows urgency, researchers do not only study the MG method and theories but also also consider the parallelizations. Parallel multigrid approaches have been implemented in various frameworks. For example, walberla [54] is for finite difference discretizations on fully structured grids, BoomerAMG (included in the Hypre package [55]) is the parallelization of the classical AMG methods and their variants for unstructured grids, ML [56] focus on the parallel versions of smoothed aggregation AMG methods, Peano [57] is based on space-filling curves or Distributed and Unified Numerics Environment (DUNE) which is a general software framework for solving PDEs. Not only are the researchers rapidly developing algorithms, they are doing the same with the hardware. GPUs based on single instruction multiple thread (SIMT) hardware architecture have provided an efficient platform for large-scale scientific computing since November 2006, when NVIDIA released the Compute Unified Device Architecture (CUDA) toolkit. The CUDA toolkit made programming on GPU considerably easier than it had

17 6 been previously in large part. MG methods have also been parallelized and implemented on GPUs in a number of studies. GMG methods as the typical cases of MG have been implemented on GPUs first [58, 59, 60, 61, 62, 63]. These studies demonstrate that the speedup afforded by using GPUs can result in GMG methods achieving a high level of performance on CUDA-enabled GPUs. However, to the best of our knowledge of the task, parallelizing an AMG method on GPUs or CPUs remains very challenging, mainly due to the sequential nature of the coarsening processes (setup phase) used in AMG methods. In most AMG algorithms, coarse-grid points are selected sequentially using graph theoretical tools (such as maximal independent sets and graph partitioning algorithms) and the coarsegrid matrices are constructed by a triple-matrix multiplication. Although extensive research has been devoted to improving the performance of parallel coarsening algorithms, leading to marked improvements on CPUs [64, 65, 66, 36, 67, 68, 69] and on GPUs [65, 70, 71] over time, the setup phase is still considered to be the major bottleneck in parallel AMG methods. On the other hand, the task of designing an efficient and robust parallel smoother in the solver phase is no less challenging. However, most of the difficulties could be solved by using auxiliary spaces. The special structure of the auxiliary grid makes the parallelizations more efficient. Therefore, the second goal of this thesis is to design new parallel smoothers and a MG method based on auxiliary space preconditioning techniques. Many problems in physics and engineering, like the Navier-Stokes equations in fluid dynamics, Helmholtz equations, or multi-physics problems, lead to indefinite problems. Other problems, like the biharmonic plate problem, may be formulated as coupled problems in different variables, which leads to a saddle point problem. It is important to find an efficient preconditioner for these indefinite problems. It is noted that preconditioning design and analysis for saddle-point problems and indefinite problems have been the subject of active research in a variety of areas of applied mathematics, for example: groundwater flow, Stokes and Navier-Stokes flow [72, 73, 74, 75], elasticity, magnetostatics, etc. While most results address the symmetric case, non-symmetric preconditioning has also been analyzed for some practical problems. Various iterative methods and preconditioning methods to solve saddle-point type have been the subject of research. Some methods are focused on clever renumbering schemes in combination with a classical iterative approach, like the SILU scheme and ILU schemes proposed by Wille et al. Other methods are based on the splitting of the saddle point operator.

18 7 A number of block-preconditioners have been devised, for example block diagonal preconditioners [74], block triangle preconditioners, the Pressure-Convection Diffusion commutator (PCD) of Kay, Logan and Wathen [76, 77], the Least Squares Commutator (LSC) by Elman, Howle, Shadid, Shuttleworth and Tuminaro [75], the Augmented Lagrangian Approach (AL) of Benzi and Olshanskii [78], the Artificial Compressibility (AC) preconditioner [79] and the Grad-Div (GD) preconditioner [79]. For an overview of block preconditioners, we refer to [80, 81, 82]. Although we can apply MG to solve the sub-blocks, the development of MG for indefinite problems on general unstructured grids is still rarely analyzed. So, the last goal of this thesis is to carry out the general analysis of FASP preconditioners for finite element discretizations and to describe sufficient conditions for optimal preconditioning. The rest of the thesis is organized as follows. In Chapter 2, we review the iterative method and preconditioning techniques. Then in Chapter 3, we introduce the basic concepts and theories of MG and auxiliary space preconditioning. Next, we introduce the algorithm and theory of our new auxiliary space MG on shape regular grids in Chapter 4. In Chapter 5, we present a parallel colored Gauss-Seidel method. In Chapter 6, we discuss the parallelization of the UA-AMG algorithms based on auxiliary grids. After that, we give some numerical examples in Chapter 7. In Chapter 8, we review the iterative method for the indefinite problems and introduce the new FASP theory for indefinite problems. After that, we discuss the method and apply the FASP theories to the Stokes equations in Chapter 9. And finally, we make some conclusions and describe future work in Chapter 10.

19 Chapter 2 Iterative Method A single step linear iterative method uses an old approximation, u old, of the solution u of (1.1), to produce a new approximation, u new, usually consists of three steps: 1. Form r old = f Au old ; 2. Solve Ae = r old approximately: ê = Br old ; 3. Update u new = u old + ê, where B is a linear operator on V and it can be thought of as an approximate inverse of A. As a result, we have the following algorithm. Algorithm 1 Iterative Method Given u 0 V; p 0 = r 0 ; for m = 0, 1,, until convergence do u m+1 = u m + B(f Au m ) We say that an iterative scheme like algorithm 1 converges if lim m u m = u for any u 0 V. The core element of the above iterative scheme is the operator B. Notice that if B = A 1, after one iteration, u 1 is then the exact solution. In general, B may be regarded as an approximate inverse of A. For general iterative methods, we have the following simple convergence result.

20 9 Lemma The iterative Algorithm 1 converges if and only if ρ(i BA) < 1 where ρ(a) is the spectral radius of A. If A is symmetric positive definite (SPD), we can define a new inner product: (u, v) A = (Au, v). Sometimes it is more desirable that the operator B is symmetric. If B is not symmetric, there is a natural way to symmetrize it. Consider the following Algorithm 2. Algorithm 2 Symmetrized Iterative Method Given u 0 V; p 0 = r 0 ; for m = 0, 1,, until convergence do u m+1/2 = u m + B(f Au m ) u m+1 = u m+1/2 + B T (f Au m+1/2 ) The symmetrization of iterator B can be denote as B. Since I BA = (I B T A)(I BA) = I B T A BA + B T ABA, we have B = B T + B B T AB. The convergence theory is as follows. Theorem A sufficient condition for the convergence of Algorithm 2 is that B T + B 1 A is SPD. Proof. If B T + B 1 A is SPD, then B = B T (B T + B 1 A)B is SPD. Since ( BAu, v) A = ( B T Au, v) A, BA is SPD w.r.t. (, ) A. Because ((I BA)u, v) = (I B T A)(I BA)

21 10 Assume λ σ(i BA) [0, ) and µ σ( BA), then µ > 0, so 0 λ = 1 µ < 1. which leads the conclusion. Theorem If A is SPD, A sufficient condition for the convergence of Algorithm 1 is that B T + B 1 A is SPD. Proof. By Theorem 2.0.2, ρ(i BA) I BA A = (I BA) T (I BA) 1 2 A = I BA 1 2 A < Stationary Iterative Methods Most of the stationary iterative methods involve passing from one iterate to the next by modifying one or a few components of an approximate vector solution at a time. This is natural since it is simple to modify a component. The convergence of these methods is rarely guaranteed for all matrices, but a large body of theory exists. Assume V = R N and A = (a ij ) R N N which is symmetric positive definite (SPD). And we begin with the matrix splitting A = D + L + U, (2.1) where D is the diagonal of A, L is the strict lower part, and U is the strict upper part (as shown in Figure 2.1) Jacobi Method The Jacobi method determines the i-th component by eliminating the i-th component of the residual vector of the previous solution. If u m i denotes the i-th component of the m-th

22 11 Figure 2.1. Matrix splitting of A iteration u m. With writing the residual form as (f Au ) i = f i N j=1,j i a ij u m j a ii u m i = 0, where u is the exact solution of the linear system. u m+1 can be determined by or a ii u m+1 i ( u m+1 i = 1 f i a ii = f i N j=1,j i a ij u m j N j=1,j i ) a ij u m j (2.2), i = 1, 2,, N. (2.3) This is a component-wise form of the Jacobi iteration. It can be written in vector form as u m+1 = D 1 (f (L + U)u m ) (2.4) or u m+1 = u m + D 1 (f Au m ) (2.5) This leads the following algorithm:

23 12 Algorithm 3 Jacobi Method for k 1 until convergence do for i 1 to N do σ = 0 for j 1 to N do if j i then σ σ + a ij u m j, u m+1 i 1 a ii (f i σ) check if convergence is reached The convergence theory of the Jacobi method is well defined. Theorem (Jacobi Method). Assume A is a symmetric positive definite (SPD), Jacobi method converges if and only if 2D A is SPD. Proof. Since B T + B 1 A = D + D A = 2D A, the desired result follows from Theorem It is worthy to mention that the Jacobi method sometimes converges even if these conditions are not satisfied Gauss-Seidel Method The Jacobi method determines the i-th component by eliminating the i-th component of the residual vector of the current solution in the order i = 1, 2,, N. This time the approximate solution is updated immediately. after the new component is determined. The i-th component of the residuals is i 1 f i a ij u m+1 j a ii u m+1 i j=1 N j=i+1 a ij u m j = 0, (2.6) which leads to the iteration, ( u m+1 i = 1 i 1 f i a ij u m+1 j a ii j=1 N j=i+1 a ij u m j ), i = 1, 2,, N. (2.7)

24 13 The vector form of the equation (2.6) can be written as f Lu m+1 Du m+1 Uu m = 0. Therefore, the vector form of the Gauss-Seidel method is u m+1 = (D + L) 1 (f Uu m ), (2.8) or u m+1 = u m + (D + L) 1 (f Au m ). (2.9) This leads to the following algorithm: Algorithm 4 Gauss-Seidel Method u = u 0 while convergence is not reached do for i 1 to N do σ = 0 for j 1 to N do if j i then σ σ + a ij u j, u i 1 a ii (f i σ) check if convergence is reached The convergence properties of the Gauss-Seidel method depend on the matrix A. Namely, the procedure is known to converge if either A is symmetric positive-definite or A is strictly or irreducibly diagonally dominant. Theorem (Gauss-Seidel Method). Assume A is SPD. Then Gauss-Seidel method always converges. Proof. Since B T + B 1 A = (D + L) T + D + L A = D, is SPD. By Theorem 2.0.3, we can prove the Gauss-Seidel method is convergent.

25 14 The Gauss-Seidel method has some variational formats. method can be defined as The backward Gauss-Seidel u m+1 = u u + (D + U) 1 (f Au m ), (2.10) which is equivalent to making the corrections in the order N, N 1,, 1. The symmetric Gauss-Seidel method consists of a forward sweep followed by a backward sweep which can be written as We can rewrite to one equation u m+ 1 2 = u m + (D + L) 1 (f Au m ), u m+1 = u m (D + U) 1 (f Au m+ 1 2 ). (2.11) u m+1 = u m + [(D + U) 1 + (D + L) 1 (D + U) 1 A(D + L) 1 ](f Au m ). (2.12) It is simple to prove that when A is SPD, the backward Gauss-Seidel method and the symmetric Gauss-Seidel method converge Successive Over-Relaxation Method The successive over-relaxation method (SOR) is derived by extrapolating the Gauss-Seidel method. This extrapolation takes the form of a weighted average between the previous iterate and the computed Gauss-Seidel iterate successively for each component, u m+1 i = ωū m+1 i + (1 ω)u m i, i = 1, 2,, N where ū m i denotes a Gauss-Seidel iterate and ω is the extrapolation factor. The idea is to choose a value for ω that will accelerate the rate of convergence of the iterates to the solution. The vector form of the SOR method can be written as u m+1 = (D + ωl) 1 [ ωu + (1 ω)d]u m + ω(d + ωl) 1 f. (2.13) This leads the following algorithm:

26 15 Algorithm 5 SOR Method u = u 0 while convergence is not reached do for i 1 to N do σ = 0 for j 1 to N do if j i then σ σ + a ij u j, u i (1 ω)u i + ω a ii (f i σ) check if convergence is reached If ω = 1, the SOR method simplifies to the Gauss-Seidel method. If ω = 0, the SOR method simplifies to the Jacobi method. A theorem due to Kahan [83] shows that SOR fails to converge if ω is outside the interval (0, 2). Theorem Assume A is SPD. Then SOR method converges if 0 < ω < 2. Proof. Since B T + B 1 A = ω 1 (D + ωl) T + ω 1 (D + ωl) A = 2 ω ω D, is SPD. The result follows from Theorem In general, it is not possible to compute in advance the value of ω that will maximize the rate of convergence of SOR. If the coefficient matrix A is symmetric and positive definite, the SOR iteration is guaranteed to converge for any value of ω between 0 and 2, though the choice of ω can significantly affect the rate at which the SOR iteration converges. Frequently, some heuristic estimate is used, such as ω = 2 O(h), where h is the mesh spacing of the discretization of the underlying physical domain. A backward SOR sweep can be defined analogously to the backward Gauss-Seidel sweep (2.10). A Symmetric SOR (SSOR) step consists of the SOR step (2.13) followed by a backward SOR step, u m+ 1 2 = (D + ωl) 1 [ ωu + (1 ω)d]u m + ω(d + ωl) 1 f, u m+1 = (D + ωu) 1 [ ωl + (1 ω)d]u m ω(d + ωu) 1 f. (2.14)

27 Block Iterative Method We now discuss the extension of the stationary iterative methods to block scheme. The block iterative methods are generalizations of the pointwise iteration method described above. They update a whole set of components at each time, typically a sub vector of the solution vector, instead of only one component. First of all, we assume that that Ã is simply a matrix A 11 A 12 A 1J ξ 1 β 1 A Ã = 21 A 22 A 2J......, ũ = ξ 2., β f = 2.. with entries being subblocks: A J1 A J2 A JJ ξ J β J A ij R N i N j, ξ j R N j, β i R N i, 1 i, j J and N = i N i. Similarly, we can define the following block matrix splitting: Ã = D + L + Ũ where D is the block diagonal of Ã and L and triangular parts of Ã, respectively. Namely A 11 A 22 D =... A JJ Ũ are the strictly block lower and upper 0 0 A 12 A 1J, L A =.....,.. Ũ = A1,J 1 A J1 A J 1,1 0 0 With these definitions, it is easy to generalize the previous three iterative procedures defined earlier, namely, Jacobi, Gauss-Seidel, and SOR, namely we can simply take D 1 block Jacobi; B = ( D + L) 1 block Gauss-Seidel; ω( D + ω L) 1 block SOR, (2.15)

28 17 the iterative method can be written as ũ m+1 = ũ m + B( f Ãũm ). (2.16) In addition, a block can also correspond to the unknowns associated with a few consecutive lines in the plane. Unlike the pointwise iterative methods, it is not easy to compute B exactly because solving A ii may be very expensive. In order to make the block iterative method more practical, the modified block iterative methods can be defined by replacing the block diagonal inverse D 1 by a modified block diagonal inverse, denoted by R. Namely we can choose R modified block Jacobi; B = ( R 1 + L) 1 modified block Gauss-Seidel; ω( R 1 + ω L) 1 modified block SOR, (2.17) where R is a modification or an approximation of D 1. This means on each step, we do not have to solve the sub-block exactly which is more practical to implement. Mathematically, it can be proved that under the same conditions with the pointwise iterative method, the block iterative methods and the modified iterative methods are also convergent. In the following, we shall proceed to give a convergence analysis of the modified block Gauss-Seidel method: ũ m+1 = ũ m + ( R 1 + L) 1 ( f Ãũm ), m = 1, 2,... (2.18) Lemma Assume that Ã is semi-spd. Assume that D = R T + R 1 D is nonsingular, then B 1 = ( R T + Ũ)T D 1 ( R T + Ũ) = Ã + ( D + Ũ R 1 ) T D 1 ( D + Ũ R 1 ) (2.19) and for any ṽ, ṽ T B 1ṽ = (( R T + U)ṽ) T D 1 ( R T + U)ṽ (2.20) ṽ T B 1ṽ = ṽ T Ãṽ + (( D + Ũ R 1 )ṽ) T D 1 ( D + Ũ R 1 )ṽ. (2.21)

29 18 Proof. It follows that B = B + B t B t Ã B = B t ( B 1 + B t A) B = B t D B where D = B 1 + B t A = R t + R 1 D. Hence B 1 = B 1 D 1 B t = ( D + A B T ) D 1 ( D + A B 1 ) = ( D + A B T ) D 1 ( D + A B 1 ) = Ã + (A B T ) D 1 (A B 1 ) The desired results then follow. Theorem Assume A is SPD. The modified block Gauss-Seidel method converges if D R t + R 1 D > 0. (2.22) And the following convergence estimate holds: I BÃ 2 A = 1 1 c 1 = c 0 (2.23) where and c 0 = c 1 = sup (( R t + Ũ)v)T D 1 ( R t + Ũ)v (2.24) v A =1 sup ([( D + Ũ) R 1 ]v) T D 1 [ D + Ũ R 1 ]v. (2.25) v A =1 In particular, if R = D 1, then R = D 1 and c 1 = sup (( D + Ũ)v)T D 1 ( D + Ũ)v (2.26) v Ã=1

30 19 and c 0 = sup (Ũv)T D 1 Ũv. (2.27) v A =1 2.2 Krylov Space Method and Preconditioners With respect to the influence on the development and practice of science and engineering in the 20th century, Krylov space methods are considered as one of the ten most important classes of numerical methods. Unlike the stationary iterative methods, Krylov methods do not have an iteration matrix. The Krylov subspace method is a method that extracting an approximate solution u m from an affine subspace u 0 + K m, where u 0 is an arbitrary initial guess to the solution and K m is the Krylov subspace K m (A, r 0 ) = span{r 0, Ar 0, A 2 r 0,, A m 1 r 0 }. (2.28) The residual is r = f Au So {r m } m 0 denotes the sequence of residuals r m = f Au m. When there is no ambiguity, K m (A, r 0 ) will be denoted by K m. From the theory of the approximation, it is clear that the approximations obtained from a Krylov subspace method are of the form A 1 f u m = u 0 + q m 1 (A)r 0, (2.29) in which q m 1 is a cartain polynomial of degree m 1. In a simplest case, let u 0 = 0, then A 1 f is approximated by q m 1 (A)f. In other words, polynomial q m 1 (u) is a approximation of 1/u. The relative residual norm of the Krylov subspace method can be bounded as where P m 1 is the set of (m 1)-th order polynomials. r m r 0 min q(a), (2.30) q P m 1

31 20 Although all the techniques provide the same type of polynomial approximations, different choices will arise the different versions of Krylov subspace methods. In this section, we will introduce the conjugate gradient method (CG) and we will introduce the minimal residual method (MINRES) and general minimal residual method (GMRES) in Chapter Conjugate Gradient Method The conjugate gradient (CG) method is due to Hestenes and Stiefel [6]. It is one of the best known iterative techniques for solving sparse symmetric positive definite linear systems (SPD). The conjugate gradient method was invented in the 1950s as a direct method. It has come into wide use over the last 15 years as an iterative method and has generally superseded the stationary iterative method. The CG method is the prototype of the Krylov subspace method. It is an orthogonal projection method and satisfies a minimality condition: the error is minimal in the energy norm or A-norm, which means u A = (u T Au) 1/2 + γ. Consider the quadratic function φ(u) = 1 2 ut Au f T u. (2.31) Since φ(u) is convex and has a unique minimum, assume φ(u ) is the minimal value. It satisfies φ(u ) = f Au = 0, so u is the solution. If we choose γ = 1f T A 1 f, it is easy to see that 2 f Au 2 A 1 = u u 2 A = (u ) T Au 2(u ) T Au+u T Au = u T Au 2f T x+f T A 1 f = 2φ(u). So solving the equation Au = f is equivalent to minimize the quadratic function φ(u) and also equivalent to minimize the energy norm of the error vector u u. In m-th step, we want to find u m such that u m = min φ(u). (2.32) u u 0 +K m

32 21 By doing this iteratively, we can find the solution within N steps. Choosing the search directions {v m } which are conjugate (A-orthogonal) to each other, i.e. (v m ) T Av j = 0, j = 0, 1,, m 1, and define u m = u m 1 + ω m 1 v m 1, r m = r m 1 ω m 1 Av m 1 ω m is chosen to minimize the A-norm of the residual of the error on the line u m 1 + ωv u 1. This means ω m 1 = Associated with the minimality of the Galerkin condition (rm 1 ) T v m 1. (2.33) (v m 1 ) T Avm 1 K m 1 r m 1 K m, which implies that {r m } N m=0 are the orthogonal basis of K N. If we choose the conjugate search directions to be {r m } N m=0, we can get the standard CG method. The detail algorithm is as following: Algorithm 6 CG Method r 0 = f Au 0 ; p 0 = r 0 ; for m = 0, 1,, until convergence do α m (rm ) T r m (p m ) T Ap m u m+1 u m + α m p m, r m+1 r m α m Ap m If r m+1 is sufficiently small then exit loop β m (rm+1 ) T r m+1 (r m ) T r m p m+1 r m+1 + β m p m The conjugate gradient method can theoretically be viewed as a direct method, as it produces the exact solution after a finite number of iterations, which is not larger than the size of the matrix, in the absence of round-off error. However, the conjugate gradient method is unstable with respect to even small perturbations, most directions are not in practice conjugate, and the exact solution is never obtained. Fortunately, the conjugate

33 22 gradient method can be used as an iterative method as it provides monotonically improving approximations u m to the exact solution, which may reach the required tolerance after a relatively small (compared to the problem size) number of iterations. The improvement is typically linear and its speed is determined by the condition number κ(a) of the system matrix A: κ(a) = λ max(a) λ min (A), where λ max (A) and λ min (A) are the largest and smallest eigenvalue of A. The convergence Analysis of CG from [84] is as following: Theorem (Convergence of CG). Assume that u m is the m-th iteration of CG method and u is the exact solution, we have the following estimate: κ(a) 1 u u m A 2( ) m u u 0 A, (2.34) κ(a) + 1 Proof. For an arbitrary polynomial q m 1 of degree m 1, denote ũ m = q m 1 (A)f = q m 1 (A)Au = Aq m 1 (A)u, since (2.32) we have (u u m ) T A(u u m ) min q m 1 (u ũ m ) T A(u ũ m ) min q m 1 ((I Aq m 1 (A))u ) T A(I Aq m 1 (A))u ) Here a = λ min (A) and b = λ max (A). We choose min (Aq m(a)u ) T Aq m (A)u q m(0)=1 min max q m(λ) 2 (u ) T Au q m(0)=1 λ σ(a) min max q m(λ) 2 (u ) T Au. q m(0)=1 λ [a,b] q m (λ) = T m( b+a 2λ ) b a T m ( b+a). b a

34 23 Here T m (t) is the Chebyshev polynomial of degree m given by T m (t) = { cos(m cos 1 t)) if t 1; (sign(t)) m cosh(m cosh 1 t)) if t 1; (2.35) Notice that T m ( b+a 2λ ) 1 for λ [a, b]. Thus b a We set Solving this equation for e σ, we have [ max q m(λ) T m ( b + a ] 1 λ [a,b] b a ). b + a b a = cosh σ = eσ + e σ. 2 e σ = κ(a) + 1 κ(a) 1 since κ(a) = b/a. We then obtain cosh mσ = emσ + e mσ emσ = 1 2 ( κ(a) + 1 κ(a) 1 ) m. Consequently ( ) m κ(a) 1 min max q m(λ) 2. q m(0)=1 λ [a,b] κ(a) + 1 The desired result then follows. Even though, the estimate given above is sufficient for many applications but in general it is not sharp. There are many ways to sharpen the estimate. For example, the following improved estimate shows that the convergence of the CG method depends on the distribution of the spectrum of A. It is possible that the CG method converges fast even the condition number of A is large. Theorem [85] Assume that σ(a) = σ 0 (A) σ 1 (A) and l is the number of elements in σ 0 (A). Then u u m A 2M ( b/a 1 b/a + 1 ) m l u u 0 A

35 24 where a = min λ σ1 (A) λ, b = max λ σ1 (A) λ and M = max λ σ 1 (A) µ σ 0 (A) 1 λ µ. In practice, κ(a) 1, if A ill-conditioned. For this condition, the convergence rate can not be guaranteed. 2.3 Preconditioned Iterations Although all of the iterative methods are well founded theoretically, they are all likely to suffer from slow convergence for problems which arise from typical applications. Preconditioning is a key ingredient for the success of Krylov subspace methods in the applications. Both the efficiency and robustness of iterative techniques can be improved by using preconditioners. preconditioning is simply transforming the original linear system into one which has the same solution, but easier to solve with an iterative solver. In general, the reliability of iterative techniques, when dealing with various applications, depends much more on the quality of the preconditioner than on the particular Krylov subspace accelerators used. In this section, we discusses the preconditioned versions of the Krylov subspace algorithms using a generic preconditioner first. Then, we cover some common used preconditioners. To begin with, it is worthwhile to consider the options available for preconditioning a system. The first step in preconditioning is to find a preconditioning matrix B. The matrix B can be defined in many different ways but it must satisfy a few minimal requirements. B should be an approximation of A 1 in some sense. From a practical point of view, the most requirement for B is that it is inexpensive to construct. This is because the preconditioned algorithms will all require to multiply B at each step. Once a preconditioning matrix B is defined there are three known ways of applying the preconditioner. The preconditioner can be applied from the left, which leading to the preconditioned system BAu = Bf. (2.36) Alternatively, it can also be applied to the right: ABv = f, u = Bv. (2.37)

36 25 Note that the above formulation amounts to making the change of variables u = B 1 x, and solving the system with respect to the unknown u. The third common situation is applying the preconditioner on both,which is called split preconditioner: B 1 AB 2 v = B 1 f, u = B 2 v, B = B 1 B 2. (2.38) Preconditioned Conjugate Gradient Method Recall that CG is designed to solve the symmetric and positive definite matrix. It is imperative to preserve symmetry. In general, the right and left preconditioners are no longer symmetric. In order to design the Preconditioned Conjugate Gradient Method (PCG), we need to consider the strategies for preserving symmetry. When B is available in the form of an incomplete Cholesky factorization, i.e. B = LL T, then it is simple to use the split preconditioner as (2.38), wihc leads to LAL T v = Lf, u = L T v. (2.39) Apply CG to this system, we can have the corresponding PCG method. Algorithm 7 PCG Method for split preconditioner r 0 = f Au 0 ; ˆr 0 = Lr 0 ; p 0 = L T r 0 for m = 0, 1,, until convergence do α m (ˆrm ) T ˆr m (p m ) T Ap m u m+1 u m + α m p m, ˆr m+1 ˆr m α m Ap m β m (ˆrm+1 ) T ˆr m+1 (ˆr m ) T ˆr m p m+1 L T ˆr m+1 + β m p m However, it is not necessary to split the preconditioner to preserve symmetry. Since BA is self-adjoint for the B 1 inner product, (BAu, v) B 1 = (Au, v) = (u, BAv) B 1. Therefore, an alternative is to replace the usual Euclidean inner product in the Conjugate

37 26 Gradient algorithm by the B 1 inner product. Note that the B 1 inner product do not have to be computed explicitly. With this observation, the following algorithm is obtained. Algorithm 8 PCG Method for left preconditioner r 0 = f Au 0 ; z 0 = Br 0 ; p 0 = z 0 for m = 0, 1,, until convergence do α m (ˆrm ) T z m (p m ) T Ap m u m+1 u m + α m p m, r m+1 r m α m Ap m z m+1 Br m+1 β m (rm+1 ) T z m+1 (r m ) T z m p m+1 z m+1 + β m p m By observing that BA is also self-adjoint with respect to the A inner product, i.e. (BAu, v) A = (BAu, Av) = (u, BAv) A. A similar PCG with left preconditioner can be written under this inner product. Consider now the right preconditioned system (2.37). AB is self-adjoint with B inner product, so the PCG algorithm can be written with respect to the variable u under the new inner product. that the same sequence of computations is obtained as with Algorithm 8. The implication is that the left preconditioned CG algorithm with the B 1 inner product is mathematically equivalent to the right preconditioned CG algorithm with the B inner product. The following convergence estimate has been done for PCG: u u m A 2 ( κ(ba) 1 κ(ba) + 1 ) m u u 0 A. (2.40) So a good preconditioner for PCG should satisfy that BA should be better conditioned, i.e. κ(ba) < κ(a). By knowing that any iterative method could be seen as an operator B which is an approximation of A 1, we now prove that any convergent iterative method can accelerate CG. Theorem Assume that B is symmetric with respect to the inner product (, ). If the

38 27 iterate scheme u m+1 = u m + B(f Au m ) is convergent, then B from the scheme can used as a preconditioner for A and the PCG method converges at a faster rate. Proof. If the iterative scheme is convergent, then ρ = ρ(i BA) < 1. By the definition, we have I BA 2 ((I BA)v, v) A A = sup v 0 v 2 A (BAv, v) A = 1 inf v 0 v 2 A inf v 0 (BAv, Av) > 0 which means B is symmetric positive definite. So we can use B as a preconditioner of PCG. Then, since 1 I BA BA 1 + I BA, we have κ(ba) 1 + ρ 1 ρ. < 1. So κ(ba) 1 κ(ba) + 1 The desired result then follows. 1+ρ 1 1 ρ = 1+ρ ρ 1 1 ρ 2 ρ < ρ. We conclude that for any linear iterative scheme for which B is symmetric, a preconditioner for A can be attained and the convergence rate can be accelerated by using the PCG method. For example, a preconditioner can be resulted from symmetric SOR method as follows B = SS t, S = (D ωu)d 1/2. A more interesting case is that the scheme may not be convergent at all whereas B can always be a preconditioner. For example, the Jacobi method is not convergent for all SPD system, but B = D 1 can always be used as a preconditioner. This preconditioner is often known as diagonal preconditioner Preconditioning Techniques Finding a good preconditioner to solve a sparse linear system is often viewed as a combination of art and science. A preconditioner can be defined as a solver which is combined with an

39 28 outer iteration, like the Krylov subspace iterations. Roughly speaking, a preconditioner is any form of implicit or explicit modification of the linear system. In general, a good preconditioner should be cheap to construct and apply and easy to solve. Generally speaking, there are two approaches to constructing preconditioners. One popular approach is purely algebraic methods that only use the information of the matrix A. The algebraic methods are often easy to develop and to use. For example, all of the stationary iterative methods can be applied as the preconditioners. In general, for a matrix splitting A = M N, any B = M 1 can be the preconditioner. Ideally, M should be close to A in some sense. The other common preconditioner is defined by the incomplete factorization of A. Assume we have a decomposition of the form A = LU R where L and U have the same nonzero structure as the lower and upper parts of A respectively, and R is the residual or error of the factorization. This factorization known as ILU(0) which is rather easy and inexpensive to compute. On the other hand, it often leads to a crude approximation which may result in the Krylov subspace accelerator requiring many iterations to converge. To remedy this, several alternative incomplete factorizations have been developed by allowing more fill-in in L and U. In general, the more accurate ILU factorizations require fewer iterations to converge, but the preprocessing cost to compute the factors is higher. The algebraic methods may not always be efficient. The other approach is to design the specialized algorithms by using more information for a narrow class of problems. By using the knowledge of the problem, such as the continuous equations, the problem domain, and details of the discretization, very stable and efficient preconditioners can be developed. One typical example is the multigrid preconditioners. 2.4 Numerical Example At last, we use one example to show the difference of all the iterative method we used to make the audience be more clear about the advantage and drawbacks of the methods. All of the methods are tested on MATLAB with 2.3GHz processor.

40 Comparison of the Iterative Method The first test problem we considered is the Laplacian on the unit square with homogeneous Dirichlet boundary conditions: u = f in Ω := [0, 1] 2, u = 0 on Γ := Ω. (2.41) Set f = 2x(1 x) + 2y(1 y), so the exact solution is u = x(1 x)y(1 y). Define the P1 finite element spaces V on a (2 l 1) (2 l 1) structured mesh with N := (2 l 1) 2 vertices. Let (ϕ i ) N i=1 denote a Lagrange basis in V. The discrete stiffness matrices A are defined by A i,j := ϕ j (x), ϕ i (x) dx. The right hand side vector b is defined as f i = fϕ i (x)dx. Ω Ω The linear systems generated from this equation is symmetric positive definite. We compare the different stationary iterative method by using CG as a baseline (Table 2.4.1). As shown in Figure 2.2 and Figure 2.3, Jacobi method converges slowest in both number of iteration and CPU time. It is considered as an inefficient method in term of the run time. However, the primary advantage of the Jacobi method is its parallelism. The low complexity of each iterates makes Jacobi method seems to be very efficient in parallel computing. Gauss Seidel method is faster is because it uses more updated convergence values to find better guesses than the Jacobi method. SOR method with a optimal parameter converges fastest in the stationary method. However, it is difficult to find the best parameters in practice. On the other hand, from the numerical results, all of the stationary iterative method is much slower than CG method and the convergence rate highly depend on the condition of the matrix A. It might need thousands of iterations to get an accurate solution. Because of these disadvantages, stationary iterative methods are rarely used as a solver of the linear systems.

41 Jacobi 129(2.7) 368(6.45) 1204(18.4) 4309(60.2) Gauss-Seidel 65(1.4) 185(3.25) 603(9.25) 2156(30.15) Symemtric GS 38(0.9) 98(1.9) 308(4.9) 1084(15.4) SOR 22(0.7) 37(1.1) 67(1.85) 127(3.35) CG 4(0.1) 10(0.2) 24(0.5) 46(1.2) Table 2.1. Comparison of the different iterative method for Poisson equation Figure 2.2. Comparison of the number of Iterations Figure 2.3. Comparison of the CPU time

42 Comparison of the Preconditioners In this subsection, we compare the effect of the different preconditioners regardless of the choosing the preconditioned iterative method. Table shows the number of iterations for PCG with the different preconditioners. Clearly, the preconditioning helped significantly No preconditioner Jacobi Symemtric GS Symemtric SOR Incomplete Cholesky Factorization Table 2.2. Comparison of the different preconditioners for Poisson equation (shown in Figure 2.4). For the problem, the number of iterations for PCG with out preconditioner (the common CG method) took 230 iterations, symmetric Gauss-Seidel preconditioner, number of iterations drop to 122, and it has a little improvement. And for Cholesky IC(0) factorization preconditioning method, the number of iterations is 103, and with symmetric SOR preconditioner with the optimal parameter, it only took 57 iterations. Therefore, by apply appropriate preconditioner, the convergence rate increase significantly. Figure 2.4. Comparison of the number of iterations for preconditioners The running time is another issue for choosing a good preconditioner. Normally, the computation time is proportional to the number of iteration. The less the number of iteration the less computation time is required. However, since the computation cost per iteration is

43 32 different. For example, PCG with Jacobi method may converge faster than with symmetric SOR preconditioner on the high-performance parallel computing resources. For different situation and matrix properties choose a good preconditioner is very importatent. sit could significantly reduce the amount of computation time and allow fast convergence.

44 Chapter 3 Multigird Method and Fast Auxiliary Space Preconditioner In recent decades, multigrid (MG) methods have been well established as one of the most efficient iterative solvers and preconditioners for the linear system (1.1). Moreover, intensive research has been done to analyze the convergence of MG. In particular, it can be proven that the geometric multigrid (GMG) method has linear complexity O(N) in terms of computational and memory complexity for a large class of elliptic boundary value problems where N is the number of degree of freedom. In this chapter, we introduce some basic ideas of the multigrid method from the view of subspace corrections. 3.1 Method of Subspace Correction In the spirit of dividing and conquering, we decompose the space V as the summation of subspaces and correspondingly decompose the problem (1.1) into sub-problems with smaller size and relatively easy to solve. This method is developed by Xu [19]. Let V i V (for 0 i L) be subspaces of V. It consists as a decomposition of V i.e. V = L V i. (3.1) i=0

45 34 This means that, for each v V, there exists v i V i (0 i L) such that v = L i=1 v i. This representation of v may not be unique in general, namely (3.1) is not necessarily a direct sum. We define the following operators, for i = 1,..., L: Q i : V V i the projection in the L 2 inner product (, ); I i : V i V the natural inclusion to V; P i : V V i the projection in the inner product (, ) A ; A i : V i V i the restriction of A to the subspace V i ; R i : V i V i an approximation of (A i ) 1 which means the smoother; T i : V V i T i = R i Q i A = R i A i P i. For any u V and u i, v i V i, these operators fulfil the trivial equalities (Q i u, v i ) = (u, I i v i ) = (I t i u, v i ), (A i P i u, v i ) = a(u, v i ) = (Q i Au, v i ), (A i u i, v i ) = a(u i, v i ) = (Au i, v i ) = (Q i AI i u i, v i ). Q i and P i are both orthogonal projections and A i is the restriction of A on V i and is SPD. Q T i coincides with the natural inclusion I i and thus sometimes are omitted. The matrix or operator A is understood as the bilinear function on V V. Then the restriction on subspaces is A i = Ii T AI i. It follows from the definition that A i P i = Q i A. (3.2) This identity is of fundamental importance and will be used frequently in this chapter. A consequence of it is that, if u is the solution of (1.1), then A i u i = f i (3.3) with u i = P i u and f i = Q i f. This equation may be regarded as the restriction of (1.1) to V i.

46 35 Since V i V, we may consider the natural inclusion I i : V i V defined by I i v i = v i, v i V i. We notice that Q i = I T i because (Q i u, v i ) = (u, v i ) = (u, I i v i ) = (I T i, u, v i ). Similarly, we have P i = I T i, where I T i is the transpose of I i w.r.t. (, ) A. We note that the solution u i of (3.3) is the best approximation of the solution u of (1.1) in the subspace V i in the sense that and J(u i ) = min J(v), with J(v) = 1 (Av, v) (f, v) v V i 2 u u i A = min v V i u v A. The subspace equation (3.3) will be in general solved approximately. To describe this, we introduce, for each i, another non-singular operator R i : V i V i that represents an approximate inverse of A i in certain sense. Thus an approximate solution of (3.3) may be given by û i = R i f i. The consistent notation for the smoother R i is B i, the iterator for each local problem. But we reserve the notation B for the iterator of the original problem. Last, let us look at T i = R i Q i A = R i A i P i. When R i = A 1 i, from the definition, T i = P i = A 1 i Q i A. When T i Vi : V i V i, the projection P i is identity and thus T i Vi = R i A i. With a slight abuse of notation, we use T 1 i (T i u i, u i ) A = (R i A i u i, A i u i ), (T 1 i = (T i Vi ) 1. The action of T 1 i u, u) A = (R 1 i u, u). and T i is Parallel Subspace Correction and Successive Subspace Correction From the viewpoint of subspace correction, most linear iterative methods can be classified into two major algorithms, namely the parallel subspace correction (PSC) method and the successive subspace correction method (SSC).

47 each subspace V i A i e i = Q i r old. 36 PSC: Parallel subspace correction. This type of algorithm is similar to Jacobi method. The idea is to correct the residue equation on each subspace in parallel. Let u old be a given approximation of the solution u of (1.1). The accuracy of this approximation can be measured by the residual: r old = f Au old. If r old = 0 or very small, we are done. Otherwise, we consider the residual equation: Ae = r old. Obviously u = u old + e is the solution of (1.1). Instead we solve the restricted equation to Since we can get Q i Ae = Q i r old and we know that Q i A = A i P i, it is easy to see A i P i e = Q i r old. Then it is clear that e i = P i e. It should be helpful to notice that the solution e i is the best possible correction u old in the subspace V i in the sense that and J(u old + e i ) = min J(u old + e), with J(v) = 1 (Av, v) (f, v) e V i 2 u (u old + e i ) A = min e V i u (u old + e) A. As we are only seeking for a correction, we only need to solve this equation approximately using the subspace solver R i described earlier ê i = R i Q i r old. An update of the approximation of u is obtained by u new = u old + J I i ê i i=1 which can be written as u new = u old + B(f Au old ),

48 37 where B = J I i R i Q i = J I i R i Ii t. (3.4) i=1 i=1 We have therefore Algorithm 9 Parallel subspace correction Given u 0 V; apply the Algorithm 1 with B given in (3.4). It is well-known that the Jacobi method is not convergent for all SPD problems, hence Algorithm 9 is not always convergent. However the preconditioner obtained from this algorithm is of great performance. Lemma The operator B given by (3.4) is SPD if each R i : V i V i is SPD. Proof. The symmetry of B follows from the symmetry of R i. Now, for any v V, we have (Bv, v) = J i=1 (R iq i v, Q i v) 0. If (Bv, v) = 0, we then have Q i v = 0 for all i. Let v i V i be such that v = i v i, then (v, v) = i (v, v i) = i (Q iv, v i ) = 0. Therefore v = 0 and B hence is positive and definite. SSC: Successive subspace correction. This type of algorithm is similar to the Gauss- Seidel method. To improve the PSC method that makes simultaneous correction, we here make the correction in one subspace at a time by using the most updated approximation of u. More precisely, starting from v 0 = u old and correcting its residule in V 1 gives v 1 = v 0 + I 1 R 1 I t 1(f Av 1 ). By correcting the new approximation v 1 in the next space V 2, we get v 2 = v 1 + I 2 R 2 I t 2(f Av 1 ). Proceeding this way successively for all V i leads to Let T i = R i Q i A. By (3.2), T i = R i A i P i. Note that T i : V V i is symmetric with respect to A(, ) and nonnegative and that T i = P i if R i = A 1 i.

49 38 Algorithm 10 Successive subspace correction Given u 0 V; for k = 0, 1,, until convergence do v u k ; for i = 1 : L do v v + R i Q i (f Av); u k+1 v. If u is the exact solution of (1.1), then f = Au. Let v i be the ith iterate (with v 0 = u k ) from Algorithm 10, we have by definition u v i+1 = (I T i )(u v i ), i = 1,, L. A successive application of this identity yields u u k+1 = E L (u u k ), (3.5) where E L = (I T L )(I T L 1 ) (I T 1 ). (3.6) Algorithm 10 can also be symmetrized. Algorithm 11 Symmetric successive subspace correction Given u 0 V; v u 0 ; for k = 0, 1,... until convergence do for i = 1 : J and i = J : 1 : 1 do v v + R i Q i (f Av) The advantage of the symmetrized algorithm is that it can be used as a preconditioner. In fact, 11 can be formulated in the Algorithm 1 with operator B defined as follows: For f V, let Bf = u 1 with u 1 obtained by 11 applied to (1.1) with u 0 = 0. Similar to Young s SOR method, let us introduce a relaxation method.

50 39 Algorithm 12 SOR successive subspace correction Input : u 0 V Output: v V v u 0 ; for k = 0, 1,... until convergence do for i = 1 : J do v v + ωr i Q i (f Av) Like the SOR method, that a proper choice of ω can result in an improvement of the convergence rate, but it is not easy to find an optimal ω in general. The above algorithm is essentially the same as Algorithm 10 since we can absorb the relaxation parameter ω into the definition of R i. Like the colored Gauss-Seidel method, SSC iteration can also be colorized and parallelized. Associated with a given partition 3.1, a coloring of the set J = {0, 1, 2,..., L} is a disjoint decomposition: J = Lc t=1j (t) such that P i P j = 0 for any i, j N (t), i j(1 t L c ). We say that i, j have the same color if they both belong to some J (t). The important property of the coloring is that the SSC iteration can be carried out in parallel in each color. Algorithm 13 Colored SSC Input : u 0 V Output: v V v u 0 ; for k = 0, 1,... until convergence do for t = 1 : J c do v v + i J (t) I ir i I t i (f Av) We note that the terms under the sum in the above algorithm can be evaluated in parallel (for each t, namely within the same color).

51 Multigrid viewed as Multilevel Subspace Corrections From the space decomposition point of view, a multigrid algorithm can be viewed as a subspace correction method based on the subspaces defined on a nested sequence of triangulations. In this section, we rederive a class of multigrid method by a successive application of the overlapping domain decomposition method. We will also provide a complete convergence analysis using the Xu-Zikatanov identity and the results in For simplicity, we illustrate the technique by considering the linear finite element method for the Poisson equation. u = f, in Ω, and u = 0, on Ω, (3.7) where Ω R d is a polyhedral domain. The weak formulation of (3.7) is as follows: given f H 1 (Ω) find u H 1 0(Ω) so that a(u, v) = f, v, v H 1 0(Ω), (3.8) where a(u, v) = ( u, v) = u vdx, and, is the duality pair between H 1 (Ω) and H0(Ω). 1 By the Poincaré inequality, a(, ) is an inner product on H0(Ω), 1 Thus by the Riesz representation theorem, for any f H 1 (Ω), there exists a unique u H0(Ω) 1 such that (3.8) holds. Furthermore, we have the following regularity result. There exists α (0, 1] which depends on the smoothness of Ω such that Ω u 1+α f α 1 (3.9) This inequality is valid if Ω is convex or Ω is C 1,1. We assume that the triangulation {T k }, k = 1,, J is constructed by a successive refinement process. More precisely, T h = T J for some J > 1, and T k for k J are a nested sequence of quasi-uniform triangulations. i.e. T k consist of simplexes T k = {τ i k } of size h k such that Ω = i τk i for which the quasi-uniformity constants are independent of k (cf. [86]) and τk 1 l is a union of simplexes of {τ k i }. We further assume that there is a constant γ < 1, independent of k, such that h k is proportional to γ 2k. We then have a nested sequence of

52 41 quasi-uniform triangulations T 0 T 1 T J = T h As an example, in the two dimensional case, a finer grid is obtained by connecting the midpoints of the edges of the triangles of the coarser grid, with T 1 being the given coarsest initial triangulation, which is quasi-uniform. In this example, γ = 1/ 2. Corresponding to each triangulation T k, a finite element space V k can be defined by V k = {v H 1 0(Ω) : v τ P 1 (τ), τ T k }. Obviously, the following inclusion relation holds: V 0 V 1... V J = V. We assume that h = h J is sufficiently small and h 1 is of unit size. Note that J = O( log h ). Naturally we have a macro space decomposition V = J k=0 V k Let N k be the dimension of V k, i.e., the number of interior vertices of T k. The standard nodal basis in V k will be denoted by ϕ k,i, i = 1,, N k. The micro decomposition is V k = N k i=1 V k,i with V k,i = span{ϕ k,i }. By choosing the right R k,i, the PSC method on V k is equivalent to Richardson method or Jacobi method. In summary, we choose the space decomposition: V = J V k = k=0 J N k V k,i (3.10) k=0 i=1 If we apply PSC to the decomposition (3.10) with R k,i = h 2 k I k,i, we obtain I k,i R k,i Ik,i T v = 2 2 d (v, ϕ k,i )ϕ k,i. The resulting operator B, according to (3.4), is the so-called BPX preconditioner [16]: J N k Bv = 2 2 d (v, ϕ k,i )ϕ k,i. (3.11) k=0 i=1 Let T J be the finest triangulation in the multilevel structure described earlier with nodes

53 42 {x i } N J i=1. With such a triangulation, a natural domain decomposition is Ω = Ω ( ) N J h 0 supp φ i, i=1 where φ i is the nodal basis function in V J associated with the node x i and Ω h 0, which may be empty, is the region where all functions in V J vanish. It is easy to see that the corresponding decomposition method without a coarse space is exactly the Gauss-Seidel method which is known to be inefficient (its convergence rate is known to be 1 O(h 2 J )). The more interesting case is when a coarse space is introduced. The choice of such a coarse space is clear here, namely V J 1. There remains to choose a solver for V J 1. To do this, we may repeat the above process by using the space V J 2 as a coarser space with the supports of the nodal basis function in V J 1 as a domain decomposition. We continue in this way until we reach a coarse space V 0 where a direct solver can be used. As a result, a multilevel algorithm based on domain decomposition is obtained. This procedure can be illustrated by the following diagram: V J (GS) J + V J 1 (GS) J 1 + V J 2 (GS) J 2 + V J 3... This resulting algorithm is a very basic multigrid method cycle, which may be called the backslash (\) cycle. Interpretation of the multigrid method as a special Gauss-Seidel method: A careful inspection on the multigrid algorithm derived above shows clearly that this algorithm is nothing but the SSC to the decomposition (3.10) with exact subspace problems solvers R k,i = A 1 k,i. Apparently the SSC method for this decomposition is nothing but the simple Gauss-Seidel iteration for the following matrix A MG = ( a(φ k i, φ l j) ) k,l=1:j,i,j=1:n k

54 43 This matrix is symmetric and semidefinite matrix and it called extended stiffness matrix by Michael Griebel, see Griebel [87] and Griebel and Oswald [88]. We shall first present this method from a more classic point of view. This more classic approach makes it easier to introduce many different kinds of classic multigrid methods and also make it possible to use more classic approach to analyze the convergence of multigrid methods. A multigrid process can be viewed as defining a sequence of operators B k : V k V k which are approximate inverse of A k in the sense that I B k A k A is bounded away from 1. A typical way of defining such a sequence of operators is the following backslash cycle multigrid procedure. Algorithm 14 \-cycle MG For k = 0, define B 0 = A 1 0. Assume that B k 1 : V k 1 V k 1 is defined. We shall now define B k : V k V k which is an iterator for the equation of the form A k v = g. Fine grid smoothing: for v 0 = 0 and l = 1, 2,, m do v l = v l 1 + R k (g A k v l 1 ) Coarse grid correction: e k 1 V k 1 is the approximate solution of the residual equation A k 1 e = Q k 1 (g Av m ) by the iterator B k 1 : e k 1 = B k 1 Q k 1 (g Av m ). Define B k g = v m + e k 1. After the first step, the residual v v m is small on high frequencies. In another word, v v m is smoother and hence it can be very well approximated by a coarse space V k 1. The second step in the above algorithm plays role of correcting the low frequencies by the coarser space V k 1 and the coarse grid solver B k 1 given by induction.

55 44 With the above defined B k, we may consider the following simple iteration u k+1 = u k + B J (f Au k ) (3.12) There are many different ways to make use of B k, which will be discussed late. Before we now study its convergence, we now discuss briefly the algebraic version of the above algorithm. Let Φ k = (φ k 1,, φ k n k ) be the nodal basis vector for the space V k, we define the so-called prolongation matrix I k+1 k R n k+1 n k as follows Φ k = Φ k+1 I k+1 k, (3.13) Algorithm 15 Matrix version of MG method Let B 0 = A 1 0. Assume that B k 1 R n k 1 n k 1 is defined; then for η R n k, Bk R n k n k is defined as follows: Fine grid smoothing: for ν 0 = 0 and l = 1, 2,, m do ν l = ν l 1 + R k (η A k ν l 1 ) Coarse grid correction: ε k 1 R n k 1 is the approximate solution of the residual equation A k 1 ε = (I) t (η A k ν m ) by using B k 1 ε k 1 = B k 1 (I k k 1) t (η A k ν m ). Define B k η = ν m + I k k 1ε k 1. The above algorithm is given in recurrence. But it can also be easily implemented in a non-recursive fashion. V-cycle and W-cycle Two important variants of the above backslash cycle are the so-called V-cycle and W-cycle. A V-cycle algorithm is obtained from the backslash cycle by performing more smoothings after the coarse grid corrections. Such an algorithm, roughly speaking, is like a backslash

56 45 (\) cycle plus a slash (/) (a reversed backslash) cycle. The detailed algorithm is given as follows. Algorithm 16 V-cycle MG For k = 0, define B 0 = A 1 0. Assume that B k 1 : V k 1 V k 1 is defined. We shall now define B k : V k V k which is an iterator for the equation of the form A k v = g. 1. pre-smoothing: For v 0 = 0 and l = 1, 2,, m v l = v l 1 + R k (g A k v l 1 ) 2. Coarse grid correction: e k 1 V k 1 is the approximate solution of the residual equation A k 1 e = Q k 1 (g Av m ) by the iterator B k 1 : e k 1 = B k 1 Q k 1 (g Av m ). 3. post-smoothing: For v m+1 = v m + e k 1 and l = m + 2, 2,, 2m v l = v l 1 + R k (g A k v l 1 ) Convergence Analysis The analysis of additive multilevel operator relies on the following identity which is well known in the literature [89, 19, 88, 90]. For completeness, we include a concise proof taken from [90]. Theorem (Identity for PSC). If R i is SPD on V i for i = 0,, L, then B defined by (3.4) is also SPD on V. Furthermore (B 1 v, v) = inf L i=0 v i=v L i=0 (R 1 i v i, v i ), (3.14)

57 46 and λ 1 min (BA) = sup inf (R 1 L v A =1 i=0 v i=v i v i, v i ). (3.15) Proof. Note that B is symmetric and L (Bv, v) = ( I i R i Ii T v, v) = i=0 L R i Q i v, Q i v), i=0 hence B is invertible and thus SPD. We now prove (3.14) by constructing a decomposition achieving the infimum. let ṽ i = R i Q i B 1 v, i = 0,, L. By the definition of B, we have a decomposition v = i ṽi. And it satisfies inf L i=0 v i=v Since L i=0 (R 1 i v i, v i ) = inf L i=0 w i=0 L i=0 = L i=0 (R 1 i ṽ i, u i ) = L i=0 (R 1 i (ṽ i + w i ), ṽ i + w i ) (R 1 i ṽ i, ṽ i ) + inf L i=0 w i=0 L (Q i B 1 v, u i ) = i=0 for any u i V i, i = 0,, L, we deduce inf L i=0 v i=v L i=0 (R 1 i v i, v i ) = (B 1 v, L ṽ i ) + i=0 [ 2 L i=0 (R 1 i ṽ i, w i ) + L (B 1 v, u i ) = (B 1 v, i=0 inf L i=0 w i=0 = (B 1 v, v) + inf L i=0 w i=0 = (B 1 v, v). L i=0 [ 2(B 1 v, (R 1 i w i, w i ) L w i ) + i=0 L i=0 (R 1 i w i, w i ) L u i ), i=0 L i=0 (R 1 i w i, w i ) ] ]. The proof of the equality (3.15) λ 1 min (BA) = λ 1 max((ba) 1 ((BA) 1 v, v) A (B 1 v, v) ) = sup = sup v V\{0} (v, v) A v V\{0} v 2 A = sup (B 1 v, v) = sup v A =1 inf (R 1 L v A =1 i=0 v i=v i v i, v i ).

58 47 As for additive methods, we now present an identity developed by Xu and Zikatanov [90] for multiplicative methods. For simplicity, we focus on the case R i = A 1 i, i = 0,..., L, i.e., the subspace solvers are exact. In this case I I i R i Ii T A = I P i. Theorem (X-Z Identity for SSC). The following identity is valid L (I P i ) 2 A = 1 1, (3.16) 1 + c 0 i=0 where c 0 = sup inf L v A =1 i=0 v i=v L P i i=0 L j=i+1 v j 2 A. (3.17) For SSC method with general smoothers, we assume that each subspace smoother R i induces a convergent iteration, i.e. the error operator I T i is a contraction. (T) Contraction of Subspace Error Operator: There exists ρ < 1 such that I T i Ai ρ for all i = 1,..., L. We associate with T i the adjoint operator Ti with respect to the inner product (, ) A. To deal with general non-symmetric smoothers R i, we introduce the symmetrization of T i : T i = T i + T i T i T i, i = 0,, L. We use a simplified version of XZ identity given by Cho, Xu, and Zikatanov [91]. Theorem (General X-Z Identity of SSC). if assumption (T) is valid, then the following identity holds where with w i = j>i v j. We now present a convergence analysis based on assumption (T) and two other assumptions. K = sup L (I T i ) i=0 inf L ( L v A =1 i=0 v i=v i=0 2 A = 1 1 K, 1 T i (v i + Ti w i ), v i + Ti w i ) A,

59 48 (A1) Stable Decomposition: For any v V, there exists a decomposition v = L v i, v i V i, i = 1,..., L, such that i=1 L v i 2 A i K 1 v 2 A. (3.18) i=1 (A2) Strengthened Cauchy-Schwarz (SCS) Inequality: For any u i, v i V i, i = 1,..., L L i=1 j=i+1 ( L L ) 1/2 ( L 1/2 (u i, v j ) A K 2 u i 2 A v j A) 2. i=1 The convergence theory of the method is as follows. Theorem Let V = L i=1 V i be a decomposition satisfying assumptions (A1) and (A2), and let the subspace smoothers R i satisfy (T). Then L (I I i R i Q i A) i=1 2 A 1 j=1 1 ρ 2 2K 1 (1 + (1 + ρ) 2 K 2 2). The proof can be found from [22, 21], which is simplified by using the XZ identity [90]. We shall give an upper bound of the constant K in Theorem by choosing the stable decomposition satisfying (3.18). For the optimality of the BPX preconditioner, we are to prove that the condition number κ(ba) is uniformly bounded and thus PCG using BPX preconditioner converges in a fixed number of steps for a given tolerance regardless of the mesh size. The estimate λ min (BA) 1 follows from the stability of the subspace decomposition. The first result is on the macro decomposition V = J k=0 V k. Lemma (Stability of macro decomposition). For any v V, there exists a decomposition v = J k=0 v k with v k V k, k = 0,, J such that J k=0 h 2 k v k 2 v 2 1. (3.19) Proof. Following the chronological development, we present two proofs. The first one uses full regularity and the second one minimal regularity. Full regularity H 2 : If we assume α = 1 in (3.9), which holds for convex polygons or

60 49 polyhedrons. Let P 1 = 0, we have the following decomposition v = J v k. (3.20) k=0 where v k = (P k P k 1 )v V k. The full regularity assumption leads to the L 2 error estimate of P k via a standard duality argument: (I P k )v h k (I P k )v 1, v H 1 0(Ω). Since V k 1 V k, we have P k 1 P k = P k 1 and P k P k 1 = (I P k 1 )(P k P k 1 ). So we have J k=0 h 2 k v k 2 = = J k=0 J k=0 h 2 k (P k P k 1 )v 2 h 2 k (I P k 1)(P k P k 1 )v 2 J (P k P k 1 )v 2 1 k=0 = v 2 1,Ω. In the last step, we have used the fact (P k P k 1 )v is the orthogonal decomposition in the A-inner product. Minimal regularity H 1 : If we relax the H 2 -regularity, we can use decomposition by L 2 projections J J v = v k = (Q k Q k 1 )v, k=0 k=0 where Q k : V k V k 1 is the L 2 projection onto V k and v k = (Q k Q k 1 )v. A simple proof of nearly optimal stability of (3.23) proceeds as follows. Invoking approximability and

61 50 H 1 -stability of the L 2 -projection Q k on quasi-uniform grids, we infer that (Q k Q k 1 )v = (I Q k 1 )Q k v h k Q k v 1 h k v 1. Therefore, J k=0 h 2 k v k 2 = J k=0 h 2 k (Q k Q k 1 )v 2 J u 2 1 log h u 2 1. The factor log h in the estimate can be removed by a more careful analysis based on the theory of Besov spaces and interpolation spaces. The following crucial inequality can be found, for example, in [19, 92, 93, 94, 95]: This completes the proof. J k=0 h 2 k v k 2 u 2 1. We next state the stability of the micro decomposition. For a finite element space V with nodal basis {ϕ i } N i=1, let Q ϕi be the L 2 -projection to the one dimensional subspace spanned by ϕ i. We have the following norm equivalence which says the nodal decomposition is stable in L 2. The proof is classical in the finite element analysis and thus omitted here. Lemma (Stability of micro decomposition). For any u V over a quasi-uniform mesh T, we have the norm equivalence N u 2 = Q ϕi u 2. (3.21) Theorem For any v V, there exists a decomposition of v of the form i=1 v = J N k v k,i, v k,i V k,i = span{ϕ k,i }, i = 1,, N k, k = 0,, J, (3.22) k=0 i=1 such that J N k h 2 k v k,i 2 v 2 1. (3.23) k=0 i=1 Consequently λ min (BA) 1 for the BPX preconditioner B defined in (3.11).

62 51 Proof. Since λ min (BA) = (ABAv, v) inf v V\{0} (Av, v) = inf v V\{0} ( (Av, v) (B 1 v, v) = sup v V\{0} Combining Lemma and 3.1.7, it is easy to prove (3.23) for (3.11). ) 1 (B 1 v, v). (Av, v) To estimate λ max (BA), we first present a strengthened Cauchy-Schwarz (SCS) inequality for the macro decomposition. Lemma (Strengthened Cauchy-Schwarz Inequality (SCS)). For any u i V i, v j V j, j > i, we have where γ < 1 is a constant such that h i = γ 2i. (u i, v j ) A γ j i u i 1 h 1 j v j 0, Proof. Let us first prove the inequality on one element τ T i. Using integration by parts, Cauchy-Schwarz inequality, trace theorem, and inverse inequality, we have (u i, v j ) A,τ = ( u i, v j ) τ = τ u i n v jds u i 0, τ v j 0, τ h 1 2 i u i 0,τ h 1 2 j v j 0,τ ( hj ) 1 2 u i 1,τ h 1 j v j 0,τ = h γ j i u i 1,τ h 1 j v j 0,τ. i Adding over τ T i and using Cauchy-Schwarz inequality, we have which is the asserted estimate. (u i, v j ) A = τ T i (u i, v j ) A,τ τ T i γ j i u i 1,τ h 1 j v j 0,τ ( ) 1 ( ) 1 γ j i h 1 j u i 2 2 1,τ v j 2 2 0,τ τ T i τ T i = γ j i h 1 j u i 1 v j 0 We also needs the following estimates for our main results

63 52 Lemma (Auxiliary estimate). Given γ < 1, we have N i,j=1 γ j i x i y j 2 ( N 1 γ i=1 x i 2 )( N i=1 y i 2 ), x i, y j R N, i = 1,, N, j = 1,, N. Proof. Let Γ R N N be the matrix such that Γ i,j = γ j i. The spectral radius ρ(γ) satisfies ρ(γ) Γ 1 = max 1 j N N i=1 γ j i 2 1 γ. x and y denotes (x i ) N i=1 and (y i ) N i=1. By Cauchy-Schwarz inequality, N γ j i x i y j = (Γx, y) ρ(γ) x 2 y 2. i,j=1 which is the desired estimate. Now we can have our estimates for the largest eigenvalue of BA. Theorem For any v V, we have (Av, v) inf J k=0 v k=v J k=0 h 2 k v k 2. So, λ max 1 for the BPX preconditioner B defined in (3.11). Proof. For v V, let v = J k=0 v k, v k V k, k = 0,, J be an arbitrary decomposition. By the SCS inequality of Lemma 3.1.9, J J J ( v, v) = 2 ( v k, v j ) + ( v k, v k ) k=0 j=k+1 k=0 J J k=0 j=k γ j k u k 1 h 1 j v j 0. Combining Lemma with the inverse inequality v k 1 h 1 k v k 0, we obtain ( J ( v, v) v k 2 1 k=0 ) 1 2 ( J k=0 ) 1 h 2 k v k J k=0 h 2 k v k 2 0. which is the assertion.

64 53 Finally we can conclude the optimality of the BPX preconditioner defined in (3.11). Theorem (Optimality of BPX Preconditioner). For the preconditioner B defined in (3.11), we have κ(ba) 1. Next, we prove the uniform convergence of V-cycle multigrid, namely SSC applied to the decomposition (3.10) with exact subspace solvers. Lemma (Nodal Decomposition). Let T be a quasi-uniform triangulation with N nodal basis ϕ i. For the nodal decomposition v = N v i, v i = v(x i )ϕ i, i=1 we have N N P i v j 2 1 h 2 v 2. i=1 j>i Proof. For every 1 < i < N, we define the index set L i = {j N : i < j N, supp ϕ j supp ϕ i } and ω i = j Li supp ϕ j. Since T is quasi-uniform, the numbers of integers in each L i is uniformly bounded. So we have N N P i v j 2 1,Ω = i=1 j>i N N N N P i v j 2 1,Ω v j 2 1, ω i i=1 j L i i=1 j L i N N v i 2 1, ω i h 2 i v i 2 0, ω i i=1 i=1 where we have used an inverse inequality in the last step. Since the nodal basis decomposition is stable in the L 2 -inner product (Lemma 3.1.7), we deduce the desired estimate. Lemma The following inequality holds for all v V: J (P k Q k )v 2 1,Ω k=0 J k=0 h 2 k (Q k Q k 1 )v 2 0,Ω. (3.24)

65 54 Proof. Since (I Q k )v = J j=k+1 (Q j Q j 1 )v, we have J (P k Q k )v 2 1,Ω = k=0 = = J ((P k Q k )v, (P k Q k )v) A k=0 J ((P k Q k )v, (I Q k )v) A k=0 J k=0 j=k+1 J ((P k Q k )v, (Q j Q j 1 )v) A. Applying the strengthened Cauchy-Schwarz inequality (Lemma 3.1.9) yields J ( J (P k Q k )v 2 1,Ω (P k Q k )v 2 1,Ω k=0 The desired result then follows. k=1 ( J ) 1 2 h k 2 (Q j Q j 1 )v 2 2 0,Ω. Theorem (Optimality of \-cycle Multigrid). The \-cycle multigrid method, using SSC applied to the decomposition (3.10) with exact subspace solvers R i = A 1 i, converges uniformly. Proof. We use the following macro decomposition ) 1 k=0 with the nodal decomposition v = J v k, k=0 v k = (Q k Q k 1 )v, N k v k = v k,i, v k,i = v k (x i )ϕ k,i, i=1 For simplicity, let us use a lexicographical ordering: we say (l, j) (k, i) if l = k and j i, or l > k. It is easy to see that (l,j) (k,i) v l,j = N k j=i+1 v k,j + J N l l=k+1 j=1 v l,j = N k j=i+1 v k,j + (v v k )

66 55 In view of the expression (3.16), it follows that J N k P k,i k=0 k=0 i=1 i=1 (l,j)) (k,i) v l,j 2 1,Ω ( ) J N k N k P k,i (v P k v) 2 1,Ω + P k,i v k,j 2 1,Ω J N k k=0 i=1 N k j=i+1 J n k v k,i 2 1,Ω k,i k=0 J k=0 i=1 v 1,Ω. h 2 k v k 2 0,Ω v k,j 2 1,Ω k,i j=i+1 This means that the constant c 0 defined by (3.16) is bounded above by a constant that is independent of any mesh parameters. Consequently, by Theorem 3.1.3, the multigrid method converges uniformly with respect to any mesh parameters. The convergence result discussed above has been for quasi-uniform grids, but we would like to remark that uniform convergence can also be obtained for locally refined grids, For details, we refer to [16, 96, 97, 98]. The proof of Theorem hinges on Theorem (X-Z identity), which in turn requires exact solvers R i = A 1 i and makes P i = A 1 i Q i A the key operator to appear in, then the key operator becomes (3.16). If the smoothers R i are not exact, namely R i A 1 i T i = R i Q i A and Theorem must be replaced by Theorem We refer to [91] for details. 3.2 The Auxiliary Space Method In previous sections, we study multilevel methods formulated over a hierarchy of quasiuniform or graded meshes. The geometric structure of these meshes is essential for both the design and analysis of such methods. Unfortunately, many grids in practice are not hierarchical. We use the term unstructured grids to refer those grids that do not possess

67 56 much geometric or topological structure. The design and analysis of efficient multilevel solvers for unstructured grids is a topic of great theoretical and practical interest. method of subspace correction consists of solving a system of equations in a vector space by solving on appropriately chosen subspaces of the original space. however, not always available. The Such subspaces are, The auxiliary space method, developed in [99, 47], is for designing preconditioners using auxiliary spaces which are not necessarily subspaces of the original subspace. Here, the original space is the finite element space V for the given grid T and the preconditioner is the multigrid method on the sequence (V l ) J l=1 grids (T l ) J l=1. of FE spaces for the auxiliary The idea of the method is to generate an auxiliary space V with inner product ã(, ) = Ã, and energy norm Ã. Between the spaces there is a suitable linear transfer operator Π : V V, which is continuous and surjective. Π t : V V is the dual operator of Π in the default inner products Π t u, ṽ = u, Πṽ, for all u V, ṽ V. In order to solve the linear system Ax = b, we require a preconditioner B defined by B := S + Π BΠ t, (3.25) where S is the smoother and B is the preconditioner of Ã. The estimate of the condition number κ(ba) is given below. Theorem Assume that there are nonnegative constants c 0, c 1, and c s, such that 1. the smoother S is bounded in the sense that v A c s v S 1 v V, (3.26) 2. the transfer operator Π is bounded, Πw A c 1 w Ã w V, (3.27)

68 57 3. the transfer is stable, i.e. for all v V there exists v 0 V and w V such that v = v 0 + Πw and v 0 2 S 1 + w 2 Ã c2 0 v 2 A, and (3.28) 4. the preconditioner B on the auxiliary space is optimal, i.e. for any ṽ V, there exists m 1 > m 0 > 0, such that m 0 ṽ 2 Ã ( BÃṽ, ṽ) Ã m 1 ṽ 2 Ã. (3.29) Then, the condition number of the preconditioned system defined by (3.25) can be bounded by κ(ba) max{1, 1 m 0 }c 2 0(c 2 s + m2 1 m 0 c 2 1). (3.30) Proof. Firstly, we estimate the upper bound of (BAv, v) A and consequently λ max (BA). By the definition of B from (3.25), we have (BAv, v) A = (SAv, v) A + (Π BΠ t Av, v) A. By Cauchy-Schwarz inequality, (3.26), and (3.29), we can control the first term as (SAv, v) A (SAv, SAv) 1/2 A (v, v) A c s (SAv, SAv) 1/2 S 1 (v, v) A c s (SAv, Av) 1/2 (v, v) A. which leads to (SAv, v) A c 2 s(v, v) A. Similarly by Cauchy-Schwarz inequality and (3.27), we have (Π BΠ t Av, v) A = ( BÃÃ 1 Π t Av, Ã 1 Π t Av)Ã m 1 (Ã 1 Π t Av, Ã 1 Π t Av)Ã = m 1 (ΠÃ 1 Π t Av, v) A m 1 (ΠÃ 1 Π t Av, ΠÃ 1 Π t Av) 1/2 m 1 c 1 (Ã 1 Π t Av, Ã 1 Π t Av) 1/2 m 1 m 1/2 0 = m 1 m 1/2 0 Ã A (v, v)1/2 A (v, v)1/2 A c 1 ( BÃÃ 1 Π t Av, Ã 1 Π t Av) 1/2 c 1 (Π B 1 Π t Av, v) 1/2 A Ã (v, v)1/2 A, (v, v)1/2 A

69 58 which leads to (Π BΠ t Av, v) A m2 1 m 2 0 c 2 1(v, v) A. Hence, the upper bound is Then, λ max (BAv, v) A sup (c 2 s + m2 1 c 2 v V\{0} (v, v) A m 1). 0 Next, we estimate the lower bound of λ min. We choose v = v 0 + Πw satisfying (3.28). (v, v) A = (v 0 + Πw, v) A = (v 0, SAv) S 1 + (w, Ã 1 Π t Av)Ã ((v 0, v 0 ) S 1 + (w, w)ã) 1 2 ((SAv, Av) + ( Ã 1 Π t Av)Ã, Π t Av)) 1 2 c 0 v A ( (SAv, Av) + ( Ã 1 Π t Av, Π t Av) ) 1 2 = c 0 v A ( (SAv, Av) + ( Ã 1 Π t Av, Ã 1 Π t Av)Ã ) 1 2 c 0 v A ( (SAv, Av) + 1 m 0 ( BÃÃ 1 Π t Av, Ã 1 Π t Av)Ã ( 1 c 0 v A (SAv, Av) + (Π m BΠ ) t 1 2 Av, v) A 0 c 0 max{1, 1 m 1/2 0 } v A (BAv, v) 1 2 A, which leads to (BAv, v) A max{1, 1 m 0 }c 2 0. Combining the estimate of λ min and λ max, we can have the desired conclusion. We also can combine the smoother S and the auxiliary grid correction multiplicatively with a preconditioner B in the form [100, 101] ) 1 2 I B co A = (I S T A)(I BA)(I SA) (3.31) which leads to Algorithm 17. The combined preconditioner, under suitable scaling assumptions performs no worse than its components. Theorem [100] Suppose there exists ρ [0, 1) such that for all v V, we have (I SA)v 2 A ρ v 2 A,

70 59 Algorithm 17 Multiplicative Auxiliary Space Iteration Step Input : u k Output: u k+1 Given S, B; u k,0 := u k ; (1) u k,1 := u k,0 S(b Au k,0 ); (2) u k,2 := u k,1 B(b Au k,1 ); (3) u k+1 := u k,2 S T (b Au k,2 ); then the multiplicative preconditioner B co yields the bound κ(b co A) (1 m 1)(1 ρ) + m 1 (1 m 0 )(1 ρ) + m 0, (3.32) for the condition number, and κ(b co A) κ(ba). (3.33) According to Theorem and Theorem 3.2.2, our goal is to construct an auxiliary space V in which we are able to define an efficient preconditioner. The preconditioner will be the geometric multigrid method on a suitably chosen hierarchy of auxiliary grids. Additionally, the space has to be close enough to V so that the transfer from V to V fulfils (3.27) and (3.28). This goal is achieved by a density fitting of the finest auxiliary grid T J to T. In order to prove (3.29), we use the multigrid theory for the auxiliary grids {T l } J l=1 from the viewpoint of the method of subspace corrections. 3.3 Algebraic Multigrid Method The AMG method was popularized by Ruge and Stüben [26, 27, 28]. It has grown even more popular in the past several years and is still under development. Basically, AMG involves three key steps including: 1. Setup phase: (a) Selecting coarse grid nodes or generating aggregates, (b) building interpolation and restriction operators, (c) Generating coarse grid matrix.

71 60 2. Solve phase: (a) Choosing multigrid cycles, (b) Applying pre-smoothing, (c) Coarse grid correction, (d) Applying post-smoothing. The various forms of AMG that have been proposed involve various combinations of these key steps. In what follows we give a brief summary of several popular choices Classical AMG In Ruge and Stüben s basic AMG, coarsening is based on the selection of strong correlated unknowns [26, 27, 28]. The strong or weak coupling between unknowns are decided based on the coefficients in the matrix. Each fine node is chosen such that it will be strongly connected to at least one coarse node. The interpolation operator is an approximation to the fine node value based on all the coarse nodes it is strongly connected. The choice of the AMG smoother varies. Usually classical smoothers like Gauss-Seidel are chosen, as in Stüben s classical AMG. There are also alternatives like: ILU as the smoother such as [102] proposed by Saad et al., sparse inverse proposed by Bröker et al. [103], and polynomial smoothing based on a matrix-vector product [104]. Algebraic multigrid methods are designed for the solution of the sparse linear systems (1.1) by using multigrid principles. Normally, there is no concept of how to solve the systems efficiently by AMG. However, we can gain some results under the assumption A is symmetric and A ij 0(j i), N A ij = 0. (3.34) j=1 Definition 1. Node i and j are connected if A ij 0. Neighbor of i(n i ) is all points which are connected to i. j is strongly connected to i, if A i j θ max A ik. k i

72 61 Definition 2. Assume e is a smooth error, when e satisfies that for any i, j A ij A ii (e i e j ) 2 e 2 i < ɛ. Then, 1 2 ( A ij (e i e j ) 2 ) < ɛ i,j i A ii e 2 i. Lemma If e is a smooth error, and j S i = {j : A ij θ max k i ( A ik)}, then, (e i e j ) 2 < Ce 2 i ɛ, where C depends on θ Proof. Since N j=1 A ij = 0, we have A ii = j i A ij Therefore, for all j S i, we have N i max k i ( A ik). ɛ > j A ij A ii (e i e j ) 2 e 2 i This completes the prove. = j S i A ij A ii (e i e j ) 2 e 2 i = j S i A ij max k i A ik max k i A ik A ii (e i e j ) 2 e 2 i j S i θ N i (e i e j ) 2 e 2 i θ N i (e i e j ) 2. e 2 i

73 62 Define d(i, j) := a ij /s i, d(i, P ) := j P d(i, j), $ i (α) := {j N i : d(i, j) α}, $ T i (α) := {j : i $ j (α)}. We denote the set of points j to be used in interpolation to an F-point i by $ I i. This will be a subset of C $ i. we can assume that after relaxation: a m ii e m i + i/ F m j i a m ij e m j + j C m a m ij e m j = 0. (i F ). (3.35) In order to make an improvement in interpolation, we will suppose that C has been construted so that it satisfies the following criterion, which also implicitly defines $ I i : (CG1) : C should be chosen such that for each point i in F, there is a set $ I i C $ i such that every point j $ i is either in $ I i, or strongly depends on it. The error at the point j can be approximately expressed as a weighted average of the errors at i and points in $ I i as following: e j (a ji e i + k $ I i a jk e k )/(a ji + k $ I i a jk ) (j N i $ I i ). (3.36) Then we have the interpolation formula: v m+1 vi m i, (i C m ), := ( j $ (a I ij + c ij )vj m )/(a ii + c ii ) (i F m ), i (3.37) where c ij := a ik a kj /[a ki + a kl ] (j $ I i {i}). (3.38) k / $ I i k i l $ I i Strength of connection of a point i to a point j is measured against the largest connection

74 63 of i, which is denoted by s i in the following: s i := max{ a ik : k i} UA-AMG The coarsening in the UA-AMG method is performed by simply aggregating the unknowns, and the prolongation and restriction are Boolean matrices that characterize the aggregates. It has been shown that under certain conditions the Galerkin-type coarse-level matrix has a controllable sparsity pattern [105]. These properties make the UA-AMG method more suitable for parallel computation than are traditional AMG methods, such as the classical AMG and the smoothed aggregation AMG, especially on GPUs. Recent works also show that UA-AMG method is both theoretically well founded and practically efficient, if we form the aggregates in an appropriate way and apply certain enhanced cycles, such as the AMLI-cycle or the Nonlinear AMLI-cycle (K-cycle) [106, 107, 108]. Given the k-th-level matrix A k R n k n k, in the UA-AMG method we define the prolongation matrix P k k 1 from a non-overlapping partition of the n k unknowns at level k into the n k 1 nonempty disjoint sets G k j, j = 1,..., n k 1, which are referred to as aggregates. There are many different approaches to generating the aggregates. However, in most of existing approaches, the aggregates are chosen one by one, i.e. sequentially. And, standard parallel aggregation algorithms are variants of the parallel maximal independent set algorithm, which choose the points of the independent sets based on the given random numbers for all points. Therefore, the quality of the aggregates cannot be guaranteed. If the aggregates have been formed and are ready to use. constructed, the prolongation P k k 1 is an n k n k 1 matrix given by 1 if i G k (Pk 1) k j ij = 0 otherwise Once the aggregates are i = 1,..., n k, j = 1,..., n k 1. (3.39) With such a piecewise constant prolongation Pk 1 k, the Galerkin-type coarse-level matrix A k 1 R n k 1 n k 1 is defined by A k 1 = (P k k 1) t A k (P k k 1). (3.40)

75 64 Note that the entries in the coarse-grid matrix A k 1 can be obtained from a simple summation process: (A k 1 ) ij = k G i l G j a kl, i, j = 1, 2,, n k 1. (3.41) The enhanced cycles are often used for the UA-AMG method in order to achieve uniform convergence for model problems such as the Poisson problem [107, 109, 108] and M-matrix [106]. In this paper, we use the nonlinear AMLI-cycle, also known as the variable AMLI-cycle or the K-cycle because it is parameter-free. The nonlinear AMLI-cycle uses the Krylov subspace iterative method to accelerate the coarse-grid correction. Therefore, for the completeness of our presentations, let us first recall the nonlinear preconditioned conjugate gradient (PCG) method originated in [110]. Algorithm 18 is a simplified version designed to address symmetric positive definite (SPD) problems. The original version was meant for more general cases including nonsymmetric and possibly indefinite matrices. Let ˆB k [ ] : R n k R n k be a given nonlinear preconditioner intended to approximate the inverse of A k. We now formulate the nonlinear PCG method that can be used to provide an iterated approximate inverse to A k based on the given nonlinear operator ˆB k [ ]. This procedure gives another nonlinear operator B k [ ] : R n k R n k, which can be viewed as an improved approximation of the inverse of A k. In practice, we choose the number of cycles n of Nonlinear PCG as 2. Algorithm 18 Nonlinear PCG Method B k [f] =NPCG (A, f, ˆB k [ ], n) u 0 = 0, r 0 = f and p 0 = ˆB k [r 0 ] for i = 1, 2,..., n do α i = ( ˆB[r i 1 ],r i 1 ) (Ap i 1,p i 1 ) ; u i = u i 1 + α i p i 1 ; r i = r i 1 α i Ap i 1 ; β j i = ( ˆB[r i ],Ap j ) (Ap j,p j, j = 0, 2,..., i 1 ) p i = ˆB[r i ] + i 1 j=0 βj i p j; B k [f] := u n ; We also need to introduce a smoothing operator in order to define the multigrid method. In general, all smoothers, such as the Jacobi smoother and the Gauss-Seidel (GS) smoother, can be used here. In this paper, we use parallel smoothers which are based on auxiliary grid.

76 65 If we already have a smoother. Using Algorithm 18, we can recursively define the nonlinear AMLI-cycle MG method as an approximation of A 1 k (see Algorithm 19). Algorithm 19 Nonlinear AMLI-cycle MG ˆB k [f] =NAMLIcycle [u k, A k, f k, k] if k == 1 then u 1 = A 1 1 f 1 ; else u k = smoothing(u k, A k, f k, k);// Pre-smoothing f k 1 = (P k k 1 )T (f k A k u k );// Restriction e k 1 = NPCG(A k 1, f k 1, NAMLIcycle(0, A k 1,, k 1), n k 1 ),// Coarse grid correction u k = u k + P k k 1 e k 1;// Prolongation u k = smoothing(u k, A k, f k, k);// Post-smoothing ˆB k [f] = u k ; Our parallel AMG method is mainly based on Algorithm 19 with prolongation Pk 1 k and coarse-grid matrix A k 1 defined by (3.39) and (3.40) respectively. The main idea of our new parallel AMG method is to use an auxiliary structured grid to (1) efficiently construct the aggregates in parallel; (2) simplify the construction of the prolongation and the coarse-grid matrices; (3) develop a robust and effective colored smoother; and (4) better control the sparsity pattern of the coarse-grid matrices and working load balance.

77 Chapter 4 FASP for Poisson-like Problem on unstructured grid Finite element methods are often used for solving the elliptic PDEs because of the flexibility especially on the unstructured grids. However, the general unstructured grids do not offer a suitable framework for the multigrid method due to the difficulties of constructing the grid hierarchy. One way to overcome this difficulty is applying a refinement of the given unstructured grids, such as PLTMG [111], KASKADE [112], and HHG [113]. The other way is focused on the development of the AMG method. Although there are lots of good progress in this field, AMG methods are lacking theoretical justifications. In this chapter, the multigrid algorithm on the unstructured grids is explained. The algorithm is based on the theory of auxiliary space preconditioning. The auxiliary structured grid hierarchy is generated. Then, a geometric multigrid method can be applied together with a smoothing on the original grid by using the auxiliary space preconditioning technique. In Section 4.1, we discuss some basic assumptions on the given triangulation. An abstract analysis is provided based on four assumptions. In Section 4.2 we introduce the detailed construction of the structured auxiliary space by an auxiliary cluster tree and an improved treatment of the boundary region for Neumann boundary conditions. In 4.3, we describe the auxiliary space multgrid preconditioner (ASMG) and estimate the condition number by verifying the assumptions of the abstract theory.

78 Preliminaries and Assumptions The triangulation is assumed to be conforming and shape-regular in the sense that the ratio of the circumcircle and inscribed circle is bounded uniformly [114], and it is a K-mesh in the sense that the ratio of diameters between neighboring elements is bounded uniformly. All elements τ i are assumed to be shape-regular but not necessarily quasi-uniform, so the diameters can vary globally and allow a strong local refinement. The vertices of the triangulation are denoted by (p j ) j J. Some of the vertices are Dirichlet nodes, J D J, where we impose essential Dirichlet boundary conditions, and some are Neumann vertices, J N J, where we impose natural Neumann boundary conditions. For each of the simplex τ i T, define the barycenter ξ i := 1 N τi N τi p k (τ i ), k=1 where p k (τ i ) R d is the vertex of τ i and N τi Figure 4.1. is the number of vertices of τ i, as shown in Definition 3. We denote the minimal distance between the triangle barycenters of the grid by The diameter of Ω is denoted by h := min i,j I ξ i ξ j 2. H := max x,y Ω x y 2. In order to prove the desired nearly linear complexity estimate, we have to assume that the refinement level of the grid is algebraically bounded in the following sense. Assumption 1. We assume that H/h = N q for a small number q, e.g. q = 2. The above assumption allows an algebraic grading towards a point but it forbids a geometric grading. The assumption is sufficient but not necessary: the construction of the auxiliary grids might still be of complexity O(N log N) or less if Assumption 1 is not valid, but it would require more technical assumptions in order to prove this.

79 68 Ω τ i ξ i h Figure 4.1. Left: The 2D triangulation T of Ω with elements τ i. Right: The barycenters ξ i (dots) and the minimal distance h between barycenters. 4.2 Construction of the Auxiliary Grid-hierarchy In this section, we explain how to generate a hierarchy of auxiliary grids based on the given (unstructured) grid T. The idea is to analyse and split the element barycenters by their geometric position regardless of the initial grid structure. Our aim is to obtain a structured hierarchy of grids that preserves some properties of the initial grid, e.g. the local mesh size. A similar idea has already been applied in [53, 115, 38] Clustering and Auxiliary Box-trees In order to give the construction algorithm, we need to introduce the definition of cluster. Definition 4 (Cluster). A cluster t is a subset of I. If t is a cluster, the corresponding subdomain of Ω is Ω t := i t τ i. The clusters are collected in a hierarchical cluster tree T T. Definition 5 (Cluster Tree). A tree T T satisfied. is a cluster tree if the following conditions are 1. The nodes in T T are clusters. 2. The root of T T is I. 3. The leaves of T T are denoted by L(T T ) and the tree hierarchy is given by a father/son relation: For each interior node t T T \L(T T ), the set of sons of t, sons(t), is a subset

80 69 of T T \ {t} such that t = s s sons(t) holds. Vice versa, the father of any s sons(t) is t. The standard (geometrically regular) construction of the cluster tree T T is as follows. For the initial step, we choose a (minimal) hypercube of the domain Ω: B 1 := [a 1, b 1 ) [a 2, b 2 ) [a d, b d ) Ω where a i = min x i, b i = max x i, for i = 1, 2,, d, (x 1, x 2,, x d ) Ω. (4.1) Define the level of B 1 to be g(b 1 ) = 1. Then we regularly subdivide B 1 to 2 d childrens {B 2 i }. When d = 2, the four children B 2 1, B 2 2, B 2 3, B 2 4 can be defined as B 2 2 = [a 1, b 1) [a 2, b 2 ), B 2 3 = [a 1, b 1 ) [a 2, b 2 ), B 2 1 = [a 1, b 1) [a 2, b 2), B 2 4 = [a 1, b 1 ) [a 2, b 2), where a 1 = b 1 := (a 1 + b 1 )/2 and a 2 = b 2 := (a 2 + b 2 )/2. The level of B 2 i is g(b 2 i ) = g(b 1 ) + 1 = 2, where i = 1, 2,, 2 d. Finally, we apply the same subdivision process recursively and define the level of the boxes Bi l recursively (cf. Figure 4.3). This yields an infinite tree T box with root B 1. For any i = d i=1 2(i 1)k t i, 0 t i 2 k 1, we can denote the subregion Ω k i on level k as Ω k i := (a 1 +t 1 b 1 a 1 2 k, a 1 +(t 1 +1) b 1 a 1 2 k ) (a d +t d b d a d 2 k, a d +(t d +1) b d a d 2 k ). (4.2) Figure 4.2 gives two examples of the 2D region quadtrees on a unit square domain and a circle domain. by Letting B l j denote a box in this tree, we can define the cluster t, which is a subset of I, t l j := t(b l j) := {i I ξ i B l j}. This yields an infinite cluster tree with root t(b 1 ). We construct a finite cluster tree T I by not subdividing nodes which are below a minimal cardinality n min, e.g. n min := 3. Define the nodes which have no child nodes as the leaf nodes. The cardinality #t l j = #t(b l j) is the

81 70 Figure 4.2. Examples of the region quadtree on different domains. number of the barycenters in B l j. Leaves of the cluster tree contain at most n min indices. For any leaf node, its parent node contains at least 4 barycenters, then the total number of leaf nodes is bounded by the number of barycenters N. B Figure 4.3. Tree of regular boxes with root B 1 in 2D. The black dots mark the corresponding barycenters ξ i of the triangles τ i. Boxes with less than three points ξ i are leaves. Remark The size of a leaf box B j can be much larger than the size of simplex τ j that intersect with B j since a large box B j may only intersect with one very small element and will not be further subdivided. Lemma Suppose Assumption 1 holds, the complexity for the construction of T I O(qN log N). is Proof. First, we estimate the depth of the cluster tree. Let t = t(b ν ) T I be a node of the cluster tree and #t > n min. By definition the distance between two nodes ξ i, ξ j t is at least ξ i ξ j 2 h.

82 71 Therefore, the box B ν has a diameter of at least h. After each subdivision step the diameter of the boxes is exactly halved. Let l denote the number of subdivisions after which B ν was created. Then diam(b ν ) = 2 l diam(b 1 ). Consequently, we obtain h diam(b ν ) = 2 l diam(b 1 ) 2 l 2H so that by Assumption 1, l log(h/h) = q log N. Therefore the depth of T I is in O(q log N). Next, we estimate the complexity for the construction of T I. The subdivision of a single node t T I and corresponding box B ν is of complexity #t. On each level of the tree T I, the nodes are disjoint, so that the subdivision of all nodes on one level is of complexity at most O(N). For all levels this sums up to at most O(qN log N). Remark The boxes used in the clustering can be replaced by arbitrary shaped elements, e.g. triangles/tetrahedra or anisotropic elements depending on the application or operator at hand. For ease of presentation we restrict ourselves to the case of boxes. Remark The complexity of the construction can also be bounded from below by O(N log N), as is the case for a uniform (structured) grid. However, this complexity arises only in the construction step and this step will typically be of negligible complexity. For each cluster t ν T we have an associated box B ν. The depth of the cluster tree is p := depth(t ). Notice that the tree of boxes is not the regular grid that we need for the multigrid method. A further refinement as well as deletion of elements is necessary Closure of the Auxiliary Box-tree The hierarchy of box-meshes from Figure 4.3 is exactly what we want to construct: each box has at most one hanging node per edge, namely, the fineness of two neighbouring boxes differs by at most one level. In general this is not fulfilled. We construct the grid hierarchy of nested uniform meshes starting from a coarse mesh σ (0) consisting of only a single box B 1 = [a 1, b 1 ) [a 2, b 2 ) [a d, b d ), the root of the

83 72 box tree. All boxes in the meshes σ (1),..., σ (J) to be constructed will either correspond to a cluster t in the cluster tree or will be created by refinement of a box that corresponds to a leaf of the cluster tree. Let l {1,..., J} be a level that is already constructed (the trivial start l = 1 of the induction is given above). Part I: Mark elements for refinement We mark all elements of the mesh which are then refined regularly. Let Bν l be an arbitrary box in σ (l). The box Bν l corresponds to a cluster t ν = t(bν) l T I. The following two situations can occur: 1. (Mark) If #t ν > n min then Bν l is marked for refinement. 2. (Retain) If #t ν n min, e.g. t ν =, then Bν l is not marked in this step. Figure 4.4. The subdivision of the marked (red) box on level l would create two boxes (blue) with more than one hanging node at one edge. After processing all boxes on level l, it may occur that there are boxes on level l 1 that would have more than one hanging node on an edge after refinement of the marked boxes, cf. Figure 4.4. Since we want to avoid this, we have to perform a closure operation for all such elements and for all coarser levels l 1,..., 1. Part II: Closure and Refinement 3. (Close) Let L (l 1) be the set of all boxes on level l 1 having too many hanging nodes. All of these are marked for refinement. By construction a single refinement of each box is sufficient. However, a refinement on level l 1 might then produce too many hanging nodes in a box on level l 2. Therefore, we have to form the lists L (j), j = l 1,..., 1 of boxes with too many hanging nodes successively on all levels and mark the elements.

84 73 Figure 4.5. The subdivision of the red box makes it necessary to subdivide nodes on all levels. 4. (Refine) At last we refine all boxes (on all levels) that are marked for refinement. The result of the closure operation is depicted in Figure 4.5. Each of the boxes in the closed grids lives on a unique level l {1,..., J}. It is important that a box is either refined regularly (split into four successors on the next level) or it is not refined at all. For each box that is marked in step 1, there are at most O(log N) boxes marked during the closure step 3. Lemma The complexity for the construction and storage of the (finite) box tree with boxes B l ν and corresponding cluster tree T I with clusters t ν = t(b l ν) is of complexity O(N log N), where N is the number of barycenters, i.e., the number of triangles in the triangulation τ. Proof. For the level l of the tree, let n l be the number of leaf boxes and m l be the boxes which have child boxes. Accordingly, the total number of the boxes on level l is n l + m l. By definition, n l + m l = 2 d m l 1, where l 2 and n 1 + m 1 = m 1 = 1. Since J l=1 n l N, we have N J n l = l=1 J J (2 d m l 1 m l ) = (2 d 1) m l + 1. l=2 l=1 As a result, J m l N. l=1 The total work for generating the tree is J l=1 n l + m l N. Given l, let α l denote the number of boxes in L (l 1) (the set of boxes that have more than 1 hanging node). Since every box in L (l 1) has to be a leaf box, we have α l n l. As

85 74 the process of closing each hanging node will go through at most d boxes in any given level, the total number of the marked boxes in this closure process is bounded by J djα l JN N log N. l= Construction of a Conforming Auxiliary Grid Hierarchy At last, we create a hierarchy of nested conforming triangulations by subdivision of the boxes and by discarding those hypercube that lie outside the domain Ω. If d = 2, given any box B ν there can be at most one hanging node per edge. The possible situations and corresponding local closure operations are presented in Figure 4.6. The closure Figure 4.6. Hanging nodes can be treated by a local subdivision within the box B ν. The top row shows a box with 1, 2, 2, 3, 4 hanging nodes, respectively, and the bottom row shows the corresponding triangulation of the box. operation introduces new elements on the next finer level. The final hierarchy of triangular grids σ (1),..., σ (J) is nested and conforming without hanging nodes. All triangles have a minimum angle of 45 degrees, i.e., they are shape-regular, cf. Figure 4.7. The triangles in the quasi-regular meshes σ (1),..., σ (J) have the following properties: 1. All triangles in σ (1),..., σ (J) that have children which are themselves further subdivided, are refined regularly (four congruent successors) as depicted here (cf. Figure 4.8). 2. Each triangle σ i σ (j) that is subdivided but not regularly refined, has successors σ i that will not be further subdivided (cf. Figure 4.9).

86 75 level 1 level 2 level 3 level 4 level 5 Figure 4.7. The final hierarchy of nested grids. Red edges were introduced in the last (local) closure step. σ i σ i Figure 4.8. Case 1: σ i is subdivided in the fine level The hierarchy of grids constructed so far covers on each level the whole box B. This hierarchy has now to be adapted to the boundary of the given domain Ω. In order to explain the construction we will consider the domain Ω and triangulation T (5837 triangles) from Figure The triangulation consists of shape-regular elements, it is locally refined, it contains many small inclusions, and the boundary Γ of the domain Ω is rather complicated. If d = 3, there are at most one hanging node per edge and one inner hanging node per face for any cube B ν. If there is no hanging node for any edges or faces of B ν, we can divide the cube to 6 tetrahedral regularly. If there are hanging nodes on the edges or faces, the local closure operations have two steps. we do the closure operations for all faces first and then, we connect the triangles on each face to the center of the cube to formulate the tetrahedra. This closure operations will also generate the final hierarchy of tetrahedra grids σ (1),..., σ (J) σ σ i i Figure 4.9. Case 2: σ i is not subdivided in the fine level

76 Figure 4.10. A triangulation of the Baltic sea with local refinement and small inclusions. which are nested and conforming without hanging nodes. Figure 4.11 shows the different types of possible situations and corresponding local closure operations.

87 76 Figure A triangulation of the Baltic sea with local refinement and small inclusions. which are nested and conforming without hanging nodes. Figure 4.11 shows the different types of possible situations and corresponding local closure operations Adaptation of the Auxiliary Grids to the Boundary The Dirichlet boundary: On the Dirichlet boundary we want to satisfy homogeneous boundary conditions (b.c.), i.e., u Γ = 0 (non-homogeneous b.c. can trivially be transformed to homogeneous ones). On the given fine triangulation τ this is achieved by use of basis functions that fulfil the b.c. Since the auxiliary triangulations σ (1),..., σ (J) do not necessarily resolve the boundary, we have to use a slight modification. Definition 6 (Dirichlet auxiliary grids). We define the auxiliary triangulations T D l by T D l := {τ σ (l) τ Ω}, l = 1,..., J.

77 Figure 4.11. Hanging nodes can be treated by a local subdivision within the cube B ν. Firstly erasing the hanging nodes on the face and then connecting the center of the cube.

88 77 Figure Hanging nodes can be treated by a local subdivision within the cube B ν. Firstly erasing the hanging nodes on the face and then connecting the center of the cube. level 3 level 4 level 5 Figure The boundary Γ of Ω is drawn as a red line, boxes non-intersecting Ω are light green, boxes intersecting Γ are dark green, and all other boxes (inside of Ω) are blue. In Figure 4.12 the Dirichlet auxiliary grids are formed by the blue boxes. All other elements (light green and dark green) are not used for the Dirichlet problem. On an auxiliary grid we impose homogeneous Dirichlet b.c. on the boundary Γ l := Ω D l, Ω D l := τ T D l τ. The auxiliary grids are still nested, but the area covered by the triangles grows with increasing

89 78 the level number: Ω D 1 Ω D J Ω, Ω D l := {τ T D l } The Neumann boundary: On the Neumann boundary we want to satisfy natural (Neumann) b.c., i.e., n u Γ = 0. For the auxiliary triangulations σ (1),..., σ (J), we will approximate the true b.c. by the natural b.c. on an auxiliary boundary. Definition 7 (Neumann auxiliary grids). Define the auxiliary triangulations T N 1,..., T N J T N l := {τ σ (l) τ Ω }, l = 1,..., J. by level 3 level 4 level 5 Figure The boundary Γ of Ω is drawn as a red line, boxes non-intersecting Ω are light green, and all other boxes (intersecting Ω) are blue. In Figure 4.13 the Neumann auxiliary grids are formed by the blue boxes. All other elements (light green) are not used for the Neumann problem. On an auxiliary grid we impose natural Neumann b.c., the auxiliary grids are non-nested. The area covered by the triangles grows with decreasing level number: Ω Ω N J Ω N 1, Ω N l := {τ T N l } Remark (Mixed Dirichlet/Neumann b.c.). The defintion of the grids for mixed boundary conditions of Dirichlet (on Γ D ) and Neumann type we use the grids T M l := {τ σ (l) τ Ω and τ Γ D = }, l = 1,..., J.

90 79 The b.c. on the auxiliary grid are of Neumann type except for neighbours of boxes σ Γ D where essential Dirichlet b.c. are imposed. Figure The finest auxiliary grid σ (10) contains elements of different size. Left: Dirichlet b.c. (852 degrees of freedom), right: Neumann b.c. (2100 degrees of freedom) Near Boundary Correction Since the boundaries of different levels do not coincide, the near boundary error cannot be reduced very well by the standard multigrid method for the Neumann boundary condition. So we introduce a near-boundary region Ω (l,j) where a correction for the boundary approximation will be done. The near-boundary region is defined in layers around the boundary Γ l : Definition 8 (Near-boundary region). We define the j-th near-boundary region T (l,j) on level l of the auxiliary grids by T (l,0) :={τ T l dist(γ l, τ) = 0}, T (l,i) :={τ T l dist(t (l,i 1), τ) = 0}, i = 1,..., j. The idea for solving the linear system on level l is to perform a near-boundary correction after the coarse grid correction. The errors introduced by the coarse grid correction is

91 80 eliminated by solving the subsystem for the degrees of freedom in the near-boundary region T (l,j). The extra computational complexity is O(N) because only the elements which are close to the boundary are considered. Definition 9 (Partition of degrees of freedom). Let J l denote the index set for the degrees of freedom on the auxiliary grid T l. We define the near-boundary degrees of freedom by J l,j := {i J l i belongs to an element τ T (l,j) }. Let (u) i Jl be a coefficient vector on level l of the auxiliary grids. Then we extend the standard coarse grid correction by the solve step r l := f A l u l, u l Jl,j := (A l Jl,j J l,j ) 1 r l Jl,j. The small system A l Jl,j [116]. of near-boundary elements is solved by an H-matrix solver, cf. 4.3 Estimate of the Condition Number In this section, we investigate and analyze the new algorithm by verifying the assumptions of the theorem of the auxiliary grid method. Based on the auxiliary hierarchy we constructed in Section 4.2, we can define the auxiliary space preconditioner (3.25) and (3.31) as follows. Let the auxiliary space V = V J and Ã be generated from (3.8). Since we already have the hierarchy of grids {V l } J l=1, we can apply MG on the auxiliary space V J as the preconditioner B. On the space V, we can apply a traditional smoother S, e.g. Richardson, Jacobi, or Gauß-Seidel. For the stiffness matrix A = D L U (diagonal, lower and upper triangular part), the matrix representation of the Jacobi iteration is S = D 1 and for the Gauß-Seidel iteration it is S = (D L) 1. (More generally, one could use any smoother that features the spectral equivalence v S 1 = h 1 v 2 L 2 (Ω).) The auxiliary grid may be over-refined, it can happen that an element τ i T intersects much smaller auxiliary elements τ J j T J : τ i τ J j, h τ J j h τi but h τ J j h τi. (4.3)

92 81 Algorithm 20 Auxiliary Space MultiGrid For l = 0, define B 0 = A 1 0. Assume that B l 1 : V l 1 V l 1 is defined. We shall now define B l : V l V l which is an iterator for the equation of the form A l u = f. Pre-smoothing: For u 0 = 0 and k = 1, 2,, ν u k = u k 1 + R l (f A l u k 1 ) Coarse grid correction: e l 1 V l 1 is the approximate solution of the residual equation A l 1 e = Q l 1 (f A l u ν ) by the iterator B l 1 : u ν+1 = u ν + e l 1 = u ν + B l 1 Q l 1 (g A l u ν ). Near boundary correction: u ν+2 = u ν+1 + u l Jl,j = u ν+1 + ( A l Jl,j J l,j ) 1 (f Al u ν+1 ). Post-smoothing: For k = ν + 3,, 2ν + 3 u k = u k 1 + R l (f A l u k 1 ) In this case, we do not have the local approximation and stability properties for the standard nodal interpolation operator. Therefore, we need a stronger interpolation between the original space and the auxiliary space. This is accomplished by the Scott-Zhang quasiinterpolation operator Π : H 1 (Ω) V for a triangulation T [117]. Let {ψ i } be an L 2 -dual basis to the nodal basis {ϕ i }. We define the interpolation operator as Πv(x) := i J ϕ i (x) ψ(ξ)v(ξ)dξ. Ω

93 82 By definition, Π preserves piecewise linear functions and satisfies (cf. [117]) for all v H 1 (Ω) Πv 2 1,Ω + τ T h 2 τ (v Πv) 2 0,τ v 2 1,Ω. (4.4) We define the new interpolation Π from the auxiliary space V to V by the Scott-Zhang interpolation Π : V V and the reverse interpolation Π : V V. Then we can apply Theorem for V = V J. In order to estimate the condition number, we need to verify that the multigrid preconditioner B on the auxiliary space is bounded and the finest auxiliary grid and corresponding FE space we constructed yields a stable and bounded transfer operator and smoothing operator Convergence of the MG on the Auxiliary Grids Firstly, we prove the convergence of the multigrid method on the auxiliary space. For the Dirichlet boundary, we have the nestedness Ω D 1 Ω D J Ω, which induces the nestedness of the finite element spaces defined on the auxiliary grids T D l, l = 1,, J: V 1 V 2 V J. In order to avoid overloading the notation, we will skip the superscript D in the following. In order to prove the convergence of the local multilevel methods by Theorem 3.1.5, we only need to verify the assumptions for the decompsition of V J = J V l,k. (4.5) l=1 k Ñl where Ñ l = {k J l k J l \ J l 1 or ϕ k,l ϕ k,l 1 }. Since T l σ (l) is the local refinement of T l 1, the size of the triangles in T l may be different. We denote T l as a refinement of the grid T l where all elements are regularly refined such that all elements from T l are congruent to the smallest element of T l. The finite element spaces

94 83 corresponding to T l are denoted by V l. In the triangulations T l we have τ T l h τ 2 l. For an element τ T l we denote by g τ the level number of the triangulation T gτ to which τ belongs, i.e. h τ 2 gτ. For any vertex p i, if i J l but i J l 1, we define g pi = l. The following properties about the generation of elements or vertices are [22, 21] where J (τ) is the set of vertices of τ T l. τ T l, if and only if g τ = l; i J l, if and only if g pi l; For τ T l, max i J (τ) g p i = l = g τ, With the space decomposition (4.5), we can verify the assumptions of Theorem Stable decomposition: Proof of (A1) The purpose of this subsection is to prove the decomposition is stable. Theorem For any v V, there exist function vi l V l,i, i Ñl, l = 1,..., J, such that J v = vi l and l=1 l=1 i Ñl i Ñl J vi l 2 A log(n) v 2 A. (4.6) Proof. Following the argument of [22, 21], we define the Scott-Zhang interpolation between different levels Π l : V l+1 V l, Π L : V L V L, and Π 0 : V 1 0. By the definition, we can define the decomposition as v = J v l, v l = (Π l Π l 1 )v V l. l=1 Assume v l = i J l ξ l,i ϕ l i, where v l i = ξ l,i ϕ l i V l,i. Then, v l i 2 0 = v l i 2 0,ω l i τ ω l i h d τ v l (p i ) 2 v l 2 0,ω l i = (Π l Π l 1 )v 2 0,ω. i l where ω l i is support of ϕ l i and the center vertex is p i.

95 84 By the inverse inequality, we can conclude vi l 2 A i J l i J l τ ωi l h 2 τ vi l 2 0,τ h 2 (Π l Π l 1 )v 2 0,τ. τ T l Invoking the approximability and stability and following the same argument of Lemma 4.3.6, we have So, hτ 2 τ T l τ T l h 2 τ (v Π l v) 2 0,τ v 2 1,Ω l+1 and Π l v 2 0,Ω l v 2 0,Ω l+1. (Π l Π l 1 )v 2 0,τ = h 2 τ Π l (I Π l 1 )v 2 0,τ τ T l τ h 2 τ (I Π l 1 )v 2 0,ω h 2 τ l τ (I Π l 1 )v 2 0, ω l τ T l τ T l τ T l 1 h 2 τ (I Π l 1 )v 2 0,τ v 2 1. where ω l τ is the union of the elements in l that intersect with τ T l and ω l τ is the union of the elements in T l 1 that intersect with ω l τ. Therefore, J vi l 2 A l=1 i Ñl J l=1 τ T l h 2 τ (Π l Π l 1 )v 2 0,τ J v 2 1 log(n) v Strengthened Cauchy-Schwarz inequality: Proof of (A2) In this subsection, we establish the strengthened Cauchy-Schwarz inequality for the space decomposition (4.5). Assuming there is an ordering index set Λ = {α α = (l α, k α ), k α Ñ l, l α = 1,, J}. Define the ordering as follows. For any α, β Λ, if l α > l β or l α = l β, k α > k β, then, α > β. The strengthened Cauchy-Schwarz inequality is given as follows. Theorem For any u α = vk l V α = V l,i, v β = vj m V β = V m,j, α = (l, k), β = (m, j) Λ, we have ( ) 1/2 ( 1/2 (u α, v β ) A u α 2 A u β A) 2. α Λ β Λ,β>α α Λ β Λ

96 85 Proof. For any α Λ, we denote by n(α) = {β Λ β > α, ω β ω α }, vk α = v β. where ω α is the support of the V α and g α = max τ ωα g τ. Since, the mesh is a K-mesh, for any τ ω α, we have β n(α),g β =k So, (u α, v α k ) 1,τ ( hk h gα ) 1/2 u α 1,τh 1 k vα k 0,τ Then fix u α and consider (u i, v α k ) 1,ωα = τ ω i (u i, v α k ) 1,τ (u α, β Λ,β>α τ ω α ( hk h gα ( hk h gα v β ) A = (u α, ) 1/2 u α 1,τh 1 k vα k 0,τ ) 1/2 u α 1,ωαh 1 k β n(α) β n(α),g β =k v β ) 1,ωα = (u α, v β 2 0,ω α J k=g α β n(α) 1/2. v β ) 1,ωα J (u α, vk α ) 1,ωα k=g α J ( ) 1/2 ( hk ) 1/2 u α 1,ωαh 1 k v β 2 0,ω h α gα g β =k k=g α We sum up the u α level by level, J α, g α=l (u v β ) A β Λ l=1 β>α J J l=1 g α=l k=l ( ) 1 hk h l 2 uα 1,ωα h 1 k ( β n(α) g j =k ) 1 v j 2 2 0,ω α

97 86 J J l=1 k=l J ( J l=1 k=l ( h k h l h k h l g α=l ( J u α 2 1 l=1 g α=l ( J u i 2 A l=1 g i =l ( J u i 2 A l=1 g i =l ) 1 2 ( u α 2 1,ω α ) 1 2 ( h 2 g α=l ) 1 2 ( J ) 1 2 ( J ) 1 2 ( J k g α=l u α 2 1,ω α ) 1 2 ( J h 2 l=1 k=l g α=l g β =k k k h 2 l h 2 k=1 l=1 k g β =k v β 2 A k=1 g β =k β n(α) g β =k h 2 g α=l g β =k k h 2 l v β 2 1 ) 1 2. v j 2 0,ω α ) 1 2 h 2 l v β 2 1,ω α ) 1 2 ) 1 2. v β 2 1,ω α ) 1 2 This gives us the desired estimate. The Gauß-Seidel method as the smoother means choosing the exact inverse for each of the subspaces V l,k. Therefore, the assumption of the smoother is satisfied as well. Consequently, we have the uniform convergence of the multigrid method on the auxiliary grid. Theorem The multigrid method on the auxiliary grid based on the space decomposition (4.5) is nearly optimal, the convergence rate is bounded by C log(n) Condition number estimation Now, we estimate condition number of the auxiliary space preconditioner by verifying the assumptions in Theorem The assumption (3.26) is the continuity of the smoother S. We prove the first assumption in Theorem for the Jacobi and Gauß-Seidel iteration. For the Jacobi method, the square of the energy norm can be computed by summing local contributions from the cells τ i of the mesh T : v 2 A = i J ξ i ϕ i 2 A = a( i J ξ i ϕ i, j J ξ j ϕ j ) = i J ω i ω j a(ξ i ϕ i, ξ j ϕ j )

98 87 i J ω i ω j 1 2 ( ξ iϕ i 2 A + ξ j ϕ j 2 A) K ξ i ϕ i 2 A i J = K Dv, v, where K is the maximal number of non-zeros in a row of A. Thus the choice c s = K fulfills the continuity assumption. The continuity of the Gauß-Seidel method can also be proved. Lemma (Continuity for Gauß-Seidel). The stiffness matrix A = D L U fulfills Proof. For any ξ R N we have 1 K (D L)ξ, ξ Dξ, ξ 2 (D L)ξ, ξ, ξ RN. (4.7) (D L)ξ, ξ = i I a(ξ i ϕ i, ξ j ϕ j ) j i i I Φ i Φ j a(ξ i ϕ i, ξ j ϕ j ) K Dξ, ξ, and vice versa Dξ, ξ (A + D)ξ, ξ = 2 (D L)ξ, ξ. So we get the desired estimate. In order to prove assumptions (3.27) and (3.28), we need the following lemmas for the transfer operator between V and V. Lemma (Local stability property). For any auxiliary space function v Ṽ and any element τ T, the quasi-interpolation Π satisfies Πv k,τ h j k τ v j,ωτ, j, k {0, 1}, where ω τ is the union of elements in the auxiliary grid T that intersect with τ. Proof. Assume that p i, i = 1, 2, 3, are the nodal points of τ and ϕ i,τ (ψ i,τ ) the corresponding nodal (dual) basis functions. Let Φ i = m Mi τ m denote the union of elements adjacent to p i in the grid T, and let Φ i = (p) m M τ m (p) denote the union of elements in the auxiliary grid T p i that intersect with Φ i. Then we can estimate Πv k,τ 3 i=1 Πv(p i ) φ i,τ k,τ h d 2 k τ 3 Πv(p i ) i=1

99 88 = h d 2 k τ h d 2 k τ h d 2 k τ h d 2 k τ h d 2 k τ h d 2 k τ which proves the desired inequality. 3 ψ i,τ (ξ)v(ξ)dξ Φ i 3 ψ i,τ,φi v 0,Φi i=1 i=1 3 i=1 3 i=1 3 i=1 3 i=1 h d τ v 0,Φi h d 2 k τ h d/2+j τ m m M i v j,τm 3 i=1 v 0,Φi ( m M i h d+2j τ m ) 1/2 ( m M i v 2 j,τ m ) 1/2 h d 2 +j τ v j, Φi h j k τ v j, τ. Lemma For any auxiliary element function v V the reverse interpolation operator Π satisfies h 2 τ (v Πv) 2 0,τ v 2 1,Ω and Πv 2 1,Ω J v 2 1,Ω. (4.8) τ T Proof. The proof follows an argument presented in Xu [47]. Let ˆT be the set of the elements in T which do not intersect with Ω J, i.e. Then, ˆT = {τ τ T, τ Ω J τ Ω J = }, ˆΩ = τ. τ ˆT τ T h 2 τ (v Πv) 2 0,τ τ ˆT h 2 τ (v Πv) 2 0,τ + τ T \ ˆT h 2 τ v 2 0,τ + τ T \ ˆT h 2 τ Πv 2 0,τ. For any element τ ˆT, ω τ is the union of elements in the auxiliary grid T J that intersect with τ, h 2 τ (v Πv) 2 0,τ τ ω h 2 τ (v Πv) 2 0, τ v 2 1,ω τ (4.9) τ

100 89 So, τ ˆT h 2 τ (v Πv) 2 0,τ v 2 1,ω τ v 2 1,Ω J v 2 1,Ω. τ ˆT By the Poincaré inequality and scaling, if G η is a reference square (d = 2) or a cube (d = 3) of side length η, then η 2 w 2 0,G η G η w 2 dx holds for all functions w vanishing on one edge of G η. For any τ T \ ˆT, by covering τ with subregion which can be mapped onto G ητ, η τ = h τ, we can conclude that h 2 τ w 2 0,τ w 2 1,G ητ w 2 1,Ω. τ T \ ˆT τ T \ ˆT Applying the above estimate with w = v and w = Πv, one has τ T \ ˆT h 2 τ v 2 0,τ + τ T \ ˆT h 2 τ Πv 2 0,τ v 2 1,Ω + Πv 2 1,Ω v 2 1,Ω. For the second inequality, Πv 2 1,Ω J v 2 1,Ω J v 2 1,Ω So, we have the desired estimate. We can now verify the remaining assumptions of Theorem Lemma For any v V p, we have Πv 1,Ω v 1,ΩJ. Proof. By the local stability of Π, Πv 2 1,Ω = τ T Πv 2 1,τ τ T v 2 1,ω τ v 2 1,Ω = v 2 1,Ω J The desired estimate then follows.

101 90 Lemma For any v V, there exists v 0 V and w V p such that v 0 2 S 1 + w 2 1,Ω p v 2 1,Ω. Proof. For any v V, let w := Πv and v 0 = v Πw, then v 0 2 S 1 + w 2 1,Ω J τ T τ T h 2 τ v Πw 2 0,τ + w 2 1,Ω J h 2 τ v Πv 2 0,τ + τ T h 2 τ w Πw 2 0,τ + w 2 1,Ω J v 2 1,Ω. Theorem If the multigrid method on the Dirichlet auxiliary grid is the preconditioner B on the auxiliary space and the ASMG preconditioner defined by (3.25) or (3.31), then κ(ba) log(n).

102 Chapter 5 Colored Gauss-Seidel Method by auxiliary grid The Gauss-Seidel method is very attractive in many applications to solve problems such as a set of inter-dependent constraints or as a smoother in multigrid methods. The heart of the algorithm is a loop that sequentially process each unknown quantity. Since the algorithm is inherently sequential, it is not easy to do the parallelization efficiently. If we relax the strict precedence relationships in the Gauss-Seidel method, the parallelism can be extracted even for highly coupled problems [118]. Instead of a true Gauss-Seidel method, each processor performs the Gauss-Seidel method as a subdomain solver for a block Jacobi method. For this method, as we add more processing units, the algorithm tending toward a Jacobi algorithm. The main drawback is then that the convergence rate sometimes suffers and even lead to divergence [104]. Researchers have done a lot of efforts on the parallelization of Gauss-Seidel method. Lots of parallelization schemes have been developed for regular problems. A red-black coloring scheme is used to solve the Poisson equations by finite differences method in [119, 120]. Then, this scheme has been extended to multi-coloring [121, 104]. Other Gauss-Seidel method for distributed systems can be found in [122, 123, 124, 125]. One of the key procedure of the parallelization is grouping the independent quantities based on a graph-coloring approach. However for multi-colored Gauss-Seidel method, the number of parallel communications required per iteration is proportional to the number of colors, hence it tends to be slow, especially on unstructured grids. Therefore, the purpose of this chapter is to develop a new coloring scheme for unstructured grids so that efficient parallel Gauss-Seidel method can be

103 92 implemented in practical problems on general unstructured grids. Thanks to the auxiliary grids, an unstructured grid can be associated with an structured grid which is generated by an adaptive tree. Since the trees have numerous parallel constructs, it seems plausible that efficient parallel coloring procedure could be developed on shared-memory or distributed-memory parallel computers. We can develop a coloring algorithm for quadtree tree in 2D and octree in 3D, which can be done in linear time complexity. By the coloring of the adaptive tree, we can color the DoFs by blocks which are sitting in the same hypercube in the auxiliary structured grid. Therefore, a colored blockwise Gauss-Seidel smoother can be applied with the aggregates serving as non-overlapping blocks. 5.1 Graph Coloring In order to implement the colored Gauss-Seidel method, we need to assign a color for each DoF such that all the DoFs with the same color are independent. For the general unstructured grid, it is usually by the algorithm of graph coloring. Define a graph G = (V, E). Each point in V is one DoF. If there existe = (v i, v j ) E, it means a ij 0. Introduce the following definition: Definition 10 (Proper Coloring, k-coloring). A proper coloring is an assignment of colors to the vertices of a graph so that no two adjacent vertices have the same color. A k-coloring of a graph is a proper coloring involving a total of k colors In order to describe how many colors we need to color a graph, we need the following definition: Definition 11 (Chromatic Number). The chromatic number of a graph is the minimum number of colors in a proper coloring of that graph, denoted by χ(g). In general, it is difficult to compute or estimate the chromatic number of complicated graphs. Theoretical results on graph coloring do not offer much good news: even approximating the chromatic number of a graph is known to be NP-hard [126]. There is an algorithm for properly coloring vertices that can give us an upper bound on the number of colors needed. The complexity of the greedy algorithm is O(n d), where d is the average degree. The parallelization of the greedy algorithm is difficult since it is inherently sequential. Several

104 93 Algorithm 21 The Greedy Algorithm For Coloring Vertices of a Graph Let v 1,, v n be an ordering of V for i = 1 to n do Determine the forbidden colors to v i Assign v i the smallest perissible color approaches based on maximal independent set exist, but it will generate more colors than a sequential implementation and the speedup is poor especially on unstructured grid [127]. Clearly, the greedy algorithm provides a proper coloring. However the number of colors used for the algorithm depends on the graph and the order in which we color the vertices. An upper bound of the colors can be estimated by the following theory. Theorem (Greedy Coloring Theorem). If d is the largest of the degrees of the vertices in a graph G, then G has a proper coloring with d + 1 or fewer colors, i.e., the chromatic number of G is at most d Quadtree Coloring The quadtree coloring problem was introduced by Benantar et al [128, 129], motivated by problems of the parallel computations on quadtree-structured finite element meshes. There are several variants of the problems. Quadtree may be balanced or unbalanced. Balanced quad trees are typically used in finite element method. Furthermore, the squares may be neighboring with the squares sharing the edge (edge adjacency) or sharing any vertex of edge (vertex adjacency). We focus on the vertex adjacency balanced quadtree coloring. The coloring of the quadtree can be done by the general graph coloring algorithm by considering the dual graph G of the quadtree. G is obtained by replacing each rectangles as nodes and putting an edge if two rectangles share the same edge or vertex. By Theorem 5.1.1, the maximum number of colors for coloring the vertex adjacency balanced quadtree by greedy algorithm is 12. In 3D, the number will be 32. However with the special structure of the quadtree, we can have better algorithms. For full quadtree, the coloring procedure is simple. 4 colors are sufficient for 2D quadtree and 8 colors are sufficient for 3D octree. If we have established an adaptive quadtree, we seek coloring procedures that (1) use a small number of colors, (2) have a reasonable parallel complexity, and (3) applicable to arbitrary quadtree / octree structures. Since one level

105 94 differences in tree levels are permitted at adjacent quadrant edges and two level difference between quadrants sharing a common vertex. Benantar et al. [128] provide an algorithm showing that with corner adjacency, balanced quadtrees require at most six colors. Eppstein et al. [130] provide lower bound examples showing that at least five colors are necessary for balanced corner adjacency. Theorem ([130]). There is a balanced quadtree requiring five colors for all colorings in which no two squares sharing an edge or a corner have the same color. Proof. For an example of a balanced quadtree as shown in Figure 5.1, assume we can color the quadtree by only 4 colors. 4 different colors are used for the center rectangles see Figure 5.1. Figure 5.1. A balanced quadtree requires at least five colors Rectangle A has three possible options: color 1,2, or 4. Once we choose the color for A, we can fill in the colors for some rectangles by the following two rules: 1. If rectangle s has 3 colored neighbors, assign the remaining fourth color to s; 2. If rectangle s has a corner shared by three other rectangles, each of which is adjacent to some rectangles of color i, assign color i to s. For different choices of colors for A, Figure 5.2 shows the results with choosing A as color 1 and 4. Since the quadtree is symmetric with respect to y = x, Coloring A as 2 is similar

106 95 with choosing A as 1. No matter what color is chosen for A, it leads to an inconsistency at B. B has to be both color 1 and 2. Therefore, the overall quadtree cannot be colored by 4 colors. Figure 5.2. Forced coloring rectangles The six-color algorithm in [128] is suitable for parallel implementation. present the algorithm, a 2D binary graph is obtained from a finite quadtree. In order to Definition 12. A 2D binary graph is a directed binary graph obtained from a finite quadtree by the following assertive algorithm: 1. The root of the quadtree corresponds to the root of the binary graph. 2. Every terminal quadrant is associated with a node of the binary graph. 3. Nodes across a common horizontal edge in the quadtree are connected in the binary graph. 4. When a quadrant is divided, its parent in the quasi binary graph becomes the root of a subgraph. Assume we want to divided all quadrants to six colors 1, 2, 3, 4, 5, and 6. Let us define the six colors divided into three sets, a 0 = {1, 2}, a 1 = {3, 4}, and a 2 = {5, 6}. Each set consists two disjoint colors that alternate in a column-order traversal of the quadtree representation of the domain. A column-order traversal of the quadtree is equivalent to a

107 96 Figure 5.3. Adaptive quadtree and its binary graph depth-first traversal of the binary graph. Whenever left and right branches of the binary graph merges, the traversal continues using the color set associated with the left branch. Two of the three color sets are passed to a node of the binary graph. Assume they are a and b. At each branching, the color set a and the third color set c are passed to the left offspring while the sets a and b are passed in reverse order to the right offspring. This process results in the recursive coloring procedure described in Algorithm 22. Algorithm 22 Six-coloring method for binary graph in 2D B k [f] = coloring binary graph (root, a 0, a 1, a 2 ) if not (root = NULL root colored) then color root using an alternating coloring from set a 0 ; coloring binary graph(left offspring,a 0, a 2, a 1 ); coloring binary graph(right offspring,a 1, a 0, a 2 ); Theorem ([131]). For any balanced quadtree, the coloring given by Algorithm 22 is a proper vertex adjacent coloring, which means no two squares sharing an edge or a corner have the same color. Proof. There are three relationship for the two rectangle: S 1 and S 2 may share the edge in x or y direction or S 1 and S 2 share one vertex. If S 1 and S 2 share an x direction edge, which means S 1 and S 2 are parent and child in the binary graph. Assume S 1 is the parent of S 2. The possible conditions are 1. The depth of S 1 and S 2 are the same;

108 97 2. The depth of S 1 is larger than S 2 and S 2 is the left offspring; 3. The depth of S 1 is larger than S 2 and S 2 is the right offspring; 4. The depth of S 1 is smaller than S 2 and S 1 is the left branch; 5. The depth of S 1 is smaller than S 2 and S 1 is the right branch; All conditions will result 2 possible results: 1, S 1 and S 2 are colored by the alternating coloring of the same color group; 2, S 1 and S 2 are colored by different color group. For both of these two conditions, S 1 and S 2 have different color. If S 1 and S 2 share an y direction edge, S 1 and S 2 have the same descendant S 0. Assume S 1 is the most right child of the left offspring S 0 and S 2 is the most left child of the right offspring S 0. Denote the left offspring and right offspring of S 0 are S 3 and S 4. Assume the color set assigned to S 0 is {a 0, a 1, a 2 }. Then the color set of S 3 is {a 0, a 2, a 1 } and the color set of S 4 is {a 1, a 0, a 2 }. So the color set of S 1 is {a 0, a 2, a 1 } or {a 2, a 0, a 1 } and the color set of S 2 is {a 1, a 0, a 2 } or {a 1, a 2, a 0 }. Therefore, S 1 and S 2 are colored by different color group. So S 1 and S 2 have different color. If S 1 and S 2 share an vertex, S 1 and S 2 have the same descendant S 0. The possible conditions are the same with the case which S 1 and S 2 share an y direction edge. So S 1 and S 2 have different color. Therefore, any neighbor rectangles have different colors, which completes our proof. The actual implementation utilizes additional stack structure instead of the binary graph in order to reduce the number of tree traversals. Figure 5.4 shows the coloring result from the binary tree of Figure 5.3. Remark The corner-adjacent quadtree may require either five or six colors. Whether exists an coloring algorithm with 5 colors is still an open problem. 5.3 Tree Representations In most classic representation of quadtree / octree uses pointers to store the data. Besides of the data, each node needs to store the pointers for each of its child nodes. In leaf nodes, the pointer to the children are NULL. The pointer from a child node to his parent node can also be added in order to simplify the upward traversal.

109 98 Figure 5.4. Six-Coloring for adaptive quadtree In order to speed up the traversal, one possible option is to replace pointers by indices. In that case, the references to a child node must be replaced by a calculus on the father s index and the nodes must be stored in an index table. The structure are more compact that pointer representation and they allow a direct access in constant time to any node of the tree by providing its induex, while pointer representation only allow direct access to the root node. Since the tree structure is generated from the grid, the index can also be systematically generated from the node geodetrical position by defining the node position in the tree hierarchy. A common choice for such index is Morton codes. This method is efficient to generate unique index for each node, while offers good spatial locality and easy computation. Another advantage of Mordon code is their hierarchical order, since it is possible to create a single index for each node, while preserving the tree hierarchy. The index can be calculated from the tree hierarchy, recursively when traversing the tree. The root has index 0, and the index of each child node is the concatenation of its parent index with the direction which is coded over d bits. The bottom-up traversal is also possible, as if to find the parent index we only have to truncate the last d bits of a child index.

110 99 On the other hand, the index can equivalently be computed from the geometric position of the node s box and its size. Assume the root s box is a unit box,if the position of a box is (x 1, x 2,, x d ) and the size is 2 l. Then, it means that the node is at depth l and the Morton code could be generated as a 1 1a 1 2 a 1 da 2 1a 2 2 a 2 d a l 1a l 2 a l d (5.1) where a l ia l 1 i a 1 i is the binary decomposition of 2 l x i, i = 1, 2,, d. Figure 5.5 shows one example of the Mordon code of the leaf node of the adaptive quadtree. Figure 5.5. the Mordon code of an adaptive quadtree 5.4 Parallel Implementation of the Coloring Algorithm With the index representation of the tree structure, we can rewrite Algorithm 22 and generalize the algorithm for 3D octree. If d = 2, assume that the index of one node i is

111 100 I i = a 1 1a 1 2a 2 1a 2 2 a l 1a l 2. We can rewrite the index as primary index I1 i = a 1 1a 2 1 a l 1 and secondary index I2 i = a 1 2a 2 2 a l 2. The primary index can determine the color set used for the node. If two nodes has the same primary index, they will have the same color set (c.f. Figure 5.6). Since the length of the primary index may be different, we only compare the value of the index, i.e. ignore the 0s in front of the primary index. For example, 010 = 10. The color set can determined by the remainder of the division of I1 i by 3. if I1 i mod 3 = k, the color of node i is in color set a k. Then we can order the nodes with the same primary index by the secondary index and using an alternating coloring from the color set to color the nodes. The comparison of the secondary code is by binary digits from the right to left. For example, the ordered sequence of {00, 001, 10, 11, 101} is {00, 10, 001, 101, 11}. In order to color all the nodes, we only need to reorder the nodes by the primary and secondary indices. Then, using the primary indices to determine the color set and alternatively color the nodes with the choosing color set. Figure 5.6. Adaptive quadtree and its binary graph Theorem Six colors are sufficient to color the nodes of a quadtree so that no two adjacent cubes, which means that they are sharing the nodes or edges have the same color. Proof. Since the new algorithm is equivalent with Algorithm 22, by Theorem 5.2.2, we have the proof. If d = 3, we can design a similar algorithm for octree with 18 colors. Firstly, we separate

112 101 the 18 colors to 9 sets: a 1 = {1, 2}, a 2 = {3, 4},, a 9 = {17, 18} and continue to group the 9 sets to 3 set group: b 1 = {a 1, a 2, a 3 }, b 2 = {a 4, a 5, a 6 }, and b 3 = {a 7, a 8, a 9 }. For the node i with the index I i = a 1 1a 1 2a 1 3a 2 1a 2 2a 2 3 a l 1a l 2a l 3, we can separate the index to three indices: primary-1 index I1 i = a 1 1a 2 1 a l 1, primary-2 index I2 i = a 1 2a 2 2 a l 2, and secondary index I3 i = a 1 3a 2 3 a l 3. primary-1 index determines the set group and primary-1 index determines the color sets. Then, if we order the nodes with the same primary index by the secondary index. So we can also color the nodes by the ordering of the primary-1, primary-2, and secondary indices (c.f. Figure5.7). Figure 5.7. Coloring of 3D adaptive octree Theorem Eighteen colors are sufficient to color the nodes of a octree so that no two adjacent cubes, which means that they are sharing the nodes, edges, or faces, have the same color. Proof. Assume the two neighbor cubes are A and B. There are four conditions. If A and B have the same primary-1 and primary-2 indices, they are colored by the same color set, denoted as a i. Since we color the two cubes by the alternative colors in a i, A and B have different colors. If A and B have the same primary-1 index but different primary-2 index. Assume the primary-2 index of A is I a 2 and its level is l 1. Then assume the primary-2 index of B is I b 2 and its level is l 2. Since A and B are neighbors, l 2 {l 1 1, l 1, l 1 + 1}. If l 2 = l 1, then

CLASSICAL ITERATIVE METHODS

CLASSICAL ITERATIVE METHODS LONG CHEN In this notes we discuss classic iterative methods on solving the linear operator equation (1) Au = f, posed on a finite dimensional Hilbert space V = R N equipped