Parallele Numerik. Miriam Mehl. Institut für Parallele und Verteilte Systeme Universität Stuttgart

Size: px

Start display at page:

Download "Parallele Numerik. Miriam Mehl. Institut für Parallele und Verteilte Systeme Universität Stuttgart"

Marvin Greene
5 years ago
Views:

1 Parallele Numerik Miriam Mehl Institut für Parallele und Verteilte Systeme Universität Stuttgart

3 Contents 1 Introduction 5 2 Parallelism and Parallel Architecture Basics Levels of Parallelism Bit-Level Parallelism Instruction-Level Parallelism / Pipelining Data Parallelism Task Parallelism Performance Analysis Strong Scaling and Amdahl s Law Weak Scaling and Gustafson s Law Cost, Chances, and Pitfalls in Parallel Computing Costs Chances Pitfalls Data Dependency Graphs Elementary Linear Algebra Problems (BLAS) Level-1 BLAS: Vector-Vector Operations DOT SAXPY Further Level-1 BLAS Routines

4 4 3.2 Level-2 BLAS: Matrix-Vector Operations Dense Matrices Banded Matrices Level-3 BLAS: Matrix-Matrix Operations Linear Systems of Equations with Dense Matrices Gaussian Elimination Basic Properties Parallel Gaussian Elimination Parallel Solving of Triangular Systems QR-Decomposition Basics QR-Decomposition in Parallel Sparse Matrices Storage Schemes for Sparse Matrices Matrices and Graphs Gaussian Elimination for Sparse Matrices Parallel Direct Solvers Iterative Methods for Sparse Linear Systems of Equations Relaxation Methods Krylov-Subspace Methods Preconditioning Multigrid Domain Decomposition Overlapping Domain Decomposition Non-overlapping Domain Decomposition Partitioned Multi-Physics

5 5 8 Parareal Basic Ideas A Simple Example Convergence/Efficiency

7 Chapter 1 Introduction The scope of this lecture is to revise standard numerical methods considering parallel computations! As a prerequisite, basic knowledge in numerical methods and linear algebra is required. Why should we consider parallel computing? Many important problems in science and engineering can be solved only with a huge computational effort. Think of climate &wheather predictions, astrophysics, fusion reactors, and biomedical applications as examples. The required compute power for such simulations can only be provided by massively parallel computer architectures as energy density limits prohibit the further increase of clock-rates of single cores. Simulated probabilities for a warming by more that 2 after different time periods and calculated with different models [ 7

8 8 Part of the millenium simulation in astrophysics involving with different scales [wwwmpa.mpa-garching.mpg.de]. Fusion reactor as a climate-friendly future energy source? [

9 Stress and stability simulation for orthopedic implants [www.lbb-mhh.de].

9 9 Stress and stability simulation for orthopedic implants [ Number of cores of the TOP500 of the world s most powerful computers (Nov 2015). For an interactive version see [

10 Currently number 8 of the Top500 list: Cray XC40@HLRS (Hazel Hen), 7.4 TFlop/s theoretical peak performance, more than 300, 000 cores. http://www.mericler.

Back in the 1970s, computing was easy, comparable to a single worker at a single desk working through a pile of data and related tasks.

10 10 Currently number 8 of the Top500 list: Cray XC40@HLRS (Hazel Hen), 7.4 TFlop/s theoretical peak performance, more than 300, 000 cores. Having to perform these highly complex simulations on a massively parallel supercomputer, we face a problem that is similar to the situation in an open plan office. Back in the 1970s, computing was easy, comparable to a single worker at a single desk working through a pile of data and related tasks. If we imagine this worker as a one-man start-up company and assumer that the company is successful, the single worker will be able to cope with the increasing workload for a while simply by getting more experienced:

11 11 This is what happen to computer in the 1980s and 1990s: They got more experiences, i.e., faster, provided more memory but still only a single CPU (Except from high performance computing architectures). However, at some point where the workload exceeds the abilities of a single worker even if he or she is a professional, a burnout is going to happen. For computer architectures, this is true even literally as the theoretically possible energy density reached its limit. At this point, hiring a first employee is highly recommended. This was done with the development of the first dual core processors such as the Intel Xeon Woodcrest (1.6 Ghz, 2 Cores) in the 2000s. As professional employees are hard to find, an alternative approach is to hire a lot of less skilled people. A corresponding computer architecture can be found, e.g., in the Intel Xeon Phi 5120D (1.05 Ghz, 60 Cores). Such a team can now work in parallel and theoretically do the same work in half the time. However, there are also some new challenges. E.g., if both workers try to calculate the sum s = a + b + c + d +..., there might be conflicts on who is allowed to write the result of the next addition to s first. If one of the workers writes a new result to s after his colleague read s and before his colleague writes back the result to s, we even get wrong results in the end (race condition). s = a + b + c + d +...

12 12 Another problem can occur when both workers make a copy of parts of the data in the shared large data pile to be able to work faster on this small copy (private cache) than on the unhandy large pile (shared main memory). a = a + b, a = a + c If the two copies overlap in terms of their content, a mechanism is required that ensures that both workers immediately notice changes in the copy of the colleague. This becomes relevant if they want to add two numbers to the same variables, for example. As soon as the first number was added by the first worker, the result has to be immediately made public to the other one in order to ensure correct results (cache coherence). Even if the two workers have clearly separated responsibilities in terms of data, a disadvantageous choice of data may lead to severe efficiency breakdowns: copies from the large data pile (main memory) always have to be made from connected extracts of the total data. Thus, if worker one works on the variables a, c, d,..., worker two on the variables b, d, f,..., e.g., to increment each of them, the copies (cache lines) the two workers have are going to be overlapping to a large extend if data are stored in the order a, b, c, d, e, f,... in main memory. Thus, cache coherency has to be ensured in a costly synchronization step after each operation! But also the pure act of having to look for data in the large shared data pile (main memory) can be very time consuming, whereas data available in the small private copy (cache) of a worker can be retrieved very fast. Therefore, it is not a good idea to execute s = a + h + z + f + r + w + k + g +... if data are stored in the order a, b, c, d,.... This will frequently require to fetch data from the large pile as they are not yet in the local copy (cache miss). If we use the correct order, our mechanism of copying a whole set of data (cache line) and not only a single variable will ensure that most variables are already available in the local copy when needed (cache hit). s = a+h+z +f +r +w +k +g +... The next obvious step for a growing company is to increase not only the number of employees (cores), but also the number of desks (chips) leading to a typical open plan office with all its strengths and challenges:

13 The first challenge is the noise created by the need for global communication (all-to-all), e.g., to let everybody know that the calculation of a new time step of a simulation is finished.

Therefore, in spite of the trend to flat hierarchies, a strict organization in, e.g., a binary tree is required to increase the efficiency.

13 13 The first challenge is the noise created by the need for global communication (all-to-all), e.g., to let everybody know that the calculation of a new time step of a simulation is finished. If everybody tries to talk to everybody in a pairwise manner, this is going to cause a lot of noise and take a long time to finish the communication. Therefore, in spite of the trend to flat hierarchies, a strict organization in, e.g., a binary tree is required to increase the efficiency. This means that we organize the workers in pairs that first agree on the finalization of a time step before sharing this information via the head of the pair with a second group. This pair of pairs again communicates with another pair of pairs via their leader and so forth. We are going to see this binary tree concept again several times throughout the lecture. Another obvious problem is the communication per se. Walking to your colleague to share a piece of communication often takes longer than the actual work you have invested to create this information. This issue becomes even worse if the open plan office becomes larger and you, thus, have to walk further (we didn t talk about the coffee yet that you might have together at the occasion). Similar observations also hold in computer architectures, where the speed and bandwidth of communication increases much slower than the compute power. The only way out is to try to minimze the amount and the frequency of data communication as well as the number of colleagues (other processes) you have to communicate with.

14 Concluding, we can state that open plan offices are great, if The worst management failure for the leader of the open plan office, however, is to formulate work packages for the employees in such

14 14 Concluding, we can state that open plan offices are great, if The worst management failure for the leader of the open plan office, however, is to formulate work packages for the employees in such a way that only a part of them can work simultaneously whereas the other part has to wait for the results. This sounds trivial but can be a severe problem in some cases. Just think of logistic experts trying to plan the packing of a truck. If half the workers are responsible for the heavy items that need to be loaded first, the second half of the workers who are responsible for the light-weight items, that need to be piled on top of the heavy ones, will be sleeping until the first have completed their tasks. This situation is very similar to doing a simulation of a time-dependent process and trying to parallelize in time by assigning time intervals to different groups of processors. responsibilities are clearly defined, important informations are available to all, work and material are well-organized, work packages have small overlaps and dependencies, not everybody has to talk to everybody, talking is minimized in general, no-one has to wait for others. With this picture of an open plan office in mind, we will answer the following question in this lecture: What do we need to simulate the mentioned examples efficiently on a parallel computer? Recalling other lectures such as Grundlagen des Wissenschaftlichen Rechnens, we observe that a whole hierarchy of building blocks is required in a simulation code: simple algebraic operations such as +, -, * and /, more complex ones such as, but in particular a lot of matrix and vector operations as components of solvers for linear of nonlinear large systems of equations. In this lecture, we will have a look at parallelization options on all these levels in subsequent chapters. But before we do so, we revisit various aspects of parallel computer architectures that are relevant for our tasks. We do not have a closer look at parallel programming models. For those, refer to the lecture High Performance Computing, e.g.

15 Chapter 2 Parallelism and Parallel Architecture Basics 2.1 Levels of Parallelism Bit-Level Parallelism Bit-level parallelism is completely transparant to the user and refers to the length of a computer word that can be handled in a single clock cycle. If this word length would be only one, 16 machine instructions would be required to add two 2-byte = 16-bit integers. Todays computers usually have a 64-bit architectures, i.e., even long 8-byte integers can be added with a single machine instruction. This level of parallelism will not be very relevant for the remainder of this lecture. However, we should keep it in mind if we, e.g., have to decide whether to use integers or vectors of booleans for a certain purpose. Let s look at an example: We have already had a look at octree or quadtree computational grids in Grundlagen des Wissenschaftlichen Rechnens. For quadtrees, we used the Morton order as a numbering and storage order for the grid cells. The respective numbers are very easy to compute in a bitcode representation. At the first refinement level, only two bits are required: 2 = 10 3 = 11 0 = 00 1 = 01 If we further refine the quadtree, the same z- order is applied to the children of our four first-level grid cells. To take this into account in our numbering, we simply append two further bits to our codes: 15

16 A common task in solvers for partial differential equations such as the heat equation on such a grid is to evaluate so-called difference stencils that approximate derivatives of a function discretized on the grid. This requires access to data of neighbouring cells. Computing the Morton-numbers of neighbours of a cell is very easy: Having a closer look at the Mortonnumbers, we observe that the last bit of each bit-pair denotes the position in x-direction on the respective level, whereas the first bit denotes the position in y-direction. I.e., if we look for the right neighbour of a grid cell, we increment the last bit by one. If the respective cell is not a child of the same father, this leads to a bit overflow. In this case, we flip the last bit and proceed with the x-bit of the next level, i.e., increment the x-position of the father and so forth until we reach a bit that can be incremented without overflow. We look at this using the bit-code approach for the example of cell number 39: 39 translated to bit-code , translated to a decimal number 50. The cost for this calculation is obviously O(L) if L is the refinement level of the quadtree. In our example, we have L = 3. How could we have executed this calculation faster using bit-level parallelism? We have observed above, that every second bit denotes the position in x-direction on the different refinement levels: The last bit denotes the position on the finest level, the last but two bit the position on the last but one level etc. The other bits do the same for the position in y-direction. What we actually did in the example above when we looked for the neighbouring cell in x-direction was increment the integer given by the last, the last but two and the last but four bit by one. This would have been possible in O(1) time using bit-level parallelism. Thus, not storing the bit-code as a vector of boolean variables but as two integers (one for the x-direction, one for the y-direction) would have been the smarter implementation of the neighbour calculation: 39 translated to bit-code translated to two integers 101 = 5 and 011 = = 5 remains unchanged, = = 4 = 100, recombination translated to a decimal number 50. Exercise: Up to which resolution of the quadtree can we calculate neighbour cells in O(1) time?

17 Instruction-Level Parallelism / Pipelining The same as the bit-level parallelism, also the instruction-level parallelism is transparant to the user, but has to be kept in mind if we want to implement numerical algorithms in a hardware-efficient way. In all modern CPUs, elementary operations are carried out in pipelines, which means that the operations are divided into smaller subtasks, each subtask is executed on a piece of hardware that operates concurrently with other stages of the pipeline. As an example, we have a look at the addition pipeline: operand 1 compare exponents align exponents add mantissa normalize result result operand 2 This pipeline is filled with a stream of input (two operands in case of addition) that are handed over from one stage to the next. Thus, with four stages, up to four operand pairs can be worked on in parallel, as we can see from the following visualization of pipelining: x 2 x 1 y 1 y 2 x 3 x 2 x,y 1 1 y 2 y 3

18 18 x 4 x 3 x, 2 y x,y y 3 y 4 x 5 x 4 x, y 3 x,y 2 x, y y 4 y 5 x 6 x 5 x, y 4 x, y x,y 2 2 x,y 1 1 y 5 y 6 x 7 x 6 y 6 x, y 5 x, y 4 4 x, 5 3 y 3 x,y 2 2 x + 1 y 1 y 7 x 8 x 7 y 7 x, y 6 x, 6 y 5 5 x,y 4 4 x,y 3 3 x + 2 y 2 y 8 Thus, the startup time, i.e., the time until the operation for the first operand pair is finished is k(= 4) clock units. All further results are delivered within one additional clock unit u per operator evaluation. The total time for n operations: is k u + n u

19 19 This leads to the following efficiency considerations for pipelines: If the pipeline is filled, one result is achieved per clock unit. Therefore, all operations should be organized such that the pipeline is always filled! If the pipeline is nearly empty: e.g., in the beginning of the computations, it is not efficient! The major task for the CPU is to organize all operations such that operands are just in time at the right position to fill the pipeline and keep it filled. CPU Pipelining In general, pipelining includes the following steps: 1. instruction fetch: get the next command, 2. decoding: analyse instruction, compute addresses of operands, 3. operand fetch: get the values of the next operands, 4. execution step: carry out the command on the operands, 5. result write: write the result in memory. Pipelining is done for all these steps, but also inside each step. A special case of pipelining is applied for vector scaling, where one the operands (the scaling factor) pairs with all components of the vector to execute a multiplication.

20 20 y 1 y 2 y n = α x 1 x 2 x n for j = 1, 2,..., n y j = αx j ; (2.1) α α x 5 α x 4 α x 3 α x 2 α x 1 x 6 x 7 This is similat to to the pipeling if, for set of data, the same operation has to be executed on all components. The total costs are given as startup time + vector length clock time = pipeline length + vector length T. Pipelining does not work properly as soon as we have data dependencies. We can easily see this if we try to compute the Fibonacci numbers using pipelining: x 0 = 0, x 1 = 1, x 2 = x 0 + x 1,..., x i = x i 2 + x i 1. (2.2) Here, it is obvious that we can not compute x 3 before x 2 has been computed. Thus, the pair x 1 and x 0 has to traverse the whole pipeline before we can feed in the next pair x 1 and x 2 : x 1 x 2 x 0 x 2 x 1 Similar problems occur for recursive subroutine calls. Although pipelining happens transparant to the programmer of a numerical software, it has to be kept in mind when developing numerical algorithms to run efficiently on any architecture. Another typical example for a difficult problem is summation: sum = x(1)*x(1) For i = 2 : N sum = sum + x(i)*x(i) End x 1 x 0 sum sum

21 21 The pipeline for updating the sum has to wait until the partial result for sum has left the pipeline. In this case, loop splitting can help to make better use of the pipeline: sum1 = x(1)*x(1) sum2 = x(2)*x(2) For i = 2 : n/2 sum1 = sum1 + x(2*i-1)*x(2*i-1) sum2 = sum2 + x(2*i)*x(2*i) End sum = sum1 + sum2 Superscalar processors can execute several instances of the same stage of a pipeline concurrently such that more than one operation can be finished within a single clock cycle Data Parallelism Data parallelism refers to the same operation executed on a set of data. A well-known example is the addition of two vectors for which we can execute the addition simultaneously in all vector components. This type of parallelism is typical for vector processors. There was a big vector processor hype in the 1970s and 1980s and a series of supercomputers based on this arichitecture in the late 1990s and early 2000s, e.g., the Bavarian vector computer CRAY T90 with 4 vector CPUs and 7.2 GFlop/s peak performance (LRZ Munich, ), the Fujtsu VPP vector system with 52 vector CPUs and GFlop/s peak performance (LRZ Munich, ). After a few years where this architecture seemed to disappear, recent architectures use vector processors again to boost their performance. Prominent examples are video game consoles, e.g., the Cell processor released in 2000 for the Playstation 3 constisting of one scalar and eight vector processors, graphical processing units (GPUs), Intel Xeon Phi (512-bits vector units),... Vector processors can be directly addressed by the programmer using SIMD (single instruction multiple data) instruction sets such MMX, SSE or AVX for CPUs or extension of C such as CUDA or OpenCL for GPUs. In addition, compilers often perform an automatic vectorization of loops which, however, does not deliver the expected result in all cases.

22 Task Parallelism Task parallelism refers to the concurrent execution of several tasks, which can be the same or different operations on the same or different data. In the remainder of this lecture, this type of parallelism will be the one we put our focus on. If we want to provide a picture for task parallelism, we switch from problem... t N t 2 t 1 CPU to problem CPU CPU CPU CPU Task parallelism can be achieved on every parallel architecture. We list the most important types, here: Multi-core processors: Several processing units called cores are located on a single chip. they can execute operations from several instruction streams called threads simultaneously. Shared-memory architectures: Several (multi-core) processors are connected to the same main memory via a common bus (symmetric multi-processing, limited scalability: memory I/O bus cache cache cache cache CPU CPU CPU CPU The fact that different CPUs can have different caches induces the need for hardware architects to ensure cache coherence:

23 23 memory bus I/O Wee look at an example to see the issue: parallel for(i=proc_number; i<n; i=i+2) x[i]=2.; cache cache cache cache We execute this with two threads, i.e., proc_number 0 and 1, thread 1 changes CPU CPU CPU CPU x(0,2,4,...) and thread2 changes x(1,3,5,...). Each changing step of thread 1 also changes data that is contained in the cache of thread 2 (and vice versa). Otherwise, data in the two caches would be not consistent anymore! To retain the correct values in both caches after each changing step, also the value in the other cache has to be renewed! This leads to a dramatic increase of computational time resulting in a code that is possibly slower than the sequential computation! Thus, although cache coherence does not explicitely have to be ensured by the numerical programmer, it has to be kept in mind as a potential pitfall for efficiency of a parallel code. Distributed-memory architectures: Several shared-memory units are connected via a network within the same machine (high scalability). This in general means that we do not have a common address space any more and the programmer explicitely has to call suitable library functions to send data from one memory to the other where required by the numerical algorithm. bus mem. mem. mem. mem. cache cache cache cache CPU CPU CPU CPU node node node node In case of virtually shared memory, we have physically distributed data that are virtually organized as shared memory. A cluster of multiple CPU processors usually combines shared and distributed memory units (Nonuniform Memory Access (NUMA)): memory memory controller bus controller cache cache cache cache cache cache cache cache CPU CPU CPU CPU CPU CPU CPU CPU This requires different types of communication! The time required for sending data from one processor to another depends on the connection network topology (e.g., mesh, hypercube, ring, torus, binary tree).

24 24 clusters (loosely connected separate computers, typically standard PC hardware, connection via network, at a single location), grid (combination of computer resources from multiple administrative domains), cloud (dynamically scalable and often virtualized resources (data, software/algorithms, computing power,...) as a service over the Internet on a utility basis 2.2 Performance Analysis Analyzing the performance of a parallel program our main focus often is on the scalability, i.e., the suitability of the algorithm or the program to be run on an increasing number of processing units. Scalability s measured in two different settings: with a constant problem size (Who much faster can we solve the same problem putting in more resources? = strong scaling) and a problem size increasing proportional to the number of processing units (How much larger problems can we solve in the same compute time with more resources? = weak scaling) Strong Scaling and Amdahl s Law. Using p processors in parallel on a problem of fixed size, we can (hopefully) achieve a speedup defined as S p = t 1 t p, i.e. the ratio of execution time with 1 vs. p processing units. In the ideal case, we observe t 1 = p t p corresponding of a speedup equal to the number of processors p. The parallel efficiency E p = S p p, 0 E p 1 (2.3) using p processors is an alternative measure for the performance of a parallel algorithm. A very good program yields E p 1. We can analyze what to expect in a realistic setting making some simplifying assumptions: 1. For a fixed problem size, the program can be separated into a parallel fraction f and a sequential fraction 1 f. 2. We can neglect other overhead of a parallel program. With this, we get the parallel runtime on p processing units t p = f t 1 f + (1 f)p + (1 f)t 1 = t 1 (1 f)t 1 (2.4) p p strongly sequential ideally parallel

25 25 This directly leads to Amdahl s law for the parallel speedup and the parallel efficiency: S p = t 1 t p = 1 f+(1 f)p p = p f + (1 f)p 1 1 f, E p = S p p = 1 f + (1 f)p 1 p E p = 0 (1 f)p We always have a small portion of our algorithm that is not parallelizable. Thus, the efficiency will always be zero in the limit! The speedup depends on p and reaches a saturation at 1 1 f Weak Scaling and Gustafson s Law. In contrast to the strong scaling described above, a weak scaling approaches assumes the problem size to grow proportional to the number of processing units used. Again, we can make simplifying assumptions for an analysis: 1. Our problem can be solved in 1 unit of time on a parallel machine with p processors. 2. The fraction f of this parallel cost is parallelizable, 1 f is not. 3. We can neglect other overhead of a parallel program. Accordingly, compared with the parallel implementation, a single processor would require the compute time (1 f) + fp (2.5) for the same job. With this, we get Gustafson s law for speedup and efficiency: S pf = t 1 = 1 f + fp = p + (1 p)(1 f), t p 1 E pf = S pf p = 1 f p + f p f Gustafson s law assumes that a a fraction f of the parallel cost on p processors is parallelizable. To keep this fraction constant (i.e., independent of p), we would in practice have to increase the problem size proportional to p. Example: f = 0.99, p = 100 or p = 1000

26 26 law p speedup efficiency p Amdahl S = f+(1 f)p E = 1 f+(1 f)p 100 S 100 = E 100 = S 1000 = E 1000 = 0.1 Gustafson S = 1 f + fp E = 1 f p + f 100 S 100 = E 100 = S 1000 = E 1000 = Cost, Chances, and Pitfalls in Parallel Computing Costs Sequential Algorithmic Parts. These are the only costs of parallelization that we have explicitely considered above in Amdahl s and Gustafson s law. Sequential parts can, e.g., be the calculation of a common threshold for the time-step width or checking a global convergence criterion composed of criteria from all processes. In many practical applications, also more costly parts such as the computational mesh generation are not parallelized and, thus, have to be counted as a sequential part. Tool Overhead. A parallel code usually uses OpenMP or MPI for shared and distributed memory parallelization, respectively. This means that we introduce new pragmas (OpenMP) and function calls (MPI) in our code which increase the runtime even if we choose the number of processors to be one. In particular for MPI, also the initialization and finalization costs are non-negligible. Tasks need to be started and terminated, parallel compilers induce a software overhead,... Algorithmic Overhead. Parallelization usually requires additional algorithmic steps to determine parallel work packages. Example are the domain decomposition and the colouring described later in the lecture. Communication. The amount of communication between tasks depends upon the problem: Some problems can be decomposed and executed in parallel with virtually no need for tasks to share data. They are often called embarrassingly parallel as they are straight-forward. Most parallel applications are not so simple. They require tasks to share data with each other. The computation/communication ratio is an important qualitative measure. Assuming that periods of computation are separated from periods of communication by synchronization events, this ration directly affects the parallel efficiency. For massively parallel machines, communication is a major part of the whole work done and, thus, has to be hidden behind

27 27 computations, i.e., executed by dedicated parts of the processing units while other parts are doing computations. In this case, the computation/communication ratio becomes an issue if it is smaller than one (in terms of execution times), only. Note that considering only explicit communication between distributed memory units is not the whole story. We ve seen above that also the implicit communication ensuring cachecoherence within a shared memory unit has to be kept in mind. Load Imbalances. A job should be distributed such that all processors are busy all the time! This implies that not only computational load but also communication load has to be distributed equally on a parallel machine with homogeneous processing units or according to the power of the units in heterogeneous architectures. Communication has long been neglected in load balancing but is considered to be an important if not the decisive issue on massively parallel computers. Synchronization. Many numerical algorithms require synchronization points where all processing units must halt until all other units get to this points as well. E.g., we want to decide whether an iterative solver can be stopped by checking on a global residual accumulated from partly residuals computed in each processing unit. To define synchronization, there are various tools available: 1. A barrier implies that all tasks are involved. Each task performs its work until reaching the barrier. It then stops. When last task reaches the barrier, all tasks are synchronized and continue their work. 2. A lock/semaphore can involve any number of tasks and can be blocking or nonblocking. It is used to serialize/protect access to global data or a code section. Only one task at a time may use (own) the lock/semaphore/flag. The first task to acquire the lock sets it. This task can then safely access the protected data or code. Other tasks can attempt to get the lock, but have to wait until it is released. 3. Synchronous communication operations involves only tasks executing a communication operation. When a task performs a communication operation, coordination with participating tasks is required. Due to the obvious costs of tasks synchronization, asynchronuous numerical methods reducing synchronization and communication at the price of higher computational costs are reconsidered as efficient variants of numerical schemes over the last years in the context of massively parallel systems Chances Despite all overhead caused by parallelization, it sometimes also yields an automatic increase in performance and even a superlinear speedup, i.e., a speedup larger that p for p processors. This happens often for distributed memory parallelisation the reason being a better

28 28 cache usage due to the reduced amount of data to be processed in each memory unit. Thus, for a non-cache-optimal sequential program, we tend to see a sudden increase in performance when increasing the number of processing units at the point where single tasks data are small enough to completel yfit into the cache. This must not be considered a merit of efficient parallelization but rather a sign of bad cache usage and, thus, performance drawbacks og the sequential program Pitfalls There are several pitfalls that can make a parallel program not only inefficient but lead to failure: 1. If two or more processors are waiting indefinitely for an event that can be caused only by one of them, each processor is waiting for results of the other processor! We speak of a dead lock. 2. In other cases, we observe race conditions, i.e. a non-deterministic outcome/result depending on the chronology of parallel tasks. 2.4 Data Dependency Graphs One of the most important general problems in parallel computing is the dependeny of data. As an example, we look at the computation of (x 1 + x 2 )(x 2 + x 3 ) = x 1 x 2 + x x 1 x 3 + x 2 x 3. We can represent the data dependencies of these two alternatives by directed graphs: step 0 x 1 x 2 x 3 x 1 x 2 x 3 1 x 1 + x 2 x 2 + x 3 x 1 x 2 x 2 2 x 2 x 3 x 1 x 3 2 (x + 1 x 2) (x 2 + x ) 3 2 x 1 x 2 + x 2 x 2 x 3 + x 1 x x 1 x 2 + x 2 + x 2 x 3 + x 1 x 3 This directly shows that the parallel computation of the first version takes 2 time steps whereas the computation of the second version version takes 3 time steps due to a worse data dependency graph!

29 29 A further example is the computation of f(x, y) = x 2 2xy + y 2 = (x y) 2. step 0 x y x y 1 x 2 xy y 2 x y 2 2x y (x y) 2 3 x 2 2x y+ y 2 3 time steps in parallel 2 time steps sequentially! Besides such simple analyses of algorithms, an important usecase of graphs is the Convergence Improvement of Iterative Algorithms. Let s assume, we have a given iterative (Jacobi-type) algorithm of the form x (k+1) 1 x (k+1) n = x (k+1) = f(x (k) ) = f 1 (x (k) 1,..., x(k) n ) f n (x (k) 1,..., x(k) n ) (2.6) with f defined as follows: This corresponds to the dependency graph x 2 x (k+1) 1 = f 1 (x (k) 1, x(k) 3 ), x (k+1) 2 = f 2 (x (k) 1, x(k) 2 ), x (k+1) 3 = f 3 (x (k) 2, x(k) 3, x(k) 4 ), x (k+1) 4 = f 4 (x (k) 2, x(k) 4 ). x 1 x 3 x 4 with an edge from m to i iff x m is an argument of f i. Regardsless of the depedency graph for f, we can always calculate all components of x (k+1) (k) k in parallel, but we often observe slow convergence x x.

30 30 One idea to accelerate the convergence is to always use the newest available information: x (k+1) 1 = f 1 (x (k) 1, x(k) 3 ), x (k+1) 2 = f 2 (x (k+1) 1, x (k) 2 ), x (k+1) 3 = f 3 (x (k+1) 2, x (k) 3, x(k) 4 ), x (k+1) 4 = f 4 (x (k+1) 2, x (k) 4 ). This often leads to faster convergence, but also a loss of parallelism as updating the next component x i in one iteration step can be done only after updating all the previous components x 1,..., x i 1! step 0 x 1 x 2 x 3 x 4 1 x 1 2 x 2 3 x 3 x 4 Such methods are therefore also called single-step or Gauss-Seidel iterations. The iteration depends on the ordering of the components of the vector x. We try a different ordering by updating x 2 last: x (k+1) 1 = f 1 (x (k) 1, x(k) 3 ), x (k+1) 3 = f 3 (x (k) 2, x(k) 3, x(k) 4 ), x (k+1) 4 = f 4 (x (k) 2, x(k) 4 ), x (k+1) 2 = f 2 (x (k+1) 1, x (k) 2 ), step 0 x 1 x 2 x 3 x 4 1 x 1 x 3 x 4 2 x 2

31 31 With this algorothms, we complete 1 iteration in 2 time steps and, by interleaving the first step of an iteration with the second of the previous iteration, can even get every further iteration within one additional time step. We have seen from the example that our aim is to find the optimal ordering without loosing parallel efficiency and with fast convergence! For this purpose, we can use so-called coloring algorithms for the dependency graph: These algorithms use k colors for the vertices of the graph, where k is the minimal natural number that allows a colouring such that no cycles connect vertices of the same color! This assures an ordering such that vertices with the same color are independent and can be updated in parallel! First step: Find a coloring without cycles. x 2 parallel x 1 x 4 x 3 parallel Second step: Apply an ordering for each color based on the subgraph corresponding to the color. This subgraph is a tree. So we can start the numbering with the leaves and end with the root. In our sinple example, we can use the ordering (x 1, x 3, x 4 ) and (x 2 ) corresponding to the update sequence (x 2 ), (x 1, x 3, x 4 ), (x 2 ), (x 1, x 3, x 4 ),... For computing x 2, we only need the newest red indices. For computing x 1, we need x 3 which is not updated yet. For computing x 3, we need the black x 2 and x 4 which is not updated yet. For computing x 4, we need the black x 2. Therefore, the red indices can be updated in parallel. In contrast to this good colouring, the following wrong coloring does not help as it has a cycle in the red vertices: x 2 x 1 x 4 x 3 There is no suitable ordering in the red vertices that avoids the data dependency in updating the red components. Comparison of Different Colorings.

32 32 Ordering: (4, 2, 1), (3) Newest information: For x 3 : new x 2 and x 4 For x 1 : new x 3 total: 3 x x x 3 x Ordering: (1, 4), (3, 2) Newest information: For x 1 : new x 3 For x 4 : new x 2 For x 3 : new x 4 For x 2 : new x 1 total: 4 x 2 x 1 x 4 x 3 Thus, the second colouring might be superior in terms of convergence properties as, in total, 4 already new values are used within each iteration, whereas the first couolouring leads to a reuse of only two new values. Both colourings use two colours corresponding to two time steps in a parallel implementation with a sufficient number of compute resources. Comparison of Different k. At first sight, a minimal k is always the optimum. However, this is true only if we have a sufficient number of processors: To see this, let s assume that we have 2 processors available, all function evaluations are similar, and the distribution on colors is balanced in terms of the number of vertices. Using the Jacobi-type iteration without any reuse of new values (corresponds to k = 1), we can put n 2 function evaluations on each processor and can get one iteration done in n 2 time steps. For k = 2, we can put n 4 from the first colour on p 1 and n 4 on p 2, followed by the calculation of n 4 of the second colour on p 1 and n 4 on p 2. The parallel time is again n 2. The number k of colors is only important if a large number of processors is avaliable! If we have n processsors for n functions, Jacobi takes 1 time step, Gauss-Seidel with k = 2 takes 2 time steps.

33 33 A Partial Differential Equation Example Discretization of a rectangular region: u i,j+3 u i+1,j+3 u i,j+2 u i+1,j+2 u i+2,j+2 u i+2,j+3 i+3,j+3 u i,j u i+1,j+1 u i+3,j+1 u i+2,j+1 u i,j u i+2,j u u i+1,j u u i+3,j+2 i+3,j 5-point discretization of the 2d Laplacian u xx = 2 u x 2 u i 1,j 2u i,j + u i+1,j h 2 (2.7) odd components: red, even components: black two colours (k = 2) If we reorder the system matrix according to the above red-black ordering, the red/black components can be updated in parallel. This colouring meets the requirements of having as few colors as possible ( #colors = #time steps) and using as much new information as possible to accelerate convergence. The total number of time steps to solve the systems with parallel Jacobi corresponds to the number of iterations it Jac (provided that enough CPUs are available), whereas for the parallel coloured Gauss-Seidel, we get 2it GS time steps. The ncoloured GS takes it GS n time steps if n is the number of vertices. k = 2 is the minimal number of colors for Gauss-Seidel unless the matrix would be diagonal! Hence, the number of iterations for GS has to be less then half of the number of iterations of the Jacobi method. Otherwise Jacobi is faster in parallel!

35 Chapter 3 Elementary Linear Algebra Problems (BLAS) Basic linear algebra subroutines such as dot products, matrix-vector and matrix-matrix multiplications are important ingredients of many numerical problems. We will have a short look at the underlying algorithms in this chapter and recall in later chapters where we can use which of the algorithms presented here. BLAS, a library of basic linear algebra operations, was first published in Nowadays, BLAS denotes a specification of basic linear algebra routines for which various implementations are available, partially specialized for specific hardware types. BLAS is organized in levels according to complexity (number of operations in dependence on the dimensionality n of the underlying vector space): BLAS level type of operation complexity BLAS-1 vector-vector O(n) BLAS-2 matrix-vector O(n 2 ) BLAS-3 matrix-matrix O(n 3 ) There are machine-specific optimized BLAS libraries and BLAS is also the basis of other libraries, e.g., LAPACK (linear algebra package for solving linear equations, least squares problems, QR-decomposition, eigenvalues, singular values...) 3.1 Level-1 BLAS: Vector-Vector Operations Level-1 BLAS provides routines with O(n) problems, i.e., operates with vectors only. In the following, x and y denote vectors in R n ). 35

36 DOT The first example of level-1 BLAS routines is the DOT-product: s = x T y = n j=1 x j y j. To efficiently compute this in parallel, a fan-in process is used based on pairwise summation in a hierarchical manner: x 1 y 1 x 2 y 2 x 3 y 3 y 4 x 4 x 5 y 5 y 6 s 1 s 2 s 3 s 1 s 2 s 1 x 6 y 7 x 7 x 8 y 9 s 4 Thus, the number of time steps in parallel of the DOT-product can not be less that than log(n) (every computation involving fan-in takes at least log(n) time steps in parallel). With n = 2 N, we can write every step of the fan-in process as a vector update: s (k) = s (k) 1 s (k) 2 s (k) 2 N k = s (k 1) 1 s (k 1) 2 s (k 1) 2 N k + s (k 1) 2 N k +1 s (k 1) 2 N k +2 s (k 1) 2 N k+1 (3.1) Pseudocode: for (k=1;k<=n;k++) for (j=1;j<=2 N k ;j++) s j = s j + s j+2 N k end end Exploitation of the Different Levels of Parallelism. Computing the dot product in parallel, we can in principle exploit all levels of parallelism, although task parallelism on a distributed memory machine will be rarely required for a stand-alone dot product, but may make sense if the involved vectors are already available in a distributed way. This can for

37 37 example be the case if we compute an overall residual norm for a huge set of equations solved on a supercomputer. Bit-level parallelism is not going to be explicitly addressed in the following. It is automatically used in all basic scalar algebraic operations without any involvement of the programmer. Instruction-level parallelism can be easily exploited in a similar way as for the summation of n variables that we presented above. The computation of the products for all entries is inherently parallel and ideally suited for pipelining and the summation process is done according to the fan-in process that we already considered well-suited for summation exploiting pipelining. Data Parallelism can be used in the component-wise multiplications and for the summation of vector-parts as in the pseudo-code above with decreasing degree of parallelism from one stage of the fan-in to the next. Task-level parallelism on a shared memory machine can be achieved basically the same way as the instruction-level parallelism with only one fundamental difference: To avoid inefficiencies due to frequent cache coherence checks and cache synchronizations, we should be careful to arrange data such that there is as little as possible overlap between the cache contents of the involved cores, e.g., by ensuring that data neighbouring in main memory are added by one core. The granularity of the parallelization obviously decreases with increasing computer size, i.e., in particular for distributed memory parallelization, we usually assign a whole section of the vectors with far more than only two elements to each shared-memory unit. Explicit communication is required on distributed memory machines as soon as the summation in each of these sections is finished. Using a similar hierarchical procedures as the fan-in shown above, communication costs for the calculation of a dot product on a distributed memory machine are O(p) is p is the number of shared memory units involved SAXPY The BLAS-notation SAXPY stands for adding two vectors, one scaled with an arbitrary scalar factor α: y = αx + y. S A X P Y single precision (D for double, C for complex) α scalar vector plus operation vector A SAXPY operation is inherently perfectly parallel (embarassingly parallel) as all multiplications and additions can be performed for all vector entries in parallel without any requirements for communication or conflicts in data access. Exploitation of the Different Levels of Parallelism. Accordingly, also the exploitation of instruction-level parallelism, data parallelism and task-level parallelism is straight forward:

38 38 Instruction-level parallelism: pipelining: A SAXPY operation can naturally be vectorized by α α x k multiplication addition α x k + y k x k y k Data Parallelism: SAXPY is the standard example for perfect data parallelism. Task-level parallelism: For further parallelization, in particular on distributed memory architectures, we partition the vectors and perform the operations on the resulting shorter vectors: x 1 x = {1, 2,..., n} = 1, n = I 1 I 2... I R (3.2) x 1 x 2 x n = X 1 X R, y = y 1 y 2 y n = Y 1 Y R (3.3) x 2 x 3 x 4 x 5 x 6 x 7 x 8 = X 1 X 2 X 3 X 4 We decompose the long vectors of length n into short vectors X j and Y j of length n R. Each processor P j gets partial vectors X j and Y j and computes Y j = αx j + Y j, j = 1, 2,..., R Further Level-1 BLAS Routines. There are further level-1 BLAS routines that are more or less special cases of DOT and SAXPY: SCOPY copies the values of a vector to a second vector (compare algorithm for SAXPY): y = x or y x. Norm computes the (Euclidian) norm of a vector by computing the dot product of the vector with itself and taking the squareroot of the result: x 2 = n x 2 j = x T x. j=1

39 Level-2 BLAS: Matrix-Vector Operations Matrix-vector operations require O(n 2 ) operations (sequentially). provided by BLAS reads y = αax + βy. The general operation In BLAS-notation, it is called SGEMV standing for S G E M V } single precision general matrix vector Another Level-2 BLAS routine solves a system Lx = b with a triangular matrix L. The analysis of the matrix-vector product is a little more involved than the same analysis for SAXPY as done in the previous section. We simplify SGEMV to the pure matrix-vector product Ab = c with A = (a ij ) i=1,...,n R n m, b R m, c R n j=1,...,m Dense Matrices We reduce the matrix-vector product to compositions of simpler building blocks: c 1 c n = = a 11 a 1m a n1 a nm m j=1 a 1jb j m j=1 a njb j b 1 b m = or a 11 b a 1m b m a n1 b a nm b m m j=1 b j = a 1j a nj n DOT-products of length m m SAXPYs of length n (GAXPY) Pseudocode: ij-form c = 0; for i=1,...,n for j=1,...,m c i = c i + a ij b j end end } DOT-product This version computes the entries c i = A i b as DOT-products of the ith row of A with the vector b.

40 40 Pseudocode: ji-form c = 0; for j=1,...,m for i=1,...,n c i = c i + a ij b j end end SAXPY (y = y + αx) updates the vector c with b j times the jth column of A. y = y 0 for i = 1 n y = y + α i x i end GAXPY is a sequence of SAXPYs related to the same vector. The advantage is that the vector c that is updated can be kept in fast memory. No additional data transfer is neccessary. Exploitation of the Different Levels of Parallelism. different implications on the parallelism: The two options above have Instruction-level parallelism: Having SAXPY operations in the inner loop results in an ideal use of pipeling whereas dot products in the inner loop limit the efficient use of pipeling. Data Parallelism: Also in terms of data parallelism, SAXPY operations in the inner loop are favourable. Task-level parallelism: For parallelization on the task-level, we reduce the matrix-vector product to smaller matrix-vector products. For this purpose, we decompose the index set of the matrix both in row-wise and columnn-wise direction: {1, 2,..., n} = 1, n = I 1 I 2... I R disjunct: I j I k = 0 for j k (3.4) {1, 2,..., m} = 1, m = J 1 J 2... J S disjunct: J j J k = 0 for j k (3.5) We use a (virtually) two-dimensional array of processors P rs. Processor P rs gets the matrix block A rs = A(I r, J s ), b s = b(j s ), c r = c(i r ). I r A rs b s J s = c s I r c r = S s=1 A rs b s = S c (s) r s=1 J s

41 41 Pseudocode for r = 1,..., R for s = 1,..., S c (s) r = A rs b s ; end end for r = 1,..., R c r = 0 for s = 1,..., S c r = c r + c (s) r ; end end Using a personal copy of c r on each processor P rs results in small and independent matrixvector products on each processor. No communication is necessary during the computations of this first step! In a second step, however, we have to perform a block-wise collection and addition of vectors. This requires row-wise communication on a shared memory machine and is a fan-in process with O(log S) complexity. Special Case Row-Wise Blocking (S = 1): If S = 1, no communication is necessary between the processors P 1,..., P R. Data in b have to be available in (copied to) all processors, but each processor only has to store its part of the result c. Data-locality in read operations from b is not optimal, but data locality in write operations to c is good. A 1. c 1 A 2. A 3. b = c 2 c 3 A 4. c 4 A 5. c 5 Special Case Column-Wise Blocking (R = 1): If R = 1, A j b j are independent in the sense that disjoint data sets (from A and b are required. However, partial results have to be collected from processors P 1,..., P S (fan-in) to accumulate c. Only parts of b have to be available on each processor, the vector c has to be stored in S copies. Data-locality in read operations from b is good, but data locality in write operations to c is not optimal. b 1 A.1 A.2 A.3 A.4 b 2 b 3 = c b 4 J s Banded Matrices. Banded matrices are very common in many numerical applications, e.g., discretized partial differential equations. They are characterized by the fact that all entries further than a certain distance (bandwidth β) from the main diagonal are zero:

42 42 β β Thus, a banded matrix is given by 2β+1 diagonals (main diagonal plus β subdiagonals + β superdiagonals). If we have β = 1:, the matrix is called tridiagonal. We can write a banded matrix in two different notations, the standard matrix notation (A) or a notation considering the diagonals as vectors: A = a 11 a 1,β a 22 a β+1,1 0 0 a n β,n a n 1,n a n,n β a nn ã 10 ã 1,β 0 0 ã 20 ã β+1, β 0 0 ã n β,β ã n 1,0 0 0 ã n, β ã n,0 Storing the diagonals as columns of a new matrix Ã means that we have to store n(2β + 1) matrix entries, only, instead of a n 2 entries in A: Ã 0 0 ã 10 ã 1,β 0 0 ã n, β ã n,0 0 0 ã β+1, β ã n β,β (3.6) It we compute c = Ab with this notation, we get: r i For i = 1,..., n c i = A i b = a ij b j = j where l i = max{ β, 1 i} and r i = min{β, n i}. s=l i a i,i+s b i+s = Swapping the loops over the indices i and s yields: r i s=l i ã i,s b i+s

43 43 for s = β β end for i = max{1 s, 1} min{n s, n} end c i = c i + ã i,s b i+s This corresponds to a general TRIAD computation per sweep of the s-loop, no SAXPY (note the shifted index of b). The TRIADs (i.e., the s-loops) have to be computed sequentially or as a fan-in. Keeping the loop order gives: for i = 1 n end for s = max{ β, 1 i} max{β, n i} end c i = c i + ã i,s b i+s Thus, we get a partial DOT-product per sweep of the i-loop (not summing over all indices of the vector b in this case. All these DOT-products can be computed in parallel without communication if b is available for all processes. Thus, sparsity means less operations, but also a loss of efficiency in terms of vectorizability. Task-Level Parallelism: For parallelization with R processors P 1,..., P R, we use a partitioning of the index set from 1 to n 1, n = R I r, I k I m = (3.7) r=1 and compute on processor P r the I r part of the result vector c i : for i I r end c i = r i s=l i ã i,s b i+s If processor P r gets the rows associated to the index set I r = [m r, M r ] in order to compute its part of the final vector c, what part of vector b does processor P r need? Necessary for I r : b j = b i+s with j = i + s m r + l mr = m r + max{ β, 1 m r } = max{m r β, 1} and j = i + s M r + r Mr = M r + min{β, n M r } = min{m r + β, n}

44 44 Thus, processor P r with the index set I r needs b j with j [max{1, m r β}, min{n, M r + β}] (3.8) b c 3.3 Level-3 BLAS: Matrix-Matrix Operations Matrix-Matrix operations typically come with O(n 3 ) operations (sequentially). The general BLAS operation is C = αab + βc and is denoted as SGEMM: S G E M M } single precision general matrix matrix As for the matrix-vector operations, we simplify our task also for the matrix-matrix operations and consider only the pure matrix-matrix product. The other ingedients of SGEMM are handled either as described for matrix-vector or for vector-vector operations (adding matrices). A = (a ij ) i=1,...,n R n m, j=1,...,m B(b ij ) i=1,...,m j=1,...,q R m q, (3.9) C = AB = (c ij ) i=1,...,n j=1,...,q R n q (3.10) a i,1 a i,2 a i,m b 1,j b 2,j b m,j = c i,j In a naive implementation following the mathematical definition, this corresponds to the pseucode.

45 45 for i = 1 n for j = 1 q m c ij = a ik b kj k=1 end end Exploitation of the Different Levels of Parallelism. As for the level-2 BLAS routines, the parallelizability on the instruction level and the data parallel implementation options depend on the order in which we arrange the three for-loops and the resulting type of inner operations. For parallelization on the task-level, we again use block decompositions. Instruction-level parallelism and data parallelism: We remember from the analysis of level-2 BLAS that moving the fan-in with its rather poor properties to outer loops is advantageous. The matrix-matrix product algorithm comprises three loops as already mentioned that can be switch from inner to outer. Thus, we have six possible settings listed in the following table ijk ikj kij jik jki kji Alg. 1 Alg. 2 Alg. 3 Access to A by row row column column Access to B by column row row column Comput. of C row row row column column column Comput. of c ij direct delayed delayed direct delayed delayed Vector operation DOT GAXPY SAXPY DOT GAXPY SAXPY Vector length m q q m n n The table lists the access patterns of the matrices A, B, and C as well as the type of vector operations resulting from the loop order in the inner loops. Here, GAXPY in the inner loops is a good case as it allows for a high degree of parallelism and results in a local access pattern to the result vector of the GAXPY. The optimal access to matrices is organized according to the chosen storage scheme (row-wise or column-wise). We discuss some of the options in the following: Algorithm 1: (ijk)-form: Algorithm 1 consideres the matrix A as a combination of rows A = A A n = n i=1 e i A i (3.11) with the ith unit vectors e i, i = 1,..., n and the matrix B as a collection of columns B = (B 1 0 ) + (0 B ) (... 0 B m ) = n i=1 B i e T i. (3.12)

46 46 The corresponding algorithms computes the dot products of rows of A and columns of B in the inner loop: for i = 1 n for j = 1 q for k = 1 m c ij = c ij + a ik b kj end end end } DOT-product of length m c ij = A i B j for all i, j. All entries c ij are fully computed, one after another. Access to A and C is rowwise, to B columnwise (depends on inner-most loops!) Algorithm 2: (jki)-form If we want to bring also GAXPY into play, we have to additionally swap the k-loop with the j-loop to keep operating on the same vector c j in the two inner loops: for j=1,...,q for k=1,...,m for i=1,...,n c ij = c ij + a ik b kj end end end We have the vector update c j = c j + a k b kj in the innermost loop and get a sequence of SAXPYs for the same vector: c j = b kj a k. k C is computed columnwise; access to A happens columnwise, access to B columnwise also, but delayed (we access n times the same entry b kj before proceeding to the next, i.e., b (k+1)j ). Algorithm 3: (kji)-form: Swapping the view of the matrices from Algorithm 1, i.e., decomposing A into columns and B into rows, we can reformulate the matrix-matrix product as AB = n A j e T m j ( j=1 k=1 e k B k ) = A j (e T j e k )B k = k,j m k=1 A k B k full n q matrices, (3.13) i.e., as a sum of full matrices A k b k by outer product of the kth column of A and the kth row of B.

47 47 for k=1,...,m for j=1,...,q for i=1,...,n c ij = c ij + a ik b kj end end end In the inner loop, we have the vector update c j = c j + a k b kj. Thus, we get sequences of SAXPYs for different vectors c j, but no GAXPY. Access to A happens columnwise, access to B rowwise and delayed. C is computed with intermediate values c (k) ij which are computed columnwise. Task-level parallelism is based on block-wise calculations as for level-2 BLAS. For parallelization via blocking, we have three index sets that we can partition: the indices for rows of A (1,..., n), the indices for columns of A and rows of B (1,..., m), and the indices for colums of B (1,..., q): 1, n = R I r, r=1 1, m = S K s, s=1 1, q = T J t (3.14) t=1 Again, we can derive special cases applying the blocking in only one or two index sets or in all three index sets: 1D-Parallelization (R = S = 1): Each processor gets the complete matrix A and a column slice of B and computes the related column slice of C = AB. A B.t = C.t J t J t This yields communication needs of nm in order to communicate the matrix A to all p processors and q m to communicate all p slices of B of size q m p to their respective processor. The granularity (Computation / Communication), a relevant measure for the efficiemcy in N 3 case of a distributed memory parallelization, thus is N 2 (1 + p) = N if n = M = q = N. 1 + p Memory requirements: A needs to be provided at all processes, in addition, each process holds its part of B and its part of the result matrix C, i.e., if we have p processors, each of

48 48 them has a memory requirement of n m + m q p + n q p resulting in a total memory requirement over all processes of matrix entries. n m p + m q + n q 2D-Parallelization (S = 1): For the 2D parallelization, we assume that we have p processors where p is a square number. Each processors gets a row slice of A and a column slice of B, thus computes a full subblock of C = AB: I r A r. B.t = C rt I r J t J t With this, we get communication requirements of nm p to communicate p slices of A of size m n p to p processors, each. For sending p slices of B of size m q p to p processors each, analoguously requires the communication of qm p entries. The granularity (data total N 3 / data communicated), accordingly, is 2N 2 p = N 2 if n = m = q = N. p Each processor P rt can compute its part of c, c rt, independently and without communication. In the extreme case of n q processors, the matrix product per processors reduces to one DOT-product m c rt = a rk b kt (3.15) k=1 that can be computed with a fan-in by m nq additional processors reducing the number of parallel time steps to O(log(m)). With no further available processors, m parallel time steps are required. Memory requirements: Each slice of A and B needs to be provided to p processes, each. If we have p processors, each of them has a memory requirement of resulting in total memory over all processes of n m p + m q p + n q p n m p + m q p + n q

49 49 matrix entries. Thus, the memory requirements per process are equal to what we had in the 1D parallelization case. 3D-Parallelization: The 3D parallelization corresponds to the general blocking case (S, R, T 1) in all three index sets. We assume that the number p of processors is cubic, each processor gets a sub-block of A and a sub-block of B, and can compute a part of a subblock of C = AB. I r A rs B st K s J t K s = C C rt C rt C rt rt C rt J J t J t J t t J t I I r I r I r r I r This results in an additional fan-in to collect the parts to the full sub-block of C from q = p 1 3 processors. The communication need is nmp 1 3 to send p 2 3 pieces of A of size nm Analoguosly, we get the communication need mqp 1 3 for B. The granularity is N 3 3N 2 p 1 3 = N 3p 1 3 if n = m = p. p 2 3 to p 1 3 processors, each. Memory requirements: Each block of A and B needs to be provided to 3 p processes, each. If we have p processors, each of them has a memory requirement of n m 3 + m q 2 p 3 + n q 2 p 3 2 p resulting in a total memory over all processes of (n m + m q + n a) 3 2 p matrix entries. Thus, the memory requirements per process are equal to what we had in the 1D and in the 2D parallelization case. Actually, the memory requirements per processor can only be reduced with concepts communicating required data in just in time manner to the resoective processors (see tutorial).

51 Chapter 4 Linear Systems of Equations with Dense Matrices In this chapter, we consider a linear system of equations: a 11 x a 1n x n = b 1 a n1 x a nn x n = b n We are looking for a vector x that fulfils this system, i.e., we solve Ax = b with a 11 a 1n a n1 a nn x 1 x n = b 1 b n (4.1) 4.1 Gaussian Elimination Basic Properties Gaussian elimination is a well-known standard algorithm to solve such a system or rather transform the system into a new one that can be solved easier. We transform the sytem Ax = b to LUx = b with A = LU, L, U R n n, L upper triangular, U lower triangular. With this, the system Ax = b can be solved in two steps by solving Ly = b and Ux = y. Algorithms to solve these triangular systems are presented in 4.2. We recapitulate the algorithm to compute the two matrices L and U here before discussing its parallelization. The major work in Gaussian elimination is done to generate a sequence of simpler linear equations (matrices). At the end of this sequence, we have transformed A to triangular form: A A (1)... A (n 1) = U: 51

52 52 a 11 a 12 a 1n a 21 a 22 a 2n a n1 a n2 a nn (4.2) The row transformations (2) (2) a 21 a 11 (1),..., (n) (n) a n1 a 11 (1) lead to a 11 a 12 a 13 a 1n 0 a (1) A (2) 22 a (1) 23 a (1) 2n = 0 a (1) 32 a (1) 33 a (1). (4.3) 3n 0 a (1) n2 a (1) n3 a nn (1) The next transformations are (3) (3) a(2) 32 a (2) 22 (2),..., (n) (n) a(2) n2 (2) and yield a (2) 22 a 11 a 12 a 13 a 1n 0 a (1) A (3) 22 a (1) 23 a (1) 2n = 0 0 a (2) 33 a (2), (4.4) 3n 0 0 a (2) n3 a (2) nn followed by (4) (4) a(3) 43 a (3) 33 (3),..., (n) (n) a(3) n3 (3) and so on, finally giving a (3) 33 a 11 a 12 a 13 a 1n 0 a (1) A (n) 22 a (1) 23 a (1) 2n = 0 0 a (2) 33 a (2) = U. (4.5) 3n a (n 1) nn Pseudocode. To simplify the code, we assume that no pivoting is necessary (recapitulate what pivoting does to a matrix and why and when it is required), i.e., a (k 1) kk 0 or a (k 1) kk ρ > 0 for k = 1, 2,..., n 1 (4.6)

53 53 With this, we get the algorithm for k = 1 n 1...for i = k + 1 n...l i,k = a i,k a k,k...end...for i = k + 1 n...for j = k + 1 n...a i,j = a i,j l i,k a k,j...end...end end In practice, we in addition have to include pivoting and the transformation of the right hand side b, and solve the resulting triangular system in U using backward substitution! To store the actions during the elimination, we define auxiliary matrices L k = l k+1,k l n,k 0 0. (4.7) We know that we can express the elimination step in terms of these auxiliary matrices: with A (k) = (I L k ) A (k 1) U = A (n 1) = (I L n 1 ) A (n 2) =... = (I L n 1 ) (I L 1 )A A = (I + L 1 )(I + L2)... (I + L n 1 )U = LU L = l 2,1 1 0 l n,1 l n,n 1 1 and U = A (n) (4.8) Parallel Gaussian Elimination Before we parallelize the algorithm of Gaussian elimination (or LU-decomposition), we first note that 1. all entries in the same column of A can be eliminated in parallel, i.e., the n k entries of L k can be computed in parallel and (I L k ) can be applied to the (n k) (n k) involved entries of A k in parallel,

54 54 2. several columns of A can not be updated in parallel. Exploitation of the Different Levels of Parallelism. following parallelization strategies on the various levels: This observation leads to the Instruction-level parallelism and data parallelism can be exploited within the columns as mentioned above. Since the number of entries below the diagonal decreases from one column to the next, also the degree of parallelism obviously decreases. Task-level parallelism is based in a block-wise approach again. The main idea of block-wise algorithms for Gaussian elimination is to compute the entries of L and U in block-entities, e.g., L L 21 L 22 0 L 31 L 32 L 33 U 11 U 12 U 13 0 U 22 U U 33 = A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 = (4.9) = = L 11 U 11 L 11 U 12 L 11 U 13 L 21 U 11 L 21 U 12 + L 22 U 22 L 21 U 13 + L 22 U 23 L 31 U 11 L 31 U 12 + L 32 U 22 (4.10) In the illustration above and in all following illustrations, known blocks of the matrices are marked green, those we want to compute next are red, and parts of the matrix which are to be considered later are grey. I.e., we assume that L 11, L 21, L 31, U 11, U 12, U 13 are already known and we want to compute L 2,2, L 3,2, U 2,2, and U 2,3 in the next step. The size of the block exemplarily indicates the actual size of the blocks. From the considerations above, we know that the entries in L 22 are independent from the entries in L 32, e.g., and can thus be computed on seperate processors without communication. In addition, L 22 and L 32 are responsible for the modification of different parts of A. Thus, also these modifications can be done on two different processors without communication. From (4.9) and (4.10), we have: A 22 = L 21 U 12 + L 22 U 22, A 32 = L 31 U 12 + L 32 U 22, A 23 = L 21 U 13 + L 22 U 23 Thus, we get L 22 and U 22 from the small LU-decomposition L 22 U 22 = A 22 L 21 U 12.

55 55 L 32 can be determined by solving (in parallel) a set of lower triangular systems L 32 U 22 = A 32 L 31 U 12 U T 22L T 32 = A T 32 U T 12L T 31. with the lower triangular system matrix U22 T. This step mirrors the parallelism in the lines of A as stated above. Finally, we get U 23 from solving (in parallel) the sequence of linear lower triangular systems L 22 U 23 = A 23 L 21 U 13. We apply these steps successively for the next column block in L and the next ro block in U. We repeat this overall procedure until L and U are fully computed. The corresponding algorithm is called the Crout form of block-wise LU-decomposition and consists of the following four steps within each block-column elimination (the steps have to be executed sequentially, but allow for parallelism within each of them): 1. Update involved blocks of A using A A LU (this step is skipped if no previous column-blocks have already been eliminated, i.e., A 11 R 0 0 ): A 22 = A 22 L 21 U 12 = A 23 = A 23 L 21 U 13 =

56 56 A 32 L 31 U 12 = To do this in parallel, we use parallel matrix-matrix multiplication as presented in the previous chapter, e.g., with a 3D decomposition indicated by the blocking in the illustration above. 2. Compute the small block-lu-decomposition: L 22 U 22 = A 22 = 3. Solving a collection of small independent triangular systems: L 32 U 22 = A 32 = L 22 U 23 = A 23 = To exploit task-level parallelism at this stage, we can, e.g., assign blocks of the unknown matrices to different processes as indicated in the illustrations above.

57 57 4. Repartition the matrices: After computing L 22, L 32, U 22, and U 23, we enlarge the number rows and lines in the first row and line blocks by the number of rows and lines of L 22 and define new second blocks to be computed next in the remaining parts of the matrices by splitting the until now ignored parts L 33 and U 33 into new columns/rows. Thus, we get ( L 11 L 21 L 22 ) L 11, ( L 31 L 32 ) ( L 21 L 31 ), L 33 ( L 22 L 32 L 33 ), ( U 11 U 12 U 22 ) U 11, ( U 13 U 23 ) ( U 12 U 13 ), U 33 ( U 22 U 23 U 33 ). L 11 L 11 L21 L22 U11 U12 U 13 U 11 U 12 U 13 U12 U22 U 23 L 21 L 22 L31 L32 L33 L 31 L 32 L 33 U 22 U 23 U 33 U 33 There are different ways of computing L and U following this basic concept depending on where we start, the grouping in known blocks, blocks newly to compute, and blocks to be computed later and on how we compute the new entry/row/column of L/U. We present a few commonly used approaches: The Crout Form. The crout form corresponds to the method sketched above. It proceeds column-block-wisely in L and row-block-wisely in U. Left Looking Gaussian Elimination. block-column-wisely both in L and in U: The left looking Gaussian elimination proceeds = 1. Compute U 12 by solving by a couple of parallel triangular solves. L 11 U 12 = A 12

58 58 2. Update entries in A: ( A 22 A 32 ) = ( A 22 A 32 ) ( L 21 L 31 ) U 12 (4.11) 3. Compute L 22 and U 22 by the small LU decomposition L 22 U 22 = A 22. (4.12) 4. Compute L 32 by solving L 32 U 22 = A 32. (4.13) After these three steps, reorder blocks and repeat until ready. Initial Steps: L 11 U 11 = A 11, L 21 U 11 = A 21, L 31 U 11 = A 31. Right Looking Gaussian Elimination. The right looking Gaussian elimination proceeds block-line-wisely in L and block-column-wisely in U. The difference to the Crout form is that the updates in A are done at the end of the update step for the next line and column blocks, already: This allows us to completely focus on the lower right part of the matrix A that is not eliminated yet: = 1. Compute L 22 and U 22 based on the small LU decomposition L 22 U 22 = A Compute L 31 and U 13 by solving the sets of triangular systems L 32 U 22 = A 32 and L 22 U 23 = A Update the entries in A: A 33 = A 33 L 32 U 23. After these three steps, what remains to be done is the decomposition L 33 U 33 = A 33. L 11, L 21, L 31, U 11, U 12, U 13, A 11, A 12, A 13, A 21, A 31 are not used above. Thus, the algorithm actually works the matrices ( L 22 L 32 L 33 ), ( U 22 U 23 U 33 ) and ( A 22 A 23 A 32 A 33 ), only and we can repeat the respective (2 2)-blocking for A 33 and apply the algorithm recursively.

59 Comparison and Overview. In comparison, all methods have nearly the same efficiency in parallel. Their performance is better then that of the unblocked variants because they are based on BLAS-3. Elementary steps of all blocking methods are matrix-matrix products and sums (easy to parallelize), a couple of triangular solves (easy to parallelize), small LU-decomposition (parallelizable for long rows) Parallel Solving of Triangular Systems As a result of Gaussian elimination and QR decomposition, we get a triangular system that we have to solve in a last step. We examine the parallelization potential of this step in this section. To begin with, we examine the Data Dependencies for Solving a Triangular System. a a 21 a 22 0 a n1 a n,n 1 a nn x 1 x 2 x n = b 1 b 2 b n (4.14) The calculation of x 1,..., x n is inherently strongly sequential: x 1 = b 1 /a 11 x 2 = (b 2 a 21 x 1 )/a 22 x 3 = (b 3 a 31 x 1 a 32 x 2 )/a 33 x 4 = (b 4 a 41 x 1 a 42 x 2 a 43 x 3 )/a 44 In general: for k = 1,..., n x k = b k k 1 j=1 a kjx j a kk Altogether, we get the following dependency graph (for a ii = 1):

60 60 step 0 x = b 1 1 a x 21 1 a x 31 1 a x x 2 = b 2 a 21 x1 b 3 = b 3 a 31 x1 b 4 = b 4 a x a x 32 2 a x x = b 3 3 a x 32 2 b = b 4 4 a x a x x = b 4 4 a x 43 3 This corresponds to groups of operations on matrix data that can be handled in parallel: Thus, a triangular system with four unknowns can be solved in seven steps in parallel. In general, 2n 1 time steps are required at least. Exploitation of Different Levels of Parallelism. strategies for parallelisation are straight-forward: Based on the above observations, Instruction-level and data parallelism can be used based on BLAS level-1 (DOT) with increasing vector length in the calculation of j 1 i=1 a j,ix j.

61 61 Task-level parallelism uses (not surprisingly) a block-wise approach resulting in smaller triangular systems and matrix-vector products: A = A 1,1 0 0 A 2,1 A 2,2 0 A 3,1 A 3,2 A 3,3, b = with A i,j R k k, B i, X i R k for all i, j = 1,..., N/k: B 1 B 2 B 3, x = The algorithms looks similar to the one above using blocks instead of scalars: X 1 X 2 X 3 Solve A 1,1 X 1 = B 1 Solve A 2,2 X 2 = B 2 A 21 X 1 Solve A 3,3 X 3 = B 3 A 31 X 1 A 32 X 2 ) QR-Decomposition Gaussian elimination and LU-decomposition are sometimes numerically not stable (cancellation effects!) and can not be used for overdetermined (more rows than columns in the system matrix A) or underdetermined (more columns than rows in the system matrix A) systems. In these cases a different type of decomposition can be applied: A = QR with Q T Q = I, R upper triangular. Solving the linear system Ax = b is then done in a numerically stable way via by cheap matrix-vector multiplication and a triangular solve. b = Ax = QRx Rx = Q T b (4.15) Basics Overdetermined Systems. An overdetermined system Ax = b = has a matrix A R m n with n < m, the unknown vector x is in R n and the right-hand side b in R m. The best approximate solution can be found by solving the minimization problem min Ax b 2 x 2 = min(x T A T Ax 2x T A T b + b T b) (4.16) x

62 62 The gradient of the norm to be minimized is to be set equal to zero to find the minimum A T Ax = A T b (normal equations). Thus, the solution can be found by considering the linear system with the system matrix A T A, but the condition number is worse than that of A (cond(a T A) = cond(a) 2 ). For overdetermined systems with full rank n, we can do a QR-decomposition A = QR R 1 = We get A T Ax = A T b (QR) T (QR)x = (QR) T b R T Rx = R T (Q T b) (R T 1 0) ( R 1 0 ) x = (RT 1 0) ˆb R T 1 R 1 x = R T 1 ˆb 1 R 1 x = ˆb 1. = Q T b = ( ˆb 1 ˆb2 ) R T 1 R 1 x = (R T 1 0) ( ˆb 1 ˆb2 ) Instead of solving the normal equations, we only have to consider the triangular system in R 1. This is cheap and has a better condition number. Calculating only the first n columns of Q and the first n lines of R is sufficient as we have A = QR = (Q 1 Q 2 ) ( R 1 0 ) = Q 1R 1 R 1 R 1 = Q 1 Q2 = Q 1 an we get the least squares solution from x = R 1 1 ˆb 1 = R 1 1 Q T 1 b. Underdetermined Systems. An unterdetermined system Ax = b = has a matrix A R m n with n > m, an unknown vector x R n and a right-hand side b R m. A QR decomposition of such a systems (in case of full rank m) yields A = QR = R 1 If we solve R 1 x 1 = Q T b

63 63 where x 1 comprises the first m entries of the unknown vector x, we get a solution: A ( x 1 0 ) = Q ( R 1 ) ( x 1 0 ) = QQT b = b. However, we do not have direct control on further properties of this solution. With a slightly different approach based on a QR-decomposition of A T, this is different: The QRdecomposition of A T A T = Q T R T R T,1 = gives the decomposition A = R T T Q T T = T R T,1 of A, such that we can solve the system by solving R T T,1y = b, and computing x = Q T ( y 0 ). Proving that this actually solves the system is straight-forward: Ax = (R T T,1 0) Q T T x = (R T T,1 0) ( y 0 ) = RT T,1y = b. In addition to solving the sytem Ax = b, this is the solution with minimal Euklidian norm x 2 as Ax = b (R T T,1 0) Q T T x = b R T T,1 (QT T x) 1 = b x = Q T ( R T T,1 b with x 2 2 = Q T ( R T T,1 b R T T,1b ) 2 = ( R T T,1 b 0 as Q is an orthonormal matrix. = ( R T T,1 b 2 ) 2 2 ) = RT,1b T = Q T ( R T T,1 b 0 2 ) = x Again, calculating only the first n columns of Q T and the first n lines of R T is sufficient as we have we calculate the minimum norm solution as x = (Q T,1 Q T,2 ) ( R T T,1 b 0 ) = Q T,1 R T T,1b. )

64 64 After these considerations, we restrict our attention in the following to QR-decompositions of matrices A R m n with m n. Gram-Schmidt Orthonormalization is a method to orthonormalize a given set of vectors. If we apply the method to the columns A,j of our matrix A, this results directly in a QRdecomposition. The orthogonalization step reads which corresponds to A = Q R with Q,1 = A,1, Q,2 = A,2 Q,n = n 1 A,n i=1 Q = ( Q,1 Q,2 Q,n ), R = Normalizing the columns of Q yields Algorithm A = QR with Q,1 Q = ( Q,1 Q,2 Q,n ) R = Q,1 r 1,2 r 1,n 0 Q,2 r n 1,n 0 0 Q,n Q T,1 A,2 Q,1 2 Q,1, Q T,i A,n Q,i 2 Q,i 1 r 1,2 r 1,n 0 1 r n 1,n Q,2, r i,j = Q,n, r i,j = Q T,iA,j. 1 Q T,i A,j Q,i 2. initialize Q := A for i = 1 n...compute r i,i = Q,i...Q,i = Q,i /r i,i...compute (r i,i+1 r i,i+2... r i,n ) = Q T,i (A,i+1 A,i+2... A,n )...(Q,i+1 Q,i+2... Q,n ) = (Q,i+1 Q,i+2... Q,n ) Q,i (r i,i+1 r i,i+2... r i,n ) end and QR-Decomposition in Parallel Instruction-level and data parallelism is mainly based on the parallelisation of BLAS level-1 and level-2 building blocks of the algorithm:

65 65 1. in the computation of the norm Q,i (DOT), 2. the normalization Q,i /r i,i (SAXPY / vector scaling), 3. the scalar products Q T,i (A,i+1 A,i+2... A,n ) (matrix-vector product), 4. Q,i (s i+1 s i+2... s n ) (multiple vector scaling), 5. the update of (Q,i+1 Q,i+2... Q,n ) (SAXPY for the vector of matrix entries). Task-level parallelism is based on a block-decomposition. The idea is again to decompose the matrix A into blocks in order to get block-operations at BLAS level-3 instead of the BLAS level-2 operations in the column-wise algorithms above. As this increases the locality in terms of number of operations per datum, it increases the efficieny of memory hierarchy usage and reduces the communication requirements on a distributed memory system. To achieve this, we decompose A into two blocks, one containing the first k columns, the other one the rest of A: A = (A 1 A 2 ) A 1 A 2 In a first step, we compute the QR-decomposition of A 1, only. For this part, we can not resolve the sequential character of the algorithms, i.e., we have to handle one column after the other according to the algorithm given above. The only parallelization potential would be a further row-wise decomposition of A 1, i.e., using block-wise BLAS level-1 and level-2 routines. A 1 = Q 1 R 1 = In the second part of the block-wise parallel algorithm, we apply the respective transformations (step 3-4 in the list above) to the rest matrix A 2. Here, we can use whole k columns of Q now at a time instead of only one on the original algorithm: 3. compute S = Q T 1 A 2 (matrix-matrix product, use block-wise algorithm), = 4. Q 1 S (matrix-matrix product, use block-wise algorithm),

66 66 = Communication Avoiding QR. For large matrices A, we might want to actually use parallelism not only in terms of columns of A 2 but also in lines of A without the large communication overhead due to the numerous scalar products of whole columns. In this case, the following communication avoiding algorithm starting with QR decompositions of blocks of A 1 and combining them (after many entries have become zero) to an overall transformation of A 1 : Tall and Skinny QR. The same idea as in the communication avoiding QR can be used if our matrix A is very tall and skinny, i.e., in R n m with n >> m: A = A 0 A 1 A 2 A 3 = Q 0 R 0 Q 1 R 1 Q 2 R 2 Q 3 R 3 = Q 0 Q 1 Q 2 Q 3 R 0 R 1 R 2 R 3 (4.17) R 0 R 1 R 2 R 3 = ( R 0 R 1 ) ( R 2 R 3 ) = ( Q 01R 01 Q 23 R 23 ) = ( Q 01 Q 23 ) ( R 01 R 23 ) (4.18) ( R 01 R 23 ) = Q 0123 R 0123 (4.19) A = A 0 A 1 A 2 A 3 = Q 0 Q 1 ( Q 01 Q 2 Q 3 ) Q Q 0123 R 0123 (4.20) 23

67 67 A 0 R 0 A 1 R 1 A 2 R 2 A 3 R 3 R 01 R 23 R 0123 The advantage is that we now have O(log(P )) messages compared to O(2n log(p )) for ScaLA- PACK (standard parallel QR as described above).

69 Chapter 5 Sparse Matrices Sparse matrices are matrices with a lot of entries being zero. More specifically, we speak of a sparse matrix if the number of non-zero entries is less that O(n 2 ) for an n n matrix. Prominent examples are tridiagonal matrices, band matrices, or block band matrices. Such matrices are typical for discretized partial differential equations for example. Treating a sparse matrix the same way as a dense matrix would lead to storage requirements of O(n 2 ) and cost for solving a system with such a matrix of O(n 3 ) if we use a direct solver such as Gaussian elimination or QR decomposition. If we formulate a given problem in a clever way (example: finite element discretizations with nodal bases), that leads, however, to a sparse linear system with O(n) matrix entries only. The question is: Can we compute the solution of such a sparse system also with O(n) operations? 5.1 Storage Schemes for Sparse Matrices As a first step towards efficient solvers for linear systems with sparse matrices, we present suitable storage schemes that avoid storing all zero entries. Note that, in Chapter 3, we have already seen an efficient storage scheme for band matrices. Let s start with the sparse matrix example (5.1) Additionally to the non-zero matrix entries, we need to store the size of the matrix n = 5, the number of nonzero entries nnz = 12, the positions of non-zero entries. 69

70 70 Storage in Coordinate Form. The straight forward way to store a sparse matrix is the coordinate form that can be seen as a tabular containing the non-zero matrix entries and their row and column indices: values AA row JR column JC There s no particular order given for the matrix entries, which implies that we store redundant information that we could avoid if we would prescribe an order. Pseudocode for computing c = A b: c=0; for j = 1 : nnz c[jr[j]] = c[jr[j]] + AA[j] * b[jc[j]]; end; as AA[j] = A JR[j],JC[j] according to the definition of the storage format. The disadvantage of this storage format is the indirect addressing (indexing) in the vectors c and b that induces two memory access events (in JR or JC and c or b, respectively) and, in addition jumps in memory (within b and c). The advantage, however, is that there s no systematical difference between columns and rows such that switching from A to A T is simple. Parallelization: We can partition AA into equal portions, but we have to dublicate both c and b. Communication is required to accumulate all contributions to c from all processes. Compressed Sparse Row Format: CSR The compressed sparse row storage format for sparse matrices orders the matrix entries according to the row they belong to, i.e., entries if row one are listed first, followed by entries of row two and so forth. Thus, there is no need to store the row index for each matrix entry any more. We only need the information, at which position of the list of matrix entries a new row starts: values AA column JA start row IA Pseudocode for computing c = A b: c=0; for i = 1 : n for j=ia[i] : IA[i+1]-1 c[i] = c[i] + AA[j] * b[ja[j]]; end end

71 71 Thus, we need indirect addressing only in b. The columnwise analogon is called the compressed sparse column format. Parallelization: We can partition AA into row-blocks of A. This requires to store b in all processes. No communication is required for the final computation of c (row-blocking of A). CSR with Extracted Main Diagonal We can save further n integers if we list the diagonal entries first (ordered according to their row/column indices). In this case, we know the exact position of the first n entries of the matrix entry list, i.e., can omit n of the column indices. Instead, we can use the free positions in JA to store the indices of the first nondiagonal entry of each row: main diagonal entries nondiagonal entries values AA (start row) / column JA start row coumn Pseudocode for computing c = A b: c=0; for i = 1 : n c[i] = AA[i] * b[i]; for $j=ja[i] : JA[i+1]-1 c[i] = c[i] + AA[j] * b[ja[j]]; end end Parallelization: We can partition AA into row-blocks of A, but have to take care that we distribute the main diagonal entries to the respective processor s part of AA. As for CSR, this requires to store b in all processes. No communication is required for the final computation of c (row-blocking of A). Rectangular Storage Scheme by Pressing from the Right pressing from right (5.2) gives the compressed matrix and the corresponding matrix of column indices COEF = JCOEF = (5.3)

72 72 Pseudocode for computing c = A b: c=0; for i = 1 : n for j = 1 : nl c[i] = c[i] + COEF[i,j] * b[jcoef[i,j]]; end end This format was used in ELLPACK (package of subroutines for elliptic PDEs). Parallelization: Any type of partitioning (row and column blocks) is possible. Diagonalwise Storage. For sparse matrices with a particular structure, we can often find even more efficient storage methods. An important class (typical for system matrices of discretized partial differential equations) is given for example by matrices with band-structure which have only a non-zero diagonal and a small number of non-zero off-diagonals: (5.4) New matrix A! Different matrix to slides before! An obvious idea is to store the matrix entries diagonal- and off-diagobal-wise, i.e., we first need the diagonal numbers that give the distance in upward (positve) or downward (negative) direction from the main diagonal. In ou example matrix, we have non-zeros in diangonals 1, 0, and 2. i.e, we can store A as (compare matrix-multiplication with banded matrices) DIAG = Pseudocode for computing c = A b: c=0; for i = 1 : n for j = 1 : nd c[i] = c[i] + DIAG[i,j] * b[ioff[i+ioff[j]]; end end, IOFF = ( 1 0 2) (5.5) Parallelization: Any type of partitioning (row and column blocks) is possible. Survey on Sparse Storage Formats.

73 73 Scheme Integers Coordinate Form 2(1 + nnz) (n, nnz, row and column indices) CSR CSR with Extracted MD 3 + nnz + n 3 + nnz (n, nnz, column indices, nnz n col- (n, nnz, pointers to umn indices, rows) n + 1 pointers to rows) Rectangular Storage 1 + n nl (n, column indices) Diagonal Float nnz nnz nnz + 1 n nl n nd 2 + nd (n, nd, offsets) where n is the dimension of the matrix, nnz the number of nun-zero matrix entries, nl the maximal number of non-zeros per matrix row, nd the number of non-zero diagonals. Coordinate form and CSR are the traditional way to specify sparse matrix in MATLAB. 5.2 Matrices and Graphs Each symmetric n n-matrix with entries a ik, i, k = 1,..., n corresponds to an undirected graph with vertices e 1,..., e n and edges (e i, e k ) for all nonzero entries a ik 0: A = G(A) G(A) as directed graph: The adjacency matrix for G(A) or A can be obtained directly by replacing each nonzero in A by 1: A(G(A)) = (5.6) A(G(A)) contains all information on the sparsity pattern of A. As, for nonsymmetric matrices, we don t neccessarily have the property a ik = 0 a ki = 0, we get a directed graph in this case:

74 74 A(G(A)) = G(A) Remark: We can even use a graph to store the whole marix information. For this purpose, edges of the graph haver to be annotated with weights corresponding to the respective matrix entries. This storage scheme is usually not used for matrix operations, but in other contexts such as algebraic multigrid methods in order to identify entries of the unknown vector that are tightly and those that are loosly connected. 5.3 Gaussian Elimination for Sparse Matrices As an example for harmful effects of Gaussian elimination for sparse matrices, we consider the matrix... A = n 1 n Already the first elimination step leads to a dense matrix: Reordering the matrix entries might help to improve this, i.e., to reduce the fil-in effect. For example, we can strive for minimal bandwidth. For the given matrix with one line that is completely filled, it is obvious, that the minimal bandwidth is n 2 :

75 75 This would reduce the fill-in to the lower left quarter of the matrix. In the upper right quarter, the elimination of the first (n 1)/2 entries of row (n 1)/2 + 1 does not cause any fill-in and can be achieved with by adding only two scaled non-zero elements of respective row to row (n 1)/ Thus, this part of the matrix can be eliminated with O(n/2) cost. For the lower left quarter, we have the same problem as we had before for the whole matrix. A complete fill-in of this (n 1)/2 (n 1)/2 matrix happens immediately while eliminating column (n 1)/ Thus, the costs here are O(n 3 ), but a factor of 1/8 smaller than for the original matrix due to the reduced size of the problematic matrix. We could, however, do an optimal reordering in one step: 1 n. This leads to optimal costs O(n) Motivated by this observation, we consider some reordering approaches in the following. Most of them are based on a graph representation of the sparse matrix. A reordering of the lines and columns of the matrix A corresponds to symmetric permutations of A in the form P AP T that change the ordering of the rows and columns of A simultaneously and results in a graph of P AP T that can be obtained by the graph of A by renumbering the vertices. Example: Assume that P is the permutation that changes 3 4: A(G(A)) = A(G(P AP T )) = (5.7) G(A) G(P AP T )

76 76 Reordering Algorithms. If we try to improve the sparsity pattern of a matrix to achieve a faster Gaussian elimination or QR decpomposition, we have to answer the question how we can characterize good sparsity patterns? Good obviously should be such that Gaussian elimination can be reduced to smaller subproblems and produces no (or small) fill-in (original zeros being replaced by nonzeros during the elimination and, thus, increasing the elimination cost or the cost of backward substitution of the U-system). Cuthill McKee aims at reducing the fill-in by minimizing the bandwidth of a matrix. As an example, we use a sparse matrix A with the graph G(A): With the numbering of nodes (= rows and lines of the matrix) from the graph, this corresponds to a matrix where denotes a non-zero matrix entry. The Cuthill McKee algorithm starts with reordering the vertices of the graph in level sets, starting with: S 1 = {2} S 2 = {6, 2, 9}, all vertices directly connected with S 1 S 3 = {7, 3, 10}, all vertices directly connected with S 2 S 4 = {8, 4, 11}, all vertices directly connected with S 3 S 5 = {5} Inside the level sets, we order the vertices such that the first group of indices in S i are neighbors of the first vertex in S i 1 and so forth. If there is a choice left (e.g., several vertices in S i are neighbours of the first vertex in S i 1 ), number indices with small degree first!.

77 77 S 1 = {1} S 2 = {2, 6, 9}, (could also be different!) S 3 = {3, 7, 10}, (as we start with the neighbors of 2, then 6, and then 10) S 4 = {4, 8, 11}, (as we start with neighbors of 3,... ) S 5 = {5} This yields a matrix with a smaller bandwidth (4 instead of 9 as for the original matrix): Often, Reverse Cuthill Mckee gives slightly better results. It reverses the ordering of Cuthill McKee (11 1, 10 2,...). Alternatively, we can choose the start vertex as an extreme vertex (maximal diameter, i.e., maximal number of directly connected other vertices)! The Cuthill McKee method is far from being optimal and strongly heuristical as our example from the beginning of the section shows:... A = n 1 n We apply Cuthill McKee to see if the respective reordered matrix has better properties: If we start with vertex 1, Cuthill McKee keeps all numbers unchanged and we, thus, get no improvement. Dissection Reordering reorders the rows and columns such that we get almost independent blocks that can be solved independently, in particular in parallel to each other. As an extreme example, we consider a new matrix A with a typical banded pattern: A(G(A)) = G(A)

78 78 Exercise: Apply Gaussian elimination to this matrix and record the sparsity pattern at all stages of the algorithm. What do you observe? If we apply a symmetric permutation 2 3, we get P AP T = = ( A 1 0 ) 0 A 2 G(A) By this permutation, A can, thus, be transformed into block diagonal form which is easy to solve! ( A ) = ( A A 2 0 A 1 ) (5.8) 2 Algebraic Pivoting in GE. If we use numerical pivoting, we choose the largest entry in column/row/block k and permute this element on the diagonal position for eliminating elements in column k. The disadvantage of this method for sparse matrices is that it may lead to a large fill-in in the sparsity pattern of A. Thus, the idea of algebraic pivoting is to choose a pivot element according to the minimal fill-in! As a heuristic, we can choose the pivot element according to the degree in the graph G(A). This method is called minimum degree reordering. We consider this method in more detail, first for the special case A = A T : For elimination in the kth column of A, we 1. define r m as the number of nonzero entries in row m, 2. choose the pivot index i by r i = min m r m, 3. do the pivot permutation and the elimination, 4. go to the next column k. If we revisit our example... A = n 1 n With algebraic Pivoting, we would swap the first line and column with the second line and column in a first step, the second line and column with the third and so fort:

79 We see that this is exactly the strategy yielding the optimal reordering. As r m is the number of nonzeros in the mth row, it is also the number of vertices directly connected with the vertex m. Hence, the pivot vertex is the vertex with minimal degree in G(A k ). The heuristics works as few entries in the mth row/column yield little fill-in in because there are only few elements to eliminate and the pivot row used in the elimination is very sparse. For eliminating the whole matrix, we apply multiple minimum degree reorderings. We could even do an optimal reordering in one step: 1 n. This leads to optimal costs O(n) Generalization to Nonsymmetric Problems: Markowitz. Also for nonsymmetric matrices, we define r m = nnz in row m, but additionally c p = nnz in column p. We choose a pivot element with the index pair (i, j) such that (r i 1)(c j 1) = min m,p (r m 1)(c p 1) This heuristic is expected to work well as small c j leads to few elimination steps, small r i leads to a sparse pivot row used in the elimination. In the special case r i = 1 or c j = 1, we even get no fill-in. We can also include numerical pivoting by applying algebraic pivoting only on indices with absolute value that is not to small, e.g., a i,j 0.1 max r,s a r,s. (5.9) 5.4 Parallel Direct Solvers More General Dissection Reordering. For a more general case of dissection reordering, consider a matrix A with the graph G(A) and a numbering of unknows in two groups separated by a third group:

80 80 This numbering leads to the dissection form A 1 0 F 1 0 A 2 F 2 G 1 G 2 A 3 (5.10) This system can be solved, e.g., based on the Schur complement: A 1 0 F 1 0 A 2 F 2 G 1 G 2 A 3 x 1 x 2 x 3 = b 1 b 2 b 3 x 1 = A 1 1 (b 1 F 1 x 3 ), x 2 = A 1 2 (b 2 F 2 x 3 ), A 3 x 3 = b 3 G 1 x 1 G 2 x 2 A 3 x 3 = b 3 G 1 A 1 1 (b 1 F 1 x 3 ) G 2 A 1 2 (b 2 F 2 x 3 ), x 1 = A 1 1 (b 1 F 1 x 3 ), x 2 = A 1 2 (b 2 F 2 x 3 ) x 3 = (A 3 G 1 A 1 1 F 1 G 2 A 1 2 F 2) 1 (b 3 G 1 A 1 1 b 1 G 2 A 1 2 b 2), x 1 = A 1 1 (b 1 F 1 x 3 ), x 2 = A 1 2 (b 2 F 2 x 3 ). General idea: 1. Cut G(A) into pieces seperated by separator. I.e., if you remove the seperator, unconnected subgraphs remain. 2. Number the separator last. 3. If we repeat this recursively, we get a nested dissection form. As solving the system for the separating unknowns is the expensive part (even if we exploit the parallelism in computing A 1 1 F 1 and A 1 2 F 2), we are looking for a partitioning of the graph with a minimal number of connections! In our example from the beginning of the section

81 81... A = n 1 n we can partition the graph in n groups where n 1 of them are all only connected to the nth group yielding the optimal reordering Frontal Methods. All our efforts in the previous section focused on reducing the fil-in of a sparse matrix during Gaussian elimination. However, from the perspective of parallel computing, this has the clear drawback that, at the same time, all potential for parallelsim in the elimination of the entries of a column and in the update of entries of a row is also minimized. However, at least for band matrices with not too small bandwidth, we can exploit parallelism at least to a certain degree using the so-called frontal methods. Assume that we have a given matrix with bandwidth β. β β β frontal matrix Frontal methods treat parts of this matrix containing only few zeros as dense matrices based on the following algorithmic steps: 1. Define a frontal dense matrix of size (β + 1) (2β + 1) as shown above and treat it as a dense matrix, 2. eliminate the first column of the frontal matrix, 3. move the frontal matrix one entry down-right and do the next elimination, 4. repeat until done. This involves limited parallelism until now and the elimination in subsequent frontal matrices is obviously equential. Parallelism can be exploited in this version only within the dense frontal systems.

83 Chapter 6 Iterative Methods for Sparse Linear Systems of Equations In the previous chapter, we have seen that direct methods have several disadvantages if applied to sparse matrices: The computations become strongly sequential due to the sparsity; direct solvers may lead to a fill-in of zeros and thus, the same costs as for dense matrices; the use of sparse matrix formats is difficult as the sparsity pattern might change during the solving steps; sparse matrix formats imply indirect addressing,... We, thus, consider iterative solvers as a promsing alternative, i.e., we 1. choose an initial guess = starting vector x (0), e.g., x (0) = 0, 2. apply an iteration function x (k+1) = Φ(x (k) ). If we apply this idea for solving a linear system, the main part of Φ should be a matrix-vector multiplication Ax (matrix-free!?). This is easy to parallelize and induces no change in the pattern of A. The main problem is to achieve fast convergence! (k) k x x = A 1 b (6.1) 6.1 Relaxation Methods Relaxation methods are basic widespread iteration methods that use entries of the residual r (k) = b Ax (k) = A x x (k) = e (k) where x is the exact solution to update x (k) to the next iterate x (k+1). In contrast to the error that would be the ideal update leading to the exact solution in a single iteration, the residual can easily be computed from x (k). 83

84 84 Richardson iterations construct an iteration process using the plain residual: x (k+1) = x (k) + r (k). = Φ(x (k) ) Thus, one iteration requires a matrix-vector multiplication and a SAXPY which we can both execute in parallel in an efficient way. However, Richardson iterations converge only in some cases as the following short convergence analysis via Neumann series shows: x (k) = b + (I A) x (k 1) = b + N(b + Nx (k 2) ) = b + Nb + N 2 x (k 2) = = N... = b + Nb + N 2 b + + N k 1 b + N k x (0) = ( k 1 j=0 N j ) b + N k x (0). With this, we can do an error analysis for e (k) = x x (k) : e (k+1) = x x (k+1) = Φ(x ) Φ(x (k) ) = (x (k) + b Ax (k) ) (x + b Ax ) = Ne (k). We get convergence, if Thus, e (k) N e (k 1) N 2 e (k 2) N k e (0), N < 1 N k k 0 e (k) k 0 where ρ is the spectral radius of N = I A N < 1 ρ(n) = ρ(i A) < 1 ρ(n) = λmax = max( λ i ) (λ i is eigenvalue of N). (6.2) i I.e., the eigenvalues of A have to be all in circle around 1 with radius 1. In other words, A I is required to achieve convergence of the Richardson iteration. This is obviously a very strict requirement of the system matrix A. Generalization: splittings of A We observed that we get convergence of Richardson only in very special cases! Thus, we try to improve the iteration for better convergence! We do this by writing A in the form A = M N. Analogue to the splitting A = I (I A) used for Richardson, we can derive an iterative method from this splitting based on the observation that our exact solution x fulfills b = Ax = (M N)x = Mx Nx x = M 1 b + M 1 Nx x = x + M 1 (b Ax ).

85 85 This brings us to the iteration x (k+1) = x (k) + M 1 (b Ax (k) ) = x (k) + M 1 r (k). = Φ(x (k) ) M should be such that M 1 y can be evaluated efficiently. Note that the iteration with the splitting M N is equivalent to Richardson on M 1 Ax = M 1 b. (6.3) Thus, the iteration with the splitting A = M N is convergent if M 1 A is close to the identity or, more precisely, if ρ(m 1 N) = ρ(i M 1 A) < 1 Such a matrix M is called a preconditioner for A. It is used also in other iterative methods to accelerate convergence. In the following paragraphs, we shortly recall commonly used choices for the matrix M to spped up the Richardson iteration: Jacobi (Diagonal) Splitting. The Jacobi method uses A = M N = D (L + U) with the diagonal part of A D = diag(a), L the lower triangular part of A, and U the upper triangular part of A. L D U This yields the iteration x (k+1) = x (k) + D 1 r (k). This method is obviously convergent for A diag(a) or diagonal dominant matrices ρ(d 1 N) = ρ(i D 1 A) < 1 (6.4) and the matrix D is very easy to invert (just invert the diagonal entries one by one). Thus, we get an iteration that convergees for a larger class of matrices than the Richardson iteration but is still very simple to implement and features an optimal degree of parallelism as the multiplication with A is a standard matrix-vector product, the multplication with D 1 a simple componentwise scaling. However, the convergence is often too slow! Possible improvement are block Jacobi iterations, where D is a block-diagonal matrix. In this case, D 1 requires the solution of small systems corresponding to the diagonal blocks.

86 86 D 1 U D 2 L D 3 Alternatively or in addition, damping can be applied for improving convergence which means that we introduce a step length for this correction step: x (k+1) = x (k) + D 1 r (k) x (k+1) = x (k) + ωd 1 r (k), i.e., an additional damping parameter ω. This yields the damped Jacobi iteration: or which is convergent iff x (k+1) = x (k) + ωd 1 (b Ax (k) ). x (k+1) = (I ωd 1 A)x (k) + ωd 1 b ρ((i ωd 1 A) ) < 1. ω 0 I Thus, we look for optimal ω with best convergence. Note that this introduces an additional degree of freedom in our method for which we don t have a generally applicable method to compute a suitable value in an automated way. Gauss-Seidel. The Gauss-seidel iteration tries to improve Jacobi by always using the newest information available! To see how this works, we write the Jacobi iteration in componentwise form: x (k+1) j = x (k) j + 1 n a j,m x (k) m = 1 a j,j m=1 a j,j b j j 1 m=1 a j,m x (k) m n m=j+1 a j,m x m (k). For all entries in the first part of the sum, i.e., x (k) 1,..., x(k) j 1, we have already computed new values x (k+1) 1,..., x (k+1) j 1 when we get to entry j. The Gauss-Seidel iteration exploits this already available new information by changing the iteration to x (k+1) j = 1 a jj b j j 1 m=1 a j,m x (k+1) m n m=j+1 a j,m x (k) m.

87 87 This is equivalent to solving the system j a j,m x (k+1) m m=1 = b j n m=j+1 a j,m a (k) m, j = 1,..., n or (D + L)x (k+1) = b Ux (k). Thus, the Gauss-Seidel iteration corresponds to the splitting A = (D + L) + U = M N x (k+1) = (D + L) 1 b (D + L) 1 Ux (k) = (D + L) 1 b + (D + L) 1 (D + L A)x (k) = x (k) + (D L) 1 r (k) Thus iteration converges if the spectral radius of I (D + L) 1 A is smaller than one: ρ(i (D + L) 1 A) < 1. The linear system in D L is easy to solve because D L is lower triangular but strongly sequential as we have seen in Chapter 4! However, there are cases for which regaining a high degree of parallelism is easy. We habe already considered a realted problem in Chapter 1 when we coloured data dependency graphs for iterative updates into an as small number of trees as possible (look this up if you don t remember). The standard example for Gauss-Seidel colouring is a datset that describes, e.g., temperatures or displacements of a linear elastic structure on a two-dimensional square domain discretized with a regular mesh: In an easy discretized form of the underlying Poisson equation for temperatures or displacements, the graph of the system matrix corresponds to the mesh, i.e., each node is connected (via non-zero matrix entries) to its left, right, upper and lower neighbour. Solving a line of the triangular system as required in the Gauss-Seidel iteration then means that we have to use exactly these four neighbours. We write this in an (i, j) indexing form where i gives the position number in x-direction y the position number in y-direction and use a lexicographic order. This gives the Gauss-Seidel update formula at a given position (i, j): x (k+1) i,j = 1 (b i,j a (i,j),(i 1,j) x (k+1) i 1,j a a (i,j),(i+1,j)x (k) i+1,j a (i,j),(i,j 1)x (k+1) i,j 1 a (i,j),(i,j+1)x (k) i,j+1 ). (i,j),(i,j) Thus, the iteration is strongly sequential in the sense that the positions (i 1, j) and (i, j 1) have to be updated before (i, j). However, if we change the order and first traverse a group of points for which i + j is even (red nodes) and afterwards the group for which i + j is odd,

88 88 the Gauss-Seidel update reads x (k+1) i,j = 1 a (i,j),(i,j) (b i,j a (i,j),(i 1,j) x (k) i 1,j a (i,j),(i+1,j)x (k) i+1,j a (i,j),(i,j 1)x (k) i,j 1 a (i,j),(i,j+1)x (k) i,j+1 ). for all (i, j) with i + j even, x (k+1) i,j = 1 (b i,j a (i,j),(i 1,j) x (k+1) i 1,j a a (i,j),(i+1,j)x (k+1) i+1,j a (i,j),(i,j 1)x (k+1) i,j 1 a (i,j),(i,j+1)x (k+1) i,j+1 ). (i,j),(i,j) for all (i, j) with i + j odd. Thus, we can execute all operations for the updates of even (red) nodes in parallel as they only depend on old values from the kth iterate. The same holds for all od (black) nodes. They use only new values from iteration k+1, but these are already avalaible from the sweep over the even / red nodes. As for Jacobi, we can improve the convergence by introducing a relaxation which leads to the so-called successive over relaxation (SOR): x (k+1) = x (k) + ω(d L) 1 r (k) = ω(d L) 1 b + [(1 ω) + ω(d L) 1 U]x (k) The convergence now depends on the spectral radius of the iteration matrix (1 ω) + ω(d L) 1 U The parallelization of SOR is done exactly the same as the parallelization of GS. 6.2 Krylov-Subspace Methods Krylov-subspace methods are based on a more complex idea, but almost as simple to implement as the relaxation methods. They solve a linear system by minimizing a quadratic function: Consider A = A T > 0 (A symmetric positive definite) and the function Φ(x) = 1 2 xt Ax b T x. Φ is an n-dimensional paraboloid R n R with the gradient Φ(x) = Ax b. The position with Φ(x) = 0 is exactly the minimum of the paraboloid. Therefore, instead of solving Ax = b, we can consider min x Φ(x).

89 Steepest Descent uses the local direction of steepest descent (at the actual solution approximation) to look for the minimum: Φ(x) y is minimal for y = Φ(x) = b Ax.

89 89 Steepest Descent uses the local direction of steepest descent (at the actual solution approximation) to look for the minimum: Φ(x) y is minimal for y = Φ(x) = b Ax. Starting from an iterate x (k), we minimize along the search direction, i.e., our next iterate is x (k+1) = x (k) + α k r (k) with α k defined by this one-dimensional minimization: min g(α) = min(φ(x (k) + α(b Ax (k) ))) α α = min α ( 1 2 (x(k) + αr (k) ) T A(x (k) + αr (k) ) b T (x (k) + αr (k) )) α k = r(k)t r (k) r (k)t Ar (k). = min α ( 1 2 α2 r (k)t Ar (k) αr (k)t r (k) x(k)t Ax (k) x (k)t b), The algorithm reads: 1. Start with x (0). 2. Iteration step: with r (k) = b Ax (k) x (k+1) = x (k) + α k r (k) (6.5) and α k = r(k)t r (k) r (k)t Ar (k). Thus, the algorithm is very simple and also very easy to parallelize. The components required are matrix-vector multiplication, SAXPY and DOT, for which we know efficient parallel algorithms. Disadvantage of the steepest descent method: Contour lines of Φ can be distorted. This leads to slow convergence (zig zag path). The local descent direction is not globally optimal. A detailed error analysis of the steepest descent method yields x x (k+1) 2 A (1 1 κ(a) ) x x (k) 2 A (6.6) Therefore, we get very slow convergence for κ(a) 1! Recall the definition of the condition number: κ(a) = A 1 A, λ max or σ max. λ min σ min

90 90 The Conjugate Gradient Method improves the descent direction to be globally optimal: x (k+1) = x (k) + α k p (k) with a search direction not being the negative gradient, but a projection of the gradient that is A-conjugate to all previous search directions: p (k) Ap (j) p (k) A p (j) for all j < k or or p (k)t Ap (j) = 0 for j < k We choose new search direction as the component of the last residual that is A-conjugate to all previous search directions. The stepsize α k is again determined by one-dimensional minimization as before (for the chosen direction p (k) ). The CGradient algorithm reads p (0) = r (0) = b Ax (0) for k = 1, 2,... do α (k) = r(k),r (k) p (k),ap (k) x (k+1) = x (k) α (k) p (k) r (k+1) = r (k) + α (k) Ap (k) if r (k+1) 2 2 ɛ then break β (k) = r(k+1),r (k+1) r (k),r (k) p (k+1) = r (k+1) + β (k) p (k) Due to the conugacy of the search directions, we get the xact solution after n steps: x (n) = A 1 b Unfortunately, this is not true in floating point arithmetic. Furthermore, O(n) iteration steps would be too costly. Thus, we examine the convergence properties as for the steepest descent. After a somewhat detailed analysis, we get k e (k) 1 A ) e(0) κ(a) 1 T k ( κ(a)+1 A 2 e (0) A κ(a) 1 κ(a) + 1 In addition, one can show that for clustered eigenvalues, e.g., if we assume that A has only m different eigenvalues, we get convergence after m steps. Summarizing, we state the following:

91 91 To get fast convergence, we have to reduce the condition number of the system matrix by a preconditioner M, such that M 1 Ax = M 1 b has clustered eigenvalues. Conjugate gradients (CG) are applicable only for symmetric positive definite A. CG has two important properties: optimal search directions and cheap computations. For parallelization, we can again decompose the algorithmic steps into easy to parallelize BLAS rountines: scalar products, matrix-vector multiplication, SAXPY. 6.3 Preconditioning In this section, we think about possibilities to further improve the tools we have at hand to solve large sparse linear systems for the following reasons: Direct solvers are strongly sequential and tend to loose the sparsity. Iterative solvers are easy to parallelize and sparse, but possibly slowly convergent. Thus, a combination of both methods (2 variants) seems to be promising, i.e., we include a preconditioner M A in the form of transforming our original system Ax = b to the preconditioned system M 1 Ax = M 1 b such that M is easy to invert deal with in parallel and the spectrum of M 1 A is much better clustered. M 1 can for example be the result of in inexact direct solver. For sparse matrices, in particular inexact direct solvers preserving the sparsity pattern of A are of interest. We will see example later in this section. Preconditioning for Relaxation Methods. Consider preconditioner M A in the form M 1 Ax = M 1 b. The Richardson iteration for this preconditioned system reads x (k+1) = x (k) + M 1 b M 1 Ax (k), i.e., corresponds to the relaxation method with the splitting A = M N. The convergence depends on I M 1 A < 1. That is exactly the condition for a good preconditioner: The spectrum of M 1 A should be clustered around 1 or M 1 A I. If the splitting with M leads to fast convergence of the splitting method, M is, thus, also a good preconditioner.

92 92 Preconditioned Conjugate Gradients (PCG). The statements obove for the relaxation methods shows that in a sense, PCG with preconditioner M can be seen as an acceleration of the corresponding splitting method. It replaces Richardson iteration for the preconditioned system by CG for the preconditioned system. The Jacobi splitting with M = D = diag(a) gives a Jacobi preconditioner. The Gauss-Seidel splitting with M = D L leads to a Gauss- Seidel preconditioner. We know that the CG algorithm may be applied only if the system matrix is symmetric positive definite. This is, in general not the case for M 1 A. Thus, we have to modify our preconditioning to achieve symmetry in two steps: 1. We have to choose a preconditioner that is symmetric, i.e, M = M T. 2. Symmetrization of M 1 A by transformation to E 1 AE T : M 1 as defined above is symmetric positive definite like A, however, M 1 A does not have to be symmetric positive definite which would be a requirement for the applicability of the conjugate gradient method. Solution: M can be decomposed to M = EE T, E 1 AE T is symmetric positive definite as x T E 1 AE T x = (E T x) T A (E T x) > 0 for all x, E 1 AE T has the same eigenvalues and condition number as M 1 A: If v is an eigenvector of M 1 A with eigenvalue λ, than E T v is an eigenvalue of E 1 AE T with eigenvalue lambda since E 1 AE T E T v = E 1 Av = E T E T E 1 Av = E T M 1 Av = E T λv = λe T v. Thus, we apply the conjugate gradient method to E 1 AE T û = E 1 b. Afterwards compute u = E T û. PCG Algorithm with Symmetrized Preconditioning Recapitulation Algorithm cg: α (k) = (r (k) ) T r (k) (p (k) ) T Ap (k), x (k+1) = x (k) + α (k) p (k), r (k+1) = r (k) α (k) Ap (k), β (k+1) = (r(k+1) ) T r (k+1) (r (k) ) T r (k), p (k+1) = r (k+1) β (k+1) p (k). Resulting Algorithm for the preconditioned system:

93 93 α (k) = (ˆr (k) ) T ˆr (k) (ˆp (k) ) T E 1 AE T ˆp (k), ˆx (k+1) = ˆx (k) + α (k) ˆp (k), ˆr (k+1) = ˆr (k) α (k) E 1 AE T ˆp (k), β (k+1) = (ˆr(k+1) ) T ˆr (k+1) (ˆr (k) ) T ˆr (k), ˆp (k+1) = ˆr (k+1) β (k+1) ˆp (k). Disadvantage of this method: E has to be computed and an additional equation has to be solved for u. However, we observe: ˆr (k) = E 1 b E 1 AE T ˆx (k) = E 1 r (k), ˆx (k) = E T x (k), E T E 1 = M. Thus, we define p (k) with the help of ˆp (k) = E T p (k). This yields: α (k) = (r (k) ) T E T E 1 r (k) (p (k) ) T = (r(k) T ) M 1 r (k) EE 1 AE T E T p (k) (p (k) ) T, Ap (k) x (k+1) = E T (E T x (k) + α (k) E T p (k) ) = x (k) + α (k) p (k), r (k+1) = E (E 1 r (k) α (k) E 1 AE T E T p (k) ) = r (k) α (k) Ap (k), β (k+1) = (r(k+1) ) T E T E 1 r (k+1) (r (k) ) T = (r(k+1) T ) M 1 r (k+1) E T E 1 r (k) (r (k) ) T, M 1 r (k) p (k+1) = E T (E 1 r (k) β (k+1) E T p (k) ) = M 1 r (k) β (k+1) p (k).

94 94 Resulting algorithm: α (k) = (r (k) ) T v (k) (p (k) ) T Ap (k), x (k+1) = x (k) + α (k) p (k), r (k+1) = r (k) α (k) Ap (k), v (k+1) = M 1 r (k+1), β (k+1) = (r(k+1) ) T v (k+1) (r (k) ) T v (k), p (k+1) = v (k+1) + β (k+1) p (k). Relaxation Methods as Preconditioners. As mentioned above, PCG can be interpreted as a speedup for the splitting methods. For the related methods, we apply a CG iteration to M 1 Ax = M 1 b, where M is derived from the splitting A = M N. If we revisit the PCG algorithm above and assume that we have a function solver iteration(v, A, rhs) given that performs an iteration of a splitting method for the matrix A, a given right-hand side rhs, and the solution approximation v, we can rewrite PCG as α (k) = (r (k) ) T v (k) (p (k) ) T Ap (K), x (k+1) = x (k) + α (k) p (k), r (k+1) = r (k) α (k) Ap (k), v (k+1) = 0, v (k+1) = solver iterations(v (k+1), A, r (k+1) ), β (k+1) = (r(k+1) ) T v (k+1) (r (k) ) T v (k), p (k+1) = v (k+1) β (k+1) p (k). There is one issues to be discussed here: Not all splitting methods use a symmetric matrix M. E.g., for Gauss-Seidel, we have seen that M = D L. To turn this into a symmetric preconditioner, we use a combination of two iterations for M, more specifically, we perform a first iteration with the splitting A = M N and a second iteration with A = M T N T (note that A is symeetric positive definite). Applying two iterations of a splitting method to a system with right-hand side r (k+1) and initial guess zero yields: u = M 1 r (k+1) first iteration, v (k+1) = u + M T (r (k+1) Au) = (M 1 + M T M T AM 1 )r (k+1) second iteration. Thus, these two iteration correspond to a preconditioning with the symmetric matrix M 1 new = M 1 + M T M T AM 1.

95 95 For Jacobi and Gauss-Seidel, we can thus use one Jacobi iteration for preconditioning or two Gauss-Seidel iterations with inverted processing order of the unknowns (why does this give us a splitting wioth M T?). ILU Preconditioning. ILU preconditioners apply the Gaussian elimination algorithm, but only on an allowed pattern resulting in an incomplete LU factorization called ILU. With this approach, we preserve the sparsity pattern of A or at least a guarantied parsity, i.e., we reduce the non-zero entries in L and U either to an allowed pattern (e.g., ILU(0) for the pattern of A) or to values that are not too small (ILUT for ILU with treshold). This leads to an approximate LU factorization with all ignored fill-in entries collected in R. A = LU + R, preconditioner M = LU (6.7) For our PCG, this means that we compute M 1 r (k+1) from solving LUv (k+1) = r (k+1), two sparse triangular systems. We revisit all preconditioners we have seen up to now regarding the following requirements for efficient parallel preconditioning: (i) The computation of M 1 is fast in parallel. (ii) The application M 1 r (k+1) in each iteration step is easy in parallel. (iii) The spectrum AM or MA is clustered fast convergence. The observations are that For GS, (i) is ok (there is nothing to compute), but not (ii), and often also not (iii). For Jacobi, (i) and (ii) are ok, but not (iii). For ILU, (iii) is ok, but not (i) (limited parallelism of direct solvers for a sparse matrix) and (ii) (sequential character of solvers for triangular systems). The class of methods presented in the next paragraph is similar to the ILU approach but performs much better in terms of parallel scalability: Sparse Approximate Inverses (SPAI). Sparse approximate inverses are a possibility to approximate A 1 in a sparse and parallel way. They approximate A 1 by norm minimization min AM 1 I over some sparsity pattern P. M 1 P

96 96 The efficiency and parallelizability of the method strongly depends on the choice of the norm. The optimal norm is the Frobenius norm: Minimization in this norm yields A 2 F = n M 1 P k=1 min AM 1 I 2 M 1 F = min P n a 2 i,j = i,j=1 n j=1 A,j 2 (AM 1 I)e k 2 2 = n k=1 Mk 1 min P k AM 1 k e k 2 2 Hence, to minimize the Frobenius norm, we have to solve n least squares problems (one for each of the sparse columns of M 1 )! This can be done fully in parallel! But what are the costs for the least squares problems? If we denote by J k the set of allowed indices in the kth column of M, we observe that only a part of the matrix A is involved in the least squares problem for the kth column of M 1 : 1 M k e k e k A = A (:, J ) k 1 M (J ) k k min M 1 k P k AMk 1 e k 2 2 = min Mk 1 P k A e k 2 2 = min M 1 k P k A(, J k )M 1 k (J k) e k 2 2 (6.8) A(, J k ) is a sparse rectangular matrix. In addition, we delete superfluous zeros in the least squares problem, i.e., in A(, J k ), we keep only non-zero rows A(I k, J k ): I k A (:, J ) k 1 M (J ) k k = e k A (I, J ) k k e k (I ) k

97 97 Hence, we achieve a reduction of the sparse LS to A(I k, J k ): min M 1 k P k A(I k, J k )M 1 (J k ) e k (I k ) 2 2. We solve the small LS problem by Householder QR. Since A(I k, J k ) is a smaller and denser matrix than A, we can have two cases: 1. A(I k, J k ) is really small. Than we don t need more parallelism than we already have since we compute all columns of M 1 simultaneously. 2. A(I k, J k ) is still considerably large. In this case, we can use parallel QR for each of the least squares problems. Even for sparse A, A 1 will be no more sparse! In spite of this fact, we want to force at least our preconditioner M 1 A 1 to be sparse. As an a priori choice of a good approximate sparsity pattern for M 1, we can choose the pattern of A k or (A T ) k, (A T A) k A T for some k = 1, 2, A ɛ with sparsified A, a combination of the above mentioned possibilities. We get A ɛ by sparsification of A: delete all entries with A ij < ɛ Instead of choosing an a priori sparsity pattern, we can determine the pattern in a more adaptive way: We start with a thin approximate pattern J k for Mk 1 and compute the optimal column M k,opt 1 (J k) by least squares. In a second step, we try to find a new entry j for Mk 1 such that M k,opt 1 (J k) + λe j has a smaller residual in the Frobenius norm: min A(Mk 1 + λe j) e k 2 λ 2 = min (AMk 1 e k) + λae j 2 λ 2 = = min λ ( r k λ(rT k A j) + λ 2 A j 2 2 ) with r k = AM 1 k e k. Minimization over λ yields λ j = rt k A j A j 2. 2 Finally, we choose the index j with r T k A j 0 and j = argmin (λ j r T k A j) since min A(Mk 1 + λe j) e k 2 λ 2 = r k 2 2 (rt k A j) 2 A j 2 = r k 2 2 λ j rk T A j. 2

98 98 In literature, several variants of this basic algorithm are known together with error estimates for the approximation of A 1 by M 1. Summarizing, we can state that SPAI preconditioners achieve a good approximation of A 1, can be computed efficiently in parallel and also applied efficiently (multiplication with a sparse matrix M 1, see the PCG algorithm above). 6.4 Multigrid Basic Multigrid Principles Revisited. For some sparse systems (e.g., stemming from the discretrization of partial differential equations), iterative relaxation methods show a very characteristic behaviour: 1. The error decreases only very slowly after a certain number of iterations, 2. the error becomes smooth after only a few iterations, 3. the convergence of the same stationary method would be faster on a reduced (coarse) version of the system. As an example, we consider the two-dimensional Poisson equation u = f in ]0; 1[ 2 u = 0 at (]0; 1[ 2 ) Discretisation of the Laplacian on a regular two-dimensional grid with mesh width h in x- and y-direction and finite differences gives u[i + 3] + u[i + 1] 4 u[i] + u[i 1] + u[i 3] h 2. This yields a system of equations with the system matrix A = 1 h Starting with arbitrary initial guess for u h leads to the following pictures showing the initial error and the error after ten Gauss-Seidel iterations:

99 99 Note, that the size of the maximal value of the error decreased only by a factor of two! From these observations, the following general two-grid algorithm can be derived: 1. Iterate the original system a few times with a Gauss-Seidel solver x Establish an equation for the remaining error (which is smooth, now): A (x x 1 ) = b Ax 1. = e = res 3. Transport this equation to a coarser level (e.g., every second grid node in both spatial directions) using a prolongation P and a restriction R (e.g., bilinear interpolation and injection): RAP e c = Rres where e c is a coarse representation of the error. 4. Solve the coarse system. 5. Prolongate the computed error back to the fine grid: e f = P e c. 6. Improve the fine guess x 1 using this error approximation: x 2 = x 1 + e f. 7. Iterate the original system a few times with the Gauss-Seidel solver x 3. Obviously, if using multiple levels is a good idea for the original system, it also makes sense for the new, coarser system. This leads to a recursive multilevel method: 1. Iterate the original system a few times with a Gauss-Seidel solver x Establish an equation for the remaining error (which is smooth, now): A (x x 1 ) = b Ax 1. = e = res

100 3. Transport this equation to a coarser level (e.g., every second grid node in both spatial directions) using a prolongation P and a restriction R (e.g., bilinear interpolation and injection): RAP e c = Rres where e c is a coarse representation of the error.

Improve the fine guess x 1 using this error approximation: x 2 = x 1 + e f. 7. Iterate the original system a few times with the chosen stationary solver x 3.

100 Transport this equation to a coarser level (e.g., every second grid node in both spatial directions) using a prolongation P and a restriction R (e.g., bilinear interpolation and injection): RAP e c = Rres where e c is a coarse representation of the error. 4. Solve the coarse system. To do this, go to step 1) for the coarse system. 5. Prolongate the computed error back to the fine grid: e f = P e c. 6. Improve the fine guess x 1 using this error approximation: x 2 = x 1 + e f. 7. Iterate the original system a few times with the chosen stationary solver x 3. We have a look at the resulting error at the steps of this algorithm for the 2D Poisson example described above: Start: Random initial error After Smoothing (2 GS) Residual before restriction Restricted residual

101 Coarse grid solution = error approximation Interpolated error approximation Fine grid error after correction Fine grid error after further 2 GS Parallel Scalability of Multigrid.

101 101 Coarse grid solution = error approximation Interpolated error approximation Fine grid error after correction Fine grid error after further 2 GS Parallel Scalability of Multigrid. Considering the parallel scalability, we have the freedom to define parts of the algorithm that we want to examine. Computer scientists usually look at a component that is identically repeated a couple of times, not at all repetitions. In terms of solvers for sparse linear systems, this would mean to consider a single iteration: If we measure the weak scalability of a single iteration of a conjugate gradient solver, we get the following graph plotting the runtime over the number of processors that grows proportional to the number of entries in the unknown vector x. This is due to the almostoptimal scalability of all operations in such an iteration. Only the scalar product induces a O(log p) term when p is the number of processors. The constant of this term is small such that it is almost invisible in the scaling plot.

102 If we compare the scalability with the scalability of a multigrid iteration, we first observe a longer runtime on a single processor, but also a worse scaling which we will try to explain in the

102 102 If we compare the scalability with the scalability of a multigrid iteration, we first observe a longer runtime on a single processor, but also a worse scaling which we will try to explain in the following: As a consequence, the convergence rate These measurements seem to lead to the conclusion that the conjugate gradient method is very well-suited for parallel architectures while multigrid is not. Let s have a closer look at what happens if we change our point of view by looking at the time to solution instead of units of single iterations: In this case, we have a complete view on the toal costs of the solution process, i.e., the product of the number of iterations and the costs per iteration. Coming back to the Poisson equation from above, we know (Grundlagen des Wissenschaftlichen Rechnens, vast amount of literature) that the condition number of the discretized system matrix increases with a deceasing mesh widt h, in other words with increasing length of the vector of unknowns. κ 1 κ + 1 (κ being the condition of the system matrix) of the conjugate gradient method increases and the number of iterations, thus, also grows. For multigrid methods, this is not the case. This changes the scalability graph:

103 103 This graph shows a weak scaling, again, i.e., the number of unknowns grows proportionally to the number of processors. Other than this figures suggests, multigrid of does not scale perfectly, but the runtime per iteration and, thus, the whole solution processes increases with O(p): This obviously leads to the opposite conclusion from what we had before: multigrid scales whereas conjugate gradients do not. Of course, we ve restricted our analysis to weak scalability. For strong scalability, i.e., constant problem size, CG would be in favour. However, in general, the task of masively parallel computing is to shift the borders of computability, i.e., to solve larger systems leading to larger scenario domains and higher accuracy. For such systems, a method has to be chosen that scales best in the weak scaling sense! Detailed Theoretical Scalability Analysis. We should have a closer look at the theoretical potential of CG and multigrid in terms of parallel scalability. In a simplified view, we have for the weak scalability of CG (with ax denoting the matrix-vector product, x T y the scalar product): For the Poisson equation from above, the number of iterations of conjugate gradient solvers increases with 1 h if h is the mesh width of the computational grid. If we want to solve our system with a remaining error that is in the same order as the discretization error (O(h 2 )), the number of iterations umber increases even more due to the decreasing error tolerance:

104 104 This leads to the following costs of a complete solution: If we parallelize multigrid, we have to parallelize the operations on all levels. For the finest levels, we get a good scalability. Here, the computational cost per grid partition still overweighs the communication costs at partition boundaries. Therefore, the runtime in this range decreases with 1 4 in case of a two-dimensional mesh for each coarsening step h h/2.

105 Then, there s a group of grid levels, where communication cost between

This means that the runtime decreases only with the factor that the communication

Finally, we have the very coarse levels, where there is not enough work left to keep

105 105 Then, there s a group of grid levels, where communication cost between partitions are larger than the computational costs. This means that the runtime decreases only with the factor that the communication costs decrease in each coarsening step, i.e., with a factor of 1 2 in the two-dimensional example. Finally, we have the very coarse levels, where there is not enough work left to keep all processors busy. Here, we don t get a further decrease of the parallel runtime when further coarsening the grid.

106 106 We look at the whole picture of the three groups of grid levels: In the green group, we can easily calculate the upper bound of the costs relative t othe costs on the finest level L max using the geometric series i=0 ( 1 4 )i = 4 3 : In the yellow group, we can do the same as in the green with 1 2 instead of 1 4 :

solution process, we have to multiply these costs with the number of required iterations, which includes a log 1 h )

107 107 In the red group, we don t get a reduction factor at all. Thus, this induces a factor O(log p): Putting everything together yields for a single multigrid cycle For the total solution process, we have to multiply these costs with the number of required iterations, which includes a log 1 h ) factor, again, if we don t solve up to a fixed error tolerance but up to an error tolerance that decreases with increasing problem size corresponding to a decreasing discretization error

108 108 Multigrid as a Preconditioner for Conjugate Gradients as an Alternative. Now that we have seen the advantages and drawbacks of both CG and multigrid, we can ask the question whether there is a way to combine advantages of both and eliminate drawbacks as far as possible. The way to do this is to combine CG methods with multigrid preconditioners. If multigrid is used as a preconditioner instead of a solver, weaker requirements have to be fulfilled. This makes, e.g., the use of so-called additive multigrid solvers as preconditioners quite attractive. Additive (BPX) multigrid preconditioners replace the sequential processing order of a standard multigrid V-cycle by a different cycle where only restrictions and intepolations have to be performed sequentially and the smoothing operations can be done in parallel on all grid levels at the same time:

109 Chapter 7 Domain Decomposition A very high-level approach for parallelization is based on the domain decomposition idea that we shortly sketch in the following: Consider an elliptic PDE on the region Ω with boundary Γ, e.g. with Dirichlet boundary conditions. Example: u = u xx + u yy = 2 u x u = f(x, y) Ω (7.1) y2 u(x, y) Γ = g(x, y) Γ (7.2) How can we efficiently solve this partial differential equation exploiting parallelism at the highest level, i.e., parallelize even before discretization? The general idea of domain decomposition is to split the domain into two or more subdomains, solve a PDE in each of them, exchange values at boundaries between subdomains, and iterate this procedure until we get convergence. 7.1 Overlapping Domain Decomposition We sketch the ideas of overlapping domain decomposition with only two subdomains, first: Partition the region Ω in two regions Ω 1 and Ω 2 with new boundaries Γ 1 and Γ 2 which are partially given by the old boundary Γ and some new parts Γ 1 and Γ 2 : 109

110 110 Ο 1 Γ 1 Ο Γ 2 Ο Ο 2 We discretize and solve the given PDE on Ω 1 and Ω 2 with boundaries Γ 1 and Γ 2. This means, we need the values of u(x, y) at the new artificial boundaries Γ 1 and Γ 2. First, we assume initial approximation values at Γ 1 and Γ 2, e.g., u(x, y) = 0. Then we solve the linear systems in both Ω 1 and Ω 2. Values of resulting solution in Ω 1 are used as new values at Γ 2 and vice versa, the solution in Ω 2 is used as new approximation at Γ 1. So we generate solutions on partial regions which provide us with approximate values for unknown boundary values of the other partial solution. The sequence of solutions converges to the solution on Ω in each subdomain. Note that, in contrast to our parallelization approaches for linear solvers in the previous chapter, where line-blocks of the system matrix also correspond to subdomains for discretized PDEs, the domain decomposition approach allows us to even use different discretization schemes and different grids in the subdomains. We summarize the algorithm for the overlapping domain decomposition: 1. Solve (in parallel) the PDE on all subdomains with given (Dirichlet) boundary values. 2. Reset the boundary values using the solution in the neighbouring subdomains (overlapping with our subdomain). 3. Repeat until convergence. 1st step 2nd step

111 111 We test this algorithm for the following very simple example: which obviously has the analytical solution u xx = 0 in ]0; 8[, u(0) = 0, u(8) = 8 u(x) = x. We denote our solution on Ω 1 =]0; 6[ as u 1 and the solution on Ω 2 =]2; 8[ as u 2. As an initial guess for the inner boundaries, we use u 1 (6) = 0, u 2 (2) = 0. Iterations: 1. Solve This gives { (u 1 ) xx = 0 with u 1 (0) = 0, u 1 (6) = 0 } and { u 1 (x) = 0, u 2 (x) = 4 (x 2). 3 (u 2 ) xx = 0 with u 2 (2) = 0, u 2 (8) = 8 }. 2. Reset boundary values: u 1 (6) = u 2 (6) = = 16 3, u 2(2) = u 1 (2) = Solving the new problems in Ω 1 and Ω 2 gives u 1 (x) = 8 9 x, u 2(x) = 4 (x 2) Reset boundary values: u 1 (6) = u 2 (6) = 16 3, u 2(2) = u 1 (2) = Solving the new problems in Ω 1 and Ω 2 gives u 1 (x) = 8 9 x, u 2(x) = x Remark: In order to be able to start with better initial values for interior boundary nodes, we can use a kind of preprocessing step solving the problem in the whole domain Ω on a coarse mesh and interpolate the resulting values to the finer meshes at the subdomain boundaries. We have a look at the matrix representation of overlapping domain decomposition to be able to compare the domain decomposition with our previously used parallel iterative solvers:

112 112 A 00 A 10 A 01 A 11 A 02 A 12 x 0 x 1 A 00 A 10 A 01 A 11 A 02 A 12 x 0 x 1 solve move to right hand side A 20 A 21 A 22 x 2 A 20 A 21 A 22 x 2 step 1 step 2 Grey parts are related to the other domain and we assume to know the related components in the vector x of unknowns. They are moved to the right-hand side b in the solution step of the corresponding subdomain. Additive versus Multiplicative Domain Decomposition. From the matrix representation, we observe that we actually do an overlapping-block-jacobi iteration over the whole problem, i.e., we solve both all lines in Ω 1 and all lines in Ω 2 together with a suitable solver, exchange values in the overlap region and go on iterating. This approach is also called the additive domain decomposition. The multiplicative (also called alternating) approach solves the problem first on Ω 1, prescribes new boundary values for Ω 2, solves the problem in Ω 2, prescribes new boundary values for Ω 1 and so forth. This obviously corresponds to an overlapping-block-gauss-seidel and is inherently sequential. For testing purposes, we try the multiplicative method for our one-dimensional example, as well: 1. Solve This gives 2. Reset boundary values: 3. Solve This gives { { (u 1 ) xx = 0 with u 1 (0) = 0, u 1 (6) = 0 }. u 1 (x) = 0. u 2 (2) = u 1 (2) = 0. (u 2 ) xx = 0 with u 2 (2) = 0, u 2 (8) = 0 }. u 2 (x) = 4 (x 2) Reset boundary values: 5. Solving the new problem in Ω 1 gives u 1 (6) = u 2 (6) = u 1 (x) = 8 9 x.

113 We observe that we are as fast as the additive version, but save the unnecessary work we invested in the additive version in solving one system per iteration in which the solution does not change at all. This seems to make the additive (the only parallel) variant useless. However, the observed behaviour was due to the fact that we started with zero initial guesses for the boundary values. For other choices, also the additive variant would have changed both solutions u 1 and u 2 in every iteration! Convergence. The convergence speed of the overlapping domain decomposition approach depends on the size of the overlap such that we typically get a trade-off between the higher costs per iteration and the smaller number of iterations for a larger overlap. 7.2 Non-overlapping Domain Decomposition Non-overlapping domain decomposition methods do not require overlapping regions, but have to use different boundary conditions at inter-subdomain boundaries to ensure convergence. We consider an example with only two subdomains, again: Ο 1 Γ new Ο Γ new Ο 2 After discretization of the original problem and numbering the unknowns relative to the partitioning given by Ω 1 and Ω 2, this leads to a linear system with a matrix in dissection form: In matrix-vector notation Au = f can be written as A = A (1) I,I 0 A (1) I,Γ 0 A (2) I,I A (2) I,Γ A (1) Γ,I A (2) Γ,I A Γ,Γ, u = u (1) I u (2) I u Γ, f = f (1) I f (2) I f Γ, (7.3)

114 114 where the degrees of freedom are partitioned into those internal to Ω 1, and to Ω 2, and those of the interior boundary Γ separating Ω 1 from Ω 2. A Γ,Γ is the so called interface matrix. For the matrix-form, we have already seen how we can reduce the original problem to two partial subproblems and one interface Schur complement system: u Γ = 1 (7.4) (A Γ,Γ A (1) Γ,I (A(1) I,I ) 1 A (1) I,Γ A(2) Γ,I (A(2) I,I ) 1 A (2) I,Γ ) (b 3 A (1) Γ,I (A(1) I,I ) 1 b 1 A (2) Γ,I (A(2) I,I ) 1 b 2 ), Schur complement S = B Γ u (1) I = (A (1) I,I ) 1 (b 1 A (1) I,Γ u Γ), u (2) I = (A (2) I,I ) 1 (b 2 A (2) I,Γ u Γ). To solve the system iteratively, we can, e.g., use preconditioned comjugate gradients with the preconditioner M 1 = (A (1) I,I ) (A (2) I,I ) MΓ 1. (7.5) For MΓ 1, we can use the identity or an approximate inverse for the Schur complement, e.g., using the SPAI preconditioner. Note that, for computing the SPAI preconditioner for S, we do not have to compute S as a matrix, but only have to calculate SM,i 1 1 for the columns M,i of MΓ 1. For this, we compute y 1 = A (1) I,Γ M,i and y 2 = A (2) I,Γ M,i (simple matrix-vector products), approximations for z 1 = (A (1) I,I ) 1 y 1 and z 2 = (A (2) I,I ) 1 y 2 using iterative subdomain solvers, a 1 = A (1) Γ,I z 1 and a 2 = A (2) Γ,I z 2 (simple matrix-vector products). We solve the small subdomain problems, e.g., with multigrid in parallel. This can be easily generalized to more than two sudomains: Leads to 16 block matrices on the diagonal A 1,..., A 16 and Schur complement S.

115 115 A 1 A 2 F 1 F A 16 F 16 G 1 G 2... G 16 A 17 (S = A 17 G 1 A 1 1 F 1... G 16 A 1 16F 16 ). Fully Partitioned Solution. In the non-overlapping domain decomposition approach described above, the subdomains are independent, but we still need the interaction matrices A (1) I,Γ, A(2) I,Γ, A(1) Γ,I, and A(2) Γ,I given explicitly as a result of a discretization of the PDE in the whole domain Ω. In the overlapping approach, this was not required, independent discretizations in each subdomain were possible. In order to recover a similar independence for the non-overlapping method, we start from the following observation: The Poisson equation on the domain Ω is equivalent to u = f in Ω, u = 0 on Ω u 1 = f in Ω 1, u 1 = 0 on Ω 1 /Γ, u 1 = u 2 on Γ, u 1 = u 2 on Γ, n 1 n 2 u 2 = f in Ω 2, u 2 = 0 on Ω 2 /Γ. From this, we derive an alternating iterative solution approach: 1. Solve u 1 = f in Ω 1, u 1 = u 2 on Γ, 2. Solve u 2 = f in Ω 2, u 2 n 2 = u 1 n 1 on Γ, 3. iterate these two steps until convergence. We test this iteration for the example already used in the overlapping domain decomposition: Starting with the initial approximations u xx = 0 in ]0; 8[, u(0) = 0, u(8) = 8, (7.6) u 1 (x) = 0 in Ω 1 =]0; 4[, u 2 (x) = 2(x 4) in Ω 2 =]4; 8[, (7.7)

116 116 yields: 1st iteration: Solve u 1,xx = 0 in ]0; 4[, u 1 (0) = 0, u 1 (4) = u 2 (4) = 0 u 1 (x) = 0, Solve u 2,xx = 0 in ]4; 8[, u 2,x (4) = u 1,x (4) = 0, u 2 (8) = 8 u 2 (x) = 8. 2nd iteration: Solve u 1,xx = 0 in ]0; 4[, u 1 (0) = 0, u 1 (4) = u 2 (4) = 8 u 1 (x) = 2x, Solve u 2,xx = 0 in ]4; 8[, u 2,x (4) = u 1,x (4) = 2, u 2 (8) = 8 u 2 (x) = 2(x 4). 3rd iteration: Solve u 1,xx = 0 in ]0; 4[, u 1 (0) = 0, u 1 (4) = u 2 (4) = 0 u 1 (x) = 0, Solve u 2,xx = 0 in ]4; 8[, u 2,x (4) = u 1,x (4) = 2, u 2 (8) = 8 u 2 (x) = 8. Thus, we get back to where we have already been, i.e., the iteration does not converge. Writing this iterative method in matrix-notation, again, we see the connection to the Schur complement system: A (2) I,I normal derivatives at Γ A (2) I,Γ A (2) Γ,I A (2) Γ,Γ We can transform this to A (1) I,I u(1),k+1 I = b 1 A I,Γ u k Γ, Dirchlet values ( u(2),k+1 I u k+1 Γ ) = b Γ A (1) b 2 Γ,I u(1),k+1 I normal derivatives at Γ (A (2) Γ,Γ A(2) Γ,I (A(2) I,I ) 1 A (2) I,Γ ) (uk+1 Γ u k Γ) = B Γ Su k Γ A (1) Γ,Γ uk Γ with the right-hand side B Γ of the Schur complement system (7.4) and the Schur complement matrix S. Thus, the fully partitioned scheme corresponds to a preconditioned Richardson iteration of the interface system (7.4). It does not always converge and might, thus, need further preconditioning / convergence acceleration..

117 117 Recursive Form of Non-Overlapping DD. Instead of decomposing the domain only on one level, we can recursively further subdivide the subdomains. This leads to the nested (recursive) dissection: Summarizing, we state that overlapping domain decomposition is easy to parallelize, but features slow convergence. Non-overlapping is harder to parallelize, but we have more influence on the convergence in the Schur complement S. The domain decomposition approach can be generalized to non-conforming discretizations (mortar methods/lagrange multipliers/feti), time-dependent systems,... Literature: A. Toselli, O. Widlund: Springer, 2004 Domain Decomposition Methods Algorithms and Theory, A. Quarteroni, A. Valli: Domain Decomposition Methods for Partial Differential Equations, Oxford Science Publications, Partitioned Multi-Physics In this section, we shortly discuss the application of the fully partitioned non-overlapping domain decomposition approach to an example problem, where the two domains Ω 1 and Ω 2 correspond to two different physical fields: Ω 1 is filled with a fluid, whereas Ω 2 is an elastic solid. We assume that we have software packages F and S solving fluid flow and structure deformation and establish a simulation environment calculating the bi-directional interaction between flow and structure deformation based on these two solvers. Such a setup has many practical applications, e.g., flapping aircraft wings, interaction of blood flow and blood vessels, simulations of a pumping heart, winturbines,... Typically, a flow solver uses Dirichlet boundary values for the flow velocities, i.e., it uses position and velocity values at the interface between fluid and structure as an input. As an output, the

executions of flow and structure solvers (either for a stationary solution of the solution of an implicit time step) yields exactly the fully partitioned domain decomposition iteration introduced

118 118 flow solver can calculate forces exterted on the structure by the fluid flow. These are used as an input (Neumann boundary values) for the structure solver, which in turn computes new displacements and velocities at the interface to the fluid: Iterating of alternating executions of flow and structure solvers (either for a stationary solution of the solution of an implicit time step) yields exactly the fully partitioned domain decomposition iteration introduced above: flow: A (1) I,I u(1),k+1 I = b 1 A I,Γ u k Γ, velocities/displacements structure: A (2) I,I A (2) I,Γ A (2) Γ,I A 2 Γ,Γ ( u(2),k+1 I u k+1 ) = Γ b Γ A (1) b 2 Γ,I u(1),k+1 I A (1) Γ,Γ uk Γ forces. However, note that neither the flow nor the structure solver are linear in general such that the formulas above are to be seen only as a matrix-like notation. Fully considering the nonlinearity, we can write the iterative scheme shortly as f k+1 = F (d k ), d k+1 = S(f k+1 ) if F and S denote the interaction of the solvers at the surface between flow and structure. I.e., the flow solver F maps interface displacements d to forces f and the structure solver S maps forces at the surface back to displacements. This can be interpreted as a fixed-point iteration for the fixed point equation S F (d) = d and has two major drawbacks: 1. It converges if and only if S F is a contraction, which is usually not the case, in particular for incompressible fluids and relatively elastic and leightweight structures. 2. Flow and structure solver can only be executed after each other which limits the parallel scalability as, typically, the flow solver can scales only on a much smaller number of cores than the flow solver resulting in the following pattern of idle cores over the runtime of the simulation:

This solves the fixed-point equation (in matrix-like notation) ( 0 S F 0 ) ( d f ) = ( d f ), but only

119 119 The latter problem seems to be easy to solve by simply switching to the slightly modified, parallel iteration f k+1 = F (d k ), d k+1 = S(f k ). This solves the fixed-point equation (in matrix-like notation) ( 0 S F 0 ) ( d f ) = ( d f ), but only helped us to satisfy (maybe) people at the supercomputing center measuring idle times of our simulation as it corresponds to two separate iterations of the original, staggered type:

120 The good news is that there is a class of methods that helps to resolve the convergence problems of the staggered approach and the parallel approach in a way that ensures convergence for most

120 120 The good news is that there is a class of methods that helps to resolve the convergence problems of the staggered approach and the parallel approach in a way that ensures convergence for most physical scenarios and, at the same time, reduces the number of iterations required for the parallel method to approximately the same as required for the original, staggered iteration using the same acceleration methods. This class of methods is called quasi-newton or Anderson mixing (depending on the community publishing the method). We explain the method coming from the Newton view: Both the staggered and the parallel iteration described above, solve a fixed-point equation. To avoid having to distinguish between both, we write H(x) = x in the following for our fixed-point problem. The important characteristic of our fluidstructure interaction scenario is that we are able to compute H(x) for a given input vector x, i.e., either S F (d) or ( S(f) ), but have no further access to details such as underlying F (d) discretization details or Jacobian matrices of H. We call this the black-box property. If we would have the Jacobian J H (x) of H at every point x, a Newton iteration for the non-linear equation H(x) x = 0 would read Solve (J H (x k ) + I) x k = H(x k ) x k, = J R (x k ) Set x k+1 = x k x k where R denotes the mapping of x to the residual R(x) = H(x) x of the fixed-point iteration and J R is the Jacobian of the residual mapping. transformations, this becomes After some Solve x k = J 1 R (xk )R(x k ), Set x k+1 = H(x k ) x k

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst: