DELFT UNIVERSITY OF TECHNOLOGY

Size: px
Start display at page:

Download "DELFT UNIVERSITY OF TECHNOLOGY"

Transcription

1 DELFT UNIVERSITY OF TECHNOLOGY REPORT Efficient Two-Level Preconditionined Conjugate Gradient Method on the GPU. Rohit Gupta, Martin B. van Gijzen and Kees Vuik ISSN Reports of the Department of Applied Mathematical Analysis Delft 2011

2 Copyright 2011 by Department of Applied Mathematical Analysis, Delft, The Netherlands. No part of the Journal may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission from Department of Applied Mathematical Analysis, Delft University of Technology, The Netherlands.

3 Abstract We present an implementation of Two-Level Preconditioned Conjugate Gradient Method for the GPU. We investigate a Truncated Neumann Series based preconditioner in combination with deflation and compare it with Block Incomplete Cholesky schemes. This combination exhibits fine-grain parallelism and hence we gain considerably in execution time. It s numerical performance is also comparable to the Block Incomplete Cholesky approach. Our method provides a speedup of up to 16 times for a system of one million unknowns when compared to an optimized implementation on the CPU. 1 Introduction Multi-phase flows occur when different fluids interact. These fluids have different properties, e.g., in densities. To simulate two-phase flow we use the Incompressible Navier Stokes Equation. A representative example of such a flow could be thought of as an air bubble rising in water. Such phenomena frequently arise in physical processes like oil refineries and nuclear reactors and understanding them can affect the design and efficiency of such processes. 1.1 Motivation Our work is motivated by the Mass-Conserving Level Set approach (8) to solve the Navier Stokes equations for multi-phase flow. The most time consuming step in this approach is the solution of the (discretized) pressurecorrection equation, which is a poisson equation with discontinuous coefficients. For our research we choose a model problem with an interface layer that divides a square domain. More details are provided in Section 2. The discretized pressure-correction equation, takes the form of a linear system Ax = b, A R N N, N N (1) where N is the number of degrees of freedom. A is symmetric positive definite (SPD). The entries of this matrix depend on the densities of the fluids involved. So, e.g., if we consider a gas and liquid mixture then their densities have a ratio of the order of This leads to a large condition number κ for the matrix A, and slow convergence. κ = λ N λ1, where 0 < λ 1 λ 2... λ N are eigenvalues of the matrix A arranged in ascending order 1

4 1.2 Focus of this research The convergence for the system Ax = b mentioned in the previous section is slow when an iterative method like the Conjugate Gradient (CG) is applied. To achieve faster convergence, the linear system is preconditioned: M 1 Ax = M 1 b, (2) where the matrix M is symmetric and positive. The choice of M is such that the operation M 1 y, for some vector y, is computationally cheap and M can also be stored efficiently. We first consider Block Incomplete Cholesky Precondtioner which is an established method and provides some level of parallelism. However, due to the block structure of the preconditioner and also because of small eigenvalues (in the matrix A) the preconditioned matrix M 1 A still has a high condition number. In order to further reduce it we have to apply a second level of preconditioning called deflation (10). Deflation removes the smaller eigenvalues so that the condition number of the matrix M 1 A is reduced. For more details we refer the interested reader to (23). For the GPU, however, Block Incomplete Cholesky preconditioning is not the optimal choice. It is inherently sequential in every block. The amount of data-parallelism is limited by the number of blocks. However, increasing the number of blocks degrades the effectiveness of the preconditioner. Through this research we aim to find preconditioning schemes that offer fine-grain parallelism. Hence they would be better suited to the GPU and at the same time should prove effective in bringing down the condition number of M 1 A. We compare the schemes we have developed with Block-Incomplete Cholesky (Block-IC) Preconditioners, as a benchmark to check the quality of our preconditioning schemes. The numerical performance of the preconditioners we introduce in this paper comes close to it s Block-IC counterparts for our model problem and they also offer more parallelism for the GPU. 1.3 Related work With the advent of CUDA in 2007 scientific computing, it became easier to develop for the NVIDIA GPU platform. Some of the earliest advances into many core computing had to do otherwise (6). Developing with CUDA had its caveats and it required an understanding of the way applications are executed on the GPU hardware (22) to get the maximum performance and parallelism promised by a GPU. This challenge was duly taken up by the scientific community. In particular for iterative methods the GPUs were proven to be extremely useful as suggested in a number of earlier works ((15), (18), (19), (20) and (14)). There have been some previous works that have explored preconditioners with similar properties and we provide a brief overview in this section. In (1), 2

5 a highly parallelizable preconditioner was introduced that was specifically designed for a Poisson type problem. Our experience with this preconditioner tells us that it is not an effective preconditioner for ill-conditioned matrices, though it works very well for smooth Poisson problems. The preconditioning technique we introduce in this paper is able to overcome this limitation. Comparisons with existing schemes (implemented on both CPU and GPU) establish its relevance to achieve convergence quickly and accurately. We discuss this preconditioning in more detail in Section 3.3. The preconditioning technique mentioned in (11), utilizes the same idea as presented in (1) albeit with a relaxation factor. In (11) the authors use the Neumann Series approximation to devise a new preconditioner with the same focus i.e. to adapt the preconditioning techniques for the many-core platform. In (12), the authors used an LU decomposition based preconditioner with fill-in, reordered using multi-coloring. They decide in advance on the sparsity pattern of the incomplete factorization based on the matrix power A p+1 and it s multi-coloring permutation. In contrast, our approach is based on the strictly lower triangular part of the scaled version of the coefficient matrix A. The scaling is applied once and can be done in parallel so the cost is minimal. This paper is organized as follows: in the next section we present a brief discussion of the discretization approach used to construct the matrix A. A brief overview of the preconditioning schemes and their features can be found in Section 3. We discuss the approach of second level preconditioning in Section 4. In Section 5 we list the Conjugate Gradient Algorithm with Preconditioning and Deflation. Furthermore we comment on two different implementation methods for this method in Section 6. In Section 7 we present our results and we end with a discussion in Section 8. 2 Problem Definition In this section we present the test problem used to evaluate the different algorithms. 2.1 Test Problem The discretized pressure-correction equation for a 2-D square grid (n n) takes the form of 5-point Poisson type matrix for our method (as mentioned in Section 1.1). One set of diagonals have offsets ±1 and the other two have offsets of ±n with respect to the main diagonal. For a system with only one phase, the matrix would have the stencil, [ 1, 1, 4, 1, 1] for an inner cell. However, with the introduction of multiple phases we see a jump in the coefficient values at the interface. This jump is also visible in the eigenval- 3

6 magnitude two phase matrix single phase matrix eigenvalues Figure 1: 2D grid (16 16) with 256 unknowns. Jump at the Interface due to density contrast. Figure 2: Two phase Flow Computational Model Figure 3: Unit Square with boundary conditions. Figure 4: Discretized Unit Square with Interface ues. To show this we consider the simple case of a 2-D grid and plot the eigenvalues in Figure 1. The jump results from the density contrast of the two fluids. This jump in eigenvalues leads to a large condition number, κ. The computational domain can be pictured as in Figure 2. It has two fluids with a high density contrast and appropriate boundary conditions. We define a unit square as our domain (Figure 3) and an interface at the middle of this square (Figure 4). 2.2 Solution of the Discrete System After making the coefficient matrix using the stencils, we solve the resulting system with an iterative method. We define a desired accuracy of the solution we wish to achieve. The iterative method refines the solution vector x 4

7 at every step until it reaches a desired norm of the residual r k 2 b 2 ǫ, (3) where r k is the residual at the k-th step, b is the right-hand side and ǫ is the tolerance. In our experiments we choose ǫ = We measure the accuracy of our results using the relative error norm of the solution. It is defined as the difference of the computed solution and the exact solution, x exact x k 2 x exact 2, (4) where x k is the solution computed by the iterative method at the k th step of the iteration and x exact is the exact solution. The initial guess (x 0 ) is a random vector to avoid artificially fast convergence due to a smooth solution. The ill-conditioned two-phase matrix requires that we augment the Preconditioned Conjugate Gradient method with a second level preconditioner. This is essential to achieve reasonable performance of the iterative solver. This technique of second level preconditioning, called Deflation, was investigated in detail in (5). It is essential to mention here that our simple model can be easily extended to a 3D model since then the only change is to the coefficient matrix where the number of non-zeros in each row of A increases. More diagonals are added with larger offsets. For example for a 7-point stencil in 3D we can have two more diagonals at n x n y offsets if the grid dimensions are n x n y n z. For the realistic problem of bubbly flow we consider the domain being composed of air bubbles rising in water. The coefficient matrix for this problem has similar properties as defined in Section 1.1. Only now the interfaces are not as clear as in the model problem shown in Figure 2. Instead there would be spheres/circles cutting a cell partially. To formulate the matrix A for such a multi-bubble/multi-phase case, suitable approaches (cut cell, weighted averaged) can be used as suggested in (8). 3 Preconditioning Schemes In this section we first discuss standard preconditioning schemes for matrix A when using an iterative method like Conjugate Gradient. Further, we provide details of the preconditioning schemes that we have developed. b = Ax exact The exact solution is generated using the cosine function x exact(i) = cos(i 1), where i = 1...N. N is the number of unknowns. 5

8 Figure 5: Block-IC block structure. Elements excluded from a 8 8 grid for a block size of 2n = Block Incomplete Cholesky preconditioning We consider an adapted incomplete Cholesky decomposition: A = LD 1 L T R where the elements of the lower triangular matrix L and diagonal matrix D satisfy the following rules: 1. l ij = 0 (i,j) where a ij = 0 i > j, 2. l ii = d ii, 3. (LD 1 L T ) ij = a ij (i,j) where a ij 0 i j. In order to make blocks for the Block-Incomplete Cholesky approach we first apply the block structure on the matrix A. In Figure 5 we show how some of the elements belonging to the 5-point Poisson type matrix are dropped when a block incomplete scheme is applied to make the preconditioner for A. The red colored dots are the elements that are dropped since they lie outside the blocks. With this A we make the L as suggested in (13). In each iteration we have to compute y from y = M 1 BIC r, where M BIC = LD 1 L T. (5) This is done by doing forward substitution followed by diagonal scaling and then backward substitution. The number of blocks is N/g, where g is the size of the blocks. Within a block all calculations are sequential. However, each block forms a system that can be solved independently of other blocks. Hence they can be solved in parallel. The parallelism offered by Block-IC is limited to the number of blocks. In order to increase the number of parallel operations we must decrease the block size, g. However, doing this would lead to more loss of information, and consequently, delayed convergence. So, we must think of alternatives. In this paper the Block-IC Preconditioners are characterized by their block size. We denote it by a number in brackets e.g. blkic(8n) or M 1 Blk IC(8n). 6

9 Here 8n is the block-size so for a matrix A with N = n n unknowns, hence the number of blocks is n Incomplete Poisson preconditioning In (1), a new kind of incomplete preconditioning is presented. The preconditioner is based on a splitting of the coefficient matrix A using it s lower triangular part L and the diagonal D, Specifically the preconditioner is defined as, A = L + D + L T. (6) M 1 IP = (I LD 1 )(I D 1 L T ), (7) where L is the strictly lower triangular part of A and D is the diagonal matrix containing diagonal elements of A. The arrangement of the terms in (7) seems unusual (transpose terms after non-transpose terms) but the authors use this expression in their work. After calculation of the entries of M 1, the values that are below a certain threshold value are dropped, which results in an incomplete decomposition that is applied as the preconditioner. As this preconditioner was used for a Poisson type problem, the authors call it the Incomplete Poisson (IP) preconditioner. A detailed heuristic analysis about the effectiveness of this preconditioning technique can be found in (1). The main advantage of this technique is that it (Equation (5)) is reduced to sparse matrix vector products which have been heavily optimized for many-core platforms (7). However, this algorithm if used in the form as presented in the original publication gives slow convergence for a two-phase matrix. It must be used with the scaled versions of A, x and b. Ã = D 1 2 AD 1 2. (8) x = D 1 2 x. (9) b = D 1 2b. (10) In such a setting the results are much better. We call this improvement Incomplete Poisson preconditioning with scaling. This technique can be further improved as we show in the next section. 3.3 Neumann Series based preconditioning Building up on the improvements we found in the previous section for Incomplete Poisson Preconditioning we now define the preconditioning matrix, M as M 1 IP is stored just like A in DIA format M = (I + L)(I + L T ), (11) 7

10 where L is the strictly lower triangular part of à as mentioned in (8). In order to calculate M 1 we use the Neumann Series of (I + L) and (I + L T ). This is defined as The series converges if (I + L) 1 I L + L 2 L 3 +. (12) L < 1. (13) This is true for our problem and hence the Neumann Series is a valid choice for approximating the inverse of (I + L). So we can redefine M 1 as M 1 = (I L T + )(I L + ) (14) For making our preconditioners we truncate the series (12) after 1 or 2 terms. We refer to these as the Neu1 and Neu2 Preconditioners. Note that and M 1 Neu1 = (I L T )(I L) (15) M 1 Neu2 = (I L T + ( L T ) 2 )(I L + L 2 ). (16) We define K = (I L) for (15) and K = (I L + L 2 ) for (16). Note that the order of terms (transposed and non-transposed) is as expected. It appears that M 1 1 Neu1 and MIPScal have approximately the same convergence behavior. M 1 Neu1 has the same number of operations per application when compared to Incomplete Poisson and Block-IC variants. MNeu2 1, which has another higher order term in K has 2 times as many. For Incomplete Poisson, we store M 1 IP one time (similar to A or Ã) and use it over and over again in the iteration for the operation M 1 r. Since M 1 has the same sparsity structure as A (5 diagonals). Total computation cost of M 1 r is 10N operations which is the same as for Ax. For the preconditioner as given by (15) we calculate K T Kx by term in every iteration. Every term in the expansion of M 1 = K T K can be (roughly) computed at the cost of one Lx operation. This is around 2N multiplications and N additions. 3.4 Eigenvalue Spectrum Comparison of different preconditioning schemes In this section we show how the eigenvalue spectrum changes for different preconditioning schemes. The plots are for a 2-D grid with N = 4096 unknowns arranged in an n n grid where n = 64. The contrast in densities is 1000 : 1. Condition numbers for the different preconditioners applied to this matrix are listed in Table 1. 8

11 Preconditioning Scheme Condition Number for (M 1 A or M 1 Ã) Block-IC (blocksize=2n) Block-IC (blocksize=4n) Block-IC (blocksize=8n) IP (without scaling) IP (with scaling) Truncated Neumann(Neu1) Truncated Neumann(Neu1) 2.50e e e e e e e+03 Table 1: Condition Numbers after Preconditioning. Condition number of A = 9.68e In each of the plots of Figure 6 the eigenvalues of A and the eigenvalues of the preconditioned versions M 1 A are plotted. The eigenvalues of A show a jump due to the contrast in densities. All preconditioners are efficient except for IP without scaling (M 1 1 IP ). IP with scaling (MIPScal ) performs comparable to M 1 Neu1. The Block Incomplete Cholesky preconditioners are successful in bringing most eigenvalues to 1 which also brings down the condition number, κ. Note that κ decreases as the block size increases. The Neumann type preconditioning schemes perform similar to Block-IC in these respects, with the second type (with K = I L + L 2 ) performing as good as Block-IC(8n). For the Truncated Neumann Series preconditioners we always use the diagonally scaled version of A. For Block-IC and Truncated Neumann Series based preconditioners we provide a separate zoom to show the effect of preconditioner on smaller eigenvalues. 4 Deflation To improve the convergence of our method we also use a second level of preconditioning. Deflation aims to remove the remaining bad eigenvalues from the preconditioned matrix, M 1 A or M 1 Ã. This operation increases the convergence rate of the Preconditioned Conjugate Gradient (PCG) method. If we assume P = I AQ,Q = ZE 1 Z T,E = Z T AZ, (17) where E R k k is the invertible Galerkin Matrix, Q R n n is the correction Matrix, and P R n n is the deflation operator. Z is the so-called deflation-subspace matrix whose k columns are called deflation vectors K T Kx = (I L T )y, where y = (I L)x Number of operations = (2N+2N)[multiplications] + (4N)[additions] 9

12 magnitude A M 1 BIk IC(2n) A M 1 Blk IC(4n) A M 1 Blk IC(8n) A eigenvalues (a) Blk-IC 10 1 magnitude 10 1 A M 1 BIk IC(2n) A M 1 Blk IC(4n) A M 1 Blk IC(8n) A eigenvalues (b) Blk-IC : small eigenvalues magnitude A M 1 IP A M 1 IPScal B eigenvalues (c) IP magnitude magnitude A M 1 Neu1 B M 1 Neu2 B eigenvalues (d) Truncated Neumann A M 1 Neu1 B M 1 Neu2 B eigenvalues (e) Truncated Neumann : eigenvalues small Figure 6: Spectrum of Preconditioned Matrix for a 2D two-phase problem. B = Ã = D 1 2AD

13 or projection vectors. The linear system Ax = b can then be solved by employing the splitting x = (I P T )x + P T x x = Qb + P T x (18) Ax = AQb + AP T x (19) b = AQb + PAx (20) Pb = PAx. (21) The x at the end of the expression is not necessarily a solution of the original linear system, since it might contain components of the null space of P A, N(P A). Therefore this deflated solution is denoted as ˆx rather than x. The deflated system is now PAˆx = Pb. (22) 4.1 Deflated Preconditioned Conjugate Gradient Method The deflated preconditioned version of the Conjugate Gradient Method can now be presented. The deflated system (22) can be solved using a symmetric positive definite (SPD) preconditioner, M 1. We therefore seek a solution of M 1 PAˆx = M 1 Pb. (23) The resulting method is called the Deflated Preconditioned Conjugate Gradient (DPCG) method (details in (23)). We choose Sub-domain Deflation and use piecewise constant deflation vectors. We make stripe-wise deflation vectors (see Figure 9) unlike the block deflation vectors suggested in (5). These vectors lead to a regular structure for AZ and, therefore, an efficient storage of AZ. 5 Two Level Preconditioned Conjugate Gradient -Implementation In Algorithm 1 we list out the steps involved in a typical implementation of the Deflated Preconditioned Conjugate Gradient method. We follow this implementation in writing out the code for GPU and CPU. The deflation operation ŵ j := PAp j is broken into the following steps: 1. Set a 1 = Z T p j. 2. Solve Ea 2 = a Set a 3 = AZa Set w j = p j a 3. 11

14 Algorithm 1 Deflated Preconditioned Conjugate Gradient Algorithm 1: Select x 0. Compute r 0 := b Ax 0 and ˆr 0 = Pr 0, Solve My 0 = ˆr 0 and set p 0 := y 0. 2: for j:=0,..., until convergence do 3: ŵ j := PAp j 4: α j := ( ˆr j,y j ) (p j,ŵ j ) 5: ˆx j+1 := ˆx j + α j p j 6: ˆr j+1 := ˆr j α j ŵ j 7: Solve My j+1 = ˆr j+1 8: β j := (ˆr j+1,y j+1 ) (ˆr j,y j ) 9: p j+1 := y j+1 + β j p j 10: end for 11: x it := Qb + P T x j+1 Step 2 can be solved in two different ways as we will see in Section 7. The kernel for step 1 is a sum operation. Steps 3 and 4 can be done by one kernel. Given the storage format we choose for AZ these operations reduce to similar number of operations, memory access pattern and performance as the sparse matrix vector product Ax. 5.1 Storage of the matrix AZ The structure of the matrix AZ if stored as an N d matrix, where d is the number of domains/deflation vectors, can be seen in Figure 7. In Figure 7 to 9 it must be noted that d = 2 n here and N = n n = 64, n = 8. The AZ matrix is formed by multiplying the Z matrix (a part of which is shown in the adjoining figure of matrix AZ in Figure 7) with the coefficient matrix, A. The colored boxes indicate non-zero elements in AZ. They have been color coded to provide reference for how they are stored in the compact form. The red elements are in the same space as the deflation vector. The green elements result from the horizontal fill-in and the blue elements result from the vertical fill-in. The arrangement of the deflation vectors (on the grid) is shown in Figure 9. Each ellipse corresponds to the non-zero part of the corresponding deflation vector in matrix Z. The trick to store AZ in an efficient way (for the GPU) is to make sure that memory accesses are ordered. For this we need to have a look at how the operation a 3 = AZa 2 works, where a 2 is a d 1 vector. For each element of the resulting vector a 3 we need an element from at most 5 different columns of the AZ matrix. Now it must be recalled that in case of A multiplied with x we have 5 elements of A in a single row multiplied with 5 elements of x as detailed in (7). So we start looking at the different colored elements and group them so that the access pattern to calculate each element of a 3 is similar to the Sparse- 12

15 Figure 8: AZ matrix after compression. Figure 7: Parts of Z and AZ matrix. Number of Deflation Vectors =2n. Figure 9: Deflation Vectors for the 8 8 grid. Matrix Vector Product operation. Wherever there is no element in AZ we can store a zero. So writing out all the red elements is trivial as they are N in number. Similarly the blue elements can also be written in a row. Only they would have an offset at the beginning or the end that must be padded with zeros in order to make them N entries long. For the green elements we look at each of the columns in which they are present. We store one set of green non-zeros which correspond to the left of the domain in the second row of the data structure shown in Figure 8 and the other set is stored in the 4th row. Thus in the compacted form the N d matrix AZ can be stored in 5N elements as illustrated in Figure 8. Stored in such a way the AZ matrix can be used to do faster calculation of matrix vector products within the iteration. The golden arrows in Figure 8 show how each thread on the GPU can compute one element when the operation AZa 2 is performed where a 2 is a d 1 vector. The black arrows show the accesses done by multiple threads. This is similar to the DIA format of storage and calculating Sparse Matrix Vector Product as suggested in (7). Other than deflation and preconditioning the algorithm involves Sparse Matrix Vector Products (SpMVs), Dot Products, Saxpy s etc. These operations form the building blocks that must be optimized for a parallel/manycore version of the code. For standard operations like dot products, norms, daxpys we use the CUBLAS library on the GPU and ATLAS library on the 13

16 (a) CPU (b) GPU Figure 10: Setup+Initialization Time as a percentage of the total time for triangular solve approach across different sizes of deflation vectors for DPCG. CPU. 6 GPU Implementation of Deflation Since we are working with a 5-point stencil we have a regular sparse matrix. We store the matrix in the Diagonal (DIA) format and we follow the implementation as detailed in (7). We also managed to store the matrix AZ in a similar format. The details can be found in (9). Other operations include solving the system E 1 x = b which can be solved in two ways. Here E is the Galerkin Matrix given by E = Z T AZ. 1. Calculating E 1 explicitly so that the E 1 b becomes a dense matrix vector product which can be calculated using the MAGMA BLAS library for the GPU. 2. Using the dpotrs and dpotrf functions to solve the system Ex = b on the CPU using an optimized BLAS library. In the second method we have to do a host-to-device transfer and back in every iteration. This adversely affects the run-time of the complete algorithm when we consider the GPU implementation. In the first method calculation of E 1 (which is only done once in the setup phase) becomes prohibitively expensive (almost 100%) as the number of unknowns increases. In Figure 10 we see how the time for setup (for the triangular solve approach using dpotrs and dpotrf ) stays below 10% for the CPU while for the GPU it scales with the number of deflation vectors. dpotrs solves a system of linear equations Ax = B with a symmetric positive definite matrix A using the Cholesky factorization A = U T U or A = LL T computed by dpotrf dpotrf computes the Cholesky factorization of a real symmetric positive definite matrix A. 14

17 (a) with preconditioning only (b) with deflation and preconditioning Figure 11: convergence rate variation with two levels of preconditioning. 7 Numerical Results In this section we present the results for the preconditioning methods we propose compared with some well known preconditioners. We comment on the numerical performance of the preconditioner and further present results on its ability to exploit the parallel computation power of the GPU. 7.1 Comparing Preconditioning Schemes We first demonstrate with MATLAB implementations of our DPCG algorithm, how the two levels of preconditioning incrementally work at reducing the number of iterations it takes for the Conjugate Gradient Method to converge. The experiments are done for three different 2 dimensional grids with sizes 16 16, and The ratio of densities for the two fluids modeled is Figure 11 shows how the convergence rate (measured in terms of number of iterations) varies with different preconditioners for the CG method. We note that the Neumann type Preconditioners denoted by Neu1 and Neu2 are comparable to the Block-IC approaches with block sizes 2n (blkic(2n)) and 8n (blkic(8n)). We have not shown the results for plain Incomplete Poisson preconditioning (without scaling) here since they are at least 3 times higher than the diagonal preconditioning results. In Figure 11 we also notice how deflation effectively halves the number of iterations it requires to converge. 7.2 Experiments on the GPU We performed our results on the hardware available with the Delft Institute of Applied Mathematics. 15

18 For the CPU version of the code we used a single core of 2.83 Ghz with 6MB L1 cache and 8 GB main memory. For the GPU version we used a NVIDIA Tesla(Fermi) C2070 with 6GB memory. The time we report for our implementations is the total time it takes to complete k iterations (excluding the setup time) required for convergence. A single iteration involves steps 2 to 10 in Algorithm 1. In our results, speedup is measured as a ratio of the time taken to complete k iterations (of the DPCG method) on the two different architectures, The setup phase includes the following operations Speedup = k CPU k GPU (24) 1. Assigning space to variables required for temporary storage during the iterations. 2. Making matrix AZ. 3. Making matrix E. 4. Populating x, b. 5. Doing the operations as specified in the first line of Algorithm 1 in Section 5. We kept the setup time out of our results since we had two alternate approaches for handling the operation Ea 2 = a 1 as mentioned in Section 6. This way it is easier to see that explicit inverse calculation could be beneficial for speedup whereas the triangular solve is not. The problem description is as described in Section 2. The stopping criteria is defined at ǫ = For deflation we have used 2n deflation vectors. The number of unknowns is N, where N = n n. The CPU version of the code uses optimized BLAS library ATLAS and is compiled with O3 optimization flags. The GPU version uses MAGAMABLAS for some of the operations like daxpy, dcopy, ddot etc. For the triangular solve every step we use a MAGMABLAS dpotrs function every iteration after having used dpotrf once before entering the iteration loop. This function is executed in co-operation with the CPU. In Figure 12 we see that the speedup values for all schemes stay more or less constant. This is because in these cases the largest part of the time for the iteration is spent in the highly sequential Block Incomplete Cholesky Preconditioning on the GPU. The speedup increases with increasing problem size since more and more parallelism (fine-grain) is exposed. In Figure 13 and 14 we note that: 16

19 Figure 12: Comparison of Explicit versus triangular solve strategy for DPCG. Block-IC Preconditioning with 2n, 4n and 8n block sizes. Figure 13: Comparison of Explicit versus triangular solve strategy for DPCG. Incomplete Poisson Preconditioning (with and without scaling). Figure 14: Comparison of Explicit versus triangular solve strategy for DPCG. Neumann Series based Preconditioners M 1 = K T K. 1. For Incomplete Poisson Schemes with dpotrs and dpotrf based calculation of E 1 a 1 the speedup is comparable to the results for the choice where E 1 is explicitly calculated. 2. Using explicit E 1 calculation combined with Incomplete Poisson(IP), and both the variants of Truncated Neumann Series based preconditioners (M 1 1 Neu1 /MNeu2), we observe a factor 4 increase in speed with respect to the triangular solve solve approach using dpotrs and dpotrf. A comparison of how the wall-clock times for the different preconditioning algorithms vary for the Deflated Preconditioned Conjugate Gradient Method is presented in Figure 15. Finally in Figure 16 we present the number of iterations required for convergence for the different preconditioning schemes considered in this paper. It can be noticed that the second type of Neumann Series based Preconditioner (with K = (I L + L 2 ) lies between 17

20 Figure 15: Deflated Preconditioned Conjugate Gradient. Wall- Clock Times for a grid. 2n deflation vectors. Stopping Criteria 10e 06. Figure 16: Deflated Preconditioned Conjugate Gradient. Iterations required for convergence. the Block-IC scheme with block sizes 4n and 8n. 8 Conclusions and Future work We have shown how two level preconditioning can be adapted to the GPU for computational efficiency. In order to achieve this we have investigated preconditioners that are suited to the GPU. At the same time we have made new data structures in order to optimize deflation operations. Through our results we demonstrate that the combination of Truncated Neumann based preconditioning and deflation proves to be computationally efficient on the GPU. At the same time it s numerical performance is also comparable to the established method of Block-Incomplete Cholesky Preconditioning. We have also evaluated two different approaches of implementing the deflation step. From the model problem we have learned that the choice of implementing deflation method could be crucial in the overall run-time of the method. Using this knowledge we are now working on model problems with bubbles instead of simple interfaces. With these geometries we can use the Level-Set Sub-domain based deflation vectors to capture and eliminate small eigenvalues with considerably less deflation vectors. This way we can use the explicit inverse calculation for E and have double-digit speedups for larger problems. 18

21 References [1] Ament, M. and Knittel, G. and Weiskopf, D. and Strasser, W.: A Parallel Preconditioned Conjugate Gradient Solver for the Poisson Problem on a Multi-GPU Platform, pp , 2010, 18th Euromicro Conference on Parallel, Distributed and Network-based Processing. [2] Stanley O. and Sethian, James A.: Fronts Propagating with Curvature Dependent Speed: Algorithms Based on Hamilton-Jacobi Formulations, Journal of Computational Physics, [3] Demmel, J. and Hoemmen, M.F. and Mohiyuddin, M. and Yelick, K. A., Avoiding Communication in Computing Krylov Subspaces, EECS Department, University of California, Berkeley, 2007, Oct, html, UCB/EECS [4] Saad, Y. :Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics; 2 nd edition, Philadelphia, (2003). [5] Tang J.M.: Two-Level Preconditioned Conjugate Gradient Methods with Applications to Bubbly Flow Problems, PhD Thesis, Delft University of Technology, Delft, The Netherlands. [6] Bolz, J. and Farmer, I. and Grinspun, E. and Schröoder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid, ACM Trans. Graph. 22,3, 2003, [7] Bell, N. and Garland, M.: Efficient Sparse Matrix-Vector Multiplication on CUDA,2008. NVIDIA Corporation, NVR [8] Pijl, S.P. Van der and Segal, A. and Vuik, C. and Wesseling, P.: A mass conserving Level-Set Method for modeling of multi-phase flows, International Journal for Numerical Methods in Fluids,47, 2005, [9] Gupta, R.: Implementation of the Deflated Preconditioned Conjugate Gradient Method for Bubbly Flow on the Graphical Processing Unit(GPU), Master Thesis, Delft University of Technology, Delft, The Netherlands. [10] Tang, J.M.,Vuik, C.: Efficient Deflation Methods applied to 3- D Bubbly Flow Problems. Electronic Transactions on Numerical Analysis 26 (2007)

22 [11] Helfenstein R. and Koko J., Parallel preconditioned conjugate gradient algorithm on GPU. Journal of Computational and Applied Mathematics.In Press, Corrected Proof,2011, [12] V. Heuveline, D. Lukarski, C. Subramanian, J.-P. Weiss, Parallel Preconditioning and Modular Finite Element Solvers on Hybrid CPU-GPU Systems, in P. Ivnyi, B.H.V. Topping, (Editors), Proceedings of the Second International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering, Civil-Comp Press, Stirlingshire, UK, Paper 36, [13] Meijerink, J. A. and Vorst, H. A. van der., An Iterative Solution Method for Linear Systems of Which the Coefficient Matrix is a Symmetric M-Matrix. Mathematics of Computation, 31, 137, Jan., 1977, pp , American Mathematical Society. [14] Asgasri, A. and Tate J. E., Implementing the Chebyshev Polynomial Preconditioner for the Iterative Solution of Linear Systems on Massively Parallel Graphics Processors, CIGRE Canada Conference on Power Systems, 2009, Toronto, Canada. [15] Griebel M. and Zaspel P., A multi-gpu accelerated solver for the three-dimensional two-phase incompressible Navier-Stokes equations, Computer Science - Research and Development, 2010, 25, Springer Berlin / Heidelberg. [16] Baskaran, M. and Bordawekar, R., Opimizing Sparse Matrix-Vector Multiplication on GPUs, IBM Research Division, 2008, NY, USA. [17] M. Baboulin and Buttari, A. and Dongarra, J. and Kurzak, J. and Langou, J. and Langou, J. and Luszczek, P. and Tomov, S., Accelerating Scientific Computations with Mixed Precision Algorithms, CoRR, abs/ , [18] Buatois, L. and Caumon, G. and Levy, B., Concurrent number cruncher: a GPU implementation of a general sparse linear solver, Int. J. Parallel Emerg. Distrib. Syst., 24, 3, 2009, Taylor & Francis, Inc., Bristol, PA, USA. [19] Monakov, A. and Avetisyan, A., Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs, SAMOS 09: Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation, 2009, , Samos, Greece, Springer-Verlag. 20

23 [20] Wang, M. and Klie, H. and Parashar, M. and Sudan, H., Solving Sparse Linear Systems on NVIDIA Tesla GPUs, ICCS 09: Proceedings of the 9th International Conference on Computational Science, 2009, , Baton Rouge, LA, Springer-Verlag. [21] NVIDIA CUDA Programming Guide v4.0, NVIDIA Corporattion,Santa Clara, [22] NVIDIA CUDA C Programming Best Practices Guide CUDA Toolkit v4.0, NVIDIA Corporattion, Santa Clara, [23] C. Vuik and A. Segal and J.A. Meijerink, An efficient preconditioned CG method for the solution of a class of layered problems with extreme contrasts in the coefficients, J. Comp. Phys., 152, 1999,

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations Rohit Gupta, Martin van Gijzen, Kees Vuik GPU Technology Conference 2012, San Jose CA. GPU Technology Conference 2012,

More information

On the choice of abstract projection vectors for second level preconditioners

On the choice of abstract projection vectors for second level preconditioners On the choice of abstract projection vectors for second level preconditioners C. Vuik 1, J.M. Tang 1, and R. Nabben 2 1 Delft University of Technology 2 Technische Universität Berlin Institut für Mathematik

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Classical Iterations We have a problem, We assume that the matrix comes from a discretization of a PDE. The best and most popular model problem is, The matrix will be as large

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

The Deflation Accelerated Schwarz Method for CFD

The Deflation Accelerated Schwarz Method for CFD The Deflation Accelerated Schwarz Method for CFD J. Verkaik 1, C. Vuik 2,, B.D. Paarhuis 1, and A. Twerda 1 1 TNO Science and Industry, Stieltjesweg 1, P.O. Box 155, 2600 AD Delft, The Netherlands 2 Delft

More information

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI Sagar Bhatt Person Number: 50170651 Department of Mechanical and Aerospace Engineering,

More information

C. Vuik 1 R. Nabben 2 J.M. Tang 1

C. Vuik 1 R. Nabben 2 J.M. Tang 1 Deflation acceleration of block ILU preconditioned Krylov methods C. Vuik 1 R. Nabben 2 J.M. Tang 1 1 Delft University of Technology J.M. Burgerscentrum 2 Technical University Berlin Institut für Mathematik

More information

DELFT UNIVERSITY OF TECHNOLOGY

DELFT UNIVERSITY OF TECHNOLOGY DELFT UNIVERSITY OF TECHNOLOGY REPORT 16-02 The Induced Dimension Reduction method applied to convection-diffusion-reaction problems R. Astudillo and M. B. van Gijzen ISSN 1389-6520 Reports of the Delft

More information

The solution of the discretized incompressible Navier-Stokes equations with iterative methods

The solution of the discretized incompressible Navier-Stokes equations with iterative methods The solution of the discretized incompressible Navier-Stokes equations with iterative methods Report 93-54 C. Vuik Technische Universiteit Delft Delft University of Technology Faculteit der Technische

More information

9.1 Preconditioned Krylov Subspace Methods

9.1 Preconditioned Krylov Subspace Methods Chapter 9 PRECONDITIONING 9.1 Preconditioned Krylov Subspace Methods 9.2 Preconditioned Conjugate Gradient 9.3 Preconditioned Generalized Minimal Residual 9.4 Relaxation Method Preconditioners 9.5 Incomplete

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

Linear Solvers. Andrew Hazel

Linear Solvers. Andrew Hazel Linear Solvers Andrew Hazel Introduction Thus far we have talked about the formulation and discretisation of physical problems...... and stopped when we got to a discrete linear system of equations. Introduction

More information

Boundary Value Problems - Solving 3-D Finite-Difference problems Jacob White

Boundary Value Problems - Solving 3-D Finite-Difference problems Jacob White Introduction to Simulation - Lecture 2 Boundary Value Problems - Solving 3-D Finite-Difference problems Jacob White Thanks to Deepak Ramaswamy, Michal Rewienski, and Karen Veroy Outline Reminder about

More information

Lecture 17: Iterative Methods and Sparse Linear Algebra

Lecture 17: Iterative Methods and Sparse Linear Algebra Lecture 17: Iterative Methods and Sparse Linear Algebra David Bindel 25 Mar 2014 Logistics HW 3 extended to Wednesday after break HW 4 should come out Monday after break Still need project description

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

SOLUTION of linear systems of equations of the form:

SOLUTION of linear systems of equations of the form: Proceedings of the Federated Conference on Computer Science and Information Systems pp. Mixed precision iterative refinement techniques for the WZ factorization Beata Bylina Jarosław Bylina Institute of

More information

On deflation and singular symmetric positive semi-definite matrices

On deflation and singular symmetric positive semi-definite matrices Journal of Computational and Applied Mathematics 206 (2007) 603 614 www.elsevier.com/locate/cam On deflation and singular symmetric positive semi-definite matrices J.M. Tang, C. Vuik Faculty of Electrical

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 OUTLINE Sparse matrix storage format Basic factorization

More information

Master Thesis Literature Study Presentation

Master Thesis Literature Study Presentation Master Thesis Literature Study Presentation Delft University of Technology The Faculty of Electrical Engineering, Mathematics and Computer Science January 29, 2010 Plaxis Introduction Plaxis Finite Element

More information

DELFT UNIVERSITY OF TECHNOLOGY

DELFT UNIVERSITY OF TECHNOLOGY DELFT UNIVERSITY OF TECHNOLOGY REPORT 09-08 Preconditioned conjugate gradient method enhanced by deflation of rigid body modes applied to composite materials. T.B. Jönsthövel ISSN 1389-6520 Reports of

More information

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology, USA SPPEXA Symposium TU München,

More information

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations Sparse Linear Systems Iterative Methods for Sparse Linear Systems Matrix Computations and Applications, Lecture C11 Fredrik Bengzon, Robert Söderlund We consider the problem of solving the linear system

More information

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix Scientific Computing with Case Studies SIAM Press, 2009 http://www.cs.umd.edu/users/oleary/sccswebpage Lecture Notes for Unit VII Sparse Matrix Computations Part 1: Direct Methods Dianne P. O Leary c 2008

More information

Preface to the Second Edition. Preface to the First Edition

Preface to the Second Edition. Preface to the First Edition n page v Preface to the Second Edition Preface to the First Edition xiii xvii 1 Background in Linear Algebra 1 1.1 Matrices................................. 1 1.2 Square Matrices and Eigenvalues....................

More information

Various ways to use a second level preconditioner

Various ways to use a second level preconditioner Various ways to use a second level preconditioner C. Vuik 1, J.M. Tang 1, R. Nabben 2, and Y. Erlangga 3 1 Delft University of Technology Delft Institute of Applied Mathematics 2 Technische Universität

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota SIAM CSE Boston - March 1, 2013 First: Joint work with Ruipeng Li Work

More information

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros

More information

Scientific Computing

Scientific Computing Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting

More information

Fine-grained Parallel Incomplete LU Factorization

Fine-grained Parallel Incomplete LU Factorization Fine-grained Parallel Incomplete LU Factorization Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Sparse Days Meeting at CERFACS June 5-6, 2014 Contribution

More information

LINEAR SYSTEMS (11) Intensive Computation

LINEAR SYSTEMS (11) Intensive Computation LINEAR SYSTEMS () Intensive Computation 27-8 prof. Annalisa Massini Viviana Arrigoni EXACT METHODS:. GAUSSIAN ELIMINATION. 2. CHOLESKY DECOMPOSITION. ITERATIVE METHODS:. JACOBI. 2. GAUSS-SEIDEL 2 CHOLESKY

More information

- Part 4 - Multicore and Manycore Technology: Chances and Challenges. Vincent Heuveline

- Part 4 - Multicore and Manycore Technology: Chances and Challenges. Vincent Heuveline - Part 4 - Multicore and Manycore Technology: Chances and Challenges Vincent Heuveline 1 Numerical Simulation of Tropical Cyclones Goal oriented adaptivity for tropical cyclones ~10⁴km ~1500km ~100km 2

More information

Solving Ax = b, an overview. Program

Solving Ax = b, an overview. Program Numerical Linear Algebra Improving iterative solvers: preconditioning, deflation, numerical software and parallelisation Gerard Sleijpen and Martin van Gijzen November 29, 27 Solving Ax = b, an overview

More information

Iterative Solvers in the Finite Element Solution of Transient Heat Conduction

Iterative Solvers in the Finite Element Solution of Transient Heat Conduction Iterative Solvers in the Finite Element Solution of Transient Heat Conduction Mile R. Vuji~i} PhD student Steve G.R. Brown Senior Lecturer Materials Research Centre School of Engineering University of

More information

Algebraic Multigrid as Solvers and as Preconditioner

Algebraic Multigrid as Solvers and as Preconditioner Ò Algebraic Multigrid as Solvers and as Preconditioner Domenico Lahaye domenico.lahaye@cs.kuleuven.ac.be http://www.cs.kuleuven.ac.be/ domenico/ Department of Computer Science Katholieke Universiteit Leuven

More information

Lecture 18 Classical Iterative Methods

Lecture 18 Classical Iterative Methods Lecture 18 Classical Iterative Methods MIT 18.335J / 6.337J Introduction to Numerical Methods Per-Olof Persson November 14, 2006 1 Iterative Methods for Linear Systems Direct methods for solving Ax = b,

More information

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn Today s class Linear Algebraic Equations LU Decomposition 1 Linear Algebraic Equations Gaussian Elimination works well for solving linear systems of the form: AX = B What if you have to solve the linear

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 24: Preconditioning and Multigrid Solver Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 5 Preconditioning Motivation:

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis Vector Norms Definition

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 37, No. 2, pp. C169 C193 c 2015 Society for Industrial and Applied Mathematics FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION EDMOND CHOW AND AFTAB PATEL Abstract. This paper

More information

Finite-choice algorithm optimization in Conjugate Gradients

Finite-choice algorithm optimization in Conjugate Gradients Finite-choice algorithm optimization in Conjugate Gradients Jack Dongarra and Victor Eijkhout January 2003 Abstract We present computational aspects of mathematically equivalent implementations of the

More information

7.2 Steepest Descent and Preconditioning

7.2 Steepest Descent and Preconditioning 7.2 Steepest Descent and Preconditioning Descent methods are a broad class of iterative methods for finding solutions of the linear system Ax = b for symmetric positive definite matrix A R n n. Consider

More information

Solving PDEs with Multigrid Methods p.1

Solving PDEs with Multigrid Methods p.1 Solving PDEs with Multigrid Methods Scott MacLachlan maclachl@colorado.edu Department of Applied Mathematics, University of Colorado at Boulder Solving PDEs with Multigrid Methods p.1 Support and Collaboration

More information

DELFT UNIVERSITY OF TECHNOLOGY

DELFT UNIVERSITY OF TECHNOLOGY DELFT UNIVERSITY OF TECHNOLOGY REPORT 01-13 The influence of deflation vectors at interfaces on the deflated conjugate gradient method F.J. Vermolen and C. Vuik ISSN 1389-6520 Reports of the Department

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Numerical Analysis: Solutions of System of. Linear Equation. Natasha S. Sharma, PhD

Numerical Analysis: Solutions of System of. Linear Equation. Natasha S. Sharma, PhD Mathematical Question we are interested in answering numerically How to solve the following linear system for x Ax = b? where A is an n n invertible matrix and b is vector of length n. Notation: x denote

More information

Chapter 9 Implicit integration, incompressible flows

Chapter 9 Implicit integration, incompressible flows Chapter 9 Implicit integration, incompressible flows The methods we discussed so far work well for problems of hydrodynamics in which the flow speeds of interest are not orders of magnitude smaller than

More information

Notes for CS542G (Iterative Solvers for Linear Systems)

Notes for CS542G (Iterative Solvers for Linear Systems) Notes for CS542G (Iterative Solvers for Linear Systems) Robert Bridson November 20, 2007 1 The Basics We re now looking at efficient ways to solve the linear system of equations Ax = b where in this course,

More information

Conjugate Gradient Method

Conjugate Gradient Method Conjugate Gradient Method direct and indirect methods positive definite linear systems Krylov sequence spectral analysis of Krylov sequence preconditioning Prof. S. Boyd, EE364b, Stanford University Three

More information

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009 Parallel Preconditioning of Linear Systems based on ILUPACK for Multithreaded Architectures J.I. Aliaga M. Bollhöfer 2 A.F. Martín E.S. Quintana-Ortí Deparment of Computer Science and Engineering, Univ.

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost

More information

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline

More information

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Math 471 (Numerical methods) Chapter 3 (second half). System of equations Math 47 (Numerical methods) Chapter 3 (second half). System of equations Overlap 3.5 3.8 of Bradie 3.5 LU factorization w/o pivoting. Motivation: ( ) A I Gaussian Elimination (U L ) where U is upper triangular

More information

M.A. Botchev. September 5, 2014

M.A. Botchev. September 5, 2014 Rome-Moscow school of Matrix Methods and Applied Linear Algebra 2014 A short introduction to Krylov subspaces for linear systems, matrix functions and inexact Newton methods. Plan and exercises. M.A. Botchev

More information

AMS Mathematics Subject Classification : 65F10,65F50. Key words and phrases: ILUS factorization, preconditioning, Schur complement, 1.

AMS Mathematics Subject Classification : 65F10,65F50. Key words and phrases: ILUS factorization, preconditioning, Schur complement, 1. J. Appl. Math. & Computing Vol. 15(2004), No. 1, pp. 299-312 BILUS: A BLOCK VERSION OF ILUS FACTORIZATION DAVOD KHOJASTEH SALKUYEH AND FAEZEH TOUTOUNIAN Abstract. ILUS factorization has many desirable

More information

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Scalable Non-blocking Preconditioned Conjugate Gradient Methods Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William

More information

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices DICEA DEPARTMENT OF CIVIL, ENVIRONMENTAL AND ARCHITECTURAL ENGINEERING PhD SCHOOL CIVIL AND ENVIRONMENTAL ENGINEERING SCIENCES XXX CYCLE A robust multilevel approximate inverse preconditioner for symmetric

More information

A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation

A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation Tao Zhao 1, Feng-Nan Hwang 2 and Xiao-Chuan Cai 3 Abstract In this paper, we develop an overlapping domain decomposition

More information

Lab 1: Iterative Methods for Solving Linear Systems

Lab 1: Iterative Methods for Solving Linear Systems Lab 1: Iterative Methods for Solving Linear Systems January 22, 2017 Introduction Many real world applications require the solution to very large and sparse linear systems where direct methods such as

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

Lecture # 20 The Preconditioned Conjugate Gradient Method

Lecture # 20 The Preconditioned Conjugate Gradient Method Lecture # 20 The Preconditioned Conjugate Gradient Method We wish to solve Ax = b (1) A R n n is symmetric and positive definite (SPD). We then of n are being VERY LARGE, say, n = 10 6 or n = 10 7. Usually,

More information

Incomplete Cholesky preconditioners that exploit the low-rank property

Incomplete Cholesky preconditioners that exploit the low-rank property anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

PDE Solvers for Fluid Flow

PDE Solvers for Fluid Flow PDE Solvers for Fluid Flow issues and algorithms for the Streaming Supercomputer Eran Guendelman February 5, 2002 Topics Equations for incompressible fluid flow 3 model PDEs: Hyperbolic, Elliptic, Parabolic

More information

A SPARSE APPROXIMATE INVERSE PRECONDITIONER FOR NONSYMMETRIC LINEAR SYSTEMS

A SPARSE APPROXIMATE INVERSE PRECONDITIONER FOR NONSYMMETRIC LINEAR SYSTEMS INTERNATIONAL JOURNAL OF NUMERICAL ANALYSIS AND MODELING, SERIES B Volume 5, Number 1-2, Pages 21 30 c 2014 Institute for Scientific Computing and Information A SPARSE APPROXIMATE INVERSE PRECONDITIONER

More information

EXAMPLES OF CLASSICAL ITERATIVE METHODS

EXAMPLES OF CLASSICAL ITERATIVE METHODS EXAMPLES OF CLASSICAL ITERATIVE METHODS In these lecture notes we revisit a few classical fixpoint iterations for the solution of the linear systems of equations. We focus on the algebraic and algorithmic

More information

Lecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc.

Lecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc. Lecture 11: CMSC 878R/AMSC698R Iterative Methods An introduction Outline Direct Solution of Linear Systems Inverse, LU decomposition, Cholesky, SVD, etc. Iterative methods for linear systems Why? Matrix

More information

Notes on PCG for Sparse Linear Systems

Notes on PCG for Sparse Linear Systems Notes on PCG for Sparse Linear Systems Luca Bergamaschi Department of Civil Environmental and Architectural Engineering University of Padova e-mail luca.bergamaschi@unipd.it webpage www.dmsa.unipd.it/

More information

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Direct and Incomplete Cholesky Factorizations with Static Supernodes Direct and Incomplete Cholesky Factorizations with Static Supernodes AMSC 661 Term Project Report Yuancheng Luo 2010-05-14 Introduction Incomplete factorizations of sparse symmetric positive definite (SSPD)

More information

CLASSICAL ITERATIVE METHODS

CLASSICAL ITERATIVE METHODS CLASSICAL ITERATIVE METHODS LONG CHEN In this notes we discuss classic iterative methods on solving the linear operator equation (1) Au = f, posed on a finite dimensional Hilbert space V = R N equipped

More information

Preconditioning Techniques Analysis for CG Method

Preconditioning Techniques Analysis for CG Method Preconditioning Techniques Analysis for CG Method Huaguang Song Department of Computer Science University of California, Davis hso@ucdavis.edu Abstract Matrix computation issue for solve linear system

More information

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication. CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax

More information

A PRECONDITIONER FOR THE HELMHOLTZ EQUATION WITH PERFECTLY MATCHED LAYER

A PRECONDITIONER FOR THE HELMHOLTZ EQUATION WITH PERFECTLY MATCHED LAYER European Conference on Computational Fluid Dynamics ECCOMAS CFD 2006 P. Wesseling, E. Oñate and J. Périaux (Eds) c TU Delft, The Netherlands, 2006 A PRECONDITIONER FOR THE HELMHOLTZ EQUATION WITH PERFECTLY

More information

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give

More information

A Robust Preconditioned Iterative Method for the Navier-Stokes Equations with High Reynolds Numbers

A Robust Preconditioned Iterative Method for the Navier-Stokes Equations with High Reynolds Numbers Applied and Computational Mathematics 2017; 6(4): 202-207 http://www.sciencepublishinggroup.com/j/acm doi: 10.11648/j.acm.20170604.18 ISSN: 2328-5605 (Print); ISSN: 2328-5613 (Online) A Robust Preconditioned

More information

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems Topics The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems What about non-spd systems? Methods requiring small history Methods requiring large history Summary of solvers 1 / 52 Conjugate

More information

Implementation of the Deflated Pre-conditioned Conjugate Gradient Method applied to the Poisson equation related to bubbly flow on the GPU

Implementation of the Deflated Pre-conditioned Conjugate Gradient Method applied to the Poisson equation related to bubbly flow on the GPU Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology Implementation of the Deflated Pre-conditioned Conjugate Gradient Method applied to the Poisson equation

More information

Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices

Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices Eun-Joo Lee Department of Computer Science, East Stroudsburg University of Pennsylvania, 327 Science and Technology Center,

More information

2.29 Numerical Fluid Mechanics Spring 2015 Lecture 9

2.29 Numerical Fluid Mechanics Spring 2015 Lecture 9 Spring 2015 Lecture 9 REVIEW Lecture 8: Direct Methods for solving (linear) algebraic equations Gauss Elimination LU decomposition/factorization Error Analysis for Linear Systems and Condition Numbers

More information

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES ITERATIVE METHODS BASED ON KRYLOV SUBSPACES LONG CHEN We shall present iterative methods for solving linear algebraic equation Au = b based on Krylov subspaces We derive conjugate gradient (CG) method

More information

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294) Conjugate gradient method Descent method Hestenes, Stiefel 1952 For A N N SPD In exact arithmetic, solves in N steps In real arithmetic No guaranteed stopping Often converges in many fewer than N steps

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

Computational Methods. Systems of Linear Equations

Computational Methods. Systems of Linear Equations Computational Methods Systems of Linear Equations Manfred Huber 2010 1 Systems of Equations Often a system model contains multiple variables (parameters) and contains multiple equations Multiple equations

More information

An advanced ILU preconditioner for the incompressible Navier-Stokes equations

An advanced ILU preconditioner for the incompressible Navier-Stokes equations An advanced ILU preconditioner for the incompressible Navier-Stokes equations M. ur Rehman C. Vuik A. Segal Delft Institute of Applied Mathematics, TU delft The Netherlands Computational Methods with Applications,

More information

THEORETICAL AND NUMERICAL COMPARISON OF PROJECTION METHODS DERIVED FROM DEFLATION, DOMAIN DECOMPOSITION AND MULTIGRID METHODS

THEORETICAL AND NUMERICAL COMPARISON OF PROJECTION METHODS DERIVED FROM DEFLATION, DOMAIN DECOMPOSITION AND MULTIGRID METHODS THEORETICAL AND NUMERICAL COMPARISON OF PROJECTION METHODS DERIVED FROM DEFLATION, DOMAIN DECOMPOSITION AND MULTIGRID METHODS J.M. TANG, R. NABBEN, C. VUIK, AND Y.A. ERLANGGA Abstract. For various applications,

More information

On the Preconditioning of the Block Tridiagonal Linear System of Equations

On the Preconditioning of the Block Tridiagonal Linear System of Equations On the Preconditioning of the Block Tridiagonal Linear System of Equations Davod Khojasteh Salkuyeh Department of Mathematics, University of Mohaghegh Ardabili, PO Box 179, Ardabil, Iran E-mail: khojaste@umaacir

More information

Preconditioners for the incompressible Navier Stokes equations

Preconditioners for the incompressible Navier Stokes equations Preconditioners for the incompressible Navier Stokes equations C. Vuik M. ur Rehman A. Segal Delft Institute of Applied Mathematics, TU Delft, The Netherlands SIAM Conference on Computational Science and

More information

A MULTIGRID ALGORITHM FOR. Richard E. Ewing and Jian Shen. Institute for Scientic Computation. Texas A&M University. College Station, Texas SUMMARY

A MULTIGRID ALGORITHM FOR. Richard E. Ewing and Jian Shen. Institute for Scientic Computation. Texas A&M University. College Station, Texas SUMMARY A MULTIGRID ALGORITHM FOR THE CELL-CENTERED FINITE DIFFERENCE SCHEME Richard E. Ewing and Jian Shen Institute for Scientic Computation Texas A&M University College Station, Texas SUMMARY In this article,

More information

Computational Linear Algebra

Computational Linear Algebra Computational Linear Algebra PD Dr. rer. nat. habil. Ralf Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2017/18 Part 2: Direct Methods PD Dr.

More information

An Efficient Low Memory Implicit DG Algorithm for Time Dependent Problems

An Efficient Low Memory Implicit DG Algorithm for Time Dependent Problems An Efficient Low Memory Implicit DG Algorithm for Time Dependent Problems P.-O. Persson and J. Peraire Massachusetts Institute of Technology 2006 AIAA Aerospace Sciences Meeting, Reno, Nevada January 9,

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

Parallel Algorithms for Four-Dimensional Variational Data Assimilation

Parallel Algorithms for Four-Dimensional Variational Data Assimilation Parallel Algorithms for Four-Dimensional Variational Data Assimilation Mie Fisher ECMWF October 24, 2011 Mie Fisher (ECMWF) Parallel 4D-Var October 24, 2011 1 / 37 Brief Introduction to 4D-Var Four-Dimensional

More information

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems

A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Outline A High-Performance Parallel Hybrid Method for Large Sparse Linear Systems Azzam Haidar CERFACS, Toulouse joint work with Luc Giraud (N7-IRIT, France) and Layne Watson (Virginia Polytechnic Institute,

More information