Solving PDEs with CUDA Jonathan Cohen

Similar documents
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Solving linear systems (6 lectures)

LU Factorization. Marco Chiarandini. DM559 Linear and Integer Programming. Department of Mathematics & Computer Science University of Southern Denmark

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn

Numerical Analysis Fall. Gauss Elimination

Notes for CS542G (Iterative Solvers for Linear Systems)

Numerical Solution Techniques in Mechanical and Aerospace Engineering

Contents. Preface... xi. Introduction...

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Improvements for Implicit Linear Equation Solvers

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

Preface to the Second Edition. Preface to the First Edition

Example: Current in an Electrical Circuit. Solving Linear Systems:Direct Methods. Linear Systems of Equations. Solving Linear Systems: Direct Methods

Numerical Analysis: Solutions of System of. Linear Equation. Natasha S. Sharma, PhD

Finite Difference Methods (FDMs) 1

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Solving Linear Systems

Algebraic Multigrid as Solvers and as Preconditioner

Boundary Value Problems - Solving 3-D Finite-Difference problems Jacob White

Block-tridiagonal matrices

CS 542G: Conditioning, BLAS, LU Factorization

Cache Oblivious Stencil Computations

Scientific Computing

Lecture 18 Classical Iterative Methods

Review of matrices. Let m, n IN. A rectangle of numbers written like A =

Next topics: Solving systems of linear equations

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Numerical Linear Algebra

ME Computational Fluid Mechanics Lecture 5

Scientific Computing WS 2018/2019. Lecture 9. Jürgen Fuhrmann Lecture 9 Slide 1

Computing least squares condition numbers on hybrid multicore/gpu systems

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

Scientific Computing: Solving Linear Systems

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Computers and Mathematics with Applications

PDE Solvers for Fluid Flow

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University

1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

AMS Mathematics Subject Classification : 65F10,65F50. Key words and phrases: ILUS factorization, preconditioning, Schur complement, 1.

GAUSSIAN ELIMINATION AND LU DECOMPOSITION (SUPPLEMENT FOR MA511)

CPE 310: Numerical Analysis for Engineers

Numerical Methods I Non-Square and Sparse Linear Systems

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Linear System of Equations

BlockMatrixComputations and the Singular Value Decomposition. ATaleofTwoIdeas

Computational Methods. Systems of Linear Equations

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Computational Techniques Prof. Sreenivas Jayanthi. Department of Chemical Engineering Indian institute of Technology, Madras

c 2015 Society for Industrial and Applied Mathematics

Scientific Computing: An Introductory Survey

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Math 552 Scientific Computing II Spring SOLUTIONS: Homework Set 1

3.5 Finite Differences and Fast Poisson Solvers

Diffusion / Parabolic Equations. PHY 688: Numerical Methods for (Astro)Physics

EXAMPLES OF CLASSICAL ITERATIVE METHODS

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Image Reconstruction And Poisson s equation

Lecture 12 (Tue, Mar 5) Gaussian elimination and LU factorization (II)

Review Questions REVIEW QUESTIONS 71

Linear Algebra. Carleton DeTar February 27, 2017

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

A Review of Matrix Analysis

Elliptic Problems / Multigrid. PHY 604: Computational Methods for Physics and Astrophysics II

Exercise Sketch these lines and find their intersection.

Iterative Methods. Splitting Methods

Computational Fluid Dynamics Prof. Sreenivas Jayanti Department of Computer Science and Engineering Indian Institute of Technology, Madras

Parallel Scientific Computing

1 Number Systems and Errors 1

A Sparse QS-Decomposition for Large Sparse Linear System of Equations

Sparse Linear Systems. Iterative Methods for Sparse Linear Systems. Motivation for Studying Sparse Linear Systems. Partial Differential Equations

. =. a i1 x 1 + a i2 x 2 + a in x n = b i. a 11 a 12 a 1n a 21 a 22 a 1n. i1 a i2 a in

Linear Algebra. Solving Linear Systems. Copyright 2005, W.R. Winfrey

Chapter 9: Gaussian Elimination

Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing

Sparse BLAS-3 Reduction

1.Chapter Objectives

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 2. Systems of Linear Equations

7.6 The Inverse of a Square Matrix

Process Model Formulation and Solution, 3E4

Linear Solvers. Andrew Hazel

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

The Solution of Linear Systems AX = B

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Matrix Assembly in FEA

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

An efficient multigrid method for graph Laplacian systems II: robust aggregation

Transcription:

Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research

PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear equation => Linear solvers Parallel approaches for solving resulting linear systems

Poisson Equation Classic Poisson Equation: 2 p = rhs (p, rhs scalar fields) (Laplacian of p = sum of second derivatives) 2 p/ x 2 + 2 p/ y 2 + 2 p/ z 2 = rhs 2 p/ x 2 (p+ x)/ x - p/ x x

To compute Laplacian at P[4]: 1/ x 2 P[0] P[1] P[2] P[3] P[4] P[5] P[6] P[7] 1-2 1 1 st Derivatives on both sides: p/ x = P[4] - P[3] x (p+ )/ x = P[5] - P[4] x Derivative of 1 st Derivatives: P[5] - P[4] - P[4] - P[3] x x x 1/ x 2 ( P[3] 2P[4] + P[5] )

Poisson Matrix Poisson Equation is a sparse linear system -2 1 1 1-2 1 1-2 1 1-2 1 1-2 1 1-2 1 1-2 1 1 1-2 P[0] P[1] P[2] P[3] P[4] P[5] P[6] P[7] = x 2 RHS

Approach 1: Iterative Solver Solve M p = r, where M and r are known Error is easy to estimate: E = M p - r Basic iterative scheme: Start with a guess for p, call it p Until M p - r < tolerance p <= Update(p, M, r) Return p

Serial Gauss-Seidel Relaxation Loop until convergence: For each equation j = 1 to n Solve for P[j] E.g. equation for P[1]: P[0] - 2P[1] + P[2] = h*h*rhs[1] Rearrange terms: P[1] = P[0] + P[2] - h*h*rhs[1] 2

One Pass of Serial Algorithm P[0] = P[7] + P[1] -h*h*rhs[0] 2 P[1] = P[2] + P[0] -h*h*rhs[1] 2 P[2] = P[1] + P[3] -h*h*rhs[2] 2 P[7] = P[6] + P[0] -h*h*rhs[7] 2 P[0] P[0] P[0] P[0] P[1] P[1] P[1] P[1] P[2] P[2] P[2] P[2] P[3] P[3] P[3] P[3] P[4] P[4] P[4] P[4] P[5] P[5] P[5] P[5] P[6] P[6] P[6] P[6] P[7] P[7] P[7] P[7]

Red-Black Gauss-Seidel Relaxation Can choose any order in which to update equations Convergence rate may change, but convergence still guaranteed Red-black ordering: P[0] P[1] P[2] P[3] P[4] P[5] P[6] P[7] Red (odd) equations independent of each other Black (even) equations independent of each other

Parallel Gauss-Seidel Relaxation Loop n times (until convergence) For each even equation j = 0 to n-1 Solve for P[j] For each odd equation j = 1 to n Solve for P[j] For loops are parallel perfect for CUDA kernel

One Pass of Parallel Algorithm P[0] P[1] P[0] = P[7] + P[1] -h*h*rhs[0] 2 P[2] = P[1] + P[3] -h*h*rhs[2] 2 P[4] = P[3] + P[5] -h*h*rhs[4] 2 P[6] = P[5] + P[7] -h*h*rhs[6] 2 P[0] P[1] P[1] = P[0] + P[2] -h*h*rhs[1] 2 P[3] = P[2] + P[4] -h*h*rhs[3] 2 P[5] = P[4] + P[6] -h*h*rhs[5] 2 P[7] = P[6] + P[0] -h*h*rhs[7] 2 P[2] P[2] P[3] P[3] P[4] P[4] P[5] P[5] P[6] P[6] P[7] P[7]

CUDA Pseudo-code global void RedBlackGaussSeidel( Grid P, Grid RHS, float h, int red_black) { int i = blockidx.x*blockdim.x + threadidx.x; int j = blockidx.y*blockdim.y + threadidx.y; i*=2; if (j%2!= red_black) i++; int idx = j*rhs.jstride + i*rhs.istride; P.buf[idx] = 1.0/6.0*(-h*h*R.buf[idx] + P.buf[idx + P.istride] + P.buf[idx P.istride] + P.buf[idx + P.jstride] + P.buf[idx P.jstride]); } // on host: for (int i=0; i < 100; i++) { RedBlackGaussSeidel<<<Dg, Db>>>(P, RHS, h, 0); RedBlackGaussSeidel<<<Dg, Db>>>(P, RHS, h, 1); }

Optimizing the Poisson Solver Red-Black scheme is bad for coalescing Read every other grid cell => half memory bandwidth Lots of reuse between adjacent threads (blue and green) P[0] P[1] P[2] P[3] P[4] P[5] P[6] P[7] Texture cache (Fermi L1 cache) improves performance by 2x Lots of immediate reuse, very small working set In my tests, (barely) beats software-managed shared memory

Generalizing to non-grids What about discretization over non-cartesian grids? Finite element, finite volume, etc. Need discrete version of differential operator (Laplacian) to this geometry 1.2-4 1.2 1.6 Works out to same thing: L p = r where L is matrix of Laplacian discretizations

Graph Coloring Need partition into non-adjacent sets Classic graph coloring problem Red-black is special case, where 2 colors suffice Too many colors => not enough parallelism within each color Not enough colors => hard coloring problem Until convergence: Update green terms in parallel Update orange terms in parallel Update blue terms in parallel

Back to the Poisson Matrix 1D Poisson Matrix has particular sparse structure: 3 non-zeros per row, around the diagonal -2 1 1 1-2 1 1-2 1 1-2 1 1-2 1 1-2 1 1-2 1 1 1-2 P[0] P[1] P[2] P[3] P[4] P[5] P[6] P[7] = x 2 RHS

Approach 2: Direct Solver Solve the matrix equation directly Exploit sparsity pattern all zeroes except diagonal, 1 above, 1 below = tridiagonal matrix Many applications for tridiagonal matrices Vertical diffusion (adjacent columns do not interact) ADI methods (e.g. separate 2D blur x blur, y blur) Linear solvers (multigrid, preconditioners, etc.) Typically many small tridiagonal systems => per-cta algorithm

What is a tridiagonal system?

A Classic Sequential Algorithm Gaussian elimination in tridiagonal case (Thomas algorithm) Phase 1: 2: Forward Backward Elimination Substitution

Cyclic Reduction: Parallel algorithm Basic linear algebra: Take any row, multiply by scalar, add to another row => Solution unchanged B1 C1 A1 X1 R1 A2 B2 C2 X2 R2 A3 B3 C3 X3 R3 A4 B4 C4 A5 B5 C5 X4 X5 = R4 R5 A6 B6 C6 X6 R6 A7 B7 C7 X7 R7 C8 A8 B8 X8 R8

Scale Equations 3 and 5 B1 C1 A1 A2 B2 C2 X1 X2 R1 R2 -A4/B3 * A3 B3 C3 X3 R3 -C4/B5 * A4 B4 C4 A5 B5 C5 X4 X5 = R4 R5 A6 B6 C6 X6 R6 A7 B7 C7 X7 R7 C8 A8 B8 X8 R8

Add scaled Equations 3 & 5 to 4 B1 C1 A1 X1 R1 A2 B2 C2 X2 R2 + + A3 -A4 C3 A4 B4 C4 A5 -C4 C5 X3 X4 X5 = R3 R4 R5 A6 B6 C6 X6 R6 A7 B7 C7 X7 R7 C8 A8 B8 X8 R8

Zeroes entries 4,3 and 4,5 B1 C1 A1 X1 R1 A2 B2 C2 X2 R2 A3 -A4 C3 X3 R3 A3 B4 C5 A5 -C4 C5 X4 X5 = R4 R5 A6 B6 C6 X6 R6 A7 B7 C7 X7 R7 C8 A8 B8 X8 R8

Repeat operation for all equations B1 C2 A8 X1 R1 B2 C3 A1 X2 R2 A2 B3 C4 X3 R3 A3 B4 C5 A4 B5 C6 X4 X5 = R4 R5 A5 B6 C7 X6 R6 C8 A6 B7 X7 R7 C1 A7 B8 X8 R8

Permute 2 independent blocks B1 C2 A8 X1 R1 A2 B3 C4 X3 R3 Odd A4 B5 C6 X5 R5 C8 A6 B7 B2 C3 A1 X7 X2 = R7 R3 A3 B4 C5 X4 R4 Even A5 B6 C7 X6 R6 C1 A7 B8 X8 R8

Cyclic Reduction Ingredients Apply this transformation (pivot + permute) Split (n x n) into 2 independent (n/2 x n/2) Proceed recursively Two approaches: Recursively reduce both submatrices until n 1x1 matrices obtained. Solve resulting diagonal matrix. Recursively reduce odd submatrix until single 1x1 system. Solve system. Reverse process via back-substitution.

Cyclic Reduction (CR) 24 1 threads working 8-unknown system Forward Reduction 4-unknown system 2-unknown system Backward Substitution Solve 2 unknowns Solve the rest 2 unknowns Solve the rest 4 unknowns 2*log2 (8)-1 = 2*3-1 = 5 steps

Parallel Cyclic Reduction (PCR) 4 threads working One 8-unknown system Forward Reduction Two 4-unknown systems Four 2-unknown systems Solve all unknowns log2 (8) = 3 steps

Work vs. Step Efficiency CR does O(n) work, requires 2 log(n) steps PCR does O(n log n) work, requires log(n) steps Smallest granularity of work is 32 threads: performing fewer than 32 math ops = same cost as 32 math ops Here s an idea: Save work when > 32 threads active (CR) Save steps when < 32 threads active (PCR)

Hybrid Algorithm Switch to PCR Switch back to CR System size reduced at the beginning No idle processors Fewer algorithmic steps Even more beneficial because of: bank conflicts control overhead

PCR vs Hybrid Make tradeoffs between the computation, memory access, and control The earlier you switch from CR to PCR The fewer bank conflicts, the fewer algorithmic steps But more work

Hybrid Solver Optimal cross-over 1.2 Optimal performance of hybrid solver Solving 512 systems of 512 unknowns 1 Time (milliseconds) 0.8 0.6 0.4 0.2 0 2 4 8 16 32 64 128 256 512 CR Hybrid PCR Optimal intermediate system size

Results: Tridiagonal Linear Solver Time (milliseconds) 14 12 10 8 6 4 2 0 CR Solve 512 systems of 512 unknowns 2.5x 1.3x 2.5x 1.3x 1.07 0.53 0.42 PCR Hybrid 12x 12x PCI-E 4.08 5.24 MT GE GE 9.30 11.8 GEP CR PCR Hybrid PCI-E MT GE GE GEP PCI-E: CPU-GPU data transfer MT GE: multi-threaded CPU Gaussian Elimination GEP: CPU Gaussian Elimination with pivoting (from LAPACK) From Zhang et al., Fast Tridiagonal Solvers on GPU. PPoPP 2010.

Summary Linear PDE => Linear solver (e.g. Poisson Equation) 2 basic approaches: Iterative vs. Direct Parallel iterative solver (Red Black Gauss Seidel) Design update procedure so multiple terms can be updated in parallel Parallel direct solver (Cyclic Reduction) Exploit structure of matrix to solve using parallel operations General purpose solvers largely mythical. Most people use special purpose solvers => Lots of good research potential mining this field