Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Similar documents
TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

A Comparison of Solving the Poisson Equation Using Several Numerical Methods in Matlab and Octave on the Cluster maya

Review: From problem to parallel algorithm

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI

Linear Solvers. Andrew Hazel

Course Notes: Week 1

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

Practical Combustion Kinetics with CUDA

CS 542G: Conditioning, BLAS, LU Factorization

Linear Algebra and Eigenproblems

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Scientific Computing

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

Iterative Methods for Solving A x = b

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL

Solving PDEs with CUDA Jonathan Cohen

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Lecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc.

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Linear System of Equations

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Iterative solvers for linear equations

Mini-project in scientific computing

Consider the following example of a linear system:

Iterative solvers for linear equations

Lab 1: Iterative Methods for Solving Linear Systems

Iterative solvers for linear equations

Sparsity Matters. Robert J. Vanderbei September 20. IDA: Center for Communications Research Princeton NJ.

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Conjugate Gradient Method

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Computers and Mathematics with Applications

Linear System of Equations

Matrix Assembly in FEA

Dense Arithmetic over Finite Fields with CUMODP

Lecture 4: Linear Algebra 1

SOLUTION of linear systems of equations of the form:

A CUDA Solver for Helmholtz Equation

The answer in each case is the error in evaluating the taylor series for ln(1 x) for x = which is 6.9.

PDE Solvers for Fluid Flow

Chapter 7 Iterative Techniques in Matrix Algebra

L1-MAGIC BENCHMARKS AND POTENTIAL IMPROVEMENTS

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations

FINDING PARALLELISM IN GENERAL-PURPOSE LINEAR PROGRAMMING

Computing least squares condition numbers on hybrid multicore/gpu systems

Motivation: Sparse matrices and numerical PDE's

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

On the design of parallel linear solvers for large scale problems

The Conjugate Gradient Method

The Solution of Linear Systems AX = B

ACCELERATED LEARNING OF GAUSSIAN PROCESS MODELS

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Numerical Methods I Solving Square Linear Systems: GEM and LU factorization

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

U.C. Berkeley CS270: Algorithms Lecture 21 Professor Vazirani and Professor Rao Last revised. Lecture 21

FPGA Implementation of a Predictive Controller

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Numerical Methods I: Numerical linear algebra

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Scalable and Power-Efficient Data Mining Kernels

Algebraic Multigrid as Solvers and as Preconditioner

Lecture 17: Iterative Methods and Sparse Linear Algebra

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Scientific Computing: Solving Linear Systems

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Boundary Value Problems - Solving 3-D Finite-Difference problems Jacob White

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

ERLANGEN REGIONAL COMPUTING CENTER

Contents. Preface... xi. Introduction...

Incomplete Cholesky preconditioners that exploit the low-rank property

9.1 Preconditioned Krylov Subspace Methods

Fast Multipole Methods: Fundamentals & Applications. Ramani Duraiswami Nail A. Gumerov

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Numerical Algorithms. IE 496 Lecture 20

Math 2142 Homework 5 Part 1 Solutions

Linear Equations in Linear Algebra

Incomplete LU Preconditioning and Error Compensation Strategies for Sparse Matrices

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

FAS and Solver Performance

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

A Fast, Parallel Potential Flow Solver

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Transcription:

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012

Introduction Linear systems of equations are ubiquitous in engineering Their solution is often the bottleneck for entire applications NxN systems Ax=b arising from physics have special properties: Almost always : Sparse - O(N) nonzero entries Very often : Symmetric Often enough : Positive definite, diagonally dominant, etc. Dense matrix complexity ~O(N 2.38 ) does not apply For special matrices, solvers with cost as low as O(N) exist The hidden constant becomes important, and can be lowered via parallelism Efficient algorithms need to be very frugal with storage

Introduction Example : The 3D Poisson equation 6 1 1 1. 1 6 1 1... 1 6 1................. 1 1 1 1.................................... 1............ 1.................. 1 6 1... 1 1 6 1 1 1 1 6 x = b

Applications Rasmussen et al, Smoke Simulation for Large Scale Phenomena (SIGGRAPH 2003)

Applications Nielsen et al, Out-Of-Core and Compressed Level Set Methods (ACM TOG 2007)

Applications Horvath & Geiger, Directable, high-resolution simulation of fire on the GPU (ACM SIGGRAPH 2009)

Applications

Introduction 700M equations 40 sec on 16 cores 16GB RAM required

Introduction 450M equations 25 sec on 16 cores, 12GB RAM required

Introduction 135M equations 12 sec on 16 cores, 3GB RAM required

Introduction How good (or not) are these runtimes? For one of these problems the system Ax=b has 700M equations, 700M unknowns Storing x (in float precision) requires 2.8GB. Same for b. The matrix A has 7-nonzero entries per equation. MATLAB requires ~40GB to store it, in its sparse format Our case study computes the solution With ~40sec of runtime (on a 16-core SMP) While using <16GB of RAM (caveat : it avoids storing A)

Development Plan Design Define your objectives Choose a parallel-friendly theoretical formulation Set performance expectations Choose a promising algorithm Implement Implement a prototype Organize code into reusable kernels Accelerate Reorder/combine/pipeline operations Reduce resource utilization (try harder...) Parallelize component kernels

Clarifying objectives What kind of accuracy do we need? Solve Ax=b down to machine precision? Ensure that x is correct to k significant digits? Ensure that x (initial guess) is improved by k significant digits? What is the real underlying problem we care about? The system Ax=b is rarely the ultimate objective Typically, it s means to an end Solve the system to create a simulation Solve the system to generate a solution to a physical law We have some flexibility to make Ax=b better for parallel algorithmic solution

Clarifying objectives x = b Ax = b 4u ij + u i+1,j + u i 1,j + u i,j+1 + u i,j 1 h 2 = f ij

Development Plan Design Define your objectives Choose a parallel-friendly theoretical formulation Set performance expectations Choose a promising algorithm Implement Implement a prototype Organize code into reusable kernels Accelerate Reorder/combine/pipeline operations Reduce resource utilization (try harder...) Parallelize component kernels

Speedup vs. Efficiency Well- intended evalua-on prac-ces... My serial implementation of algorithm X on machine Y ran in Z seconds. When I parallelized my code, I got a speedup of 15x on 16 cores...... are some-mes abused like this:... when I ported my implementation to CUDA, this numerical solver ran 200 times faster than my original MATLAB code... So, what is wrong with that premise?

Speedup vs. Efficiency Watch for warning signs: Speedup across platforms grossly exceeding specification ratios e.g. NVIDIA GTX580 vs. Intel i7-2600 Relative (peak) specifications : GPU has about 3x higher (peak) compute capacity GPU has about 8x higher (peak) memory bandwidth Significantly higher speedups likely indicate: Different implementations on the 2 platforms Baseline code was not optimal/parallel enough Standard parallelization yields linear speedups on many cores [Reasonable scenario] Implementation is CPU-bound [Problematic scenario] Implementation is CPU-wasteful

Speedup vs. Efficiency A different perspec-ve...... after optimizing my code, the runtime is about 5x slower than the best possible performance that I could expect from this machine...... i.e. 20% of maximum theore-cal efficiency! Challenge : How can we tell how fast the best implementa-on could have been? (without implemen-ng it...)

Performance bounds and textbook efficiency Example : Solving the quadratic equation ax 2 + bx + c = 0 What is the minimum amount of time needed to solve this? Data access cost bound We cannot solve this faster than the time needed to read a,b,c and write x We cannot solve this faster than the time needed evaluate the polynomial, for given values of a,b,c and x (i.e. 2 ADDs, 2 MULTs plus data access) Solu-on verifica-on bound Equivalent opera-on bound We cannot solve this faster than the time it takes to compute a square root

Performance bounds and textbook efficiency What about linear systems of equations? Ax = b Textbook Efficiency (for ellip-c systems) It is theoretically possible to compute the solution to a linear system (with certain properties) with a cost comparable to 10x the cost of verifying that a given value x is an actual solution... or... It is theoretically possible to compute the solution to a linear system (with certain properties) with a cost comparable to 10x the cost of computing the expression r=b- Ax and verifying that r=0 (i.e. slightly over 10x of the cost of a matrix- vector mul-plica-on)

Development Plan Design Define your objectives Choose a parallel-friendly theoretical formulation Set performance expectations Choose a promising algorithm Implement Implement a prototype Organize code into reusable kernels Accelerate Reorder/combine/pipeline operations Reduce resource utilization (try harder...) Parallelize component kernels

Sample 2D domain (512x512 resolution)

Exact solution

Conjugate Gradients (w/o preconditioning)

Conjugate Gradients (with a stock preconditioner)

Multigrid Conjugate Gradients (with parallel multigrid preconditioner)

Preconditioned Conjugate Gradients Lx = f Performance Converges in O(Nd) with stock preconditioners Converges in O(N) with multigrid preconditioners Prerequisites Requires a symmetric system matrix Matrix needs to be positive definite (Other variants exist, too) Benefits Low storage overhead Simple component kernels M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Development Plan Design Define your objectives Choose a parallel-friendly theoretical formulation Set performance expectations Choose a promising algorithm Implement Implement a prototype Organize code into reusable kernels Accelerate Reorder/combine/pipeline operations Reduce resource utilization (try harder...) Parallelize component kernels

Preconditioned Conjugate Gradients Lx = f Kernels Multiply() Saxpy() Subtract() Copy() Inner_Product() Norm() M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Preconditioned Conjugate Gradients Lx = f Kernels Multiply() Saxpy() Subtract() Copy() Inner_Product() Norm() M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Preconditioned Conjugate Gradients Lx = f Kernels Multiply() Saxpy() Subtract() Copy() Inner_Product() Norm() M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Preconditioned Conjugate Gradients Lx = f Kernels Multiply() Saxpy() Subtract() Copy() Inner_Product() Norm() M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Preconditioned Conjugate Gradients Lx = f Kernels Multiply() Saxpy() Subtract() Copy() Inner_Product() Norm() M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Preconditioned Conjugate Gradients Lx = f Kernels Multiply() Saxpy() Subtract() Copy() Inner_Product() Norm() M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Development Plan Design Define your objectives Choose a parallel-friendly theoretical formulation Set performance expectations Choose a promising algorithm Implement Implement a prototype Organize code into reusable kernels Accelerate Reorder/combine/pipeline operations Reduce resource utilization (try harder...) Parallelize component kernels

Preconditioned Conjugate Gradients M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Preconditioned Conjugate Gradients M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Preconditioned Conjugate Gradients M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Preconditioned Conjugate Gradients M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Preconditioned Conjugate Gradients M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Preconditioned Conjugate Gradients M 1: procedure MGPCG(r, x) 2: r r Lx, µ r, ν r µ 3: if (ν < ν max ) then return 4: r r µ, p M 1 r ( ), ρ p T r 5: for k = 0 to k max do 6: z Lp, σ p T z 7: α ρ/σ 8: r r αz, µ r, ν r µ 9: if (ν < ν max or k = k max ) then 10: x x + αp 11: return 12: end if 13: r r µ, z M 1 r ( ), ρ new z T r 14: β ρ new /ρ 15: ρ ρ new 16: x x + αp, p z + βp 17: end for 18: end procedure

Development Plan Design Define your objectives Choose a parallel-friendly theoretical formulation Set performance expectations Choose a promising algorithm Implement Implement a prototype Organize code into reusable kernels Accelerate Reorder/combine/pipeline operations Reduce resource utilization (try harder...) Parallelize component kernels

Results and performance

Results and performance Most kernels reach 60-70% of max theoretical efficiency (based on architectural limitations) Preconditioning overhead : ~60% of iteration cost

Results and performance

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012