SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Similar documents
Parallel sparse direct solvers for Poisson s equation in streamer discharges

Some notes on efficient computing and setting up high performance computing environments

SOLUTION of linear systems of equations of the form:

Sparse Matrix Computations in Arterial Fluid Mechanics

A CUDA Solver for Helmholtz Equation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers

Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Computing least squares condition numbers on hybrid multicore/gpu systems

FEAST eigenvalue algorithm and solver: review and perspectives

Solving PDEs with CUDA Jonathan Cohen

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Domain Decomposition-based contour integration eigenvalue solvers

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications

Intel Math Kernel Library (Intel MKL) LAPACK

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Review for the Midterm Exam

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Introduction to numerical computations on the GPU

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Maximum-weighted matching strategies and the application to symmetric indefinite systems

Direct Self-Consistent Field Computations on GPU Clusters

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

A Comparison of Solving the Poisson Equation Using Several Numerical Methods in Matlab and Octave on the Cluster maya

Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries

INITIAL INTEGRATION AND EVALUATION

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems

Multilevel low-rank approximation preconditioners Yousef Saad Department of Computer Science and Engineering University of Minnesota

Discretization of PDEs and Tools for the Parallel Solution of the Resulting Systems

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

Parallelization of the Molecular Orbital Program MOS-F

ECS289: Scalable Machine Learning

On the design of parallel linear solvers for large scale problems

Solving Ax = b, an overview. Program

Multipole-Based Preconditioners for Sparse Linear Systems.

Recent advances in sparse linear solver technology for semiconductor device simulation matrices

RWTH Aachen University

HPMPC - A new software package with efficient solvers for Model Predictive Control

Improvements for Implicit Linear Equation Solvers

Lecture 18 Classical Iterative Methods

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Scientific Computing

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

ERLANGEN REGIONAL COMPUTING CENTER

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

EVALUATING SPARSE LINEAR SYSTEM SOLVERS ON SCALABLE PARALLEL ARCHITECTURES

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

Algorithms and Methods for Fast Model Predictive Control

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

A simple FEM solver and its data parallelism

Porting a sphere optimization program from LAPACK to ScaLAPACK

Some Geometric and Algebraic Aspects of Domain Decomposition Methods

Solving Large Nonlinear Sparse Systems

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Dual Reciprocity Boundary Element Method for Magma Ocean Simulations

An evaluation of sparse direct symmetric solvers: an introduction and preliminary finding

Dense Arithmetic over Finite Fields with CUMODP

Matrix Assembly in FEA

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)

Application of Maxwell Equations to Human Body Modelling

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

Preconditioned Parallel Block Jacobi SVD Algorithm

R. Glenn Brook, Bilel Hadri*, Vincent C. Betro, Ryan C. Hulguin, and Ryan Braby Cray Users Group 2012 Stuttgart, Germany April 29 May 3, 2012

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU accelerated Arnoldi solver for small batched matrix

Parallel Polynomial Evaluation

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Recent advances in HPC with FreeFem++

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

On the design of parallel linear solvers for large scale problems

Position Papers of the 2013 Federated Conference on Computer Science and Information Systems pp

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

IMPROVING THE PERFORMANCE OF SPARSE LU MATRIX FACTORIZATION USING A SUPERNODAL ALGORITHM

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Binding Performance and Power of Dense Linear Algebra Operations

Parallelism in FreeFem++.

Preliminary Results of GRAPES Helmholtz solver using GCR and PETSc tools

A parallel solver for incompressible fluid flows

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Robust Preconditioned Conjugate Gradient for the GPU and Parallel Implementations

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations

Massively parallel electronic structure calculations with Python software. Jussi Enkovaara Software Engineering CSC the finnish IT center for science

Transcription:

SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015

OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS 5 EXAMPLE OF USE 6 CONCLUSIONS AND REMARKS

2D POISSON PROBLEM 2D Poisson problem solution at Cartesius pardiso 1 thread pardiso 12 threads pardiso 24 threads fishpack lapack mkl 10-1 10-2 10-3 10-4 10 3 10 4 n x = n y Results on 1 node of Cartesius with 24 cores LAPACK: fastest implementation on Cartesius PARDISO: shared-memory multiprocessing parallel direct sparse solver by Olaf Schenk[ 00-04] optimized for Intel R

residu 2D POISSON PROBLEM 2D Poisson problem accuracy 10-2 10-4 10-6 pardiso fishpack lapack 10-8 10 3 10 4 n x = n y LAPACK: maximum problem size n x = n y = 1300 FISHPACK: convergence till problem size n x = n y = 1400 PARDISO: maximum problem size n x = n y = 5600

OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS 5 EXAMPLE OF USE 6 CONCLUSIONS AND REMARKS

CLUSTER MACHINE Cartesius, the Dutch Supercomputer at SURFsara is a cluster machine Node Type Number Cores CPU Clock Memory thin 1080 24 E5-2690 v3 2.6 GHz 64 GB thin 540 24 E5-2695 v2 2.4 GHz 64 GB fat 32 32 E5-4650 2.7 GHz 256 GB gpu 64 16 E5-2450 v2 2.5 GHz 96 GB 40,960 cores + 132 GPUs: 1.559 Pflop/s (peak performance) 117 TB memory (CPU + GPGPU) Fat nodes have 4 times more memory than thin nodes, but are slower

NODES AND CORES A Cartesius node can have 24 or 32 cores Within a node shared memory Over nodes distributed memory Nodes can be configured in different ways 1 NODE 1 NODE 1 NODE 8 CORES (a) 8 MPI processes 8 CORES (b) 8 OpenMP threads 8 CORES (c) 4 MPI processes

SOFTWARE MKL LIBRARY Intel R Math Kernel Library is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math. The routines in MKL are hand-optimized specifically for Intel R processors. Sparse solvers: MKL PARDISO- Parallel Direct Sparse Solver interface Parallel Direct Sparse Solver for Cluster Interface Direct Sparse Solvers (DDS) (Interface Routines) Iterative Sparse Solvers (based on Reverse Communication Interface)

SOFTWARE Intel R Poisson solvers for a single node: Two-dimensional Helmholtz problem on a Cartesian plane Two-dimensional Poisson problem on a Cartesian plane Two-dimensional Laplace problem on a Cartesian plane Helmholtz problem on a sphere Poisson problem on a sphere Three-dimensional Helmholtz problem Three-dimensional Poisson problem Three-dimensional Laplace problem

OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS 5 EXAMPLE OF USE 6 CONCLUSIONS AND REMARKS

1D CELL CENTERED DIRICHLET BC Hundsdorfer and Verwer: Consider cell centered grid with nodes x i = (i 1 )h; i = 1,, M; h = 1/M. 2 For Dirichlet BC we need in x 0 = 1 h and in x 2 M+1 = 1 + 1 h, 2 the virtual values u 0 and u M+1, such that 1 2 (u 0 + u 1 ) = γ 0 1 2 (u M + u M+1 ) = γ M. We obtain the following semi-discrete system u 1 u i u M = 1 ( 3u h 2 1 + u 2 ) + 2 γ h 2 0, = 1 (u h 2 i 1 2u i + u i+1 ), 2 i M 1, = 1 (u h 2 M 1 3u M ) + 2 γ h 2 M,

1D CELL CENTERED DIRICHLET BC 1D Poisson matrix A of size M and RHS vector b are defined by A = 1 h 2 3 1 1 2 1 1 2 1... 1 2 1 1 3, b = Note: the Poisson matrix is symmetric positive indefinite Note: correction on the RHS vector b 1 + 2 h 2 γ 0 b 2 b 3... b M 1 b M + 2 h 2 γ M

2D AND 3D CELL CENTERED DIRICHLET BC 2D Poisson matrix A of size M 2 M 2 for M = 4 is defined by A = 1 h 2 6 1 1 1 5 1 1 1 5 1 1 1 6 1 1 5 1 1 1 1 4 1 1 1 1 4 1 1 1 1 5 1 1 5 1 1 1 1 4 1 1 1 1 4 1 1 1 1 5 1 1 1 6 1 1 1 5 1 1 1 5 1 1 1 6 For the 3D case we distinguish 3 diagonal parts [ 9 8 8 8 8 9] for cells on edges [ 8 7 7 7 7 8] for cells on surfaces [ 7 6 6 6 6 7] for inner cells supplemented with 3 sub diagonals and 3 super diagonals

OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS 5 EXAMPLE OF USE 6 CONCLUSIONS AND REMARKS

POISSON SOLVER FOR LARGE 2D AND 3D SIMULATIONS Poisson solvers PARDISO (MKL) CLUSTER_SPARSE_SOLVER (MKL) MUMPS Release 5.0.1

ANALYSIS, FACTORIZATION, SOLVE To solve we factorize A into A x = b A = L D L T For both PARDISO, CLUSTER_SPARSE_SOLVER and MUMPS we can distinguish three main phases analysis and reordering factorization solution Note 1 : Each phase can be called independently (not for FISHPACK) Note 2 : Once the matrix has been factorized we may restrict to the solution phase

ANALYSIS, FACTORIZATION, SOLVE Analysis phase reordering of the matrix to reduce fill-in choosing pivots using a selection criterion to preserve sparsity matrix input distributions CRS for PARDISO and CLUSTER_SPARSE_SOLVER Central assembled matrix format for MUMPS matrix only on host or distributed over processes if desired an analysis report is made

ANALYSIS, FACTORIZATION, SOLVE Factorization phase most time consuming phase most memory consuming phase if desired a report about the factorization is made pivot strategy required only once?

ANALYSIS, FACTORIZATION, SOLVE Solution phase Post-processing: iterative refinement Error analysis Compute r = Ax b then max i=1,,m r i < 1 E 12 Let x cont be the solution of the continuous problem then Residu : x x cont 2 or Residu : max i=1,,m x(i) x cont(i)

OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS 5 EXAMPLE OF USE 6 CONCLUSIONS AND REMARKS

2D POISSON PROBLEM Solve U(x, y) = ( 2 x + 2 ) U(x, y) 2 y 2 using a 4-pt centered 2-nd order difference scheme. 2D POISSON PROBLEM WITH KNOWN SOLUTION U(x, y) = exp ( C((x x 0 ) 2 + (y y 0 ) 2 )) + 1.0 U(x, y) = ( 4C + 4C 2 ((x x 0 ) 2 + (y y 0 ) 2 )) exp ( C((x x 0 ) 2 + (y y 0 ) 2 )) on an uniform grid defined on x [0, 1] and y [0, 1] and C {,, 10 4, 10 6 } and x 0 = y 0 = 0.5

2D P OISSON PROBLEM 2 2 1.5 1.5 1 1 0.5 0.5 0 0.5 0 0.5 0.5 0 Y 0.5 0 0-0.5-0.5 (d) C = 10 Y X 0 0-0.5-0.5 (e) C = 10 2 2 1.5 1.5 1 1 0.5 0.5 0 0.5 X 2 0 0.5 0.5 0 Y 0-0.5-0.5 (f) C = 104 X 0.5 0 Y 0-0.5-0.5 (g) C = 106 X

Residu (2-norm) 2D POISSON PROBLEM 10 5 2D Poisson problem accuracy C= C= C=10 4 C=10 6 10 4 2D Reordering phase on 1 node C=1 C= C=10 4 C=10 6 10-5 10-10 10-2 10-15 10 3 10 4 nx = ny (h) convergence 2D Factorize phase on 1 node C=1 10-1 10-2 10-3 C= C=10 4 C=10 6 10-4 10 3 10 4 nx = ny (i) reordering PARDISO 2D Solution phase on 1 node C=1 10-1 10-2 10-3 10-4 C= C=10 4 C=10 6 10-4 10 3 10 4 nx = ny (j) factorization PARDISO 10-5 10 3 10 4 nx = ny (k) solution PARDISO

3D POISSON PROBLEM Solve U(x, y, z) = ( 2 x + 2 2 y + 2 ) U(x, y, z) 2 z2 using a 6-pt centered 2-nd order difference scheme. 3D POISSON PROBLEM WITH KNOWN SOLUTION U(x, y, z) = exp ( C((x x 0 ) 2 + (y y 0 ) 2 ) + (z z 0 ) 2 ) + 1.0 U(x, y, z) = ( 4C + 4C 2 ((x x 0 ) 2 + (y y 0 ) 2 ) + (z z 0 ) 2 ) exp ( C((x x 0 ) 2 + (y y 0 ) 2 + (z z 0 ) 2 )) on an uniform grid defined on x [0, 1], y [0, 1] and z [0, 1] and C {,, 10 4, 10 6 } and x 0 = y 0 = z 0 = 0.5

3D POISSON PROBLEM CLUSTER_SPARSE_SOLVER residu(max norm) 3D Poisson problem accuracy 12 cores CLUSTER -2 10 3D Reordering phase 12 cores CLUSTER_SPARSE_SOLVER 10 3 10-3 10-4 (l) convergence 10 4 3D Factorize phase 12 cores CLUSTER (m) reordering 3D Solution phase 12 cores CLUSTER_SPARSE_SOLVER 10 3 (n) factorization 10-1 10-2 (o) solution

3D POISSON PROBLEM CLUSTER_SPARSE_SOLVER 3D Reordering phase 12 cores CLUSTER_SPARSE_SOLVER 10 3 10 4 3D Factorize phase 12 cores CLUSTER 3D Solution phase 12 cores CLUSTER_SPARSE_SOLVER 10 3 (p) reordering (q) factorization 10-1 10-2 (r) solution 3D Reordering phase 24 cores CLUSTER_SPARSE_SOLVER 10 3 3D Factorize phase 24 cores CLUSTER_SPARSE_SOLVER 10 4 3D Solution phase 24 cores CLUSTER_SPARSE_SOLVER 10 3 (s) reordering (t) factorization 10-1 10-2 (u) solution FIGURE: Number of cores per node 12 (upper) and 24 (lower) figures

3D POISSON PROBLEM MUMPS 10 3 3D Reordering phase MUMPS 10 4 3D Factorize phase MUMPS 3D Solution phase MUMPS 10 3 (a) reordering (b) factorization 10-1 10-2 (c) solution 10 4 3D Reordering phase MUMPS 10 4 3D Factorize phase MUMPS 3D Solution phase MUMPS 10 3 10 3 (d) reordering (e) factorization 10-1 10-2 (f) solution FIGURE: Number of cores per node 12 (upper) and 24 (lower) figures

3D POISSON PROBLEM CLUSTER_SPARSE_SOLVER VERSUS MUMPS 3D Reordering phase 24 cores CLUSTER_SPARSE_SOLVER 10 3 3D Factorize phase 24 cores CLUSTER_SPARSE_SOLVER 10 4 3D Solution phase 24 cores CLUSTER_SPARSE_SOLVER 10 3 (a) reordering (b) factorization 10-1 10-2 (c) solution 10 4 3D Reordering phase MUMPS 10 4 3D Factorize phase MUMPS 3D Solution phase MUMPS 10 3 10 3 (d) reordering (e) factorization 10-1 10-2 (f) solution FIGURE: CLUSTER_SPARSE_SOLVER (upper) versus MUMPS (lower) figures; number of cores per node 24

3D POISSON PROBLEM MUMPS Speedup Speedup Speedup 10 8 6 3D Reordering phase MUMPS 16 14 12 10 3D Factorize phase MUMPS 16 14 12 10 3D Solution phase MUMPS 8 8 4 2 0 50 60 70 80 90 100 120 130 (a) reordering 6 4 2 0 60 70 80 90 100 120 130 (b) factorization 6 4 2 0 60 70 80 90 100 120 130 (c) solution FIGURE: Speedup compared with 1 node

3D POISSON PROBLEM Analysis report for 3D MUMPS on 64 nodes n x N NZ operations host avg total MBYTES MBYTES MBYTES 64 262144 1036288 3.75 E+11 155 69 4465 80 512000 2028800 1.49 E+12 217 165 10614 96 884736 3511296 4.47 E+12 446 358 22925 112 1404928 5582080 1.17 E+13 851 659 42189 128 2097152 8339456 2.72 E+13 1644 1087 69627 160 4096000 16307200 1.06 E+13 3298 2596 166191 192 7077888 28200960 3.24 E+14 8711 5784 370210 224 11239424 44807168 8.24 E+14 14399 10114 647308 256 16777216 66912256 1.87 E+14 21989 16270 1041342

OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS 5 EXAMPLE OF USE 6 CONCLUSIONS AND REMARKS

CONCLUSIONS, REMARKS AND QUESTIONS 2D Poisson problems up to n x = n y = 5400 on single node 2D Poisson problems up to n x = n y = 13000 on 32 nodes 3D Poisson problems up to n x = n y = n z = 128 on single nodes 3D Poisson problems up to n x = n y = n z = 256 on 64 nodes MUMPS is very suitable for cluster machines CLUSTER_SPARSE_SOLVER can handle larger problems than MUMPS the solution phase of CLUSTER_SPARSE_SOLVER is slower than MUMPS use MKL software where possible also for MUMPS parallelization with MUMPS or CLUSTER_SPARSE_SOLVER is NOT difficult forget about FISHPACK it is no longer the fastest solver results obtained by FISHPACK are not reliable

CONCLUSIONS, REMARKS AND QUESTIONS Is it possible to accelerate Anna s code? Is the 3D approach suitable for Anna? More questions