Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Similar documents
J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Efficient Algorithms for Solving the Black-Scholes Equation on Dimensionally Adaptive Sparse Grids

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Universität Dortmund UCHPC. Performance. Computing for Finite Element Simulations

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

ERLANGEN REGIONAL COMPUTING CENTER

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Accelerating linear algebra computations with hybrid GPU-multicore systems.

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Asynchronous Parareal in time discretization for partial differential equations

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Hybrid Method for the Wave Equation. beilina

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Sustained Petascale Performance of Seismic Simulations with SeisSol

A sparse multifrontal solver using hierarchically semi-separable frontal matrices

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Suboptimal feedback control of PDEs by solving Hamilton-Jacobi Bellman equations on sparse grids

AMS 529: Finite Element Methods: Fundamentals, Applications, and New Trends

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points

Discretization of PDEs and Tools for the Parallel Solution of the Resulting Systems

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

Solving PDEs with CUDA Jonathan Cohen

GPU Computing Activities in KISTI

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

Introduction of a Stabilized Bi-Conjugate Gradient iterative solver for Helmholtz s Equation on the CMA GRAPES Global and Regional models.

R. Glenn Brook, Bilel Hadri*, Vincent C. Betro, Ryan C. Hulguin, and Ryan Braby Cray Users Group 2012 Stuttgart, Germany April 29 May 3, 2012

Solving PDEs: the Poisson problem TMA4280 Introduction to Supercomputing

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Solving RODEs on GPU clusters

Petascale Quantum Simulations of Nano Systems and Biomolecules

FINE-GRAINED PARALLEL INCOMPLETE LU FACTORIZATION

Cactus Tools for Petascale Computing

Algorithms for Scientific Computing

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Behavioral Simulations in MapReduce

arxiv: v1 [hep-lat] 8 Nov 2014

Radial Basis Functions generated Finite Differences (RBF-FD) for Solving High-Dimensional PDEs in Finance

c 2015 Society for Industrial and Applied Mathematics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

A Fast, Parallel Potential Flow Solver

Performance Evaluation of MPI on Weather and Hydrological Models

Multipole-Based Preconditioners for Sparse Linear Systems.

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

Saving Energy in Sparse and Dense Linear Algebra Computations

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

Parallel Preconditioning Methods for Ill-conditioned Problems

Lattice Quantum Chromodynamics on the MIC architectures

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

arxiv: v1 [hep-lat] 7 Oct 2010

Multilevel stochastic collocations with dimensionality reduction

Performance comparison between hybridizable DG and classical DG methods for elastic waves simulation in harmonic domain

Accelerating interior point methods with GPUs for smart grid systems

Huge-scale Molecular Study of Multi-bubble Nuclei

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

Efficient implementation of the overlap operator on multi-gpus

A User Friendly Toolbox for Parallel PDE-Solvers

The Performance Evolution of the Parallel Ocean Program on the Cray X1

Direct Self-Consistent Field Computations on GPU Clusters

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Matrix square root and interpolation spaces

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications

Aggregation-based algebraic multigrid

Introduction to Parallel Programming in OpenMP Dr. Yogish Sabharwal Department of Computer Science & Engineering Indian Institute of Technology, Delhi

Scalable Domain Decomposition Preconditioners For Heterogeneous Elliptic Problems

Fine-grained Parallel Incomplete LU Factorization

Incomplete Cholesky preconditioners that exploit the low-rank property

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

H 2 -matrices with adaptive bases

Lecture 1. Finite difference and finite element methods. Partial differential equations (PDEs) Solving the heat equation numerically

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

UTPlaceF 3.0: A Parallelization Framework for Modern FPGA Global Placement

Multigrid absolute value preconditioning

An Adaptive Hierarchical Matrix on Point Iterative Poisson Solver

Performance Evaluation of Scientific Applications on POWER8

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

Parallel Transposition of Sparse Data Structures

Block-Structured Adaptive Mesh Refinement

Iterative methods for positive definite linear systems with a complex shift

Improvements for Implicit Linear Equation Solvers

Schwarz-type methods and their application in geomechanics

A High-Order Galerkin Solver for the Poisson Problem on the Surface of the Cubed Sphere

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Everyday Multithreading

Transcription:

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 1

Motivation Sparse Grids have been successfully applied to following fields: Interpolation Data Mining Solving PDEs and many more Adaptivity! Performance? HPC Spatially Adaptive Sparse Grids? HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 2

Motivation - We Need Adaptive Grids! V 6 5 4 3 2 1 0-1 7 6 5 4 S2 3 2 1 0 0 1 2 3 4 5 6 7 S1 HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 3

QPI Technische Universität München Motivation - We Need to be Hardware Aware! Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 3 CH DDR3 1333 MHz Shared L3 Cache QPI Shared L3 Cache 3 CH DDR3 1333 MHz QPI QPI: 6.4GT PCIe x16 Tylersburg-EP Chipset QPI Tylersburg-EP Chipset PCIe x16 GTX470 GTX470 HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 4

Outline Overview parallel solution of PDEs shared memory systems results distributed memory systems results additional possibilities Conclusion HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 5

Solving PDEs - An Overview second huge application domain of our SG-code (SG ++ ) currently available: Poission equation, heat equation, Black-Scholes equation, Hull-White equation solution done by finite elements currently available: highly optimized shared memory implementation execution times still too high currently under construction : hybrid parallelization on distributed and shared memory systems HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 6

Some Basic Definitions... φ l, i are some suitable Sparse Grid basis functions (e.g. piecewise linear) where needed: φ j with a suitable hierarchical ordering u contains the grid s coefficients (hierarchical surpluses) N is the number of grid points there is a way of addressing basis functions based on level and index (hashmap) there are refinement and coarsening algorithms based on the hierarchical surplus in order to allow spatially adaptive sparse grids in case of adaptive sparse grids a well-formed grid is assumed HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 7

A Parallel PDE Solver, I of II 1: G := initial grid 2: p( x) := function for initial condition 3: refinegrid(g, u, p( x)) 4: for t = 0 to T do 5: solvetimestep(g, u, δ t) 6: restructuregrid(g, u) 7: t t + δt 8: end for 9: return u, G CG/BiCGSTAB-solver consumes 100% of solver time grid manipulation can also be done in parallel HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 8

A Parallel PDE Solver, II of II 1: i := 0 2: r := b Lx 3: d := r 4: δ 0 := r T r 5: while i < i max AND δ > ε 2 δ 0 do 6: q = Ld 7: update x,d,delta,i 8: end while MV-Product consumes 100% of solver time MV-Product can be parallelized (Up/Down scheme) HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 9

Application of a PDE s FEM System Matrix Calculating the 1D-L 2 scalar-product: = N n=1 N x n m u n φ n (x)φ m (x)dx x u n φ n (x) φ m (x)dx + N x n>m u z φ n (x) φ m (x)dx, m = 1,...,N and assuming a suitable ordering φ 1,1,φ 2,1,φ 2,3,φ 3,1,... matrix equation that follows: A u = A Down u + A Up u. HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 10

Application of a PDE s FEM System Matrix UpDown(1): u copy u down u add Au uu up uu UpDown(d): u copy u UpDown(d 1) down u add Au uu up UpDown(d 1) uu Up/Down parallelization constructs 2 d tasks Barriers: Just for more than 5d: up to 12 cores usable Up/Down parallelization only for moderate core counts suitable Today s HPC-Boxes (up to 80 HW-threads) cannot be exploited HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 11

Some Code: Up/Down Parallel i n t size = u. getsize ( ) ; Vector temp ( size ), r e s u l t t e m p ( s ize ), temp two ( size ) ; #pragma omp task shared ( u, r e s u l t ) { up ( u, temp, dim ) ; updown ( temp, r e s u l t, dim 1) ; } #pragma omp task shared ( u, r e s u l t t e m p ) { updown ( u, temp two, dim 1) ; down ( temp two, result temp, dim ) ; } #pragma omp t a s k w a i t r e s u l t. add ( r e s u l t t e m p ) ; HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 12

Example: Heat Equation Laplacian: calculated by sum of 1d-Operators B: A u(t) = d B i i=1 }{{} :=L u(t) Implicit Euler: (A δtl) u(t + 1) }{{} x = A u(t). }{{} b more parallelism: d + 1 additional summands! HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 13

Black-Scholes Parallelized in Shared Memory Black Scholes Equation (for Basket-Options) reversed in time: V τ 1 2 d d i=1 j=1 2 V σ i σ j ρ i,j S i S j S i S j d i=1 µ i S i V S i + rv = 0 FEM matrix equation: A u(τ) = ( d ) i F i=1ν 1 d d 2 σ i σ j ρ i,j E + r A u(τ) i=1 j=1 }{{} :=L with ν i = µ i d ( 12 j=1 σ i σ j ρ i,j (1 + δ i,j ) ) d2 +3d 2 + 2 summands (including symmetry)! HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 14

Black-Scholes: OpenMP, I of II Speed-Up 10 9 8 7 6 5 4 3 2 2d, reg., U/D 2d, reg., Sum. 2d, adap., U/D 2d, adap., Sum. 1 2 4 6 8 10 12 Number Threads Speed-Up 12 11 10 9 8 7 6 5 4 3 3d, reg., U/D 3d, reg., Sum. 3d, adap., U/D 3d, adap., Sum. 2 2 4 6 8 10 12 Number Threads HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 15

Black-Scholes: OpenMP, II of II Speed-Up 12 11 10 9 8 7 6 5 4 3 2 4d, adap., U/D 4d, adap., Sum. 1 2 4 6 8 10 12 Number Threads Speed-Up 12 11 10 9 8 7 6 5 4 3 5d, adap., U/D 5d, adap., Sum. 2 2 4 6 8 10 12 Number Threads HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 16

PDEs parallel: Hybrid parallelization current status: solution of higher dimensional BS still too slow only shared memory systems can be used under construction : sum-parallelization suitable for MPI Up/Down parallelization only in shared memory Goal: better use of NUMA solving on compute clusters HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 17

QPI Technische Universität München Testplatforms - Revisited Now: Use hardware threading for latency hiding (incl. pinning) Now: GPU s cannot be exploited (no data parallelism) Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 3 CH DDR3 1333 MHz Shared L3 Cache QPI Shared L3 Cache 3 CH DDR3 1333 MHz QPI QPI: 6.4GT PCIe x16 Tylersburg-EP Chipset QPI Tylersburg-EP Chipset PCIe x16 GTX470 GTX470 HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 18

PDEs parallel: First Results Poisson (reg. Grids) For a current high-end workstation: 2x Xeon X5650 2.66GHz 12 Thr. 24 Thr. 2 Pr. 6 Thr. 2 Pr. 12 Thr. no HT HT no HT HT 6D-Poisson (8l) 1390s 1050s 1250s 905s SGI ICE (2x Xeon E5540 per Blade, HT, Infiniband): Xeon E5540 2.53GHz 2 Pr. 8 Thr. 6 Pr. 8 Thr. 1 node 3 nodes 6D-Poisson (8l) 1270s 630s SGI UV (2x Xeon X7550 per Blade, SGI NUMALink): Xeon X7550 2.00GHz 2 Pr. 8 Thr. (no HT) 6 Pr. 8 Thr. (no HT) 6D-Poisson (8l) 2050s 1200s SGI UV: MPI-Pinning not possible :-( HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 19

PDEs parallel: First Results Poisson (reg. Grids) For a current high-end workstation (8D-Poisson): 2x Xeon X5650 2.66GHz 12 Thr. 24 Thr. (HT) 8D-Poisson (6l) 4700s 3300s (-30 %) SGI ICE (2x Xeon E5540 per Blade, no HT, Infiniband): Xeon E5540 2.53GHz 2 Pr. 8 Thr. 4 Pr. 8 Thr. 8Pr. 8Thr. 2 Knoten 4 Knoten 8 Knoten 8D-Poisson (6l) 4400s 2600s 1400s HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 20

PDEs parallel: First Results Poisson (adapt. Grids) For a current high-end workstation: 2x Xeon X5650 2.66GHz 12 Thr. 24 Thr. 2 Pr. 6 Thr. 2 Pr. 12 Thr. no HT HT no HT HT 6D-Poisson (adapt.) 1200s 890s 1050s 800s SGI ICE (2x Xeon E5540 per Blade, HT, Infiniband): Xeon E5540 2.53GHz 2 Pr. 8 Thr. 6 Pr. 8 Thr. 1 node 3 nodes 6D-Poisson (adapt.) 1530s 790s HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 21

PDEs parallel: Beyond Up/Down-Parallelism TifaMMy: support for hybrid matrices (neither dense nor sparse) TifaMMy: (fast) ILU preconditioner parallel (hierarchical) preconditioner? HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 22

Conclusion and Future Work Conclusion proposal for a parallelization approach of the Up/Down-scheme with respect to today s hardware successful demonstration with several prototypes Future Work reduce boundary overhead implementing BS-Equation with MPI/OpenMP parallelization scheme trying TifaMMy (hybrid matrices) preconditioning HIM - Workshop on Sparse Grids and Applications, May 18 th 2011 23