Solving RODEs on GPU clusters

Similar documents
Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Direct Self-Consistent Field Computations on GPU Clusters

arxiv: v1 [hep-lat] 31 Oct 2015

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Two case studies of Monte Carlo simulation on GPU

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

arxiv: v1 [hep-lat] 10 Jul 2012

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Unraveling the mysteries of quarks with hundreds of GPUs. Ron Babich NVIDIA

GPU Computing Activities in KISTI

GPU accelerated Arnoldi solver for small batched matrix

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Improving weather prediction via advancing model initialization

Efficient Molecular Dynamics on Heterogeneous Architectures in GROMACS

Introduction to numerical computations on the GPU

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

Origami: Folding Warps for Energy Efficient GPUs

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

Measuring freeze-out parameters on the Bielefeld GPU cluster

Efficient implementation of the overlap operator on multi-gpus

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

arxiv: v1 [cs.dc] 4 Sep 2014

arxiv: v1 [hep-lat] 7 Oct 2010

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Stochastic Modelling of Electron Transport on different HPC architectures

New approaches to strongly interacting Fermi gases

A simple Concept for the Performance Analysis of Cluster-Computing

Molecular Dynamics Simulation of a Biomolecule with High Speed, Low Power and Accuracy Using GPU-Accelerated TSUBAME2.

Monte Carlo Methods for Electron Transport: Scalability Study

Numerical Characterization of Multi-Dielectric Green s Function for 3-D Capacitance Extraction with Floating Random Walk Algorithm

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Solving PDEs with CUDA Jonathan Cohen

Fast event generation system using GPU. Junichi Kanzaki (KEK) ACAT 2013 May 16, 2013, IHEP, Beijing

Machine Learning I Continuous Reinforcement Learning

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Julian Merten. GPU Computing and Alternative Architecture

ERLANGEN REGIONAL COMPUTING CENTER

Implementing NNLO into MCFM

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Multiscale simulations of complex fluid rheology

Approximation of inverse Poisson CDF on GPUs

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

RWTH Aachen University

Utilisation de la compression low-rank pour réduire la complexité du solveur PaStiX

Dense Arithmetic over Finite Fields with CUMODP

A robust multilevel approximate inverse preconditioner for symmetric positive definite matrices

WRF performance tuning for the Intel Woodcrest Processor

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Architecture-Aware Algorithms and Software for Peta and Exascale Computing

Optimized LU-decomposition with Full Pivot for Small Batched Matrices S3069

Solving Quadratic Equations with XL on Parallel Architectures

Case Study: Quantum Chromodynamics

Network Security. Random Numbers. Cornelius Diekmann. Version: November 21, 2015

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

Random Sampling for Short Lattice Vectors on Graphics Cards

Domain Decomposition-based contour integration eigenvalue solvers

Computations of Properties of Atoms and Molecules Using Relativistic Coupled Cluster Theory

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

sri 2D Implicit Charge- and Energy- Conserving Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy

GPU accelerated Monte Carlo simulations of lattice spin models

Establishing a CUDA Research Center at Penn State: Perspectives on GPU-Enabled Teaching and Research

The Memory Intensive System

High-Performance Computing and Groundbreaking Applications

Real-time signal detection for pulsars and radio transients using GPUs

Efficient and Cryptographically Secure Generation of Chaotic Pseudorandom Numbers on GPU

Data analysis of massive data sets a Planck example

Physics plans and ILDG usage

Cosmology with Galaxy Clusters: Observations meet High-Performance-Computing

FEM-Level Set Techniques for Multiphase Flow --- Some recent results

Statistical Methods for Data Analysis

Plaquette Renormalized Tensor Network States: Application to Frustrated Systems

arxiv: v1 [cs.ms] 7 Nov 2018

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

GPU-accelerated Computing at Scale. Dirk Pleiter I GTC Europe 10 October 2018

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Light curve modeling of eclipsing binary stars

Parallel Simulations of Self-propelled Microorganisms

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

A Stochastic-based Optimized Schwarz Method for the Gravimetry Equations on GPU Clusters

Hybrid parallelization of a pseudo-spectral DNS code and its computational performance on RZG s idataplex system Hydra

Level-3 BLAS on a GPU

Explore Computational Power of GPU in Electromagnetics and Micromagnetics

Calculation of ground states of few-body nuclei using NVIDIA CUDA technology

Chile / Dirección Meteorológica de Chile (Chilean Weather Service)

Multivariate Gaussian Random Number Generator Targeting Specific Resource Utilization in an FPGA

Accelerating interior point methods with GPUs for smart grid systems

Efficient algorithms for symmetric tensor contractions

Transcription:

HIGH TEA @ SCIENCE Solving RODEs on GPU clusters Christoph Riesinger Technische Universität München March 4, 206 HIGH TEA @ SCIENCE, March 4, 206

Motivation - Parallel Computing HIGH TEA @ SCIENCE, March 4, 206 2

Motivation - Parallel Computing HIGH TEA @ SCIENCE, March 4, 206 2

Motivation - Parallel Computing HIGH TEA @ SCIENCE, March 4, 206 2

Motivation - Parallel Computing HIGH TEA @ SCIENCE, March 4, 206 2

Motivation - Parallel Computing HIGH TEA @ SCIENCE, March 4, 206 2

Motivation - Multiple Levels of Parallelism HIGH TEA @ SCIENCE, March 4, 206 3

Motivation - Multiple Levels of Parallelism HIGH TEA @ SCIENCE, March 4, 206 3

Technische Universita t Mu nchen Motivation - Multiple Levels of Parallelism HIGH TEA @ SCIENCE, March 4, 206 3

Building Blocks Pseudo Random Number Generation Ornstein-Uhlenbeck Process Averaging Numerical Solver () (3) x 3 x 6 x 9 μ () μ (3) x x 4 x 7 x 0 x 2 x 5 x 8 x μ () μ 3 (3) x 2 x 3 x 5 x 6 x 8 x 9 x μ () μ 2 μ 3 (3) x 5 x x 6 x GPU 0 Monte Carlo Pseudo Random Number Generation Ornstein-Uhlenbeck Process Averaging Numerical Solver GPU............ Pseudo Random Number Generation Ornstein-Uhlenbeck Process Averaging Numerical Solver GPU N- HIGH TEA @ SCIENCE, March 4, 206 4

Pseudo Random Number Generation - Ziggurat The area under the Gaussian function is approximated by strips R i These strips are further subdivided in central (green), tail (purple), and cap (red) regions and a base strip (blue) y0 x R 0 y x 2 R y2 x 3 R 2 y3 x 4 R3 y4 R 4 x 5 x 6 y5 R 5 y 6 R 6 x 7 =r y 7 R 7 =R B HIGH TEA @ SCIENCE, March 4, 206 5

Pseudo Random Number Generation - Ziggurat The area under the Gaussian function is approximated by strips R i These strips are further subdivided in central (green), tail (purple), and cap (red) regions and a base strip (blue) y0 x R 0 y x 2 R y2 x 3 R 2 y3 x 4 R3 y4 R 4 x 5 x 6 y5 R 5 y 6 R 6 x 7 =r y 7 R 7 =R B To do the transformation, a strip is randomly selected An uniform random number u [0, [ is stretched by a lookup table value basing on the selected strip If a central region is hit, the transformation is very cheap, otherwise it s much more expensive HIGH TEA @ SCIENCE, March 4, 206 5

Pseudo Random Number Generation - Trade-off The more strips are used for the Ziggurat, the bigger the ratio of the sum of all central regions to the sum of all strips gets The bigger this ratio gets, the higher the likelihood to hit a (cheap) central region gets In addition, on GPUs, this reduces the likelihood for warp divergence So runtime can be reduced by using more strips which results in larger lookup tables runtime/memory trade-off 0.8 0.6 0.4 0.2 0.5.5 2.0 2.5 3.0 3.5 0.8 0.6 0.4 0.2 0.5.5 2.0 2.5 3.0 3.5 0.8 0.6 0.4 0.2 0.5.5 2.0 2.5 3.0 3.5 HIGH TEA @ SCIENCE, March 4, 206 6

Pseudo Random Number Generation - Results /2 Performance of the Ziggurat Method GPU architecture Fermi Kepler Maxwell Model M2090 Tesla K40m GTX 750 Ti #Processing elements 6 32 5 92 5 28 Peak performance SP (TFLOPS).332 3.8492.6384 Peak performance DP (TFLOPS) 0.6656.28064 52 Peak memory bandwidth (GByte/s) 77.4 288.384 96.28 2.5 Tesla M2090 (Fermi) 2.5 Tesla K40m (Kepler) 2.5 GTX 750 Ti (Maxwell) giga pseudo random numbers per second 2.0.5 0.5 2 4 2 5 2 6 2 7 2 8 2 9 2 0 2 2 2 2 3 number of strips 2.0.5 0.5 2 4 2 5 2 6 2 7 2 8 2 9 2 0 2 2 2 2 3 number of strips 2.0.5 0.5 2 4 2 5 2 6 2 7 2 8 2 9 2 0 2 2 2 2 3 number of strips local 2 5 2 local 2 6 2 0 local 2 7 2 9 local 2 8 2 8 local 2 9 2 7 shared 2 5 2 shared 2 6 2 0 shared 2 7 2 9 shared 2 8 2 8 shared 2 9 2 7 HIGH TEA @ SCIENCE, March 4, 206 7

Pseudo Random Number Generation - Results 2/2 Comparison with other Normal PRNGs giga pseudo random numbers per second 4.5 4.0 3.5 3.0 2.5 2.0.5 0.5 Tesla M2090 (Fermi) 2 5 2 2 6 2 0 2 7 2 9 2 8 2 8 2 9 2 7 grid configuration Ziggurat Inverse CDF 4.5 Tesla K40m (Kepler) 4.0 3.5 3.0 2.5 2.0.5 0.5 2 5 2 2 6 2 0 2 7 2 9 2 8 2 8 2 9 2 7 grid configuration Rational Polynomial curand Wallace XORWOW 4.5 GTX 750 Ti (Maxwell) 4.0 3.5 3.0 2.5 2.0.5 0.5 2 5 2 2 6 2 0 2 7 2 9 2 8 2 8 2 9 2 7 grid configuration MKL on Xeon E5-2680 v2 HIGH TEA @ SCIENCE, March 4, 206 8

Ornstein-Uhlenbeck process - Link to Prefix Sum /2 O th = µo t σ X n () O t2h = µo th σ X n = µ µo t σ X n () ( ) = µ 2 O t σ X µn () n ( O t3h = µo t2h σ X n (3) = µ ( µ ( µo t σ X n () ( = µ µo th σ X n ) ) σ X n σ X n (3) ( = µ 3 O t σ X µ 2 n () µn n (3)... =... i ( ) O tih = µ i O t σ X µ i k n (k) k= ) ) σ X n = ) σ X n (3) = = HIGH TEA @ SCIENCE, March 4, 206 9

Ornstein-Uhlenbeck process - Link to Prefix Sum 2/2 This looks very similar to the prefix sum or scan operation: i ( ) O tih = µ i O t σ X µ i k n (k) i O tih = k= n (k) k= HIGH TEA @ SCIENCE, March 4, 206 0

Ornstein-Uhlenbeck process - Link to Prefix Sum 2/2 This looks very similar to the prefix sum or scan operation: i ( ) O tih = µ i O t σ X µ i k n (k) i O tih = k= n (k) k= x x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 0 x x 2 x 3 x 4 x 5 x x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 0 x x 2 x 3 x 4 x 5 HIGH TEA @ SCIENCE, March 4, 206 0

Ornstein-Uhlenbeck process - Parallel Prefix Sum Up-Sweep Algorithm Up-sweep phase : for d = ; d log 2 (n); d do 2: for i = 0; i < n 2 d ; i do 3: x (i)2 d x (i)2 d x (i 2 )2 d 4: end for 5: end for d=4 μ () μ 3 (3) μ (5) (6) μ 7 (7) μ (9) (0) μ 8 μ 3 μ () μ (3) μ 5 (5) d=3 μ () μ 3 (3) μ (5) (6) μ 7 (7) μ (9) (0) μ 3 μ () μ (3) μ 4 μ 4 μ 7 (5) d=2 d= μ n 0 () μ () μ 3 (3) μ (3) μ (5) μ (5) (6) (6) μ 3 μ (7) μ (6) (7) μ (9) μ (9) (0) (0) μ 3 μ () μ (0) () μ (3) μ 2 μ 2 μ 2 μ 2 μ (3) μ μ μ μ μ μ μ μ μ 3 (5) μ (5) d=0 () (3) (5) (6) (7) (9) (0) () (3) (5) HIGH TEA @ SCIENCE, March 4, 206

Ornstein-Uhlenbeck process - Parallel Prefix Sum Down-Sweep Algorithm 2 Down-sweep phase : for d = log 2 (n) ; d 0; d-- do 2: for i = 0; i < n 2 d ; i do 3: x (i 3 2 )2 d x (i)2 d x (i 3 2 )2 d 4: end for 5: end for d=3 μ () X μ 3 (3) μ (5) μ 7 (6) (7) μ (9) (0) μ 3 () μ (3) μ 5 (5) d=2 μ () μ 3 (3) μ (5) (6) μ 7 (7) μ (9) μ 4 (0) μ () μ (3) μ 5 (5) d= d=0 μ () μ () μ 2 μ 3 (3) μ 3 (3) μ 2 μ 2 μ 2 μ 4 μ 5 (5) μ 5 (5) (6) μ 6 (6) μ 7 (7) μ 7 (7) μ 8 μ 9 (9) μ 9 (9) (0) μ 0 (0) μ () μ () μ 2 μ 3 (3) μ μ μ μ μ μ μ μ 3 (3) μ 4 μ 5 (5) μ 5 (5) HIGH TEA @ SCIENCE, March 4, 206 2

Ornstein-Uhlenbeck process - Results 3.5 Tesla M2090 (Fermi) 3.5 Tesla K40m (Kepler) 3.5 GTX 750 Ti (Maxwell) giga realizations of OU process 3.0 2.5 2.0.5 0.5 3.0 2.5 2.0.5 0.5 3.0 2.5 2.0.5 0.5 2 4 8 6 elements per thread float, 2 7 threads/block double, 2 7 threads/block 2 4 8 6 elements per thread float, 2 8 threads/block double, 2 8 threads/block float, 2 9 threads/block double, 2 9 threads/block 2 4 8 6 elements per thread float, 2 0 threads/block double, 2 0 threads/block HIGH TEA @ SCIENCE, March 4, 206 3

Averaging x 3 x 6 x 9 x 2 x 5 x 8 x 2 x x 4 x 7 x 0 x 3 x 6 x 9 x 22 x 2 x 5 x 8 x x 4 x 7 x 20 x 23 x 2 x 3 x 5 x 6 x 8 x 9 x x 2 x 4 x 5 x 7 x 8 x 20 x 2 x 23 x 5 x 6 x x 2 x 7 x 8 x 23 x x 2 x 23 x 23 HIGH TEA @ SCIENCE, March 4, 206 4

Averaging - Results Tesla M2090 (Fermi) Tesla K40m (Kepler) GTX 750 Ti (Maxwell) ratio of maximum bandwidth 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 2 5 2 6 2 7 2 8 2 9 2 0 threads per block float, single averaging double, single averaging 2 5 2 6 2 7 2 8 2 9 2 0 threads per block float, double averaging double, double averaging float, 3-tridiagonal double, 3-tridiagonal 2 5 2 6 2 7 2 8 2 9 2 0 threads per block float, 4-tridiagonal double, 4-tridiagonal architecture Tesla M2090 Tesla K40m GTX 750 Ti ratio peak memory bandwidth 88.5% 72.% 8.% configuration ( threads block 28, double 2 8, double 2 6, double HIGH TEA @ SCIENCE, March 4, 206 5

Solving one instance of the RODE on a single GPU Tesla M2090 (Fermi) Tesla K40m (Kepler) GTX 750 Ti (Maxwell) 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 float, 2 3 2 0 2 0 double, 2 3 2 0 2 0 float, 2 4 2 0 2 0 double, 2 4 2 0 2 0 float, 2 5 2 0 2 0 double, 2 5 2 0 2 0 initstatesnormalkernel() scanexclusiveoukernel() averagedeulerkernel() purple blue green red float, 2 3 2 0 2 0 double, 2 3 2 0 2 0 float, 2 4 2 0 2 0 double, 2 4 2 0 2 0 float, 2 5 2 0 2 0 double, 2 5 2 0 2 0 float, 2 6 2 0 2 0 double, 2 6 2 0 2 0 float, 2 7 2 0 2 0 double, 2 7 2 0 2 0 float, 2 8 2 0 2 0 double, 2 8 2 0 2 0 float, 2 9 2 0 2 0 double, 2 9 2 0 2 0 float, 2 0 2 0 2 0 double, 2 0 2 0 2 0 getrandomnumbersnormalkernel() scanoufixkernel() float, 2 3 2 0 2 0 double, 2 3 2 0 2 0 float, 2 4 2 0 2 0 double, 2 4 2 0 2 0 float, 2 5 2 0 2 0 double, 2 5 2 0 2 0 float, 2 6 2 0 2 0 double, 2 6 2 0 2 0 singleaveragekernel() realizeouprocesskernel() numerical solver averaging Ornstein-Uhlenbeck process pseudo random number generation float, 2 7 2 0 2 0 double, 2 7 2 0 2 0 float, 2 8 2 0 2 0 double, 2 8 2 0 2 0 HIGH TEA @ SCIENCE, March 4, 206 6

Solving several instances of the RODE on multiple GPUs cluster JuDGE Hydra TSUBAME 2.5 location FZJ RZG GSIC GPUs per node 2 3 total # of GPUs 206 338 4224 Interconnect QDR InfiniBand FDR InfiniBand QDR InfiniBand efficiency.2 0.8 0.6 0.4 0.2 JuDGE 2 2 2 2 3 2 4 2 5 2 6 2 7 number of GPUs.2 0.8 0.6 0.4 0.2 float, GPU computations double, GPU computations Hydra 2 2 2 2 3 2 4 2 5 2 6 2 7 number of GPUs float, MPI_Reduce() double, MPI_Reduce().2 0.8 0.6 0.4 0.2 TSUBAME 2.5 2 4 2 5 2 6 2 7 2 8 2 9 2 0 number of GPUs float, total double, total HIGH TEA @ SCIENCE, March 4, 206 7

Final slide HIGH TEA @ SCIENCE, March 4, 206 8