Practical Combustion Kinetics with CUDA

Similar documents
Optimizing Time Integration of Chemical-Kinetic Networks for Speed and Accuracy

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

S3D Direct Numerical Simulation: Preparation for the PF Era

Review for the Midterm Exam

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Towards Accommodating Comprehensive Chemical Reaction Mechanisms in Practical Internal Combustion Engine Simulations

On the design of parallel linear solvers for large scale problems

Matrix Reduction Techniques for Ordinary Differential Equations in Chemical Systems

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

RECENT DEVELOPMENTS IN COMPUTATIONAL REACTOR ANALYSIS

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Acceleration of WRF on the GPU

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Dynamic Scheduling within MAGMA

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Julian Merten. GPU Computing and Alternative Architecture

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

On Verification and Validation of Spring Fabric Model

CRYPTOGRAPHIC COMPUTING

Welcome to MCS 572. content and organization expectations of the course. definition and classification

arxiv: v1 [physics.comp-ph] 22 Nov 2012

Accelerating Calculations of Reaction Dissipative Particle Dynamics in LAMMPS

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Introduction to numerical computations on the GPU

Dense Arithmetic over Finite Fields with CUMODP

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Acoustics Analysis of Speaker ANSYS, Inc. November 28, 2014

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Particle Dynamics with MBD and FEA Using CUDA

CS 700: Quantitative Methods & Experimental Design in Computer Science

Matrix Assembly in FEA

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Fine-Grained Parallel Algorithms for Incomplete Factorization Preconditioning

DARS overview, IISc Bangalore 18/03/2014

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

arxiv: v1 [hep-lat] 7 Oct 2010

Accelerating Quantum Chromodynamics Calculations with GPUs

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Level-3 BLAS on a GPU

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Detailed Chemical Kinetics in Multidimensional CFD Using Storage/Retrieval Algorithms

Review: From problem to parallel algorithm

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs

Parallel Polynomial Evaluation

Introduction to Neural Networks

Direct Self-Consistent Field Computations on GPU Clusters

Improvements for Implicit Linear Equation Solvers

Efficient implementation of the overlap operator on multi-gpus

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Overview of Turbulent Reacting Flows

ZFS - Flow Solver AIA NVIDIA Workshop

Current progress in DARS model development for CFD

L1-MAGIC BENCHMARKS AND POTENTIAL IMPROVEMENTS

The Chemical Kinetics Time Step a detailed lecture. Andrew Conley ACOM Division

The Lattice Boltzmann Method for Laminar and Turbulent Channel Flows

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

1 Finite difference example: 1D implicit heat equation

Basic Fluid Mechanics

TAU Solver Improvement [Implicit methods]

Lecture 19. Architectural Directions

Prof. Brant Robertson Department of Astronomy and Astrophysics University of California, Santa

Using AmgX to accelerate a PETSc-based immersed-boundary method code

Universität Dortmund UCHPC. Performance. Computing for Finite Element Simulations

Develpment of NSCBC for compressible Navier-Stokes equations in OpenFOAM : Subsonic Non-Reflecting Outflow

Chemical Reaction Engineering Prof. Jayant Modak Department of Chemical Engineering Indian Institute of Science, Bangalore

A CUDA Solver for Helmholtz Equation

MONTE CARLO NEUTRON TRANSPORT SIMULATING NUCLEAR REACTIONS ONE NEUTRON AT A TIME Tony Scudiero NVIDIA

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

Computational Issues in the Continuum Gyrokinetic Code GYRO

CORBIS: Code Raréfié Bidimensionnel Implicite Stationnaire

Numerical modelling of phase change processes in clouds. Challenges and Approaches. Martin Reitzle Bernard Weigand

SOLUTION of linear systems of equations of the form:

Solving PDEs with CUDA Jonathan Cohen

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Enhancing Scalability of Sparse Direct Methods

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

1. Introduction. Student Submission for the 6 th ESI OpenFOAM User Conference 2018, Hamburg - Germany

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Practicality of Large Scale Fast Matrix Multiplication

Overview of Reacting Flow

A Different Kind of Flow Analysis. David M Nicol University of Illinois at Urbana-Champaign

Transcription:

Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides & Matthew McNenly Session S5468 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Collaborators Cummins Inc. Convergent Science NVIDIA Indiana University Good guys to work with. 2

Does plus equal? The big question. 3

Lots of smaller questions:?? What has already been done in this area? How are we approaching the problem? What have we accomplished? What s left to do? There won t be a quiz at the end. 4

Why?? NVIDIA GPUs/CUDA Toolkit More FLOP/s, More GB/s, Faster Growth in Both. Data from NVIDIA s, CUDA C Programming Guide Version 6.0, 2014." 5

? Reacting flow simulation Computational Fluid Dynamics (CFD) Detailed chemical kinetics Tracking 10-1000 s of species ConvergeCFD (internal combustion engines) Approach also used to simulate gas turbines, burners, flames, etc. 6

What has been done already in combustion kinetics on GPU s? Recent review by Niemeyer & Sung [1]: Spafford, Sankaran & co-workers (ORNL) (first published 2010) Shi, Green & co-workers (MIT) Stone (CS&E LLC) Niemeyer & Sung (CWRU/OSU, UConn) Most approaches use explicit or semi-implicit Runge-Kutta techniques Some only use GPU for derivative calculation From [1]: Furthermore, no practical demonstration of a GPU chemistry solver capable of handling stiff chemistry has yet been made. This is one area where efforts need to be focused. [1] K.E. Niemeyer, C.-J. Sung, Recent progress and challenges in exploiting graphics processors in computational fluid dynamics, J Supercomput. 67 (2014) 528 564. doi:10.1007/s11227-013-1015-7. A few groups working (publicly) on this. Some progress has been made. 7

Problem: Can t directly port CPU chemistry algorithms to GPU GPUs need dense data and lots of it. Large chemical mechanisms are sparse. Small chemical mechanisms don t have enough data. (even large mechanisms aren t large in GPU context) Solution: Re-frame many uncoupled reactor calculations into a single system of coupled reactors. For chemistry it s not as simple as adding new hardware. 8

How do we solve chemistry on the CPU? Temperature Y O2 Example: Engine Simulation in Converge CFD 9

How do we solve chemistry on the CPU? Temperature Y O2 Example: Engine Simulation in Converge CFD 10

Detailed Chemistry in Reacting Flow CFD: Operator Splitting Technique: Solve independent Initial Value Problem in each cell (or zone) to calculate chemical source terms for species and energy advection/diffusion equations. Each cells is treated as an isolated system for chemistry. 11

Detailed Chemistry in Reacting Flow CFD: Operator Splitting Technique: Solve independent Initial Value Problem in each cell (or zone) to calculate chemical source terms for species and energy advection/diffusion equations. Each cells is treated as an isolated system for chemistry. 12

Detailed Chemistry in Reacting Flow CFD: Operator Splitting Technique: Solve independent Initial Value Problem in each cell (or zone) to calculate chemical source terms for species and energy advection/diffusion equations. t t+ t Each cells is treated as an isolated system for chemistry. 13

CPU (un-coupled) chemistry integration t t+ t Each cells is treated as an isolated system for chemistry. 14

GPU (coupled) chemistry integration t t+ t For the GPU we solve chemistry simultaneously in large groups of cells. 15

What about variations in practical engine CFD? vs. If the systems are not similar how much extra work needs to be done? 16

What are the equations we re trying to solve? Derivative Equations (vector calculations) dy i dt = w i dc i ρ dt dense A Jacobian Matrix Solution L = * U dt dt = RT ρc v species i u i dc i dt sparse = * Derivative represents system of equations to be solved (perfectly stirred reactor). Matrix solution required due to stiffness Matrix storage in dense or sparse formats Significant effort to transform fastest CPU algorithms to GPU appropriate versions. 17

We want to solve many of these simultaneously Not as easy as copy and paste. 18

Example: Species production rates Net rates of production dc i dt create = R j R j j destroy j Arrhenius Rates k i = A i T n A,i i e E RT Chemical reaction rates of progress k i = k i R i = k i species j α j C j species ν C ij j Chemical reaction step rate coefficients Third-body enhanced Rates j Equilibrium Reverse Rates prod 0 G j k i = k i, f K eq = k i, f exp Fall-off rates k i = k i... RT Major component of derivative; Lots of sparse operations. j reac j G j 0 RT 19

Example: Species production rates Arrhenius Rates k i = A i T n A,i i e E RT k i = k i Net rates of production dc i dt species j create R i = k i α j C j species ν C ij j j destroy = R j R j j j Chemical species connectivity Generally sparsely connected Chemical reaction rates of progress Leads to poor memory locality Bad for GPU performance Chemical reaction step rate coefficients Third-body enhanced Rates Equilibrium Reverse Rates prod 0 G j k i = k i, f K eq = k i, f exp Fall-off rates k i = k i... RT Major component of derivative; Lots of sparse operations. j reac j G j 0 RT 20

Example: Species production rates Each column is data for single reactor (cell). Each row is data element for all reactors. data now arranged for coalesced access Approach: couple together reactors (or cells) and make smart use of GPU memory. 21

Benchmarking Platforms: Big Red 2 AMD Opteron Interlagos (16 core) 1x-Tesla K20 Surface (not pictured) Intel Xeon E5-2670 (16 core) 2x-Tesla K40m CPU and GPU Used Both Matter 22

Big Red 2 2048 1024 512 dc i dt 256 128 simultaneous net production rate calculations Significant speedup achieved for species production rates. 23

Surface 2048 1024 512 dc i dt 128 simultaneous net production rate calculations 256 Less speedup than Big Red 2 because the CPU is faster. 24

We have implemented or borrowed algorithms for the rest of the chemistry integration. Derivative Equations (vector calculations) dy i dt = w i dc i ρ dt dense A Jacobian Matrix Solution L = * U dt dt = RT ρc v species i u i dc i dt sparse = * Need to put the rest of the calculations on the GPU. 25

We have implemented or borrowed algorithms for the rest of the chemistry integration. Derivative Equations (vector calculations) dy i dt = w i dc i Apart from ρ dcdt i /dt, derivative is straightforward dt on GPU. dt = RT ρc v species i dc u i i dt dense sparse A Jacobian Matrix Solution L = * = * U Need to put the rest of the calculations on the GPU. 26

We have implemented or borrowed algorithms for the rest of the chemistry integration. Derivative Equations (vector calculations) dy i dt = w i dc i Apart from ρ dcdt i /dt, derivative is straightforward dt on GPU. dt = RT ρc v species i dc u i i dt dense sparse A Jacobian Matrix Solution L = * = * U We are able to use NVIDIA developed algorithms to perform matrix operations on GPU. Need to put the rest of the calculations on the GPU. 27

Matrix Solution Methods = * dense = * sparse CPU LAPACK dgetrf dgetrs GPU CUBLAS dgetrfbatched dgtribatched batched matrix-vector multiplication CPU SuperLU dgetrf dgetrs GPU GLU (soon cusolversp (7.0)) LU refactorization (SuperLU for first factor) LU solve Conglomerate matrix (<6.5) Batched matrices (>= 6.5) (2-4x faster) 28

Test case for full chemistry integration Ignition delay time calculation (i.e. shock tube simulation): 256-2048 constant volume reactor calculations No coupling to CFD Comparing CPU and GPU with both dense and sparse matrix operations This provides a gauge of what the ideal speedup will be in CFD simulations. 29

0D, Uncoupled, Ideal Case: Max speedup Speedup (CPU time/gpu time) 16 14 12 10 8 6 4 2 0 Big Red 2 10 CPU Dense GPU Dense 32 48 79 94 Number of Species CPU Sparse GPU Dense 111 160 CPU Sparse GPU Sparse As with dc i /dt best speedup is for large number of reactors. 2048 1024 512 256 Number of Reactors 30

Synchronization Penalty Test Case: Converge CFD Rectilinear volume (16x8x8 mm) Initial conditions: Variable gradients in temperature ( ) & phi ( ) Uniform zero velocity Uniform pressure (20 bar) Boundary conditions: No flux for all variables ~50 CFD steps capturing complete fuel conversion Every cell chemistry (2048 cells, 1 CPU core, 1 GPU device) 7 kinetic mechanisms from 10-160 species Solved with both sparse and dense matrix algorithms Testing affect of non-identical reactors. 31

We compared the total chemistry cost for sequential auto-ignition in a constant volume chamber Initial Conditions:" Increasing Temperature Increasing Equivalence Ratio Pressure (MPa) 5.5 4.5 3.5 2.5 1.5 0 50 Time (µs) 32

We compared the total chemistry cost for sequential auto-ignition in a constant volume chamber Initial Conditions:" Increasing Temperature Condition T spread ϕ spread Grad0 1450 1.0 Grad1 1400-1450 0.95-1.05 Grad2 1350-1450 0.90-1.10 Increasing Equivalence Ratio Grad3 1250-1450 0.80-1.20 33

Converge GPU: Sequential Auto-ignition Speedup (CPU time/gpu time) 14 12 10 8 6 4 2 0 Big Red 2 CPU Dense GPU Dense 10 32 48 CPU Sparse GPU Dense 79 94 Number of Species 111 160 CPU Sparse GPU Sparse Even in non-ideal case we find significant speedup. grad0 grad1 grad2 grad3 Amount of Gradation 34

Finally ready to run engine simulation on GPU Temperature Y O2 Big Red 2 Compared cost of every cell chemistry from -20 to 15 CAD. 24 nodes of Big Red 2: 24 CPU cores vs. 24 GPU devices. Should be close to worst case scenario w.r.t. synchronization penalty. What s the speedup on a real problem? 35

Engine calculation on GPU Big Red 2 24 CPU cores = 53.8 hours 24 GPU devices = 14.5 hours Speedup = 53.8/14.5 = 3.7 Good speedup. With Caveats. 36

CPU-GPU Work-sharing Ideal Case ( ( S 1) ) S total = N CPU + N GPU N CPU GPU Speedup = S S total 8 7 6 5 4 3 2 1 S=8 N CPU =4 N CPU =8 N CPU =16 * * N CPU =32 1 2 3 4 N GPU Number of CPU cores = N CPU Number of GPU devices = N GPU * Big Red 2 (1.4375) * Surface (1.8750) Let s make use of the whole machine. 37

CPU-GPU Work-sharing: Strong scaling 10000 Surface Chemistry Time (seconds) 1000 100 CPU Chemistry 1 2 4 8 16 Number of Processors Sequential auto-ignition case, grad0, 53 species, ~10,000 cells Strong scaling is good for this problem on CPU. 38

CPU-GPU Work-sharing: Strong scaling 10000 Surface Chemistry Time (seconds) ~7x 1000 CPU Chemistry GPU Chemistry (std work sharing) 100 1 2 4 8 16 Number of Processors Sequential auto-ignition case, grad0, 53 species, ~10,000 cells Poor scaling with GPUS, if all processors get the same amount of work. 39

CPU-GPU Work-sharing: Strong scaling 10000 Surface Chemistry Time (seconds) ~7x 1000 100 CPU Chemistry GPU Chemistry (custom work sharing) GPU Chemistry (std work sharing) 1 2 4 8 16 Number of Processors ~1.7x (S total ) (S = 6.6) Sequential auto-ignition case, grad0, 53 species, ~10,000 cells Better scaling if give GPU processors appropriate work load. 40

Proof of Principle: Engine calculation on GPU+CPU Cores Surface 16 cpu cores = 21.2 hours 16 cpu cores + 2 GPU devices = 17.6 hours Speedup = 21.2/17.6 = 1.20 (S total, S = 2.6) In line with expectations. 41

Does plus equal? Sort of. 42

Future directions Improve CPU/GPU parallel task management Minimize synchronization penalty Work stealing Improvements to derivative calculation Custom code generation Reframe parts as matrix multiplication Improvements to matrix calculations Analytical Jacobian Mixed precision calculations Possibilities for significant further improvements. 43

Conclusions GPU chemistry for stiff integration implemented Implemented as Converge CFD UDF but flexible for incorporation in other CFD codes. Continuing development: Further speedup envisioned More work can improve applicability + Thank you! 44

Supplemental Slides + Just in case. 45

0D, Uncoupled, Ideal Case: Cost Breakdown 1 CPU Normalized Computation Time 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 dense sparse Matrix Formation Matrix Factor Matrix Solve Derivatives Other GPU 0 10 32 48 79 94 111 160 # of species Surface Evenly distributed costs both on CPU and GPU 46

0D, Uncoupled, Ideal Case: Cost Breakdown 0.2 Normalized Computation Time 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 dense sparse Matrix Formation Matrix Factor Matrix Solve Derivatives Other 0 10 32 48 79 94 111 160 # of species Surface Evenly distributed costs both on CPU and GPU 47

0D, Uncoupled, Ideal Case: Max speedup Speedup (CPU time/gpu time) 16 14 12 10 8 6 4 2 0 Surface 10 CPU Dense GPU Dense 32 48 79 94 Number of Species CPU Sparse GPU Sparse 111 160 As with dc i /dt best speedup is for large number of reactors. 2048 1024 512 256 Number of Reactors 48