Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Similar documents
Comparison of ODE Solver for Chemical Kinetics and Reactive CFD Applications

Accelerating Calculations of Reaction Dissipative Particle Dynamics in LAMMPS

Solving PDEs with PGI CUDA Fortran Part 4: Initial value problems for ordinary differential equations

Practical Combustion Kinetics with CUDA

S3D Direct Numerical Simulation: Preparation for the PF Era

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Bindel, Fall 2011 Intro to Scientific Computing (CS 3220) Week 12: Monday, Apr 18. HW 7 is posted, and will be due in class on 4/25.

Solving PDEs with CUDA Jonathan Cohen

Butcher tableau Can summarize an s + 1 stage Runge Kutta method using a triangular grid of coefficients

The Chemical Kinetics Time Step a detailed lecture. Andrew Conley ACOM Division

C1.2 Ringleb flow. 2nd International Workshop on High-Order CFD Methods. D. C. Del Rey Ferna ndez1, P.D. Boom1, and D. W. Zingg1, and J. E.

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

The family of Runge Kutta methods with two intermediate evaluations is defined by

CS520: numerical ODEs (Ch.2)

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Runge Kutta Chebyshev methods for parabolic problems

Comparison of a fully coupled and a decoupled solver for the simulation of Fluid Catalytic Cracking Background Fully coupled solver

Towards Accommodating Comprehensive Chemical Reaction Mechanisms in Practical Internal Combustion Engine Simulations

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 9

Parallel Methods for ODEs

HPMPC - A new software package with efficient solvers for Model Predictive Control

CS 450 Numerical Analysis. Chapter 9: Initial Value Problems for Ordinary Differential Equations

Accelerating incompressible fluid flow simulations on hybrid CPU/GPU systems

Nonlinear Iterative Solution of the Neutron Transport Equation

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

An Efficient Low Memory Implicit DG Algorithm for Time Dependent Problems

Scientific Computing: An Introductory Survey

Dense Arithmetic over Finite Fields with CUMODP

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

10.34 Numerical Methods Applied to Chemical Engineering Fall Homework #4: IVP

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points

Active Flux for Advection Diffusion

ODE solvers in Julia. Gabriel Ingesson. October 2, 2015

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)

A CUDA Solver for Helmholtz Equation

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Fall Exam II. Wed. Nov. 9, 2005

Hybrid Analog-Digital Solution of Nonlinear Partial Differential Equations

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Acceleration of Time Integration

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

DETERMINISTIC SIMULATIONS OF LARGE-SCALE MODELS OF CELLULAR PROCESSES ACCELERATED ON GRAPHICS PROCESSING UNITS

Numerical Modeling of Laminar, Reactive Flows with Detailed Kinetic Mechanisms

Stabilization and Acceleration of Algebraic Multigrid Method

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Overview of Turbulent Reacting Flows

Introduction to numerical computations on the GPU

Optimizing Runge-Kutta smoothers for unsteady flow problems

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Detailed Chemical Kinetics in Multidimensional CFD Using Storage/Retrieval Algorithms

arxiv: v1 [hep-lat] 7 Oct 2010

Matrix Reduction Techniques for Ordinary Differential Equations in Chemical Systems

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Review: From problem to parallel algorithm

IRKC: an IMEX Solver for Stiff Diffusion-Reaction PDEs

Astrodynamics (AERO0024)

ODEs. PHY 688: Numerical Methods for (Astro)Physics

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

COSC 3361 Numerical Analysis I Ordinary Differential Equations (II) - Multistep methods

Explicit One-Step Methods

Time-Parallel Algorithms for Weather Prediction and Climate Simulation

Index. higher order methods, 52 nonlinear, 36 with variable coefficients, 34 Burgers equation, 234 BVP, see boundary value problems

Implicit-explicit exponential integrators

ENGI9496 Lecture Notes State-Space Equation Generation

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Is My CFD Mesh Adequate? A Quantitative Answer

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Improving Dynamical Core Scalability, Accuracy, and Limi:ng Flexibility with the ADER- DT Time Discre:za:on

Time-Parallel Algorithms for Weather Prediction and Climate Simulation

Solution of ordinary Differential Equations

Numerical Methods - Initial Value Problems for ODEs

Carbon Science and Technology

Semi-implicit Krylov Deferred Correction Methods for Ordinary Differential Equations

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

RECENT DEVELOPMENTS IN COMPUTATIONAL REACTOR ANALYSIS

Real-time signal detection for pulsars and radio transients using GPUs

A STUDY OF MULTIGRID SMOOTHERS USED IN COMPRESSIBLE CFD BASED ON THE CONVECTION DIFFUSION EQUATION

A comparison of the Bader Deuflhard and the Cash Karp Runge Kutta integrators for the GRI-MECH 3.0 model based on the chemical kinetics code Kintecus

Multistep Methods for IVPs. t 0 < t < T

Newton s Method and Efficient, Robust Variants

PALADINS: Scalable Time-Adaptive Algebraic Splitting and Preconditioners for the Navier-Stokes Equations

MONTE CARLO NEUTRON TRANSPORT SIMULATING NUCLEAR REACTIONS ONE NEUTRON AT A TIME Tony Scudiero NVIDIA

1. Introduction. Many dynamical systems in science and engineering are modeled by ordinary differential equations (ODEs)

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

NUMERICAL METHODS FOR ENGINEERING APPLICATION

Some notes about PDEs. -Bill Green Nov. 2015

An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Practical Free-Start Collision Attacks on full SHA-1

Convective Vaporization and Burning of Fuel Droplet Arrays

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

Exponentially Fitted Error Correction Methods for Solving Initial Value Problems

Power System Transient Stability Simulation

GPU accelerated Arnoldi solver for small batched matrix

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Excel for Scientists and Engineers Numerical Method s. E. Joseph Billo

OPEN CHANNEL FLOW. Computer Applications. Numerical Methods and. Roland Jeppson. CRC Press UNIVERSITATSB'BUOTHEK TECHNISCHE. INFORMATlONSBiBUOTHEK

MBS/FEM Co-Simulation Approach for Analyzing Fluid/Structure- Interaction Phenomena in Turbine Systems

On Computing Reactive Flows Using Slow Manifolds

Transcription:

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University

2 Outline Combustion Kinetics: Operator Splitting Stiff ODE Integration: Algorithms CUDA Implementations Benchmarks: VODE vs. RKx Performance Analysis Summary

3 Combustion Kinetics Must solve Navier-Stokes and species conservation equations: massive PDE system. This coupled PDE system is too expensive to solve when we have finite-rate kinetics, large mechanisms, and fine meshes. Instead, decouple reaction terms spatially and solve them over the CFD time-step.. Operator Splitting

4 Combustion Kinetics Integrate convection-diffusion and reaction components separately with CFD time-step: still 2 nd -order time accurate Valid when reaction time-scales are much faster than convection-diffusion time-scales. Transforms massive PDE into thousands or millions of independent ODE systems that can be solved by a stiff solver in parallel. N s +1 coupled ODE system Non-linear RHS Function

5 Combustion Kinetics ODEs integration still expensive due to 1. Numerical stiffness 2. Costly RHS reaction function scales as ~ N s 3. Jacobian matrix factorization scales as ~ (N s ) 3 Targeting modest mechanisms: 10 s < N s < 100 s Embarrassingly parallel on host assuming thousands of grid points per core But must be careful with variable workload 1. Each ODE is unique: different initial conditions. 2. Adaptive step-size: different number of steps per ODE.

6 Keys to GPU Performance Stream data from global memory: optimal bandwidth if threads within a warp read contiguous data from global memory Oversubscribe threads to hide latency: run many warps per SM to hide slow global memory Warp-level vector processing: optimal throughput if each thread within a warp executes the same instruction on different data (SIMD) Must use all 3 for best performance. SIMD processing challenging in ODE setting.

7 Mapping ODEs to the GPU A. ODE solver logic runs on device 1. One-thread-per-ODE: 10,000+ ODEs concurrently 2. One-block-per-ODE: 100+ ODEs concurrently B. ODE solver logic runs on host 1. One-kernel-per-ODE: 1 ODE at a time 2. Offload expensive components: RHS, Jacobian, etc. 3. Not considered in this study: must have very large mechanisms to reach enough parallelism

8 Mapping ODEs to the GPU One-thread-per-ODE One-block-per-ODE Thread Index Problem Index Dependent Variables and parameters Thread Index 1 1 y 1, y 2, y 3, y n, T, p, dt 1 Warp 1 2 3 2 y 1, y 2, y 3, y n, T, p, dt 3 y 1, y 2, y 3, y n, T, p, dt 2 3 Block 1 y 1, y 2, y 3, y n, T, p, dt 32 32 y 1, y 2, y 3, y n, T, p, dt 32 33 33 y 1, y 2, y 3, y n, T, p, dt 33 34 34 y 1, y 2, y 3, y n, T, p, dt 34 Warp 2 35 35 y 1, y 2, y 3, y n, T, p, dt 35 Block 2 y 1, y 2, y 3, y n, T, p, dt 64 64 y 1, y 2, y 3, y n, T, p, dt 64 y 1, y 2, y 3, y n, T, p, dt

9 Mapping ODEs to the GPU one-thread-per-ode 1. Straightforward: replicate serial code 2. Warp-level conflict between ODEs 3. Only global memory 4. O(10k) problems for full occupancy one-block-per-ode 1. Complex: exploit parallelism within solver and across ODEs 2. No warp-level conflict between ODEs 3. Only shared memory: limits problem size 4. O(100) problems for full occupancy

10 Stiff ODE Solvers VODE: Variable-coefficient ODE solver Backwards differentiation formula (BDF) Maintained by LLNL since early 80 s Flavors: DVODE (Fortran); CVODE (C/C++) 1-5 th order implicit BDF with non-linear Newton- Raphson iteration and h/p-adaption. Solves generic system of ODEs with user-defined RHS function

11 VODE Stiff Algorithm 1. Estimate initial step-size: h 0 = h 0 (f(y,t 0 ),e) 2. Initialize order p = 1 3. While (t < t 1 ) A. Predict y(t+h) using (p-1)-th order polynomial extrapolation B. Correct y(t+h) using p-th order (1 p 5) BDF i. Compute (approximate) Jacobian matrix (J) if necessary ii. iii. Factorize iteration matrix M = (I gj) if necessary While (not converged): iterate Newton-Raphson solver a) Compute RHS function at y m b) Solve correction dy m+1 = M -1 f(y m ) C. If solution accepted: t = t + h D. Adjust h and p 4. Interpolate solution backwards if overshoot (t > t 1 ) Heuristic optimization: only execute if it will help run-time

12 Stiff ODE Solvers VODE implicit BDF algorithm highly optimized to reduce sequential execution time for stiff problems but Complex implicit logic makes parallel (SIMD) execution less efficient: Newton iterations, p-adaption, Jacobian recycling, selective factorization Fixed-order, explicit schemes should have greater parallel efficiency in SIMD environments but usually perform poorly on stiff problems. Sacrifice numerical efficiency for higher computational throughput.

13 RKF Algorithm 1. Estimate initial step-size: h 0 = h 0 (f(y,t 0 ),e) 2. While (t < t 1 ) A. Compute trial y(t+h) and err B. If solution accepted: t = t + h C. Adjust h Cash-Karp (RKCK) and Dormand-Prince (DOPRI) other RK variants similar to Fehlberg (RKF) Explicit 4 th -order with 5 th -order error estimation Adaptive step-size for error control 6 RHS function evaluations per step Minimal branching within time-loop: high SIMD efficiency Variable number of steps still likely between ODEs

14 RKC Algorithm 1. Estimate initial step-size: h 0 = h 0 (f(y,t 0 ),e) 2. While (t < t 1 ) A. Estimate spectral radius s. B. Compute trial y(t+h) with s stages and err C. If solution accepted: t = t + h D. Adjust h Stablized Runge-Kutta-Chebyshev (RKC) still explicit but can handle more stiffness than RKF. 2 th -order scheme with adaptive h for error control. Number of stages s 2 estimated every 25 steps or after rejection based on spectral radius. Adaptive s introduces additional variability.

15 Benchmark Details Host: Opteron 6134 2.3 GHz baseline DVODE is SERIAL! GPU: Fermi m2050 ODE Parameters: Relative tolerance = 10-10 Absolute tolerance = 10-13 dt ~ 10-7 sec Simulation database constructed from stochastic Counter-Flow Linear-Eddy Model (LEM-CF) with many different fuel-oxidizer conditions: 19 mil samples available ~ 300 cells per simulation 19 species C 2 H 4 mechanism 167 reversible reactions 10 quasi-steady species Low stiffness

16 RKF Solver Performance RKF ~ 2x slower than VODE on host RKF ~ 2-3x faster than VODE on GPU Overhead less than 0.05%

17 RKF Solver Performance One-thread RKF: 20.2x max speed-up One-block RKF: 10.7x max speed-up One-thread VODE: 7.7x max speed-up One-block VODE: 7.3x max speed-up One-block breaks even at M ~ 10 One-thread breaks even at M ~ 1000

18 RKC Solver Performance RKC vs. VODE with moderately stiff natural gas mechanism (GRI Mech v3) 53 species, 634 reactions Reltol = 10-6 ; Abstol = 10-10 dt = 10-6 sec Homogeneous ignition test problem Host: Xeon X5650 2.67GHz GPU: c2075 57x speed-up over CPU VODE

19 RKC Solver Performance C 2 H 4 -Air mechanism: 111 species, 1566 reactions. Reltol = 10-6 ; Abstol = 10-10 dt = 10-6 sec 27x speed-up over CPU VODE still very effective.

20 RKC Solver Performance Increase the stiffness by increasing the global time-step: dt = 10-4 sec RKC 2.5 slower than VODE for very stiff problem.

21 Impact of Ordering ODE performance is based on linear database Neighboring grid points have similar initial conditions Leads to better SIMD vectorization Randomize the database amplifies the variation in the number of steps per ODE within each warp. RKF strongly impacted. VODE minimally impacted: already running poorly VODE 12% RKF 71% Randomization slow-down in one-thread implementation Slow-down due to increased divergence.

22 Performance Summary VODE achieved ~8x speed-up on GPU with small C 2 H 4 mechanism. Complex algorithm has too much divergence. Fixed-order, explicit schemes performed better on device. RKF > 20x speed-up over serial host VODE performance with 19-species ethylene. RKC > 55x speed-up with modestly stiff CH 4 mechanism. RKC performance diminished as stiffness is increased still need implicit solver for very stiff problems.

23 ODE Solver Recipe 1. What s best on the host isn t always best on the GPU: a) Explicit RK beats implicit VODE on many problems if the global dt is small. b) Very stiff problems still need implicit solver. 2. One-thread-per-ODE mapping provides effective speed-up assuming 10k s of concurrent ODEs. a) Straightforward translation from single-threaded host code. b) One-block-per-ODE could be useful if ODEs vary widely or if there are only 100 s of ODEs. 3. Be careful with ordering: a) Variable number of steps can severely degrade performance. b) Group ODEs with similar conditions in same warp.

24 Acknowledgments Funding for RKF and VODE GPU implementations provided by the DoD HPCMP s User Productivity Enhancement, Technology Transfer, and Training (PETTT) Program (Contract No: GS04T09DBC0017) under project PP-CFD-KY02-115-P3.

25 More Information Stone and Davis, Techniques for solving stiff chemical kinetics on GPUs, AIAA J. of Propulsion and Power, 2013. Niemeyer and Sung, Accelerating moderately stiff chemical kinetics in reactive-flow simulations using GPUs, J. Computational Physics, vol. 256, pages 854-871, 2014.

26 Comments?