Performance Analysis of a List-Based Lattice-Boltzmann Kernel

Similar documents
A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

ERLANGEN REGIONAL COMPUTING CENTER

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

More Science per Joule: Bottleneck Computing

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Exploring performance and power properties of modern multicore chips via simple machine models

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

Performance Evaluation of Scientific Applications on POWER8

Applications of Lattice Boltzmann Methods

Simulation of floating bodies with lattice Boltzmann

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

On the Use of a Many core Processor for Computational Fluid Dynamics Simulations

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC

Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications

Measuring freeze-out parameters on the Bielefeld GPU cluster

P214 Efficient Computation of Passive Seismic Interferometry

Accelerating Quantum Chromodynamics Calculations with GPUs

Simulation of Lid-driven Cavity Flow by Parallel Implementation of Lattice Boltzmann Method on GPUs

GPU-accelerated Computing at Scale. Dirk Pleiter I GTC Europe 10 October 2018

Lecture 19. Architectural Directions

Efficient implementation of the overlap operator on multi-gpus

Case Study: Quantum Chromodynamics

External and Internal Incompressible Viscous Flows Computation using Taylor Series Expansion and Least Square based Lattice Boltzmann Method

Drag Force Simulations of Particle Agglomerates with the Lattice-Boltzmann Method

Porting a sphere optimization program from LAPACK to ScaLAPACK

Window-aware Load Shedding for Aggregation Queries over Data Streams

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

A simple Concept for the Performance Analysis of Cluster-Computing

A hierarchical Model for the Analysis of Efficiency and Speed-up of Multi-Core Cluster-Computers

Numerical Characterization of Multi-Dielectric Green s Function for 3-D Capacitance Extraction with Floating Random Walk Algorithm

Compiling Techniques

Numerical Simulation Of Pore Fluid Flow And Fine Sediment Infiltration Into The Riverbed

2.5D algorithms for distributed-memory computing

Research of Micro-Rectangular-Channel Flow Based on Lattice Boltzmann Method

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

- Part 4 - Multicore and Manycore Technology: Chances and Challenges. Vincent Heuveline

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

EXTENDED FREE SURFACE FLOW MODEL BASED ON THE LATTICE BOLTZMANN APPROACH

QuickCheck. Collisions between molecules. Collisions between molecules

arxiv: v1 [hep-lat] 8 Nov 2014

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Routing Algorithms. CS60002: Distributed Systems. Pallab Dasgupta Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur

Porting RSL to C++ Ryusuke Villemin, Christophe Hery. Pixar Technical Memo 12-08

Collisions between molecules

Two case studies of Monte Carlo simulation on GPU

Janus: FPGA Based System for Scientific Computing Filippo Mantovani

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

Parallel Simulations of Self-propelled Microorganisms

Lattice Boltzmann model for the Elder problem

Lattice Boltzmann Method for Moving Boundaries

Matrix Assembly in FEA

Cactus Tools for Petascale Computing

ONE DIMENSIONAL CELLULAR AUTOMATA(CA). By Bertrand Rurangwa

Maxim > Design Support > Technical Documents > Application Notes > Battery Management > APP 131

Simulation of T-junction using LBM and VOF ENERGY 224 Final Project Yifan Wang,

Algorithms: Lecture 12. Chalmers University of Technology

Mitchell Chapter 10. Living systems are open systems that exchange energy, materials & information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

The Finite Cell Method: High order simulation of complex structures without meshing

Gas Turbine Technologies Torino (Italy) 26 January 2006

arxiv: v1 [cs.pf] 5 Mar 2018

Lattice Boltzmann fluid-dynamics on the QPACE supercomputer

Array-of-Struct particles for ipic3d on MIC. Alec Johnson and Giovanni Lapenta. EASC2014 Stockholm, Sweden April 3, 2014

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Pedestrian traffic models

Lecture #3. Review: Power

Cellular Automata CS 591 Complex Adaptive Systems Spring Professor: Melanie Moses 2/02/09

A Mathematical Model of the Skype VoIP Congestion Control Algorithm

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

The Lattice Boltzmann Method for Laminar and Turbulent Channel Flows

How do Wireless Chains Behave? The Impact of MAC Interactions

Number Representation and Waveform Quantization

Improvements for Implicit Linear Equation Solvers

Communication avoiding parallel algorithms for dense matrix factorizations

Using OpenMP on a Hydrodynamic Lattice-Boltzmann Code

arxiv: v2 [math.na] 21 Aug 2016

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

CSC 1700 Analysis of Algorithms: Warshall s and Floyd s algorithms

arxiv: v1 [cs.dc] 4 Sep 2014

Power Allocation and Coverage for a Relay-Assisted Downlink with Voice Users

ME615 Project Presentation Aeroacoustic Simulations using Lattice Boltzmann Method

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Equivalence between kinetic method for fluid-dynamic equation and macroscopic finite-difference scheme

High-Performance Scientific Computing

CS 700: Quantitative Methods & Experimental Design in Computer Science

CprE 281: Digital Logic

Accelerating linear algebra computations with hybrid GPU-multicore systems.

416 Distributed Systems

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Lattice Boltzmann Method for Fluid Simulations

Scientific Computing II

VEHICULAR TRAFFIC FLOW MODELS

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

IMPLEMENTING THE LATTICE-BOLTZMANN

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

distributed approaches For Proportional and max-min fairness in random access ad-hoc networks

Power System Analysis Prof. A. K. Sinha Department of Electrical Engineering Indian Institute of Technology, Kharagpur. Lecture - 21 Power Flow VI

Parallel Transposition of Sparse Data Structures

Direct Self-Consistent Field Computations on GPU Clusters

Transcription:

Performance Analysis of a List-Based Lattice-Boltzmann Kernel First Talk MuCoSim, 29. June 2016 Michael Hußnätter RRZE HPC Group Friedrich-Alexander University of Erlangen-Nuremberg

Outline Lattice Boltzmann List-Based Data Layout Run Length Encoding Roofline Analysis 2

Lattice Boltzmann Overview (1) Originating from lattice gas automaton Discrete time steps and discrete particle grid Particles only reside at the grid nodes 3

Lattice Boltzmann Overview (1) Originating from lattice gas automaton Discrete time steps and discrete particle grid Particles only reside at the grid nodes Grid nodes are connected by velocity vectors (c α ) Particle distribution is changed in two-step approach Particle Distribution Functions (PDFs) aggregate particles (f α ) 4

Lattice Boltzmann Overview (2) Combining Cellular Gas Automaton and Boltzmann equation leads to: f α x + c α Δt, t + Δt f α x, t = ω f α f α eq where f α eq depends on macroscopic velocity and density of the lattice NW W D2Q9 N C NO O Easy implementation by two-step approach: SW S Stream step: f α x + c α Δt, t + Δt = f α x, t + Δt Collide step: f α x, t + Δt = f α (x, t) ω(f α f eq α ) SO 5

Lattice Boltzmann PDF Streaming Two possibilities for PDF streaming: pull scheme 6

Lattice Boltzmann PDF Streaming Two possibilities for PDF streaming: pull scheme push scheme 7

Lattice Boltzmann PDF Streaming Two possibilities for PDF streaming: pull scheme push scheme 8

Lattice Boltzmann No-Slip Boundary Reflecting PDFs into same cell but opposite direction: F S Fluid node Solid node time step t = 0 9

Lattice Boltzmann No-Slip Boundary Reflecting PDFs into same cell but opposite direction: F S Fluid node Solid node time step t = 0.5 10

Lattice Boltzmann No-Slip Boundary Reflecting PDFs into same cell but opposite direction: F S Fluid node Solid node time step t = 1 11

Lattice Boltzmann Data Layout Grid Section 10 11 12 13 14 F Fluid node 5 6 7 8 9 S Solid node 0 1 2 3 4 Cell Storage 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 N N N N N N N N N N N N N N N S S S S S S S S S S S S S S S Field data layout (SoA) Easy address calculation for neighboring PDFs Source and destination cell storage 12

Lattice Boltzmann Simple Kernel (1) foreach cell in cellstorage do (2) if cell is fluidcell then (3) stream collide (4) end (5) end (6) swap cell storages 13

List-Layout Motivation LBM performance usually limited by memory capacity and memory bandwidth direct addressing scheme wastes valuable memory resources when it comes to complex domains Goal: Reduce memory requirements by omitting non-fluid cells which at the same time eliminates if in main loop Challenge: Convenient address calculation is lost (Godenschwager) 14

List-Layout Basics Grid Section 10 11 12 13 14 F Fluid node 5 6 7 8 9 S Solid node 0 1 2 3 4 Cell Storage 1 5 6 7 8 10 11 13 14 N N N N N N N N N Adjacency List 1 5 6 7 8 10 11 13 14 N* N* N* N* N* N* N* N* N* 15

List-Layout Basics Grid Section 10 11 12 13 14 F Fluid node 5 6 7 8 9 S Solid node 0 1 2 3 4 Cell Storage 1 5 6 7 8 10 11 13 14 N N N N N N N N N Adjacency List 1 5 6 7 8 10 11 13 14 N* N* N* N* N* N* N* N* N* 16

List-Layout Basics Grid Section 10 11 12 13 14 F Fluid node 5 6 7 8 9 S Solid node 0 1 2 3 4 Cell Storage 1 5 6 7 8 10 11 13 14 N N N N N N N N N Adjacency List 1 5 6 7 8 10 11 13 14 N* N* N* N* N* N* N* N* N* 17

List-Layout Basics Grid Section 10 11 12 13 14 F Fluid node 5 6 7 8 9 S Solid node 0 1 2 3 4 Cell Storage 1 5 6 7 8 10 11 13 14 N N N N N N N N N Adjacency List 1 5 6 7 8 10 11 13 14 N* N* N* N* N* N* N* N* N* 18

List-Layout No-Slip Boundary No-Slip without any intermediate time step: Cell Storage 1 2 3 4 3 N NE E SE S SW W NW 5 6 3 N* NE* E* SE* S* SW* W* NW* Adjacency List 19

List-Layout No-Slip Boundary No-Slip without any intermediate time step: Cell Storage 1 2 3 4 3 N NE E SE S SW W NW 5 6 3 N* NE* E* SE* S* SW* W* NW* Adjacency List 20

List-Layout Kernel (1) foreach cell in cellstorage do (2) get pullpointers from adjacencylist (3) stream collide (4) end (5) swap cell storages 21

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* W* W* W* W* W* W* RLE List 0 1 5 22

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* W* W* W* W* W* W* RLE List 0 1 5 23

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* RLE List 0 1 5 24

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* RLE List 0 1 5 25

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* RLE List 0 1 5 26

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* RLE List 0 1 5 27

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* RLE List 0 1 5 28

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* RLE List 0 1 5 29

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 RLE List 0 1 5 W* W* W* W* W* W* 30

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 RLE List 0 1 5 W* W* W* W* W* W* 31

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 RLE List 0 1 5 W* W* W* W* W* W* 32

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 RLE List 0 1 5 W* W* W* W* W* W* 33

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 RLE List 0 1 5 W* W* W* W* W* W* 34

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 RLE List 0 1 5 W* W* W* W* W* W* 35

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* W* W* W* W* W* W* RLE List 0 1 5 36

List-Layout Kernel with RLE (1) foreach rleblock in rlelist do // RLE loop (2) get pullpointers from adjacencylist (3) foreach cell in rleblock do // one macroscopic loop (4) calculate macroscopic values (5) end (6) foreach cell in rleblock do // nine collide loops (7) collide and store directions pairwise (8) end (9) end (10) swap cellstorages 37

Roofline Analysis

Roofline Emmy s Characteristics Maximal floating point performance for operands in L1: 2 load ports, 1 store ports, 1 cy throughput per add (mul) AVX @ 2.2 GHz delivers 88 GFLOP/s Achievable memory bandwidth: Determined on full socket with likwid-bench s copy_avx and yielded 40.6 GByte/s 39

GFLOP/s Roofline Determining Bottleneck 128 64 32 16 8 4 2 Roofline Estimation 1 1/16 1/4 1 4 16 64 Operational Intensity [FLOP/Byte] 40

GFLOP/s Roofline Determining Bottleneck 128 64 32 16 8 4 2 Roofline Estimation 1 1/16 1/4 1 4 16 64 Operational Intensity [FLOP/Byte] 198 FLOP / LUP 3 * 19 * 8 Byte / LUP = 456 Byte / LUP Operational Intensity: 0.43 FLOP / Byte 41

GFLOP/s Roofline Determining Bottleneck 128 64 32 16 8 4 2 Roofline Estimation 1 1/16 1/4 0.43 1 4 16 64 Operational Intensity [FLOP/Byte] 198 FLOP / LUP 3 * 19 * 8 Byte / LUP = 456 Byte / LUP Operational Intensity: 0.43 FLOP / Byte 42

GFLOP/s Roofline Determining Bottleneck 128 64 32 16 8 4 Mem Limit 2 Roofline Estimation 1 1/16 1/4 0.43 1 4 16 64 Operational Intensity [FLOP/Byte] 198 FLOP / LUP 3 * 19 * 8 Byte / LUP = 456 Byte / LUP Operational Intensity: 0.43 FLOP / Byte 43

Roofline FLOP vs FLUP Lattice Boltzmann: More FLOPs will not neccessary lead to shorter time to solution FLOPs per lattice update highly depend on implemenation Fluid Lattice UPdate(s) per second introduced for comparable results Considered implementation requires 456 Byte per FLUP Adapted Roofline performance estimation based on achievable memory bandwidth for certain number of cores 44

Roofline TestCase: Channel 25,000,000 cells High pressure boundary (green) Low pressure boundary (red) 45

GByte/s Roofline Emmy s MemBandwidth 50 Theoretical Limit 1600 MHz Quad-Channel 40 30 20 10 0 copy_avx 1 Load / 1 Store copy_avx 19 Load / 1 Store 1 2 3 4 5 6 7 8 9 10 # cores 46

MFLUP/s Roofline Performance Evaluation 100 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 # cores Roofline Roofline 19/1 List LBM 47

Upcoming Talk Overview Short recap of Lattice Boltzmann 48

Upcoming Talk Overview Short recap of Lattice Boltzmann Detailed ECM performance estimation and evaluation for IvyBridge and Haswell 49

};

Backup Slide SoA vs AoS Struct of Arrays (SoA) C C C N N N S S S Array of Structs (AoS) C N S W E NW NE SW SE C N S 51