Performance Analysis of a List-Based Lattice-Boltzmann Kernel

Performance Analysis of a List-Based Lattice-Boltzmann Kernel First Talk MuCoSim, 29. June 2016 Michael Hußnätter RRZE HPC Group Friedrich-Alexander University of Erlangen-Nuremberg

Outline Lattice Boltzmann List-Based Data Layout Run Length Encoding Roofline Analysis 2

Lattice Boltzmann Overview (1) Originating from lattice gas automaton Discrete time steps and discrete particle grid Particles only reside at the grid nodes 3

Lattice Boltzmann Overview (1) Originating from lattice gas automaton Discrete time steps and discrete particle grid Particles only reside at the grid nodes Grid nodes are connected by velocity vectors (c α ) Particle distribution is changed in two-step approach Particle Distribution Functions (PDFs) aggregate particles (f α ) 4

Lattice Boltzmann Overview (2) Combining Cellular Gas Automaton and Boltzmann equation leads to: f α x + c α Δt, t + Δt f α x, t = ω f α f α eq where f α eq depends on macroscopic velocity and density of the lattice NW W D2Q9 N C NO O Easy implementation by two-step approach: SW S Stream step: f α x + c α Δt, t + Δt = f α x, t + Δt Collide step: f α x, t + Δt = f α (x, t) ω(f α f eq α ) SO 5

Lattice Boltzmann PDF Streaming Two possibilities for PDF streaming: pull scheme 6

Lattice Boltzmann PDF Streaming Two possibilities for PDF streaming: pull scheme push scheme 7

Lattice Boltzmann PDF Streaming Two possibilities for PDF streaming: pull scheme push scheme 8

Lattice Boltzmann No-Slip Boundary Reflecting PDFs into same cell but opposite direction: F S Fluid node Solid node time step t = 0 9

Lattice Boltzmann No-Slip Boundary Reflecting PDFs into same cell but opposite direction: F S Fluid node Solid node time step t = 0.5 10

Lattice Boltzmann No-Slip Boundary Reflecting PDFs into same cell but opposite direction: F S Fluid node Solid node time step t = 1 11

Lattice Boltzmann Data Layout Grid Section 10 11 12 13 14 F Fluid node 5 6 7 8 9 S Solid node 0 1 2 3 4 Cell Storage 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 N N N N N N N N N N N N N N N S S S S S S S S S S S S S S S Field data layout (SoA) Easy address calculation for neighboring PDFs Source and destination cell storage 12

Lattice Boltzmann Simple Kernel (1) foreach cell in cellstorage do (2) if cell is fluidcell then (3) stream collide (4) end (5) end (6) swap cell storages 13

List-Layout Motivation LBM performance usually limited by memory capacity and memory bandwidth direct addressing scheme wastes valuable memory resources when it comes to complex domains Goal: Reduce memory requirements by omitting non-fluid cells which at the same time eliminates if in main loop Challenge: Convenient address calculation is lost (Godenschwager) 14

List-Layout Basics Grid Section 10 11 12 13 14 F Fluid node 5 6 7 8 9 S Solid node 0 1 2 3 4 Cell Storage 1 5 6 7 8 10 11 13 14 N N N N N N N N N Adjacency List 1 5 6 7 8 10 11 13 14 N* N* N* N* N* N* N* N* N* 15

List-Layout Basics Grid Section 10 11 12 13 14 F Fluid node 5 6 7 8 9 S Solid node 0 1 2 3 4 Cell Storage 1 5 6 7 8 10 11 13 14 N N N N N N N N N Adjacency List 1 5 6 7 8 10 11 13 14 N* N* N* N* N* N* N* N* N* 16

List-Layout Basics Grid Section 10 11 12 13 14 F Fluid node 5 6 7 8 9 S Solid node 0 1 2 3 4 Cell Storage 1 5 6 7 8 10 11 13 14 N N N N N N N N N Adjacency List 1 5 6 7 8 10 11 13 14 N* N* N* N* N* N* N* N* N* 17

List-Layout Basics Grid Section 10 11 12 13 14 F Fluid node 5 6 7 8 9 S Solid node 0 1 2 3 4 Cell Storage 1 5 6 7 8 10 11 13 14 N N N N N N N N N Adjacency List 1 5 6 7 8 10 11 13 14 N* N* N* N* N* N* N* N* N* 18

List-Layout No-Slip Boundary No-Slip without any intermediate time step: Cell Storage 1 2 3 4 3 N NE E SE S SW W NW 5 6 3 N* NE* E* SE* S* SW* W* NW* Adjacency List 19

List-Layout No-Slip Boundary No-Slip without any intermediate time step: Cell Storage 1 2 3 4 3 N NE E SE S SW W NW 5 6 3 N* NE* E* SE* S* SW* W* NW* Adjacency List 20

List-Layout Kernel (1) foreach cell in cellstorage do (2) get pullpointers from adjacencylist (3) stream collide (4) end (5) swap cell storages 21

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* W* W* W* W* W* W* RLE List 0 1 5 22

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* RLE List 0 1 5 24

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* RLE List 0 1 5 25

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* RLE List 0 1 5 26

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* RLE List 0 1 5 27

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* RLE List 0 1 5 28

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 E* E* E* E* E* E* RLE List 0 1 5 29

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 RLE List 0 1 5 W* W* W* W* W* W* 30

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 RLE List 0 1 5 W* W* W* W* W* W* 31

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 RLE List 0 1 5 W* W* W* W* W* W* 32

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 RLE List 0 1 5 W* W* W* W* W* W* 33

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 RLE List 0 1 5 W* W* W* W* W* W* 34

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5 0 1 2 3 4 5 E E E E E E W W W W W W Adjacency List 0 1 2 3 4 5 0 1 2 3 4 5 RLE List 0 1 5 W* W* W* W* W* W* 35

List-Layout Kernel with RLE (1) foreach rleblock in rlelist do // RLE loop (2) get pullpointers from adjacencylist (3) foreach cell in rleblock do // one macroscopic loop (4) calculate macroscopic values (5) end (6) foreach cell in rleblock do // nine collide loops (7) collide and store directions pairwise (8) end (9) end (10) swap cellstorages 37

Roofline Analysis

Roofline Emmy s Characteristics Maximal floating point performance for operands in L1: 2 load ports, 1 store ports, 1 cy throughput per add (mul) AVX @ 2.2 GHz delivers 88 GFLOP/s Achievable memory bandwidth: Determined on full socket with likwid-bench s copy_avx and yielded 40.6 GByte/s 39

GFLOP/s Roofline Determining Bottleneck 128 64 32 16 8 4 2 Roofline Estimation 1 1/16 1/4 1 4 16 64 Operational Intensity [FLOP/Byte] 40

GFLOP/s Roofline Determining Bottleneck 128 64 32 16 8 4 2 Roofline Estimation 1 1/16 1/4 1 4 16 64 Operational Intensity [FLOP/Byte] 198 FLOP / LUP 3 * 19 * 8 Byte / LUP = 456 Byte / LUP Operational Intensity: 0.43 FLOP / Byte 41

GFLOP/s Roofline Determining Bottleneck 128 64 32 16 8 4 2 Roofline Estimation 1 1/16 1/4 0.43 1 4 16 64 Operational Intensity [FLOP/Byte] 198 FLOP / LUP 3 * 19 * 8 Byte / LUP = 456 Byte / LUP Operational Intensity: 0.43 FLOP / Byte 42

GFLOP/s Roofline Determining Bottleneck 128 64 32 16 8 4 Mem Limit 2 Roofline Estimation 1 1/16 1/4 0.43 1 4 16 64 Operational Intensity [FLOP/Byte] 198 FLOP / LUP 3 * 19 * 8 Byte / LUP = 456 Byte / LUP Operational Intensity: 0.43 FLOP / Byte 43

Roofline FLOP vs FLUP Lattice Boltzmann: More FLOPs will not neccessary lead to shorter time to solution FLOPs per lattice update highly depend on implemenation Fluid Lattice UPdate(s) per second introduced for comparable results Considered implementation requires 456 Byte per FLUP Adapted Roofline performance estimation based on achievable memory bandwidth for certain number of cores 44

Roofline TestCase: Channel 25,000,000 cells High pressure boundary (green) Low pressure boundary (red) 45

GByte/s Roofline Emmy s MemBandwidth 50 Theoretical Limit 1600 MHz Quad-Channel 40 30 20 10 0 copy_avx 1 Load / 1 Store copy_avx 19 Load / 1 Store 1 2 3 4 5 6 7 8 9 10 # cores 46

MFLUP/s Roofline Performance Evaluation 100 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 # cores Roofline Roofline 19/1 List LBM 47

Upcoming Talk Overview Short recap of Lattice Boltzmann 48

Upcoming Talk Overview Short recap of Lattice Boltzmann Detailed ECM performance estimation and evaluation for IvyBridge and Haswell 49

};

Backup Slide SoA vs AoS Struct of Arrays (SoA) C C C N N N S S S Array of Structs (AoS) C N S W E NW NE SW SE C N S 51