Performance Analysis of a List-Based Lattice-Boltzmann Kernel

Size: px

Start display at page:

Download "Performance Analysis of a List-Based Lattice-Boltzmann Kernel"

Clara McGee
5 years ago
Views:

1 Performance Analysis of a List-Based Lattice-Boltzmann Kernel First Talk MuCoSim, 29. June 2016 Michael Hußnätter RRZE HPC Group Friedrich-Alexander University of Erlangen-Nuremberg

2 Outline Lattice Boltzmann List-Based Data Layout Run Length Encoding Roofline Analysis 2

3 Lattice Boltzmann Overview (1) Originating from lattice gas automaton Discrete time steps and discrete particle grid Particles only reside at the grid nodes 3

4 Lattice Boltzmann Overview (1) Originating from lattice gas automaton Discrete time steps and discrete particle grid Particles only reside at the grid nodes Grid nodes are connected by velocity vectors (c α ) Particle distribution is changed in two-step approach Particle Distribution Functions (PDFs) aggregate particles (f α ) 4

Lattice Boltzmann Overview (2) Combining Cellular Gas Automaton and Boltzmann equation leads to: f α x + c α Δt, t + Δt f α x, t = ω f α f α eq where f α eq depends on macroscopic velocity and

5 Lattice Boltzmann Overview (2) Combining Cellular Gas Automaton and Boltzmann equation leads to: f α x + c α Δt, t + Δt f α x, t = ω f α f α eq where f α eq depends on macroscopic velocity and density of the lattice NW W D2Q9 N C NO O Easy implementation by two-step approach: SW S Stream step: f α x + c α Δt, t + Δt = f α x, t + Δt Collide step: f α x, t + Δt = f α (x, t) ω(f α f eq α ) SO 5

6 Lattice Boltzmann PDF Streaming Two possibilities for PDF streaming: pull scheme 6

7 Lattice Boltzmann PDF Streaming Two possibilities for PDF streaming: pull scheme push scheme 7

8 Lattice Boltzmann PDF Streaming Two possibilities for PDF streaming: pull scheme push scheme 8

9 Lattice Boltzmann No-Slip Boundary Reflecting PDFs into same cell but opposite direction: F S Fluid node Solid node time step t = 0 9

10 Lattice Boltzmann No-Slip Boundary Reflecting PDFs into same cell but opposite direction: F S Fluid node Solid node time step t =

11 Lattice Boltzmann No-Slip Boundary Reflecting PDFs into same cell but opposite direction: F S Fluid node Solid node time step t = 1 11

12 Lattice Boltzmann Data Layout Grid Section F Fluid node S Solid node Cell Storage N N N N N N N N N N N N N N N S S S S S S S S S S S S S S S Field data layout (SoA) Easy address calculation for neighboring PDFs Source and destination cell storage 12

13 Lattice Boltzmann Simple Kernel (1) foreach cell in cellstorage do (2) if cell is fluidcell then (3) stream collide (4) end (5) end (6) swap cell storages 13

List-Layout Motivation LBM performance usually limited by memory capacity and memory bandwidth direct addressing scheme wastes valuable memory resources when it comes to complex

14 List-Layout Motivation LBM performance usually limited by memory capacity and memory bandwidth direct addressing scheme wastes valuable memory resources when it comes to complex domains Goal: Reduce memory requirements by omitting non-fluid cells which at the same time eliminates if in main loop Challenge: Convenient address calculation is lost (Godenschwager) 14

15 List-Layout Basics Grid Section F Fluid node S Solid node Cell Storage N N N N N N N N N Adjacency List N* N* N* N* N* N* N* N* N* 15

16 List-Layout Basics Grid Section F Fluid node S Solid node Cell Storage N N N N N N N N N Adjacency List N* N* N* N* N* N* N* N* N* 16

17 List-Layout Basics Grid Section F Fluid node S Solid node Cell Storage N N N N N N N N N Adjacency List N* N* N* N* N* N* N* N* N* 17

18 List-Layout Basics Grid Section F Fluid node S Solid node Cell Storage N N N N N N N N N Adjacency List N* N* N* N* N* N* N* N* N* 18

19 List-Layout No-Slip Boundary No-Slip without any intermediate time step: Cell Storage N NE E SE S SW W NW N* NE* E* SE* S* SW* W* NW* Adjacency List 19

20 List-Layout No-Slip Boundary No-Slip without any intermediate time step: Cell Storage N NE E SE S SW W NW N* NE* E* SE* S* SW* W* NW* Adjacency List 20

21 List-Layout Kernel (1) foreach cell in cellstorage do (2) get pullpointers from adjacencylist (3) stream collide (4) end (5) swap cell storages 21

22 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List E* E* E* E* E* E* W* W* W* W* W* W* RLE List

23 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List E* E* E* E* E* E* W* W* W* W* W* W* RLE List

24 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List E* E* E* E* E* E* RLE List

25 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List E* E* E* E* E* E* RLE List

26 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List E* E* E* E* E* E* RLE List

27 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List E* E* E* E* E* E* RLE List

28 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List E* E* E* E* E* E* RLE List

29 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List E* E* E* E* E* E* RLE List

30 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List RLE List W* W* W* W* W* W* 30

31 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List RLE List W* W* W* W* W* W* 31

32 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List RLE List W* W* W* W* W* W* 32

33 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List RLE List W* W* W* W* W* W* 33

34 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List RLE List W* W* W* W* W* W* 34

List-Layout Run Length Encoding Grid Section 0 1 2 3 4 5 F S Fluid node Solid node Cell Storage 0 1 2 3 4 5

35 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List RLE List W* W* W* W* W* W* 35

36 List-Layout Run Length Encoding Grid Section F S Fluid node Solid node Cell Storage E E E E E E W W W W W W Adjacency List E* E* E* E* E* E* W* W* W* W* W* W* RLE List

37 List-Layout Kernel with RLE (1) foreach rleblock in rlelist do // RLE loop (2) get pullpointers from adjacencylist (3) foreach cell in rleblock do // one macroscopic loop (4) calculate macroscopic values (5) end (6) foreach cell in rleblock do // nine collide loops (7) collide and store directions pairwise (8) end (9) end (10) swap cellstorages 37

38 Roofline Analysis

39 Roofline Emmy s Characteristics Maximal floating point performance for operands in L1: 2 load ports, 1 store ports, 1 cy throughput per add (mul) 2.2 GHz delivers 88 GFLOP/s Achievable memory bandwidth: Determined on full socket with likwid-bench s copy_avx and yielded 40.6 GByte/s 39

40 GFLOP/s Roofline Determining Bottleneck Roofline Estimation 1 1/16 1/ Operational Intensity [FLOP/Byte] 40

41 GFLOP/s Roofline Determining Bottleneck Roofline Estimation 1 1/16 1/ Operational Intensity [FLOP/Byte] 198 FLOP / LUP 3 * 19 * 8 Byte / LUP = 456 Byte / LUP Operational Intensity: 0.43 FLOP / Byte 41

42 GFLOP/s Roofline Determining Bottleneck Roofline Estimation 1 1/16 1/ Operational Intensity [FLOP/Byte] 198 FLOP / LUP 3 * 19 * 8 Byte / LUP = 456 Byte / LUP Operational Intensity: 0.43 FLOP / Byte 42

GFLOP/s Roofline Determining Bottleneck 128 64 32 16 8 4 Mem Limit 2 Roofline Estimation 1 1/16 1/4 0.

43 GFLOP/s Roofline Determining Bottleneck Mem Limit 2 Roofline Estimation 1 1/16 1/ Operational Intensity [FLOP/Byte] 198 FLOP / LUP 3 * 19 * 8 Byte / LUP = 456 Byte / LUP Operational Intensity: 0.43 FLOP / Byte 43

44 Roofline FLOP vs FLUP Lattice Boltzmann: More FLOPs will not neccessary lead to shorter time to solution FLOPs per lattice update highly depend on implemenation Fluid Lattice UPdate(s) per second introduced for comparable results Considered implementation requires 456 Byte per FLUP Adapted Roofline performance estimation based on achievable memory bandwidth for certain number of cores 44

45 Roofline TestCase: Channel 25,000,000 cells High pressure boundary (green) Low pressure boundary (red) 45

46 GByte/s Roofline Emmy s MemBandwidth 50 Theoretical Limit 1600 MHz Quad-Channel copy_avx 1 Load / 1 Store copy_avx 19 Load / 1 Store # cores 46

47 MFLUP/s Roofline Performance Evaluation # cores Roofline Roofline 19/1 List LBM 47

48 Upcoming Talk Overview Short recap of Lattice Boltzmann 48

49 Upcoming Talk Overview Short recap of Lattice Boltzmann Detailed ECM performance estimation and evaluation for IvyBridge and Haswell 49

50 };

51 Backup Slide SoA vs AoS Struct of Arrays (SoA) C C C N N N S S S Array of Structs (AoS) C N S W E NW NE SW SE C N S 51

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries SC13, November 21 st 2013 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler, Ulrich