Performance Evaluation of Scientific Applications on POWER8

Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4, and S. Fabio Schifano 2 1 Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425 Jülich, Germany 2 Dip. di Matematica e Informatica, Università di Ferrara and INFN, Ferrara, Italy 3 IBM Deutschland Research & Development GmbH, 71032 Böblingen, Germany 4 SimLab Neuroscience, IAS, Forschungszentrum Jülich, 52425 Jülich, Germany

Outline POWER8 architecture intro Micro benchmarks STREAM memory bandwidth OpenMP overheads Scientific HPC applications LBM Lattice Boltzmann method for fluid dynamics NEST Neuroscience network simulator MAFIA subspace clustering data analysis tool 2

POWER8 chip architecture 3.42 GHz core frequency 10 cores per socket (max 12) core architecture: POWER ISA v2.07 8-way SMT 2xFXU, 2xLSU, 2xLU 2xVMX, 4xFPU 3

Memory hierarchy of POWER8 L1$: 64k /core Reg Data flow Load request propagation LD_CMPL + LSU_LDX L2$: 512k /core L3$: 8M /core shared Coherency Remote cast-outs L1$ L2$ DATA_FROM_L2 DATA_ALL_FROM_L3 Centaur: L4$: max. 128M Mem: 1T max. L3_PREF_ALL L3$ L4$ Memory DATA_ALL_FROM_{RDL}MEM DATA_ALL_FROM_{RDL}L4 4

STREAM benchmark with 2 streams!copy!!c[i] = a[i];!!scale!!b[i] = s*c[i];!! with 3 streams!sum!!a[i] = b[i]+c[i];!!triad!!a[i] = s*b[i]+c[i];!! efficient prefetching expected 5

STREAM memory traffic - copy 10 8 10 7 Reg L1$ L1$ L3$ L1$ L2$ L1$ Mem Data read (B) 10 6 10 5 10 4 L2$ L3$ 10 3 10 2 L1$ 2 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25 Array size n 6

STREAM performance sampling 1000x 350 300 0.25 0.20 0.15 Distributions for 40 threads copy scale sum triad 250 0.10 Bandwidth (GB/s) 200 150 0.05 0.00 200 220 240 260 280 300 320 GB/s median values 100 50 0 1 2 4 8 10 20 40 80 160 Threads #Cores #LSU #HWT 7

OpenMP overheads Explicit vs. implicit constructs: overhead = t(n) t serial /n 10 4 10 3 atomic barrier critical ordered for parallel reduction single 10 2 Overhead (µs) 10 1 10 0 10 1 10 2 #Cores #HWT 1 2 4 810 20 40 80 160 Threads #Cores #HWT 1 2 4 810 20 40 80 160 Threads 8

Lattice Boltzmann method Fluid dynamics - discrete sampling of positions and velocities - complex and irregular structures - multi-phase flow possible Example: Rayleigh-Taylor instability 9

Lattice Boltzmann method D2Q37 - two kernels: collide (6200 FP/lattice site) propagate (BW limited) on most platforms t collide >> t propagate - high degree of parallelism - high arithmetic intensity 10

LBM D2Q37 performance best results ~200 GF/s for collide at 40 or 80 thr. 200 Collide Propagate 80 70 150 60 best results ~75 GB/s for propagate at 20 thr. Collide (GF/s) shortest runtime at 40 thr. 100 50 0 1 2 4 8 10 16 20 40 80 160 Threads #Cores #FPU/LSU #HWT 11 50 40 30 20 10 0 Propagate (GB/s)

MAFIA a parallelized algorithm to identify subspace clusters in high dimensional spaces kernel: pcount!!bitwise & GPU acceleration possible Adinetz et al., Euro-Par'13 Proceedings, pp 838-849 12

MAFIA performance 10 2 stride = 1 stride = 2 stride = 4 stride = 8 Speedup 10 1 10 0 1 10 20 40 80 160 Threads 13

NEST NEural Simulation Tool discrete event simulator on an a distributed graph uses domains-specific interpreter language (SLI) stochastic input hybrid parallelization with M T virtual processes n=9375 neurons per MPI process dry-run mode for M processes kernels: nest-initiative.org neuron update (some FP-ops) spike delivery (dominant) Brette et al. J Comp. Neuroscience Vol 23, 3 (2007) 14

NEST total runtime control the binding of threads to cores! dry-run mode with M=512 and M=16,384 Time (s) 60 50 40 30 20 stride = 1 stride = 2 stride = 4 stride = 8 Time (s) 1200 1000 800 600 400 stride = 1 stride = 2 stride = 4 stride = 8 10 0 10 20 40 80 160 Threads use 160/T as stride 200 0 10 20 40 80 160 Threads 15

NEST spike delivery total amount of work fitted to C = c 0 + c 1 MT + c 2 MT 2 10 14 M=512 M=2048 points: measured lines: fit Instructions 10 13 10 12 10 11 M=4096 M=16384 1 5 10 20 40 80 160 Threads 16

NEST potential for improvement Each of T threads runs through MT buffer elements in the spike delivery kernel, avoiding the c 2 MT 2 component would improve thread-scalability Instructions per cycle throughput benefits from SMT8 Needs deeper understanding 17

Summary LBM and MAFIA benefit from vectorization and instruction-level parallelism NEST benefits from SMT8, needs restructuring POWER8 suitable as host CPU in GPU-accelerated system 18

Outlook POWER8 with GPUs installed porting in progress 19