Performance Evaluation of Scientific Applications on POWER8

Size: px

Start display at page:

Download "Performance Evaluation of Scientific Applications on POWER8"

Mitchell Simpson
6 years ago
Views:

Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F.

1 Performance Evaluation of Scientific Applications on POWER Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4, and S. Fabio Schifano 2 1 Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany 2 Dip. di Matematica e Informatica, Università di Ferrara and INFN, Ferrara, Italy 3 IBM Deutschland Research & Development GmbH, Böblingen, Germany 4 SimLab Neuroscience, IAS, Forschungszentrum Jülich, Jülich, Germany

2 Outline POWER8 architecture intro Micro benchmarks STREAM memory bandwidth OpenMP overheads Scientific HPC applications LBM Lattice Boltzmann method for fluid dynamics NEST Neuroscience network simulator MAFIA subspace clustering data analysis tool 2

3 POWER8 chip architecture 3.42 GHz core frequency 10 cores per socket (max 12) core architecture: POWER ISA v way SMT 2xFXU, 2xLSU, 2xLU 2xVMX, 4xFPU 3

4 Memory hierarchy of POWER8 L1$: 64k /core Reg Data flow Load request propagation LD_CMPL + LSU_LDX L2$: 512k /core L3$: 8M /core shared Coherency Remote cast-outs L1$ L2$ DATA_FROM_L2 DATA_ALL_FROM_L3 Centaur: L4$: max. 128M Mem: 1T max. L3_PREF_ALL L3$ L4$ Memory DATA_ALL_FROM_{RDL}MEM DATA_ALL_FROM_{RDL}L4 4

5 STREAM benchmark with 2 streams!copy!!c[i] = a[i];!!scale!!b[i] = s*c[i];!! with 3 streams!sum!!a[i] = b[i]+c[i];!!triad!!a[i] = s*b[i]+c[i];!! efficient prefetching expected 5

6 STREAM memory traffic - copy Reg L1$ L1$ L3$ L1$ L2$ L1$ Mem Data read (B) L2$ L3$ L1$ Array size n 6

7 STREAM performance sampling 1000x Distributions for 40 threads copy scale sum triad Bandwidth (GB/s) GB/s median values Threads #Cores #LSU #HWT 7

8 OpenMP overheads Explicit vs. implicit constructs: overhead = t(n) t serial /n atomic barrier critical ordered for parallel reduction single 10 2 Overhead (µs) #Cores #HWT Threads #Cores #HWT Threads 8

9 Lattice Boltzmann method Fluid dynamics - discrete sampling of positions and velocities - complex and irregular structures - multi-phase flow possible Example: Rayleigh-Taylor instability 9

10 Lattice Boltzmann method D2Q37 - two kernels: collide (6200 FP/lattice site) propagate (BW limited) on most platforms t collide >> t propagate - high degree of parallelism - high arithmetic intensity 10

11 LBM D2Q37 performance best results ~200 GF/s for collide at 40 or 80 thr. 200 Collide Propagate best results ~75 GB/s for propagate at 20 thr. Collide (GF/s) shortest runtime at 40 thr Threads #Cores #FPU/LSU #HWT Propagate (GB/s)

12 MAFIA a parallelized algorithm to identify subspace clusters in high dimensional spaces kernel: pcount!!bitwise & GPU acceleration possible Adinetz et al., Euro-Par'13 Proceedings, pp

13 MAFIA performance 10 2 stride = 1 stride = 2 stride = 4 stride = 8 Speedup Threads 13

NEST NEural Simulation Tool discrete event simulator on an a distributed graph uses domains-specific interpreter language (SLI) stochastic input hybrid parallelization with M T virtual processes

14 NEST NEural Simulation Tool discrete event simulator on an a distributed graph uses domains-specific interpreter language (SLI) stochastic input hybrid parallelization with M T virtual processes n=9375 neurons per MPI process dry-run mode for M processes kernels: nest-initiative.org neuron update (some FP-ops) spike delivery (dominant) Brette et al. J Comp. Neuroscience Vol 23, 3 (2007) 14

15 NEST total runtime control the binding of threads to cores! dry-run mode with M=512 and M=16,384 Time (s) stride = 1 stride = 2 stride = 4 stride = 8 Time (s) stride = 1 stride = 2 stride = 4 stride = Threads use 160/T as stride Threads 15

16 NEST spike delivery total amount of work fitted to C = c 0 + c 1 MT + c 2 MT M=512 M=2048 points: measured lines: fit Instructions M=4096 M= Threads 16

17 NEST potential for improvement Each of T threads runs through MT buffer elements in the spike delivery kernel, avoiding the c 2 MT 2 component would improve thread-scalability Instructions per cycle throughput benefits from SMT8 Needs deeper understanding 17

18 Summary LBM and MAFIA benefit from vectorization and instruction-level parallelism NEST benefits from SMT8, needs restructuring POWER8 suitable as host CPU in GPU-accelerated system 18

19 Outlook POWER8 with GPUs installed porting in progress 19

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance