Performance Evaluation of Scientific Applications on POWER8

Similar documents
On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Exploiting In-Memory Processing Capabilities for Density Functional Theory Applications

Performance Analysis of Lattice QCD Application with APGAS Programming Model

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

An Overview of HPC at the Met Office

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

GPU-accelerated Computing at Scale. Dirk Pleiter I GTC Europe 10 October 2018

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Vector Lane Threading

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Exploring performance and power properties of modern multicore chips via simple machine models

The Lattice Boltzmann Simulation on Multi-GPU Systems

arxiv: v1 [hep-lat] 7 Oct 2010

On the Use of a Many core Processor for Computational Fluid Dynamics Simulations

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Performance Analysis of a List-Based Lattice-Boltzmann Kernel

Two case studies of Monte Carlo simulation on GPU

ERLANGEN REGIONAL COMPUTING CENTER

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

More Science per Joule: Bottleneck Computing

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Scalable and Power-Efficient Data Mining Kernels

Multiscale simulations of complex fluid rheology

Direct Self-Consistent Field Computations on GPU Clusters

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Lattice Boltzmann fluid-dynamics on the QPACE supercomputer

Profiling and scalability of the high resolution NCEP model for Weather and Climate Simulations

Panorama des modèles et outils de programmation parallèle

Perm State University Research-Education Center Parallel and Distributed Computing

Parallel Performance Theory - 1

LRADNN: High-Throughput and Energy- Efficient Deep Neural Network Accelerator using Low Rank Approximation

Lightweight Superscalar Task Execution in Distributed Memory

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

GPU Computing Activities in KISTI

Introduction to numerical computations on the GPU

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs

Performance Analysis of Lattice QCD Application with APGAS Programming Model

The Analysis of Microburst (Burstiness) on Virtual Switch

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

Strassen s Algorithm for Tensor Contraction

Improving Simulations of Spiking Neural P Systems in NVIDIA CUDA GPUs: CuSNP

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

The Performance Evolution of the Parallel Ocean Program on the Cray X1

Lecture 19. Architectural Directions

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Gas Turbine Technologies Torino (Italy) 26 January 2006

Performance of WRF using UPC

A Different Kind of Flow Analysis. David M Nicol University of Illinois at Urbana-Champaign

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

arxiv: v1 [cs.dc] 4 Sep 2014

Reflecting on the Goal and Baseline of Exascale Computing

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Progress in NWP on Intel HPC architecture at Australian Bureau of Meteorology

Mitglied der Helmholtz-Gemeinschaft. Linear algebra tasks in Materials Science: optimization and portability

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Sparse Matrix Computations in Arterial Fluid Mechanics

Parallel sparse direct solvers for Poisson s equation in streamer discharges

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

A simple Concept for the Performance Analysis of Cluster-Computing

arxiv: v1 [hep-lat] 8 Nov 2014

11 Parallel programming models

Available online at ScienceDirect. Procedia Engineering 61 (2013 ) 94 99

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

Parallel Transposition of Sparse Data Structures

Zacros. Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis

Calculation of ground states of few-body nuclei using NVIDIA CUDA technology

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures

P214 Efficient Computation of Passive Seismic Interferometry

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Accelerating Quantum Chromodynamics Calculations with GPUs

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

ZFS - Flow Solver AIA NVIDIA Workshop

Simulation of Lid-driven Cavity Flow by Parallel Implementation of Lattice Boltzmann Method on GPUs

Optimization Techniques for Parallel Code 1. Parallel programming models

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel Performance Theory

Transcription:

Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4, and S. Fabio Schifano 2 1 Jülich Supercomputing Centre, Forschungszentrum Jülich, 52425 Jülich, Germany 2 Dip. di Matematica e Informatica, Università di Ferrara and INFN, Ferrara, Italy 3 IBM Deutschland Research & Development GmbH, 71032 Böblingen, Germany 4 SimLab Neuroscience, IAS, Forschungszentrum Jülich, 52425 Jülich, Germany

Outline POWER8 architecture intro Micro benchmarks STREAM memory bandwidth OpenMP overheads Scientific HPC applications LBM Lattice Boltzmann method for fluid dynamics NEST Neuroscience network simulator MAFIA subspace clustering data analysis tool 2

POWER8 chip architecture 3.42 GHz core frequency 10 cores per socket (max 12) core architecture: POWER ISA v2.07 8-way SMT 2xFXU, 2xLSU, 2xLU 2xVMX, 4xFPU 3

Memory hierarchy of POWER8 L1$: 64k /core Reg Data flow Load request propagation LD_CMPL + LSU_LDX L2$: 512k /core L3$: 8M /core shared Coherency Remote cast-outs L1$ L2$ DATA_FROM_L2 DATA_ALL_FROM_L3 Centaur: L4$: max. 128M Mem: 1T max. L3_PREF_ALL L3$ L4$ Memory DATA_ALL_FROM_{RDL}MEM DATA_ALL_FROM_{RDL}L4 4

STREAM benchmark with 2 streams!copy!!c[i] = a[i];!!scale!!b[i] = s*c[i];!! with 3 streams!sum!!a[i] = b[i]+c[i];!!triad!!a[i] = s*b[i]+c[i];!! efficient prefetching expected 5

STREAM memory traffic - copy 10 8 10 7 Reg L1$ L1$ L3$ L1$ L2$ L1$ Mem Data read (B) 10 6 10 5 10 4 L2$ L3$ 10 3 10 2 L1$ 2 11 2 12 2 13 2 14 2 15 2 16 2 17 2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25 Array size n 6

STREAM performance sampling 1000x 350 300 0.25 0.20 0.15 Distributions for 40 threads copy scale sum triad 250 0.10 Bandwidth (GB/s) 200 150 0.05 0.00 200 220 240 260 280 300 320 GB/s median values 100 50 0 1 2 4 8 10 20 40 80 160 Threads #Cores #LSU #HWT 7

OpenMP overheads Explicit vs. implicit constructs: overhead = t(n) t serial /n 10 4 10 3 atomic barrier critical ordered for parallel reduction single 10 2 Overhead (µs) 10 1 10 0 10 1 10 2 #Cores #HWT 1 2 4 810 20 40 80 160 Threads #Cores #HWT 1 2 4 810 20 40 80 160 Threads 8

Lattice Boltzmann method Fluid dynamics - discrete sampling of positions and velocities - complex and irregular structures - multi-phase flow possible Example: Rayleigh-Taylor instability 9

Lattice Boltzmann method D2Q37 - two kernels: collide (6200 FP/lattice site) propagate (BW limited) on most platforms t collide >> t propagate - high degree of parallelism - high arithmetic intensity 10

LBM D2Q37 performance best results ~200 GF/s for collide at 40 or 80 thr. 200 Collide Propagate 80 70 150 60 best results ~75 GB/s for propagate at 20 thr. Collide (GF/s) shortest runtime at 40 thr. 100 50 0 1 2 4 8 10 16 20 40 80 160 Threads #Cores #FPU/LSU #HWT 11 50 40 30 20 10 0 Propagate (GB/s)

MAFIA a parallelized algorithm to identify subspace clusters in high dimensional spaces kernel: pcount!!bitwise & GPU acceleration possible Adinetz et al., Euro-Par'13 Proceedings, pp 838-849 12

MAFIA performance 10 2 stride = 1 stride = 2 stride = 4 stride = 8 Speedup 10 1 10 0 1 10 20 40 80 160 Threads 13

NEST NEural Simulation Tool discrete event simulator on an a distributed graph uses domains-specific interpreter language (SLI) stochastic input hybrid parallelization with M T virtual processes n=9375 neurons per MPI process dry-run mode for M processes kernels: nest-initiative.org neuron update (some FP-ops) spike delivery (dominant) Brette et al. J Comp. Neuroscience Vol 23, 3 (2007) 14

NEST total runtime control the binding of threads to cores! dry-run mode with M=512 and M=16,384 Time (s) 60 50 40 30 20 stride = 1 stride = 2 stride = 4 stride = 8 Time (s) 1200 1000 800 600 400 stride = 1 stride = 2 stride = 4 stride = 8 10 0 10 20 40 80 160 Threads use 160/T as stride 200 0 10 20 40 80 160 Threads 15

NEST spike delivery total amount of work fitted to C = c 0 + c 1 MT + c 2 MT 2 10 14 M=512 M=2048 points: measured lines: fit Instructions 10 13 10 12 10 11 M=4096 M=16384 1 5 10 20 40 80 160 Threads 16

NEST potential for improvement Each of T threads runs through MT buffer elements in the spike delivery kernel, avoiding the c 2 MT 2 component would improve thread-scalability Instructions per cycle throughput benefits from SMT8 Needs deeper understanding 17

Summary LBM and MAFIA benefit from vectorization and instruction-level parallelism NEST benefits from SMT8, needs restructuring POWER8 suitable as host CPU in GPU-accelerated system 18

Outlook POWER8 with GPUs installed porting in progress 19