On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Size: px

Start display at page:

Download "On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code"

Francis Ferguson
5 years ago
Views:

1 On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance Computing August 26, 2014 Porto, Portugal E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

2 Outline 1 LBM at glance, D2Q37 model 2 OpenCL 3 Implementation details 4 Results and conclusions We addressed the issue of portability of code across several computing architectures preserving performances. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

3 The D2Q37 Lattice Boltzmann Model Lattice Boltzmann method (LBM) is a class of computational fluid dynamics (CFD) methods simulation of synthetic dynamics described by the discrete Boltzmann equation, instead of the Navier-Stokes equations a set of virtual particles called populations arranged at edges of a discrete and regular grid interacting by propagation and collision reproduce after appropriate averaging the dynamics of fluids D2Q37 is a D2 model with 37 components of velocity (populations) suitable to study behaviour of compressible gas and fluids optionally in presence of combustion 1 effects correct treatment of Navier-Stokes, heat transport and perfect-gas (P = ρt ) equations 1 chemical reactions turning cold-mixture of reactants into hot-mixture of burnt product. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

$parallel applying in sequence propagate and collide. Challenge Design an efficient implementation able exploit a large fraction of available peak performance. E.$

4 Computational Scheme of LBM foreach time step foreach lattice point propagate ( ) ; endfor foreach lattice point collide ( ) ; endfor endfor Embarassing parallelism All sites can be processed in parallel applying in sequence propagate and collide. Challenge Design an efficient implementation able exploit a large fraction of available peak performance. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

5 D2Q37: propagation scheme perform accesses to neighbour-cells at distance 1,2, and 3 generate memory-accesses with sparse addressing patterns E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

6 D2Q37: boundary-conditions After propagation, boundary conditions are enforced at top and bottom edges of the lattice. 2D lattice with period-boundaries along X-direction at the top and the bottom boundary conditions are enforced: to adjust some values at sites y = and y = N y 3... N y 1 e.g. set vertical velocity to zero At left and and right edges we apply periodic boundary conditions. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

7 D2Q37 collision collision is computed at each lattice-cell after computation of boundary conditions computational intensive: for the D2Q37 model requires 7500 DP floating-point operations completely local: arithmetic operations require only the populations associate to the site computation of propagate and collide kernels are kept separate after propagate but before collide we may need to perform collective operations (e.g. divergence of of the velocity field) if we include computations conbustion effects. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

8 Open Computing Language (OpenCL) programming framework for heterogenous architectures: CPU+accelerators computing model: host-code plus one or more kernels running on accelerators kernels are executed by a set of work-items each processing an item of the data-set (data-parallelism) work-items are grouped into work-groups, each executed by a compute-unit and processing K work-items in parallel using vector instructions e.g.: on Xeon-Phi work-groups are mapped on (virtual-)cores processing each up to 8 double-precisions floting-point data memory model identifies a hierarchy of four spaces which differ for size and access-time : private, local, global and constant memory OCL aims to guarantee portability of both code and performances across several architectures E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

9 OCL Saxpy kernel C = s A B, s R, A, B, C R n kernel void saxpy ( global double A, global double B, global double C, const double s ) { / / get g l o b a l thread ID int id = get_global_id ( 0 ) ; } C [ id ] = s A [ id ] + B [ id ] ; each work-item executes the saxpy kernel computing just one data-item of the output array first it computes its unique global identifier id and then uses it to address the id th data-item of arrays A, B and C. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

10 Memory layout for LB : AoS vs SoA / / l a t t i c e stored as AoS : typedef struct { double p1 ; / / p o p u l a t i o n 1 double p2 ; / / p o p u l a t i o n 2... double p37 ; / / p o p u l a t i o n 37 } pop_t ; pop_t lattice2d [ SIZEX SIZEY ] ; AoS: corresponding populations of different sites are interleaved, causing strided memory-access and leading to coalescing issues. / / l a t t i c e stored as AoS : typedef struct { double p1 [ SIZEX SIZEY ] ; / / p o p u l a t i o n 1 array double p2 [ SIZEX SIZEY ] ; / / p o p u l a t i o n 2 array... double p37 [ SIZEX SIZEY ] ; / / p o p u l a t i o n 37 array } pop_t ; pop_t lattice2d ; SoA: corresponding populations of different sites are allocated at contiguous memory addresses, enabling coalescing of accesses, and making use of full memory bandwidth. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

Example: physical lattice of 11 16 cells; the size of work-groups is 1 4.

11 Grids Layout Uni-dimensional array of NTHREADS, each thread processing one lattice site. Example: physical lattice of cells; the size of work-groups is 1 4. L y = α N wi, α N; (L y L x )/N wi = N wg E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

Hardware used: Eurora prototype Eurora (Eurotech and Cineca) Hot water cooling system Deliver 3,209 MFLOPs per Watt of sustained performance 1 st in the Green500 of June 2013 Computing Nodes: 64

12 Hardware used: Eurora prototype Eurora (Eurotech and Cineca) Hot water cooling system Deliver 3,209 MFLOPs per Watt of sustained performance 1 st in the Green500 of June 2013 Computing Nodes: 64 Processor Type: Intel Xeon 2.10GHz Intel Xeon 3.10GHz Accelerator Type: MIC - Intel Xeon-Phi 5120D GPU - NVIDIA Tesla K20x E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

13 OpenCL Benchmark of Propagate (Xeon-Phi) Performance of propagate as function of the number of work-items N wi per work-group, and the number of work-groups N wg. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

14 OpenCL Benchmark of Collide (Xeon-Phi) Performance of collide as function of the number of work-items N wi per work-group, and the number of work-groups N wg. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

15 2 x NVIDIA K20s GPU CUDA OpenCL Run time on 2 x GPU (NVIDIA K20s) [msec] per iteration Propagate BC Collide E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

16 2 x Intel Xeon Phi MIC C OpenCL Run time on 2 x MIC (Intel Xeon Phi) [msec] per iteration Propagate BC Collide E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

17 Propagate [msec] per iteration C C Opt. CUDA OpenCL Run time (Propagate x2048 lattice) 0 MIC GPU CPU2 CPU3 E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

18 Collide C C Opt. CUDA OpenCL Run time (Collide x2048 lattice) [msec] per iteration MIC GPU CPU2 CPU3 E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

19 Scalability on Eurora Nodes Weak regime lattice size: No_devices. Strong regime lattice size: E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

20 Limitations to strong scalability E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

21 Conclusions 1 we have presented an OCL implementation of a fluid-dynamic simulation based on Lattice Boltzmann methods 2 code portability: it has been succesfully ported and run on several computing architectures, including CPU, GPU and MIC systems 3 performance portability: results are of the same level of codes written using more native programming frameworks, such as CUDA or C 4 the good news: this results make OpenCL a good framework to develop code, easily portable across several architecture preserving performances 5 the bad news: not all vendors are today commited to support this standard because considered a low-level approach. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

22 Simulation of the Rayleigh-Taylor (RT) Instability Instability at the interface of two fluids of different densities triggered by gravity. A cold-dense fluid over a less dense and warmer fluid triggers an instability that mixes the two fluid-regions (till equilibrium is reached). E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

23 Acknowledgments Luca Biferale, Mauro Sbragaglia, Patrizio Ripesi University of Tor Vergata and INFN Roma, Italy Andrea Scagliarini University of Barcelona, Spain Filippo Mantovani BSC institute, Spain Enrico Calore, Sebastiano Fabio Schifano, Raffaele Tripiccione, University and INFN of Ferrara, Italy Federico Toschi Eindhoven University of Technology The Netherlands, and CNR-IAC, Roma Italy This work has been performed in the framework of the INFN COKA and SUMA projects. We would like to thank CINECA (ITALY) and JSC (GERMANY) institutes for access to their systems. E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

24 Thanks for Your attention E. Calore (INFN of Ferrara) Portability, performance and scalability UCHPC, August 26, / 24

Performance Evaluation of Scientific Applications on POWER8

Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,