High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

Similar documents
Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Panorama des modèles et outils de programmation parallèle

Distributed Memory Parallelization in NGSolve

VISUALIZING systems containing many spheres using

11 Parallel programming models

Introduction to MPI. School of Computational Science, 21 October 2005

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

Introductory MPI June 2008

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

GPU Accelerated Reweighting Calculations for Neutrino Oscillation Analyses with the T2K Experiment

Lecture 24 - MPI ECE 459: Programming for Performance

Cluster Computing: Updraft. Charles Reid Scientific Computing Summer Workshop June 29, 2010

Julian Merten. GPU Computing and Alternative Architecture

The Fast Multipole Method in molecular dynamics

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Welcome to MCS 572. content and organization expectations of the course. definition and classification

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Cactus Tools for Petascale Computing

Accelerated Neutrino Oscillation Probability Calculations and Reweighting on GPUs. Richard Calland

Continuous Machine Learning

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication

Distributed Memory Programming With MPI

Crossing the Chasm. On the Paths to Exascale: Presented by Mike Rezny, Monash University, Australia

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Auto-Tuning Complex Array Layouts for GPUs - Supplemental Material

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Homework 1. Johan Jensen ABSTRACT. 1. Theoretical questions and computations related to digital representation of numbers.

Solving PDEs with CUDA Jonathan Cohen

Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters. Mike Showerman, Guochun Shi Steven Gottlieb

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

B629 project - StreamIt MPI Backend. Nilesh Mahajan

Adaptive Spike-Based Solver 1.0 User Guide

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Some notes on efficient computing and setting up high performance computing environments

A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries

Dark Energy and Massive Neutrino Universe Covariances

Dense Arithmetic over Finite Fields with CUMODP

Projectile Motion Slide 1/16. Projectile Motion. Fall Semester. Parallel Computing

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Scalable and Power-Efficient Data Mining Kernels

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Parallel Ray Tracer TJHSST Senior Research Project Computer Systems Lab

Sunrise: Patrik Jonsson. Panchromatic SED Models of Simulated Galaxies. Lecture 2: Working with Sunrise. Harvard-Smithsonian Center for Astrophysics

arxiv: v1 [hep-lat] 7 Oct 2010

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R

Parallelization of the QC-lib Quantum Computer Simulator Library

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

Two case studies of Monte Carlo simulation on GPU

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

I/O Devices. Device. Lecture Notes Week 8

Lightweight Superscalar Task Execution in Distributed Memory

Improvements for Implicit Linear Equation Solvers

Perm State University Research-Education Center Parallel and Distributed Computing

CSD Conformer Generator User Guide

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Parallel Sparse Matrix - Vector Product

On the Paths to Exascale: Will We be Hungry?

GRASS GIS in the sky

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

The Lattice Boltzmann Simulation on Multi-GPU Systems

Simulating a two dimensional particle in a square quantum box with CUDA. George Zakhour August 30, 2013

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Vibro-acoustic Analysis for Noise Reduction of Electric Machines

CSD Conformer Generator User Guide

ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Parallelization of the QC-lib Quantum Computer Simulator Library

Petascale Quantum Simulations of Nano Systems and Biomolecules

One Optimized I/O Configuration per HPC Application

Dr. Andrea Bocci. Using GPUs to Accelerate Online Event Reconstruction. at the Large Hadron Collider. Applied Physicist

GPU Acceleration of BCP Procedure for SAT Algorithms

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

The Blue Gene/P at Jülich Case Study & Optimization. W.Frings, Forschungszentrum Jülich,

Digital Systems EEE4084F. [30 marks]

Everyday Multithreading

sri 2D Implicit Charge- and Energy- Conserving Particle-in-cell Application Using CUDA Christopher Leibs Karthik Murthy

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Finite difference methods. Finite difference methods p. 1

GPU Computing Activities in KISTI

Performance Analysis of Lattice QCD Application with APGAS Programming Model

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

A quick introduction to MPI (Message Passing Interface)

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Analysis of PFLOTRAN on Jaguar

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

A CUDA Solver for Helmholtz Equation

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Introduction to functions

Lenstool-HPC. From scratch to supercomputers: building a large-scale strong lensing computational software bottom-up. HPC Advisory Council, April 2018

Notater: INF3331. Veronika Heimsbakk December 4, Introduction 3

UNMIXING 4-D PTYCHOGRAPHIC IMAGES

Transcription:

High-performance processing and development with Madagascar July 24, 2010 Madagascar development team

Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar OpenMP MPI GPU/CUDA

Vast majority of modern HPC systems are hybrid Rise of GPUs added another level of hardware and software complexity

Application frameworks for HPC Type Supported via OpenMP Shared Compiler pragmas MPI Distributed Libraries a GPU/CUDA Special hardware Extra compiler + SDK b a Executables need special environment to run: remote shell + launcher (mpirun) b Executables need proprietary drivers to run GPU kernels, but can be recompiled in emulation mode Most big clusters have job batch systems. On such systems, programs cannot be run directly, but have to be submitted to a queue.

Design goals for HPC support Support for all combination of systems Fault tolerance and portability of SConstructs How? Utilize explicit (data) parallelism whenever possible: keep basic programs simple and sequential, handle parallelization inside Flow() statements automatically Define system-dependent execution parameters outside of SConstructs

How to define parallel Flow() Flow( target, source, workflow,[n,m],reduce= cat ) # Process source independently along nth dimension of length m, concatenate all results into target. How to run $ export RSF THREADS= 16 $ export RSF CLUSTER= hostname1 8 hostname2 8 $ pscons # or $ scons -j 16 CLUSTER= hostname1 8 hostname2 8 What happens inside 1 Source file gets split into several independent temporary files 2 SCons executes same workflow on cluster nodes for each file independently through remote shell (ssh) 3 SCons assembles independent outputs into one target

Figure: Parallel workflows with PSCons

Scratch directory for tmp files $ export TMPDATAPATH=/local/fast/file/system Monitor running programs $ sftop Kill remote programs $ sfkill sfprogram

Fallbacks If remote shell is not available directly, then it is possible to try to bypass it. Directly through MPI Flow( target, source, workflow,[n, mpi,m],reduce= cat ) # Try to invoke MPI executable environment directly with mpirun -np m. Directly through OpenMP Flow( target, source, workflow,[n, omp ],reduce= cat ) # Try to run the workflow in OpenMP environment locally. Note These options are not as portable as the general one.

Hello World program of HPC world c = a + b. Approaches First, the obvious one - pscons Then, let us assume that this problem is not data parallel and explicit parallellism has to be expressed at the level of source code

Obvious solution from r s f. p r o j import Flow ( a, None, math n1=512 n2=512 output= x1+x2 ) Flow ( b, None, math n1=512 n2=512 output= x1 x2 ) Flow ( c, a b, math b=${sources [ 1 ] } output= i n p u t+b, s p l i t = [ 2, 5 1 2 ] ) End ( ) How to run $ pscons Note This approach will work on any type of system and on a single-cpu machine as well Splitting inputs and collecting outputs add overhead

vecsum.job - job file for running inside LSF batch system #! / b i n / bash #BSUB J vecsum # Job name #BSUB o vecsum. out # Name o f the output f i l e #BSUB q normal # Queue name #BSUB P r s f d e v # Account #BSUB W 0 : 0 5 # r u n t i m e #BSUB n 16 # number o f CPUs export RSF THREADS=16 export RSF CLUSTER=$LSB MCPU HOSTS p s c o n s How to run $ bsub <vecsum.job

Figure: Summation of vectors with OpenMP

omphello.c #i n c l u d e < r s f. h> i n t main ( i n t argc, char a r g v [ ] ) { i n t n1, n2, i, j ; f l o a t a, b, c ; s f f i l e ain, bin, cout = NULL ; s f i n i t ( argc, a r g v ) ; a i n = s f i n p u t ( i n ) ; / I n p u t v e c t o r a / b i n = s f i n p u t ( b ) ; / I n p u t v e c t o r b / i f (SF FLOAT!= s f g e t t y p e ( a i n ) SF FLOAT!= s f g e t t y p e ( b i n ) ) s f e r r o r ( Need f l o a t ) ; / Vector s i z e / i f (! s f h i s t i n t ( ain, n1, &n1 ) ) s f e r r o r ( No n1= ) ; / Number o f v e c t o r s / n2 = s f l e f t s i z e ( ain, 1 ) ;

omphello.c / Output v e c t o r / cout = s f o u t p u t ( out ) ; / V e c t o r s i n memory / a = s f f l o a t a l l o c ( n1 ) ; b = s f f l o a t a l l o c ( n1 ) ; c = s f f l o a t a l l o c ( n1 ) ; / Outer l o o p o v e r v e c t o r s / f o r ( i = 0 ; i < n2 ; i ++) { s f f l o a t r e a d ( a, n1, a i n ) ; s f f l o a t r e a d ( b, n1, b i n ) ; / P a r a l l e l summation / #pragma omp p a r a l l e l f o r p r i v a t e ( j ) s h a r e d ( a, b, c ) f o r ( j = 0 ; j < n1 ; j ++) c [ j ] = a [ j ] + b [ j ] ; s f f l o a t w r i t e ( c, n1, cout ) ; } s f f i l e c l o s e ( a i n ) ; s f f i l e c l o s e ( b i n ) ; s f f i l e c l o s e ( cout ) ; r e t u r n 0 ; }

How to compile Like any regular Madagascar program. How to run $ scons/pscons Note Even if OpenMP is not supported, the source code will get compiled as sequential It can be combined with upper-level parallelization in SConstruct For a good real-life example, look into RSFSRC/user/psava/sfawefd.c

Figure: Summation of vectors with MPI

mpihello.c #i n c l u d e < r s f. h> #i n c l u d e <mpi. h> i n t main ( i n t argc, char a r g v [ ] ) { i n t n1, n2, nc, e s i z e, i, j, k = 0 ; f l o a t a, b, c ; s f f i l e ain, bin, cout = NULL ; i n t c p u i d ; / CPU i d / i n t ncpu ; / Number o f CPUs / MPI Status m p i s t a t ; M P I I n i t (& argc, &a r g v ) ; MPI Comm rank (MPI COMM WORLD, &c p u i d ) ; MPI Comm size (MPI COMM WORLD, &ncpu ) ; s f i n i t ( argc, a r g v ) ;

mpihello.c a i n = s f i n p u t ( i n p u t ) ; / I n p u t v e c t o r a / b i n = s f i n p u t ( b ) ; / I n p u t v e c t o r b / i f (SF FLOAT!= s f g e t t y p e ( a i n ) SF FLOAT!= s f g e t t y p e ( b i n ) ) s f e r r o r ( Need f l o a t ) ; / S i z e o f an element / i f (! s f h i s t i n t ( ain, e s i z e, &e s i z e ) ) e s i z e = s i z e o f ( f l o a t ) ; / Vector s i z e / i f (! s f h i s t i n t ( ain, n1, &n1 ) ) s f e r r o r ( No n1= ) ; / T o t a l number o f v e c t o r s / n2 = s f l e f t s i z e ( ain, 1 ) ; / Only the f i r s t CPU w i l l do output / i f (0 == c p u i d ) { cout = s f o u t p u t ( out ) ; s f p u t i n t ( cout, n1, n1 ) ; s f p u t i n t ( cout, n2, n2 ) ; s f w a r n i n g ( Running on %d CPUs, ncpu ) ; }

mpihello.c a = s f f l o a t a l l o c ( n1 ) ; b = s f f l o a t a l l o c ( n1 ) ; / How many v e c t o r s p e r CPU / nc = ( i n t ) ( n2 /( f l o a t ) ncpu + 0. 5 f ) ; c = s f f l o a t a l l o c 2 ( n1, nc ) ; / S t a r t i n g p o s i t i o n i n i n p u t f i l e s / s f s e e k ( ain, n1 c p u i d e s i z e, SEEK CUR ) ; s f s e e k ( bin, n1 c p u i d e s i z e, SEEK CUR ) ; f o r ( i = c p u i d ; i < n2 ; i += ncpu, k++) { / Read l o c a l p o r t i o n o f i n p u t data / s f f l o a t r e a d ( a, n1, a i n ) ; s f f l o a t r e a d ( b, n1, b i n ) ; / P a r a l l e l summation h e r e / f o r ( j = 0 ; j < n1 ; j ++) c [ k ] [ j ] = a [ j ] + b [ j ] ; / Move on to the n e x t p o r t i o n / s f s e e k ( ain, n1 ( ncpu 1) e s i z e, SEEK CUR ) ; s f s e e k ( bin, n1 ( ncpu 1) e s i z e, SEEK CUR ) ; }

mpihello.c } i f (0 == c p u i d ) { / C o l l e c t r e s u l t s from a l l nodes / f o r ( i = 0 ; i < n2 ; i ++) { k = i / ncpu ; / I t e r a t i o n number / j = i % ncpu ; / CPU number to r e c e i v e from / i f ( j ) / R e c e i v e from non z e r o CPU / MPI Recv (&c [ k ] [ 0 ], n1, MPI FLOAT, j, j, MPI COMM WORLD, &m p i s t a t ) ; s f f l o a t w r i t e ( c [ k ], n1, cout ) ; } s f f i l e c l o s e ( cout ) ; } e l s e { / Send r e s u l t s to CPU #0 / f o r ( i = 0 ; i < k ; i ++) / Vector by v e c t o r / MPI Send (&c [ i ] [ 0 ], n1, MPI FLOAT, 0, cpuid, MPI COMM WORLD ) ; } s f f i l e c l o s e ( a i n ) ; s f f i l e c l o s e ( b i n ) ; M P I F i n a l i z e ( ) ; r e t u r n 0 ;

How to compile Look into RSFSRC/system/main/SConstruct for mpi.c How to run $ export MPIRUN=/path/to/my/mpirun # If it is not standard $ scons Note No standard input in the program Pipes are not allowed in Flow()

Figure: Summation of vectors on GPU

gpuhello.c #d e f i n e BLOCK SIZE 128 g l o b a l void gpu vec sum ( f l o a t a, f l o a t b, f l o a t c ) { const unsigned i n t j = b l o c k I d x. x blockdim. x + t h r e a d I d x. x ; c [ j ] = a [ j ] + b [ j ] ; } i n t main ( i n t argc, char a r g v [ ] ) { i n t n1, n2, e s i z e, i ; f l o a t a, b, c ; s f f i l e ain, bin, cout = NULL ; dim3 d i m g r i d ( 1, 1, 1 ) ; / GPU g r i d / dim3 dimblock ( BLOCK SIZE, 1, 1 ) ; / GPU b l o c k / f l o a t d a, d b, d c ; / GPU p o i n t e r s / s f i n i t ( argc, a r g v ) ; c u I n i t ( 0 ) ; / Use f i r s t GPU d e v i c e / s f c h e c k g p u e r r o r ( Device i n i t i a l i z a t i o n ) ; c u d a S e t D e v i c e ( 0 ) ;

gpuhello.c a i n = s f i n p u t ( i n ) ; / I n p u t v e c t o r a / b i n = s f i n p u t ( b ) ; / I n p u t v e c t o r b / i f (SF FLOAT!= s f g e t t y p e ( a i n ) SF FLOAT!= s f g e t t y p e ( b i n ) ) s f e r r o r ( Need f l o a t ) ; / S i z e o f an element / i f (! s f h i s t i n t ( ain, e s i z e, &e s i z e ) ) e s i z e = s i z e o f ( f l o a t ) ; / Vector s i z e / i f (! s f h i s t i n t ( ain, n1, &n1 ) ) s f e r r o r ( No n1= ) ; / Number o f v e c t o r s / n2 = s f l e f t s i z e ( ain, 1 ) ; / Output v e c t o r / cout = s f o u t p u t ( out ) ; / V e c t o r s i n CPU memory / a = s f f l o a t a l l o c ( n1 ) ; b = s f f l o a t a l l o c ( n1 ) ; c = s f f l o a t a l l o c ( n1 ) ;

gpuhello.c / V e c t o r s i n GPU memory / cudamalloc ( ( void )& d a, n1 e s i z e ) ; cudamalloc ( ( void )& d b, n1 e s i z e ) ; cudamalloc ( ( void )& d c, n1 e s i z e ) ; s f c h e c k g p u e r r o r ( GPU m a l l o c s ) ; / K e r n e l c o n f i g u r a t i o n f o r t h i s data / d i m g r i d = dim3 ( n1/block SIZE, 1, 1 ) ; / Outer l o o p o v e r v e c t o r s / f o r ( i = 0 ; i < n2 ; i ++) { s f f l o a t r e a d ( a, n1, a i n ) ; / I n p u t / s f f l o a t r e a d ( b, n1, b i n ) ; cudamemcpy ( d a, a, n1 e s i z e, / a > GPU / cudamemcpyhosttodevice ) ; cudamemcpy ( d b, b, n1 e s i z e, / b > GPU / cudamemcpyhosttodevice ) ; s f c h e c k g p u e r r o r ( Copying a&b to GPU ) ; / P a r a l l e l summation on GPU / gpu vec sum<<<dimgrid, dimblock >>>(d a, d b, d c ) ;

gpuhello.c } / P a r a l l e l summation on GPU / gpu vec sum<<<dimgrid, dimblock >>>(d a, d b, d c ) ; s f c h e c k g p u e r r o r ( K e r n e l e x e c u t i o n ) ; cudamemcpy ( c, d c, n1 e s i z e, / GPU > c / cudamemcpydevicetohost ) ; s f c h e c k g p u e r r o r ( Copying c from GPU ) ; s f f l o a t w r i t e ( c, n1, cout ) ; / Output / } s f f i l e c l o s e ( a i n ) ; s f f i l e c l o s e ( b i n ) ; s f f i l e c l o s e ( cout ) ; r e t u r n 0 ;

How to compile Look into RSFSRC/user/cuda/SConstruct How to run $ scons/pscons Note Can be launched on a multi-gpu cluster by pscons Can be piped like a sequential Madagascar program You will need a dedicated GPU to run this code efficiently CUDA is proprietary

Takeaway message Exploit data parallelism when possible. It keeps programs simple and SConstructs portable.