11 Parallel programming models

Size: px

Start display at page:

Download "11 Parallel programming models"

Suzan Barton
5 years ago
Views:

1 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages Linda CSP-based (Communicating Sequential Processes) FortranM Occam Dataflow SISAL (Streams and Iteration in a Single Assignment Language) Distributed Sequoia Bloom Event-driven and hardware description

2 238 // Program Design 10.3 Assessing parallel programs Verilog hardware description Multi-threaded language (HDL) Clojure Functional Object-oriented Concurrent Haskell Charm++ GPU languages Smalltalk CUDA Message-passing OpenMP OpenACC MPI OpenHMPP (HMPP for Hybrid Multicore Parallel Pro- PVM Partitioned global address space gramming) (PGAS) Logic programming Parlog High Performance Fortran (HPF)

3 239 // programming models 11.1 HPF 11.1 HPF Partitioned global address space parallel programming model Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which processor gets which part of the data Concurrency is defined by HPF commands based on Fortran90 HPF directives as comments:!hpf$ <directive>

4 240 // programming models 11.1 HPF HPF example: matrix multiplication PROGRAM ABmult IMPLICIT NONE INTEGER, PARAMETER : : N = 100 INTEGER, DIMENSION (N, N) : : A, B, C INTEGER : : i, j! HPF$ PROCESSORS s q u a r e ( 2, 2 )! HPF$ DISTRIBUTE (BLOCK, BLOCK) ONTO s q u a r e : : C! HPF$ ALIGN A( i, ) WITH C( i, )! r e p l i c a t e c o p i e s o f row A ( i, ) onto proc. s which compute C( i, j )! HPF$ ALIGN B(, j ) WITH C(, j )! r e p l i c a t e c o p i e s o f c o l. B (, j ) ) onto proc. s which compute C( i, j ) A = 1 B = 2 C = 0 DO i = 1, N DO j = 1, N! A l l t h e work i s l o c a l due t o ALIGNs C( i, j ) = DOT_PRODUCT(A( i, : ), B ( :, j ) ) END DO END DO WRITE(, ) C END

5 241 // programming models 11.1 HPF HPF programming methodology Need to find balance between concurrency and communication the more processes the more communication aiming to find balanced load based from the owner calculates rule data locality Easy to write a program in HPF but difficult to gain good efficiency Programming in HPF technique is more or less like this: 1. Write a correctly working serial program, test and debug it 2. add distribution directives introducing as less as possible communication

6 242 // programming models 11.2 OpenMP 11.2 OpenMP OpenMP tutorial ( openmp/) Programming model based on thread parallelism Helping tool for a programmer Based on compiler directives C/C++, Fortran* Nested Parallelism is supported (though, not all implementations support it) Dynamic threads OpenMP has become a standard

7 243 // programming models 11.2 OpenMP Fork-Join model

8 244 // programming models 11.2 OpenMP OpenMP Example: Matrix Multiplication:! h t t p : / / www. l l n l. gov / computing / t u t o r i a l s / openmp / e x e r c i s e. h tml C C OpenMp Example M a t r i x M u l t i p l y F o r t r a n V e r s i o n C FILE : omp_mm. f C DESCRIPTION : C D e m o n s t r a t e s a m a t r i x m u l t i p l y u s i n g OpenMP. Threads s h a r e row i t e r a t i o n s C a c c o r d i n g to a p r e d e f i n e d chunk s i z e. C LAST REVISED : 1 / 5 / 0 4 B l a i s e Barney C PROGRAM MATMULT INTEGER NRA, NCA, NCB, TID, NTHREADS, I, J, K, CHUNK, + OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM C number of rows in m a t r i x A PARAMETER (NRA=62) C number of columns in m a t r i x A PARAMETER (NCA=15) C number of columns in m a t r i x B PARAMETER (NCB=7) REAL 8 A(NRA,NCA), B(NCA, NCB), C(NRA, NCB)

9 245 // programming models 11.2 OpenMP C S e t loop i t e r a t i o n chunk s i z e CHUNK = 10 C Spawn a p a r a l l e l r e g i o n e x p l i c i t l y s c o p i n g a l l v a r i a b l e s!$omp PARALLEL SHARED(A, B, C,NTHREADS,CHUNK) PRIVATE ( TID, I, J,K) TID = OMP_GET_THREAD_NUM( ) IF ( TID.EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS( ) PRINT, S t a r t i n g m a t r i x m u l t i p l e example with, NTHREADS, + t h r e a d s PRINT, I n i t i a l i z i n g m a t r i c e s END IF C I n i t i a l i z e m a t r i c e s!$omp DO SCHEDULE( STATIC, CHUNK) DO 30 I =1, NRA DO 30 J =1, NCA A( I, J ) = ( I 1) +( J 1) 30 CONTINUE!$OMP DO SCHEDULE( STATIC, CHUNK) DO 40 I =1, NCA DO 40 J =1, NCB B( I, J ) = ( I 1) ( J 1) 40 CONTINUE

10 246 // programming models 11.2 OpenMP!$OMP DO SCHEDULE( STATIC, CHUNK) DO 50 I =1, NRA DO 50 J =1, NCB C( I, J ) = 0 50 CONTINUE C Do m a t r i x m u l t i p l y s h a r i n g i t e r a t i o n s on o u t e r l oop C D i s p l a y who does which i t e r a t i o n s f o r d e m o n s t r a t i o n p u r p o s e s PRINT, Thread, TID, s t a r t i n g m a t r i x m u l t i p l y...!$omp DO SCHEDULE( STATIC, CHUNK) DO 60 I =1, NRA PRINT, Thread, TID, d i d row, I DO 60 J =1, NCB DO 60 K=1, NCA C( I, J ) = C( I, J ) + A( I,K) B(K, J ) 60 CONTINUE C End of p a r a l l e l r e g i o n!$omp END PARALLEL C P r i n t r e s u l t s PRINT, PRINT, R e s u l t M a t r i x : DO 90 I =1, NRA

11 247 // programming models 11.2 OpenMP DO 80 J =1, NCB WRITE(, 7 0 ) C( I, J ) 70 FORMAT(2 x, f8. 2, $ ) 80 CONTINUE PRINT, 90 CONTINUE PRINT, PRINT, Done. END

12 248 // programming models 11.3 GPGPU 11.3 GPGPU General-purpose computing on graphics processing units CUDA Parallel programming platform and model created by NVIDIA CUDA-extended programming languages (like nvcc) C, C++, Fortran CUDA gives access to virtual instruction set memory on parallel computational elements on GPUs based on executing a large number of threads simultaneously CUDA-accelerated libraries

13 249 // programming models 11.3 GPGPU CUDA Fortran Matrix Multiplication example (for more information, see http: // Host (CPU) code: s u b r o u t i n e mmul ( A, B, C ) use c u d a f o r real, dimension ( :, : ) : : A, B, C i n t e g e r : : N, M, L real, device, a l l o c a t a b l e, dimension ( :, : ) : : Adev, Bdev, Cdev type ( dim3 ) : : dimgrid, dimblock N = s i z e (A, 1 ) ; M = s i z e (A, 2 ) ; L = s i z e (B, 2 ) a l l o c a t e ( Adev (N,M), Bdev (M, L ), Cdev (N, L ) ) Adev = A( 1 : N, 1 :M) Bdev = B ( 1 :M, 1 : L ) dimgrid = dim3 ( N/ 1 6, L / 1 6, 1 ) dimblock = dim3 ( 16, 16, 1 ) c a l l mmul_kernel <<<dimgrid, dimblock >>>( Adev, Bdev, Cdev, N,M, L ) C ( 1 : N, 1 :M) = Cdev d e a l l o c a t e ( Adev, Bdev, Cdev ) end s u b r o u t i n e

14 250 // programming models 11.3 GPGPU GPU code: a t t r i b u t e s ( g l o b a l ) s u b r o u t i n e MMUL_KERNEL( A, B, C, N,M, L ) real, d e v i c e : : A(N,M),B(M, L ),C(N, L ) i n t e g e r, v a l u e : : N,M, L i n t e g e r : : i, j, kb, k, tx, t y real, s h a r e d : : Ab ( 1 6, 1 6 ), Bb ( 1 6, 1 6 ) r e a l : : C i j t x = t h r e a d i d x%x ; t y = t h r e a d i d x%y i = ( b l o c k i d x%x 1) 16 + t x j = ( b l o c k i d x%y 1) 16 + t y C i j = 0. 0 do kb = 1, M, 16! Fetch one e l e m e n t each i n t o Ab and Bb ; n o t e t h a t 16 x16 = 256! t h r e a d s i n t h i s thread b l o c k are f e t c h i n g s e p a r a t e e l e m e n t s! o f Ab and Bb Ab ( tx, t y ) = A( i, kb+ty 1) Bb ( tx, t y ) = B( kb+tx 1, j )! Wait u n t i l a l l e l e m e n t s o f Ab and Bb are f i l l e d c a l l s y n c t h r e a d s ( ) do k = 1, 16 C i j = C i j + Ab ( tx, k ) Bb ( k, t y ) enddo! Wait u n t i l a l l t h r e a d s i n t h e thread b l o c k f i n i s h w i t h

15 251 // programming models 11.3 GPGPU! t h i s i t e r a t i o n s Ab and Bb c a l l s y n c t h r e a d s ( ) enddo C( i, j ) = C i j end s u b r o u t i n e OpenCL Open Computing Language (OpenCL): Heterogeneous systems of CPUs (central processing units) GPUs (graphics processing units) DSPs (digital signal processors) FPGAs (field-programmable gate arrays) and other processors

16 252 // programming models 11.3 GPGPU language (based on C99) kernels (functions that execute on OpenCL devices) plus application programming interfaces (APIs) ( fortrancl ) parallel computing using task-based and data-based parallelism GPGPU open standard by Khronos Group (Apple, AMD, IBM, Intel and Nvidia) + adopted by Altera, Samsung, Vivante and ARM Holdings Example: Matrix Multiplication

17 253 // programming models 11.3 GPGPU / k e r n e l. c l M a t r i x m u l t i p l i c a t i o n : C = A B. ( ( Device code ) ) ( h t t p : / / gpgpu computing4. b l o g s p o t. co. uk / / 0 9 / matrix m u l t i p l i c a t i o n 2 o p e n c l. html ) / / / OpenCL K e r n e l k e r n e l void matrixmul ( g l o b a l f l o a t C, g l o b a l f l o a t A, g l o b a l f l o a t B, i n t wa, i n t wb) { / / 2D Thread ID / / Old CUDA code / / i n t t x = b l o c k I d x. x TILE_SIZE + t h r e a d I d x. x ; / / i n t t y = b l o c k I d x. y TILE_SIZE + t h r e a d I d x. y ; i n t t x = g e t _ g l o b a l _ i d ( 0 ) ; i n t t y = g e t _ g l o b a l _ i d ( 1 ) ; / / v a l u e s t o r e s t h e e l e m e n t t h a t i s / / computed by t h e t h r e a d f l o a t v a l u e = 0 ; f o r ( i n t k = 0 ; k < wa; ++k ) { f l o a t elementa = A[ t y wa + k ] ; f l o a t elementb = B[ k wb + t x ] ;

18 254 // programming models 11.3 GPGPU } v a l u e += elementa elementb ; } / / W r i t e t h e m a t r i x t o d e v i c e memory each / / t h r e a d w r i t e s one e l e m e n t C[ t y wa + t x ] = v a l u e ; OpenACC standard (Cray, CAPS, Nvidia and PGI) to simplify parallel programming of heterogeneous CPU/GPU systems using annotations (like OpenMP), C, C++, Fortran source code code started on both CPU and GPU automatically OpenACC to merge into OpenMP in a future release of OpenMP

19 255 // programming models 11.3 GPGPU! A s i m p l e OpenACC k e r n e l f o r M a t r i x M u l t i p l i c a t i o n! $acc k e r n e l s do k = 1, n1 do i = 1, n3 c ( i, k ) = 0. 0 do j = 1, n2 c ( i, k ) = c ( i, k ) + a ( i, j ) b ( j, k ) enddo enddo enddo! $acc end k e r n e l s! h t t p : / / www. bu. edu / t e c h / r e s e a r c h / c o m p u t a t i o n / l i n u x c l u s t e r / katana c l u s t e r / gpu computing / openacc f o r t r a n / matrix m u l t i p l y f o r t r a n / program m a t r i x _ m u l t i p l y use omp_ lib use openacc i m p l i c i t none i n t e g e r : : i, j, k, myid, m, n, c o m p i l e d _ f o r, o p t i o n i n t e g e r, parameter : : fd = 11 i n t e g e r : : t1, t2, dt, c o u n t _ r a t e, count_max real, a l l o c a t a b l e, dimension ( :, : ) : : a, b, c

20 256 // programming models 11.3 GPGPU r e a l : : tmp, s e c s open ( fd, f i l e = w a l l c l o c k t i m e, form= f o r m a t t e d ) o p t i o n = c o m p i l e d _ f o r ( fd )! 1 s e r i a l, 2 OpenMP, 3 OpenACC, 4 both! $omp p a r a l l e l! $ myid = OMP_GET_THREAD_NUM( )! $ i f ( myid. eq. 0) t h e n! $ w r i t e ( fd, " ( Number of p r o c s i s, i 4 ) " ) OMP_GET_NUM_THREADS( )! $ e n d i f! $omp end p a r a l l e l c a l l system_clock ( count_max=count_max, c o u n t _ r a t e = c o u n t _ r a t e ) do m=1,4! compute f o r d i f f e r e n t s i z e m a t r i x m u l t i p l i e s c a l l system_clock ( t 1 ) n = (m 1)! 1000, 2000, 4000, 8000 a l l o c a t e ( a ( n, n ), b ( n, n ), c ( n, n ) )! I n i t i a l i z e m a t r i c e s do j =1, n do i =1, n a ( i, j ) = r e a l ( i + j ) b ( i, j ) = r e a l ( i j ) enddo enddo! $omp p a r a l l e l do s h a r e d ( a, b, c, n, tmp ) r e d u c t i o n ( + : tmp )! $ acc d a t a c o p y i n ( a, b ) copy ( c )! $ acc k e r n e l s

21 257 // programming models 11.3 GPGPU! Compute m a t r i x m u l t i p l i c a t i o n. do j =1, n do i =1, n tmp = 0. 0! e n a b l e s ACC p a r a l l e l i s m f o r k loop do k =1, n tmp = tmp + a ( i, k ) b ( k, j ) enddo c ( i, j ) = tmp enddo enddo! $ acc end k e r n e l s! $ acc end d a t a! $omp end p a r a l l e l do c a l l system_clock ( t 2 ) d t = t2 t 1 s e c s = r e a l ( d t ) / r e a l ( c o u n t _ r a t e ) w r i t e ( fd, " ( For n =, i4,, w a l l c l o c k time i s, f12. 2, seconds ) " ) & n, s e c s d e a l l o c a t e ( a, b, c ) enddo c l o s e ( fd ) end program m a t r i x _ m u l t i p l y i n t e g e r f u n c t i o n c o m p i l e d _ f o r ( fd )

22 258 // programming models 11.3 GPGPU i m p l i c i t none i n t e g e r : : fd # i f d e f i n e d _OPENMP && d e f i n e d _OPENACC c o m p i l e d _ f o r = 4 w r i t e ( fd, " ( This code i s compiled with OpenMP & OpenACC ) " ) # e l i f d e f i n e d _OPENACC c o m p i l e d _ f o r = 3 w r i t e ( fd, " ( This code i s compiled with OpenACC ) " ) # e l i f d e f i n e d _OPENMP c o m p i l e d _ f o r = 2 w r i t e ( fd, " ( This code i s compiled with OpenMP ) " ) # e l s e c o m p i l e d _ f o r = 1 w r i t e ( fd, " ( This code i s compiled f o r s e r i a l o p e r a t i o n s ) " ) # e n d i f end f u n c t i o n c o m p i l e d _ f o r

Panorama des modèles et outils de programmation parallèle

Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &