Panorama des modèles et outils de programmation parallèle

Size: px

Start display at page:

Download "Panorama des modèles et outils de programmation parallèle"

Godfrey Lambert
5 years ago
Views:

1 Panorama des modèles et outils de programmation parallèle Sylvain HENRY University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, /45

2 Outline Introduction Accelerators & Many-Core Runtime Systems Conclusion 2/45

3 Introduction 3/45

4 Unicore P.U. Control ALU Unit Load... c := a+b... Store Memory 4/45

5 Unicore P.U. Control ALU Unit Load... c := a+b... Store Memory I asm (x86... ), C, Fortran, C /45

6 Moore's Law + Frequency Wall = Parallelism (mandatory slide) 5/45

7 Multicore Memory PU PU PU PU 6/45

8 Multicore Memory PU PU PU PU I PThread 6/45

9 POSIX Threads (PThread) #i n c l u d e <p t h r e a d. h> v o i d mythread ( v o i d a r g ) { // C } i n t main ( ) { pthread_t t i d ; // A p t h r e a d _ c r e a t e (& t i d, NULL, &mythread, NULL) ; PU PU PU PU A create B join C D // B p t h r e a d _ j o i n ( t i d, NULL) ; } // D 7/45

10 Multicore Memory PU PU PU PU I PThread I OpenMP 8/45

11 OpenMP #i n c l u d e <omp. h> i n t main ( ) { // A #pragma omp p a r a l l e l num_threads ( 5 ) { i n t i d = omp_get_thread_num ( ) ; p r i n t f ( " My id : %d\n", i d ) ; // B } } // C % gcc -fopenmp test.c %./a.out My id: 1 My id: 3 My id: 0 My id: 2 My id: 4 A id=0 B C id=2 B id=4 B id=1 B id=3 B 9/45

12 OpenMP #pragma omp p a r a l l e l f o r num_threads ( 3 ) f o r ( i n t i =0; i <10; i ++) { i n t i d = omp_get_thread_num ( ) ; p r i n t f ( "%d: %d\n", id, i ) ; } %./a.out 2: 7 2: 8 2: 9 1: 4 1: 5 1: 6 0: 0 0: 1 0: 2 0: 3 10/45

13 Multicore with NUMA MEM CPU #0 CPU #2 MEM MEM CPU #1 CPU #3 MEM I/O I/O 11/45

14 Multicore with NUMA MEM CPU #0 CPU #2 MEM MEM CPU #1 CPU #3 MEM I/O I/O I Issues: thread scheduling (data anity... ), data migration... 11/45

15 Multicore with NUMA MEM CPU #0 CPU #2 MEM MEM CPU #1 CPU #3 MEM I/O I/O I Issues: thread scheduling (data anity... ), data migration... I e.g. OpenMP: ForestGomp 11/45

16 Clusters Node0 NodeN Network 12/45

17 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI 12/45

18 MPI i n t c n t = 1 ; i f ( i d == 0) { MPI_Send(& cnt, 1, MPI_INT, i d +1, 0, MPI_COMM_WORLD) ; MPI_Recv(& cnt, 1, MPI_INT, N 1, 0, MPI_COMM_WORLD, NULL) ; p r i n t f ( " Counter : %d\n", c n t ) ; } e l s e { MPI_Recv(& cnt, 1, MPI_INT, id 1, 0, MPI_COMM_WORLD, NULL) ; c n t += 1 ; MPI_Send(&cnt, 1, MPI_INT, ( i d +1) % N, 0, MPI_COMM_WORLD) ; } % mpicc test.c % mpirun -n 5 --hostfile hosts./a.out Counter: 5 id=0 id=1 id=2 id=3 id= /45

19 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI I Process placement issue I Depends on network topology I e.g. TreeMatch 14/45

20 TreeMatch ZEUSMP/2 communication pattern Metric : msg Receiver rank I Trace communications between processes I Algorithm to map processes to network topology Sender rank 15/45

21 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI I Actor model / Active Objects: Charm++, Erlang... 16/45

22 Actor model PU PU id=0 id=1 Mailbox Send to 1 Recv Mailbox Memory Memory I Asynchronous message passing I Dynamic actor creation 17/45

23 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI I Actor model / Active Objects: Charm++, Erlang... I PGAS: HPF, XcalableMP, x10, ZPL/Chapel... I PGAS = Partitioned Global Address Space 18/45

24 PGAS: XcalableMP! $xmp n o d e s P ( 4 )! $xmp t e m p l a t e T( )! $xmp d i s t r i b u t e T( b l o c k ) o n t o P r e a l G( 8 0, 100)! g l o b a l v a r i a b l e! $xmp a l i g n G(, i ) w i t h T( i ) r e a l L ( 5 0, 40)! l o c a l v a r i a b l e ( d e f a u l t ) Global Address Space (virtual) G(80,100) Local View (data allocation) G(80, 1:25) G(80, 26:50) G(80, 51:75) G(80, 76:100) L(50,40) L(50,40) L(50,40) L(50,40) 19/45

25 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI I Actor model / Active Objects: Charm++, Erlang... I PGAS: HPF, XcalableMP, x10, ZPL/Chapel... I MapReduce: Google MapReduce, Hadoop... 20/45

26 MapReduce A A A A A A A f f f f f f f B B B B B B B g g g g C g g I reduce g. map f I Most operations on collections can be written this way I Easily parallelizable I Users only have to provide f and g I The runtime system provides I Task distribution I Communications I Fault tolerance I... Keywords: Bird-Meertens Formalism, Catamorphism 21/45

27 Summary Architecture Programming models Multicore (NUMA) OpenMP, PThread... Cluster MPI, PGAS, Actors, MapReduce... I They all assume homogeneous architectures I Same (or similar) architecture duplicated 22/45

28 Accelerators & Many-Core 23/45

29 Frameworks... MPI OpenMP Cluster CPU 24/45

30 Frameworks... MPI libspe OpenMP Cluster Cell CPU 24/45

31 Frameworks... MPI CUDA libspe OpenMP PTX Cluster Cell CPU GPU 24/45

32 Frameworks... MPI CUDA ATI Stream libspe OpenMP PTX Cluster Cell CPU GPU 24/45

33 Frameworks... MPI CUDA ATI Stream libspe OpenMP Ocelot PTX Cluster Cell CPU GPU 24/45

34 Frameworks... OpenCL MPI CUDA ATI Stream libspe OpenMP Ocelot PTX Others Cluster Cell CPU GPU 24/45

35 Frameworks... OpenCL MPI libspe CUDA OpenMP PTX Ocelot Others Cluster Cell CPU GPU 24/45

36 Frameworks... OpenCL MPI libspe CUDA OpenMP PTX Ocelot Others Cluster Xeon Phi Cell CPU GPU 24/45

37 Frameworks... OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 24/45

38 OpenCL OpenCL OpenCL host API OpenCL C language Host program Kernels CPUs GPUs Accelerators 25/45

39 OpenCL C Language Local Memory Global Memory... Local Memory I SPMD I get_local_id(int dim) I get_global_id(int dim) Work-Item Work-Group Local Memory Local Memory I local, global I work-group barrier /45

40 OpenCL Host API Device management I Query available devices and their capabilities Memory management I Allocate/release buers in device memory I Transfer data between host memory and devices Kernel management I Compilation I Scheduling on devices Synchronizations I Command graph: events... I With the host program: barriers... 27/45

41 OpenCL example Host OpenCL device /45

42 OpenCL example Host OpenCL device /45

43 OpenCL example Host OpenCL device /45

44 OpenCL example Host OpenCL device /45

45 OpenCL example Host OpenCL device /45

46 OpenCL example: kernel kernel v o i d add2 ( global f l o a t out ) { s i z e _ t i d = g e t _ g l o b a l _ i d ( 0 ) ; } out [ i d ] += 2. 0 ; 29/45

47 OpenCL example: host API f l o a t data [ ] = {... } ; c l G e t P l a t f o r m I D s ( max_pfs, p f s, NULL) ; c l G e t D e v i c e I D s ( p f s [ 0 ], CL_DEVICE_TYPE_ALL, 1, devs, NULL) ; c t x = c l C r e a t e C o n t e x t (NULL, 1, d e v s [ 0 ], NULL, NULL, NULL) ; cq = clcreatecommandqueue ( ctx, d e v s [ 0 ], 0, NULL) ; buf = c l C r e a t e B u f f e r ( ctx,cl_mem_read_write, ,NULL, NULL) ; c l E n q u e u e W r i t e B u f f e r ( cq, buf, CL_FALSE, 0, , data, 0, NULL, NULL) ; program = c l C r e a t e P r o g r a m W i t h S o u r c e ( ctx,1,& s r c, NULL, NULL) ; c l B u i l d P r o g r a m ( program, 0, NULL, NULL, NULL, NULL) ; k e r = c l C r e a t e K e r n e l ( program, " add2 ", NULL) ; c l S e t K e r n e l A r g ( ker, 0, s i z e o f ( cl_mem ), &buf ) ; s i z e _ t group [ 3 ] = {32, 1, 1 } ; s i z e _ t g r i d [ 3 ] = {256, 1, 1 } ; clenqueuendrangekernel ( cq, ker, 3, NULL, g r i d, group, 0, NULL, NULL) ; c l E n q u e u e R e a d B u f f e r ( cq, buf, CL_FALSE, 0, , data, 0, NULL, NULL) ; c l F i n i s h ( cq ) ; 30/45

48 Frameworks... OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 31/45

49 Frameworks... OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 31/45

50 OpenACC Concepts I Generate kernels from annotated C/Fortran code I Simpler memory model (one accelerator) I Ensure memory coherence with annotations 32/45

51 OpenACC example f l o a t data [ ] = {... } ; #pragma acc k e r n e l s copy ( data ) f o r ( i n t i =0; i <256; i ++) data [ i ] += 2. 0 ; 33/45

52 Limitations Low-level approaches I No support for multi-accelerator (OpenACC) I Hard to manage several accelerators (OpenCL) 34/45

53 Runtime Systems 35/45

54 Runtime Systems rw A Virtual Memory C B D r r w rw w r Tasks Runtime System I Automatically schedule kernels on available accelerators I Automatically manage distributed memory I e.g. StarSS, StarPU... Drivers (CPU, CUDA, OpenCL...) CPU GPU... Xeon Phi 36/45

55 Runtime Systems X Y Z V matmul cpu gpu matmul cpu gpu W I Automatically schedule kernels on available accelerators I Automatically manage distributed memory I e.g. StarSS, StarPU... 36/45

56 Runtime Systems Global Space (virtual) A B A B B I Automatically schedule kernels on available accelerators I Automatically manage distributed memory I e.g. StarSS, StarPU... 36/45

57 Frameworks... OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 37/45

58 Frameworks... StarPU OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 37/45

59 StarPU example v o i d scal_cpu_func ( v o i d b u f f e r s [ ], v o i s a r g s ) ; v o i d scal_cuda_func ( v o i d b u f f e r s [ ], v o i s a r g s ) ; s t a r p u _ c o d e l e t s c a l _ c l = {. where = STARPU_CPU STARPU_CUDA,. cpu_func = { scal_cpu_func, NULL},. cuda_func = { scal_cuda_func, NULL},. n b u f f e r s = 1,. modes = {STARPU_RW} } f l o a t v e c t o r [N ] ; starpu_data_handle v e c t o r _ h a n d l e ; s t a r p u _ v e c t o r _ d a t a _ r e g i s t e r (& vector_handle, 0, v e c t o r, N, s i z e o f ( f l o a t ) ) ; s t a r p u _ i n s e r t _ t a s k (& s c a l _ c l, STARPU_RW, vector_handle, 0) ; s t a r p u _ t a s k _ w a i t _ f o r _ a l l ( ) ; s t a r p u _ d a t a _ u n r e g i s t e r ( v e c t o r _ h a n d l e ) ; 38/45

60 StarPU performance UTK Magma (linear algebra) - QR factorization 16 CPUs (AMD) + 4 GPUs (C1060) 39/45

61 Limitations I Task/data granularity adaptation I Task graph optimizations X Y Z V W matmul cpu gpu matmul cpu gpu 40/45

62 What I am working on I Use a functional language to describe the task graph I More amenable to optimizations I Granularity adaptation expected I e.g. HaskellPU, ViperVM 41/45

63 Frameworks... StarPU OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 42/45

64 Frameworks... HaskellPU StarPU OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 42/45

65 HaskellPU l e t r e s = a x + b x + c x + d x computesync r e s l e t r e s = ( a + b + c + d ) x computesync r e s l e t r e s = ( r e d u c e (+) [ a, b, c, d ] ) x computesync r e s x a b c d a b c d a b c d * * * * * res x + + * res + x res 43/45

66 Conclusion I Things are evolving quickly! I Architectures I Programming models I Runtime systems are now accepted in the HPC community I Ensure portability and performance I New challenges I Fault tolerance I Power management I Collaboration with the programming language community (FP... ) 44/45

67 Thank you! 45/45

11 Parallel programming models

237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages