Panorama des modèles et outils de programmation parallèle
|
|
- Godfrey Lambert
- 5 years ago
- Views:
Transcription
1 Panorama des modèles et outils de programmation parallèle Sylvain HENRY University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, /45
2 Outline Introduction Accelerators & Many-Core Runtime Systems Conclusion 2/45
3 Introduction 3/45
4 Unicore P.U. Control ALU Unit Load... c := a+b... Store Memory 4/45
5 Unicore P.U. Control ALU Unit Load... c := a+b... Store Memory I asm (x86... ), C, Fortran, C /45
6 Moore's Law + Frequency Wall = Parallelism (mandatory slide) 5/45
7 Multicore Memory PU PU PU PU 6/45
8 Multicore Memory PU PU PU PU I PThread 6/45
9 POSIX Threads (PThread) #i n c l u d e <p t h r e a d. h> v o i d mythread ( v o i d a r g ) { // C } i n t main ( ) { pthread_t t i d ; // A p t h r e a d _ c r e a t e (& t i d, NULL, &mythread, NULL) ; PU PU PU PU A create B join C D // B p t h r e a d _ j o i n ( t i d, NULL) ; } // D 7/45
10 Multicore Memory PU PU PU PU I PThread I OpenMP 8/45
11 OpenMP #i n c l u d e <omp. h> i n t main ( ) { // A #pragma omp p a r a l l e l num_threads ( 5 ) { i n t i d = omp_get_thread_num ( ) ; p r i n t f ( " My id : %d\n", i d ) ; // B } } // C % gcc -fopenmp test.c %./a.out My id: 1 My id: 3 My id: 0 My id: 2 My id: 4 A id=0 B C id=2 B id=4 B id=1 B id=3 B 9/45
12 OpenMP #pragma omp p a r a l l e l f o r num_threads ( 3 ) f o r ( i n t i =0; i <10; i ++) { i n t i d = omp_get_thread_num ( ) ; p r i n t f ( "%d: %d\n", id, i ) ; } %./a.out 2: 7 2: 8 2: 9 1: 4 1: 5 1: 6 0: 0 0: 1 0: 2 0: 3 10/45
13 Multicore with NUMA MEM CPU #0 CPU #2 MEM MEM CPU #1 CPU #3 MEM I/O I/O 11/45
14 Multicore with NUMA MEM CPU #0 CPU #2 MEM MEM CPU #1 CPU #3 MEM I/O I/O I Issues: thread scheduling (data anity... ), data migration... 11/45
15 Multicore with NUMA MEM CPU #0 CPU #2 MEM MEM CPU #1 CPU #3 MEM I/O I/O I Issues: thread scheduling (data anity... ), data migration... I e.g. OpenMP: ForestGomp 11/45
16 Clusters Node0 NodeN Network 12/45
17 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI 12/45
18 MPI i n t c n t = 1 ; i f ( i d == 0) { MPI_Send(& cnt, 1, MPI_INT, i d +1, 0, MPI_COMM_WORLD) ; MPI_Recv(& cnt, 1, MPI_INT, N 1, 0, MPI_COMM_WORLD, NULL) ; p r i n t f ( " Counter : %d\n", c n t ) ; } e l s e { MPI_Recv(& cnt, 1, MPI_INT, id 1, 0, MPI_COMM_WORLD, NULL) ; c n t += 1 ; MPI_Send(&cnt, 1, MPI_INT, ( i d +1) % N, 0, MPI_COMM_WORLD) ; } % mpicc test.c % mpirun -n 5 --hostfile hosts./a.out Counter: 5 id=0 id=1 id=2 id=3 id= /45
19 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI I Process placement issue I Depends on network topology I e.g. TreeMatch 14/45
20 TreeMatch ZEUSMP/2 communication pattern Metric : msg Receiver rank I Trace communications between processes I Algorithm to map processes to network topology Sender rank 15/45
21 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI I Actor model / Active Objects: Charm++, Erlang... 16/45
22 Actor model PU PU id=0 id=1 Mailbox Send to 1 Recv Mailbox Memory Memory I Asynchronous message passing I Dynamic actor creation 17/45
23 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI I Actor model / Active Objects: Charm++, Erlang... I PGAS: HPF, XcalableMP, x10, ZPL/Chapel... I PGAS = Partitioned Global Address Space 18/45
24 PGAS: XcalableMP! $xmp n o d e s P ( 4 )! $xmp t e m p l a t e T( )! $xmp d i s t r i b u t e T( b l o c k ) o n t o P r e a l G( 8 0, 100)! g l o b a l v a r i a b l e! $xmp a l i g n G(, i ) w i t h T( i ) r e a l L ( 5 0, 40)! l o c a l v a r i a b l e ( d e f a u l t ) Global Address Space (virtual) G(80,100) Local View (data allocation) G(80, 1:25) G(80, 26:50) G(80, 51:75) G(80, 76:100) L(50,40) L(50,40) L(50,40) L(50,40) 19/45
25 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI I Actor model / Active Objects: Charm++, Erlang... I PGAS: HPF, XcalableMP, x10, ZPL/Chapel... I MapReduce: Google MapReduce, Hadoop... 20/45
26 MapReduce A A A A A A A f f f f f f f B B B B B B B g g g g C g g I reduce g. map f I Most operations on collections can be written this way I Easily parallelizable I Users only have to provide f and g I The runtime system provides I Task distribution I Communications I Fault tolerance I... Keywords: Bird-Meertens Formalism, Catamorphism 21/45
27 Summary Architecture Programming models Multicore (NUMA) OpenMP, PThread... Cluster MPI, PGAS, Actors, MapReduce... I They all assume homogeneous architectures I Same (or similar) architecture duplicated 22/45
28 Accelerators & Many-Core 23/45
29 Frameworks... MPI OpenMP Cluster CPU 24/45
30 Frameworks... MPI libspe OpenMP Cluster Cell CPU 24/45
31 Frameworks... MPI CUDA libspe OpenMP PTX Cluster Cell CPU GPU 24/45
32 Frameworks... MPI CUDA ATI Stream libspe OpenMP PTX Cluster Cell CPU GPU 24/45
33 Frameworks... MPI CUDA ATI Stream libspe OpenMP Ocelot PTX Cluster Cell CPU GPU 24/45
34 Frameworks... OpenCL MPI CUDA ATI Stream libspe OpenMP Ocelot PTX Others Cluster Cell CPU GPU 24/45
35 Frameworks... OpenCL MPI libspe CUDA OpenMP PTX Ocelot Others Cluster Cell CPU GPU 24/45
36 Frameworks... OpenCL MPI libspe CUDA OpenMP PTX Ocelot Others Cluster Xeon Phi Cell CPU GPU 24/45
37 Frameworks... OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 24/45
38 OpenCL OpenCL OpenCL host API OpenCL C language Host program Kernels CPUs GPUs Accelerators 25/45
39 OpenCL C Language Local Memory Global Memory... Local Memory I SPMD I get_local_id(int dim) I get_global_id(int dim) Work-Item Work-Group Local Memory Local Memory I local, global I work-group barrier /45
40 OpenCL Host API Device management I Query available devices and their capabilities Memory management I Allocate/release buers in device memory I Transfer data between host memory and devices Kernel management I Compilation I Scheduling on devices Synchronizations I Command graph: events... I With the host program: barriers... 27/45
41 OpenCL example Host OpenCL device /45
42 OpenCL example Host OpenCL device /45
43 OpenCL example Host OpenCL device /45
44 OpenCL example Host OpenCL device /45
45 OpenCL example Host OpenCL device /45
46 OpenCL example: kernel kernel v o i d add2 ( global f l o a t out ) { s i z e _ t i d = g e t _ g l o b a l _ i d ( 0 ) ; } out [ i d ] += 2. 0 ; 29/45
47 OpenCL example: host API f l o a t data [ ] = {... } ; c l G e t P l a t f o r m I D s ( max_pfs, p f s, NULL) ; c l G e t D e v i c e I D s ( p f s [ 0 ], CL_DEVICE_TYPE_ALL, 1, devs, NULL) ; c t x = c l C r e a t e C o n t e x t (NULL, 1, d e v s [ 0 ], NULL, NULL, NULL) ; cq = clcreatecommandqueue ( ctx, d e v s [ 0 ], 0, NULL) ; buf = c l C r e a t e B u f f e r ( ctx,cl_mem_read_write, ,NULL, NULL) ; c l E n q u e u e W r i t e B u f f e r ( cq, buf, CL_FALSE, 0, , data, 0, NULL, NULL) ; program = c l C r e a t e P r o g r a m W i t h S o u r c e ( ctx,1,& s r c, NULL, NULL) ; c l B u i l d P r o g r a m ( program, 0, NULL, NULL, NULL, NULL) ; k e r = c l C r e a t e K e r n e l ( program, " add2 ", NULL) ; c l S e t K e r n e l A r g ( ker, 0, s i z e o f ( cl_mem ), &buf ) ; s i z e _ t group [ 3 ] = {32, 1, 1 } ; s i z e _ t g r i d [ 3 ] = {256, 1, 1 } ; clenqueuendrangekernel ( cq, ker, 3, NULL, g r i d, group, 0, NULL, NULL) ; c l E n q u e u e R e a d B u f f e r ( cq, buf, CL_FALSE, 0, , data, 0, NULL, NULL) ; c l F i n i s h ( cq ) ; 30/45
48 Frameworks... OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 31/45
49 Frameworks... OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 31/45
50 OpenACC Concepts I Generate kernels from annotated C/Fortran code I Simpler memory model (one accelerator) I Ensure memory coherence with annotations 32/45
51 OpenACC example f l o a t data [ ] = {... } ; #pragma acc k e r n e l s copy ( data ) f o r ( i n t i =0; i <256; i ++) data [ i ] += 2. 0 ; 33/45
52 Limitations Low-level approaches I No support for multi-accelerator (OpenACC) I Hard to manage several accelerators (OpenCL) 34/45
53 Runtime Systems 35/45
54 Runtime Systems rw A Virtual Memory C B D r r w rw w r Tasks Runtime System I Automatically schedule kernels on available accelerators I Automatically manage distributed memory I e.g. StarSS, StarPU... Drivers (CPU, CUDA, OpenCL...) CPU GPU... Xeon Phi 36/45
55 Runtime Systems X Y Z V matmul cpu gpu matmul cpu gpu W I Automatically schedule kernels on available accelerators I Automatically manage distributed memory I e.g. StarSS, StarPU... 36/45
56 Runtime Systems Global Space (virtual) A B A B B I Automatically schedule kernels on available accelerators I Automatically manage distributed memory I e.g. StarSS, StarPU... 36/45
57 Frameworks... OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 37/45
58 Frameworks... StarPU OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 37/45
59 StarPU example v o i d scal_cpu_func ( v o i d b u f f e r s [ ], v o i s a r g s ) ; v o i d scal_cuda_func ( v o i d b u f f e r s [ ], v o i s a r g s ) ; s t a r p u _ c o d e l e t s c a l _ c l = {. where = STARPU_CPU STARPU_CUDA,. cpu_func = { scal_cpu_func, NULL},. cuda_func = { scal_cuda_func, NULL},. n b u f f e r s = 1,. modes = {STARPU_RW} } f l o a t v e c t o r [N ] ; starpu_data_handle v e c t o r _ h a n d l e ; s t a r p u _ v e c t o r _ d a t a _ r e g i s t e r (& vector_handle, 0, v e c t o r, N, s i z e o f ( f l o a t ) ) ; s t a r p u _ i n s e r t _ t a s k (& s c a l _ c l, STARPU_RW, vector_handle, 0) ; s t a r p u _ t a s k _ w a i t _ f o r _ a l l ( ) ; s t a r p u _ d a t a _ u n r e g i s t e r ( v e c t o r _ h a n d l e ) ; 38/45
60 StarPU performance UTK Magma (linear algebra) - QR factorization 16 CPUs (AMD) + 4 GPUs (C1060) 39/45
61 Limitations I Task/data granularity adaptation I Task graph optimizations X Y Z V W matmul cpu gpu matmul cpu gpu 40/45
62 What I am working on I Use a functional language to describe the task graph I More amenable to optimizations I Granularity adaptation expected I e.g. HaskellPU, ViperVM 41/45
63 Frameworks... StarPU OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 42/45
64 Frameworks... HaskellPU StarPU OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 42/45
65 HaskellPU l e t r e s = a x + b x + c x + d x computesync r e s l e t r e s = ( a + b + c + d ) x computesync r e s l e t r e s = ( r e d u c e (+) [ a, b, c, d ] ) x computesync r e s x a b c d a b c d a b c d * * * * * res x + + * res + x res 43/45
66 Conclusion I Things are evolving quickly! I Architectures I Programming models I Runtime systems are now accepted in the HPC community I Ensure portability and performance I New challenges I Fault tolerance I Power management I Collaboration with the programming language community (FP... ) 44/45
67 Thank you! 45/45
11 Parallel programming models
237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages
More informationLightweight Superscalar Task Execution in Distributed Memory
Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,
More informationOpenCL: Programming Heterogeneous Architectures
OpenCL: Programming Heterogeneous Architectures Porting BigDFT to OpenCL Brice Videau (LIG - NANOSIM) April 24, 2012 1 / 45 Introduction 2 / 45 Needs in computing resources are innite Benets for physicists
More informationHigh-performance processing and development with Madagascar. July 24, 2010 Madagascar development team
High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar
More informationLecture 24 - MPI ECE 459: Programming for Performance
ECE 459: Programming for Performance Jon Eyolfson March 9, 2012 What is MPI? Messaging Passing Interface A language-independent communation protocol for parallel computers Run the same code on a number
More informationMAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors
MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:
More informationAntonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg
INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationHydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign
Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign Collaborators Thesis Advisors Denis Barthou (Labri/INRIA
More informationHYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017
HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher
More informationModel Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University
Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul
More informationBenchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor
XcalableMP 1 2 2 2 Xeon Phi Xeon XcalableMP HIMENO L Phi XL 16 Xeon 1 16 Phi XcalableMP MPI XcalableMP OpenMP 16 2048 Benchmarking program performance evaluation of Parallel programming language XcalableMP
More informationPerformance Analysis of Lattice QCD Application with APGAS Programming Model
Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models
More informationOptimization Techniques for Parallel Code 1. Parallel programming models
Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of
More informationAn Integrative Model for Parallelism
An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale
More informationPerm State University Research-Education Center Parallel and Distributed Computing
Perm State University Research-Education Center Parallel and Distributed Computing A 25-minute Talk (S4493) at the GPU Technology Conference (GTC) 2014 MARCH 24-27, 2014 SAN JOSE, CA GPU-accelerated modeling
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012
MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationPerformance Evaluation of Scientific Applications on POWER8
Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,
More information上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose
上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU
More informationA quick introduction to MPI (Message Passing Interface)
A quick introduction to MPI (Message Passing Interface) M1IF - APPD Oguz Kaya Pierre Pradic École Normale Supérieure de Lyon, France 1 / 34 Oguz Kaya, Pierre Pradic M1IF - Presentation MPI Introduction
More informationSC09 HPC Challenge Submission for XcalableMP
SC09 HPC Challenge Submission for XcalableMP Jinpil Lee Graduate School of Systems and Information Engineering University of Tsukuba jinpil@hpcs.cs.tsukuba.ac.jp Mitsuhisa Sato Center for Computational
More informationTips Geared Towards R. Adam J. Suarez. Arpil 10, 2015
Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationIntroductory MPI June 2008
7: http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture7.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12 June 2008 MPI: Why,
More informationFirst, a look at using OpenACC on WRF subroutine advance_w dynamics routine
First, a look at using OpenACC on WRF subroutine advance_w dynamics routine Second, an estimate of WRF multi-node performance on Cray XK6 with GPU accelerators Based on performance of WRF kernels, what
More informationCEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication
1 / 26 CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication Albert S. Kim Department of Civil and Environmental Engineering University of Hawai i at Manoa 2540 Dole
More informationMPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory
MAX PLANCK INSTITUTE November 5, 2010 MPI at MPI Jens Saak Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory FOR DYNAMICS OF COMPLEX TECHNICAL
More informationParallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata
Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Outline Introduction Algorithms Parallelization Comparison of performances Conclusions
More informationDynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationIntroduction to Benchmark Test for Multi-scale Computational Materials Software
Introduction to Benchmark Test for Multi-scale Computational Materials Software Shun Xu*, Jian Zhang, Zhong Jin xushun@sccas.cn Computer Network Information Center Chinese Academy of Sciences (IPCC member)
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationThe Lattice Boltzmann Simulation on Multi-GPU Systems
The Lattice Boltzmann Simulation on Multi-GPU Systems Thor Kristian Valderhaug Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University
More informationClaude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique
Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)
More informationJulian Merten. GPU Computing and Alternative Architecture
Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg
More informationDigital Systems EEE4084F. [30 marks]
Digital Systems EEE4084F [30 marks] Practical 3: Simulation of Planet Vogela with its Moon and Vogel Spiral Star Formation using OpenGL, OpenMP, and MPI Introduction The objective of this assignment is
More informationOn Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code
On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance
More information2018/10/
1 2018 10 2 2 30 10 2 17 8 1 1. 9 25 2. 10 2 l 3. 10 9 l 4. 10 16 l 5. 10 23 l 2 6. 10 30 l - 2019 2 4 24 7. 11 6 l 8. 11 20 l - 9. 11 27 l 10. 12 4 l l 11. 12 11 l 12. 12 18?? l 13. 1 8 l RB-H 3 4 5 6
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationEmpowering Scientists with Domain Specific Languages
Empowering Scientists with Domain Specific Languages Julian Kunkel, Nabeeh Jum ah Scientific Computing Department of Informatics University of Hamburg SciCADE2017 2017-09-13 Outline 1 Developing Scientific
More informationDynamic Scheduling within MAGMA
Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing
More informationParallelism of MRT Lattice Boltzmann Method based on Multi-GPUs
Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs 1 School of Information Engineering, China University of Geosciences (Beijing) Beijing, 100083, China E-mail: Yaolk1119@icloud.com Ailan
More informationParallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29
Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing
More informationOne Optimized I/O Configuration per HPC Application
One Optimized I/O Configuration per HPC Application Leveraging I/O Configurability of Amazon EC2 Cloud Mingliang Liu, Jidong Zhai, Yan Zhai Tsinghua University Xiaosong Ma North Carolina State University
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More information( ) ? ( ) ? ? ? l RB-H
1 2018 4 17 2 30 4 17 17 8 1 1. 4 10 ( ) 2. 4 17 l 3. 4 24 l 4. 5 1 l ( ) 5. 5 8 l 2 6. 5 15 l - 2018 8 6 24 7. 5 22 l 8. 6 5 l - 9. 6 12 l 10. 6 19? l l 11. 7 3? l 12. 7 10? l 13. 7 17? l RB-H 3 4 5 6
More informationImpression Store: Compressive Sensing-based Storage for. Big Data Analytics
Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research The Curse of O(N) in
More informationMagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs
MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,
More informationSome notes on efficient computing and setting up high performance computing environments
Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient
More informationB629 project - StreamIt MPI Backend. Nilesh Mahajan
B629 project - StreamIt MPI Backend Nilesh Mahajan March 26, 2013 Abstract StreamIt is a language based on the dataflow model of computation. StreamIt consists of computation units called filters connected
More informationAdministrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application
Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationDistributed Memory Programming With MPI
John Interdisciplinary Center for Applied Mathematics & Information Technology Department Virginia Tech... Applied Computational Science II Department of Scientific Computing Florida State University http://people.sc.fsu.edu/
More informationMarwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation
and and marwan.burelle@lse.epita.fr http://wiki-prog.kh405.net Outline 1 2 and 3 and Evolutions and Next evolutions in processor tends more on more on growing of cores number GPU and similar extensions
More informationS8241 VERSIONING GPU- ACCLERATED WRF TO Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA)
S8241 VERSIONING GPU- ACCLERATED WRF TO 3.7.1 Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA) 1 ACKNOWLEDGEMENT The work presented here today would not have been possible without the efforts
More informationMUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux
The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent
More informationCS425: Algorithms for Web Scale Data
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges
More informationAntti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum
More informationOperating Systems. VII. Synchronization
Operating Systems VII. Synchronization Ludovic Apvrille ludovic.apvrille@telecom-paristech.fr Eurecom, office 470 http://soc.eurecom.fr/os/ @OS Eurecom Outline Synchronization issues 2/22 Fall 2017 Institut
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest
More informationAdaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs
Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs Simon McIntosh-Smith simonm@cs.bris.ac.uk Head of Microelectronics Research University of Bristol, UK 1 ! Collaborators
More informationInterpolation with Radial Basis Functions on GPGPUs using CUDA
Interpolation with Radial Basis Functions on GPGPUs using CUDA Gundolf Haase in coop. with: Dirk Martin [VRV Vienna] and Günter Offner [AVL Graz] Institute for Mathematics and Scientific Computing University
More informationExascale challenges for Numerical Weather Prediction : the ESCAPE project
Exascale challenges for Numerical Weather Prediction : the ESCAPE project O Olivier Marsden This project has received funding from the European Union s Horizon 2020 research and innovation programme under
More informationPetascale Quantum Simulations of Nano Systems and Biomolecules
Petascale Quantum Simulations of Nano Systems and Biomolecules Emil Briggs North Carolina State University 1. Outline of real-space Multigrid (RMG) 2. Scalability and hybrid/threaded models 3. GPU acceleration
More informationTargeting Extreme Scale Computational Challenges with Heterogeneous Systems
Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1 Introduction! PNNL Laboratory Directed Research &
More informationProjectile Motion Slide 1/16. Projectile Motion. Fall Semester. Parallel Computing
Projectile Motion Slide 1/16 Projectile Motion Fall Semester Projectile Motion Slide 2/16 Topic Outline Historical Perspective ABC and ENIAC Ballistics tables Projectile Motion Air resistance Euler s method
More information3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC
3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 9th Scheduling for Large Scale Systems Workshop, Lyon S. Moustafa, M. Faverge, L. Plagne, and P. Ramet S. Moustafa, M.
More informationDistributed Memory Parallelization in NGSolve
Distributed Memory Parallelization in NGSolve Lukas Kogler June, 2017 Inst. for Analysis and Scientific Computing, TU Wien From Shared to Distributed Memory Shared Memory Parallelization via threads (
More informationQR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationOn the design of parallel linear solvers for large scale problems
On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,
More informationTowards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters
Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM
More informationHIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU
April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th Brief Introduction of CTC AGENDA Alpha/Beta Matrix
More informationInformation Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and
Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element
More informationUsing a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics
Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de
More informationParallel Polynomial Evaluation
Parallel Polynomial Evaluation Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu
More informationCactus Tools for Petascale Computing
Cactus Tools for Petascale Computing Erik Schnetter Reno, November 2007 Gamma Ray Bursts ~10 7 km He Protoneutron Star Accretion Collapse to a Black Hole Jet Formation and Sustainment Fe-group nuclei Si
More informationCrossing the Chasm. On the Paths to Exascale: Presented by Mike Rezny, Monash University, Australia
On the Paths to Exascale: Crossing the Chasm Presented by Mike Rezny, Monash University, Australia michael.rezny@monash.edu Crossing the Chasm meeting Reading, 24 th October 2016 Version 0.1 In collaboration
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationEveryday Multithreading
Everyday Multithreading Parallel computing for genomic evaluations in R C. Heuer, D. Hinrichs, G. Thaller Institute of Animal Breeding and Husbandry, Kiel University August 27, 2014 C. Heuer, D. Hinrichs,
More informationA dissection solver with kernel detection for unsymmetric matrices in FreeFem++
. p.1/21 11 Dec. 2014, LJLL, Paris FreeFem++ workshop A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ Atsushi Suzuki Atsushi.Suzuki@ann.jussieu.fr Joint work with François-Xavier
More informationAccelerated Neutrino Oscillation Probability Calculations and Reweighting on GPUs. Richard Calland
Accelerated Neutrino Oscillation Probability Calculations and Reweighting on GPUs Richard Calland University of Liverpool GPU Computing in High Energy Physics University of Pisa, 11th September 2014 Introduction
More informationPerformance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster
Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster Yuta Hirokawa Graduate School of Systems and Information Engineering, University of Tsukuba hirokawa@hpcs.cs.tsukuba.ac.jp
More informationZacros. Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis
Zacros Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis Jens H Nielsen, Mayeul D'Avezac, James Hetherington & Michail Stamatakis Introduction to Zacros
More informationECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University
ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University Prof. Mi Lu TA: Ehsan Rohani Laboratory Exercise #4 MIPS Assembly and Simulation
More informationSome thoughts about energy efficient application execution on NEC LX Series compute clusters
Some thoughts about energy efficient application execution on NEC LX Series compute clusters G. Wellein, G. Hager, J. Treibig, M. Wittmann Erlangen Regional Computing Center & Department of Computer Science
More informationJuggle: Proactive Load Balancing on Multicore Computers
Juggle: Proactive Load Balancing on Multicore Computers ABSTRACT Steven Hofmeyr LBNL shofmeyr@lbl.gov Costin Iancu LBNL ciancu@lbl.gov We investigate proactive dynamic load balancing on multicore systems,
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 12: Real-Time Data Analytics (2/2) March 31, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationCarlo Cavazzoni, HPC department, CINECA
Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have a mixed architecture + accelerators
More informationEnergy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators
PARIS-SACLAY, FRANCE Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen b April the 6 th, 2016 a Université de Lille, Sciences et Technologies
More informationHybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS
Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures
More informationMARCH 24-27, 2014 SAN JOSE, CA
MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =
More informationPerformance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures
Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse
More informationThe Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology
The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology 16:00-17:30 27 October 2016 Panelists John Michalakes (UCAR,
More informationINF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)
INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder
More informationTransparent Fault Tolerance for Scalable Functional Computation
Transparent Fault Tolerance for Scalable Functional Computation Rob Stewart 1 Patrick Maier 2 Phil Trinder 2 26 th July 2016 1 Heriot-Watt University Edinburgh 2 University of Glasgow Motivation Tolerating
More informationEfficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization
Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized
More information