Panorama des modèles et outils de programmation parallèle

Size: px
Start display at page:

Download "Panorama des modèles et outils de programmation parallèle"

Transcription

1 Panorama des modèles et outils de programmation parallèle Sylvain HENRY University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, /45

2 Outline Introduction Accelerators & Many-Core Runtime Systems Conclusion 2/45

3 Introduction 3/45

4 Unicore P.U. Control ALU Unit Load... c := a+b... Store Memory 4/45

5 Unicore P.U. Control ALU Unit Load... c := a+b... Store Memory I asm (x86... ), C, Fortran, C /45

6 Moore's Law + Frequency Wall = Parallelism (mandatory slide) 5/45

7 Multicore Memory PU PU PU PU 6/45

8 Multicore Memory PU PU PU PU I PThread 6/45

9 POSIX Threads (PThread) #i n c l u d e <p t h r e a d. h> v o i d mythread ( v o i d a r g ) { // C } i n t main ( ) { pthread_t t i d ; // A p t h r e a d _ c r e a t e (& t i d, NULL, &mythread, NULL) ; PU PU PU PU A create B join C D // B p t h r e a d _ j o i n ( t i d, NULL) ; } // D 7/45

10 Multicore Memory PU PU PU PU I PThread I OpenMP 8/45

11 OpenMP #i n c l u d e <omp. h> i n t main ( ) { // A #pragma omp p a r a l l e l num_threads ( 5 ) { i n t i d = omp_get_thread_num ( ) ; p r i n t f ( " My id : %d\n", i d ) ; // B } } // C % gcc -fopenmp test.c %./a.out My id: 1 My id: 3 My id: 0 My id: 2 My id: 4 A id=0 B C id=2 B id=4 B id=1 B id=3 B 9/45

12 OpenMP #pragma omp p a r a l l e l f o r num_threads ( 3 ) f o r ( i n t i =0; i <10; i ++) { i n t i d = omp_get_thread_num ( ) ; p r i n t f ( "%d: %d\n", id, i ) ; } %./a.out 2: 7 2: 8 2: 9 1: 4 1: 5 1: 6 0: 0 0: 1 0: 2 0: 3 10/45

13 Multicore with NUMA MEM CPU #0 CPU #2 MEM MEM CPU #1 CPU #3 MEM I/O I/O 11/45

14 Multicore with NUMA MEM CPU #0 CPU #2 MEM MEM CPU #1 CPU #3 MEM I/O I/O I Issues: thread scheduling (data anity... ), data migration... 11/45

15 Multicore with NUMA MEM CPU #0 CPU #2 MEM MEM CPU #1 CPU #3 MEM I/O I/O I Issues: thread scheduling (data anity... ), data migration... I e.g. OpenMP: ForestGomp 11/45

16 Clusters Node0 NodeN Network 12/45

17 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI 12/45

18 MPI i n t c n t = 1 ; i f ( i d == 0) { MPI_Send(& cnt, 1, MPI_INT, i d +1, 0, MPI_COMM_WORLD) ; MPI_Recv(& cnt, 1, MPI_INT, N 1, 0, MPI_COMM_WORLD, NULL) ; p r i n t f ( " Counter : %d\n", c n t ) ; } e l s e { MPI_Recv(& cnt, 1, MPI_INT, id 1, 0, MPI_COMM_WORLD, NULL) ; c n t += 1 ; MPI_Send(&cnt, 1, MPI_INT, ( i d +1) % N, 0, MPI_COMM_WORLD) ; } % mpicc test.c % mpirun -n 5 --hostfile hosts./a.out Counter: 5 id=0 id=1 id=2 id=3 id= /45

19 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI I Process placement issue I Depends on network topology I e.g. TreeMatch 14/45

20 TreeMatch ZEUSMP/2 communication pattern Metric : msg Receiver rank I Trace communications between processes I Algorithm to map processes to network topology Sender rank 15/45

21 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI I Actor model / Active Objects: Charm++, Erlang... 16/45

22 Actor model PU PU id=0 id=1 Mailbox Send to 1 Recv Mailbox Memory Memory I Asynchronous message passing I Dynamic actor creation 17/45

23 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI I Actor model / Active Objects: Charm++, Erlang... I PGAS: HPF, XcalableMP, x10, ZPL/Chapel... I PGAS = Partitioned Global Address Space 18/45

24 PGAS: XcalableMP! $xmp n o d e s P ( 4 )! $xmp t e m p l a t e T( )! $xmp d i s t r i b u t e T( b l o c k ) o n t o P r e a l G( 8 0, 100)! g l o b a l v a r i a b l e! $xmp a l i g n G(, i ) w i t h T( i ) r e a l L ( 5 0, 40)! l o c a l v a r i a b l e ( d e f a u l t ) Global Address Space (virtual) G(80,100) Local View (data allocation) G(80, 1:25) G(80, 26:50) G(80, 51:75) G(80, 76:100) L(50,40) L(50,40) L(50,40) L(50,40) 19/45

25 Clusters Node0 NodeN Network I Synchronous Message Passing: MPI I Actor model / Active Objects: Charm++, Erlang... I PGAS: HPF, XcalableMP, x10, ZPL/Chapel... I MapReduce: Google MapReduce, Hadoop... 20/45

26 MapReduce A A A A A A A f f f f f f f B B B B B B B g g g g C g g I reduce g. map f I Most operations on collections can be written this way I Easily parallelizable I Users only have to provide f and g I The runtime system provides I Task distribution I Communications I Fault tolerance I... Keywords: Bird-Meertens Formalism, Catamorphism 21/45

27 Summary Architecture Programming models Multicore (NUMA) OpenMP, PThread... Cluster MPI, PGAS, Actors, MapReduce... I They all assume homogeneous architectures I Same (or similar) architecture duplicated 22/45

28 Accelerators & Many-Core 23/45

29 Frameworks... MPI OpenMP Cluster CPU 24/45

30 Frameworks... MPI libspe OpenMP Cluster Cell CPU 24/45

31 Frameworks... MPI CUDA libspe OpenMP PTX Cluster Cell CPU GPU 24/45

32 Frameworks... MPI CUDA ATI Stream libspe OpenMP PTX Cluster Cell CPU GPU 24/45

33 Frameworks... MPI CUDA ATI Stream libspe OpenMP Ocelot PTX Cluster Cell CPU GPU 24/45

34 Frameworks... OpenCL MPI CUDA ATI Stream libspe OpenMP Ocelot PTX Others Cluster Cell CPU GPU 24/45

35 Frameworks... OpenCL MPI libspe CUDA OpenMP PTX Ocelot Others Cluster Cell CPU GPU 24/45

36 Frameworks... OpenCL MPI libspe CUDA OpenMP PTX Ocelot Others Cluster Xeon Phi Cell CPU GPU 24/45

37 Frameworks... OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 24/45

38 OpenCL OpenCL OpenCL host API OpenCL C language Host program Kernels CPUs GPUs Accelerators 25/45

39 OpenCL C Language Local Memory Global Memory... Local Memory I SPMD I get_local_id(int dim) I get_global_id(int dim) Work-Item Work-Group Local Memory Local Memory I local, global I work-group barrier /45

40 OpenCL Host API Device management I Query available devices and their capabilities Memory management I Allocate/release buers in device memory I Transfer data between host memory and devices Kernel management I Compilation I Scheduling on devices Synchronizations I Command graph: events... I With the host program: barriers... 27/45

41 OpenCL example Host OpenCL device /45

42 OpenCL example Host OpenCL device /45

43 OpenCL example Host OpenCL device /45

44 OpenCL example Host OpenCL device /45

45 OpenCL example Host OpenCL device /45

46 OpenCL example: kernel kernel v o i d add2 ( global f l o a t out ) { s i z e _ t i d = g e t _ g l o b a l _ i d ( 0 ) ; } out [ i d ] += 2. 0 ; 29/45

47 OpenCL example: host API f l o a t data [ ] = {... } ; c l G e t P l a t f o r m I D s ( max_pfs, p f s, NULL) ; c l G e t D e v i c e I D s ( p f s [ 0 ], CL_DEVICE_TYPE_ALL, 1, devs, NULL) ; c t x = c l C r e a t e C o n t e x t (NULL, 1, d e v s [ 0 ], NULL, NULL, NULL) ; cq = clcreatecommandqueue ( ctx, d e v s [ 0 ], 0, NULL) ; buf = c l C r e a t e B u f f e r ( ctx,cl_mem_read_write, ,NULL, NULL) ; c l E n q u e u e W r i t e B u f f e r ( cq, buf, CL_FALSE, 0, , data, 0, NULL, NULL) ; program = c l C r e a t e P r o g r a m W i t h S o u r c e ( ctx,1,& s r c, NULL, NULL) ; c l B u i l d P r o g r a m ( program, 0, NULL, NULL, NULL, NULL) ; k e r = c l C r e a t e K e r n e l ( program, " add2 ", NULL) ; c l S e t K e r n e l A r g ( ker, 0, s i z e o f ( cl_mem ), &buf ) ; s i z e _ t group [ 3 ] = {32, 1, 1 } ; s i z e _ t g r i d [ 3 ] = {256, 1, 1 } ; clenqueuendrangekernel ( cq, ker, 3, NULL, g r i d, group, 0, NULL, NULL) ; c l E n q u e u e R e a d B u f f e r ( cq, buf, CL_FALSE, 0, , data, 0, NULL, NULL) ; c l F i n i s h ( cq ) ; 30/45

48 Frameworks... OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 31/45

49 Frameworks... OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 31/45

50 OpenACC Concepts I Generate kernels from annotated C/Fortran code I Simpler memory model (one accelerator) I Ensure memory coherence with annotations 32/45

51 OpenACC example f l o a t data [ ] = {... } ; #pragma acc k e r n e l s copy ( data ) f o r ( i n t i =0; i <256; i ++) data [ i ] += 2. 0 ; 33/45

52 Limitations Low-level approaches I No support for multi-accelerator (OpenACC) I Hard to manage several accelerators (OpenCL) 34/45

53 Runtime Systems 35/45

54 Runtime Systems rw A Virtual Memory C B D r r w rw w r Tasks Runtime System I Automatically schedule kernels on available accelerators I Automatically manage distributed memory I e.g. StarSS, StarPU... Drivers (CPU, CUDA, OpenCL...) CPU GPU... Xeon Phi 36/45

55 Runtime Systems X Y Z V matmul cpu gpu matmul cpu gpu W I Automatically schedule kernels on available accelerators I Automatically manage distributed memory I e.g. StarSS, StarPU... 36/45

56 Runtime Systems Global Space (virtual) A B A B B I Automatically schedule kernels on available accelerators I Automatically manage distributed memory I e.g. StarSS, StarPU... 36/45

57 Frameworks... OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 37/45

58 Frameworks... StarPU OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 37/45

59 StarPU example v o i d scal_cpu_func ( v o i d b u f f e r s [ ], v o i s a r g s ) ; v o i d scal_cuda_func ( v o i d b u f f e r s [ ], v o i s a r g s ) ; s t a r p u _ c o d e l e t s c a l _ c l = {. where = STARPU_CPU STARPU_CUDA,. cpu_func = { scal_cpu_func, NULL},. cuda_func = { scal_cuda_func, NULL},. n b u f f e r s = 1,. modes = {STARPU_RW} } f l o a t v e c t o r [N ] ; starpu_data_handle v e c t o r _ h a n d l e ; s t a r p u _ v e c t o r _ d a t a _ r e g i s t e r (& vector_handle, 0, v e c t o r, N, s i z e o f ( f l o a t ) ) ; s t a r p u _ i n s e r t _ t a s k (& s c a l _ c l, STARPU_RW, vector_handle, 0) ; s t a r p u _ t a s k _ w a i t _ f o r _ a l l ( ) ; s t a r p u _ d a t a _ u n r e g i s t e r ( v e c t o r _ h a n d l e ) ; 38/45

60 StarPU performance UTK Magma (linear algebra) - QR factorization 16 CPUs (AMD) + 4 GPUs (C1060) 39/45

61 Limitations I Task/data granularity adaptation I Task graph optimizations X Y Z V W matmul cpu gpu matmul cpu gpu 40/45

62 What I am working on I Use a functional language to describe the task graph I More amenable to optimizations I Granularity adaptation expected I e.g. HaskellPU, ViperVM 41/45

63 Frameworks... StarPU OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 42/45

64 Frameworks... HaskellPU StarPU OpenACC OpenCL MPI SnuCL CUDA libspe OpenMP Ocelot PTX Others Cluster Xeon Phi Cell CPU GPU 42/45

65 HaskellPU l e t r e s = a x + b x + c x + d x computesync r e s l e t r e s = ( a + b + c + d ) x computesync r e s l e t r e s = ( r e d u c e (+) [ a, b, c, d ] ) x computesync r e s x a b c d a b c d a b c d * * * * * res x + + * res + x res 43/45

66 Conclusion I Things are evolving quickly! I Architectures I Programming models I Runtime systems are now accepted in the HPC community I Ensure portability and performance I New challenges I Fault tolerance I Power management I Collaboration with the programming language community (FP... ) 44/45

67 Thank you! 45/45

11 Parallel programming models

11 Parallel programming models 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

OpenCL: Programming Heterogeneous Architectures

OpenCL: Programming Heterogeneous Architectures OpenCL: Programming Heterogeneous Architectures Porting BigDFT to OpenCL Brice Videau (LIG - NANOSIM) April 24, 2012 1 / 45 Introduction 2 / 45 Needs in computing resources are innite Benets for physicists

More information

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar

More information

Lecture 24 - MPI ECE 459: Programming for Performance

Lecture 24 - MPI ECE 459: Programming for Performance ECE 459: Programming for Performance Jon Eyolfson March 9, 2012 What is MPI? Messaging Passing Interface A language-independent communation protocol for parallel computers Run the same code on a number

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign Collaborators Thesis Advisors Denis Barthou (Labri/INRIA

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor XcalableMP 1 2 2 2 Xeon Phi Xeon XcalableMP HIMENO L Phi XL 16 Xeon 1 16 Phi XcalableMP MPI XcalableMP OpenMP 16 2048 Benchmarking program performance evaluation of Parallel programming language XcalableMP

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

Optimization Techniques for Parallel Code 1. Parallel programming models

Optimization Techniques for Parallel Code 1. Parallel programming models Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of

More information

An Integrative Model for Parallelism

An Integrative Model for Parallelism An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale

More information

Perm State University Research-Education Center Parallel and Distributed Computing

Perm State University Research-Education Center Parallel and Distributed Computing Perm State University Research-Education Center Parallel and Distributed Computing A 25-minute Talk (S4493) at the GPU Technology Conference (GTC) 2014 MARCH 24-27, 2014 SAN JOSE, CA GPU-accelerated modeling

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Performance Evaluation of Scientific Applications on POWER8

Performance Evaluation of Scientific Applications on POWER8 Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,

More information

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose 上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU

More information

A quick introduction to MPI (Message Passing Interface)

A quick introduction to MPI (Message Passing Interface) A quick introduction to MPI (Message Passing Interface) M1IF - APPD Oguz Kaya Pierre Pradic École Normale Supérieure de Lyon, France 1 / 34 Oguz Kaya, Pierre Pradic M1IF - Presentation MPI Introduction

More information

SC09 HPC Challenge Submission for XcalableMP

SC09 HPC Challenge Submission for XcalableMP SC09 HPC Challenge Submission for XcalableMP Jinpil Lee Graduate School of Systems and Information Engineering University of Tsukuba jinpil@hpcs.cs.tsukuba.ac.jp Mitsuhisa Sato Center for Computational

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Introductory MPI June 2008

Introductory MPI June 2008 7: http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture7.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12 June 2008 MPI: Why,

More information

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine First, a look at using OpenACC on WRF subroutine advance_w dynamics routine Second, an estimate of WRF multi-node performance on Cray XK6 with GPU accelerators Based on performance of WRF kernels, what

More information

CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication

CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication 1 / 26 CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication Albert S. Kim Department of Civil and Environmental Engineering University of Hawai i at Manoa 2540 Dole

More information

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory MAX PLANCK INSTITUTE November 5, 2010 MPI at MPI Jens Saak Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory FOR DYNAMICS OF COMPLEX TECHNICAL

More information

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Outline Introduction Algorithms Parallelization Comparison of performances Conclusions

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Introduction to Benchmark Test for Multi-scale Computational Materials Software Introduction to Benchmark Test for Multi-scale Computational Materials Software Shun Xu*, Jian Zhang, Zhong Jin xushun@sccas.cn Computer Network Information Center Chinese Academy of Sciences (IPCC member)

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

The Lattice Boltzmann Simulation on Multi-GPU Systems

The Lattice Boltzmann Simulation on Multi-GPU Systems The Lattice Boltzmann Simulation on Multi-GPU Systems Thor Kristian Valderhaug Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

Digital Systems EEE4084F. [30 marks]

Digital Systems EEE4084F. [30 marks] Digital Systems EEE4084F [30 marks] Practical 3: Simulation of Planet Vogela with its Moon and Vogel Spiral Star Formation using OpenGL, OpenMP, and MPI Introduction The objective of this assignment is

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

2018/10/

2018/10/ 1 2018 10 2 2 30 10 2 17 8 1 1. 9 25 2. 10 2 l 3. 10 9 l 4. 10 16 l 5. 10 23 l 2 6. 10 30 l - 2019 2 4 24 7. 11 6 l 8. 11 20 l - 9. 11 27 l 10. 12 4 l l 11. 12 11 l 12. 12 18?? l 13. 1 8 l RB-H 3 4 5 6

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

Empowering Scientists with Domain Specific Languages

Empowering Scientists with Domain Specific Languages Empowering Scientists with Domain Specific Languages Julian Kunkel, Nabeeh Jum ah Scientific Computing Department of Informatics University of Hamburg SciCADE2017 2017-09-13 Outline 1 Developing Scientific

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs 1 School of Information Engineering, China University of Geosciences (Beijing) Beijing, 100083, China E-mail: Yaolk1119@icloud.com Ailan

More information

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing

More information

One Optimized I/O Configuration per HPC Application

One Optimized I/O Configuration per HPC Application One Optimized I/O Configuration per HPC Application Leveraging I/O Configurability of Amazon EC2 Cloud Mingliang Liu, Jidong Zhai, Yan Zhai Tsinghua University Xiaosong Ma North Carolina State University

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

( ) ? ( ) ? ? ? l RB-H

( ) ? ( ) ? ? ? l RB-H 1 2018 4 17 2 30 4 17 17 8 1 1. 4 10 ( ) 2. 4 17 l 3. 4 24 l 4. 5 1 l ( ) 5. 5 8 l 2 6. 5 15 l - 2018 8 6 24 7. 5 22 l 8. 6 5 l - 9. 6 12 l 10. 6 19? l l 11. 7 3? l 12. 7 10? l 13. 7 17? l RB-H 3 4 5 6

More information

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research The Curse of O(N) in

More information

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs Lucien Ng The Chinese University of Hong Kong Kwai Wong The Joint Institute for Computational Sciences (JICS), UTK and ORNL Azzam Haidar,

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

B629 project - StreamIt MPI Backend. Nilesh Mahajan

B629 project - StreamIt MPI Backend. Nilesh Mahajan B629 project - StreamIt MPI Backend Nilesh Mahajan March 26, 2013 Abstract StreamIt is a language based on the dataflow model of computation. StreamIt consists of computation units called filters connected

More information

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Distributed Memory Programming With MPI

Distributed Memory Programming With MPI John Interdisciplinary Center for Applied Mathematics & Information Technology Department Virginia Tech... Applied Computational Science II Department of Scientific Computing Florida State University http://people.sc.fsu.edu/

More information

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Marwan Burelle.  Parallel and Concurrent Programming. Introduction and Foundation and and marwan.burelle@lse.epita.fr http://wiki-prog.kh405.net Outline 1 2 and 3 and Evolutions and Next evolutions in processor tends more on more on growing of cores number GPU and similar extensions

More information

S8241 VERSIONING GPU- ACCLERATED WRF TO Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA)

S8241 VERSIONING GPU- ACCLERATED WRF TO Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA) S8241 VERSIONING GPU- ACCLERATED WRF TO 3.7.1 Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA) 1 ACKNOWLEDGEMENT The work presented here today would not have been possible without the efforts

More information

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux

MUMPS. The MUMPS library: work done during the SOLSTICE project. MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux The MUMPS library: work done during the SOLSTICE project MUMPS team, Lyon-Grenoble, Toulouse, Bordeaux Sparse Days and ANR SOLSTICE Final Workshop June MUMPS MUMPS Team since beg. of SOLSTICE (2007) Permanent

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

Operating Systems. VII. Synchronization

Operating Systems. VII. Synchronization Operating Systems VII. Synchronization Ludovic Apvrille ludovic.apvrille@telecom-paristech.fr Eurecom, office 470 http://soc.eurecom.fr/os/ @OS Eurecom Outline Synchronization issues 2/22 Fall 2017 Institut

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems Journée problème de Poisson, IHP, Paris M. Faverge, P. Ramet M. Faverge Assistant Professor Bordeaux INP LaBRI Inria Bordeaux - Sud-Ouest

More information

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs Simon McIntosh-Smith simonm@cs.bris.ac.uk Head of Microelectronics Research University of Bristol, UK 1 ! Collaborators

More information

Interpolation with Radial Basis Functions on GPGPUs using CUDA

Interpolation with Radial Basis Functions on GPGPUs using CUDA Interpolation with Radial Basis Functions on GPGPUs using CUDA Gundolf Haase in coop. with: Dirk Martin [VRV Vienna] and Günter Offner [AVL Graz] Institute for Mathematics and Scientific Computing University

More information

Exascale challenges for Numerical Weather Prediction : the ESCAPE project

Exascale challenges for Numerical Weather Prediction : the ESCAPE project Exascale challenges for Numerical Weather Prediction : the ESCAPE project O Olivier Marsden This project has received funding from the European Union s Horizon 2020 research and innovation programme under

More information

Petascale Quantum Simulations of Nano Systems and Biomolecules

Petascale Quantum Simulations of Nano Systems and Biomolecules Petascale Quantum Simulations of Nano Systems and Biomolecules Emil Briggs North Carolina State University 1. Outline of real-space Multigrid (RMG) 2. Scalability and hybrid/threaded models 3. GPU acceleration

More information

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1 Introduction! PNNL Laboratory Directed Research &

More information

Projectile Motion Slide 1/16. Projectile Motion. Fall Semester. Parallel Computing

Projectile Motion Slide 1/16. Projectile Motion. Fall Semester. Parallel Computing Projectile Motion Slide 1/16 Projectile Motion Fall Semester Projectile Motion Slide 2/16 Topic Outline Historical Perspective ABC and ENIAC Ballistics tables Projectile Motion Air resistance Euler s method

More information

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC

3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 3D Cartesian Transport Sweep for Massively Parallel Architectures on top of PaRSEC 9th Scheduling for Large Scale Systems Workshop, Lyon S. Moustafa, M. Faverge, L. Plagne, and P. Ramet S. Moustafa, M.

More information

Distributed Memory Parallelization in NGSolve

Distributed Memory Parallelization in NGSolve Distributed Memory Parallelization in NGSolve Lukas Kogler June, 2017 Inst. for Analysis and Scientific Computing, TU Wien From Shared to Distributed Memory Shared Memory Parallelization via threads (

More information

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment Emmanuel AGULLO (INRIA / LaBRI) Camille COTI (Iowa State University) Jack DONGARRA (University of Tennessee) Thomas HÉRAULT

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM

More information

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th Brief Introduction of CTC AGENDA Alpha/Beta Matrix

More information

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element

More information

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de

More information

Parallel Polynomial Evaluation

Parallel Polynomial Evaluation Parallel Polynomial Evaluation Jan Verschelde joint work with Genady Yoffe University of Illinois at Chicago Department of Mathematics, Statistics, and Computer Science http://www.math.uic.edu/ jan jan@math.uic.edu

More information

Cactus Tools for Petascale Computing

Cactus Tools for Petascale Computing Cactus Tools for Petascale Computing Erik Schnetter Reno, November 2007 Gamma Ray Bursts ~10 7 km He Protoneutron Star Accretion Collapse to a Black Hole Jet Formation and Sustainment Fe-group nuclei Si

More information

Crossing the Chasm. On the Paths to Exascale: Presented by Mike Rezny, Monash University, Australia

Crossing the Chasm. On the Paths to Exascale: Presented by Mike Rezny, Monash University, Australia On the Paths to Exascale: Crossing the Chasm Presented by Mike Rezny, Monash University, Australia michael.rezny@monash.edu Crossing the Chasm meeting Reading, 24 th October 2016 Version 0.1 In collaboration

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

Everyday Multithreading

Everyday Multithreading Everyday Multithreading Parallel computing for genomic evaluations in R C. Heuer, D. Hinrichs, G. Thaller Institute of Animal Breeding and Husbandry, Kiel University August 27, 2014 C. Heuer, D. Hinrichs,

More information

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++

A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ . p.1/21 11 Dec. 2014, LJLL, Paris FreeFem++ workshop A dissection solver with kernel detection for unsymmetric matrices in FreeFem++ Atsushi Suzuki Atsushi.Suzuki@ann.jussieu.fr Joint work with François-Xavier

More information

Accelerated Neutrino Oscillation Probability Calculations and Reweighting on GPUs. Richard Calland

Accelerated Neutrino Oscillation Probability Calculations and Reweighting on GPUs. Richard Calland Accelerated Neutrino Oscillation Probability Calculations and Reweighting on GPUs Richard Calland University of Liverpool GPU Computing in High Energy Physics University of Pisa, 11th September 2014 Introduction

More information

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster Yuta Hirokawa Graduate School of Systems and Information Engineering, University of Tsukuba hirokawa@hpcs.cs.tsukuba.ac.jp

More information

Zacros. Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis

Zacros. Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis Zacros Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis Jens H Nielsen, Mayeul D'Avezac, James Hetherington & Michail Stamatakis Introduction to Zacros

More information

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University Prof. Mi Lu TA: Ehsan Rohani Laboratory Exercise #4 MIPS Assembly and Simulation

More information

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Some thoughts about energy efficient application execution on NEC LX Series compute clusters Some thoughts about energy efficient application execution on NEC LX Series compute clusters G. Wellein, G. Hager, J. Treibig, M. Wittmann Erlangen Regional Computing Center & Department of Computer Science

More information

Juggle: Proactive Load Balancing on Multicore Computers

Juggle: Proactive Load Balancing on Multicore Computers Juggle: Proactive Load Balancing on Multicore Computers ABSTRACT Steven Hofmeyr LBNL shofmeyr@lbl.gov Costin Iancu LBNL ciancu@lbl.gov We investigate proactive dynamic load balancing on multicore systems,

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 12: Real-Time Data Analytics (2/2) March 31, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Carlo Cavazzoni, HPC department, CINECA

Carlo Cavazzoni, HPC department, CINECA Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have a mixed architecture + accelerators

More information

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators

Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators PARIS-SACLAY, FRANCE Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators Serge G. Petiton a and Langshi Chen b April the 6 th, 2016 a Université de Lille, Sciences et Technologies

More information

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures

More information

MARCH 24-27, 2014 SAN JOSE, CA

MARCH 24-27, 2014 SAN JOSE, CA MARCH 24-27, 2014 SAN JOSE, CA Sparse HPC on modern architectures Important scientific applications rely on sparse linear algebra HPCG a new benchmark proposal to complement Top500 (HPL) To solve A x =

More information

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse

More information

The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology

The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology 16:00-17:30 27 October 2016 Panelists John Michalakes (UCAR,

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Transparent Fault Tolerance for Scalable Functional Computation

Transparent Fault Tolerance for Scalable Functional Computation Transparent Fault Tolerance for Scalable Functional Computation Rob Stewart 1 Patrick Maier 2 Phil Trinder 2 26 th July 2016 1 Heriot-Watt University Edinburgh 2 University of Glasgow Motivation Tolerating

More information

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized

More information