11 Parallel programming models

Size: px
Start display at page:

Download "11 Parallel programming models"

Transcription

1 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages Linda CSP-based (Communicating Sequential Processes) FortranM Occam Dataflow SISAL (Streams and Iteration in a Single Assignment Language) Distributed Sequoia Bloom Event-driven and hardware description

2 238 // Program Design 10.3 Assessing parallel programs Verilog hardware description Multi-threaded language (HDL) Clojure Functional Object-oriented Concurrent Haskell Charm++ GPU languages Smalltalk CUDA Message-passing OpenMP OpenACC MPI OpenHMPP (HMPP for Hybrid Multicore Parallel Pro- PVM Partitioned global address space gramming) (PGAS) Logic programming Parlog High Performance Fortran (HPF)

3 239 // programming models 11.1 HPF 11.1 HPF Partitioned global address space parallel programming model Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which processor gets which part of the data Concurrency is defined by HPF commands based on Fortran90 HPF directives as comments:!hpf$ <directive>

4 240 // programming models 11.1 HPF HPF example: matrix multiplication PROGRAM ABmult IMPLICIT NONE INTEGER, PARAMETER : : N = 100 INTEGER, DIMENSION (N, N) : : A, B, C INTEGER : : i, j! HPF$ PROCESSORS s q u a r e ( 2, 2 )! HPF$ DISTRIBUTE (BLOCK, BLOCK) ONTO s q u a r e : : C! HPF$ ALIGN A( i, ) WITH C( i, )! r e p l i c a t e c o p i e s o f row A ( i, ) onto proc. s which compute C( i, j )! HPF$ ALIGN B(, j ) WITH C(, j )! r e p l i c a t e c o p i e s o f c o l. B (, j ) ) onto proc. s which compute C( i, j ) A = 1 B = 2 C = 0 DO i = 1, N DO j = 1, N! A l l t h e work i s l o c a l due t o ALIGNs C( i, j ) = DOT_PRODUCT(A( i, : ), B ( :, j ) ) END DO END DO WRITE(, ) C END

5 241 // programming models 11.1 HPF HPF programming methodology Need to find balance between concurrency and communication the more processes the more communication aiming to find balanced load based from the owner calculates rule data locality Easy to write a program in HPF but difficult to gain good efficiency Programming in HPF technique is more or less like this: 1. Write a correctly working serial program, test and debug it 2. add distribution directives introducing as less as possible communication

6 242 // programming models 11.2 OpenMP 11.2 OpenMP OpenMP tutorial ( openmp/) Programming model based on thread parallelism Helping tool for a programmer Based on compiler directives C/C++, Fortran* Nested Parallelism is supported (though, not all implementations support it) Dynamic threads OpenMP has become a standard

7 243 // programming models 11.2 OpenMP Fork-Join model

8 244 // programming models 11.2 OpenMP OpenMP Example: Matrix Multiplication:! h t t p : / / www. l l n l. gov / computing / t u t o r i a l s / openmp / e x e r c i s e. h tml C C OpenMp Example M a t r i x M u l t i p l y F o r t r a n V e r s i o n C FILE : omp_mm. f C DESCRIPTION : C D e m o n s t r a t e s a m a t r i x m u l t i p l y u s i n g OpenMP. Threads s h a r e row i t e r a t i o n s C a c c o r d i n g to a p r e d e f i n e d chunk s i z e. C LAST REVISED : 1 / 5 / 0 4 B l a i s e Barney C PROGRAM MATMULT INTEGER NRA, NCA, NCB, TID, NTHREADS, I, J, K, CHUNK, + OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM C number of rows in m a t r i x A PARAMETER (NRA=62) C number of columns in m a t r i x A PARAMETER (NCA=15) C number of columns in m a t r i x B PARAMETER (NCB=7) REAL 8 A(NRA,NCA), B(NCA, NCB), C(NRA, NCB)

9 245 // programming models 11.2 OpenMP C S e t loop i t e r a t i o n chunk s i z e CHUNK = 10 C Spawn a p a r a l l e l r e g i o n e x p l i c i t l y s c o p i n g a l l v a r i a b l e s!$omp PARALLEL SHARED(A, B, C,NTHREADS,CHUNK) PRIVATE ( TID, I, J,K) TID = OMP_GET_THREAD_NUM( ) IF ( TID.EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS( ) PRINT, S t a r t i n g m a t r i x m u l t i p l e example with, NTHREADS, + t h r e a d s PRINT, I n i t i a l i z i n g m a t r i c e s END IF C I n i t i a l i z e m a t r i c e s!$omp DO SCHEDULE( STATIC, CHUNK) DO 30 I =1, NRA DO 30 J =1, NCA A( I, J ) = ( I 1) +( J 1) 30 CONTINUE!$OMP DO SCHEDULE( STATIC, CHUNK) DO 40 I =1, NCA DO 40 J =1, NCB B( I, J ) = ( I 1) ( J 1) 40 CONTINUE

10 246 // programming models 11.2 OpenMP!$OMP DO SCHEDULE( STATIC, CHUNK) DO 50 I =1, NRA DO 50 J =1, NCB C( I, J ) = 0 50 CONTINUE C Do m a t r i x m u l t i p l y s h a r i n g i t e r a t i o n s on o u t e r l oop C D i s p l a y who does which i t e r a t i o n s f o r d e m o n s t r a t i o n p u r p o s e s PRINT, Thread, TID, s t a r t i n g m a t r i x m u l t i p l y...!$omp DO SCHEDULE( STATIC, CHUNK) DO 60 I =1, NRA PRINT, Thread, TID, d i d row, I DO 60 J =1, NCB DO 60 K=1, NCA C( I, J ) = C( I, J ) + A( I,K) B(K, J ) 60 CONTINUE C End of p a r a l l e l r e g i o n!$omp END PARALLEL C P r i n t r e s u l t s PRINT, PRINT, R e s u l t M a t r i x : DO 90 I =1, NRA

11 247 // programming models 11.2 OpenMP DO 80 J =1, NCB WRITE(, 7 0 ) C( I, J ) 70 FORMAT(2 x, f8. 2, $ ) 80 CONTINUE PRINT, 90 CONTINUE PRINT, PRINT, Done. END

12 248 // programming models 11.3 GPGPU 11.3 GPGPU General-purpose computing on graphics processing units CUDA Parallel programming platform and model created by NVIDIA CUDA-extended programming languages (like nvcc) C, C++, Fortran CUDA gives access to virtual instruction set memory on parallel computational elements on GPUs based on executing a large number of threads simultaneously CUDA-accelerated libraries

13 249 // programming models 11.3 GPGPU CUDA Fortran Matrix Multiplication example (for more information, see http: // Host (CPU) code: s u b r o u t i n e mmul ( A, B, C ) use c u d a f o r real, dimension ( :, : ) : : A, B, C i n t e g e r : : N, M, L real, device, a l l o c a t a b l e, dimension ( :, : ) : : Adev, Bdev, Cdev type ( dim3 ) : : dimgrid, dimblock N = s i z e (A, 1 ) ; M = s i z e (A, 2 ) ; L = s i z e (B, 2 ) a l l o c a t e ( Adev (N,M), Bdev (M, L ), Cdev (N, L ) ) Adev = A( 1 : N, 1 :M) Bdev = B ( 1 :M, 1 : L ) dimgrid = dim3 ( N/ 1 6, L / 1 6, 1 ) dimblock = dim3 ( 16, 16, 1 ) c a l l mmul_kernel <<<dimgrid, dimblock >>>( Adev, Bdev, Cdev, N,M, L ) C ( 1 : N, 1 :M) = Cdev d e a l l o c a t e ( Adev, Bdev, Cdev ) end s u b r o u t i n e

14 250 // programming models 11.3 GPGPU GPU code: a t t r i b u t e s ( g l o b a l ) s u b r o u t i n e MMUL_KERNEL( A, B, C, N,M, L ) real, d e v i c e : : A(N,M),B(M, L ),C(N, L ) i n t e g e r, v a l u e : : N,M, L i n t e g e r : : i, j, kb, k, tx, t y real, s h a r e d : : Ab ( 1 6, 1 6 ), Bb ( 1 6, 1 6 ) r e a l : : C i j t x = t h r e a d i d x%x ; t y = t h r e a d i d x%y i = ( b l o c k i d x%x 1) 16 + t x j = ( b l o c k i d x%y 1) 16 + t y C i j = 0. 0 do kb = 1, M, 16! Fetch one e l e m e n t each i n t o Ab and Bb ; n o t e t h a t 16 x16 = 256! t h r e a d s i n t h i s thread b l o c k are f e t c h i n g s e p a r a t e e l e m e n t s! o f Ab and Bb Ab ( tx, t y ) = A( i, kb+ty 1) Bb ( tx, t y ) = B( kb+tx 1, j )! Wait u n t i l a l l e l e m e n t s o f Ab and Bb are f i l l e d c a l l s y n c t h r e a d s ( ) do k = 1, 16 C i j = C i j + Ab ( tx, k ) Bb ( k, t y ) enddo! Wait u n t i l a l l t h r e a d s i n t h e thread b l o c k f i n i s h w i t h

15 251 // programming models 11.3 GPGPU! t h i s i t e r a t i o n s Ab and Bb c a l l s y n c t h r e a d s ( ) enddo C( i, j ) = C i j end s u b r o u t i n e OpenCL Open Computing Language (OpenCL): Heterogeneous systems of CPUs (central processing units) GPUs (graphics processing units) DSPs (digital signal processors) FPGAs (field-programmable gate arrays) and other processors

16 252 // programming models 11.3 GPGPU language (based on C99) kernels (functions that execute on OpenCL devices) plus application programming interfaces (APIs) ( fortrancl ) parallel computing using task-based and data-based parallelism GPGPU open standard by Khronos Group (Apple, AMD, IBM, Intel and Nvidia) + adopted by Altera, Samsung, Vivante and ARM Holdings Example: Matrix Multiplication

17 253 // programming models 11.3 GPGPU / k e r n e l. c l M a t r i x m u l t i p l i c a t i o n : C = A B. ( ( Device code ) ) ( h t t p : / / gpgpu computing4. b l o g s p o t. co. uk / / 0 9 / matrix m u l t i p l i c a t i o n 2 o p e n c l. html ) / / / OpenCL K e r n e l k e r n e l void matrixmul ( g l o b a l f l o a t C, g l o b a l f l o a t A, g l o b a l f l o a t B, i n t wa, i n t wb) { / / 2D Thread ID / / Old CUDA code / / i n t t x = b l o c k I d x. x TILE_SIZE + t h r e a d I d x. x ; / / i n t t y = b l o c k I d x. y TILE_SIZE + t h r e a d I d x. y ; i n t t x = g e t _ g l o b a l _ i d ( 0 ) ; i n t t y = g e t _ g l o b a l _ i d ( 1 ) ; / / v a l u e s t o r e s t h e e l e m e n t t h a t i s / / computed by t h e t h r e a d f l o a t v a l u e = 0 ; f o r ( i n t k = 0 ; k < wa; ++k ) { f l o a t elementa = A[ t y wa + k ] ; f l o a t elementb = B[ k wb + t x ] ;

18 254 // programming models 11.3 GPGPU } v a l u e += elementa elementb ; } / / W r i t e t h e m a t r i x t o d e v i c e memory each / / t h r e a d w r i t e s one e l e m e n t C[ t y wa + t x ] = v a l u e ; OpenACC standard (Cray, CAPS, Nvidia and PGI) to simplify parallel programming of heterogeneous CPU/GPU systems using annotations (like OpenMP), C, C++, Fortran source code code started on both CPU and GPU automatically OpenACC to merge into OpenMP in a future release of OpenMP

19 255 // programming models 11.3 GPGPU! A s i m p l e OpenACC k e r n e l f o r M a t r i x M u l t i p l i c a t i o n! $acc k e r n e l s do k = 1, n1 do i = 1, n3 c ( i, k ) = 0. 0 do j = 1, n2 c ( i, k ) = c ( i, k ) + a ( i, j ) b ( j, k ) enddo enddo enddo! $acc end k e r n e l s! h t t p : / / www. bu. edu / t e c h / r e s e a r c h / c o m p u t a t i o n / l i n u x c l u s t e r / katana c l u s t e r / gpu computing / openacc f o r t r a n / matrix m u l t i p l y f o r t r a n / program m a t r i x _ m u l t i p l y use omp_ lib use openacc i m p l i c i t none i n t e g e r : : i, j, k, myid, m, n, c o m p i l e d _ f o r, o p t i o n i n t e g e r, parameter : : fd = 11 i n t e g e r : : t1, t2, dt, c o u n t _ r a t e, count_max real, a l l o c a t a b l e, dimension ( :, : ) : : a, b, c

20 256 // programming models 11.3 GPGPU r e a l : : tmp, s e c s open ( fd, f i l e = w a l l c l o c k t i m e, form= f o r m a t t e d ) o p t i o n = c o m p i l e d _ f o r ( fd )! 1 s e r i a l, 2 OpenMP, 3 OpenACC, 4 both! $omp p a r a l l e l! $ myid = OMP_GET_THREAD_NUM( )! $ i f ( myid. eq. 0) t h e n! $ w r i t e ( fd, " ( Number of p r o c s i s, i 4 ) " ) OMP_GET_NUM_THREADS( )! $ e n d i f! $omp end p a r a l l e l c a l l system_clock ( count_max=count_max, c o u n t _ r a t e = c o u n t _ r a t e ) do m=1,4! compute f o r d i f f e r e n t s i z e m a t r i x m u l t i p l i e s c a l l system_clock ( t 1 ) n = (m 1)! 1000, 2000, 4000, 8000 a l l o c a t e ( a ( n, n ), b ( n, n ), c ( n, n ) )! I n i t i a l i z e m a t r i c e s do j =1, n do i =1, n a ( i, j ) = r e a l ( i + j ) b ( i, j ) = r e a l ( i j ) enddo enddo! $omp p a r a l l e l do s h a r e d ( a, b, c, n, tmp ) r e d u c t i o n ( + : tmp )! $ acc d a t a c o p y i n ( a, b ) copy ( c )! $ acc k e r n e l s

21 257 // programming models 11.3 GPGPU! Compute m a t r i x m u l t i p l i c a t i o n. do j =1, n do i =1, n tmp = 0. 0! e n a b l e s ACC p a r a l l e l i s m f o r k loop do k =1, n tmp = tmp + a ( i, k ) b ( k, j ) enddo c ( i, j ) = tmp enddo enddo! $ acc end k e r n e l s! $ acc end d a t a! $omp end p a r a l l e l do c a l l system_clock ( t 2 ) d t = t2 t 1 s e c s = r e a l ( d t ) / r e a l ( c o u n t _ r a t e ) w r i t e ( fd, " ( For n =, i4,, w a l l c l o c k time i s, f12. 2, seconds ) " ) & n, s e c s d e a l l o c a t e ( a, b, c ) enddo c l o s e ( fd ) end program m a t r i x _ m u l t i p l y i n t e g e r f u n c t i o n c o m p i l e d _ f o r ( fd )

22 258 // programming models 11.3 GPGPU i m p l i c i t none i n t e g e r : : fd # i f d e f i n e d _OPENMP && d e f i n e d _OPENACC c o m p i l e d _ f o r = 4 w r i t e ( fd, " ( This code i s compiled with OpenMP & OpenACC ) " ) # e l i f d e f i n e d _OPENACC c o m p i l e d _ f o r = 3 w r i t e ( fd, " ( This code i s compiled with OpenACC ) " ) # e l i f d e f i n e d _OPENMP c o m p i l e d _ f o r = 2 w r i t e ( fd, " ( This code i s compiled with OpenMP ) " ) # e l s e c o m p i l e d _ f o r = 1 w r i t e ( fd, " ( This code i s compiled f o r s e r i a l o p e r a t i o n s ) " ) # e n d i f end f u n c t i o n c o m p i l e d _ f o r

Panorama des modèles et outils de programmation parallèle

Panorama des modèles et outils de programmation parallèle Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &

More information

CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication

CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication 1 / 26 CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication Albert S. Kim Department of Civil and Environmental Engineering University of Hawai i at Manoa 2540 Dole

More information

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine First, a look at using OpenACC on WRF subroutine advance_w dynamics routine Second, an estimate of WRF multi-node performance on Cray XK6 with GPU accelerators Based on performance of WRF kernels, what

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar

More information

Exascale challenges for Numerical Weather Prediction : the ESCAPE project

Exascale challenges for Numerical Weather Prediction : the ESCAPE project Exascale challenges for Numerical Weather Prediction : the ESCAPE project O Olivier Marsden This project has received funding from the European Union s Horizon 2020 research and innovation programme under

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

The Lattice Boltzmann Simulation on Multi-GPU Systems

The Lattice Boltzmann Simulation on Multi-GPU Systems The Lattice Boltzmann Simulation on Multi-GPU Systems Thor Kristian Valderhaug Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University

More information

Perm State University Research-Education Center Parallel and Distributed Computing

Perm State University Research-Education Center Parallel and Distributed Computing Perm State University Research-Education Center Parallel and Distributed Computing A 25-minute Talk (S4493) at the GPU Technology Conference (GTC) 2014 MARCH 24-27, 2014 SAN JOSE, CA GPU-accelerated modeling

More information

B629 project - StreamIt MPI Backend. Nilesh Mahajan

B629 project - StreamIt MPI Backend. Nilesh Mahajan B629 project - StreamIt MPI Backend Nilesh Mahajan March 26, 2013 Abstract StreamIt is a language based on the dataflow model of computation. StreamIt consists of computation units called filters connected

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Optimization Techniques for Parallel Code 1. Parallel programming models

Optimization Techniques for Parallel Code 1. Parallel programming models Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of

More information

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version) Bulletin of Networking, Computing, Systems, and Software www.bncss.org, ISSN 2186-5140 Volume 7, Number 1, pages 28 32, January 2018 Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers

More information

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Outline Introduction Algorithms Parallelization Comparison of performances Conclusions

More information

An Integrative Model for Parallelism

An Integrative Model for Parallelism An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Recent advances in the GFDL Flexible Modeling System

Recent advances in the GFDL Flexible Modeling System Recent advances in the GFDL Flexible Modeling System 4th ENES HPC Workshop Toulouse, FRANCE V. Balaji and many others NOAA/GFDL and Princeton University 6 April 2016 V. Balaji (balaji@princeton.edu) GFDL

More information

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor XcalableMP 1 2 2 2 Xeon Phi Xeon XcalableMP HIMENO L Phi XL 16 Xeon 1 16 Phi XcalableMP MPI XcalableMP OpenMP 16 2048 Benchmarking program performance evaluation of Parallel programming language XcalableMP

More information

CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators

CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators Sandeep D souza and Ragunathan (Raj) Rajkumar Carnegie Mellon University High (Energy) Cost of Accelerators Modern-day

More information

PSEUDORANDOM numbers are very important in practice

PSEUDORANDOM numbers are very important in practice Proceedings of the Federated Conference on Computer Science and Information Systems pp 571 578 ISBN 978-83-681-51-4 Parallel GPU-accelerated Recursion-based Generators of Pseudorandom Numbers Przemysław

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

Fortran program + Partial data layout specifications Data Layout Assistant.. regular problems. dynamic remapping allowed Invoked only a few times Not part of the compiler Can use expensive techniques HPF

More information

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing

More information

Advanced parallel Fortran

Advanced parallel Fortran 1/44 Advanced parallel Fortran A. Shterenlikht Mech Eng Dept, The University of Bristol Bristol BS8 1TR, UK, Email: mexas@bristol.ac.uk coarrays.sf.net 12th November 2017 2/44 Fortran 2003 i n t e g e

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory

More information

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures

More information

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA Date: 16th May 2012 Wed, 3pm to 3.25pm(Adv. Session) Sathyanarayana K., Manish Banga, and Ravi Kumar G. V. V. Engineering Services,

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Interpolation with Radial Basis Functions on GPGPUs using CUDA

Interpolation with Radial Basis Functions on GPGPUs using CUDA Interpolation with Radial Basis Functions on GPGPUs using CUDA Gundolf Haase in coop. with: Dirk Martin [VRV Vienna] and Günter Offner [AVL Graz] Institute for Mathematics and Scientific Computing University

More information

EECS 579: Logic and Fault Simulation. Simulation

EECS 579: Logic and Fault Simulation. Simulation EECS 579: Logic and Fault Simulation Simulation: Use of computer software models to verify correctness Fault Simulation: Use of simulation for fault analysis and ATPG Circuit description Input data for

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) Tung Chou January 5, 2012 QUAD Stream cipher. Security relies on MQ (Multivariate Quadratics). QUAD The Provably-secure QUAD(q, n, r) Stream Cipher

More information

CSE 548: Analysis of Algorithms. Lecture 12 ( Analyzing Parallel Algorithms )

CSE 548: Analysis of Algorithms. Lecture 12 ( Analyzing Parallel Algorithms ) CSE 548: Analysis of Algorithms Lecture 12 ( Analyzing Parallel Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Fall 2017 Why Parallelism? Moore s Law Source: Wikipedia

More information

High-performance Technical Computing with Erlang

High-performance Technical Computing with Erlang High-performance Technical Computing with Erlang Alceste Scalas Giovanni Casu Piero Pili Center for Advanced Studies, Research and Development in Sardinia ACM ICFP 2008 Erlang Workshop September 27th,

More information

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1 Introduction! PNNL Laboratory Directed Research &

More information

Predicting the Electrical and Thermal Transport Properties in Direct Energy Conversion Material: A Practical HPC Perspective

Predicting the Electrical and Thermal Transport Properties in Direct Energy Conversion Material: A Practical HPC Perspective Predicting the Electrical and Thermal Transport Properties in Direct Energy Conversion Material: A Practical HPC Perspective Plasmonic Thermal Absorber Nanostructure Thermoelectrics Dr. Terry Musho Assistant

More information

Digital Systems EEE4084F. [30 marks]

Digital Systems EEE4084F. [30 marks] Digital Systems EEE4084F [30 marks] Practical 3: Simulation of Planet Vogela with its Moon and Vogel Spiral Star Formation using OpenGL, OpenMP, and MPI Introduction The objective of this assignment is

More information

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ CS-206 Concurrency Lecture 13 Wrap Up Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Created by Nooshin Mirzadeh, Georgios Psaropoulos and Babak Falsafi EPFL Copyright 2015 EPFL CS-206 Spring

More information

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),

More information

Improvement of MPAS on the Integration Speed and the Accuracy

Improvement of MPAS on the Integration Speed and the Accuracy ICAS2017 Annecy, France Improvement of MPAS on the Integration Speed and the Accuracy Wonsu Kim, Ji-Sun Kang, Jae Youp Kim, and Minsu Joh Disaster Management HPC Technology Research Center, Korea Institute

More information

Cyclops Tensor Framework

Cyclops Tensor Framework Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

The Thermoflow60 Finite-Element. Program

The Thermoflow60 Finite-Element. Program The Thermoflow60 Finite-Element Program Ulrich Wepler,, German Aerospace enter (DLR) Dieter an Mey, enter for omputing and ommunication,, RWTH Alexander Spiegel, enter for omputing and ommunication,, RWTH

More information

Modern Functional Programming and Actors With Scala and Akka

Modern Functional Programming and Actors With Scala and Akka Modern Functional Programming and Actors With Scala and Akka Aaron Kosmatin Computer Science Department San Jose State University San Jose, CA 95192 707-508-9143 akosmatin@gmail.com Abstract For many years,

More information

S8241 VERSIONING GPU- ACCLERATED WRF TO Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA)

S8241 VERSIONING GPU- ACCLERATED WRF TO Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA) S8241 VERSIONING GPU- ACCLERATED WRF TO 3.7.1 Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA) 1 ACKNOWLEDGEMENT The work presented here today would not have been possible without the efforts

More information

OpenCL: Programming Heterogeneous Architectures

OpenCL: Programming Heterogeneous Architectures OpenCL: Programming Heterogeneous Architectures Porting BigDFT to OpenCL Brice Videau (LIG - NANOSIM) April 24, 2012 1 / 45 Introduction 2 / 45 Needs in computing resources are innite Benets for physicists

More information

The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology

The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology 16:00-17:30 27 October 2016 Panelists John Michalakes (UCAR,

More information

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized

More information

COMP 515: Advanced Compilation for Vector and Parallel Processors. Vivek Sarkar Department of Computer Science Rice University

COMP 515: Advanced Compilation for Vector and Parallel Processors. Vivek Sarkar Department of Computer Science Rice University COMP 515: Advanced Compilation for Vector and Parallel Processors Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP 515 Lecture 10 10 February 2009 Announcement Feb 17 th

More information

CRYPTOGRAPHIC COMPUTING

CRYPTOGRAPHIC COMPUTING CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,

More information

Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures

Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures Mark Gove? NOAA Earth System Research Laboratory We Need Be?er Numerical Weather Predic(on Superstorm Sandy Hurricane

More information

Analysis of Parallel Montgomery Multiplication in CUDA

Analysis of Parallel Montgomery Multiplication in CUDA San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2013 Analysis of Parallel Montgomery Multiplication in CUDA Yuheng Liu San Jose State University

More information

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs Simon McIntosh-Smith simonm@cs.bris.ac.uk Head of Microelectronics Research University of Bristol, UK 1 ! Collaborators

More information

Marawacc: A Framework for Heterogeneous Computing in Java

Marawacc: A Framework for Heterogeneous Computing in Java : A Framework for Heterogeneous Computing in Java Juan Fumero, Michel Steuwer, Christophe Dubach Code The University of Edinburgh UK Many-Core Developer Conference 2016 1 / 23 Code 2 / 23 Code 3 / 23 Code

More information

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS

ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS Alan Gray, Developer Technology Engineer, NVIDIA ECMWF 18th Workshop on high performance computing in meteorology, 28 th September 2018 ESCAPE NVIDIA s

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville Performance of the fusion code GYRO on three four generations of Crays Mark Fahey mfahey@utk.edu University of Tennessee, Knoxville Contents Introduction GYRO Overview Benchmark Problem Test Platforms

More information

Empowering Scientists with Domain Specific Languages

Empowering Scientists with Domain Specific Languages Empowering Scientists with Domain Specific Languages Julian Kunkel, Nabeeh Jum ah Scientific Computing Department of Informatics University of Hamburg SciCADE2017 2017-09-13 Outline 1 Developing Scientific

More information

Code Generation for GPU Accelerators in the Domain of Image Preprocessing

Code Generation for GPU Accelerators in the Domain of Image Preprocessing Code Generation for GPU Accelerators in the Domain of Image Preprocessing Oliver Reiche, Richard Membarth, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Dagstuhl,

More information

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The

More information

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th Brief Introduction of CTC AGENDA Alpha/Beta Matrix

More information

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS GTC 20130319 A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS Erik Lindahl erik.lindahl@scilifelab.se Molecular Dynamics Understand biology We re comfortably on

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

Lecture 22: Multithreaded Algorithms CSCI Algorithms I. Andrew Rosenberg

Lecture 22: Multithreaded Algorithms CSCI Algorithms I. Andrew Rosenberg Lecture 22: Multithreaded Algorithms CSCI 700 - Algorithms I Andrew Rosenberg Last Time Open Addressing Hashing Today Multithreading Two Styles of Threading Shared Memory Every thread can access the same

More information

Available online at ScienceDirect. Procedia Engineering 61 (2013 ) 94 99

Available online at  ScienceDirect. Procedia Engineering 61 (2013 ) 94 99 Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 6 (203 ) 94 99 Parallel Computational Fluid Dynamics Conference (ParCFD203) Simulations of three-dimensional cavity flows with

More information

Marwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation

Marwan Burelle.  Parallel and Concurrent Programming. Introduction and Foundation and and marwan.burelle@lse.epita.fr http://wiki-prog.kh405.net Outline 1 2 and 3 and Evolutions and Next evolutions in processor tends more on more on growing of cores number GPU and similar extensions

More information

Multicore Semantics and Programming

Multicore Semantics and Programming Multicore Semantics and Programming Peter Sewell Tim Harris University of Cambridge Oracle October November, 2015 p. 1 These Lectures Part 1: Multicore Semantics: the concurrency of multiprocessors and

More information

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs 1 School of Information Engineering, China University of Geosciences (Beijing) Beijing, 100083, China E-mail: Yaolk1119@icloud.com Ailan

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

SIMULATION OF ISING SPIN MODEL USING CUDA

SIMULATION OF ISING SPIN MODEL USING CUDA SIMULATION OF ISING SPIN MODEL USING CUDA MIRO JURIŠIĆ Supervisor: dr.sc. Dejan Vinković Split, November 2011 Master Thesis in Physics Department of Physics Faculty of Natural Sciences and Mathematics

More information

GPU Computing with Applications in Digital Logic

GPU Computing with Applications in Digital Logic Tampere International Center for Signal Processing. TICSP series # 62 J. Astola, M. Kameyama, M. Lukac & R. S. Stanković (eds.) GPU Computing with Applications in Digital Logic Tampere International Center

More information

Carlo Cavazzoni, HPC department, CINECA

Carlo Cavazzoni, HPC department, CINECA Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have a mixed architecture + accelerators

More information

All-electron density functional theory on Intel MIC: Elk

All-electron density functional theory on Intel MIC: Elk All-electron density functional theory on Intel MIC: Elk W. Scott Thornton, R.J. Harrison Abstract We present the results of the porting of the full potential linear augmented plane-wave solver, Elk [1],

More information

Parallel Scientific Computing

Parallel Scientific Computing IV-1 Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication. Direct method for solving a linear equation. Gaussian Elimination. Iterative method for solving a linear equation.

More information

Parallel Sparse Tensor Decompositions using HiCOO Format

Parallel Sparse Tensor Decompositions using HiCOO Format Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, 8 @ SIAM ALA 8 Outline

More information

Introduction The Nature of High-Performance Computation

Introduction The Nature of High-Performance Computation 1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential

More information

Language GTC April 7, These slides at Data Science Applications of GPUs in the R.

Language GTC April 7, These slides at   Data Science Applications of GPUs in the R. Data Science April 7, 2016 These slides at http://heather.cs.ucdavis.edu/gtc.pdf Why R? Why R? The lingua franca for the data science community. (R-Python-Julia battle looming?) Statistically Correct:

More information

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting Thomas C. Schulthess 1 Cray XC30 with 5272 hybrid, GPU accelerated compute nodes Piz Daint Compute node:

More information

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer

More information

Performance Evaluation of Scientific Applications on POWER8

Performance Evaluation of Scientific Applications on POWER8 Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,

More information