11 Parallel programming models
|
|
- Suzan Barton
- 5 years ago
- Views:
Transcription
1 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages Linda CSP-based (Communicating Sequential Processes) FortranM Occam Dataflow SISAL (Streams and Iteration in a Single Assignment Language) Distributed Sequoia Bloom Event-driven and hardware description
2 238 // Program Design 10.3 Assessing parallel programs Verilog hardware description Multi-threaded language (HDL) Clojure Functional Object-oriented Concurrent Haskell Charm++ GPU languages Smalltalk CUDA Message-passing OpenMP OpenACC MPI OpenHMPP (HMPP for Hybrid Multicore Parallel Pro- PVM Partitioned global address space gramming) (PGAS) Logic programming Parlog High Performance Fortran (HPF)
3 239 // programming models 11.1 HPF 11.1 HPF Partitioned global address space parallel programming model Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which processor gets which part of the data Concurrency is defined by HPF commands based on Fortran90 HPF directives as comments:!hpf$ <directive>
4 240 // programming models 11.1 HPF HPF example: matrix multiplication PROGRAM ABmult IMPLICIT NONE INTEGER, PARAMETER : : N = 100 INTEGER, DIMENSION (N, N) : : A, B, C INTEGER : : i, j! HPF$ PROCESSORS s q u a r e ( 2, 2 )! HPF$ DISTRIBUTE (BLOCK, BLOCK) ONTO s q u a r e : : C! HPF$ ALIGN A( i, ) WITH C( i, )! r e p l i c a t e c o p i e s o f row A ( i, ) onto proc. s which compute C( i, j )! HPF$ ALIGN B(, j ) WITH C(, j )! r e p l i c a t e c o p i e s o f c o l. B (, j ) ) onto proc. s which compute C( i, j ) A = 1 B = 2 C = 0 DO i = 1, N DO j = 1, N! A l l t h e work i s l o c a l due t o ALIGNs C( i, j ) = DOT_PRODUCT(A( i, : ), B ( :, j ) ) END DO END DO WRITE(, ) C END
5 241 // programming models 11.1 HPF HPF programming methodology Need to find balance between concurrency and communication the more processes the more communication aiming to find balanced load based from the owner calculates rule data locality Easy to write a program in HPF but difficult to gain good efficiency Programming in HPF technique is more or less like this: 1. Write a correctly working serial program, test and debug it 2. add distribution directives introducing as less as possible communication
6 242 // programming models 11.2 OpenMP 11.2 OpenMP OpenMP tutorial ( openmp/) Programming model based on thread parallelism Helping tool for a programmer Based on compiler directives C/C++, Fortran* Nested Parallelism is supported (though, not all implementations support it) Dynamic threads OpenMP has become a standard
7 243 // programming models 11.2 OpenMP Fork-Join model
8 244 // programming models 11.2 OpenMP OpenMP Example: Matrix Multiplication:! h t t p : / / www. l l n l. gov / computing / t u t o r i a l s / openmp / e x e r c i s e. h tml C C OpenMp Example M a t r i x M u l t i p l y F o r t r a n V e r s i o n C FILE : omp_mm. f C DESCRIPTION : C D e m o n s t r a t e s a m a t r i x m u l t i p l y u s i n g OpenMP. Threads s h a r e row i t e r a t i o n s C a c c o r d i n g to a p r e d e f i n e d chunk s i z e. C LAST REVISED : 1 / 5 / 0 4 B l a i s e Barney C PROGRAM MATMULT INTEGER NRA, NCA, NCB, TID, NTHREADS, I, J, K, CHUNK, + OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM C number of rows in m a t r i x A PARAMETER (NRA=62) C number of columns in m a t r i x A PARAMETER (NCA=15) C number of columns in m a t r i x B PARAMETER (NCB=7) REAL 8 A(NRA,NCA), B(NCA, NCB), C(NRA, NCB)
9 245 // programming models 11.2 OpenMP C S e t loop i t e r a t i o n chunk s i z e CHUNK = 10 C Spawn a p a r a l l e l r e g i o n e x p l i c i t l y s c o p i n g a l l v a r i a b l e s!$omp PARALLEL SHARED(A, B, C,NTHREADS,CHUNK) PRIVATE ( TID, I, J,K) TID = OMP_GET_THREAD_NUM( ) IF ( TID.EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS( ) PRINT, S t a r t i n g m a t r i x m u l t i p l e example with, NTHREADS, + t h r e a d s PRINT, I n i t i a l i z i n g m a t r i c e s END IF C I n i t i a l i z e m a t r i c e s!$omp DO SCHEDULE( STATIC, CHUNK) DO 30 I =1, NRA DO 30 J =1, NCA A( I, J ) = ( I 1) +( J 1) 30 CONTINUE!$OMP DO SCHEDULE( STATIC, CHUNK) DO 40 I =1, NCA DO 40 J =1, NCB B( I, J ) = ( I 1) ( J 1) 40 CONTINUE
10 246 // programming models 11.2 OpenMP!$OMP DO SCHEDULE( STATIC, CHUNK) DO 50 I =1, NRA DO 50 J =1, NCB C( I, J ) = 0 50 CONTINUE C Do m a t r i x m u l t i p l y s h a r i n g i t e r a t i o n s on o u t e r l oop C D i s p l a y who does which i t e r a t i o n s f o r d e m o n s t r a t i o n p u r p o s e s PRINT, Thread, TID, s t a r t i n g m a t r i x m u l t i p l y...!$omp DO SCHEDULE( STATIC, CHUNK) DO 60 I =1, NRA PRINT, Thread, TID, d i d row, I DO 60 J =1, NCB DO 60 K=1, NCA C( I, J ) = C( I, J ) + A( I,K) B(K, J ) 60 CONTINUE C End of p a r a l l e l r e g i o n!$omp END PARALLEL C P r i n t r e s u l t s PRINT, PRINT, R e s u l t M a t r i x : DO 90 I =1, NRA
11 247 // programming models 11.2 OpenMP DO 80 J =1, NCB WRITE(, 7 0 ) C( I, J ) 70 FORMAT(2 x, f8. 2, $ ) 80 CONTINUE PRINT, 90 CONTINUE PRINT, PRINT, Done. END
12 248 // programming models 11.3 GPGPU 11.3 GPGPU General-purpose computing on graphics processing units CUDA Parallel programming platform and model created by NVIDIA CUDA-extended programming languages (like nvcc) C, C++, Fortran CUDA gives access to virtual instruction set memory on parallel computational elements on GPUs based on executing a large number of threads simultaneously CUDA-accelerated libraries
13 249 // programming models 11.3 GPGPU CUDA Fortran Matrix Multiplication example (for more information, see http: // Host (CPU) code: s u b r o u t i n e mmul ( A, B, C ) use c u d a f o r real, dimension ( :, : ) : : A, B, C i n t e g e r : : N, M, L real, device, a l l o c a t a b l e, dimension ( :, : ) : : Adev, Bdev, Cdev type ( dim3 ) : : dimgrid, dimblock N = s i z e (A, 1 ) ; M = s i z e (A, 2 ) ; L = s i z e (B, 2 ) a l l o c a t e ( Adev (N,M), Bdev (M, L ), Cdev (N, L ) ) Adev = A( 1 : N, 1 :M) Bdev = B ( 1 :M, 1 : L ) dimgrid = dim3 ( N/ 1 6, L / 1 6, 1 ) dimblock = dim3 ( 16, 16, 1 ) c a l l mmul_kernel <<<dimgrid, dimblock >>>( Adev, Bdev, Cdev, N,M, L ) C ( 1 : N, 1 :M) = Cdev d e a l l o c a t e ( Adev, Bdev, Cdev ) end s u b r o u t i n e
14 250 // programming models 11.3 GPGPU GPU code: a t t r i b u t e s ( g l o b a l ) s u b r o u t i n e MMUL_KERNEL( A, B, C, N,M, L ) real, d e v i c e : : A(N,M),B(M, L ),C(N, L ) i n t e g e r, v a l u e : : N,M, L i n t e g e r : : i, j, kb, k, tx, t y real, s h a r e d : : Ab ( 1 6, 1 6 ), Bb ( 1 6, 1 6 ) r e a l : : C i j t x = t h r e a d i d x%x ; t y = t h r e a d i d x%y i = ( b l o c k i d x%x 1) 16 + t x j = ( b l o c k i d x%y 1) 16 + t y C i j = 0. 0 do kb = 1, M, 16! Fetch one e l e m e n t each i n t o Ab and Bb ; n o t e t h a t 16 x16 = 256! t h r e a d s i n t h i s thread b l o c k are f e t c h i n g s e p a r a t e e l e m e n t s! o f Ab and Bb Ab ( tx, t y ) = A( i, kb+ty 1) Bb ( tx, t y ) = B( kb+tx 1, j )! Wait u n t i l a l l e l e m e n t s o f Ab and Bb are f i l l e d c a l l s y n c t h r e a d s ( ) do k = 1, 16 C i j = C i j + Ab ( tx, k ) Bb ( k, t y ) enddo! Wait u n t i l a l l t h r e a d s i n t h e thread b l o c k f i n i s h w i t h
15 251 // programming models 11.3 GPGPU! t h i s i t e r a t i o n s Ab and Bb c a l l s y n c t h r e a d s ( ) enddo C( i, j ) = C i j end s u b r o u t i n e OpenCL Open Computing Language (OpenCL): Heterogeneous systems of CPUs (central processing units) GPUs (graphics processing units) DSPs (digital signal processors) FPGAs (field-programmable gate arrays) and other processors
16 252 // programming models 11.3 GPGPU language (based on C99) kernels (functions that execute on OpenCL devices) plus application programming interfaces (APIs) ( fortrancl ) parallel computing using task-based and data-based parallelism GPGPU open standard by Khronos Group (Apple, AMD, IBM, Intel and Nvidia) + adopted by Altera, Samsung, Vivante and ARM Holdings Example: Matrix Multiplication
17 253 // programming models 11.3 GPGPU / k e r n e l. c l M a t r i x m u l t i p l i c a t i o n : C = A B. ( ( Device code ) ) ( h t t p : / / gpgpu computing4. b l o g s p o t. co. uk / / 0 9 / matrix m u l t i p l i c a t i o n 2 o p e n c l. html ) / / / OpenCL K e r n e l k e r n e l void matrixmul ( g l o b a l f l o a t C, g l o b a l f l o a t A, g l o b a l f l o a t B, i n t wa, i n t wb) { / / 2D Thread ID / / Old CUDA code / / i n t t x = b l o c k I d x. x TILE_SIZE + t h r e a d I d x. x ; / / i n t t y = b l o c k I d x. y TILE_SIZE + t h r e a d I d x. y ; i n t t x = g e t _ g l o b a l _ i d ( 0 ) ; i n t t y = g e t _ g l o b a l _ i d ( 1 ) ; / / v a l u e s t o r e s t h e e l e m e n t t h a t i s / / computed by t h e t h r e a d f l o a t v a l u e = 0 ; f o r ( i n t k = 0 ; k < wa; ++k ) { f l o a t elementa = A[ t y wa + k ] ; f l o a t elementb = B[ k wb + t x ] ;
18 254 // programming models 11.3 GPGPU } v a l u e += elementa elementb ; } / / W r i t e t h e m a t r i x t o d e v i c e memory each / / t h r e a d w r i t e s one e l e m e n t C[ t y wa + t x ] = v a l u e ; OpenACC standard (Cray, CAPS, Nvidia and PGI) to simplify parallel programming of heterogeneous CPU/GPU systems using annotations (like OpenMP), C, C++, Fortran source code code started on both CPU and GPU automatically OpenACC to merge into OpenMP in a future release of OpenMP
19 255 // programming models 11.3 GPGPU! A s i m p l e OpenACC k e r n e l f o r M a t r i x M u l t i p l i c a t i o n! $acc k e r n e l s do k = 1, n1 do i = 1, n3 c ( i, k ) = 0. 0 do j = 1, n2 c ( i, k ) = c ( i, k ) + a ( i, j ) b ( j, k ) enddo enddo enddo! $acc end k e r n e l s! h t t p : / / www. bu. edu / t e c h / r e s e a r c h / c o m p u t a t i o n / l i n u x c l u s t e r / katana c l u s t e r / gpu computing / openacc f o r t r a n / matrix m u l t i p l y f o r t r a n / program m a t r i x _ m u l t i p l y use omp_ lib use openacc i m p l i c i t none i n t e g e r : : i, j, k, myid, m, n, c o m p i l e d _ f o r, o p t i o n i n t e g e r, parameter : : fd = 11 i n t e g e r : : t1, t2, dt, c o u n t _ r a t e, count_max real, a l l o c a t a b l e, dimension ( :, : ) : : a, b, c
20 256 // programming models 11.3 GPGPU r e a l : : tmp, s e c s open ( fd, f i l e = w a l l c l o c k t i m e, form= f o r m a t t e d ) o p t i o n = c o m p i l e d _ f o r ( fd )! 1 s e r i a l, 2 OpenMP, 3 OpenACC, 4 both! $omp p a r a l l e l! $ myid = OMP_GET_THREAD_NUM( )! $ i f ( myid. eq. 0) t h e n! $ w r i t e ( fd, " ( Number of p r o c s i s, i 4 ) " ) OMP_GET_NUM_THREADS( )! $ e n d i f! $omp end p a r a l l e l c a l l system_clock ( count_max=count_max, c o u n t _ r a t e = c o u n t _ r a t e ) do m=1,4! compute f o r d i f f e r e n t s i z e m a t r i x m u l t i p l i e s c a l l system_clock ( t 1 ) n = (m 1)! 1000, 2000, 4000, 8000 a l l o c a t e ( a ( n, n ), b ( n, n ), c ( n, n ) )! I n i t i a l i z e m a t r i c e s do j =1, n do i =1, n a ( i, j ) = r e a l ( i + j ) b ( i, j ) = r e a l ( i j ) enddo enddo! $omp p a r a l l e l do s h a r e d ( a, b, c, n, tmp ) r e d u c t i o n ( + : tmp )! $ acc d a t a c o p y i n ( a, b ) copy ( c )! $ acc k e r n e l s
21 257 // programming models 11.3 GPGPU! Compute m a t r i x m u l t i p l i c a t i o n. do j =1, n do i =1, n tmp = 0. 0! e n a b l e s ACC p a r a l l e l i s m f o r k loop do k =1, n tmp = tmp + a ( i, k ) b ( k, j ) enddo c ( i, j ) = tmp enddo enddo! $ acc end k e r n e l s! $ acc end d a t a! $omp end p a r a l l e l do c a l l system_clock ( t 2 ) d t = t2 t 1 s e c s = r e a l ( d t ) / r e a l ( c o u n t _ r a t e ) w r i t e ( fd, " ( For n =, i4,, w a l l c l o c k time i s, f12. 2, seconds ) " ) & n, s e c s d e a l l o c a t e ( a, b, c ) enddo c l o s e ( fd ) end program m a t r i x _ m u l t i p l y i n t e g e r f u n c t i o n c o m p i l e d _ f o r ( fd )
22 258 // programming models 11.3 GPGPU i m p l i c i t none i n t e g e r : : fd # i f d e f i n e d _OPENMP && d e f i n e d _OPENACC c o m p i l e d _ f o r = 4 w r i t e ( fd, " ( This code i s compiled with OpenMP & OpenACC ) " ) # e l i f d e f i n e d _OPENACC c o m p i l e d _ f o r = 3 w r i t e ( fd, " ( This code i s compiled with OpenACC ) " ) # e l i f d e f i n e d _OPENMP c o m p i l e d _ f o r = 2 w r i t e ( fd, " ( This code i s compiled with OpenMP ) " ) # e l s e c o m p i l e d _ f o r = 1 w r i t e ( fd, " ( This code i s compiled f o r s e r i a l o p e r a t i o n s ) " ) # e n d i f end f u n c t i o n c o m p i l e d _ f o r
Panorama des modèles et outils de programmation parallèle
Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &
More informationCEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication
1 / 26 CEE 618 Scientific Parallel Computing (Lecture 7): OpenMP (con td) and Matrix Multiplication Albert S. Kim Department of Civil and Environmental Engineering University of Hawai i at Manoa 2540 Dole
More informationFirst, a look at using OpenACC on WRF subroutine advance_w dynamics routine
First, a look at using OpenACC on WRF subroutine advance_w dynamics routine Second, an estimate of WRF multi-node performance on Cray XK6 with GPU accelerators Based on performance of WRF kernels, what
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationJulian Merten. GPU Computing and Alternative Architecture
Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg
More informationA Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters
A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!
More informationHigh-performance processing and development with Madagascar. July 24, 2010 Madagascar development team
High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar
More informationExascale challenges for Numerical Weather Prediction : the ESCAPE project
Exascale challenges for Numerical Weather Prediction : the ESCAPE project O Olivier Marsden This project has received funding from the European Union s Horizon 2020 research and innovation programme under
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationThe Lattice Boltzmann Simulation on Multi-GPU Systems
The Lattice Boltzmann Simulation on Multi-GPU Systems Thor Kristian Valderhaug Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University
More informationPerm State University Research-Education Center Parallel and Distributed Computing
Perm State University Research-Education Center Parallel and Distributed Computing A 25-minute Talk (S4493) at the GPU Technology Conference (GTC) 2014 MARCH 24-27, 2014 SAN JOSE, CA GPU-accelerated modeling
More informationB629 project - StreamIt MPI Backend. Nilesh Mahajan
B629 project - StreamIt MPI Backend Nilesh Mahajan March 26, 2013 Abstract StreamIt is a language based on the dataflow model of computation. StreamIt consists of computation units called filters connected
More informationDynamic Scheduling for Work Agglomeration on Heterogeneous Clusters
Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed
More informationOptimization Techniques for Parallel Code 1. Parallel programming models
Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of
More informationParallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)
Bulletin of Networking, Computing, Systems, and Software www.bncss.org, ISSN 2186-5140 Volume 7, Number 1, pages 28 32, January 2018 Parallel Rabin-Karp Algorithm Implementation on GPU (preliminary version)
More informationHeterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry
Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)
More informationParallelization of the QC-lib Quantum Computer Simulator Library
Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers
More informationParallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata
Parallelization of the Dirac operator Pushan Majumdar Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata Outline Introduction Algorithms Parallelization Comparison of performances Conclusions
More informationAn Integrative Model for Parallelism
An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale
More informationSP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay
SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain
More informationTips Geared Towards R. Adam J. Suarez. Arpil 10, 2015
Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done
More informationAccelerating linear algebra computations with hybrid GPU-multicore systems.
Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)
More informationRecent advances in the GFDL Flexible Modeling System
Recent advances in the GFDL Flexible Modeling System 4th ENES HPC Workshop Toulouse, FRANCE V. Balaji and many others NOAA/GFDL and Princeton University 6 April 2016 V. Balaji (balaji@princeton.edu) GFDL
More informationBenchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor
XcalableMP 1 2 2 2 Xeon Phi Xeon XcalableMP HIMENO L Phi XL 16 Xeon 1 16 Phi XcalableMP MPI XcalableMP OpenMP 16 2048 Benchmarking program performance evaluation of Parallel programming language XcalableMP
More informationCycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators
CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators Sandeep D souza and Ragunathan (Raj) Rajkumar Carnegie Mellon University High (Energy) Cost of Accelerators Modern-day
More informationPSEUDORANDOM numbers are very important in practice
Proceedings of the Federated Conference on Computer Science and Information Systems pp 571 578 ISBN 978-83-681-51-4 Parallel GPU-accelerated Recursion-based Generators of Pseudorandom Numbers Przemysław
More informationOn Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code
On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance
More informationTR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems
TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a
More informationModel Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University
Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul
More informationFortran program + Partial data layout specifications Data Layout Assistant.. regular problems. dynamic remapping allowed Invoked only a few times Not part of the compiler Can use expensive techniques HPF
More informationUsing a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics
Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de
More informationAccelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers
UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric
More informationIntroduction to numerical computations on the GPU
Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming
More informationMAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors
MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:
More informationParallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29
Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing
More informationAdvanced parallel Fortran
1/44 Advanced parallel Fortran A. Shterenlikht Mech Eng Dept, The University of Bristol Bristol BS8 1TR, UK, Email: mexas@bristol.ac.uk coarrays.sf.net 12th November 2017 2/44 Fortran 2003 i n t e g e
More informationParallelization of the QC-lib Quantum Computer Simulator Library
Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/
More informationAdministrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application
Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory
More informationHybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS
Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures
More informationS0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA
S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA Date: 16th May 2012 Wed, 3pm to 3.25pm(Adv. Session) Sathyanarayana K., Manish Banga, and Ravi Kumar G. V. V. Engineering Services,
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationInterpolation with Radial Basis Functions on GPGPUs using CUDA
Interpolation with Radial Basis Functions on GPGPUs using CUDA Gundolf Haase in coop. with: Dirk Martin [VRV Vienna] and Günter Offner [AVL Graz] Institute for Mathematics and Scientific Computing University
More informationEECS 579: Logic and Fault Simulation. Simulation
EECS 579: Logic and Fault Simulation Simulation: Use of computer software models to verify correctness Fault Simulation: Use of simulation for fault analysis and ATPG Circuit description Input data for
More informationHYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017
HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher
More informationChe-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University
Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.
More informationInformation Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and
Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element
More informationPerformance Analysis of Lattice QCD Application with APGAS Programming Model
Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models
More informationAn Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))
An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8)) Tung Chou January 5, 2012 QUAD Stream cipher. Security relies on MQ (Multivariate Quadratics). QUAD The Provably-secure QUAD(q, n, r) Stream Cipher
More informationCSE 548: Analysis of Algorithms. Lecture 12 ( Analyzing Parallel Algorithms )
CSE 548: Analysis of Algorithms Lecture 12 ( Analyzing Parallel Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Fall 2017 Why Parallelism? Moore s Law Source: Wikipedia
More informationHigh-performance Technical Computing with Erlang
High-performance Technical Computing with Erlang Alceste Scalas Giovanni Casu Piero Pili Center for Advanced Studies, Research and Development in Sardinia ACM ICFP 2008 Erlang Workshop September 27th,
More informationTargeting Extreme Scale Computational Challenges with Heterogeneous Systems
Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1 Introduction! PNNL Laboratory Directed Research &
More informationPredicting the Electrical and Thermal Transport Properties in Direct Energy Conversion Material: A Practical HPC Perspective
Predicting the Electrical and Thermal Transport Properties in Direct Energy Conversion Material: A Practical HPC Perspective Plasmonic Thermal Absorber Nanostructure Thermoelectrics Dr. Terry Musho Assistant
More informationDigital Systems EEE4084F. [30 marks]
Digital Systems EEE4084F [30 marks] Practical 3: Simulation of Planet Vogela with its Moon and Vogel Spiral Star Formation using OpenGL, OpenMP, and MPI Introduction The objective of this assignment is
More informationCS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/
CS-206 Concurrency Lecture 13 Wrap Up Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/ Created by Nooshin Mirzadeh, Georgios Psaropoulos and Babak Falsafi EPFL Copyright 2015 EPFL CS-206 Spring
More informationResearch on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method
NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),
More informationImprovement of MPAS on the Integration Speed and the Accuracy
ICAS2017 Annecy, France Improvement of MPAS on the Integration Speed and the Accuracy Wonsu Kim, Ji-Sun Kang, Jae Youp Kim, and Minsu Joh Disaster Management HPC Technology Research Center, Korea Institute
More informationCyclops Tensor Framework
Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r
More informationScalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver
Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationThe Thermoflow60 Finite-Element. Program
The Thermoflow60 Finite-Element Program Ulrich Wepler,, German Aerospace enter (DLR) Dieter an Mey, enter for omputing and ommunication,, RWTH Alexander Spiegel, enter for omputing and ommunication,, RWTH
More informationModern Functional Programming and Actors With Scala and Akka
Modern Functional Programming and Actors With Scala and Akka Aaron Kosmatin Computer Science Department San Jose State University San Jose, CA 95192 707-508-9143 akosmatin@gmail.com Abstract For many years,
More informationS8241 VERSIONING GPU- ACCLERATED WRF TO Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA)
S8241 VERSIONING GPU- ACCLERATED WRF TO 3.7.1 Jeff Adie, 26 March, 2018 (Presented by Stan Posey, NVIDIA) 1 ACKNOWLEDGEMENT The work presented here today would not have been possible without the efforts
More informationOpenCL: Programming Heterogeneous Architectures
OpenCL: Programming Heterogeneous Architectures Porting BigDFT to OpenCL Brice Videau (LIG - NANOSIM) April 24, 2012 1 / 45 Introduction 2 / 45 Needs in computing resources are innite Benets for physicists
More informationThe Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology
The Panel: What does the future look like for NPW application development? 17 th ECMWF Workshop on High Performance Computing in Meteorology 16:00-17:30 27 October 2016 Panelists John Michalakes (UCAR,
More informationEfficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization
Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized
More informationCOMP 515: Advanced Compilation for Vector and Parallel Processors. Vivek Sarkar Department of Computer Science Rice University
COMP 515: Advanced Compilation for Vector and Parallel Processors Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP 515 Lecture 10 10 February 2009 Announcement Feb 17 th
More informationCRYPTOGRAPHIC COMPUTING
CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,
More informationParalleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures
Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures Mark Gove? NOAA Earth System Research Laboratory We Need Be?er Numerical Weather Predic(on Superstorm Sandy Hurricane
More informationAnalysis of Parallel Montgomery Multiplication in CUDA
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2013 Analysis of Parallel Montgomery Multiplication in CUDA Yuheng Liu San Jose State University
More informationAdaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs
Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs Simon McIntosh-Smith simonm@cs.bris.ac.uk Head of Microelectronics Research University of Bristol, UK 1 ! Collaborators
More informationMarawacc: A Framework for Heterogeneous Computing in Java
: A Framework for Heterogeneous Computing in Java Juan Fumero, Michel Steuwer, Christophe Dubach Code The University of Edinburgh UK Many-Core Developer Conference 2016 1 / 23 Code 2 / 23 Code 3 / 23 Code
More informationACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS
ACCELERATING WEATHER PREDICTION WITH NVIDIA GPUS Alan Gray, Developer Technology Engineer, NVIDIA ECMWF 18th Workshop on high performance computing in meteorology, 28 th September 2018 ESCAPE NVIDIA s
More informationGPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic
GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago
More informationLightweight Superscalar Task Execution in Distributed Memory
Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,
More informationPerformance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville
Performance of the fusion code GYRO on three four generations of Crays Mark Fahey mfahey@utk.edu University of Tennessee, Knoxville Contents Introduction GYRO Overview Benchmark Problem Test Platforms
More informationEmpowering Scientists with Domain Specific Languages
Empowering Scientists with Domain Specific Languages Julian Kunkel, Nabeeh Jum ah Scientific Computing Department of Informatics University of Hamburg SciCADE2017 2017-09-13 Outline 1 Developing Scientific
More informationCode Generation for GPU Accelerators in the Domain of Image Preprocessing
Code Generation for GPU Accelerators in the Domain of Image Preprocessing Oliver Reiche, Richard Membarth, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Dagstuhl,
More informationAntonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg
INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The
More informationHIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU
April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th Brief Introduction of CTC AGENDA Alpha/Beta Matrix
More informationA microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS
GTC 20130319 A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS Erik Lindahl erik.lindahl@scilifelab.se Molecular Dynamics Understand biology We re comfortably on
More informationAccelerating Model Reduction of Large Linear Systems with Graphics Processors
Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex
More informationClaude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique
Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)
More informationDense Arithmetic over Finite Fields with CUMODP
Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,
More informationLecture 22: Multithreaded Algorithms CSCI Algorithms I. Andrew Rosenberg
Lecture 22: Multithreaded Algorithms CSCI 700 - Algorithms I Andrew Rosenberg Last Time Open Addressing Hashing Today Multithreading Two Styles of Threading Shared Memory Every thread can access the same
More informationAvailable online at ScienceDirect. Procedia Engineering 61 (2013 ) 94 99
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 6 (203 ) 94 99 Parallel Computational Fluid Dynamics Conference (ParCFD203) Simulations of three-dimensional cavity flows with
More informationMarwan Burelle. Parallel and Concurrent Programming. Introduction and Foundation
and and marwan.burelle@lse.epita.fr http://wiki-prog.kh405.net Outline 1 2 and 3 and Evolutions and Next evolutions in processor tends more on more on growing of cores number GPU and similar extensions
More informationMulticore Semantics and Programming
Multicore Semantics and Programming Peter Sewell Tim Harris University of Cambridge Oracle October November, 2015 p. 1 These Lectures Part 1: Multicore Semantics: the concurrency of multiprocessors and
More informationParallelism of MRT Lattice Boltzmann Method based on Multi-GPUs
Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs 1 School of Information Engineering, China University of Geosciences (Beijing) Beijing, 100083, China E-mail: Yaolk1119@icloud.com Ailan
More informationA model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)
A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal
More informationSIMULATION OF ISING SPIN MODEL USING CUDA
SIMULATION OF ISING SPIN MODEL USING CUDA MIRO JURIŠIĆ Supervisor: dr.sc. Dejan Vinković Split, November 2011 Master Thesis in Physics Department of Physics Faculty of Natural Sciences and Mathematics
More informationGPU Computing with Applications in Digital Logic
Tampere International Center for Signal Processing. TICSP series # 62 J. Astola, M. Kameyama, M. Lukac & R. S. Stanković (eds.) GPU Computing with Applications in Digital Logic Tampere International Center
More informationCarlo Cavazzoni, HPC department, CINECA
Large Scale Parallelism Carlo Cavazzoni, HPC department, CINECA Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have a mixed architecture + accelerators
More informationAll-electron density functional theory on Intel MIC: Elk
All-electron density functional theory on Intel MIC: Elk W. Scott Thornton, R.J. Harrison Abstract We present the results of the porting of the full potential linear augmented plane-wave solver, Elk [1],
More informationParallel Scientific Computing
IV-1 Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication. Direct method for solving a linear equation. Gaussian Elimination. Iterative method for solving a linear equation.
More informationParallel Sparse Tensor Decompositions using HiCOO Format
Figure sources: A brief survey of tensors by Berton Earnshaw and NVIDIA Tensor Cores Parallel Sparse Tensor Decompositions using HiCOO Format Jiajia Li, Jee Choi, Richard Vuduc May 8, 8 @ SIAM ALA 8 Outline
More informationIntroduction The Nature of High-Performance Computation
1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential
More informationLanguage GTC April 7, These slides at Data Science Applications of GPUs in the R.
Data Science April 7, 2016 These slides at http://heather.cs.ucdavis.edu/gtc.pdf Why R? Why R? The lingua franca for the data science community. (R-Python-Julia battle looming?) Statistically Correct:
More informationPiz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess
Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting Thomas C. Schulthess 1 Cray XC30 with 5272 hybrid, GPU accelerated compute nodes Piz Daint Compute node:
More informationParallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors
Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1 1 Deparment of Computer
More informationPerformance Evaluation of Scientific Applications on POWER8
Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,
More information