OpenCL: Programming Heterogeneous Architectures

Size: px
Start display at page:

Download "OpenCL: Programming Heterogeneous Architectures"

Transcription

1 OpenCL: Programming Heterogeneous Architectures Porting BigDFT to OpenCL Brice Videau (LIG - NANOSIM) April 24, / 45

2 Introduction 2 / 45

3 Needs in computing resources are innite Benets for physicists and chemists More computing power means : Bigger systems, Fewer approximations, Improved accuracy. Numerical experimentation. CEA's hybrid cluster Titane, built by Bull 3 / 45

4 Current and future architectures 3 trends are seen in current calculators : Bigger systems Number of nodes in clusters and grids increase. Number of processors in supercomputer increase. Green Computing Low power components, High eciency, Huge Number of components. More powerful components Increased frequency, Increased number of processors and cores, Specialized co-processors : GPU, CELL... 4 / 45

5 Exploiting those architectures Middlewares are available to program those machines. Each middleware covers a range of usage. Some examples Distributed machines : MPI, KAAPI, CHARM... Multicore architectures : MPI, OpenMP, ompss, StarPU, OpenCL... GPU : OpenCL, NVIDIA Cuda, ATI Streams... 5 / 45

6 Talk Outline 2 OpenCL : a Standard for Parallel Computing 3 Life and Death of OpenCL in a Program 4 Writing Kernels 5 BigDFT 6 Conclusions and perspectives 6 / 45

7 OpenCL : a Standard for Parallel Computing 7 / 45

8 Host-Devices model OpenCL Architecture Model 1 host and several devices. Devices are connected to the host. Host issues commands to the devices. Data transport is done via memory copy. Several devices support OpenCL Host Commands Devices Devices Devices NVIDIA for GPU and in the future for Tegra. AMD and Intel for CPUs and GPUs and MIC? IBM CELL processor. ARM GPUs (Mali) + CPUs 8 / 45

9 Context and Queues Contexts aggregate resources, programs and devices belonging to a common platform (ie NVIDIA, or ATI). Host and devices communicate via buers dened in a context. Commands are sent to devices using command queues. Commands are called kernels. Command queues Can be synchronous or asynchronous. Can be event driven. Several queues can point to the same device, allowing concurrent execution. 9 / 45

10 OpenCL Processing Model Work item 1 Work item 2 Work item 1 Work item 2 Work item n 1 Work item n Work item n 1 Work item n Compute Unit 1 Compute Unit m Compute Device Kernels are split into uni, two or three-dimensional ranges called work groups. Work groups are mapped to compute units. Individual item are processed by work items. 10 / 45

11 OpenCL Memory Model 4 dierent memory space dened on an OpenCL device : Global memory : corresponds to the device RAM, input data are stored there. Constant memory : cached global memory. Local memory : high speed memory shared among work items of a compute unit. Private memory : registers of a work item. 11 / 45

12 Life and Death of OpenCL in a Program The Host Side of OpenCL 12 / 45

13 General Workow Select desired platform Select desired devices Create associated Context Send data to devices using command queues Create command queues associated do devices Compile or load kernels on the devices Send commands to devices using command queues Get data from devices using command queues Release every resources used 13 / 45

14 Platform Selection In a near future every platform will support OpenCL, but the user may not be interested in all of them : select an appropriate platform Get Platforms 1 #i n c l u d e <CL/ c l. h> 2 c l _ u i n t num_platforms ; 3 c l G e t P l a t f o r m I D s ( NULL, NULL, &num_platforms ) ; 4 c l _ p l a t f o r m _ i d p l a t f o r m s = m a l l o c ( s i z e o f ( c l _ p l a t f o r m _ i d ) num_platforms ) ; 5 clgetplatformids ( num_platforms, platforms, NULL ) ; 6 /... / 7 f o r ( i n t i =0; i <num_platforms ; i ++){ 8 /... / 9 c l G e t P l a t f o r m I n f o ( p l a t f o r m s [ i ], CL_PLATFORM_VENDOR,... ) ; 10 /... / 11 } 14 / 45

15 Device Selection Several device from the same vendor is also common : one device for the screen and one device for computations Get Devices 1 #i n c l u d e <CL/ c l. h> 2 c l _ u i n t num_devices ; 3 c l G e t D e v i c e I D s ( p l a t f o r m, CL_DEVICE_TYPE_ALL, NULL, NULL, &num_devices ) ; 4 c l _ d e v i c e _ i d d e v i c e s = m a l l o c ( s i z e o f ( c l _ d e v i c e _ i d ) num_devices ) ; 5 clgetdeviceids ( platform, CL_DEVICE_TYPE_ALL, num_devices, devices, NULL ) ; 6 /... / 7 f o r ( i n t i =0; i <num_devices ; i ++){ 8 /... / 9 c l G e t D e v i c e I n f o ( d e v i c e s [ i ], CL_DEVICE_NAME,... ) ; 10 /... / 11 } 15 / 45

16 Context Creation Context gather devices from the same platform. Those devices will be able to share resources. Create Context 1 c l _ c o n t e x t _ p r o p e r t i e s p r o p e r t i e s [ ] = 2 { CL_CONTEXT_PLATFORM, ( c l _ c o n t e x t _ p r o p e r t i e s ) platform_id, 0 } ; 3 c l _ d e v i c e _ i d d e v i c e s [ ] = { device_id_1, device_id_2 } ; 4 c l _ c o n t e x t c o n t e x t = 5 c l C r e a t e C o n t e x t ( p r o p e r t i e s, 2, d e v i c e s, NULL, NULL, NULL ) ; A shortcut exists, skipping device selection : Create Context from Type 1 c l _ c o n t e x t _ p r o p e r t i e s p r o p e r t i e s [ ] = 2 { CL_CONTEXT_PLATFORM, ( c l _ c o n t e x t _ p r o p e r t i e s ) platform_id, 0 } ; 3 c l _ c o n t e x t c o n t e x t = 4 clcreatecontextfromtype ( p r o p e r t i e s, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL ) ; 16 / 45

17 Building Program from Source Once the context is created, the program is to be built (or loaded from binary). Building Program 1 / s t r i n g s i s an a r r a y o f s t r i n g _ c o u n t NULL t e r m i n a t e d s t r i n g s / 2 cl_program program = 3 clcreateprogramwithsource ( context, string_ count, s t r i n g s, NULL, NULL ) ; 4 / i f d e v i c e _ l i s t i s NULL, program i s b u i l t 5 f o r a l l a v a i l a b l e d e v i c e s i n t h e c o n t e x t / 6 c l B u i l d P r o g r a m ( program, num_devices, d e v i c e _ l i s t, o p t i o n s, NULL, NULL ) ; 7 c l _ k e r n e l k e r n e l = c l C r e a t e K e r n e l ( program, " kernel_name ", NULL ) ; Kernels are extracted from the built program using their name. 17 / 45

18 Creating Command Queues A command queue is used to send commands to a device. They have to be associated with a device. Creating Command Queues 1 cl_command_queue queue = 2 clcreatecommandqueue ( c o n t e x t, d e v i c e s [ chosen_device ], 0, NULL ) ; Options can be specied instead of 0, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE allows for out of order execution for instance. 18 / 45

19 Using OpenCL Using OpenCL is (hopefully) easier than setting it up. Create buffers to store data on devices Send data to devices using command queues Send commands to devices using command queues Get data from devices using command queues 19 / 45

20 Buer Creation In OpenCL buers creation and deletion are explicitly managed. As can be noted buers are tied to a context and not a particular command queue. The implementation is free to transfer buers from devices to host memory or to another device. Creating Simple Buers 1 cl_mem r e a d _ b u f f e r = 2 c l C r e a t e B u f f e r ( c o n t e x t, CL_MEM_READ_ONLY, b u f f e r _ s i z e, NULL, NULL ) ; 3 cl_mem w r i t e _ b u f f e r = 4 c l C r e a t e B u f f e r ( c o n t e x t, CL_MEM_WRITE_ONLY, b u f f e r _ s i z e, NULL, NULL ) ; 20 / 45

21 Pinned Buer Creation Pinned buer creation can oer premium performances. Here is a code sample that can be used on NVIDIA devices. The nals pointers obtained can be used to transfer data between the host and the device. Creating Pinned Simple Buers 1 cl_mem p i n n e d _ r e a d _ b u f f e r = 2 c l C r e a t e B u f f e r ( c o n t e x t, CL_MEM_ALLOC_HOST_PTR CL_MEM_READ_ONLY, 3 b u f f e r _ s i z e, NULL, NULL ) ; 4 cl_mem p i n n e d _ w r i t e _ b u f f e r = 5 c l C r e a t e B u f f e r ( c o n t e x t, CL_MEM_ALLOC_HOST_PTR CL_MEM_WRITE_ONLY, 6 b u f f e r _ s i z e, NULL, NULL ) ; 7 u n s i g n e d c h a r data_in = 8 clenqueuemapbuffer ( queue, pinned_read_buffer, CL_TRUE, CL_MAP_WRITE, 0, 9 b u f f e r _ s i z e, 0, NULL, NULL, NULL ) ; 10 u n s i g n e d c h a r data_out = 11 clenqueuemapbuffer ( queue, pinned_ write_ buffer, CL_TRUE, CL_MAP_READ, 0, 12 b u f f e r _ s i z e, 0, NULL, NULL, NULL ) ; 21 / 45

22 Transferring Data The implementation is free to move buers in memory. But nonetheless, memory is often kept on the device associated to the command queue used to transfer the data. Data Transfer 1 clenqueuewritebuffer ( queue, read_ buffer, CL_TRUE, 0, 2 b u f f e r _ s i z e, data_in, 0, NULL, NULL ) ; 3 / P r o c e s s i n g t h a t r e a d s r e a d _ b u f f e r and w r i t e s w r i t e _ b u f f e r / 4 /... / 5 clenqueuereadbuffer ( queue, write_ buffer, CL_TRUE, 0, 6 b u f f e r _ s i z e, data_out, 0, NULL, NULL ) ; 22 / 45

23 Performing Calculations Once data is transfered, kernels are used to perform calculations. Kernel Usage 1 / P l a c e k e r n e l p a r a m e t e r s i n t h e k e r n e l s t r u c t u r e. / 2 c l S e t K e r n e l A r g ( k e r n e l, 0, s i z e o f ( d a t a _ s i z e ), ( v o i d )& d a t a _ s i z e ) ; 3 c l S e t K e r n e l A r g ( k e r n e l, 1, s i z e o f ( r e a d _ b u f f e r ), ( v o i d )& r e a d _ b u f f e r ) ; 4 c l S e t K e r n e l A r g ( k e r n e l, 2, s i z e o f ( w r i t e _ b u f f e r ), ( v o i d )& w r i t e _ b u f f e r ) ; 5 / Enqueue a 1 d i m e n s i o n a l k e r n e l w i t h a l o c a l s i z e o f 32 / 6 s i z e _ t l o c a l W o r k S i z e [ ] = { 32 } ; 7 s i z e _ t g l o b a l W o r k S i z e [ ] = { shrroundup ( 3 2, d a t a _ s i z e ) } ; 8 clenqueuendrangekernel ( queue, kernel, 1, NULL, 9 globalworksize, localworksize, 0, NULL, NULL ) ; 23 / 45

24 Event Management Almost all functions presented end with : 1..., 0, NULL, NULL ) ; These 3 arguments are used for event management, and thus asynchronous queue handling. Functions can wait for a number of events, and can generate 1 event. 1 event_t e v e n t _ l i s t [ ] = { event1, e v e n t 2 } ; 2 event_t event ; 3 clenqueuereadbuffer ( queue, write_ buffer, CL_FALSE, 0, 4 b u f f e r _ s i z e, data_out, 2, e v e n t _ l i s t, &e v e n t ) ; Previous buer read waits for 2 events and generate a third that will happen when the read is completed. 24 / 45

25 Release Resources OpenCL uses reference counts to manage memory. In order to exit cleanly from an OpenCL program all allocated resources have to be freed : buers (clreleasememobject) events (clreleaseevent) kernel (clreleasekernel) programs (clreleaseprogram) queues (clreleasecommandqueue) context (clreleasecontext) etc / 45

26 Writing Kernels 26 / 45

27 Recall : OpenCL Memory Model 4 dierent memory space dened on an OpenCL device : Global memory : corresponds to the device RAM, input data are stored there. Constant memory : cached global memory. Local memory : high speed memory shared among work items of a compute unit. Private memory : registers of a work item. 27 / 45

28 OpenCL Language : a Subset of C Kernels are written using a C-like language Recursion is prohibited Helper functions are dened Barriers Work item indexes Atomic operations Vector operations New keywords : kernel global, local, constant, private read_only, write_only, read_write 28 / 45

29 Example : Unidimensional Convolutions N e N N l (0,0) (1,0) (0,1) N l N l n N e (0,0) (1,0) (0,1) Input (i,j) Output (i,j) N e One unidimensional convolution with transposition, simple but not too much. Real world code used in BigDFT an electronic structure calculation program. 29 / 45

30 OpenCL memory management Example using a 4*4 block processing a lter of length 5. 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,0 0,1 0,2 0,3 0,3 1,3 2,3 3,3 1,0 1,1 1,2 1,3 0,0 1,0 2,0 3,0 2,0 2,1 2,2 2,3 0,1 1,1 2,1 3,1 3,0 3,1 3,2 3,3 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 1 work item processes one element of the nal matrix. 30 / 45

31 Benets from this approach 3D convolutions can be expressed as a succession of 3 1D convolution/transposition. Memory access are coalesced while reading input matrix and writing output matrix. Bank conicts are almost avoided by padding the buer to the number of bank +1. Maps easily to current architectures. Extends to more complex convolutions found in BigDFT. Almost every convolution in BigDFT have been ported, including free boundary and semi-periodic boundary. 31 / 45

32 Kernel Declaration Kernel Declaration 1 / A c t i v a t e d o u b l e p r e c i s i o n s u p p o r t / 2 #pragma OPENCL EXTENSION cl_khr_fp64 : e n a b l e 3 #d e f i n e FILT_W 16 4 #d e f i n e WG_S 16 5 kernel v o i d 6 attribute ( ( reqd_work_group_size (WG_S,WG_S, 1 ) ) 7 m a g i c f i l t e r 1 d K e r n e l _ d ( u i n t n, u i n t ndat, 8 global c o n s t d o u b l e psi, 9 global d o u b l e out ){ 10 // padded l o c a l b u f f e r s i z e : local d o u b l e tmp [WG_S (WG_S+FILT_W+1)]; Works on double precision oats Kernel expects work group size of 16 x 16 n and ndat are in local memory tmp1 is a storage buer in local memory, shared among work items 32 / 45

33 Work with Indexes Get Indexes and Load Data 1 // g e t o u r p o s i t i o n i n t h e l o c a l work g r o u p 2 c o n s t s i z e _ t i g = g e t _ g l o b a l _ i d ( 0 ) ; 3 c o n s t s i z e _ t j g = g e t _ g l o b a l _ i d ( 1 ) ; 4 // g e t o u r p o s i t i o n i n t h e r e s u l t m a t r i x 5 c o n s t s i z e _ t i = g e t _ l o c a l _ i d ( 0 ) ; 6 c o n s t s i z e _ t j = g e t _ l o c a l _ i d ( 1 ) ; 7 // t r a n s p o s e i n d e x e s i n t h e work g r o u p i n o r d e r t o r e a d t r a n s p o s e d d a t a 8 p t r d i f f _ t i g t = i g i + j FILT_W/ 2 ; 9 p t r d i f f _ t j g t = j g j + i ; 10 // i f we are on the outside, s e l e c t a border element to load, wrapping around 11 //we w i l l be l o a d i n g 2 e l e m e n t s e a c h 12 i f ( i g t < 0 ) 13 tmp [ i (WG_S+FILT_W+1) + j ] = p s i [ j g t + ( n + i g t ) ndat ] ; 14 e l s e 15 tmp [ i (WG_S+FILT_W+1) + j ] = p s i [ j g t + i g t ndat ] ; 16 i g t += FILT_W ; 17 i f ( i g t >= n ) 18 tmp [ i (WG_S+FILT_W+1) + j + FILT_W] = p s i [ j g t + ( i g t n ) ndat ] ; 19 e l s e \n\ 20 tmp [ i (WG_S+FILT_W+1) + j + FILT_W] = p s i [ j g t + i g t ndat ] ; 33 / 45

34 Compute Convolution and Write Output Performing Computations 1 // i n i t i a l i z e r e s u l t 2 d o u b l e t t = 0. 0 ; 3 // r e s t p o s i t i o n i n t h e b u f f e r t o f i r s t e l e m e n t i n v o l v e d i n t h e c o n v o l u t i o n 4 tmp += j 2 (WG_S+FILT_W+1) + i 2 ; 5 // w a i t f o r b u f f e r t o be f u l l 6 b a r r i e r (CLK_LOCAL_MEM_FENCE) ; 7 8 // a p p l y f i l t e r 9 t t += tmp++ FILT0 ; 10 t t += tmp++ FILT1 ; 11 /... / 12 t t += tmp++ FILT15 ; 13 // s t o r e t h e r e s u l t 14 out [ ( j g n+i g )]= t t ; 15 } ; 34 / 45

35 BigDFT OpenCL Port Evaluation 35 / 45

36 Motivations Part of BigDFT was already ported to GPU architectures using CUDA, why a new port? Only part of the program was ported. OpenCL can target several platforms while CUDA can't. OpenCL is a standard, while CUDA is a vendor provided solution. This time most of BigDFT is expected to run on GPU. Code ported is relatively simple and well suited to GPU, thus the performance loss is hoped to be minimal. 36 / 45

37 Objectives Most of BigDFT operations can be expressed as unidimensional convolutions. There are several dozens convolutions to implement. Convolutions of BigDFT share common traits. Objectives Find a common parallelization technique tting most convolutions. Be as ecient as CUDA. 37 / 45

38 Test System Setup GPU 2 : Tesla C2070 (Fermi) 6 GB of RAM Driver version : GPU 2 : Radeon HD GB of RAM Driver version : / 45

39 Test System Setup Host : Lenovo D20 1 Xeon 2.83 GHz (4 Nehalem cores) 8 GB of RAM Linux x86_64 icc / 45

40 Magicfilter Kinetic Synthesis Analysis Kinetic_k Magicfilter_shrink Magicfilter_grow Magicfilter_reverse Compress Uncompress Analysis_shrink Synthesis_grow Introduction OpenCL Basic Management Writing Kernels BigDFT Conclusions and perspectives Comparison CPU,Fermi,HD Performances of CPU vs NVIDIA vs AMD CPU NVIDIA AMD GFLOPS Kernels 40 / 45

41 Magicfilter Kinetic_k Kinetic Magicfilter_grow Synthesis Analysis Magicfilter_shrink Magicfilter_reverse Compress Uncompress Analysis_shrink Synthesis_grow Introduction OpenCL Basic Management Writing Kernels BigDFT Conclusions and perspectives Comparison CPU,Fermi,HD Performances of CPU vs NVIDIA vs AMD CPU NVIDIA AMD Ratio to CPU Kernels 40 / 45

42 Comparison CUDA, OpenCL, CPU 100 Badiane, X Fermi C2070, ZnO 64 at.: CPU vs. Hybrid Percent Speedup 20 0 CPU-mkl CUDA CPU-mkl-mpi OCL-mkl OCL-cublas CUDA-mkl OCL-cublas-mpi CUDA-mkl-mpi CUDA-mpi OCL-mkl-mpi 2 0 Speedup Other CPU Conv LinAlg Comms 41 / 45

43 Hybrid ATI + NVIDIA Tesla C Radeon HD 6970 MPI+NVIDIA/AMD Execution Time (s) Speedup NVIDIA NVIDIA AMD AMD (4 + NV) + (4 + AMD) Table: Performance results for dierent conguration of BigDFT, using MPI + GPUs 42 / 45

44 Conclusions OpenCL OpenCL proved easy to use. Performance is on-par with previous CUDA implementation. Kernels have been shown to run on other architectures : ATI and CPU. BigDFT Full port of BigDFT convolutions on OpenCL. Part of the code was rewritten during this work. Complexity reduced compared to the CUDA support. 43 / 45

45 Perspectives OpenCL Some OpenCL implementations are still recent and buggy. Best way to do multi-gpu, GPU+OpenCL CPU? Optimizing kernels for multiple devices? Automated kernel generation. 44 / 45

46 Questions? Thanks for your attention. 45 / 45

Panorama des modèles et outils de programmation parallèle

Panorama des modèles et outils de programmation parallèle Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge González-Domínguez*, Bertil Schmidt*, Jan C. Kässens**, Lars Wienbrandt** *Parallel and Distributed Architectures

More information

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine First, a look at using OpenACC on WRF subroutine advance_w dynamics routine Second, an estimate of WRF multi-node performance on Cray XK6 with GPU accelerators Based on performance of WRF kernels, what

More information

arxiv: v1 [physics.comp-ph] 22 Nov 2012

arxiv: v1 [physics.comp-ph] 22 Nov 2012 A Customized 3D GPU Poisson Solver for Free BCs Nazim Dugan a, Luigi Genovese b, Stefan Goedecker a, a Department of Physics, University of Basel, Klingelbergstr. 82, 4056 Basel, Switzerland b Laboratoire

More information

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1 Introduction! PNNL Laboratory Directed Research &

More information

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge González-Domínguez Parallel and Distributed Architectures Group Johannes Gutenberg University of Mainz, Germany j.gonzalez@uni-mainz.de

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

11 Parallel programming models

11 Parallel programming models 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems TR-0-07 A Comparison of the Performance of ::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems Ang Li, Omkar Deshmukh, Radu Serban, Dan Negrut May, 0 Abstract ::GPU is a

More information

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters -- Parallel Processing for Energy Efficiency October 3, 2013 NTNU, Trondheim, Norway Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer

More information

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose 上海超级计算中心 Shanghai Supercomputer Center Lei Xu Shanghai Supercomputer Center 03/26/2014 @GTC, San Jose Overview Introduction Fundamentals of the FDTD method Implementation of 3D UPML-FDTD algorithm on GPU

More information

PSEUDORANDOM numbers are very important in practice

PSEUDORANDOM numbers are very important in practice Proceedings of the Federated Conference on Computer Science and Information Systems pp 571 578 ISBN 978-83-681-51-4 Parallel GPU-accelerated Recursion-based Generators of Pseudorandom Numbers Przemysław

More information

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team High-performance processing and development with Madagascar July 24, 2010 Madagascar development team Outline 1 HPC terminology and frameworks 2 Utilizing data parallelism 3 HPC development with Madagascar

More information

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing

More information

Practical Combustion Kinetics with CUDA

Practical Combustion Kinetics with CUDA Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

CL 2 QCD - Lattice QCD based on OpenCL

CL 2 QCD - Lattice QCD based on OpenCL CL 2 QCD - Lattice QCD based on OpenCL Owe Philipsen 1, Christopher Pinke 1, Alessandro Sciarra 1, Matthias Bach 2 1 ITP, Goethe-Universität, Max-von-Laue-Str. 1, 60438 Frankfurt am Main 2 FIAS, Goethe-Universität,

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element

More information

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Multicore Parallelization of Determinant Quantum Monte Carlo Simulations Andrés Tomás, Che-Rung Lee, Zhaojun Bai, Richard Scalettar UC Davis SIAM Conference on Computation Science & Engineering Reno, March

More information

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory MAX PLANCK INSTITUTE November 5, 2010 MPI at MPI Jens Saak Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory FOR DYNAMICS OF COMPLEX TECHNICAL

More information

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA Date: 16th May 2012 Wed, 3pm to 3.25pm(Adv. Session) Sathyanarayana K., Manish Banga, and Ravi Kumar G. V. V. Engineering Services,

More information

Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework

Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework Vlad Slavici Raghu Varier Gene Cooperman Northeastern University Boston, MA {vslav,varier,gene}@ccsneuedu Robert J Harrison

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Level-3 BLAS on a GPU

Level-3 BLAS on a GPU Level-3 BLAS on a GPU Picking the Low Hanging Fruit Francisco Igual 1 Gregorio Quintana-Ortí 1 Robert A. van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

Accelerating Computational Algorithms

Accelerating Computational Algorithms Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2013 Accelerating Computational Algorithms Michael Risley Virginia Commonwealth University Follow this and

More information

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems S4283 - Subdivide, : Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems Elmar Westphal - Forschungszentrum Jülich GmbH 1 Contents Micromagnetism TetraMag, a FEM/BEM Micromagnetism Simulator

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Cosmology with Galaxy Clusters: Observations meet High-Performance-Computing

Cosmology with Galaxy Clusters: Observations meet High-Performance-Computing Cosmology with Galaxy Clusters: Observations meet High-Performance-Computing Julian Merten (ITA/ZAH) Clusters of galaxies GPU lensing codes Abell 2744 CLASH: A HST/MCT programme Clusters of galaxies DM

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting Thomas C. Schulthess 1 Cray XC30 with 5272 hybrid, GPU accelerated compute nodes Piz Daint Compute node:

More information

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg INFN - CNAF (Bologna) 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, 14-25 September 2015, Hamburg 1 / 44 Overview 1 2 3 4 5 2 / 44 to Computing The

More information

Lightweight Superscalar Task Execution in Distributed Memory

Lightweight Superscalar Task Execution in Distributed Memory Lightweight Superscalar Task Execution in Distributed Memory Asim YarKhan 1 and Jack Dongarra 1,2,3 1 Innovative Computing Lab, University of Tennessee, Knoxville, TN 2 Oak Ridge National Lab, Oak Ridge,

More information

Scalable and Power-Efficient Data Mining Kernels

Scalable and Power-Efficient Data Mining Kernels Scalable and Power-Efficient Data Mining Kernels Alok Choudhary, John G. Searle Professor Dept. of Electrical Engineering and Computer Science and Professor, Kellogg School of Management Director of the

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Computing least squares condition numbers on hybrid multicore/gpu systems

Computing least squares condition numbers on hybrid multicore/gpu systems Computing least squares condition numbers on hybrid multicore/gpu systems M. Baboulin and J. Dongarra and R. Lacroix Abstract This paper presents an efficient computation for least squares conditioning

More information

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville Performance of the fusion code GYRO on three four generations of Crays Mark Fahey mfahey@utk.edu University of Tennessee, Knoxville Contents Introduction GYRO Overview Benchmark Problem Test Platforms

More information

CRYPTOGRAPHIC COMPUTING

CRYPTOGRAPHIC COMPUTING CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,

More information

Dynamic Scheduling within MAGMA

Dynamic Scheduling within MAGMA Dynamic Scheduling within MAGMA Emmanuel Agullo, Cedric Augonnet, Jack Dongarra, Mathieu Faverge, Julien Langou, Hatem Ltaief, Samuel Thibault and Stanimire Tomov April 5, 2012 Innovative and Computing

More information

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Accelerating Model Reduction of Large Linear Systems with Graphics Processors Accelerating Model Reduction of Large Linear Systems with Graphics Processors P. Benner 1, P. Ezzatti 2, D. Kressner 3, E.S. Quintana-Ortí 4, Alfredo Remón 4 1 Max-Plank-Institute for Dynamics of Complex

More information

The Lattice Boltzmann Simulation on Multi-GPU Systems

The Lattice Boltzmann Simulation on Multi-GPU Systems The Lattice Boltzmann Simulation on Multi-GPU Systems Thor Kristian Valderhaug Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University

More information

Perm State University Research-Education Center Parallel and Distributed Computing

Perm State University Research-Education Center Parallel and Distributed Computing Perm State University Research-Education Center Parallel and Distributed Computing A 25-minute Talk (S4493) at the GPU Technology Conference (GTC) 2014 MARCH 24-27, 2014 SAN JOSE, CA GPU-accelerated modeling

More information

Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL

Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL Andreas Falkenstrøm Mieritz s093065 September 13, 2012 Supervisors: Allan Ensig-Peter Karup Bernd Dammann IMM-B.Sc.-2012-30 Contents 1 Problem

More information

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse

More information

P214 Efficient Computation of Passive Seismic Interferometry

P214 Efficient Computation of Passive Seismic Interferometry P214 Efficient Computation of Passive Seismic Interferometry J.W. Thorbecke* (Delft University of Technology) & G.G. Drijkoningen (Delft University of Technology) SUMMARY Seismic interferometry is from

More information

N-body Simulations. On GPU Clusters

N-body Simulations. On GPU Clusters N-body Simulations On GPU Clusters Laxmikant Kale Filippo Gioachin Pritish Jetley Thomas Quinn Celso Mendes Graeme Lufkin Amit Sharma Joachim Stadel Lukasz Wesolowski James Wadsley Edgar Solomonik Fabio

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

A CUDA Solver for Helmholtz Equation

A CUDA Solver for Helmholtz Equation Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College

More information

Delayed and Higher-Order Transfer Entropy

Delayed and Higher-Order Transfer Entropy Delayed and Higher-Order Transfer Entropy Michael Hansen (April 23, 2011) Background Transfer entropy (TE) is an information-theoretic measure of directed information flow introduced by Thomas Schreiber

More information

GPU Computing Activities in KISTI

GPU Computing Activities in KISTI International Advanced Research Workshop on High Performance Computing, Grids and Clouds 2010 June 21~June 25 2010, Cetraro, Italy HPC Infrastructure and GPU Computing Activities in KISTI Hongsuk Yi hsyi@kisti.re.kr

More information

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU Khramtsov D.P., Nekrasov D.A., Pokusaev B.G. Department of Thermodynamics, Thermal Engineering and Energy Saving Technologies,

More information

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD XVIII International Conference on Water Resources CMWR 2010 J. Carrera (Ed) c CIMNE, Barcelona, 2010 COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD James.E. McClure, Jan F. Prins

More information

Optimization Techniques for Parallel Code 1. Parallel programming models

Optimization Techniques for Parallel Code 1. Parallel programming models Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of

More information

Stochastic Modelling of Electron Transport on different HPC architectures

Stochastic Modelling of Electron Transport on different HPC architectures Stochastic Modelling of Electron Transport on different HPC architectures www.hp-see.eu E. Atanassov, T. Gurov, A. Karaivan ova Institute of Information and Communication Technologies Bulgarian Academy

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

Performance Evaluation of Scientific Applications on POWER8

Performance Evaluation of Scientific Applications on POWER8 Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012 MAGMA Matrix Algebra on GPU and Multicore Architectures Mark Gates February 2012 1 Hardware trends Scale # cores instead of clock speed Hardware issue became software issue Multicore Hybrid 1.E+07 1e7

More information

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs Simon McIntosh-Smith simonm@cs.bris.ac.uk Head of Microelectronics Research University of Bristol, UK 1 ! Collaborators

More information

arxiv: v1 [astro-ph.im] 7 Apr 2013

arxiv: v1 [astro-ph.im] 7 Apr 2013 A Performance Comparison of Different Graphics Processing Units Running Direct N-Body Simulations R. Capuzzo Dolcetta a, M. Spera a a Dep. of Physics, Sapienza, University of Roma, P.le A. Moro 5, Roma,

More information

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory

More information

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method Jee Choi 1, Aparna Chandramowlishwaran 3, Kamesh Madduri 4, and Richard Vuduc 2 1 ECE, Georgia Tech 2 CSE, Georgia

More information

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs 1 School of Information Engineering, China University of Geosciences (Beijing) Beijing, 100083, China E-mail: Yaolk1119@icloud.com Ailan

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice Michal Borovský Department of Theoretical Physics and Astrophysics, University of P. J. Šafárik in Košice,

More information

Behavioral Simulations in MapReduce

Behavioral Simulations in MapReduce Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

An Integrative Model for Parallelism

An Integrative Model for Parallelism An Integrative Model for Parallelism Victor Eijkhout ICERM workshop 2012/01/09 Introduction Formal part Examples Extension to other memory models Conclusion tw-12-exascale 2012/01/09 2 Introduction tw-12-exascale

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers Parallel Multivariate SpatioTemporal Clustering of Large Ecological Datasets on Hybrid Supercomputers Sarat Sreepathi1, Jitendra Kumar1, Richard T. Mills2, Forrest M. Hoffman1, Vamsi Sripathi3, William

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012 Weather Research and Forecasting (WRF) Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell,

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling 2019 Intel extreme Performance Users Group (IXPUG) meeting Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr)

More information

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign Hydra: Generation and Tuning of parallel solutions for linear algebra equations Alexandre X. Duchâteau University of Illinois at Urbana Champaign Collaborators Thesis Advisors Denis Barthou (Labri/INRIA

More information

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method Ilya B. Labutin A.A. Trofimuk Institute of Petroleum Geology and Geophysics SB RAS, 3, acad. Koptyug Ave., Novosibirsk

More information

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay SP-CNN: A Scalable and Programmable CNN-based Accelerator Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay Motivation Power is a first-order design constraint, especially for embedded devices. Certain

More information

The Fast Multipole Method in molecular dynamics

The Fast Multipole Method in molecular dynamics The Fast Multipole Method in molecular dynamics Berk Hess KTH Royal Institute of Technology, Stockholm, Sweden ADAC6 workshop Zurich, 20-06-2018 Slide BioExcel Slide Molecular Dynamics of biomolecules

More information

On the design of parallel linear solvers for large scale problems

On the design of parallel linear solvers for large scale problems On the design of parallel linear solvers for large scale problems ICIAM - August 2015 - Mini-Symposium on Recent advances in matrix computations for extreme-scale computers M. Faverge, X. Lacoste, G. Pichon,

More information

The GPU code FARGO3D: presentation and implementation strategies

The GPU code FARGO3D: presentation and implementation strategies The GPU code FARGO3D: presentation and implementation strategies Frédéric Masset Universidad Nacional Autónoma de México (UNAM) Pablo Benítez-Llambay (UC, Argentina & NBI Copenhagen), David Velasco (UNAM

More information

INITIAL INTEGRATION AND EVALUATION

INITIAL INTEGRATION AND EVALUATION INITIAL INTEGRATION AND EVALUATION OF SLATE PARALLEL BLAS IN LATTE Marc Cawkwell, Danny Perez, Arthur Voter Asim YarKhan, Gerald Ragghianti, Jack Dongarra, Introduction The aim of the joint milestone STMS10-52

More information