OpenCL: Programming Heterogeneous Architectures

Similar documents
Panorama des modèles et outils de programmation parallèle

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Julian Merten. GPU Computing and Alternative Architecture

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

arxiv: v1 [hep-lat] 7 Oct 2010

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

arxiv: v1 [physics.comp-ph] 22 Nov 2012

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

11 Parallel programming models

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

上海超级计算中心 Shanghai Supercomputer Center. Lei Xu Shanghai Supercomputer Center San Jose

PSEUDORANDOM numbers are very important in practice

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Practical Combustion Kinetics with CUDA

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

CL 2 QCD - Lattice QCD based on OpenCL

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Direct Self-Consistent Field Computations on GPU Clusters

Level-3 BLAS on a GPU

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Accelerating Computational Algorithms

S Subdivide, Preprocess and Conquer: Micromagnetism FEM/BEM-Simulations on Single-Node/Multi-GPU Systems

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Cosmology with Galaxy Clusters: Observations meet High-Performance-Computing

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Lightweight Superscalar Task Execution in Distributed Memory

Scalable and Power-Efficient Data Mining Kernels

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Computing least squares condition numbers on hybrid multicore/gpu systems

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

CRYPTOGRAPHIC COMPUTING

Dynamic Scheduling within MAGMA

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

The Lattice Boltzmann Simulation on Multi-GPU Systems

Perm State University Research-Education Center Parallel and Distributed Computing

Bachelor-thesis: GPU-Acceleration of Linear Algebra using OpenCL

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

P214 Efficient Computation of Passive Seismic Interferometry

N-body Simulations. On GPU Clusters

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

A CUDA Solver for Helmholtz Equation

Delayed and Higher-Order Transfer Entropy

GPU Computing Activities in KISTI

Multiphase Flow Simulations in Inclined Tubes with Lattice Boltzmann Method on GPU

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

Optimization Techniques for Parallel Code 1. Parallel programming models

Stochastic Modelling of Electron Transport on different HPC architectures

Parallel Transposition of Sparse Data Structures

Performance Evaluation of Scientific Applications on POWER8

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

Adaptive Heterogeneous Computing with OpenCL: Harnessing hundreds of GPUs and CPUs

arxiv: v1 [astro-ph.im] 7 Apr 2013

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Parallelism of MRT Lattice Boltzmann Method based on Multi-GPUs

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Behavioral Simulations in MapReduce

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Dense Arithmetic over Finite Fields with CUMODP

An Integrative Model for Parallelism

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Introduction to numerical computations on the GPU

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers

Efficient implementation of the overlap operator on multi-gpus

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Hydra: Generation and Tuning of parallel solutions for linear algebra equations. Alexandre X. Duchâteau University of Illinois at Urbana Champaign

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

The Fast Multipole Method in molecular dynamics

On the design of parallel linear solvers for large scale problems

The GPU code FARGO3D: presentation and implementation strategies

INITIAL INTEGRATION AND EVALUATION

Transcription:

OpenCL: Programming Heterogeneous Architectures Porting BigDFT to OpenCL Brice Videau (LIG - NANOSIM) April 24, 2012 1 / 45

Introduction 2 / 45

Needs in computing resources are innite Benets for physicists and chemists More computing power means : Bigger systems, Fewer approximations, Improved accuracy. Numerical experimentation. CEA's hybrid cluster Titane, built by Bull 3 / 45

Current and future architectures 3 trends are seen in current calculators : Bigger systems Number of nodes in clusters and grids increase. Number of processors in supercomputer increase. Green Computing Low power components, High eciency, Huge Number of components. More powerful components Increased frequency, Increased number of processors and cores, Specialized co-processors : GPU, CELL... 4 / 45

Exploiting those architectures Middlewares are available to program those machines. Each middleware covers a range of usage. Some examples Distributed machines : MPI, KAAPI, CHARM... Multicore architectures : MPI, OpenMP, ompss, StarPU, OpenCL... GPU : OpenCL, NVIDIA Cuda, ATI Streams... 5 / 45

Talk Outline 2 OpenCL : a Standard for Parallel Computing 3 Life and Death of OpenCL in a Program 4 Writing Kernels 5 BigDFT 6 Conclusions and perspectives 6 / 45

OpenCL : a Standard for Parallel Computing 7 / 45

Host-Devices model OpenCL Architecture Model 1 host and several devices. Devices are connected to the host. Host issues commands to the devices. Data transport is done via memory copy. Several devices support OpenCL Host Commands Devices Devices Devices NVIDIA for GPU and in the future for Tegra. AMD and Intel for CPUs and GPUs and MIC? IBM CELL processor. ARM GPUs (Mali) + CPUs 8 / 45

Context and Queues Contexts aggregate resources, programs and devices belonging to a common platform (ie NVIDIA, or ATI). Host and devices communicate via buers dened in a context. Commands are sent to devices using command queues. Commands are called kernels. Command queues Can be synchronous or asynchronous. Can be event driven. Several queues can point to the same device, allowing concurrent execution. 9 / 45

OpenCL Processing Model Work item 1 Work item 2 Work item 1 Work item 2 Work item n 1 Work item n Work item n 1 Work item n Compute Unit 1 Compute Unit m Compute Device Kernels are split into uni, two or three-dimensional ranges called work groups. Work groups are mapped to compute units. Individual item are processed by work items. 10 / 45

OpenCL Memory Model 4 dierent memory space dened on an OpenCL device : Global memory : corresponds to the device RAM, input data are stored there. Constant memory : cached global memory. Local memory : high speed memory shared among work items of a compute unit. Private memory : registers of a work item. 11 / 45

Life and Death of OpenCL in a Program The Host Side of OpenCL 12 / 45

General Workow Select desired platform Select desired devices Create associated Context Send data to devices using command queues Create command queues associated do devices Compile or load kernels on the devices Send commands to devices using command queues Get data from devices using command queues Release every resources used 13 / 45

Platform Selection In a near future every platform will support OpenCL, but the user may not be interested in all of them : select an appropriate platform Get Platforms 1 #i n c l u d e <CL/ c l. h> 2 c l _ u i n t num_platforms ; 3 c l G e t P l a t f o r m I D s ( NULL, NULL, &num_platforms ) ; 4 c l _ p l a t f o r m _ i d p l a t f o r m s = m a l l o c ( s i z e o f ( c l _ p l a t f o r m _ i d ) num_platforms ) ; 5 clgetplatformids ( num_platforms, platforms, NULL ) ; 6 /... / 7 f o r ( i n t i =0; i <num_platforms ; i ++){ 8 /... / 9 c l G e t P l a t f o r m I n f o ( p l a t f o r m s [ i ], CL_PLATFORM_VENDOR,... ) ; 10 /... / 11 } 14 / 45

Device Selection Several device from the same vendor is also common : one device for the screen and one device for computations Get Devices 1 #i n c l u d e <CL/ c l. h> 2 c l _ u i n t num_devices ; 3 c l G e t D e v i c e I D s ( p l a t f o r m, CL_DEVICE_TYPE_ALL, NULL, NULL, &num_devices ) ; 4 c l _ d e v i c e _ i d d e v i c e s = m a l l o c ( s i z e o f ( c l _ d e v i c e _ i d ) num_devices ) ; 5 clgetdeviceids ( platform, CL_DEVICE_TYPE_ALL, num_devices, devices, NULL ) ; 6 /... / 7 f o r ( i n t i =0; i <num_devices ; i ++){ 8 /... / 9 c l G e t D e v i c e I n f o ( d e v i c e s [ i ], CL_DEVICE_NAME,... ) ; 10 /... / 11 } 15 / 45

Context Creation Context gather devices from the same platform. Those devices will be able to share resources. Create Context 1 c l _ c o n t e x t _ p r o p e r t i e s p r o p e r t i e s [ ] = 2 { CL_CONTEXT_PLATFORM, ( c l _ c o n t e x t _ p r o p e r t i e s ) platform_id, 0 } ; 3 c l _ d e v i c e _ i d d e v i c e s [ ] = { device_id_1, device_id_2 } ; 4 c l _ c o n t e x t c o n t e x t = 5 c l C r e a t e C o n t e x t ( p r o p e r t i e s, 2, d e v i c e s, NULL, NULL, NULL ) ; A shortcut exists, skipping device selection : Create Context from Type 1 c l _ c o n t e x t _ p r o p e r t i e s p r o p e r t i e s [ ] = 2 { CL_CONTEXT_PLATFORM, ( c l _ c o n t e x t _ p r o p e r t i e s ) platform_id, 0 } ; 3 c l _ c o n t e x t c o n t e x t = 4 clcreatecontextfromtype ( p r o p e r t i e s, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL ) ; 16 / 45

Building Program from Source Once the context is created, the program is to be built (or loaded from binary). Building Program 1 / s t r i n g s i s an a r r a y o f s t r i n g _ c o u n t NULL t e r m i n a t e d s t r i n g s / 2 cl_program program = 3 clcreateprogramwithsource ( context, string_ count, s t r i n g s, NULL, NULL ) ; 4 / i f d e v i c e _ l i s t i s NULL, program i s b u i l t 5 f o r a l l a v a i l a b l e d e v i c e s i n t h e c o n t e x t / 6 c l B u i l d P r o g r a m ( program, num_devices, d e v i c e _ l i s t, o p t i o n s, NULL, NULL ) ; 7 c l _ k e r n e l k e r n e l = c l C r e a t e K e r n e l ( program, " kernel_name ", NULL ) ; Kernels are extracted from the built program using their name. 17 / 45

Creating Command Queues A command queue is used to send commands to a device. They have to be associated with a device. Creating Command Queues 1 cl_command_queue queue = 2 clcreatecommandqueue ( c o n t e x t, d e v i c e s [ chosen_device ], 0, NULL ) ; Options can be specied instead of 0, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE allows for out of order execution for instance. 18 / 45

Using OpenCL Using OpenCL is (hopefully) easier than setting it up. Create buffers to store data on devices Send data to devices using command queues Send commands to devices using command queues Get data from devices using command queues 19 / 45

Buer Creation In OpenCL buers creation and deletion are explicitly managed. As can be noted buers are tied to a context and not a particular command queue. The implementation is free to transfer buers from devices to host memory or to another device. Creating Simple Buers 1 cl_mem r e a d _ b u f f e r = 2 c l C r e a t e B u f f e r ( c o n t e x t, CL_MEM_READ_ONLY, b u f f e r _ s i z e, NULL, NULL ) ; 3 cl_mem w r i t e _ b u f f e r = 4 c l C r e a t e B u f f e r ( c o n t e x t, CL_MEM_WRITE_ONLY, b u f f e r _ s i z e, NULL, NULL ) ; 20 / 45

Pinned Buer Creation Pinned buer creation can oer premium performances. Here is a code sample that can be used on NVIDIA devices. The nals pointers obtained can be used to transfer data between the host and the device. Creating Pinned Simple Buers 1 cl_mem p i n n e d _ r e a d _ b u f f e r = 2 c l C r e a t e B u f f e r ( c o n t e x t, CL_MEM_ALLOC_HOST_PTR CL_MEM_READ_ONLY, 3 b u f f e r _ s i z e, NULL, NULL ) ; 4 cl_mem p i n n e d _ w r i t e _ b u f f e r = 5 c l C r e a t e B u f f e r ( c o n t e x t, CL_MEM_ALLOC_HOST_PTR CL_MEM_WRITE_ONLY, 6 b u f f e r _ s i z e, NULL, NULL ) ; 7 u n s i g n e d c h a r data_in = 8 clenqueuemapbuffer ( queue, pinned_read_buffer, CL_TRUE, CL_MAP_WRITE, 0, 9 b u f f e r _ s i z e, 0, NULL, NULL, NULL ) ; 10 u n s i g n e d c h a r data_out = 11 clenqueuemapbuffer ( queue, pinned_ write_ buffer, CL_TRUE, CL_MAP_READ, 0, 12 b u f f e r _ s i z e, 0, NULL, NULL, NULL ) ; 21 / 45

Transferring Data The implementation is free to move buers in memory. But nonetheless, memory is often kept on the device associated to the command queue used to transfer the data. Data Transfer 1 clenqueuewritebuffer ( queue, read_ buffer, CL_TRUE, 0, 2 b u f f e r _ s i z e, data_in, 0, NULL, NULL ) ; 3 / P r o c e s s i n g t h a t r e a d s r e a d _ b u f f e r and w r i t e s w r i t e _ b u f f e r / 4 /... / 5 clenqueuereadbuffer ( queue, write_ buffer, CL_TRUE, 0, 6 b u f f e r _ s i z e, data_out, 0, NULL, NULL ) ; 22 / 45

Performing Calculations Once data is transfered, kernels are used to perform calculations. Kernel Usage 1 / P l a c e k e r n e l p a r a m e t e r s i n t h e k e r n e l s t r u c t u r e. / 2 c l S e t K e r n e l A r g ( k e r n e l, 0, s i z e o f ( d a t a _ s i z e ), ( v o i d )& d a t a _ s i z e ) ; 3 c l S e t K e r n e l A r g ( k e r n e l, 1, s i z e o f ( r e a d _ b u f f e r ), ( v o i d )& r e a d _ b u f f e r ) ; 4 c l S e t K e r n e l A r g ( k e r n e l, 2, s i z e o f ( w r i t e _ b u f f e r ), ( v o i d )& w r i t e _ b u f f e r ) ; 5 / Enqueue a 1 d i m e n s i o n a l k e r n e l w i t h a l o c a l s i z e o f 32 / 6 s i z e _ t l o c a l W o r k S i z e [ ] = { 32 } ; 7 s i z e _ t g l o b a l W o r k S i z e [ ] = { shrroundup ( 3 2, d a t a _ s i z e ) } ; 8 clenqueuendrangekernel ( queue, kernel, 1, NULL, 9 globalworksize, localworksize, 0, NULL, NULL ) ; 23 / 45

Event Management Almost all functions presented end with : 1..., 0, NULL, NULL ) ; These 3 arguments are used for event management, and thus asynchronous queue handling. Functions can wait for a number of events, and can generate 1 event. 1 event_t e v e n t _ l i s t [ ] = { event1, e v e n t 2 } ; 2 event_t event ; 3 clenqueuereadbuffer ( queue, write_ buffer, CL_FALSE, 0, 4 b u f f e r _ s i z e, data_out, 2, e v e n t _ l i s t, &e v e n t ) ; Previous buer read waits for 2 events and generate a third that will happen when the read is completed. 24 / 45

Release Resources OpenCL uses reference counts to manage memory. In order to exit cleanly from an OpenCL program all allocated resources have to be freed : buers (clreleasememobject) events (clreleaseevent) kernel (clreleasekernel) programs (clreleaseprogram) queues (clreleasecommandqueue) context (clreleasecontext) etc... 25 / 45

Writing Kernels 26 / 45

Recall : OpenCL Memory Model 4 dierent memory space dened on an OpenCL device : Global memory : corresponds to the device RAM, input data are stored there. Constant memory : cached global memory. Local memory : high speed memory shared among work items of a compute unit. Private memory : registers of a work item. 27 / 45

OpenCL Language : a Subset of C Kernels are written using a C-like language Recursion is prohibited Helper functions are dened Barriers Work item indexes Atomic operations Vector operations New keywords : kernel global, local, constant, private read_only, write_only, read_write 28 / 45

Example : Unidimensional Convolutions N e N N l (0,0) (1,0) (0,1) N l N l n N e (0,0) (1,0) (0,1) Input (i,j) Output (i,j) N e One unidimensional convolution with transposition, simple but not too much. Real world code used in BigDFT an electronic structure calculation program. 29 / 45

OpenCL memory management Example using a 4*4 block processing a lter of length 5. 0,0 1,0 2,0 3,0 0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,0 0,1 0,2 0,3 0,3 1,3 2,3 3,3 1,0 1,1 1,2 1,3 0,0 1,0 2,0 3,0 2,0 2,1 2,2 2,3 0,1 1,1 2,1 3,1 3,0 3,1 3,2 3,3 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 1 work item processes one element of the nal matrix. 30 / 45

Benets from this approach 3D convolutions can be expressed as a succession of 3 1D convolution/transposition. Memory access are coalesced while reading input matrix and writing output matrix. Bank conicts are almost avoided by padding the buer to the number of bank +1. Maps easily to current architectures. Extends to more complex convolutions found in BigDFT. Almost every convolution in BigDFT have been ported, including free boundary and semi-periodic boundary. 31 / 45

Kernel Declaration Kernel Declaration 1 / A c t i v a t e d o u b l e p r e c i s i o n s u p p o r t / 2 #pragma OPENCL EXTENSION cl_khr_fp64 : e n a b l e 3 #d e f i n e FILT_W 16 4 #d e f i n e WG_S 16 5 kernel v o i d 6 attribute ( ( reqd_work_group_size (WG_S,WG_S, 1 ) ) 7 m a g i c f i l t e r 1 d K e r n e l _ d ( u i n t n, u i n t ndat, 8 global c o n s t d o u b l e psi, 9 global d o u b l e out ){ 10 // padded l o c a l b u f f e r s i z e : 33 16 11 local d o u b l e tmp [WG_S (WG_S+FILT_W+1)]; Works on double precision oats Kernel expects work group size of 16 x 16 n and ndat are in local memory tmp1 is a storage buer in local memory, shared among work items 32 / 45

Work with Indexes Get Indexes and Load Data 1 // g e t o u r p o s i t i o n i n t h e l o c a l work g r o u p 2 c o n s t s i z e _ t i g = g e t _ g l o b a l _ i d ( 0 ) ; 3 c o n s t s i z e _ t j g = g e t _ g l o b a l _ i d ( 1 ) ; 4 // g e t o u r p o s i t i o n i n t h e r e s u l t m a t r i x 5 c o n s t s i z e _ t i = g e t _ l o c a l _ i d ( 0 ) ; 6 c o n s t s i z e _ t j = g e t _ l o c a l _ i d ( 1 ) ; 7 // t r a n s p o s e i n d e x e s i n t h e work g r o u p i n o r d e r t o r e a d t r a n s p o s e d d a t a 8 p t r d i f f _ t i g t = i g i + j FILT_W/ 2 ; 9 p t r d i f f _ t j g t = j g j + i ; 10 // i f we are on the outside, s e l e c t a border element to load, wrapping around 11 //we w i l l be l o a d i n g 2 e l e m e n t s e a c h 12 i f ( i g t < 0 ) 13 tmp [ i (WG_S+FILT_W+1) + j ] = p s i [ j g t + ( n + i g t ) ndat ] ; 14 e l s e 15 tmp [ i (WG_S+FILT_W+1) + j ] = p s i [ j g t + i g t ndat ] ; 16 i g t += FILT_W ; 17 i f ( i g t >= n ) 18 tmp [ i (WG_S+FILT_W+1) + j + FILT_W] = p s i [ j g t + ( i g t n ) ndat ] ; 19 e l s e \n\ 20 tmp [ i (WG_S+FILT_W+1) + j + FILT_W] = p s i [ j g t + i g t ndat ] ; 33 / 45

Compute Convolution and Write Output Performing Computations 1 // i n i t i a l i z e r e s u l t 2 d o u b l e t t = 0. 0 ; 3 // r e s t p o s i t i o n i n t h e b u f f e r t o f i r s t e l e m e n t i n v o l v e d i n t h e c o n v o l u t i o n 4 tmp += j 2 (WG_S+FILT_W+1) + i 2 ; 5 // w a i t f o r b u f f e r t o be f u l l 6 b a r r i e r (CLK_LOCAL_MEM_FENCE) ; 7 8 // a p p l y f i l t e r 9 t t += tmp++ FILT0 ; 10 t t += tmp++ FILT1 ; 11 /... / 12 t t += tmp++ FILT15 ; 13 // s t o r e t h e r e s u l t 14 out [ ( j g n+i g )]= t t ; 15 } ; 34 / 45

BigDFT OpenCL Port Evaluation 35 / 45

Motivations Part of BigDFT was already ported to GPU architectures using CUDA, why a new port? Only part of the program was ported. OpenCL can target several platforms while CUDA can't. OpenCL is a standard, while CUDA is a vendor provided solution. This time most of BigDFT is expected to run on GPU. Code ported is relatively simple and well suited to GPU, thus the performance loss is hoped to be minimal. 36 / 45

Objectives Most of BigDFT operations can be expressed as unidimensional convolutions. There are several dozens convolutions to implement. Convolutions of BigDFT share common traits. Objectives Find a common parallelization technique tting most convolutions. Be as ecient as CUDA. 37 / 45

Test System Setup GPU 2 : Tesla C2070 (Fermi) 6 GB of RAM Driver version : 260.14 GPU 2 : Radeon HD6970 2 GB of RAM Driver version : 11.6 38 / 45

Test System Setup Host : Lenovo D20 1 Xeon 5550 @ 2.83 GHz (4 Nehalem cores) 8 GB of RAM Linux 2.6.38-11 x86_64 icc 11.1 39 / 45

Magicfilter Kinetic Synthesis Analysis Kinetic_k Magicfilter_shrink Magicfilter_grow Magicfilter_reverse Compress Uncompress Analysis_shrink Synthesis_grow Introduction OpenCL Basic Management Writing Kernels BigDFT Conclusions and perspectives Comparison CPU,Fermi,HD6970 180 160 140 120 Performances of CPU vs NVIDIA vs AMD CPU NVIDIA AMD GFLOPS 100 80 60 40 20 0 Kernels 40 / 45

Magicfilter Kinetic_k Kinetic Magicfilter_grow Synthesis Analysis Magicfilter_shrink Magicfilter_reverse Compress Uncompress Analysis_shrink Synthesis_grow Introduction OpenCL Basic Management Writing Kernels BigDFT Conclusions and perspectives Comparison CPU,Fermi,HD6970 70 60 50 Performances of CPU vs NVIDIA vs AMD CPU NVIDIA AMD Ratio to CPU 40 30 20 10 0 Kernels 40 / 45

Comparison CUDA, OpenCL, CPU 100 Badiane, X5550 + Fermi C2070, ZnO 64 at.: CPU vs. Hybrid 10 80 8 Percent 60 40 6 4 Speedup 20 0 CPU-mkl CUDA CPU-mkl-mpi OCL-mkl OCL-cublas CUDA-mkl OCL-cublas-mpi CUDA-mkl-mpi CUDA-mpi OCL-mkl-mpi 2 0 Speedup Other CPU Conv LinAlg Comms 41 / 45

Hybrid ATI + NVIDIA Tesla C2070 + Radeon HD 6970 MPI+NVIDIA/AMD Execution Time (s) Speedup 1 6020 1 4 1660 3.6 1 + NVIDIA 300 20 4 + NVIDIA 160 38 1 + AMD 347 17 4 + AMD 197 30 (4 + NV) + (4 + AMD) 109 55 Table: Performance results for dierent conguration of BigDFT, using MPI + GPUs 42 / 45

Conclusions OpenCL OpenCL proved easy to use. Performance is on-par with previous CUDA implementation. Kernels have been shown to run on other architectures : ATI and CPU. BigDFT Full port of BigDFT convolutions on OpenCL. Part of the code was rewritten during this work. Complexity reduced compared to the CUDA support. 43 / 45

Perspectives OpenCL Some OpenCL implementations are still recent and buggy. Best way to do multi-gpu, GPU+OpenCL CPU? Optimizing kernels for multiple devices? Automated kernel generation. 44 / 45

Questions? Thanks for your attention. 45 / 45