Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

Size: px
Start display at page:

Download "Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor"

Transcription

1 XcalableMP Xeon Phi Xeon XcalableMP HIMENO L Phi XL 16 Xeon 1 16 Phi XcalableMP MPI XcalableMP OpenMP Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor MITSURU IKEI 1 2 MASAHIRO NAKAO 2 MITSUHISA SATO 2 We have evaluated Parallel programming language XcalableMP by running a standard performance benchmark on Xeon Phi many-core co-processor cluster. We used Himeno benchmark L size for single Phi node measurement and XL size for 16 nodes of Xeon computing server with a Phi card each. As the result, we could confirm that XcalableMP version of the program can get the performance of MPI version of the same benchmark both on single node of Phi and 16 nodes of Phi cluster. We also experiment XcalableMP + OpenMP hybrid version and confirmed its performance scales up to 2048 threads on 16 compute nodes. 1. MIC Many Integrated Core Intel Xeon Phi PCI Express GDDR5 60 GPGPU General Purpose Graphics Processing Unit GPGPU GPGPU Intel K.K 2 University of Tsukuba XcalableMP XMP 1) XMP Phi XMP HIMNO 2) Phi Phi 2. XcalableMP XMP OpenMP C Fortran PC XcalableMP Fortran Fortran XMP SPMD OpenMP XMP XMP 1

2 XMP!$xmp nodes n(4)!$xmp template t(100)!$xmp distribute t(block) onto n real g(80,100)!$xmp align g(*,j) with t(j) real l(50,40) (80,1:25) (80,25:50) (80,100) (80,50:75) (80,75:100) l(50,40) l(50,40) l(50,40) l(50,40) 1 XMP XMP XMP Fortran XMP XMP XMP nodes node n node template t(100) t(100) distribute t P t(block)template t block real g(80,100) align g template t g(*,j) with t(j)j g 2 template t g real l(50,40) l node real l XMP AICS Omni XcalbleMP Compiler version 1.1 build 1222 source-to-source MPI MPI MPI runtime MPI 3. Xeon Phi 3.1 Xeon Phi PCI Express 2 Phi Phi 3 PCI Ex 2 Phi CPU Phi 61 CPU 2 2 GDDR5 8 3 PCI-Express PCI-Express 3.2 Phi Phi GPGPU Phi Phi Phi 3.3 Phi 2 2

3 Off-load native Off-load GPGPU off-load Phi native Native OpenMP native Phi PCI Express 1 Xeon 1 CPU 2.6 GHz 1600 MHz 32 GB GFLOPS 5.2 GFLOPS 4.2 HIMENO 1 1 Phi L 512x256x Phi XL 1024x512x512XMP HIMENO Fortran90 OMP L 1 XMP 3 1 module alloc1 PARAMETER (mimax = 512, mjmax = 256, mkmax = 256) real p(mimax,mjmax,mkmax), a(mimax,mjmax,mkmax,4), b(mimax,mjmax,mkmax,3) real c(mimax,mjmax,mkmax,3), bnd(mimax,mjmax,mkmax) real wrk1(mimax,mjmax,mkmax), wrk2(mimax,mjmax,mkmax)!$xmp nodes n(*)!$xmp template t(mimax,mjmax,mkmax)!$xmp distribute t(*,*,block) onto n!$xmp align (*,*,k) with t(*,*,k) :: p, bnd, wrk1, wrk2!$xmp align (*,*,k,*) with t(*,*,k) :: a, b, c!$xmp shadow p(0,0,1)!$xmp shadow a(0,0,1,0)!$xmp shadow b(0,0,1,0)!$xmp shadow c(0,0,1,0)!$xmp shadow bnd(0,0,1)!$xmp shadow wrk1(0,0,1)!$xmp shadow wrk2(0,0,1)... ( )!$xmp reflect (p)!$xmp loop (K) on t(*,*,k) reduction (+:GOSA) DO K = 2, kmax-1 DO J = 2, jmax-1 DO I = 2, imax-1 S0 = a(i,j,k,1)*p(i+1,j,k)+a(i,j,k,2)*p(i,j+1,k) 4.3 Phi 1 L Xeon 4 GFLOPS Phi Phi Xeon XMP-Phi Fortran90 MPI Xeon Phi mpiifort -O3 AVX Xeon -mmic Phi ) MPI 1 2 Xeon 4 32 Phi

4 3 (1) Phi Himemo 2 Xeon (2) Phi 1 2 (3) XMP MPI InfiniBand 16 Xeon 1 Phi Phi Himeno XL 1024x512x512 XMP !$xmp nodes n(4,*)!$xmp template t(mimax,mjmax,mkmax)!$xmp distribute t(*,block,block) onto n!$xmp align (*,j,k) with t(*,j,k) :: p, bnd, wrk1, wrk2!$xmp align (*,j,k,*) with t(*,j,k) :: a, b, c!$xmp shadow p(0,1,1)!$xmp shadow a(0,1,1,0) 5 2 node n(4,*) template t 2 4 p,bnd,wk1 align XL 2048=4x Phi 2 XMP 16 Xeon 6 Fortran90 MPI Xeon Phi mpiifort -O3 -AVX -O3 -mmic Xeon MPI x16 Phi 2 256=128x2XL (1) 4 Phi Xeon 4 64 Xeon (2) Phi MPI (3) XMP MPI XMP MPI XMP MPI OpenMP OpenMP XMP Omni XcalableMP OpenMP 7 XMP MPI XMP Fortran MPI-Phi Xeon XMP-Phi HYBRD-P 6 Phi 4

5 !$OMP PARALLEL SHARED (kmax,jmax,imax,nn, &!$OMP t5,xmp_loop_lb9,xmp_loop_ub10,xmp_loop_step11,gosa) &!$OMP PRIVATE (k6,k8,j,i,s0,ss,gosa1,loop) DO loop = 1, nn, 1 # 133 "himenol.f90"!$omp BARRIER!$OMP MASTER gosa = 0.0_8 CALL xmpf_reflect_ ( XMP_DESC_p ) CALL xmpf_ref_templ_alloc_ ( t5, XMP_DESC_t, 3 ) CALL xmpf_ref_set_loop_info_ ( t5, 0, 2, 0 ) CALL xmpf_ref_init_ ( t5 ) XMP_loop_lb9 = 2 XMP_loop_ub10 = kmax - 1 XMP_loop_step11 = 1 CALL xmpf_loop_sched_ ( XMP_loop_lb9, XMP_loop_ub10, XMP_loop_step11, 0, t5 )!$OMP END MASTER!$OMP BARRIER gosa1 = 0.0_8!$OMP DO COLLAPSE(2) DO k6 = XMP_loop_lb9, XMP_loop_ub10, XMP_loop_step11 DO j = 2, jmax - 1, 1 DO i = 2, imax - 1, 1 7 XMP!$OMP OpenMP Parallel 2 MPI XMP XMP 1 1 MASTER 1 OpenMP 3 Do XMP Phi XMP OpenMP 512, 1024 MPI 2048 =128x16 Phi InfiniBand DO template XMP MIC HIMENO XMP XMP MIC XMP MIC 1) XcalableMP / What s XcalableMP 2) / 5. Phi MIC Phi XMP template 5

SC09 HPC Challenge Submission for XcalableMP

SC09 HPC Challenge Submission for XcalableMP SC09 HPC Challenge Submission for XcalableMP Jinpil Lee Graduate School of Systems and Information Engineering University of Tsukuba jinpil@hpcs.cs.tsukuba.ac.jp Mitsuhisa Sato Center for Computational

More information

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster Yuta Hirokawa Graduate School of Systems and Information Engineering, University of Tsukuba hirokawa@hpcs.cs.tsukuba.ac.jp

More information

Panorama des modèles et outils de programmation parallèle

Panorama des modèles et outils de programmation parallèle Panorama des modèles et outils de programmation parallèle Sylvain HENRY sylvain.henry@inria.fr University of Bordeaux - LaBRI - Inria - ENSEIRB April 19th, 2013 1/45 Outline Introduction Accelerators &

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

WRF performance tuning for the Intel Woodcrest Processor

WRF performance tuning for the Intel Woodcrest Processor WRF performance tuning for the Intel Woodcrest Processor A. Semenov, T. Kashevarova, P. Mankevich, D. Shkurko, K. Arturov, N. Panov Intel Corp., pr. ak. Lavrentieva 6/1, Novosibirsk, Russia, 630090 {alexander.l.semenov,tamara.p.kashevarova,pavel.v.mankevich,

More information

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine First, a look at using OpenACC on WRF subroutine advance_w dynamics routine Second, an estimate of WRF multi-node performance on Cray XK6 with GPU accelerators Based on performance of WRF kernels, what

More information

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Some thoughts about energy efficient application execution on NEC LX Series compute clusters Some thoughts about energy efficient application execution on NEC LX Series compute clusters G. Wellein, G. Hager, J. Treibig, M. Wittmann Erlangen Regional Computing Center & Department of Computer Science

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

11 Parallel programming models

11 Parallel programming models 237 // Program Design 10.3 Assessing parallel programs 11 Parallel programming models Many different models for expressing parallelism in programming languages Actor model Erlang Scala Coordination languages

More information

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem Katharina Kormann 1 Klaus Reuter 2 Markus Rampp 2 Eric Sonnendrücker 1 1 Max Planck Institut für Plasmaphysik 2 Max Planck Computing

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Research of the new Intel Xeon Phi architecture for solving a wide range of scientific problems at JINR

Research of the new Intel Xeon Phi architecture for solving a wide range of scientific problems at JINR Research of the new Intel Xeon Phi architecture for solving a wide range of scientific problems at JINR Podgainy D.V., Streltsova O.I., Zuev M.I. on behalf of Heterogeneous Computations team HybriLIT LIT,

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling 2019 Intel extreme Performance Users Group (IXPUG) meeting Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr)

More information

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance Analysis of Lattice QCD Application with APGAS Programming Model Performance Analysis of Lattice QCD Application with APGAS Programming Model Koichi Shirahata 1, Jun Doi 2, Mikio Takeuchi 2 1: Tokyo Institute of Technology 2: IBM Research - Tokyo Programming Models

More information

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville Performance of the fusion code GYRO on three four generations of Crays Mark Fahey mfahey@utk.edu University of Tennessee, Knoxville Contents Introduction GYRO Overview Benchmark Problem Test Platforms

More information

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling 2019 Intel extreme Performance Users Group (IXPUG) meeting Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr)

More information

Interpolation with Radial Basis Functions on GPGPUs using CUDA

Interpolation with Radial Basis Functions on GPGPUs using CUDA Interpolation with Radial Basis Functions on GPGPUs using CUDA Gundolf Haase in coop. with: Dirk Martin [VRV Vienna] and Günter Offner [AVL Graz] Institute for Mathematics and Scientific Computing University

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

arxiv: v1 [hep-lat] 10 Jul 2012

arxiv: v1 [hep-lat] 10 Jul 2012 Hybrid Monte Carlo with Wilson Dirac operator on the Fermi GPU Abhijit Chakrabarty Electra Design Automation, SDF Building, SaltLake Sec-V, Kolkata - 700091. Pushan Majumdar Dept. of Theoretical Physics,

More information

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,

More information

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29 Outline A few words on MD applications and the GROMACS package The main work in an MD simulation Parallelization Stream computing

More information

Implementing NNLO into MCFM

Implementing NNLO into MCFM Implementing NNLO into MCFM Downloadable from mcfm.fnal.gov A Multi-Threaded Version of MCFM, J.M. Campbell, R.K. Ellis, W. Giele, 2015 Higgs boson production in association with a jet at NNLO using jettiness

More information

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and Accelerating the Multifrontal Method Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes {rflucas,genew,ddavis}@isi.edu and grimes@lstc.com 3D Finite Element

More information

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Introduction to Benchmark Test for Multi-scale Computational Materials Software Introduction to Benchmark Test for Multi-scale Computational Materials Software Shun Xu*, Jian Zhang, Zhong Jin xushun@sccas.cn Computer Network Information Center Chinese Academy of Sciences (IPCC member)

More information

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems G. Hager HPC Services, Computing Center Erlangen, Germany E. Jeckelmann Theoretical Physics, Univ.

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory MAX PLANCK INSTITUTE November 5, 2010 MPI at MPI Jens Saak Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory FOR DYNAMICS OF COMPLEX TECHNICAL

More information

arxiv: v1 [hep-lat] 8 Nov 2014

arxiv: v1 [hep-lat] 8 Nov 2014 Staggered Dslash Performance on Intel Xeon Phi Architecture arxiv:1411.2087v1 [hep-lat] 8 Nov 2014 Department of Physics, Indiana University, Bloomington IN 47405, USA E-mail: ruizli AT umail.iu.edu Steven

More information

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Targeting Extreme Scale Computational Challenges with Heterogeneous Systems Oreste Villa, Antonino Tumeo Pacific Northwest Na/onal Laboratory (PNNL) 1 Introduction! PNNL Laboratory Directed Research &

More information

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr) Principal Researcher / Korea Institute of Science and Technology

More information

R. Glenn Brook, Bilel Hadri*, Vincent C. Betro, Ryan C. Hulguin, and Ryan Braby Cray Users Group 2012 Stuttgart, Germany April 29 May 3, 2012

R. Glenn Brook, Bilel Hadri*, Vincent C. Betro, Ryan C. Hulguin, and Ryan Braby Cray Users Group 2012 Stuttgart, Germany April 29 May 3, 2012 R. Glenn Brook, Bilel Hadri*, Vincent C. Betro, Ryan C. Hulguin, and Ryan Braby Cray Users Group 2012 Stuttgart, Germany April 29 May 3, 2012 * presenting author Contents Overview on AACE Overview on MIC

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

Weather and Climate Modeling on GPU and Xeon Phi Accelerated Systems

Weather and Climate Modeling on GPU and Xeon Phi Accelerated Systems Weather and Climate Modeling on GPU and Xeon Phi Accelerated Systems Mike Ashworth, Rupert Ford, Graham Riley, Stephen Pickles Scientific Computing Department & STFC Hartree Centre STFC Daresbury Laboratory

More information

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures José I. Aliaga Performance and Energy Analysis of the Iterative Solution of Sparse

More information

Perm State University Research-Education Center Parallel and Distributed Computing

Perm State University Research-Education Center Parallel and Distributed Computing Perm State University Research-Education Center Parallel and Distributed Computing A 25-minute Talk (S4493) at the GPU Technology Conference (GTC) 2014 MARCH 24-27, 2014 SAN JOSE, CA GPU-accelerated modeling

More information

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting Thomas C. Schulthess 1 Cray XC30 with 5272 hybrid, GPU accelerated compute nodes Piz Daint Compute node:

More information

Quantum ESPRESSO Performance Benchmark and Profiling. February 2017

Quantum ESPRESSO Performance Benchmark and Profiling. February 2017 Quantum ESPRESSO Performance Benchmark and Profiling February 2017 2 Note The following research was performed under the HPC Advisory Council activities Compute resource - HPC Advisory Council Cluster

More information

Presentation Outline

Presentation Outline Parallel Multi-Zone Methods for Large- Scale Multidisciplinary Computational Physics Simulations Ding Li, Guoping Xia and Charles L. Merkle Purdue University The 6th International Conference on Linux Clusters

More information

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012 Weather Research and Forecasting (WRF) Performance Benchmark and Profiling July 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell,

More information

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics SPARSE SOLVERS FOR THE POISSON EQUATION Margreet Nool CWI, Multiscale Dynamics November 9, 2015 OUTLINE OF THIS TALK 1 FISHPACK, LAPACK, PARDISO 2 SYSTEM OVERVIEW OF CARTESIUS 3 POISSON EQUATION 4 SOLVERS

More information

Parallelization of the Molecular Orbital Program MOS-F

Parallelization of the Molecular Orbital Program MOS-F Parallelization of the Molecular Orbital Program MOS-F Akira Asato, Satoshi Onodera, Yoshie Inada, Elena Akhmatskaya, Ross Nobes, Azuma Matsuura, Atsuya Takahashi November 2003 Fujitsu Laboratories of

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

Monte Carlo Methods for Electron Transport: Scalability Study

Monte Carlo Methods for Electron Transport: Scalability Study Monte Carlo Methods for Electron Transport: Scalability Study www.hp-see.eu Aneta Karaivanova (Joint work with E. Atanassov and T. Gurov) Institute of Information and Communication Technologies Bulgarian

More information

Cactus Tools for Petascale Computing

Cactus Tools for Petascale Computing Cactus Tools for Petascale Computing Erik Schnetter Reno, November 2007 Gamma Ray Bursts ~10 7 km He Protoneutron Star Accretion Collapse to a Black Hole Jet Formation and Sustainment Fe-group nuclei Si

More information

A Data Communication Reliability and Trustability Study for Cluster Computing

A Data Communication Reliability and Trustability Study for Cluster Computing A Data Communication Reliability and Trustability Study for Cluster Computing Speaker: Eduardo Colmenares Midwestern State University Wichita Falls, TX HPC Introduction Relevant to a variety of sciences,

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

Performance Analysis of Parallel Alternating Directions Algorithm for Time Dependent Problems

Performance Analysis of Parallel Alternating Directions Algorithm for Time Dependent Problems Performance Analysis of Parallel Alternating Directions Algorithm for Time Dependent Problems Ivan Lirkov 1, Marcin Paprzycki 2, and Maria Ganzha 2 1 Institute of Information and Communication Technologies,

More information

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code E Calore, S F Schifano, R Tripiccione Enrico Calore INFN Ferrara, Italy 7 th Workshop on UnConventional High Performance

More information

Toward High Performance Matrix Multiplication for Exact Computation

Toward High Performance Matrix Multiplication for Exact Computation Toward High Performance Matrix Multiplication for Exact Computation Pascal Giorgi Joint work with Romain Lebreton (U. Waterloo) Funded by the French ANR project HPAC Séminaire CASYS - LJK, April 2014 Motivations

More information

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters HIM - Workshop on Sparse Grids and Applications Alexander Heinecke Chair of Scientific Computing May 18 th 2011 HIM

More information

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors J. Dongarra, M. Gates, A. Haidar, Y. Jia, K. Kabir, P. Luszczek, and S. Tomov University of Tennessee, Knoxville 05 / 03 / 2013 MAGMA:

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

Stochastic Modelling of Electron Transport on different HPC architectures

Stochastic Modelling of Electron Transport on different HPC architectures Stochastic Modelling of Electron Transport on different HPC architectures www.hp-see.eu E. Atanassov, T. Gurov, A. Karaivan ova Institute of Information and Communication Technologies Bulgarian Academy

More information

Advanced Vectorization of PPML Method for Intel Xeon Scalable Processors

Advanced Vectorization of PPML Method for Intel Xeon Scalable Processors Advanced Vectorization of PPML Method for Intel Xeon Scalable Processors Igor Chernykh 1, Igor Kulikov 1, Boris Glinsky 1, Vitaly Vshivkov 1, Lyudmila Vshivkova 1, Vladimir Prigarin 1 Institute of Computational

More information

Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing

Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing Issaku Kanamori Department of Physics, Hiroshima University Higashi-hiroshima 739-8526, Japan Email: kanamori@hiroshima-u.ac.jp

More information

Performance of machines for lattice QCD simulations

Performance of machines for lattice QCD simulations Performance of machines for lattice QCD simulations Tilo Wettig Institute for Theoretical Physics University of Regensburg Lattice 2005, 30 July 05 Tilo Wettig Performance of machines for lattice QCD simulations

More information

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems Ichitaro Yamazaki University of Tennessee, Knoxville Xiaoye Sherry Li Lawrence Berkeley National Laboratory MS49: Sparse

More information

All-electron density functional theory on Intel MIC: Elk

All-electron density functional theory on Intel MIC: Elk All-electron density functional theory on Intel MIC: Elk W. Scott Thornton, R.J. Harrison Abstract We present the results of the porting of the full potential linear augmented plane-wave solver, Elk [1],

More information

Logo. A Massively-Parallel Multicore Acceleration of a Point Contact Solid Mechanics Simulation DRAFT

Logo. A Massively-Parallel Multicore Acceleration of a Point Contact Solid Mechanics Simulation DRAFT Paper 1 Logo Civil-Comp Press, 2017 Proceedings of the Fifth International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering, P. Iványi, B.H.V Topping and G. Várady (Editors)

More information

Performance Evaluation of MPI on Weather and Hydrological Models

Performance Evaluation of MPI on Weather and Hydrological Models NCAR/RAL Performance Evaluation of MPI on Weather and Hydrological Models Alessandro Fanfarillo elfanfa@ucar.edu August 8th 2018 Cheyenne - NCAR Supercomputer Cheyenne is a 5.34-petaflops, high-performance

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

SPECIAL PROJECT PROGRESS REPORT

SPECIAL PROJECT PROGRESS REPORT SPECIAL PROJECT PROGRESS REPORT Progress Reports should be 2 to 10 pages in length, depending on importance of the project. All the following mandatory information needs to be provided. Reporting year

More information

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC International Supercomputing Conference 2013 591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC W. Eckhardt TUM, A. Heinecke TUM, R. Bader LRZ, M. Brehm LRZ, N. Hammer LRZ, H. Huber LRZ, H.-G.

More information

Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi

Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi A. Abdel-Rehim aa, G. Koutsou a, C. Urbach

More information

Everyday Multithreading

Everyday Multithreading Everyday Multithreading Parallel computing for genomic evaluations in R C. Heuer, D. Hinrichs, G. Thaller Institute of Animal Breeding and Husbandry, Kiel University August 27, 2014 C. Heuer, D. Hinrichs,

More information

Some notes on efficient computing and setting up high performance computing environments

Some notes on efficient computing and setting up high performance computing environments Some notes on efficient computing and setting up high performance computing environments Andrew O. Finley Department of Forestry, Michigan State University, Lansing, Michigan. April 17, 2017 1 Efficient

More information

Multi-GPU Simulations of the Infinite Universe

Multi-GPU Simulations of the Infinite Universe () Multi-GPU of the Infinite with with G. Rácz, I. Szapudi & L. Dobos Physics of Complex Systems Department Eötvös Loránd University, Budapest June 22, 2018, Budapest, Hungary Outline 1 () 2 () Concordance

More information

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters -- Parallel Processing for Energy Efficiency October 3, 2013 NTNU, Trondheim, Norway Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer

More information

Progress in NWP on Intel HPC architecture at Australian Bureau of Meteorology

Progress in NWP on Intel HPC architecture at Australian Bureau of Meteorology Progress in NWP on Intel HPC architecture at Australian Bureau of Meteorology www.cawcr.gov.au Robin Bowen Senior ITO Earth System Modelling Programme 04 October 2012 ECMWF HPC Presentation outline Weather

More information

Optimization Techniques for Parallel Code 1. Parallel programming models

Optimization Techniques for Parallel Code 1. Parallel programming models Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of

More information

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015 Tips Geared Towards R Departments of Statistics North Carolina State University Arpil 10, 2015 1 / 30 Advantages of R As an interpretive and interactive language, developing an algorithm in R can be done

More information

CRYPTOGRAPHIC COMPUTING

CRYPTOGRAPHIC COMPUTING CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,

More information

Performance Evaluation of Scientific Applications on POWER8

Performance Evaluation of Scientific Applications on POWER8 Performance Evaluation of Scientific Applications on POWER8 2014 Nov 16 Andrew V. Adinetz 1, Paul F. Baumeister 1, Hans Böttiger 3, Thorsten Hater 1, Thilo Maurer 3, Dirk Pleiter 1, Wolfram Schenck 4,

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Optimization strategy for MASNUM surface wave model

Optimization strategy for MASNUM surface wave model Hillsboro, September 27, 2018 Optimization strategy for MASNUM surface wave model Zhenya Song *, + * First Institute of Oceanography (FIO), State Oceanic Administrative (SOA), China + Intel Parallel Computing

More information

A simple Concept for the Performance Analysis of Cluster-Computing

A simple Concept for the Performance Analysis of Cluster-Computing A simple Concept for the Performance Analysis of Cluster-Computing H. Kredel 1, S. Richling 2, J.P. Kruse 3, E. Strohmaier 4, H.G. Kruse 1 1 IT-Center, University of Mannheim, Germany 2 IT-Center, University

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

The Memory Intensive System

The Memory Intensive System DiRAC@Durham The Memory Intensive System The DiRAC-2.5x Memory Intensive system at Durham in partnership with Dell Dr Lydia Heck, Technical Director ICC HPC and DiRAC Technical Manager 1 DiRAC Who we are:

More information

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD M.A. Naumenko, V.V. Samarin Joint Institute for Nuclear Research, Dubna, Russia

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information

Binding Performance and Power of Dense Linear Algebra Operations

Binding Performance and Power of Dense Linear Algebra Operations 10th IEEE International Symposium on Parallel and Distributed Processing with Applications Binding Performance and Power of Dense Linear Algebra Operations Maria Barreda, Manuel F. Dolz, Rafael Mayo, Enrique

More information

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic

More information

MP CBM-Z V1.0: design for a new CBM-Z gas-phase chemical mechanism architecture for next generation processors

MP CBM-Z V1.0: design for a new CBM-Z gas-phase chemical mechanism architecture for next generation processors MP CBM-Z V1.0: design for a new CBM-Z gas-phase chemical mechanism architecture for next generation processors Hui Wang 1, Junmin Lin 2, Qizhong Wu 1, Huansheng Chen 3, Xiao Tang 3, Zifa Wang 3, Xueshun

More information

Hideyuki Usui 1,3, M. Nunami 2,3, Y. Yagi 1,3, T. Moritaka 1,3, and JST/CREST multi-scale PIC simulation team

Hideyuki Usui 1,3, M. Nunami 2,3, Y. Yagi 1,3, T. Moritaka 1,3, and JST/CREST multi-scale PIC simulation team Hideyuki Usui 1,3, M. Nunami 2,3, Y. Yagi 1,3, T. Moritaka 1,3, and JST/CREST multi-scale PIC simulation team 1 Kobe Univ., Japan, 2 NIFS,Japan, 3 JST/CREST, Outline Multi-scale interaction between weak

More information

Solving Quadratic Equations with XL on Parallel Architectures

Solving Quadratic Equations with XL on Parallel Architectures Solving Quadratic Equations with XL on Parallel Architectures Cheng Chen-Mou 1, Chou Tung 2, Ni Ru-Ben 2, Yang Bo-Yin 2 1 National Taiwan University 2 Academia Sinica Taipei, Taiwan Leuven, Sept. 11, 2012

More information

Digital Systems EEE4084F. [30 marks]

Digital Systems EEE4084F. [30 marks] Digital Systems EEE4084F [30 marks] Practical 3: Simulation of Planet Vogela with its Moon and Vogel Spiral Star Formation using OpenGL, OpenMP, and MPI Introduction The objective of this assignment is

More information

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is

More information

Software optimization for petaflops/s scale Quantum Monte Carlo simulations

Software optimization for petaflops/s scale Quantum Monte Carlo simulations Software optimization for petaflops/s scale Quantum Monte Carlo simulations A. Scemama 1, M. Caffarel 1, E. Oseret 2, W. Jalby 2 1 Laboratoire de Chimie et Physique Quantiques / IRSAMC, Toulouse, France

More information

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners José I. Aliaga Leveraging task-parallelism in energy-efficient ILU preconditioners Universidad Jaime I (Castellón, Spain) José I. Aliaga

More information

Measuring freeze-out parameters on the Bielefeld GPU cluster

Measuring freeze-out parameters on the Bielefeld GPU cluster Measuring freeze-out parameters on the Bielefeld GPU cluster Outline Fluctuations and the QCD phase diagram Fluctuations from Lattice QCD The Bielefeld hybrid GPU cluster Freeze-out conditions from QCD

More information

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method Jee Choi 1, Aparna Chandramowlishwaran 3, Kamesh Madduri 4, and Richard Vuduc 2 1 ECE, Georgia Tech 2 CSE, Georgia

More information

Environment (Parallelizing Query Optimization)

Environment (Parallelizing Query Optimization) Advanced d Query Optimization i i Techniques in a Parallel Computing Environment (Parallelizing Query Optimization) Wook-Shin Han*, Wooseong Kwak, Jinsoo Lee Guy M. Lohman, Volker Markl Kyungpook National

More information

Background. Another interests. Sieve method. Parallel Sieve Processing on Vector Processor and GPU. RSA Cryptography

Background. Another interests. Sieve method. Parallel Sieve Processing on Vector Processor and GPU. RSA Cryptography Background Parallel Sieve Processing on Vector Processor and GPU Yasunori Ushiro (Earth Simulator Center) Yoshinari Fukui (Earth Simulator Center) Hidehiko Hasegawa (Univ. of Tsukuba) () RSA Cryptography

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

Huge-Scale Molecular Dynamics Simulation of Multi-bubble Nuclei

Huge-Scale Molecular Dynamics Simulation of Multi-bubble Nuclei 1/20 Huge-Scale Molecular Dynamics Simulation of Multi-bubble Nuclei H. Watanabe ISSP, The M. Suzuki H. Inaoka N. Ito Kyushu University RIKEN AICS The, RIKEN AICS Outline 1. Introduction 2. Benchmark results

More information

RWTH Aachen University

RWTH Aachen University IPCC @ RWTH Aachen University Optimization of multibody and long-range solvers in LAMMPS Rodrigo Canales William McDoniel Markus Höhnerbach Ahmed E. Ismail Paolo Bientinesi IPCC Showcase November 2016

More information

Multi-threaded adaptive extrapolation procedure for Feynman loop integrals in the physical region

Multi-threaded adaptive extrapolation procedure for Feynman loop integrals in the physical region Journal of Physics: Conference Series OPEN ACCESS Multi-threaded adaptive extrapolation procedure for Feynman loop integrals in the physical region To cite this article: E de Doncker et al 213 J. Phys.:

More information

Cluster Computing: Updraft. Charles Reid Scientific Computing Summer Workshop June 29, 2010

Cluster Computing: Updraft. Charles Reid Scientific Computing Summer Workshop June 29, 2010 Cluster Computing: Updraft Charles Reid Scientific Computing Summer Workshop June 29, 2010 Updraft Cluster: Hardware 256 Dual Quad-Core Nodes 2048 Cores 2.8 GHz Intel Xeon Processors 16 GB memory per

More information