Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

Similar documents
SC09 HPC Challenge Submission for XcalableMP

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Panorama des modèles et outils de programmation parallèle

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

WRF performance tuning for the Intel Woodcrest Processor

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

11 Parallel programming models

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Direct Self-Consistent Field Computations on GPU Clusters

Research of the new Intel Xeon Phi architecture for solving a wide range of scientific problems at JINR

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

Massively scalable computing method to tackle large eigenvalue problems for nanoelectronics modeling

Interpolation with Radial Basis Functions on GPGPUs using CUDA

Parallel Transposition of Sparse Data Structures

arxiv: v1 [hep-lat] 10 Jul 2012

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Implementing NNLO into MCFM

Information Sciences Institute 22 June 2012 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes and

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

arxiv: v1 [hep-lat] 8 Nov 2014

Targeting Extreme Scale Computational Challenges with Heterogeneous Systems

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

R. Glenn Brook, Bilel Hadri*, Vincent C. Betro, Ryan C. Hulguin, and Ryan Braby Cray Users Group 2012 Stuttgart, Germany April 29 May 3, 2012

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Weather and Climate Modeling on GPU and Xeon Phi Accelerated Systems

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Perm State University Research-Education Center Parallel and Distributed Computing

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Quantum ESPRESSO Performance Benchmark and Profiling. February 2017

Presentation Outline

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Parallelization of the Molecular Orbital Program MOS-F

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Monte Carlo Methods for Electron Transport: Scalability Study

Cactus Tools for Petascale Computing

A Data Communication Reliability and Trustability Study for Cluster Computing

ERLANGEN REGIONAL COMPUTING CENTER

Performance Analysis of Parallel Alternating Directions Algorithm for Time Dependent Problems

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Toward High Performance Matrix Multiplication for Exact Computation

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Julian Merten. GPU Computing and Alternative Architecture

Stochastic Modelling of Electron Transport on different HPC architectures

Advanced Vectorization of PPML Method for Intel Xeon Scalable Processors

Practical Implementation of Lattice QCD Simulation on Intel Xeon Phi Knights Landing

Performance of machines for lattice QCD simulations

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

All-electron density functional theory on Intel MIC: Elk

Logo. A Massively-Parallel Multicore Acceleration of a Point Contact Solid Mechanics Simulation DRAFT

Performance Evaluation of MPI on Weather and Hydrological Models

Introduction to numerical computations on the GPU

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

SPECIAL PROJECT PROGRESS REPORT

591 TFLOPS Multi-TRILLION Particles Simulation on SuperMUC

Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel Xeon Phi

Everyday Multithreading

Some notes on efficient computing and setting up high performance computing environments

Multi-GPU Simulations of the Infinite Universe

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Progress in NWP on Intel HPC architecture at Australian Bureau of Meteorology

Optimization Techniques for Parallel Code 1. Parallel programming models

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

CRYPTOGRAPHIC COMPUTING

Performance Evaluation of Scientific Applications on POWER8

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Optimization strategy for MASNUM surface wave model

A simple Concept for the Performance Analysis of Cluster-Computing

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

The Memory Intensive System

APPLICATION OF CUDA TECHNOLOGY FOR CALCULATION OF GROUND STATES OF FEW-BODY NUCLEI BY FEYNMAN'S CONTINUAL INTEGRALS METHOD

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Binding Performance and Power of Dense Linear Algebra Operations

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

MP CBM-Z V1.0: design for a new CBM-Z gas-phase chemical mechanism architecture for next generation processors

Hideyuki Usui 1,3, M. Nunami 2,3, Y. Yagi 1,3, T. Moritaka 1,3, and JST/CREST multi-scale PIC simulation team

Solving Quadratic Equations with XL on Parallel Architectures

Digital Systems EEE4084F. [30 marks]

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Software optimization for petaflops/s scale Quantum Monte Carlo simulations

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

Measuring freeze-out parameters on the Bielefeld GPU cluster

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Environment (Parallelizing Query Optimization)

Background. Another interests. Sieve method. Parallel Sieve Processing on Vector Processor and GPU. RSA Cryptography

Efficient algorithms for symmetric tensor contractions

Huge-Scale Molecular Dynamics Simulation of Multi-bubble Nuclei

RWTH Aachen University

Multi-threaded adaptive extrapolation procedure for Feynman loop integrals in the physical region

Cluster Computing: Updraft. Charles Reid Scientific Computing Summer Workshop June 29, 2010

Transcription:

XcalableMP 1 2 2 2 Xeon Phi Xeon XcalableMP HIMENO L Phi XL 16 Xeon 1 16 Phi XcalableMP MPI XcalableMP OpenMP 16 2048 Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor MITSURU IKEI 1 2 MASAHIRO NAKAO 2 MITSUHISA SATO 2 We have evaluated Parallel programming language XcalableMP by running a standard performance benchmark on Xeon Phi many-core co-processor cluster. We used Himeno benchmark L size for single Phi node measurement and XL size for 16 nodes of Xeon computing server with a Phi card each. As the result, we could confirm that XcalableMP version of the program can get the performance of MPI version of the same benchmark both on single node of Phi and 16 nodes of Phi cluster. We also experiment XcalableMP + OpenMP hybrid version and confirmed its performance scales up to 2048 threads on 16 compute nodes. 1. MIC Many Integrated Core Intel Xeon Phi PCI Express GDDR5 60 GPGPU General Purpose Graphics Processing Unit GPGPU GPGPU ----------------------------------------------------------------------------------- 1 Intel K.K 2 University of Tsukuba XcalableMP XMP 1) XMP Phi XMP HIMNO 2) Phi Phi 2. XcalableMP XMP OpenMP C Fortran PC XcalableMP Fortran Fortran XMP SPMD OpenMP XMP XMP 1

XMP!$xmp nodes n(4)!$xmp template t(100)!$xmp distribute t(block) onto n real g(80,100)!$xmp align g(*,j) with t(j) real l(50,40) (80,1:25) (80,25:50) (80,100) (80,50:75) (80,75:100) l(50,40) l(50,40) l(50,40) l(50,40) 1 XMP XMP XMP Fortran XMP XMP XMP nodes node n node template t(100) t(100) distribute t P t(block)template t 100 4 25 block real g(80,100) align g template t g(*,j) with t(j)j g 2 template t g 80 25 4 4 real l(50,40) l node 50 40 real l XMP AICS Omni XcalbleMP Compiler version 1.1 build 1222 source-to-source MPI MPI MPI runtime MPI 3. Xeon Phi 3.1 Xeon Phi PCI Express 2 Phi Phi 3 PCI Ex 2 Phi CPU Phi 61 CPU 2 2 GDDR5 8 3 PCI-Express PCI-Express 3.2 Phi Phi GPGPU Phi Phi Phi 3.3 Phi 2 2

Off-load native Off-load GPGPU off-load Phi native Native OpenMP native 4. 4.1 Phi PCI Express 1 Xeon 1 CPU 2.6 GHz 1600 MHz 32 GB GFLOPS 5.2 GFLOPS 4.2 HIMENO 1 1 Phi L 512x256x256 16 16 Phi XL 1024x512x512XMP HIMENO Fortran90 OMP L 1 XMP 3 1 module alloc1 PARAMETER (mimax = 512, mjmax = 256, mkmax = 256) real p(mimax,mjmax,mkmax), a(mimax,mjmax,mkmax,4), b(mimax,mjmax,mkmax,3) real c(mimax,mjmax,mkmax,3), bnd(mimax,mjmax,mkmax) real wrk1(mimax,mjmax,mkmax), wrk2(mimax,mjmax,mkmax)!$xmp nodes n(*)!$xmp template t(mimax,mjmax,mkmax)!$xmp distribute t(*,*,block) onto n!$xmp align (*,*,k) with t(*,*,k) :: p, bnd, wrk1, wrk2!$xmp align (*,*,k,*) with t(*,*,k) :: a, b, c!$xmp shadow p(0,0,1)!$xmp shadow a(0,0,1,0)!$xmp shadow b(0,0,1,0)!$xmp shadow c(0,0,1,0)!$xmp shadow bnd(0,0,1)!$xmp shadow wrk1(0,0,1)!$xmp shadow wrk2(0,0,1)... ( )!$xmp reflect (p)!$xmp loop (K) on t(*,*,k) reduction (+:GOSA) DO K = 2, kmax-1 DO J = 2, jmax-1 DO I = 2, imax-1 S0 = a(i,j,k,1)*p(i+1,j,k)+a(i,j,k,2)*p(i,j+1,k) 4.3 Phi 1 L Xeon 4 GFLOPS 35 30 25 20 15 10 5 0 3 1 1 2 4 8 16 32 64 128 4 Phi Phi Xeon XMP-Phi Fortran90 MPI Xeon Phi mpiifort -O3 AVX Xeon -mmic Phi ) MPI 1 2 Xeon 4 32 Phi 2 128 3

3 (1) Phi Himemo 2 Xeon (2) Phi 1 2 (3) XMP MPI 4.4 2 InfiniBand 16 Xeon 1 Phi Phi 1 128 16 2048 Himeno XL 1024x512x512 XMP 1 2 5 2!$xmp nodes n(4,*)!$xmp template t(mimax,mjmax,mkmax)!$xmp distribute t(*,block,block) onto n!$xmp align (*,j,k) with t(*,j,k) :: p, bnd, wrk1, wrk2!$xmp align (*,j,k,*) with t(*,j,k) :: a, b, c!$xmp shadow p(0,1,1)!$xmp shadow a(0,1,1,0) 5 2 node n(4,*) template t 2 4 p,bnd,wk1 align XL 2048=4x512 4.5 Phi 2 XMP 16 Xeon 6 Fortran90 MPI Xeon Phi mpiifort -O3 -AVX -O3 -mmic Xeon MPI 16 256 16x16 Phi 2 256=128x2XL 512 1024 2048 3 (1) 4 Phi 512 2 Xeon 4 64 Xeon (2) Phi MPI 1024 2048 (3) XMP MPI 2048 4.6 XMP MPI XMP MPI OpenMP OpenMP XMP Omni XcalableMP OpenMP 7 XMP MPI XMP Fortran 3 350 300 250 200 150 100 50 0 MPI-Phi Xeon XMP-Phi HYBRD-P 6 Phi 4

!$OMP PARALLEL SHARED (kmax,jmax,imax,nn, &!$OMP t5,xmp_loop_lb9,xmp_loop_ub10,xmp_loop_step11,gosa) &!$OMP PRIVATE (k6,k8,j,i,s0,ss,gosa1,loop) DO loop = 1, nn, 1 # 133 "himenol.f90"!$omp BARRIER!$OMP MASTER gosa = 0.0_8 CALL xmpf_reflect_ ( XMP_DESC_p ) CALL xmpf_ref_templ_alloc_ ( t5, XMP_DESC_t, 3 ) CALL xmpf_ref_set_loop_info_ ( t5, 0, 2, 0 ) CALL xmpf_ref_init_ ( t5 ) XMP_loop_lb9 = 2 XMP_loop_ub10 = kmax - 1 XMP_loop_step11 = 1 CALL xmpf_loop_sched_ ( XMP_loop_lb9, XMP_loop_ub10, XMP_loop_step11, 0, t5 )!$OMP END MASTER!$OMP BARRIER gosa1 = 0.0_8!$OMP DO COLLAPSE(2) DO k6 = XMP_loop_lb9, XMP_loop_ub10, XMP_loop_step11 DO j = 2, jmax - 1, 1 DO i = 2, imax - 1, 1 7 XMP!$OMP OpenMP Parallel 2 MPI XMP XMP 1 1 MASTER 1 OpenMP 3 Do XMP Phi 128 6 XMP OpenMP 512, 1024 MPI 2048 =128x16 Phi InfiniBand DO template XMP MIC HIMENO XMP XMP MIC XMP MIC 1) XcalableMP / What s XcalableMP http://www.xcalablemp.org 2) / http://accc.riken.jp/2145.htm 5. Phi MIC Phi XMP template 5