Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

XcalableMP 1 2 2 2 Xeon Phi Xeon XcalableMP HIMENO L Phi XL 16 Xeon 1 16 Phi XcalableMP MPI XcalableMP OpenMP 16 2048 Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor MITSURU IKEI 1 2 MASAHIRO NAKAO 2 MITSUHISA SATO 2 We have evaluated Parallel programming language XcalableMP by running a standard performance benchmark on Xeon Phi many-core co-processor cluster. We used Himeno benchmark L size for single Phi node measurement and XL size for 16 nodes of Xeon computing server with a Phi card each. As the result, we could confirm that XcalableMP version of the program can get the performance of MPI version of the same benchmark both on single node of Phi and 16 nodes of Phi cluster. We also experiment XcalableMP + OpenMP hybrid version and confirmed its performance scales up to 2048 threads on 16 compute nodes. 1. MIC Many Integrated Core Intel Xeon Phi PCI Express GDDR5 60 GPGPU General Purpose Graphics Processing Unit GPGPU GPGPU ----------------------------------------------------------------------------------- 1 Intel K.K 2 University of Tsukuba XcalableMP XMP 1) XMP Phi XMP HIMNO 2) Phi Phi 2. XcalableMP XMP OpenMP C Fortran PC XcalableMP Fortran Fortran XMP SPMD OpenMP XMP XMP 1

XMP!$xmp nodes n(4)!$xmp template t(100)!$xmp distribute t(block) onto n real g(80,100)!$xmp align g(*,j) with t(j) real l(50,40) (80,1:25) (80,25:50) (80,100) (80,50:75) (80,75:100) l(50,40) l(50,40) l(50,40) l(50,40) 1 XMP XMP XMP Fortran XMP XMP XMP nodes node n node template t(100) t(100) distribute t P t(block)template t 100 4 25 block real g(80,100) align g template t g(*,j) with t(j)j g 2 template t g 80 25 4 4 real l(50,40) l node 50 40 real l XMP AICS Omni XcalbleMP Compiler version 1.1 build 1222 source-to-source MPI MPI MPI runtime MPI 3. Xeon Phi 3.1 Xeon Phi PCI Express 2 Phi Phi 3 PCI Ex 2 Phi CPU Phi 61 CPU 2 2 GDDR5 8 3 PCI-Express PCI-Express 3.2 Phi Phi GPGPU Phi Phi Phi 3.3 Phi 2 2

Off-load native Off-load GPGPU off-load Phi native Native OpenMP native 4. 4.1 Phi PCI Express 1 Xeon 1 CPU 2.6 GHz 1600 MHz 32 GB GFLOPS 5.2 GFLOPS 4.2 HIMENO 1 1 Phi L 512x256x256 16 16 Phi XL 1024x512x512XMP HIMENO Fortran90 OMP L 1 XMP 3 1 module alloc1 PARAMETER (mimax = 512, mjmax = 256, mkmax = 256) real p(mimax,mjmax,mkmax), a(mimax,mjmax,mkmax,4), b(mimax,mjmax,mkmax,3) real c(mimax,mjmax,mkmax,3), bnd(mimax,mjmax,mkmax) real wrk1(mimax,mjmax,mkmax), wrk2(mimax,mjmax,mkmax)!$xmp nodes n(*)!$xmp template t(mimax,mjmax,mkmax)!$xmp distribute t(*,*,block) onto n!$xmp align (*,*,k) with t(*,*,k) :: p, bnd, wrk1, wrk2!$xmp align (*,*,k,*) with t(*,*,k) :: a, b, c!$xmp shadow p(0,0,1)!$xmp shadow a(0,0,1,0)!$xmp shadow b(0,0,1,0)!$xmp shadow c(0,0,1,0)!$xmp shadow bnd(0,0,1)!$xmp shadow wrk1(0,0,1)!$xmp shadow wrk2(0,0,1)... ( )!$xmp reflect (p)!$xmp loop (K) on t(*,*,k) reduction (+:GOSA) DO K = 2, kmax-1 DO J = 2, jmax-1 DO I = 2, imax-1 S0 = a(i,j,k,1)*p(i+1,j,k)+a(i,j,k,2)*p(i,j+1,k) 4.3 Phi 1 L Xeon 4 GFLOPS 35 30 25 20 15 10 5 0 3 1 1 2 4 8 16 32 64 128 4 Phi Phi Xeon XMP-Phi Fortran90 MPI Xeon Phi mpiifort -O3 AVX Xeon -mmic Phi ) MPI 1 2 Xeon 4 32 Phi 2 128 3

3 (1) Phi Himemo 2 Xeon (2) Phi 1 2 (3) XMP MPI 4.4 2 InfiniBand 16 Xeon 1 Phi Phi 1 128 16 2048 Himeno XL 1024x512x512 XMP 1 2 5 2!$xmp nodes n(4,*)!$xmp template t(mimax,mjmax,mkmax)!$xmp distribute t(*,block,block) onto n!$xmp align (*,j,k) with t(*,j,k) :: p, bnd, wrk1, wrk2!$xmp align (*,j,k,*) with t(*,j,k) :: a, b, c!$xmp shadow p(0,1,1)!$xmp shadow a(0,1,1,0) 5 2 node n(4,*) template t 2 4 p,bnd,wk1 align XL 2048=4x512 4.5 Phi 2 XMP 16 Xeon 6 Fortran90 MPI Xeon Phi mpiifort -O3 -AVX -O3 -mmic Xeon MPI 16 256 16x16 Phi 2 256=128x2XL 512 1024 2048 3 (1) 4 Phi 512 2 Xeon 4 64 Xeon (2) Phi MPI 1024 2048 (3) XMP MPI 2048 4.6 XMP MPI XMP MPI OpenMP OpenMP XMP Omni XcalableMP OpenMP 7 XMP MPI XMP Fortran 3 350 300 250 200 150 100 50 0 MPI-Phi Xeon XMP-Phi HYBRD-P 6 Phi 4

!$OMP PARALLEL SHARED (kmax,jmax,imax,nn, &!$OMP t5,xmp_loop_lb9,xmp_loop_ub10,xmp_loop_step11,gosa) &!$OMP PRIVATE (k6,k8,j,i,s0,ss,gosa1,loop) DO loop = 1, nn, 1 # 133 "himenol.f90"!$omp BARRIER!$OMP MASTER gosa = 0.0_8 CALL xmpf_reflect_ ( XMP_DESC_p ) CALL xmpf_ref_templ_alloc_ ( t5, XMP_DESC_t, 3 ) CALL xmpf_ref_set_loop_info_ ( t5, 0, 2, 0 ) CALL xmpf_ref_init_ ( t5 ) XMP_loop_lb9 = 2 XMP_loop_ub10 = kmax - 1 XMP_loop_step11 = 1 CALL xmpf_loop_sched_ ( XMP_loop_lb9, XMP_loop_ub10, XMP_loop_step11, 0, t5 )!$OMP END MASTER!$OMP BARRIER gosa1 = 0.0_8!$OMP DO COLLAPSE(2) DO k6 = XMP_loop_lb9, XMP_loop_ub10, XMP_loop_step11 DO j = 2, jmax - 1, 1 DO i = 2, imax - 1, 1 7 XMP!$OMP OpenMP Parallel 2 MPI XMP XMP 1 1 MASTER 1 OpenMP 3 Do XMP Phi 128 6 XMP OpenMP 512, 1024 MPI 2048 =128x16 Phi InfiniBand DO template XMP MIC HIMENO XMP XMP MIC XMP MIC 1) XcalableMP / What s XcalableMP http://www.xcalablemp.org 2) / http://accc.riken.jp/2145.htm 5. Phi MIC Phi XMP template 5