Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor

Size: px

Start display at page:

Download "Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor"

Ruby Nichols
5 years ago
Views:

1 XcalableMP Xeon Phi Xeon XcalableMP HIMENO L Phi XL 16 Xeon 1 16 Phi XcalableMP MPI XcalableMP OpenMP Benchmarking program performance evaluation of Parallel programming language XcalableMP on Many core processor MITSURU IKEI 1 2 MASAHIRO NAKAO 2 MITSUHISA SATO 2 We have evaluated Parallel programming language XcalableMP by running a standard performance benchmark on Xeon Phi many-core co-processor cluster. We used Himeno benchmark L size for single Phi node measurement and XL size for 16 nodes of Xeon computing server with a Phi card each. As the result, we could confirm that XcalableMP version of the program can get the performance of MPI version of the same benchmark both on single node of Phi and 16 nodes of Phi cluster. We also experiment XcalableMP + OpenMP hybrid version and confirmed its performance scales up to 2048 threads on 16 compute nodes. 1. MIC Many Integrated Core Intel Xeon Phi PCI Express GDDR5 60 GPGPU General Purpose Graphics Processing Unit GPGPU GPGPU Intel K.K 2 University of Tsukuba XcalableMP XMP 1) XMP Phi XMP HIMNO 2) Phi Phi 2. XcalableMP XMP OpenMP C Fortran PC XcalableMP Fortran Fortran XMP SPMD OpenMP XMP XMP 1

2 XMP!$xmp nodes n(4)!$xmp template t(100)!$xmp distribute t(block) onto n real g(80,100)!$xmp align g(*,j) with t(j) real l(50,40) (80,1:25) (80,25:50) (80,100) (80,50:75) (80,75:100) l(50,40) l(50,40) l(50,40) l(50,40) 1 XMP XMP XMP Fortran XMP XMP XMP nodes node n node template t(100) t(100) distribute t P t(block)template t block real g(80,100) align g template t g(*,j) with t(j)j g 2 template t g real l(50,40) l node real l XMP AICS Omni XcalbleMP Compiler version 1.1 build 1222 source-to-source MPI MPI MPI runtime MPI 3. Xeon Phi 3.1 Xeon Phi PCI Express 2 Phi Phi 3 PCI Ex 2 Phi CPU Phi 61 CPU 2 2 GDDR5 8 3 PCI-Express PCI-Express 3.2 Phi Phi GPGPU Phi Phi Phi 3.3 Phi 2 2

3 Off-load native Off-load GPGPU off-load Phi native Native OpenMP native Phi PCI Express 1 Xeon 1 CPU 2.6 GHz 1600 MHz 32 GB GFLOPS 5.2 GFLOPS 4.2 HIMENO 1 1 Phi L 512x256x Phi XL 1024x512x512XMP HIMENO Fortran90 OMP L 1 XMP 3 1 module alloc1 PARAMETER (mimax = 512, mjmax = 256, mkmax = 256) real p(mimax,mjmax,mkmax), a(mimax,mjmax,mkmax,4), b(mimax,mjmax,mkmax,3) real c(mimax,mjmax,mkmax,3), bnd(mimax,mjmax,mkmax) real wrk1(mimax,mjmax,mkmax), wrk2(mimax,mjmax,mkmax)!$xmp nodes n(*)!$xmp template t(mimax,mjmax,mkmax)!$xmp distribute t(*,*,block) onto n!$xmp align (*,*,k) with t(*,*,k) :: p, bnd, wrk1, wrk2!$xmp align (*,*,k,*) with t(*,*,k) :: a, b, c!$xmp shadow p(0,0,1)!$xmp shadow a(0,0,1,0)!$xmp shadow b(0,0,1,0)!$xmp shadow c(0,0,1,0)!$xmp shadow bnd(0,0,1)!$xmp shadow wrk1(0,0,1)!$xmp shadow wrk2(0,0,1)... ( )!$xmp reflect (p)!$xmp loop (K) on t(*,*,k) reduction (+:GOSA) DO K = 2, kmax-1 DO J = 2, jmax-1 DO I = 2, imax-1 S0 = a(i,j,k,1)*p(i+1,j,k)+a(i,j,k,2)*p(i,j+1,k) 4.3 Phi 1 L Xeon 4 GFLOPS Phi Phi Xeon XMP-Phi Fortran90 MPI Xeon Phi mpiifort -O3 AVX Xeon -mmic Phi ) MPI 1 2 Xeon 4 32 Phi

4 3 (1) Phi Himemo 2 Xeon (2) Phi 1 2 (3) XMP MPI InfiniBand 16 Xeon 1 Phi Phi Himeno XL 1024x512x512 XMP !$xmp nodes n(4,*)!$xmp template t(mimax,mjmax,mkmax)!$xmp distribute t(*,block,block) onto n!$xmp align (*,j,k) with t(*,j,k) :: p, bnd, wrk1, wrk2!$xmp align (*,j,k,*) with t(*,j,k) :: a, b, c!$xmp shadow p(0,1,1)!$xmp shadow a(0,1,1,0) 5 2 node n(4,*) template t 2 4 p,bnd,wk1 align XL 2048=4x Phi 2 XMP 16 Xeon 6 Fortran90 MPI Xeon Phi mpiifort -O3 -AVX -O3 -mmic Xeon MPI x16 Phi 2 256=128x2XL (1) 4 Phi Xeon 4 64 Xeon (2) Phi MPI (3) XMP MPI XMP MPI XMP MPI OpenMP OpenMP XMP Omni XcalableMP OpenMP 7 XMP MPI XMP Fortran MPI-Phi Xeon XMP-Phi HYBRD-P 6 Phi 4

5 !$OMP PARALLEL SHARED (kmax,jmax,imax,nn, &!$OMP t5,xmp_loop_lb9,xmp_loop_ub10,xmp_loop_step11,gosa) &!$OMP PRIVATE (k6,k8,j,i,s0,ss,gosa1,loop) DO loop = 1, nn, 1 # 133 "himenol.f90"!$omp BARRIER!$OMP MASTER gosa = 0.0_8 CALL xmpf_reflect_ ( XMP_DESC_p ) CALL xmpf_ref_templ_alloc_ ( t5, XMP_DESC_t, 3 ) CALL xmpf_ref_set_loop_info_ ( t5, 0, 2, 0 ) CALL xmpf_ref_init_ ( t5 ) XMP_loop_lb9 = 2 XMP_loop_ub10 = kmax - 1 XMP_loop_step11 = 1 CALL xmpf_loop_sched_ ( XMP_loop_lb9, XMP_loop_ub10, XMP_loop_step11, 0, t5 )!$OMP END MASTER!$OMP BARRIER gosa1 = 0.0_8!$OMP DO COLLAPSE(2) DO k6 = XMP_loop_lb9, XMP_loop_ub10, XMP_loop_step11 DO j = 2, jmax - 1, 1 DO i = 2, imax - 1, 1 7 XMP!$OMP OpenMP Parallel 2 MPI XMP XMP 1 1 MASTER 1 OpenMP 3 Do XMP Phi XMP OpenMP 512, 1024 MPI 2048 =128x16 Phi InfiniBand DO template XMP MIC HIMENO XMP XMP MIC XMP MIC 1) XcalableMP / What s XcalableMP 2) / 5. Phi MIC Phi XMP template 5

SC09 HPC Challenge Submission for XcalableMP

SC09 HPC Challenge Submission for XcalableMP Jinpil Lee Graduate School of Systems and Information Engineering University of Tsukuba jinpil@hpcs.cs.tsukuba.ac.jp Mitsuhisa Sato Center for Computational