Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Size: px

Start display at page:

Download "Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --"

Brian Ray
5 years ago
Views:

Parallel Processing for Energy Efficiency

Open-Source Parallel FE Software : FrontISTR --

1 Parallel Processing for Energy Efficiency October 3, 2013 NTNU, Trondheim, Norway Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters -- Hiroshi Okuda The University of Tokyo okuda@k.u-tokyo.ac.jp Olav Aanes Fagerlund, Serban Georgescu (Fujitsu Lab. Europe )

2 Outline Background CoDesign towards post-peta scale supercomputers from the viewpoint of CAE applications FEM, SpMV Overview of FrontISTR Nonlinear structural analysis Industrial applications Inter-node Parallelism Intra-node Parallelism Performance Estimation Summary

3 FEM, Iterative Solvers and SpMV (1/2) One of the most important CAE applications Continuous media (PDE) is discretized into a system of linear equations [ A (n) ] {x (n) } = {b (n) }, n : increment step Feature of matrix Large scale DOF O(10 10 ) Use of iterative solvers Non-zero density (%) O(10-2 ) Stored in compressed manner メッシュ分割領域分割領域分割 ( パーティショニングツール )

4 FEM, Iterative Solvers and SpMV (2/2) Matrices are sparse and stored in compressed manner. SpMV (sparse-matrix vector product) is a hotspot for iterative equation solvers. SpMV Enlargement of 1-100th rows model Number of nodes Number of non-zeros Density of non-zeros hinges1 84,056 19,043, % DOT AXPY & AYPX Breakdown of CPU in CG operations

5 Trends in Parallel Architecture and Parallel Programing Strategies (1/2) Parallelism Points of concern Inter-node via network Intra-node Number of cores Programability Memory distribution over network Size (GB) Memory Throughput (GB/s) CPU O(1) good O(100) O(10) GPU O(100) O(1) O(100) InfiniBand, Ethernet, Myrinet MSU: Large and slow L1~L3: Small and fast Between CPU-GPU : PCIe O(1)

6 Trends in Parallel Architecture and Parallel Programing Strategies (2/2) Points of concern Parallelism Parallel efficiency E1 x E2 Programing model Strategy Scalability Inter-node via network E1 MPI High work ratio (Localized mesh) Weak scale Intra-node E2 MPI, Thread, OpenMP, OpenCL, OpenACC Appropriate B/F & Long vector (Blocking, Padding, Reordering) Strong scale

7 Front ISTR メッシュサイズ =0.1mm

8 FrontISTR built on HEC-MW FrontISTR FRONT-STR Nonlinear analysis functions -Static Linear -Dynamic Linear -Eigen Mode -Material Nonlinear -Geometrical Nonlinear -Boundary Nonlinear (Assembly Structure) etc. Hyper-elasticity/Thermal-elasticplastic/Visco-elastic/Creep, Combined hardening rule Total/Updated Lagrangian Finite slip contact, Friction HEC-MWs FEM applications developed on PC I/F for I/F for I/F I/O Mat.Ass. I/F I/F for Solvers I/F I/O I/O Matrix Linear Coupler Assemble Solver I/O I/O Matrix Linear Assemble Solver Visualization Vis. Coupler HEC-MW カーネル for Linux for PC-Clusters Win PC-Cluster on GRID I/F for Vis. I/F Visualization Vis. HEC HEC-MW カーネル for Win PC-Cluster Advanced features of parallel FEM I/O I/O Matrix Linear Assemble Solver Visualization Vis. Coupler From MPP to PC HEC HEC-MW カーネル for Win Vector PC-Cluster SMP Hierarchical mesh refinement Assembly structure Up to O(10 5 ) nodes Portability CAE cloud Nonlinear structural analysis functions have been deployed on a parallel FEM basis: HEC-MW.

9 Large-grain Parallelism : Parallelization based on domain decomposition メッシュ分割領域分割領域分割領域分割 ( パーティショニングツール ) Local Data Local Data Local Data Local Data FEM Code FEM Code FEM Code FEM Code Solver Subsystem MPI Solver Subsystem MPI Solver Subsystem MPI Solver Subsystem

10 Strong Scale with Refiner - Static linear analysis of machine part - 2 nd order tetra element - PCG (eps=10^-6) FX10@UT SPARC64 Ixfx(1.848 GHz) 1CPU (16core)/node

11 Total FE comp. Acknowledgements: Research Organization for Information Science and Technology, RIKEN AICS

12 Acceleration of SpMV Computation Parallelization Rows are distributed among threads. Load balancing Reallocate rows to balance loads. Blocking Matrix format is crucial. CSR: Compressed Sparse Row B/F = 6.25~12.5 BCSR: Blocked CSR B/F = 4.76~5.56 value colindx rowptr Thread 1 Thread 2 Thread 3 Thread 1 Thread 2 Thread A 0 B 0 C 0 D 0 E Balanced A B C D E

13 Outer loop X : referred only Inner loop Directives for SECTOR CACHE: X : referred only indirect access cache does NOT work continuous access cache works Y : substituted only

14 Example hinge 252,168 DOFs Simple cyclic 2,115,968 non-zeros Density of non-zero : 0.03% Simple block cyclic Distribute non-zeros among threads Number of non-zero s per row Row number

15 Tuning parameters were selected empirically for a hinge example.

FlatMPI 13.7h 82.6% 4.2% Hybrid 21.7h 50.

16 Hybrid Parallel Computation on K Number of CPUs Number of cores 8,192 65,536 Parallelizati on CPU Work ratio To-peak FlatMPI 13.7h 82.6% 4.2% Hybrid 21.7h 50.3% - FE mesh 2,476,141,184 elements 2,513,793,437 nodes z-displacement Mises stress

17 Hardware test-bed CPU : Nehalem Core2 Quad Core2 Duo GPU : GeForce 280GTX GeForce 295GTX GeForce 8800GTX GeForce 9800GTX+

19 Actual problems FEM 1 FEM 2 CFD V6 engine Motorcycle frame 3D Poisson

20 Performance (GFlop/s) Performance (GFlop/s) Test matrices, SpMV GPU performance in double and quasi-double precision Intel MKL 11 performance in double precision, 1-4 cores

21 GMRES Speedup for GPU (times) Speedup for GPU (times) CG Speedup for GPU (times) Speedup for GPU (times) Test matrices, solvers GPU vs Nehalem GPU vs Core2Quad

22 FEM test case 1 V6 engine Order: 204,000x204,000 Non-zeros: 15,621,894 Requires approx. 200MB to solve Condition number (after Jacobi): O(10 6 ) CPU=Nehalem GPU=280GTX + ILU + Jacobi FrontSTR on CPU 29.1s (732 itrs.) 34.9s (1,839 itrs.) FrontSTR + CUKr on GPU (double precision) - 4.3s (1,830 itrs.) FrontSTR + CUKr on GPU (iterative refinement) - 3.5s (8 : 2,884 itrs.) CUKr : CUDA Krylov ( S.Georgescu and H.Okuda, 2009 )

23 FEM test case 2 Motorcycle frame Order: 1,552,170x1,552,170 Non-zeros: 109,522,566 Requires approx. 1.3GB to solve Condition number (after Jacobi): O(10 7 ) 10x speedup on an iteration by iteration basis, but could not converge because of lack of good preconditioner

24 CFD test case 3D Poisson with N=250 / dim Order: 15,625,000 x 15,625,000 Non-zeros: 109,249,498 Requires approx. 2.2GB Condition number: O(10 5 ) CPU=Nehalem GPU=3x280GTX Time (s) Performance (GFlop/s) CUKr on CPU (1,139 it) CUKr on GPU (double precision) CUKr on GPU (iterative refinement) (1,139 it) (10 : 2021 it) CUKr : CUDA Krylov ( S.Georgescu and H.Okuda, 2009 )

25 precision. Performance (MFlop/s) Case: AYPX on NVIDIA GTX 560Ti (and AMD Radeon HD 7970) CLBLAS (SINGLE) CLBLAS (QDOUBLE) CLBLAS (DOUBLE) CCBLAS (SINGLE) CCBLAS (QDOUBLE) CCBLAS (DOUBLE) e+06 1e+07 1e+08 Vector size Performance of kernels written in both OpenCL (CL) an CUDA C (CC) is shown, all precisions Running the same kernels on the AMD Radeon HD 7970 gives a performance well below 1 GFLOP/s, for all vector sizes and with adapted Global and Local sizes.

26 Sustained Performance Model (1/2) The K-computer s roofline model. Sustained performance can be predicted w.r.t. applications Flop/Byte ratio. 8 / 128 estimated To-peak 6.25%

27 Performance Model (2/2) SpMV with CSR B/F = 6.25~12.5 SpMV with BCSR: B/F = 4.76~5.56 Machine Node performance BW (catalog) BW (STREAM) B/F K 128 Gflops 64 GB/s 46.6 GB/s 0.36 FX Gflops 85 GB/s 64 GB/s 0.27 B/F of FISTR Topeak Measured performance by profiler on FX % SpMV with CSR 2.9~5.8 % SpMV with BCSR: 4.9~7.6 % SpMV with CSR 2.2~4.3 % SpMV with BCSR: 3.7~5.7 %

28 Summary (1/2) Hybrid parallel strategy of FrontISTR ( FE structural analysis code ) is presented. MPI for inter-node distributed mesh OpenMP for intra-node loop decomposition with blocking, padding and re-ordering Widely used by industries, ISVs and researchers Memory centric with consistent CPU and low power consumption design is crutial for CAE applications (memory wall) To attain 50% of peak performance, BPF of such machines should be ~2.5

29 Summary (2/2) How does Nvidia s optimized code perform on AMD hardware? An illustration of how OpenCL is portable, however the performance is not, necessarily. Memory-bound kernels are very sensitive to changes in memory-subsystems. i.e. unmodified code brought to a different platform can give very bad performance. Feature of FEM using iterative solvers Short vector Small loop body Indirect access

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts