1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
2 / 23 Background Introduction Outline 1 Background Introduction 2 Energy efficiency in Krylov iterative methods 3 Current Work 4 Future Work
3 / 23 Background Introduction Petascale and Exascale Supercomputing High Performance Computing is shipping from Petaflops(10 15 ) era to Exaflops(10 18 ) era: Titan of Oak Ridge National Laborary Speed: 17.59 PetaFLOPS (LINPACK) Power: 8.2MW Architecture:18,688 AMD Opteron 6274 16-core CPUs 18,688 Nvidia Tesla K20X GPUs Rank: 1 in Top500 November 2012
4 / 23 Background Introduction Problem of Energy Consumption Modern Supercomputer power: 4-6 Megawatts: electricity enough to supply 5000 homes Potential exascale computer: 1.5 Gigawatts: a nuclear sized power plant Green 500 List Evaluation of HPC via FLOPS per watt Energy efficiency: CPU, GPU, memory, disk, etc hardware design, algorithm, etc.
5 / 23 Background Introduction Energy efficieny in components of HPC Hardware: Processor, Memory, Disk,etc Algorithmic: float point computation data communication, etc Improve the energy efficiency of HPC in algorithmic aspects: Power aware programming (e.g. Dynamic Voltage Scale) Communication Avoiding (communicaiton consumes lots of energy). Auto-tuning methods, parameter optimization etc.
6 / 23 Energy efficiency in Krylov iterative methods Outline 1 Background Introduction 2 Energy efficiency in Krylov iterative methods 3 Current Work 4 Future Work
7 / 23 Energy efficiency in Krylov iterative methods Krylov iterative methods Iterative methods are widely used in solving: non linear equations large scale linear problems (order millions of variables) Krylov subspace K r (A, b) = span{b, Ab, A 2 b,..., A r 1 b} (1) Example: Conjugate Gradient, GMRES, etc
8 / 23 Energy efficiency in Krylov iterative methods Auto-tuning technology Optimization in runtime of parameters in Krylov methods Example: change the size r of Krylov subspace Smaller size: less time for orthogonalization K r (A, b) but more time for convergence Larger size: more time for orthogonalization, but faster of convergence Find the best r size dynamically to shorten the computation time. Also we will use the energy consumption as criterion of auto-tuning optimization
9 / 23 Energy efficiency in Krylov iterative methods Communication Avoiding Construction of Krylov space needs large part of sparse matrix dense vector multiplication (SpMV), which is heavily communication consumed. Especially when Large scale problem Parallel computing environment Communication Avoiding is a group of algorithms, which use redundant computation to reduce the communication data. Thus Shorten the total consumed time Improve the energy efficiency but it depends on structure of matrix like TSQR (Tall Skinny QR method), communication avoidng specially for dense matrix whose rows many more than columns.
10 / 23 Current Work Outline 1 Background Introduction 2 Energy efficiency in Krylov iterative methods 3 Current Work 4 Future Work
11 / 23 Current Work SpMV algorithms Sparse matrix-vector multiplication (SpMV) is a basic component in Krylov subspace construction. Choose different sparse matrix formats Evaluation of communication avoiding methods Energy consumption analysis Experimentation on MdS s machine Poincare: CPU, GPU mixed cluster (has 4 nodes) 2 Processor Sandy Bridge E5-2670 per node 64 Go Memory per node 4 GPU Tesla K10 (Cuda Capability 3.0, 3.5 Go memory) per node
12 / 23 Current Work SpMV algorithms Codes originally written by Maxime Hugues and modified to test on Poincare. The input sparse matrix has two sources: 1 generated structured sparse matrix 1 Chosen number of continous diagonals above the main diagonal 2 Equidistributed diagonals 2 Unstructured sparse matrix from real industrial applicaitons.
13 / 23 Current Work SpMV algorithms The following steps are executed: 1 Generated sparse matrix A = [a ij ] n n 2 Doing Y = A r X, with iteration r 3 m MPI process divided A into m submatrix B = [b ij ] n/m n. 4 Each MPI process solvase its sub SpMV on its own binded GPU (CUDA 5.0). 5 MPI communication to forme the final Y (openmpi1.6.3)
Current Work Different Sparse matrix format Generated structured sparse matrix with single precision 16 12 Gflops test on C-diagonal generated matrix CSR CSC Ell Ell-col Dimension row = col = 9000000 Continous diagonal elements (15 diagonals above main diagonal) Gflops 8 Diagonal values = 1.(without perturbation). 4 Four different sparse matrix format: 0 2 4 6 8 10 12 14 16 Nb of MPI Process CSR, CSC, Ellpack Row, Ellpack Col. 14 / 23
15 / 23 Current Work Different Sparse matrix format Both of them have bad scalabilities (communication cost) CSR is outperformed by the others when m is small, the difference is large when m augments, the difference is narrowing Results depends on Sparse matrix structure hardware environment
16 / 23 Current Work Communication Avoiding implementation Pre-compute the position of nonzero entries in Y 1 Record those column index of B where zero entries exists 2 These column index corresponds to the row index in Y whose entries values must be zero 3 These row index of Y is exclusived from communication Reduce the data movement between MPI process From O(n) to O(bnnz), where bnnz is the number of nonzero entries in Y.
17 / 23 Current Work Communication Avoiding implementation Gflops 36 32 28 24 20 16 12 8 4 0 Gflops test on C-diagonal generated matrix 2 4 6 8 10 12 14 16 Nb of MPI Process CSR CSR-avoid Generated structured sparse matrix with single precision Dimension row = col = 9000000 Continous diagonal elements (15 diagonals above main diagonal) Diagonal values = 1.(without perturbation). Comparison between CSR and CSR communication avoiding in FLOPS
Current Work Communication Avoiding implementation MPITime Percent test on C-diagonal generated matrix Generated structured sparse matrix with single precision 36 32 CSR CSR-avoid Dimension row = col = 9000000 MPI/Total time (%) 28 24 20 16 12 8 4 0 2 4 6 8 10 12 14 16 Continous diagonal elements (15 diagonals above main diagonal) Diagonal values = 1.(without perturbation). Comparison between CSR and CSR communication avoiding in percents of MPI time in Total Time Nb of MPI Process 18 / 23
19 / 23 Current Work Communication Avoiding implementation The results shows that CSR with communication avoiding Has a good scalability Has a low proportion of Communication Time For CSR standard Total time: O( nnz m ) + δ n(m 1) t O( ) + latency O(m 1) m For CSR communication avoiding Total time: bnnz << n O( nnz m ) + δ bnnz(m 1) t O( ) + latency O(m 1) m
20 / 23 Future Work Outline 1 Background Introduction 2 Energy efficiency in Krylov iterative methods 3 Current Work 4 Future Work
21 / 23 Future Work Energy evaluation on K20 Test on NVIDIA K20 Poincare has updated its GPU to K20 Redo the test for comparison Evaluation of energy using NVIDIA Management Library (NVML) nvmldevicegettemperature nvmldevicegetperformancestate... Add the energy variation as a criterion to evaluate various auto-tuning methods. E.g. Change Krylov subspace size and evaluate variation of energy consumption.
22 / 23 Future Work New way for Energy minimization 1 Find more parameters for Auto-tuning Parameters to control communication avoiding... 2 Machine learning for smart-tuning Semantics in optimization Supervised learning via history records 3 Stochastic way of communication avoiding
23 / 23 Future Work End Thank you Any Question?