Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Similar documents
Energy Consumption Evaluation for Krylov Methods on a Cluster of GPU Accelerators

Performance and Energy Analysis of the Iterative Solution of Sparse Linear Systems on Multicore and Manycore Architectures

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

MARCH 24-27, 2014 SAN JOSE, CA

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Parallelization of Molecular Dynamics (with focus on Gromacs) SeSE 2014 p.1/29

Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Introduction to numerical computations on the GPU

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Communication-avoiding parallel and sequential QR factorizations

ECS289: Scalable Machine Learning

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Feeding of the Thousands. Leveraging the GPU's Computing Power for Sparse Linear Algebra

Open-Source Parallel FE Software : FrontISTR -- Performance Considerations about B/F (Byte per Flop) of SpMV on K-Supercomputer and GPU-Clusters --

Communication-avoiding parallel and sequential QR factorizations

Jacobi-Davidson Eigensolver in Cusolver Library. Lung-Sheng Chien, NVIDIA

Computers and Mathematics with Applications

Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Nuclear Physics and Computing: Exascale Partnerships. Juan Meza Senior Scientist Lawrence Berkeley National Laboratory

Enhancing Performance of Tall-Skinny QR Factorization using FPGAs

Introduction to communication avoiding algorithms for direct methods of factorization in Linear Algebra

MS4: Minimizing Communication in Numerical Algorithms Part I of II

Julian Merten. GPU Computing and Alternative Architecture

Practical Combustion Kinetics with CUDA

On the design of parallel linear solvers for large scale problems

- Part 4 - Multicore and Manycore Technology: Chances and Challenges. Vincent Heuveline

c 2015 Society for Industrial and Applied Mathematics

Dynamic Scheduling within MAGMA

MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Piz Daint & Piz Kesch : from general purpose supercomputing to an appliance for weather forecasting. Thomas C. Schulthess

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Communication-avoiding LU and QR factorizations for multicore architectures

Course Notes: Week 1

Supercomputing: Why, What, and Where (are we)?

Efficient implementation of the overlap operator on multi-gpus

Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

The Green Index (TGI): A Metric for Evalua:ng Energy Efficiency in HPC Systems

HPC and High-end Data Science

Error Bounds for Iterative Refinement in Three Precisions

The new challenges to Krylov subspace methods Yousef Saad Department of Computer Science and Engineering University of Minnesota

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Computing least squares condition numbers on hybrid multicore/gpu systems

Communication avoiding parallel algorithms for dense matrix factorizations

Introduction to communication avoiding linear algebra algorithms in high performance computing

Sparse Matrices and Iterative Methods

Lecture 8: Fast Linear Solvers (Part 7)

Contents. Preface... xi. Introduction...

Case Study: Quantum Chromodynamics

Parallel Sparse Tensor Decompositions using HiCOO Format

arxiv: v1 [hep-lat] 31 Oct 2015

Parallel Eigensolver Performance on High Performance Computers 1

Solution to Laplace Equation using Preconditioned Conjugate Gradient Method with Compressed Row Storage using MPI

Architecture-Aware Algorithms and Software for Peta and Exascale Computing

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

AMS526: Numerical Analysis I (Numerical Linear Algebra)

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Communication-avoiding Krylov subspace methods

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Measuring freeze-out parameters on the Bielefeld GPU cluster

A CPU-GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

Universität Dortmund UCHPC. Performance. Computing for Finite Element Simulations

Accelerating Three-Body Potentials using GPUs NVIDIA Tesla K20X

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

Some thoughts about energy efficient application execution on NEC LX Series compute clusters

NVIDIA MPI-enabled Iterative Solvers for Large Scale Problems. Joe Eaton Manager, AmgX CUDA Library NVIDIA

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Power-Aware Execution of Sparse and Dense Linear Algebra Libraries

A microsecond a day keeps the doctor away: Efficient GPU Molecular Dynamics with GROMACS

Incomplete Cholesky preconditioners that exploit the low-rank property

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Enhancing Scalability of Sparse Direct Methods

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Parallel Multivariate SpatioTemporal Clustering of. Large Ecological Datasets on Hybrid Supercomputers

Petascale Quantum Simulations of Nano Systems and Biomolecules

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components

Binding Performance and Power of Dense Linear Algebra Operations

High-Performance Scientific Computing

Parallel Eigensolver Performance on High Performance Computers

Improvements for Implicit Linear Equation Solvers

Transcription:

1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

2 / 23 Background Introduction Outline 1 Background Introduction 2 Energy efficiency in Krylov iterative methods 3 Current Work 4 Future Work

3 / 23 Background Introduction Petascale and Exascale Supercomputing High Performance Computing is shipping from Petaflops(10 15 ) era to Exaflops(10 18 ) era: Titan of Oak Ridge National Laborary Speed: 17.59 PetaFLOPS (LINPACK) Power: 8.2MW Architecture:18,688 AMD Opteron 6274 16-core CPUs 18,688 Nvidia Tesla K20X GPUs Rank: 1 in Top500 November 2012

4 / 23 Background Introduction Problem of Energy Consumption Modern Supercomputer power: 4-6 Megawatts: electricity enough to supply 5000 homes Potential exascale computer: 1.5 Gigawatts: a nuclear sized power plant Green 500 List Evaluation of HPC via FLOPS per watt Energy efficiency: CPU, GPU, memory, disk, etc hardware design, algorithm, etc.

5 / 23 Background Introduction Energy efficieny in components of HPC Hardware: Processor, Memory, Disk,etc Algorithmic: float point computation data communication, etc Improve the energy efficiency of HPC in algorithmic aspects: Power aware programming (e.g. Dynamic Voltage Scale) Communication Avoiding (communicaiton consumes lots of energy). Auto-tuning methods, parameter optimization etc.

6 / 23 Energy efficiency in Krylov iterative methods Outline 1 Background Introduction 2 Energy efficiency in Krylov iterative methods 3 Current Work 4 Future Work

7 / 23 Energy efficiency in Krylov iterative methods Krylov iterative methods Iterative methods are widely used in solving: non linear equations large scale linear problems (order millions of variables) Krylov subspace K r (A, b) = span{b, Ab, A 2 b,..., A r 1 b} (1) Example: Conjugate Gradient, GMRES, etc

8 / 23 Energy efficiency in Krylov iterative methods Auto-tuning technology Optimization in runtime of parameters in Krylov methods Example: change the size r of Krylov subspace Smaller size: less time for orthogonalization K r (A, b) but more time for convergence Larger size: more time for orthogonalization, but faster of convergence Find the best r size dynamically to shorten the computation time. Also we will use the energy consumption as criterion of auto-tuning optimization

9 / 23 Energy efficiency in Krylov iterative methods Communication Avoiding Construction of Krylov space needs large part of sparse matrix dense vector multiplication (SpMV), which is heavily communication consumed. Especially when Large scale problem Parallel computing environment Communication Avoiding is a group of algorithms, which use redundant computation to reduce the communication data. Thus Shorten the total consumed time Improve the energy efficiency but it depends on structure of matrix like TSQR (Tall Skinny QR method), communication avoidng specially for dense matrix whose rows many more than columns.

10 / 23 Current Work Outline 1 Background Introduction 2 Energy efficiency in Krylov iterative methods 3 Current Work 4 Future Work

11 / 23 Current Work SpMV algorithms Sparse matrix-vector multiplication (SpMV) is a basic component in Krylov subspace construction. Choose different sparse matrix formats Evaluation of communication avoiding methods Energy consumption analysis Experimentation on MdS s machine Poincare: CPU, GPU mixed cluster (has 4 nodes) 2 Processor Sandy Bridge E5-2670 per node 64 Go Memory per node 4 GPU Tesla K10 (Cuda Capability 3.0, 3.5 Go memory) per node

12 / 23 Current Work SpMV algorithms Codes originally written by Maxime Hugues and modified to test on Poincare. The input sparse matrix has two sources: 1 generated structured sparse matrix 1 Chosen number of continous diagonals above the main diagonal 2 Equidistributed diagonals 2 Unstructured sparse matrix from real industrial applicaitons.

13 / 23 Current Work SpMV algorithms The following steps are executed: 1 Generated sparse matrix A = [a ij ] n n 2 Doing Y = A r X, with iteration r 3 m MPI process divided A into m submatrix B = [b ij ] n/m n. 4 Each MPI process solvase its sub SpMV on its own binded GPU (CUDA 5.0). 5 MPI communication to forme the final Y (openmpi1.6.3)

Current Work Different Sparse matrix format Generated structured sparse matrix with single precision 16 12 Gflops test on C-diagonal generated matrix CSR CSC Ell Ell-col Dimension row = col = 9000000 Continous diagonal elements (15 diagonals above main diagonal) Gflops 8 Diagonal values = 1.(without perturbation). 4 Four different sparse matrix format: 0 2 4 6 8 10 12 14 16 Nb of MPI Process CSR, CSC, Ellpack Row, Ellpack Col. 14 / 23

15 / 23 Current Work Different Sparse matrix format Both of them have bad scalabilities (communication cost) CSR is outperformed by the others when m is small, the difference is large when m augments, the difference is narrowing Results depends on Sparse matrix structure hardware environment

16 / 23 Current Work Communication Avoiding implementation Pre-compute the position of nonzero entries in Y 1 Record those column index of B where zero entries exists 2 These column index corresponds to the row index in Y whose entries values must be zero 3 These row index of Y is exclusived from communication Reduce the data movement between MPI process From O(n) to O(bnnz), where bnnz is the number of nonzero entries in Y.

17 / 23 Current Work Communication Avoiding implementation Gflops 36 32 28 24 20 16 12 8 4 0 Gflops test on C-diagonal generated matrix 2 4 6 8 10 12 14 16 Nb of MPI Process CSR CSR-avoid Generated structured sparse matrix with single precision Dimension row = col = 9000000 Continous diagonal elements (15 diagonals above main diagonal) Diagonal values = 1.(without perturbation). Comparison between CSR and CSR communication avoiding in FLOPS

Current Work Communication Avoiding implementation MPITime Percent test on C-diagonal generated matrix Generated structured sparse matrix with single precision 36 32 CSR CSR-avoid Dimension row = col = 9000000 MPI/Total time (%) 28 24 20 16 12 8 4 0 2 4 6 8 10 12 14 16 Continous diagonal elements (15 diagonals above main diagonal) Diagonal values = 1.(without perturbation). Comparison between CSR and CSR communication avoiding in percents of MPI time in Total Time Nb of MPI Process 18 / 23

19 / 23 Current Work Communication Avoiding implementation The results shows that CSR with communication avoiding Has a good scalability Has a low proportion of Communication Time For CSR standard Total time: O( nnz m ) + δ n(m 1) t O( ) + latency O(m 1) m For CSR communication avoiding Total time: bnnz << n O( nnz m ) + δ bnnz(m 1) t O( ) + latency O(m 1) m

20 / 23 Future Work Outline 1 Background Introduction 2 Energy efficiency in Krylov iterative methods 3 Current Work 4 Future Work

21 / 23 Future Work Energy evaluation on K20 Test on NVIDIA K20 Poincare has updated its GPU to K20 Redo the test for comparison Evaluation of energy using NVIDIA Management Library (NVML) nvmldevicegettemperature nvmldevicegetperformancestate... Add the energy variation as a criterion to evaluate various auto-tuning methods. E.g. Change Krylov subspace size and evaluate variation of energy consumption.

22 / 23 Future Work New way for Energy minimization 1 Find more parameters for Auto-tuning Parameters to control communication avoiding... 2 Machine learning for smart-tuning Semantics in optimization Supervised learning via history records 3 Stochastic way of communication avoiding

23 / 23 Future Work End Thank you Any Question?