A hybrid Hermitian general eigenvalue solver

Similar documents
Computing least squares condition numbers on hybrid multicore/gpu systems

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

On aggressive early deflation in parallel variants of the QR algorithm

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Performance Analysis and Design of a Hessenberg Reduction using Stabilized Blocked Elementary Transformations for New Architectures

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Divide and Conquer Symmetric Tridiagonal Eigensolver for Multicore Architectures

The Algorithm of Multiple Relatively Robust Representations for Multi-Core Processors

Tile QR Factorization with Parallel Panel Processing for Multicore Architectures

Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures LAPACK Working Note - 222

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

MARCH 24-27, 2014 SAN JOSE, CA

Implementing QR Factorization Updating Algorithms on GPUs. Andrew, Robert and Dingle, Nicholas J. MIMS EPrint:

Algorithm 853: an Efficient Algorithm for Solving Rank-Deficient Least Squares Problems

Dynamic Scheduling within MAGMA

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Porting a Sphere Optimization Program from lapack to scalapack

A Parallel Bisection and Inverse Iteration Solver for a Subset of Eigenpairs of Symmetric Band Matrices

Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods

Exponentials of Symmetric Matrices through Tridiagonal Reductions

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Computing least squares condition numbers on hybrid multicore/gpu systems

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem

A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures. F Tisseur and J Dongarra

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

The LINPACK Benchmark in Co-Array Fortran J. K. Reid Atlas Centre, Rutherford Appleton Laboratory, Chilton, Didcot, Oxon OX11 0QX, UK J. M. Rasmussen

MAGMA. Matrix Algebra on GPU and Multicore Architectures. Mark Gates. February 2012

A distributed packed storage for large dense parallel in-core calculations

Accelerating Band Linear Algebra Operations on GPUs with Application in Model Reduction

Communication avoiding parallel algorithms for dense matrix factorizations

Parallel Eigensolver Performance on High Performance Computers 1

The Future of LAPACK and ScaLAPACK

LAPACK-Style Codes for Pivoted Cholesky and QR Updating. Hammarling, Sven and Higham, Nicholas J. and Lucas, Craig. MIMS EPrint: 2006.

Verbundprojekt ELPA-AEO. Eigenwert-Löser für Petaflop-Anwendungen Algorithmische Erweiterungen und Optimierungen

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Avoiding Communication in Distributed-Memory Tridiagonalization

INITIAL INTEGRATION AND EVALUATION

The ELPA Library Scalable Parallel Eigenvalue Solutions for Electronic Structure Theory and Computational Science

TOWARD HIGH PERFORMANCE TILE DIVIDE AND CONQUER ALGORITHM FOR THE DENSE SYMMETRIC EIGENVALUE PROBLEM

A parallel tiled solver for dense symmetric indefinite systems on multicore architectures

The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and

D. Gimenez, M. T. Camara, P. Montilla. Aptdo Murcia. Spain. ABSTRACT

Reducing the Amount of Pivoting in Symmetric Indefinite Systems

1. Introduction. Applying the QR algorithm to a real square matrix A yields a decomposition of the form

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor

Bidiagonal SVD Computation via an Associated Tridiagonal Eigenproblem

Generalized interval arithmetic on compact matrix Lie groups

Direct methods for symmetric eigenvalue problems

Parallel eigenvalue reordering in real Schur forms

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Parallel tools for solving incremental dense least squares problems. Application to space geodesy. LAPACK Working Note 179

Parallel Variants and Library Software for the QR Algorithm and the Computation of the Matrix Exponential of Essentially Nonnegative Matrices

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

The geometric mean algorithm

Strassen s Algorithm for Tensor Contraction

Communication-avoiding parallel and sequential QR factorizations

On GPU Acceleration of Common Solvers for (Quasi-) Triangular Generalized Lyapunov Equations

Theoretical Computer Science

Level-3 BLAS on a GPU

Table 1. Comparison of QR Factorization (Square: , Tall-Skinny (TS): )

Dense Arithmetic over Finite Fields with CUMODP

APPLIED NUMERICAL LINEAR ALGEBRA

LAPACK-Style Codes for Pivoted Cholesky and QR Updating

Structured Orthogonal Inversion of Block p-cyclic Matrices on Multicore with GPU Accelerators

MPI Implementations for Solving Dot - Product on Heterogeneous Platforms

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Finite-choice algorithm optimization in Conjugate Gradients

More Gaussian Elimination and Matrix Inversion

NAG Library Routine Document F08UBF (DSBGVX)

problem Au = u by constructing an orthonormal basis V k = [v 1 ; : : : ; v k ], at each k th iteration step, and then nding an approximation for the e

Performance of Random Sampling for Computing Low-rank Approximations of a Dense Matrix on GPUs

Numerical Solution of Differential Riccati Equations on Hybrid CPU-GPU Platforms

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Unsupervised Data Discretization of Mixed Data Types

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Mitglied der Helmholtz-Gemeinschaft. Linear algebra tasks in Materials Science: optimization and portability

Aasen s Symmetric Indefinite Linear Solvers in LAPACK

Preconditioned Parallel Block Jacobi SVD Algorithm

Communication-avoiding parallel and sequential QR factorizations

Using Godunov s Two-Sided Sturm Sequences to Accurately Compute Singular Vectors of Bidiagonal Matrices.

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Algorithm for Sparse Approximate Inverse Preconditioners in the Conjugate Gradient Method

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *

Sparse BLAS-3 Reduction

IN THE international academic circles MATLAB is accepted

Out-of-Core SVD and QR Decompositions

Package magma. February 15, 2013

Avoiding Communication in Linear Algebra

Parallel Solution of Large-Scale and Sparse Generalized algebraic Riccati Equations

Communication-avoiding LU and QR factorizations for multicore architectures

MagmaDNN High-Performance Data Analytics for Manycore GPUs and CPUs

Direct Methods for Matrix Sylvester and Lyapunov Equations

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

Improvements for Implicit Linear Equation Solvers

Transcription:

Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ, Swiss National Supercomputing Centre (ETHZ/CSCS) Azzam Haidar, Stanimire Tomov, Ichitaro Yamazaki, Jack Dongarra Institute Innovative Computing Laboratory, University of Tennessee Abstract The adoption of hybrid GPU-CPU nodes in traditional supercomputing platforms opens acceleration opportunities for electronic structure calculations in materials science and chemistry applications, where medium sized Hermitian generalized eigenvalue problems must be solved many times. The small size of the problems limits the scalability on a distributed memory system, hence they can benefit from the massive computational performance concentrated on a single node, hybrid GPU-CPU system. However, new algorithms that efficiently exploit heterogeneity and massive parallelism of not just GPUs, but of multi/many-core CPUs as well are required. Addressing these demands, we implemented a novel Hermitian general eigensolver algorithm. This algorithm is based on a standard eigenvalue solver, and existing algorithms can be used. The resulting eigensolvers are state-ofthe-art in HPC, significantly outperforming existing libraries. We analyze their performance impact on applications of interest, when different fractions of eigenvectors are needed by the host electronic structure code Project ID: #283493 1. Introduction Scientific computing applications, ranging from computing frequencies that will propagate through a medium to the earthquake response of a bridge, or energy levels of electrons in nanostructure materials, require the solution of eigenvalue problems. There are many ways to formulate, mathematically, and solve these problems numerically [1]. In this work we are interested in dense eigensolvers, and in particular, generalized Hermitian-definite problems of the form where A is a Hermitian dense matrix and B is Hermitian positive definite. These solvers are needed for solving electronic structure problems in materials science and chemistry [2], [3], [4]. Particularly, problems with modest matrix dimensions of a few thousand to ten or twenty thousand seem to pose a challenge for most practical purposes. Typically these problems must be solved many times in the context of a parallel code. The known power limits in current processors, that would prevent the clock frequency to keep increasing with time, leaves us the challenge to solve the eigenvalue problem on nodes with a large number of threads that are executed on slower cores. To implement a general eigensolver different subroutines are needed. Although these routines of interest are available in a number of standard libraries, current hardware changes have motivated their redesign (see Section 2) to achieve a more efficient use of the underlying hardware. In particular, hybrid architectures based on GPUs and * Corresponding author. E-mail address: rasolca@itp.phys.ethz.ch.

multicore CPUs call for the development of new algorithms that can efficiently exploit the heterogeneity and the massive parallelism of not just the GPUs, but of the multi/many-core CPUs as well. Hybrid GPU-CPU nodes are already widely adopted in traditional supercomputing platforms such as the Cray XK6. This has opened many new acceleration opportunities in electronic structure calculations for materials science and chemistry applications, where generalized eigenvalue problems for medium sized Hermitian matrices must be solved many times, and they can benefit from the massive computational performance concentrated on a single node of a hybrid GPU-CPU system. 2. Related Work The main standard libraries that include eigensolver routines are LAPACK [5] and ScaLAPACK [6]. LAPACK is a linear algebra library for dense and banded matrices on shared-memory systems, while ScaLAPACK is its extension to distributed memory systems. LAPACK is based on calls to the BLAS (Basic Linear Algebra Subprograms) interface. Vendors usually provide well tuned LAPACK and ScaLAPACK versions, based on optimized BLAS implementations. Work specifically on eigenvalue problems has been concentrated on accelerating separate components of the solvers and, in particular, the reduction of the matrix to tridiagonal form, which is the most time consuming task. The standard approach from LAPACK [7] is to use a single phase (also referred to as one-stage ) reduction. Alternatives are two- or more stage approaches, where the matrices are first reduced to band forms, and subsequently to their final tridiagonal form. One of the first uses of a two-step reduction occurred in the context of out-of-core solvers for generalized symmetric eigenvalue problems [8]. With this approach, it was possible to recast the expensive memory-bound operations that occur during the panel factorization into a compute-bound procedure. Consequently, a framework called Successive Band Reductions (SBR) was created [9], integrating some of the multi-stage work. The SBR toolbox applies two-sided orthogonal transformations based on Householder reflectors, and successively reduces the matrix bandwidth size until a suitable width is reached. The off-diagonal elements are then annihilated column-wise, which produces large fill-in blocks or bulges that need to be chased down towards the bottom right corner of the matrix. The bulge chasing procedure may result in substantial increase in the floating point operation count when compared with the standard single-phase approach from LAPACK. If eigenvectors are required in addition to eigenvalues, then the transformations may be efficiently accumulated using level 3 BLAS operations to generate these vectors. SBR heavily relies on multithreaded BLAS that are optimized to achieve satisfactory parallel performance. However, such parallelization paradigm incurs substantial overheads [10] as it fits the Bulk Synchronous Parallelism (BSP) model [11]. Communication bounds for such two-sided reductions have been established under the Communication Avoiding framework [12]. Tile algorithms have recently seen a rekindled interest when applied to the two-stage tridiagonal reductions [10]. Using high performance kernels combined with a data translation layer to execute on top of the tile data layout format, both implementations achieve a substantial improvement compared to the equivalent routines from the stateof-the-art numerical libraries. The off-diagonal elements are annihilated column-wise instead during the bulge chasing procedure, which engenders significant extra flops due to the size of the bulges introduced. An element-wise annihilation has then been implemented based on cache-aware kernels, in the context of the two-stage tridiagonal reduction [13]. Using the coalescing task technique [13], the performance achieved is by far greater than any other available implementation. With the emergence of high bandwidth and high efficient devices such as GPUs, accelerating by orders of magnitude memory-bound and compute-bound operations became accessible. Tomov et al. [14] presented novel hybrid reduction algorithms for the two-sided factorizations, which take advantage of the high bandwidth of the GPU by offloading the expensive level 2 BLAS operations of the panel factorization to the GPU. The opposite was done by Bientinesi et al. [15] who accelerated the two-stage approach of the SBR toolbox by offloading the compute-bound kernels of the first stage to the GPU. The computation of the second stage (reduction to tridiagonal form) still remains on the host though. Vomel et al. [16] extended the tridiagonalization approach in [14] to the symmetric standard eigenvalue problem. A recent distributed-memory eigensolver library, developed for electronic structure codes, is ELPA [3]. ELPA is based on ScaLAPACK and does not support GPUs. It includes new one-stage and two-stages tridiagonalizations, the

corresponding eigenvectors transformation, and a modified divide and conquer routine that can compute the entire eigenspace or just a part of it. 3. The general eigensolver The steps to solve the general eigenproblem this algorithm combines the following routines: are described in Algorithm 1. The implementation of MAGMA Cholesky decomposition, MAGMA standard eigensolvers (Section 4), CUBLAS triangular solver, A hybrid implementation of the xhegst routine that transforms the general problem into a standard problem, it is possible to build the general eigensolver. The last routine has been implemented following the LAPACK algorithm, but moving all the level 3 BLAS operations are executed on the GPU. 4. The standard eigensolver Algorithm 1. Hermitian general eigenvalue solver The solution of the standard eigenproblem is described in Algorithm 2. Algorithm 2. Hermitian Standard eigensolver. n represents the number of stages in the tridiagonalisation. 4.1. One-stage standard eigensolver The one-stage tridiagonalization is the approach used in the LAPACK eigensolvers. It was already implemented in MAGMA [16]. It has been improved including the hybrid algorithm of the tridiagonal eigensolver (Section 4.3) as well as including the possibility to compute only a fraction of the eigenspace. The tridiagonalisation s performances are limited, since 50% of the operations are level 2 BLAS operations. 4.2. Two-stage standard eigensolver To solve the problem of the limited performance of the tridiagonalisation step, a two-stage approach has been used. In the first step the matrix is reduced to a band matrix, while in the second step the band matrix is reduced to tridiagonal one using a bulge chasing technique. The algorithms are described in [13]. However, the reduction to band form is implemented using block techniques instead of tile techniques, since they are better for GPU operations. The algorithm is described in Figure 1. For the second step we use a bulge chasing algorithm, very similar to [13]. However we differ by using a column-wise elimination, which allows us to have better locality for computing or applying the orthogonal matrix resulting from this phase. The drawback of this approach lies by the eigenvectors backtransformation, since the tridiagonalisation is performed in two steps, the backtransformation requires two steps too.

Figure 1. Reduction to band algorithm. The operation in (a) (CPUs) can be overlapped with (c) (GPU). 4.3. Tridiagonal eigensolver The standard Divide and Conquer algorithm computes all the eigenvalues and eigenvectors of a real tridiagonal matrix. A detailed description of the algorithm can be found in [25]. This algorithm splits recursively the problem in two smaller problems. It then merges the solutions of the smaller problems as follows: The multicore is used to compute the eigenvalues and eigenvectors of the rank-one modification, while the GPU updates the eigenvectors with a matrix-matrix multiplication. It is easy to see that during the last merge one can compute only the needed eigenvectors, reducing the time to solution. 5. Results Figure 2. Time to solution of the double complex general eigensolver with matrix size 8000. On the left the whole eigenspace is computed, while on the right only 10% of the eigenvectors. Figure 2 shows the results of the hybrid algorithms compared with a shared memory library (MKL) and a distributed memory library (ELPA) with two different algorithms. To have a fair comparison on a 2 6-cores Intel Xeon 5650 and one Nvidia M2090 system we compare our hybrid routines, tested with six threads on one CPU

socket and one GPU, against non-gpu routines tested using both of the CPU s sockets 12 threads for sharedmemory routines and 12 processes for distributed-memory processes. The one-stage and two-stage approaches perform similarly for the complete eigenspace, while the two-stage approach shows better performance computing a fraction of the eigenspace. It every case the hybrid algorithms outperform the CPU- only algorithms by a factor more than 2. 6. MultiGPU support We extended the dense symmetric generalized one-stage eigensolver of MAGMA to utilize multiple GPUs. The data on the GPUs is distributed in a 1D block cyclic way. Figure 3 shows the time spent in each step of the algorithm on up to eight Nvidia M2090 GPUs and two 6-cores Intel Xeon X5660. 7. Conclusion Figure 3. Time to solution for different number of GPUs of the double complex general eigensolver with matrix size 15000 and computing the whole eigenspace. In this report we presented two different approaches to solve an eigenvalue problem benefiting from the computational power of a GPU. Both approaches show on a CPU socket+gpu a significant speedup with respect to shared-memory and distributed-memory approaches using two CPU sockets. The hybrid algorithms show more than a 2x speedup with respect to the CPUs-only libraries, and can be used to reduce the time needed for the electronic structure simulations, since the general eigensolver is the most time consuming part in these codes. Moreover, the extension of the routines to multigpu allows a further reduction of the time to solution, as well as the handling of problems of larger size. Acknowledgements This work was financially supported by the PRACE project funded in part by the EUs 7th Framework Programme (FP7/2007-2013) under grant agreement no. RI-211528 and FP7-261557. The work is partially achieved using the PRACE Research Infrastructure resources at Swiss National Supercomputing Centre (CSCS) in Switzerland. The authors would also like to thank the National Science Foundation, the Department of Energy, NVIDIA, and the MathWorks for this research effort.

References [1] J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst, Templates for the solution of algebraic eigenvalue problems: a practical guide, Z. Bai, Ed. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2000. [2] P. Kent, Computational challenges of large-scale, long-time, firstprinciples molecular dynamics, Journal of Physics: Conference Series, vol. 125, no. 1, p. 012058, 2008. [Online]. Available: http://stacks.iop.org/1742-6596/125/i=1/a=012058 [3] T. Auckenthaler, V. Blum, H. J. Bungartz, T. Huckle, R. Johanni, L. Kr amer, B. Lang, H. Lederer, and P. R. Willems, Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations, Parallel Comput., vol. 37, no. 12, pp. 783 794, Dec. 2011. [Online]. Available: http://dx.doi.org/10.1016/j.parco.2011.05.002 [4] D. J. Singh, Planewaves, Pseudopotentials, and the LAPW Method. Boston: Kluwer, 1994. [5] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users Guide, 3rd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1999. [6] L. S. Blackford, J. Choi, A. Cleary, E. D Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley, ScaLAPACK Users Guide. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1997. [7] E. Anderson, Z. Bai, C. Bischof, S. L. Blackford, J. W. Demmel, J. J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. C. Sorensen, LAPACK User s Guide, 3rd ed. Philadelphia: Society for Industrial and Applied Mathematics, 1999. [8] R. G. Grimes and H. D. Simon, Solution of large, dense symmetric generalized eigenvalue problems using secondary storage, ACM Transactions on Mathematical Software, vol. 14, pp. 241 256, September 1988. [Online]. Available: http://doi.acm.org/10.1145/44128. 44130 [9] C. H. Bischof, B. Lang, and X. Sun, Algorithm 807: The SBR Toolbox software for successive band reduction, ACM Transactions on Mathematical Software, vol. 26, no. 4, pp. 602 616, 2000. [10] P. Luszczek, H. Ltaief, and J. Dongarra, Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures, in IPDPS 2011: IEEE International Parallel and Distributed Processing Symposium, Anchorage, Alaska, USA, May 16-20 2011. [11] L. G. Valiant, A bridging model for parallel computation, Communications of the ACM, vol. 33, no. 8, Aug. 1990, doi 10.1145/79173.79181. [12] G. Ballard, J. Demmel, and I. Dumitriu, Communication-optimal parallel and sequential eigenvalue and singular value algorithms, EECS University of California, Berkeley, CA, USA, Technical Report EECS- 2011-14, February 2011, LAPACK Working Note 237. [13] A. Haidar, H. Ltaief, and J. Dongarra, Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels, in SC11: International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WA, USA, November 12-18 2011. [14] S. Tomov, R. Nath, and J. Dongarra, Accelerating the reduction to upper hessenberg, tridiagonal, and bidiagonal forms through hybrid gpu-based computing, Parallel Comput., vol. 36, no. 12, pp. 645 654, Dec. 2010. [15] P. Bientinesi, F. D. Igual, D. Kressner, and E. S. Quintana-Ort ı, Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures, in Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I, ser. PPAM 09. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 387 395. [Online]. Available: http://dl.acm.org/citation.cfm?id=1882792.1882839 [16] C. Vomel, S. Tomov, and J. Dongarra, Divide and conquer on hybrid gpu-accelerated multicore systems, SIAM Journal on Scientific Computing, vol. 34, no. 2, pp. C70 C82, 2012.