Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem

Size: px

Start display at page:

Download "Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem"

Jason Hardy
5 years ago
Views:

1 Opportunities for ELPA to Accelerate the Solution of the Bethe-Salpeter Eigenvalue Problem Peter Benner, Andreas Marek, Carolin Penke August 16, 2018 ELSI Workshop 2018 Partners:

2 The Problem The Bethe-Salpeter Eigenvalue Problem Find eigenvalues, right and left eigenvectors for H BS x = λx with [ ] [ ] A B A B H BS = = B Ā B H A T, A = A H, B = B T C n n are dense. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 2/25

3 The Problem The Bethe-Salpeter Eigenvalue Problem Find eigenvalues, right and left eigenvectors for H BS x = λx with [ ] [ ] A B A B H BS = = B Ā B H A T, A = A H, B = B T C n n are dense. Comes up in quantum chemical simulations. n is proportional to n 2 e where n e is the number of electrons in the system. n can become very large (> 50, 000) penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 2/25

4 The Problem The Bethe-Salpeter Eigenvalue Problem Find eigenvalues, right and left eigenvectors for H BS x = λx with [ ] [ ] A B A B H BS = = B Ā B H A T, A = A H, B = B T C n n are dense. Comes up in quantum chemical simulations. n is proportional to n 2 e where n e is the number of electrons in the system. n can become very large (> 50, 000) Parallel and scalable algorithms running on supercomputers necessary. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 2/25

5 Applications in Quantum Chemistry Eigenvalues λ i, right eigenvectors x i and left eigenvectors y i are used to compute Spectral Density φ(ω) = 1 2n Optical Absorption Spectrum 2n j=1 δ(ω λ j ), ɛ + (ω) = n j=1 (d H r x j )(y H j d l ) y H j x j δ(ω λ j ). penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 3/25

6 Applications in Quantum Chemistry Eigenvalues λ i, right eigenvectors x i and left eigenvectors y i are used to compute Spectral Density φ(ω) = 1 2n Optical Absorption Spectrum 2n j=1 δ(ω λ j ), ɛ + (ω) = n j=1 (d H r x j )(y H j d l ) y H j x j δ(ω λ j ). We are interested in all (or most) eigenpairs. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 3/25

7 Properties of (definite) Bethe-Salpeter EVP Due to the special structure eigenvalues appear in quadruples (λ, λ, λ, λ). [Benner, Faßbender, Yang 14/ 18] ir R penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 4/25

8 Properties of (definite) Bethe-Salpeter EVP Due to the special structure eigenvalues appear in quadruples (λ, λ, λ, λ). [Benner, Faßbender, Yang 14/ 18] [ ] [ ] In 0 A B H BS is called definite if H 0 I BS = > 0, holds in many n B Ā physical settings. Here eigenvalues come in real pairs and it holds Theorem [Shao, da Jornada, Yang, Deslippe, Louie 15] There exist X 1, X 2 C n n, and λ 1,..., λ n R +, Λ + = diag{λ 1,..., λ n}, s.t. [ [ Λ+ H BS X = X, Y Λ+] H Λ+ H BS = Y Λ+] H, [ ] [ ] X1 X where X = 2 X1 X, Y = 2. X 2 X 1 X 2 X 1 Y H X = I 2n penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 4/25

9 A Structure-Preserving Method [Shao, da Jornada, Yang, Deslippe, Louie 15] A direct method is implemented in BSEPACK 1. The structure-preserving acquisition of all eigenpairs in high precision relies on a connection to Hamiltonian matrices. Theorem Let Q = 1 [ In 2 I n [ Q H A B ] ii n, then ii n ] B Q = i Ā [ ] Im(A + B) Re(A B) =: ih, Re(A + B) Im(A B) where H is real Hamiltonian, i.e. JH = (JH) T with J = [ ] 0 I. I penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 5/25

10 The resulting algorithm Algorithm 1 Algorithm for the complex Bethe-Salpeter eigenvalue problem Require: A = A H, B = B T C n n, s.t. [ ] In 0 H 0 I BS = n Ensure: X 1, X 2 C n n and Λ + = diag{λ 1,..., λ n} satisfying H [ ] Re(A + B) Im(A B) 1: Construct M = Im(A + B) Re(A B) 2: Compute the Cholesky factorization ] M = LL T L [ ] A B > 0 B Ā [ ] X1 = X 2 [ ] X1 Λ X +. 2 [ 0 3: Construct W = L T In I n 0 4: Compute the spectral decomposition iw = [ ] Z + Z diag{λ+, Λ [ ] H. +} Z + Z 5: Set [ ] X1 = X 2 [ ] In 0 QLZ 0 I +Λ 1/2 + n penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 6/25

11 The resulting algorithm Algorithm 1 Algorithm for the complex Bethe-Salpeter eigenvalue problem Require: A = A H, B = B T C n n, s.t. [ ] In 0 H 0 I BS = n Ensure: X 1, X 2 C n n and Λ + = diag{λ 1,..., λ n} satisfying H [ ] Re(A + B) Im(A B) 1: Construct M = Im(A + B) Re(A B) 2: Compute the Cholesky factorization ] M = LL T L [ ] A B > 0 B Ā [ ] X1 = X 2 [ ] X1 Λ X +. 2 [ 0 3: Construct W = L T In I n 0 4: Compute the spectral decomposition iw = [ ] Z + Z diag{λ+, Λ [ ] H. +} Z + Z 5: Set [ ] X1 = X 2 [ ] In 0 QLZ 0 I +Λ 1/2 + n Main workload: Solve a strictly imaginary Hermitian eigenvalue problem. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 6/25

12 The resulting algorithm Algorithm 1 Algorithm for the complex Bethe-Salpeter eigenvalue problem Require: A = A H, B = B T C n n, s.t. [ ] In 0 H 0 I BS = n Ensure: X 1, X 2 C n n and Λ + = diag{λ 1,..., λ n} satisfying H [ ] Re(A + B) Im(A B) 1: Construct M = Im(A + B) Re(A B) 2: Compute the Cholesky factorization ] M = LL T L [ ] A B > 0 B Ā [ ] X1 = X 2 [ ] X1 Λ X +. 2 [ 0 3: Construct W = L T In I n 0 4: Compute the spectral decomposition iw = [ ] Z + Z diag{λ+, Λ [ ] H. +} Z + Z 5: Set [ ] X1 = X 2 [ ] In 0 QLZ 0 I +Λ 1/2 + n Main workload: Solve a strictly imaginary Hermitian eigenvalue problem. Imanginary part is skew-symmetric. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 6/25

13 Implementation W is skew-symmetric can be reduced to tridiagonal form (e.g via Householder transformations): 0 α 1. UWU T α = T = α2n 2 α 2n 2 0 penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 7/25

14 Implementation W is skew-symmetric can be reduced to tridiagonal form (e.g via Householder transformations): 0 α 1. UWU T α = T = α2n 2 α 2n 2 0 Can be transformed to symmetric form using D = diag{1, i, i 2,..., i 2n } 0 α 1 0 α 1. id H α α D = α2n α2n 2 α 2n 2 0 α 2n 2 0 penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 7/25

15 Implementation Reduction to tridiagonal form by editing ScaLAPACK s routine for tridiagonal reduction of symmetric matrices: PDSYTRD PDSSTRD penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 8/25

16 Implementation Reduction to tridiagonal form by editing ScaLAPACK s routine for tridiagonal reduction of symmetric matrices: PDSYTRD PDSSTRD Solve symmetric tridiagonal EVP via 1. Bisection method (PDSTEBZ) 2. Inverse Iteration (PDSTEIN). penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 8/25

17 Implementation Reduction to tridiagonal form by editing ScaLAPACK s routine for tridiagonal reduction of symmetric matrices: PDSYTRD PDSSTRD Solve symmetric tridiagonal EVP via 1. Bisection method (PDSTEBZ) 2. Inverse Iteration (PDSTEIN). Back transformation of eigenvectors: [ ] [ ] X1 In 0 = QLUDV X 2 0 I + Λ 1/2 +, n where all matrices but D = diag{1, i, i 2..., i 2n } are real. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 8/25

18 The ELPA Library BSEPACK is just proof-of-concept, not performance optimized. Room for improvement by using better libraries! The ELPA Project 2 Eigenvalue SoLvers for Petaflop-Applications The publicly available ELPA library provides highly efficient and highly scalable direct eigensolvers for symmetric matrices. Though especially designed for use for PetaFlop/s applications solving large problem sizes on massively parallel supercomputers, ELPA eigensolvers have proven to be also very efficient for smaller matrices. Mainly developed at Max Planck Computing and Data Facility (MPCDF) in Garching. Original application area: electronic structure calculations. 2 penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 9/25

19 Tridiagonalisation in ELPA Figure: ELPA employs a two-step tridiagonalisation for symmetric or Hermitian matrices. [Marek et al. 14] penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 10/25

20 ELPA vs. ScaLAPACK BLAS lvl. 3 routines for redution to banded form. More data locality, less communication, highly efficient GEMM routines! Carefully crafted communication patterns in reduction to tridiagonal form and eigenvector back transformation. Solution of Tridiagonal Systems: Divide-and-Conquer instead of Bisection Method and Inverse Iteration. OpenMP and GPU support. Better Performance and Scalability! penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 11/25

21 Using ELPA to enhance BSEPACK Two promising points of attack for ELPA: 1. Diagonalize complex Hermitian matrix iw = ZΛZ H using ELPA + Main portion of the workload could benefit from performance and scalability of ELPA. Uses complex arithmetic, while BSEPACK mainly work on real data. 2. Extend the ELPA algorithm for skew-symmetric matrices, just as ScaLAPACK was extended in BSEPACK. + Easy from mathematical point of view Major revision of ELPA software stack necessary. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 12/25

22 Using ELPA to enhance BSEPACK Two promising points of attack for ELPA: 1. Diagonalize complex Hermitian matrix iw = ZΛZ H using ELPA + Main portion of the workload could benefit from performance and scalability of ELPA. Uses complex arithmetic, while BSEPACK mainly work on real data. 2. Extend the ELPA algorithm for skew-symmetric matrices, just as ScaLAPACK was extended in BSEPACK. + Easy from mathematical point of view Major revision of ELPA software stack necessary. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 12/25

Numerical Test Setup System DRACO Located at Max Planck Computing and Data Facility HPC Extension to HPC System HYDRA 880 nodes, Intel Haswell Xeon E5-2698, 32 cores @ 2.

23 Numerical Test Setup System DRACO Located at Max Planck Computing and Data Facility HPC Extension to HPC System HYDRA 880 nodes, Intel Haswell Xeon E5-2698, GHz 128 GB main memory per node Interconnect: fast InfiniBand FDR14 network penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 13/25

24 A C n n and B C n n where initialized with random complex values. [ ] A B Diagonal of A positive and scaled up, s.t. is positive definite B Ā (Gershgorin Circle Theorem). To work on a matrix on a distributed memory machine, it is divided into many subblocks of a certain block sizes n b n b. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 14/25

25 A C n n and B C n n where initialized with random complex values. [ ] A B Diagonal of A positive and scaled up, s.t. is positive definite B Ā (Gershgorin Circle Theorem). To work on a matrix on a distributed memory machine, it is divided into many subblocks of a certain block sizes n b n b. Numerical Tests: 1. Test impact of block size on performance for n = 25, Test strong scalability for medium sized matrices n = 25, Compare runtimes for larger matrices up to n = 75, 000. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 14/25

26 Block Size in BSEPACK and BSEPACK+ELPA 5,000 4,000 BSEPACK BSEPACK+ELPA Runtimes in s 3,000 2,000 1, Block Size n b Figure: Runtimes for different block sizes at n = 25, 000. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 15/25

27 Scalability of BSEPACK and BSEPACK+ELPA 6,000 BSEPACK, n b = 64 BSEPACK n b = 256 BSEPACK + ELPA, n b = 64 Runtime in s 4,000 2, Number of Nodes Figure: Strong Scalability for medium sized matrix (n = 25, 000) in BSEPACK and BSEPACK enhanced with ELPA. 32 threads were used per node. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 16/25

28 10 4 Scalability of BSEPACK and BSEPACK+ELPA Runtime in s BSEPACK Tridiagonalization (ScaLAPACK) BSEPACK + ELPA Hermitian Solve (ELPA) Perfect Scaling Number of Nodes Figure: Strong Scalability for medium sized matrix (n = 25, 000) in BSEPACK and BSEPACK enhanced with ELPA. 32 threads were used per node. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 17/25

29 Runtimes for large matrices Runtime in s 6,000 4,000 2,000 BSEPACK, n b = 64 BSEPACK, n b = 256 BSEPACK + ELPA, n b = Size of submatrices A, B C n n 10 4 Figure: Runtimes of BSEPACK and BSEPACK+ELPA for large matrices of size 2n 2n. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 18/25

30 Runtimes for large matrices 10 4 Runtime in s BSEPACK n b = 64 BSEPACK n b = 256 BSEPACK + ELPA Exponent 2.79 Exponent 2.40 Exponent ,000 15,000 25,000 40,000 65,000 Size of submatrices A, B C n n Figure: Runtimes of BSEPACK and BSEPACK+ELPA for large matrices of size 2n 2n. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 19/25

31 Using ELPA to enhance BSEPACK Two promising points of attack for ELPA: 1. Diagonalize complex Hermitian matrix iw = ZΛZ H using ELPA + Main portion of the workload could benefit from performance and scalability of ELPA. Uses complex arithmetic, while BSEPACK mainly work on real data. 2. Extend the ELPA algorithm for skew-symmetric matrices, just as ScaLAPACK was extended in BSEPACK. + Easy from mathematical point of view Major revision of ELPA software stack necessary. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 20/25

32 ELPA for Skew-Symmetric Matrices Stays the same: Computation of Householder vectors. Application to off-diagonal blocks. Tridiagonal solve. Most of the back transformation of eigenvectors. All communication and synchronizations. C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 21/25

33 ELPA for Skew-Symmetric Matrices Stays the same: Computation of Householder vectors. Application to off-diagonal blocks. Tridiagonal solve. Most of the back transformation of eigenvectors. All communication and synchronizations. Change wherever symmetry is implicitly assumed: Updates on blocks including the diagonal: A (I τvv T )A(I τvv T ) = A v (τv T A 0.5τ 2 v T Avv T ) }{{} u T 1 (Aτv 0.5τ 2 vv T Av) }{{} u 2 A = A T u 2 = u 1, A = A T u 2 = u 1 v T penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 21/25

34 ELPA for Skew-Symmetric Matrices Use skew-symmetric BLAS routines provided by BSEPACK. DSYR2 DSSR2 DSYMV DSSMV Multiplication with D = diag{1, i, i 2,..., i 2n } in back transformation. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 22/25

35 ELPA for Skew-Symmetric Matrices Use skew-symmetric BLAS routines provided by BSEPACK. DSYR2 DSSR2 DSYMV DSSMV Multiplication with D = diag{1, i, i 2,..., i 2n } in back transformation. Preliminary results from a smaller compute server. Bruno 2 Intel Haswell Xeon E5-2640v3, 8 cores 32 GB main memory per CPU penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 22/25

36 Runtimes for large matrices Runtime in s 2,500 2,000 1,500 1, BSEPACK, n b = 64 BSEPACK + complex ELPA, n b = 64 BSEPACK + skew-symmetric ELPA, n b = Size of submatrices A, B C n n 10 4 Figure: Preliminary results matrices of size 2n 2n running on 16 cores. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 23/25

37 Easy further improvements: Rank-2 Update becomes easier for skew-symmetric matrices: A (I τvv T )A(I τvv T ) = A v(τv T A 0.5τ 2 v} T {{ Av} v T ) =0 (Aτv 0.5τ 2 v v} T {{ Av} )v T =0 Eigenvector resulting from tridiagonal solve have to be multiplied by complex diagonal D before back transformation. Instead of applying Householder transformation directly, we apply them to identity matrix and then multiply. Easier to implement but more FLOPs. penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 24/25

38 Future Work ELPA for skew-symmetric matrices on production level (including OpenMP and GPU support) HPC implementations of algorithms for Hamiltonian matrices. Structure preserving Divide-and-Conquer scheme based on the matrix disk function / doubling algorithm [Bai, Demmel, Gu 97] and permuted Lagrangian Graph Bases [Mehrmann, Poloni 12]. Thank you for your attention! penke@mpi-magdeburg.mpg.de C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 25/25

39 The ELPA library can be found at C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 25/25

40 M. Shao, F. H. da Jornada, C. Yang, J. Deslippe, and S. G. Louie. Structure preserving parallel algorithms for solving the Bethe-Salpeter eigenvalue problem. Linear Algebra and its Applications, 488(Supplement C): , P. Benner, H. Faßbender, and C. Yang. Some remarks on the complex J-symmetric eigenproblem. Preprint MPIMD/15-12, Max Planck Institute Magdeburg, July Available from T. Auckenthaler, V. Blum, H.-J. Bungartz, T. Huckle, R. Johanni, L. Krmer, B. Lang, H. Lederer, and P.R. Willems. Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations. Parallel Computing, 37(12): , th International Workshop on Parallel Matrix Algorithms and Applications (PMAA 10). Z. Bai, J. Demmel, and M. Gu. An Inverse Free Parallel Spectral Divide and Conquer Algorithm for Nonsymmetric Eigenproblems. 76(3): , Peter Benner, Heike Fabender, and Chao Yang. Some remarks on the complex J-symmetric eigenproblem. Linear Algebra and its Applications, 544: , V. Mehrmann and F. Poloni. Doubling algorithms with permuted Lagrangian graph bases. SIAM J. Matrix Anal. Appl., 33(3): , C. Penke ELPA for the Bethe-Salpeter Eigenvalue Problem 25/25

A hybrid Hermitian general eigenvalue solver

Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe A hybrid Hermitian general eigenvalue solver Raffaele Solcà *, Thomas C. Schulthess Institute fortheoretical Physics ETHZ,