Mitglied der Helmholtz-Gemeinschaft. Linear algebra tasks in Materials Science: optimization and portability

Size: px

Start display at page:

Download "Mitglied der Helmholtz-Gemeinschaft. Linear algebra tasks in Materials Science: optimization and portability"

Beverley Cunningham
5 years ago
Views:

1 Mitglied der Helmholtz-Gemeinschaft Linear algebra tasks in Materials Science: optimization and portability ADAC Workshop, July Edoardo Di Napoli

2 Outline Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration (ChASE) Hamiltonian and Overlap matrices initialization in FLAPW ADAC Workshop, July Edoardo Di Napoli Folie 2

3 Forschungszentrum Jülich ADAC Workshop, July Edoardo Di Napoli Folie 3

Jülich Supercomputing Centre Supercomputing operations for: Center: FZJ Region: JARA Germany: Gauss Centre for Supercomputing John von Neumann

R&D work Methods and algorithms, computational science, performance analysis and tools Scientific Big Data Analytics with HPC Computer

4 Jülich Supercomputing Centre Supercomputing operations for: Center: FZJ Region: JARA Germany: Gauss Centre for Supercomputing John von Neumann Institute for Computing Europe: PRACE, EU projects Application support Primary support & Simulation Labs Peer review support & coordination R&D work Methods and algorithms, computational science, performance analysis and tools Scientific Big Data Analytics with HPC Computer architectures, Co-Design Exascale Labs together with IBM, Intel, NVIDIA Education and training ADAC Workshop, July Edoardo Di Napoli Folie 4

5 Supercomputing systems: dual architecture strategy ADAC Workshop, July Edoardo Di Napoli Folie 5

6 Simulation Laboratory as HPC enabler ADAC Workshop, July Edoardo Di Napoli Folie 6

SimLab Quantum Materials (SLQM) Team Mission The Simulation Laboratory Quantum Materials (SLQM) provides expertise in the field of quantum-based simulations in Materials Science with a special focus

7 SimLab Quantum Materials (SLQM) Team Mission The Simulation Laboratory Quantum Materials (SLQM) provides expertise in the field of quantum-based simulations in Materials Science with a special focus on high-performance computing. SLQM acts as a high-level support structure in dedicated projects and hosts research projects dealing with fundamental aspects of code development, algorithmic optimization, and performance improvement. The Lab acts as an enabler of large scale simulations on current HPC platforms as well as on future architectures by targeting domain-specific co-design processes. ADAC Workshop, July Edoardo Di Napoli Folie 7

8 High-performance computations Observations: Linear algebra algorithms ubiquitous in Materials Science Numerical libraries are used as black boxes. Domain-specific knowledge does not influence algorithm choice. Rigid legacy codes are hard to modernize. Goal Design and optimize linear algebra algorithms in order to: exploit available knowledge. increase the parallelism of complex tasks. facilitate performance portability ADAC Workshop, July Edoardo Di Napoli Folie 8

9 SLQM active areas of research Adaptive integration for functional Renormalization Group (frg) HPC Tensor operations (contractions, generalized contractions, transpositions, etc.) with application in frg, Quantum Chemistry, etc. Algebraic eigensolvers (sequences of dense eigenproblems, sparse eigenproblems) with applications in Density Functional Theory (DFT), Numerical RG, Excitonic Hamiltonians, etc. Articulated linear algebra tasks (Matrix initialization, etc.) Structured Linear Systems (Dyson, Poisson, and Keldish equations) Self-consistent Field preconditioners Predicting material properties through machine learning ADAC Workshop, July Edoardo Di Napoli Folie 9

10 Two examples from FLAPW methods (DFT) An eigensolver for sequences of dense matrices Chebyshev Accelerated Subspace iteration Eigensolver (ChASE) Exterior iterative eigensolver tailored to sequences of dense Hermitian eigenvalue problems arising in DFT methods based on the LAPW basis set. A complex linear algebra task H and S Dense Linear Algebra (HSDLA) algorithm Combination of dense linear algebra kernels for the initialization of Hamiltonian and Overlap matrices in Density Functional Theory. ADAC Workshop, July Edoardo Di Napoli Folie 10

11 Density Functional Theory Self-Consistent Field General framework Initial guess for charge density n start (r) Initialize A (l) k and B(l) k matrices Solve a set of eigenproblems P (l) k 1...P (l) k N No OUTPUT Electronic structure,... Yes Converged? n (l) n (l 1) < η Compute new charge density n (l) (r) ADAC Workshop, July Edoardo Di Napoli Folie 11

12 The case of the FLAPW method Observations: 1 every P (l) k : A(l) k x = B(l) k λx is a generalized eigenvalue problem; 2 A and B are hermitian (B is positive definite); 3 required: lower 2 20 % of eigenpairs; 4 eigenvectors of problems of same k are seemingly uncorrelated across iterations i 5 k-vector index: k = 1 : ; 6 iteration cycle index: l = 1 : ADAC Workshop, July Edoardo Di Napoli Folie 12

13 Outline Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration (ChASE) Hamiltonian and Overlap matrices initialization in FLAPW ADAC Workshop, July Edoardo Di Napoli Folie 13

14 Generalized eigenvalue solver: subspace iteration Two distinct optmization opportunities: 1 Knowledge exploitation: solution of previous SCF cycle can be used to precondition the solver at the next iteration 2 Algorithm optimization: polynomial degree can be pre-computed in order to minimize Mat-Vec multiplications ADAC Workshop, July Edoardo Di Napoli Folie 14

15 Chebyshev Accelerated Subspace Eigensolver Chebyshev filter A generic vector v = n i=1 s ix i is very quickly aligned in the direction of the eigenvector corresponding to the extremal eigenvalue λ 1 v m n n = p m (A)v = s i p m (A)x i = s i p m (λ i )x i i=1 i=1 n C m ( = s 1 x 1 + λ i c e ) s i i=2 C m ( λ 1 c e ) x i s 1 x Chebyshev polynomial Chebyshev polynomial c λmin λlower λupper eigenspectrum Polynomial degree m = 6 c λmin λlower λupper eigenspectrum Polynomial degree m = 6 ADAC Workshop, July Edoardo Di Napoli Folie 15

16 The core of the algorithm: Chebyshev filter > 80% Computing time Three-terms recurrence relation C m+1 (t) = 2xC m (t) C m 1 (t); m N, C 0 (t) = 1, C 1 (t) = x Z m. = pm ( H) Z 0 with H = H ci n FOR: i = 1 DEG 1 Z i+1 2 σ i+1 e H Z i σ i+1 σ i Z i 1 xgemm END FOR. ADAC Workshop, July Edoardo Di Napoli Folie 16

17 Preconditioning ChASE Exploiting knowledge P (l) k 1 ITERATION (l) iterative solver (X (l) k 1,Λ (l) k 1 ) P (l+1) k 1 ITERATION (l + 1) iterative solver (X (l+1) k 1,Λ (l+1) k 1 ) P (l) k 2 iterative solver (X (l) k 2,Λ (l) k 2 ) P (l+1) k 2 iterative solver (X (l+1) k 2,Λ (l+1) k 2 ) Next cycle P (l) k N iterative solver (X (l) k N,Λ (l) k N ) P (l+1) k N iterative solver (X (l+1) k N,Λ (l+1) k N ) X {x 1,...,x n } Λ diag(λ 1,...,λ n ) ADAC Workshop, July Edoardo Di Napoli Folie 17

18 Speed-up Speed-up = CPU time (input random vectors) CPU time (input approximate eigenvectors) 4 Au 98 Ag 10 - n =13, cores. 3 Speed-up Iteration Index ADAC Workshop, July Edoardo Di Napoli Folie 18

19 ChASE with polynomial optimization Speed-up Speed-up = CPU time (m constant) CPU time (m optimized for each vector) Cores Multi threaded BLAS 24 Cores and 1 GPU CUBLAS Speed up Iteration cycle index ADAC Workshop, July Edoardo Di Napoli Folie 19

20 ChASE pseudocode INPUT: Hamiltonian H, TOL, DEG OPTIONAL: approximate eigenvectors Z 0, extreme eigenvalues {λ 1,λ NEV }. OUTPUT: NEV wanted eigenpairs (Λ, W). 1 Lanczos DoS step. Identify the bounds for the eigenspectrum interval corresponding to the wanted eigenspace. REPEAT UNTIL CONVERGENCE: 2 Chebyshev filter. Filter a block of vectors W Z 0. 3 Re-orthogonalize the vectors outputted by the filter; W = QR. 4 Compute the Rayleigh quotient G = Q HQ. 5 Compute the primitive Ritz pairs (Λ,Y) by solving for GY = YΛ. 6 Compute the approximate Ritz pairs (Λ,W QY). 7 Check which one among the Ritz vectors converged. 8 Deflate and lock the converged vectors. END REPEAT ADAC Workshop, July Edoardo Di Napoli Folie 20

21 Execution time for distinct platforms An example from Density Functional Theory x Intel Xeon E (16 cores) 1x NVIDIA K20m 2x NVIDIA K20m 1x XeonPhi optimized (compact) 200 Time [s] iteration ADAC Workshop, July Edoardo Di Napoli Folie 21

22 ChASE new implementation C++ Library Separation between algorithm and implementation Few kernels for low-level tasks (Lanczos, QR, Mat-Mat multiply, etc..) General interface Kernels adapted to the parallel computing platform Library templated for Real, Complex, SP and DP. Integration into applications FLEUR code Jena BSE code (based on VASP) early stage collaboration with Exciting developers early stage collaboration with developers from AICS (Kobe) ADAC Workshop, July Edoardo Di Napoli Folie 22

23 Outline Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration (ChASE) Hamiltonian and Overlap matrices initialization in FLAPW ADAC Workshop, July Edoardo Di Napoli Folie 23

24 Hamiltonian and Overlap matrices Matrix form N A H = A H a T a [AA] A a a=1 } {{ } H AA +AH [AB] a Ta B a + B H a T a [BA] A a + B H a T a [BB] } {{ } H AB+BA+BB B a N A S = A H N A a A a + B H a U a H U a B a a=1 } {{ } a=1 } {{ } S AA S BB A a and B a C N L N G while T a... C N L N L and Hermitian. N L = (l max + 1) 2 121, N G = 1,000 50,000, and N A = number of atoms. ADAC Workshop, July Edoardo Di Napoli Folie 24

25 Constructing S AA An example of memory layout re-structuring N A S AA = A H a A a. a=1 1: for a := 1 N A do 2: S AA = A H a A a (zherk: 4N L N 2 G Flops) 3: end for N G N L A 1 N G A 2. A N A N L A NA 1: S AA = A H A (zherk: 4N A N L N 2 G Flops) ADAC Workshop, July Edoardo Di Napoli Folie 25

26 Constructing H AB+BA+BB An example of algorithm re-structuring N A H AB+BA+BB = a=1 N A = a=1 B H a (T a [BA] A a ) + (A H a T a [AB] )B a BH a (T a [BB] B a ) (BH a T a [BB] )B a B H a (T a [BA] A a T[BB] a B a ) + (A H a T [AB] a BH a T a [BB] )B a 1: for a := 1 N A do 2: Z a = T [BA] a A a (zgemm: 8N 2 L N G Flops) 3: Z a = Z a T[BB] a B a (zhemm: 8N 2 L N G Flops) 4: Stack Z a to Z and B a to B 5: end for 6: H = Z H B + B H Z (zher2k: 8N A N L N 2 G Flops) ADAC Workshop, July Edoardo Di Napoli Folie 26

27 FLEUR early optimizations Minimization of FLOPS in FLEUR S AA = (4π)2 Ω a l sph f l,a,g f l,a,g exp[i(k G K G ) x a ] l=0 Wl,a 2 l Y ( ) ( ) l,m Ra ˆK G Yl,m Ra ˆK G. m= l } {{ } The sum over m can be removed by using the well known identity ( ) 2l + 1 P l ˆK G ˆK G 4π = Removing the sum over m yields a reduction in #ops by a factor of ( lsph + 1 ) 10 but destroys matrix form = premature optimization ADAC Workshop, July Edoardo Di Napoli Folie 27

28 Hamiltonian and Overlap matrices initialization Main features Boosted performance through highly optimized computational kernels. Performance portability through modular implementation. Compute bound implementation at the cost of increased complexity (more FLOPs). NaCl (Kmax =4.0): IvyBridge Haswell TiO2 (Kmax =3.6): IvyBridge Haswell 3 speedup #threads Figure 1: Speedup of our algorithm over FLEUR with kmax =4and increasing parallelism ADAC Workshop, July Edoardo Di Napoli Folie 28

29 Porting to heterogeneous architectures Back-of-the-envelope analysis 5 lines of the algorithm constitute 97% of flops Correspond to BLAS-3 operations (gemm, herk, her2k) High arithmetic intensity and should fit GPUs well First step: offload these routine calls All 5 are BLAS kernels. Can we use some library? cublas cublas-xt MAGMA BLASX ADAC Workshop, July Edoardo Di Napoli Folie 29

30 Porting to heterogeneous architectures Additional code? 3x wrappers (zgemm, zherk, zher2k) Init and cleanup of cuda runtime and devices Get #devices, initialize devices, create handlers,... Destroy handlers, free devices,... Allocate data in page-locked memory Avoid hidden copies Fast data transfer Only around 100 lines of additional code ADAC Workshop, July Edoardo Di Napoli Folie 30

31 Experimental results Test case 2: AuAg (N A = 108, N L = 121, N G = [ ]) Time ( seconds ) CPU CPU / 1 GPU CPU / 2 GPUs 3.03x 4.96x Kmax Sandy Bridge: CPU: E5-2650, 2 x 8core, 2.0GHz, 64GBs RAM, 2 Nvidia Tesla K20X Peak performance: 256 GFs/s + 2 x 1.3 TFs/s ADAC Workshop, July Edoardo Di Napoli Folie 31

32 Conclusions Conclusions Modernizing algorithm structure of legacy code is critical Tailoring algorithms to application-specific tasks Exploiting knowledge from domain-specific applications Layered design built on top of standardized libraries Increase in performance (Almost) free lunch performance portability ADAC Workshop, July Edoardo Di Napoli Folie 32

33 References ChASE: a Chebyshev Accelerated Subspace iteration Eigensolver for sequences of eigenvalue problems. J. Winkelmann, M. Berljafa, P. Springer and E. Di Napoli. In preparation. An Optimized and Scalable Eigensolver for Sequences of Eigenvalue Problems. M. Berl-jafa, D. Wortmann, and E. Di Napoli. Concurrency Computat.: Pract. Exper. 27 (2015), pp [arxiv: ] High-performance generation of the Hamiltonian and Overlap matrices in FLAPW methods. Edoardo Di Napoli, Elmar Peise, Markus Hrywniak and Paolo Bientinesi. Comp. Phys. Comm. 211 (2017) [arxiv: ] Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPW methods. Diego Fabregat-Traver, Davor Davidović, Markus Höhnerbach, Edoardo Di Napoli. Lecture Notes in Computer Science, High-Performance Scientific Computing (2016), pp [arxiv: ] ADAC Workshop, July Edoardo Di Napoli Folie 33

34 Thank you QUANTUMATERIALS For more information ADAC Workshop, July Edoardo Di Napoli Folie 34

A knowledge-based approach to high-performance computing in ab initio simulations.

Mitglied der Helmholtz-Gemeinschaft A knowledge-based approach to high-performance computing in ab initio simulations. AICES Advisory Board Meeting. July 14th 2014 Edoardo Di Napoli Academic background