Improving the performance of applied science numerical simulations: an application to Density Functional Theory

Size: px

Start display at page:

Download "Improving the performance of applied science numerical simulations: an application to Density Functional Theory"

Natalie Boone
6 years ago
Views:

Improving the performance of applied science numerical simulations: an application to Density Functional Theory Edoardo Di Napoli Jülich Supercomputing Center - Institute for Advanced Simulation

1 Improving the performance of applied science numerical simulations: an application to Density Functional Theory Edoardo Di Napoli Jülich Supercomputing Center - Institute for Advanced Simulation Forschungszentrum Jülich GmbH Aachen Institute for Advanced Study in Computational Engineering Science RWTH Aachen University Columbia University, March 5th 2013 Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

2 Motivation and Goals Two Facts Often algorithmic libraries are used as black boxes. Very little information coming from the physics of the specific application is exploited by scientific computing codes. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

3 Motivation and Goals Two Facts Often algorithmic libraries are used as black boxes. Very little information coming from the physics of the specific application is exploited by scientific computing codes. One Objective Exploiting physical information extracted from the simulations in order to: increase the performance of large legacy codes; improve the computational paradigm on which the codes are based. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

4 Computational science Traditional approach Science-driven workflow Mathematical Model = Algorithmic Structure = Simulations = Physics Math Algorithm Often a literal translation of the mathematical model into a combination of algorithmic tasks usually designed in abstraction from other considerations (kernel optimization, tasks harmonization, etc.); Algorithm Sim The simulation is an end product, the successful implementation of an algorithm into a series of machine accessible operations resulting in the computation of meaningful physical quantities; Math Sim The mathematical model and the simulation are considered as practically disjoint. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

5 Computational science Modern approach Science-driven + high-performance workflow Mathematical Model Algorithmic Structure Simulations = Physics Math Algorithm The mathematics model undergoes substantial reformulation in order to cast operations in terms of optimized algorithmic structures. At the same time concerns regarding the numerical stability and accuracy are carefully taken into consideration and included in the reformulation; Algorithm Sim Simulations can become a tool to uncover hidden aspects of physical phenomena missed by the mathematical model. Information extracted from a careful analysis of the computations can greatly influence the choice of the optimal algorithm and how it should be used; Math Sim Some information extracted from a careful analysis of the simulations can influence so much the algorithmic choice to suggest a complete revolution in the formulation of the mathematical model. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

6 Reverse Simulation Math, Algorithm Sim total energy band energy gap conductivity forces, etc. Simulations = = Mathematical model Algorithmic structure FLAPW Paradigm Mathematical Model Algorithmic Structure Feedback from analysis of correlated eigenproblems {Pi} SIM SIM. SIM SIM SIM SIM Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

7 Sequences of eigenproblems in Density Functional Theory A glimpse of the results Speedup Scalability 2.8 Speed up vs. Iteration index for Nev=256 and three distinct matrix sizes 8 BChFSI Strong scalability NaCl n=3893 NaCl n=6217 NaCl n= AuAg n=8970 NaCl n=6217 CaFeAs n=2612 Ideal speed up 2.2 Speed up Speed up Iteration index Number of threads Table: Efficiency Self-consistent cycle Percentage of peak performance % % % % % % Fast: a great improvement of the execution time Efficient: an extensive use of computing resources Scalable: ideal use of parallel computing architectures Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

8 Outline 1 Stating the problem: how sequences of generalized eigenproblems arise in all-electron computations 2 Algorithmic choice = Eigenvectors angle evolution 3 Tailoring the algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) 4 Physics-driven performance boost: numerical results Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

9 Outline 1 Stating the problem: how sequences of generalized eigenproblems arise in all-electron computations 2 Algorithmic choice = Eigenvectors angle evolution 3 Tailoring the algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) 4 Physics-driven performance boost: numerical results Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

10 The Foundations Investigative framework Quantum Mechanics and its ingredients n H = 2m h2 2 n Z α i i=1 i=1 α x i a α + 1 i<j x i x j Hamiltonian Φ(x 1 ;s 1,x 2 ;s 2,...,x n ;s n ) Wavefunction ( ) n Φ : R 3 {± 1 2 } R high-dimensional anti-symmetric function describes the orbitals of atoms and molecules. In the Born-Oppenheimer approximation, it is the solution of the Electronic Schrödinger Equation H Φ(x 1 ;s 1,x 2 ;s 2,...,x n ;s n ) = E Φ(x 1 ;s 1,x 2 ;s 2,...,x n ;s n ) Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

11 Density Functional Theory (DFT) Hohenberg-Kohn theorem (1964) one-to-one correspondence n(r) V 0 (r) = V 0 (r) = V 0 (r)[n]! a functional E[n] : E 0 = min n E[n] 1 Φ(x 1 ;s 1,x 2 ;s 2,...,x n ;s n ) = Λ i,a φ a (x i ;s i ) 2 Charge density n(r) = a f a φ a (r) 2 3 In the Schrödinger equation the exact Coulomb interaction is substituted with an effective potential V 0 (r) = V I (r) + V H (r) + V xc (r) The high-dimensional Schrödinger equation translates into a set of coupled non-linear low-dimensional self-consistent Kohn-Sham (KS) equations ( ) a solve Ĥ KS φ a (r) = h2 2m 2 + V 0 (r) φ a (r) = ε a φ a (r) Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

12 Kohn-Sham scheme and DFT Self-consistent cycle Typically this set of equations is solved using an iterative self-consistent cycle Initial guess n init (r) = Compute KS potential V 0 (r)[n] Solve KS equations Ĥ KS φ a (r) = ε a φ a (r) No OUTPUT Energy, forces,... Yes = Converged? n (l+1) n (l) < η Compute new density n(r) = a f a φ a (r) 2 In practice this iterative cycle is much more computationally challenging and requires some form of broadly defined discretization. A common way of discretizing the KS equations is to expand the wave functions φ a (r) on a basis set. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

13 Introduction to FLAPW Linearized augmented plane wave (LAPW) in a Muffin Tin (MT) geometry Lapw φ k,ν (r) = c G k,ν ψ G(k,r) G+k G max 1 e i(k+g)r Ω ψ G (k,r) = α,l,m k ν Bloch vector band index Interstitial (I) [ a α,g lm (k)uα l (r) + bα,g lm (k) uα l ]Y (r) lm (ˆr α ) Muffin Tin (MT) boundary conditions Continuity of wavefunction and its derivative at MT boundary a α,g lm (k) and bα,g lm (k) Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

14 Introduction to FLAPW Linearized augmented plane wave (LAPW) in a Muffin Tin (MT) geometry... cont. The radial functions u α l (r) are the solutions of the atomic Schrödinger equations where the potential retains only the spherical part. [ Ĥsph α uα l (r) = h2 2 2m r 2 + h2 2m l(l + 1) r 2 + Vsph α (r) ] r u α l (r) = E lu α l (r) where the u can be determined from the energy derivative of the Schrödinger-like equation Ĥ α sph uα l = E l u α l + uα l E l is a parameter and it is predetermined by optimization like the plane wave case there is a cut-off G max (typically ) unlike the plane waves there is also a cut-off (for open-shell atoms) on l max (typically 8) unlike the plane waves this set of basis functions is overcomplete it is NOT an orthogonal set a new set of basis functions is computed at each self-consistent cycle Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

15 Introduction to FLAPW Generalized eigenvalue problems The expansion of is φ k,ν (r) in terms of LAPW basis set is then inserted in the KS equations ψ G (k,r) Ĥ KS c G k,ν ψ G (k,r) = λ kν ψ G (k,r) c G k,ν ψ G (k,r), G G and, by defining the matrix entries for the left and right hand side respectively as Hamiltonian A k and overlap matrices B k, [A k B k ] GG = dr ψ G (k,r)[ Ĥ KS ˆ1 ] ψ G (k,r) α one arrives at generalized eigenvalue equations parametrized by k P k : (A k ) GG c G kν = λ kν (B k ) GG c G kν G G or, in compact form, A k x i = λ i B k x i. with x i. = c G kν. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

16 Discretized Kohn-Sham scheme Self-consistent cycle Initial guess n start (r) = Compute KS potential V 0 (r)[n] Solve a set of eigenproblems P k1...p kn No OUTPUT Energy,... Yes = Converged? n (l+1) n (l) < η Compute new density n(r) = k,ν f k,ν φ k,ν (r) 2 Observations: 1 every P k : Ax = Bλx is a generalized eigenvalue problem; 2 A and B are DENSE and hermitian (B is also pos. def.); 3 P k s with different k index have different size and are independent from each other. 4 k= 1: ; i = 1:20-50 Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

17 A (mostly) converging process The traditional approach Initialization: At every iteration of the cycle all numerical quantities are entirely re-computed A new set of basis functions is assembled at each at iteration cycle (Math Sim) The entries of the matrices A and B are re-initialized at each iteration cycle (Algorithm Sim) Eigenproblems: Each eigenproblem P (l) : A (l) x = λb (l) x at iteration l is solved in total independence from the eigenproblem P (l 1) of the previous iteration (Algorithm Sim) Convergence: Starting with a electron density close enough to the one minimizing the energy E 0 it is likely to reach convergence within few tens of iterations. Unfortunately there is no general theorem establishing the converging conditions # of steps is still uncertain. They depend on the material and the initial guess (area of convergence) (Math Algorithm) n (r) n(r) undergo relatively small decreasing oscillations (Math Algorithm & Sim) Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

18 A (mostly) converging process The traditional approach Initialization: At every iteration of the cycle all numerical quantities are entirely re-computed A new set of basis functions is assembled at each at iteration cycle (Math Sim) The entries of the matrices A and B are re-initialized at each iteration cycle (Algorithm Sim) Eigenproblems: Each eigenproblem P (l) : A (l) x = λb (l) x at iteration l is solved in total independence from the eigenproblem P (l 1) of the previous iteration (Algorithm Sim) Convergence: Starting with a electron density close enough to the one minimizing the energy E 0 it is likely to reach convergence within few tens of iterations. Unfortunately there is no general theorem establishing the converging conditions # of steps is still uncertain. They depend on the material and the initial guess (area of convergence) (Math Algorithm) n (r) n(r) undergo relatively small decreasing oscillations (Math Algorithm & Sim) Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

19 Outline 1 Stating the problem: how sequences of generalized eigenproblems arise in all-electron computations 2 Algorithmic choice = Eigenvectors angle evolution 3 Tailoring the algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) 4 Physics-driven performance boost: numerical results Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

20 FLAPW self-consistent cycle Computational costs Spherical V 0(r) n 0(r) init. Pseudo-charge method Calculation of full-potential V (r)[n] 40% time usage Convergence check n(r) mixing n(r) for next cycle Determining the Lapw wavefunctions ψ G(k, r) 5 10% time usage Calculation of new charge density n(r) a G lm and bg lm coeff. 3-D stars, 2-D stars lattice harmonics Matrices generation A k =< ψ(k) H ψ (k) > B k =< ψ(k) S ψ (k) > <1% time usage Fermi energy calculation Eigenpairs selection Generalized eigenproblems solution A kx = λb kx Eigenproblems: distribution and setup Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

21 Sequences of eigs: an ALGORITHM SIM case Sequences of eigenproblems Consider the set of generalized eigenproblems P (1)...P (l) P (l+1)...p (N) not as a set of disjoint problems (P) N, but as a sequence; { Could this sequence P (l)} of eigenproblems evolve following a convergence pattern in line with the convergence of n(r)? REVERSE SIMULATION method numerical simulations analyzed employing a parameter-based inverse problem method; collected data on deviation angles b/w eigenvectors of adjacent eigenproblems; identified evolutions of eigenvectors along the sequence. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

22 Angle evolution fixed k Example: a metallic compound at fixed k of subspace angle for eigenvectors of k point 1 and lowest 75 eigs 0Evolution 10 AuAg Angle b/w eigenvectors of adjacent iterations Iterations (2 > 22) Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

23 Correlation and its exploitation correlation between successive eigenvectors x (l 1) and x (l) Angles decrease monotonically with some oscillation Majority of angles are small after the first few iterations Note: Mathematical model Correlation. Correlation numerical analysis of the simulation. ALGORITHM SIM The stage is favorable to an iterative eigensolver where the eigenvectors of P (l 1) are fed to the solve P (l). Next stages of the investigation: 1 Development of a block iterative eigensolver that can exploit the correlation 2 Investigate if approximate eigenvectors can speed-up the iterative solver of choice 3 Understand if such an iterative method be competitive with direct methods for dense problems Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

24 Algorithmic digression Direct solvers. Iterative solvers. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

25 Algorithmic digression Direct solvers. Iterative solvers. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

26 Algorithmic digression Direct solvers. Iterative solvers. λ 1 > λ 2 > λ 3 >... Ax j = λ j x j v = j γ j x j [ ] Av = j λ j γ j x j A k v = j λ k j γ λ jx j = λ 1 x 1 + j j 2 x λ 1 j λ j Rate of convergence magnitude of λ 1 Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

27 Algorithmic digression Direct solvers. Iterative solvers. Dense matrices. Sparse matrices. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

28 Algorithmic digression Direct solvers. Iterative solvers. Dense matrices. Sparse matrices. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

29 Outline 1 Stating the problem: how sequences of generalized eigenproblems arise in all-electron computations 2 Algorithmic choice = Eigenvectors angle evolution 3 Tailoring the algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) 4 Physics-driven performance boost: numerical results Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

30 Chebyshev Filtered Sub-space Iteration method (ChFSI) Two are the essential properties our iterative algorithm has to comply with: Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

31 Chebyshev Filtered Sub-space Iteration method (ChFSI) Two are the essential properties our iterative algorithm has to comply with: 1 The ability to receive as input a sizable set Z 0 of approximate eigenvectors; 2 The capacity to solve simultaneously for a substantial portion of eigenpairs. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

32 Chebyshev Filtered Sub-space Iteration method (ChFSI) Two are the essential properties our iterative algorithm has to comply with: 1 The ability to receive as input a sizable set Z 0 of approximate eigenvectors; 2 The capacity to solve simultaneously for a substantial portion of eigenpairs. ChFSI constitutes the natural choice: it accepts the full set of multiple starting vectors; it avoids stalling when facing small clusters of eigenvalues; when augmented with polynomial accelerators it has a much faster convergence rate; converged eigenvectors can be easily locked. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

33 ChFSI pseudocode INPUT: Hamiltonian, approximate eigenpairs (Λ,Z 0 ), TOL, DEG. OUTPUT: Wanted eigenpairs W. 1 Lanczos step. Identify the bounds for the interval to be filtered out. REPEAT UNTIL CONVERGENCE: 2 Chebyshev filter. Filter a block of vectors W Z 0. 3 QR decomposition. Re-orthogonalize the vectors outputted by the filter; W = QR. 4 Compute the Rayleigh quotient G = Q HQ. 5 Compute the primitive Ritz pairs (Λ,Y) by solving for GY = YΛ. 6 Compute the approximate Ritz pairs (Λ,W QY). 7 Check which one among the Ritz vectors converged. 8 Deflate and lock the converged vectors. END REPEAT Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

34 The core of the algorithm: Chebyshev filter Chebyshev polynomials The Chebyshev polynomial C m of the first kind of order m, is defined as { cos(marccos(x)), x [ 1,1]; C m (x) = cosh(marccosh(x)), x > 1. Three-terms recurrence relation C m+1 (x) = 2xC m (x) C m 1 (x); m N, C 0 (x) = 1, C 1 (x) = x Degree 5 6 Degree 10 x Degree 15 x Degree 20 x Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

35 The core of the algorithm: Chebyshev filter The basic principle Theorem Let γ > 1 and P m denote the set of polynomials of degree smaller or equal to m. Then the extremum min max p(t) p P m,p(γ)=1 t [ 1,1] is reached by p m (t). = C m(t) C m (γ). A generic vector v is very quickly aligned in the direction of the eigenvector corresponding to the extremal eigenvalue λ 1 v m n n = p m (A)v = s i p m (A)x i = s i p m (λ i )x i i=1 i=1 n C m ( = s 1 x 1 + λ i c e ) s i i=2 C m ( λ 1 c e ) x i s 1 x 1 Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

36 The core of the algorithm: Chebyshev filter The pseudocode A simple linear transformation maps [ 1,1] [α,β] R c = β + α 2 center of the interval e = β α 2 width of the interval. INPUT: Hamiltonian H - approx. eigenvectors Z 0 lowest eigenv. λ 1 - parameters for interval c, e - DEG. OUTPUT: Filtered vectors W. 1 σ 1 e/(λ 1 c) 2 Z 1 σ 1 e (H ci n)z 0 FOR: i = 1 DEG σ i+1 (2/σ 1 σ i ) 4 Z i+1 2 σ i+1 e (H ci n)z i σ i+1 σ i Z i 1 END FOR. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

37 Outline 1 Stating the problem: how sequences of generalized eigenproblems arise in all-electron computations 2 Algorithmic choice = Eigenvectors angle evolution 3 Tailoring the algorithm: Chebyshev Filtered Sub-space Iteration method (ChFSI) 4 Physics-driven performance boost: numerical results Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

38 Experimental tests setup Solving with a ChFSI version implemented in C Approx. vs Random vectors fed to ChFSI against Iteration Index; Parallel ChFSI vs Direct methods against Iteration Index. Matrix sizes: 2,600 13,300. B is in general almost singular (ill-conditioned). Examples: size(a) = 50 κ(a) 10 4 ; size(a) = 500 κ(a) 10 7 We used the standard form for the problem Ax = λbx A y = λy with A = L 1 AL T and y = L T x Tests were performed on JUROPA using 1 node with 8 cores. 2 Intel Xeon 5570 (Nehalem-EP) quad-core processors, 2.93GHz; 24 GB/node; THEORETICAL PEAK PERFORMANCE/CORE=11.71 GFLOPS; Minimum absolute tolerance of residuals r x = ; All numerical data are MEDIAN values over 12 distinct measurements. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

39 Speed-up for sequential ChFSI Na 15 Cl 14 Li with increasing G max = 3.0,3.5, Speed up vs. Iteration index for Nev=256 and three distinct matrix sizes NaCl n=3893 NaCl n=6217 NaCl n= Speed up Iteration index Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

40 Speed-up for sequential ChFSI Au 98 Ag 10 with increasing G max = 3.0, Speed up vs. Iteration index for Nev=972 and 2 distinct matrix sizes AuAg n=5638 AuAg n= Speed up Iteration index Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

41 Sequential to Parallel: what and how? Fraction of computing time for sequential ChFSI extracted from AuAg system (n=8970, nev=972) at iteration 23. Lanczos < 1% 4% Residuals convergence 6% Rayleigh Ritz Chebyshev filter 90% Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

42 Sequential to Parallel: what and how? Speed-up of multi-threaded BLAS ChFSI vs sequential ChFSI over 8 cores when fed approx. eigenvectors. Speed up of Multi threaded ChFSI over Sequential ChFSI w.r.t. Iteration index AuAg n=8970 NaCl n= Speed up Iteration index Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

43 Sequential to Parallel: what and how? Speed-up of sequential ChFSI and multi-threaded BLAS ChFSI Speed up of approx. vs. random vectors w.r.t. Iteration index AuAg n= cores AuAg n= core 3.5 Speed up Iteration Index Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

44 Efficiency Ratio of performance respect to theoretical peak performance Table: Efficiency for Na 15 Cl 14 Li (n=9273) Self-consistent cycle Full Algorithm Chebyshev filter % 91.4% % 90.9% % 90.4% % 90.8% % 91.5% % 91.6% % 91.6% % 92.1% % 90.8% % 91.4% % 90.9% % 90.8% Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

45 Recapping 1 Approximate vs. Random: sequential ChFSI achieves speed-ups in the range 1.5X 3.5X; multi-threaded version of ChFSI achieve speed-ups up to 5X; 2 Multi-threaded ChFSI: by just using the multi-threaded version of BLAS, ChFSI achieve speed-ups above 7X; the larger the size of the eigenproblem the better ChFSI performs. 3 Performance: Extensive use of BLAS library in the filter facilitates close to optimal performance. UNANSWERED QUESTIONS: Can we do better on shared memory architectures? Can we be faster than the direct methods? Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

46 Recapping 1 Approximate vs. Random: sequential ChFSI achieves speed-ups in the range 1.5X 3.5X; multi-threaded version of ChFSI achieve speed-ups up to 5X; 2 Multi-threaded ChFSI: by just using the multi-threaded version of BLAS, ChFSI achieve speed-ups above 7X; the larger the size of the eigenproblem the better ChFSI performs. 3 Performance: Extensive use of BLAS library in the filter facilitates close to optimal performance. UNANSWERED QUESTIONS: Can we do better on shared memory architectures? YES OpenMP. Can we be faster than the direct methods? depends on the percentage of the eigenspectrum sought after and the iteration index. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

47 Parallelizing the filter with OpenMP Input : Hamiltonian H vectors to be filtered Z DEG, c, e. Output: Filtered vectors W. 1: for i = 1 to DEG do 2: Compute α. Compute β. 3: W = αhz + βw. 4: SWAP(Z, W). 5: end for 6: SWAP(Z, W). Hamiltonian. Vectors Filtered to be = filtered. vectors Figure: Matrix matrix multiplication scheme. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

48 OpenMP ChFSI vs multi-threaded ChFSI Time to completion for AuAg (n=13379) of OpenMP vs. multi threaded BLAS ChFSI multi threaded BLAS ChFSI OpenMP 1500 CPU time (seconds) Iteration Index Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

49 Scalability OpenMP ChFSI obtains almost ideal strong scalability: the larger the better. 8 BChFSI Strong scalability 7 6 AuAg n=8970 NaCl n=6217 CaFeAs n=2612 Ideal speed up Speed up Number of threads Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

50 OpenMP ChFSI vs Direct eigensolvers on shared memory architectures Comparison with multi-threaded LAPACK MR 3 and MKL multi-threaded LAPACK MR Time for LAPACK + m t BLAS and OMPChFSI for AuAg (n=13379) solving for lowest 7.3% of eigespectrum CPU time (seconds) LAPACK + multi thrd BLAS MKL LAPACK + multi thrd BLAS ChFSI OpenMP Iteration index Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

51 OpenMP ChFSI vs Direct eigensolvers on shared memory architectures Comparison with multi-threaded LAPACK MR 3 and MKL multi-threaded LAPACK MR Time for LAPACK + m t BLAS and OMPChFSI for NaCl (n=9273) 400 CPU time (seconds) LAPACK + multi thrd BLAS MKL LAPACK + multi thrd BLAS OpenMP ChFSI 100 solving for lowest 2.8% of eigenspectrum Iteration index Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

52 Conclusions and future work CONCLUSIONS Feeding eigenvectors of P (l 1) to a block iterative solver like ChFSI speed-ups the iterative solver for P (l) ; The improved use of computing resources together with strong scalability will enable to access larger physics; The algorithmic structure of FLAPW needed to be re-thought: reverse simulation can substantially influence the computational paradigm of an application; ONGOING AND FUTURE WORK ChFSI can be extended and improved; 1 Finalizing filter optimization by adjusting the degree of the polynomial so to just achieve the required eigenvector residuals. 2 Ongoing work to parallelize ChFSI for distributed memory architectures = Elemental; 3 Parallelization of ChFSI for GPUs by using OpenACC directives; Analysis on the structure of the entries of A and B across adjacent iteration seems to suggest an exploitation of low-rank updates for sequences of eigenproblems = Hierarchical matrices; Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

53 References 1 M. Berljafa, and EDN A parallel Chebyshev Filtered Subspace Iteration optimized for LAPW-based methods In preparation 2 EDN, and M. Berljafa Block iterative eigensolvers for sequences of correlated eigenvalue problems Submitted to Comp. Phys. Comm. [arxiv: ]. 3 EDN, P. Bientinesi, and S. Blügel, Correlation in sequences of generalized eigenproblems arising in Density Functional Theory, Comp. Phys. Comm. 183 (2012), pp , [arxiv: ]. Edoardo Di Napoli (AICES/JSC) An example of reverse simulation Columbia University, March 5th / 40

Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems

Mitglied der Helmholtz-Gemeinschaft Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems Birkbeck University, London, June the 29th 2012 Edoardo Di Napoli Motivation and Goals