Making electronic structure methods scale: Large systems and (massively) parallel computing

Size: px

Start display at page:

Download "Making electronic structure methods scale: Large systems and (massively) parallel computing"

Ophelia Robertson
6 years ago
Views:

1 AB Making electronic structure methods scale: Large systems and (massively) parallel computing Ville Havu Department of Applied Physics Helsinki University of Technology - TKK Ville.Havu@tkk.fi 1 Outline Part I: Scaling to large systems of many atoms - Localisation as a key to linear scaling - Implications to different tasks in FHI-aims Part II: Scaling to large computers of many cores - Basics in parallel computing - Parallel solutions to tasks in FHI-aims 2

2 Different Scalings: From Small to Large Systems 1250 T = N Crossover points T = N T 500 T = N N 3 Different Scalings: Logarithmic Scale T = N 3 T 1000 Crossover points T = N 100 T = N N 4

3 What is Linear Scaling? Natoms Nelectrons Npoints Assume: 1. There are O(N) items of input 2. We want less than O(N) items of output If the complexity (time) to get the output from the input is O(N) the method is linear scaling Method can be linear scaling only if it uses O(1) operations per item of input Example: Cubature in R 3 Input: N points ri, N weights wi N functions values fi = f(ri) Output: i wi fi f(r) dr Trivially O(N) 5 Approaches Towards O(N) in Electronic Structure Theory Key requirements: localisation, localisation, and localisation (in R 3 or in Fourier space) Popular approaches: 1. Minimise the total energy directly using the density matrix (and Wannier functions) skip / localise the Kohn-Sham orbitals (SIESTA, CONQUEST, OpenMX, ONETEP) 2. Accelerate the calculation / use of the entries of hij and sij (Gaussian basis functions (GAUSSIAN, Q-Chem, TURBOMOLE), regular cartesian grids, FFT methods (Quickstep), wavelets (BigDFT)) 3. Use fast solvers for the Hartree potential (multigrid, fast multipole, wavelets) 4. Employ divide & conquer framework (LS3DF) S. Goedecker, Rev. Mod. Phys , (1999) 6

4 Main Tasks for Scalability in FHI-aims Key enabling technology: localisation from basis functions 1. Integration of hij and sij on partitioned atom-centred grids 2. Update of the electron density n(r) on the same grid 3. Solution of the Hartree potential using atom-wise multipole decomposition 4. Solution of the eigenproblem h ck = k s ck with small prefactor O(N 3 ) method (recall the crossover!) 7 Main Tasks for Scalability in FHI-aims Key enabling technology: localisation from basis functions 1. Integration of hij and sij on partitioned atom-centred grids 2. Update of the electron density n(r) on the same grid 3. Solution of the Hartree potential using atom-wise multipole decomposition 4. Solution of the eigenproblem h ck = k s ck with small prefactor O(N 3 ) method (recall the crossover!) Warning: naïve approaches lead to O(N 3 ) - O(N 4 ) in

5 First Thing: Integration of hij and sij 1. Partition grid into spatially localised batches: 2. Due to localisation at each grid point eventually #{basis_fn 0} O(1) 3. For each batch, consider only the non-zero basis functions Constant work per integration point O(N) algorithm 8 First Thing: Integration of hij and sij 1. Partition grid into spatially localised batches: 2. Due to localisation at each grid point eventually #{basis_fn 0} O(1) 3. For each batch, consider only the non-zero basis functions Constant work per integration point O(N) algorithm 8

6 First Thing: Integration of hij and sij 1. Partition grid into spatially localised batches: 2. Due to localisation at each grid point eventually #{basis_fn 0} O(1) 3. For each batch, consider only the non-zero basis functions Constant work per integration point O(N) algorithm 8 Second Thing: Update of the Electron Density 1. Stick with the same grid 2. Construct the density matrix Dlk = j fj cljckj 3. Compute the new electron density a) n(ri) = lk Dlk l(ri) k(ri) or b) n(ri) = j fj j(ri) 2 Again, there are only O(1) non-zero basis functions per point ri a) is O(N) algorithm NB! Construction of Dlk in 2. is formally O(N 2 ) operation but with a small prefactor 9

7 More Use for the Density Matrix Instead of solving the eigenproblem it is possible to minimise the total energy E = 2 Tr [Dh] subject to the idempotency condition D 2 = D This leads to linear scaling if 1. Minimisation of E converges in O(1) steps 2. D and h are sparse enough so that their product is O(N) D.R.Bowler, T.Miyazaki and M.J.Gillan, Journal of Physics:Condensed Matter 14 (11), 2781 (2002) 10 Third Thing: Calculate the Hartree Potential Run the whole show on differences: n(r) = n(r) at nfree,at ( r Rat ) 1. Compute the atom-wise multipole density components: nat,lm(r) = r= r - Rat pat(r) n(r) Ylm( at) d 2. Solve for the corresponding potential components: 3. Build the potential: vat,lm(r) Natoms ves(ri) = vat,lm( ri Rat )Ylm( at) Npoints 11

8 Third Thing: Calculate the Hartree Potential Run the whole show on differences: n(r) = n(r) at nfree,at ( r Rat ) 1. Compute the atom-wise multipole density components: nat,lm(r) = r= r - Rat pat(r) n(r) Ylm( at) d 2. Solve for the corresponding potential components: 3. Build the potential: vat,lm(r) Natoms ves(ri) = vat,lm( ri Rat )Ylm( at) Npoints potentially resource consuming: higher multipoles must be cut to shorter distances 11 Fourth Thing: Solve the Eigenproblem The problem: There are Nbasis Nstates coefficients cli to solve for at least O(N 2 ) algorithm Conventional direct solvers lead to O(N 3 ) method Iterative solver could be O(N 2 ) but full matrix matrix vector is O(N 2 ) O(N 3 ) again Left: sparse matrix & iterative algorithm that converges with constant number of steps for all eigenpairs severe problem with initialisation FHI-aims: matrices are not very sparse LAPACK / modified ScaLAPACK, iterative LOPBCG under investigation 12

9 Conventional Direct Solvers for the Eigenproblem Implementation in LAPACK / ScaLAPACK: 1. Factorise s = LL T and let A = L -1 hl -T to get the problem Axk = kxk 2. Use Householder transformations to reduce A to tridiagonal T = H1H2H3 Hn A Hn H3H2H1 3. Solve the eigenproblem for T with some method a. Bisection & inverse iteration b. QR-iteration c. Divide & Conquer d. MRRR-method (Multiple Relatively Robust Representations) 4. Substitute back (twice) to get the eigenvectors ck 13 Conventional Direct Solvers for the Eigenproblem O(N 3 ) Implementation in LAPACK / ScaLAPACK: 1. Factorise s = LL T and let A = L -1 hl -T to get the problem Axk = kxk 2. Use Householder transformations to reduce A to tridiagonal T = H1H2H3 Hn A Hn H3H2H1 3. Solve the eigenproblem for T with some method a. Bisection & inverse iteration b. QR-iteration c. Divide & Conquer d. MRRR-method (Multiple Relatively Robust Representations) 4. Substitute back (twice) to get the eigenvectors ck 13

Conventional Direct Solvers for the Eigenproblem O(N 3 ) Implementation in LAPACK / ScaLAPACK: 1. Factorise s = LL T and let A = L -1 hl -T to get the problem Axk = kxk 2.

Divide & Conquer d. MRRR-method (Multiple Relatively Robust Representations) 4.

10 Conventional Direct Solvers for the Eigenproblem O(N 3 ) Implementation in LAPACK / ScaLAPACK: 1. Factorise s = LL T and let A = L -1 hl -T to get the problem Axk = kxk 2. Use Householder transformations to reduce A to tridiagonal T = H1H2H3 Hn A Hn H3H2H1 3. Solve the eigenproblem for T with some method a. Bisection & inverse iteration b. QR-iteration c. Divide & Conquer d. MRRR-method (Multiple Relatively Robust Representations) 4. Substitute back (twice) to get the eigenvectors ck Method of choice in FHI-aims due to parallel scalability 13 The Scalability Test Setting 1. Physical system: Fully extended polyalanines, atoms 2. Hardware: IBM Power6 575 at RZG 205 nodes, 32 cores / node, 6560 cores total 18 TB main : 64 / 128 / 256 GB in a node Infiniband interconnect 14

11 Time for s.c.f. iteration (s) Current FHI-aims Scaling: Light Settings, 32 cores ,1 Integration O(N 1.1 ) 0, Number of atoms in the polyalanine chain 15 Time for s.c.f. iteration (s) Current FHI-aims Scaling: Light Settings, 32 cores ,1 Density Integration update (dm) Integration Density update (orb) Crossover point O(N 1.7 ) O(N 1.1 ) O(N 1.1 ) 0, Number of atoms in the polyalanine chain 15

12 Time for s.c.f. iteration (s) Current FHI-aims Scaling: Light Settings, 32 cores ,1 Density Integration update (dm) Hartree Integration potential Integration Density update (orb) Density update (orb) Crossover point O(N 1.7 ) O(N 1.7 ) O(N 1.5 ) O(N )) O(N 1.1 ) 0, Number of atoms in the polyalanine chain 15 Time for Time s.c.f. for s.c.f. iteration (s) Current FHI-aims Scaling: Light Settings, 32 cores ,1 Total Density Integration time update (dm) Density Hartree Integration update potential (dm) Hartree Integration Density potential update (orb) Integration Density update (orb) EV solution Density update (orb) O(N Crossover O(N 1.7 ) 1.7 ) O(N point 1.7 ) O(N 1.5 ) O(N 1.8 ) O(N 1.5 ) O(N 1.5 ) O(N ) ) O(N ) In this region: EV solution: O(N 2.7 ) Total: O(N 1.9 ) 0, Number of of atoms in the polyalanine chain 15

13 Time for s.c.f. iteration (s) Current FHI-aims Scaling: Tight Settings, 32 cores ,1 Total time Density update (dm) Hartree potential Integration EV solution Density update (orb) O(N 1.7 ) O(N 1.6 ) O(N 2.3 ) O(N 1.6 ) O(N 1.2 ) In this region: EV solution: O(N 2.9 ) Total: O(N 2.0 ) 0, Number of atoms in the polyalanine chain O(N 1.2 ) 16 Current FHI-aims Scaling: GFlops, Light Settings GFlops for s.c.f. iteration Using density matrix (dm) Using orbitals (orb) Crossover point O(N 1.3 ) O(N 2.0 ) O(N 2.3 ) Number of atoms in the polyalanine chain 17

14 Current FHI-aims Scaling: Light Settings, 32 cores Time for s.c.f. iteration (s) ,1 Total time Density update (dm) Hartree potential Integration EV solution Absolute minimum (total) 0, Number of atoms in the polyalanine chain 18 Part II: Scaling to large computers of many cores 19

15 Parallel Computing: Your Desktop Why you need to care: 20 Parallel Computing: Your Desktop Why you need to care: 20

Parallel Computing:Your Desktop Why you need to care: POWER Processor Roadmap 2004 2001 " 2007 " Future " " /. Distributed Switch Distributed Switch *+1 6-5+ 30%'44+/) +453+$65'& 8+5%* *#3'& 9/#.

5GHz Advanced Multi Design AltiVec AltiVec AltiVec AltiVec Cache Cache Cache Cache Advanced Cache Advanced System Features System Features Advanced System Features "03,-0#& %%'-'3#5034 +)*-9 5*3'#&'&

+356#-+:#5+0/ &7#/%'& '.039 6$4945'. -5+7'% +/4536%5+0/4 /4536%5+0/ '539 9/#.+% /'3)9 #/#)'.'/5.1307'& #35+5+0/ 0$+-+59 '.039 305'%5+0/ '94 /*#/%'& %#-+/) +.

+356#-+:#5+0/ BINARY COMPATIBILITY 2008 IBM Corporation 20 Parallel Computing:Your Desktop Why you need to care: POWER Processor Roadmap 2004 2001 2007 " " Future " " ) ( % /. /. 1.9GHz 1.9GHz 2 1.

16 Parallel Computing:Your Desktop Why you need to care: POWER Processor Roadmap " 2007 " Future " " /. Distributed Switch Distributed Switch * %'44+/) +453+$65'& 8+5%* *#3'& 9/#.+% 4 GHz GHz 3.5GHz 3.5GHz Advanced Multi Design AltiVec AltiVec AltiVec AltiVec Cache Cache Cache Cache Advanced Cache Advanced System Features System Features Advanced System Features "03,-0#& %%'-'3#5034 +)*-9 5*3'#&'& %03' ' Distributed Switch *#3'& *#3'& *#3'& 1.9GHz 1.9GHz 1.5+ GHz *#3'& GHz 1.5+ GHz 1.5+ GHz 1+ GHz 1+ GHz /. /. /. /. Distributed Switch /.!'39 +)* 3'26'/%+'4 : /*#/%'&!+356#-+:#5+0/ &7#/%'& '.039 6$4945'. -5+7'% +/4536%5+0/4 /4536%5+0/ '539 9/#.+% /'3)9 #/#)'.'/5.1307'& #35+5+0/ 0$+-+59 ' '%5+0/ '94 /*#/%'& %#-+/) +.6-5#/' *3'#&+/) /*#/%'& +453+$65'& 8+5%* /*#/%'& 03' #3#--' '& '3(03.#/%' /%3'#4'&.'.039 $#/&8+&5* '&6%'&.'.039 -#5'/%+'4!+356#-+:#5+0/ BINARY COMPATIBILITY 2008 IBM Corporation 20 Parallel Computing:Your Desktop Why you need to care: POWER Processor Roadmap " " Future " " ) ( % /. /. 1.9GHz 1.9GHz GHz GHz *#3'&! #. Distributed Switch *#3'& $ $) && 1.5+ GHz 1.5+ GHz 1+ GHz 1+ GHz *#3'& Distributed Switch *#3'& /. /. /. /. GHz GHz 3.5GHz 3.5GHz (# $ $) "03,-0#& %%'-'3#5034 2' 3 +)*-9 5*3'#&'& %03'4! ' $!'39 +)* 3'26'/%+'4 " # $% : &&' /*#/%'&!+356#-+:#5+0/ &7#/%'& '.039 6$4945'. -5+7'% +/4536%5+0/4 /4536%5+0/ '539 9/#.+% /'3)9 #/#)'.'/5.1307'& #35+5+0/ 0$ % 3,0 * ' '%5+0/ '94! 2& #3 /*#/%'& %#-+/) *, $ +.6-5#/' *3'#&+/) && /*#/%'& +453+$65'& 8+5%* /*#/%'& 03' #3#--'-+4. * %'44+/) +453+$65'& 8+5%*"#.1307'& '3(03.#/%'. /%3'#4'&.'.039 $#/&8+&5* *#3'& 3 '&6%'&.'.039 -#5'/%+'4 9/#.+% 4 2&!+356#-+:#5+0/ #!!+ #!!1 #!!$ / BINARY COMPATIBILITY #!!4 #!!. Advanced System Features System Advanced Features System Features Distributed Switch $+. /, Cache Cache Cache Cache Advanced Cache Distributed Switch / Advanced Multi Design AltiVec AltiVec AltiVec AltiVec #!!+ #!!1 #!!$ #!!4 #!! IBM Corporation 20

Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated:

17 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores 4. SGI s Altix Pleiades : 51,200 cores 21 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores 4. SGI s Altix Pleiades : 51,200 cores 21

18 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores 4. SGI s Altix Pleiades : 51,200 cores 21 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores 4. SGI s Altix Pleiades : 51,200 cores 21

With supercomputers it s getting even more complicated: 1.

IBM s Blue Gene /P & /L: 294,912 & 212,992 cores See http://www.top500.

SGI s Altix Pleiades : 51,200 cores 21 Theory of Parallel Computing Recall

single single SISD: serial computing MISD: no such thing multiple

19 Parallel Computing: Supercomputers (June 2009) Image courtesy of the National Center for Computational Sciences, Oak Ridge National Laboratory With supercomputers it s getting even more complicated: 1. IBM s Roadrunner: 129,600 cores 2. Cray XT/5 Jaguar : 150,152 cores 3. & 5. IBM s Blue Gene /P & /L: 294,912 & 212,992 cores See for more info. 4. SGI s Altix Pleiades : 51,200 cores 21 Theory of Parallel Computing Recall The Computer : Four-Field of parallelism (Flynn s taxonomy): data multiple single single SISD: serial computing MISD: no such thing multiple instructions # of data streams # of instruction streams core SIMD: vector machines MIMD: cluster machines M. Flynn, IEEE Trans. Comput., C-21, 948, (1972) 22

20 Theory of Parallel Computing Recall The Computer : core instructions Four-Field of parallelism (Flynn s taxonomy): # of instruction streams single multiple data # of data streams single multiple SISD: serial computing SIMD: vector machines MISD: no such thing MIMD: cluster machines M. Flynn, IEEE Trans. Comput., C-21, 948, (1972) SPMD: Single Program Multiple Data 22 Theory of Parallel Computing core Standard programming layer: Message Passing Interface (MPI) core core communication network core core Two main modes of communication: 1. Collective communication 2. Point-to-Point communication core W. Gropp, E. Lusk, A. Skjellum: Using MPI, MIT Press (1999) 23

21 MPI Communication task #0...sends 42 to... task #1 1. Post MPI_Send 1. Post MPI_Recv Copy to buffer 2. Receive data Send data 3. Copy to Compiler sees only these task #0 MPI Communication...sends 42 to... task #1 1. Post MPI_Send 1. Post MPI_Recv Copy to buffer 2. Receive data Send data 3. Copy to 42 24

22 MPI Today Pros: + Uniform layer for programmers + Can run on a large variety of platforms + Tested and bug-free Cons: - Opaque to the compiler no optimisation possible - Fragmented global space - Implementation vendor dependent - Lot of platform-dependent parameters (e.g. buffer size) Future replacements: Co-Array Fortran / Unified Parallel C (already here) X10, Chapel, and Fortress (still in the future) All with partitioned global address space 25 Example: Parallel Integration of hij Recall the grid and the batches: 1. Each task integrates a set of batches hij id 2. Collective communication to obtain hij = id hij id Same code on different set of points: SPMD Similarly: 1. Update of the Electron Density 2. Update of the Hartree Potential N.B.: The grid batches and their distribution define a parallel iterator over the grid 26

23 Example: Parallel Integration of hij Recall the grid and the batches: Each task integrates a set of batches hij id 2. Collective communication to obtain hij = id hij id Same code on different set of points: SPMD Similarly: 1. Update of the Electron Density 2. Update of the Hartree Potential N.B.: The grid batches and their distribution define a parallel iterator over the grid 26 Parallel Dense Linear Algebra Standard solution: ScaLAPACK / PBLAS / BLACS - Provides the same functionality as LAPACK / BLAS - Uses block-cyclic distribution of matrix elements to optimise load balancing and cache utilisation Task numbers (4 tasks) h11 h12 h13 h21 h22 h31 Matrix elements

24 Parallel Dense Linear Algebra Example: PxGEMM ( A0 A 1 A 2 A 3 )( ) B0 B 1 B 2 B 3 = ( ) C0 C 1 C 2 C 3 task #0 task #1 task #2 task #3 stage 1 stage 2 C 0 = A 0 B 0 + A 1 B 2 C 1 = A 1 B 3 + A 0 B 1 C 2 = A 2 B 0 + A 3 B 2 C 3 = A 2 B 1 + A 3 B 3 28 Parallel Dense Linear Algebra Example: PxGEMM ( A0 A 1 A 2 A 3 )( ) B0 B 1 B 2 B 3 = ( ) C0 C 1 C 2 C 3 task #0 task #1 task #2 task #3 stage 1 stage 2 C 0 = A 0 B 0 + A 1 B 2 C 1 = A 1 B 3 + A 0 B 1 C 2 = A 2 B 0 + A 3 B 2 C 3 = A 2 B 1 + A 3 B 3 28

25 Parallel Dense Linear Algebra Example: PxGEMM ( A0 A 1 A 2 A 3 )( ) B0 B 1 B 2 B 3 = ( ) C0 C 1 C 2 C 3 task #0 task #1 task #2 task #3 stage 1 stage 2 C 0 = A 0 B 0 + A 1 B 2 C 1 = A 1 B 3 + A 0 B 1 C 2 = A 2 B 0 + A 3 B 2 C 3 = A 2 B 1 + A 3 B 3 28 Parallel Dense Linear Algebra Example: PxGEMM ( A0 A 1 A 2 A 3 )( ) B0 B 1 B 2 B 3 = ( ) C0 C 1 C 2 C 3 task #0 task #1 task #2 task #3 stage 1 stage 2 C 0 = A 0 B 0 + A 1 B 2 C 1 = A 1 B 3 + A 0 B 1 C 2 = A 2 B 0 + A 3 B 2 C 3 = A 2 B 1 + A 3 B 3 Each stage: - Only one send / task - Point-to-point communication 28

26 Measures of Parallel Performance 1. Speedup: Sp = T1 / Tp (optimally Sp = p) 2. Efficiency: Ep = Sp / p (optimally Ep = 1) Amdahl s law, i.e., The Unfortunate Law of Diminishing Returns: Assume that proportion of the program can be run in parallel. Then (1- ) is the serial part of the code Then Sp ( (1- ) + /p ) -1 (1- ) -1 as p G. Amdahl, AFIPS Conference Proceedings, 30, , (1967) Measures of Parallel Performance 1. Speedup: Sp = T1 / Tp (optimally Sp = p) 2. Efficiency: Ep = Sp / p (optimally Ep = 1) = 0.95 Sp = Amdahl s law, i.e., The Unfortunate Law of Diminishing Returns: 5 Assume that proportion of the = 0.75 program can be run in parallel. = 0.50 Then (1- ) is the serial part of the code p Then Sp ( (1- ) + /p ) -1 (1- ) -1 as p G. Amdahl, AFIPS Conference Proceedings, 30, , (1967) 29

FHI-aims: Speedup on IBM s BlueGene: Ala100, -helical Speedup fo

iteration 18 16 14 12 10 8 6 4 2 Total time Speedup = p 512 1024 2048 4096

Johanni (RZG) Number of cores 30 FHI-aims on IBM s BlueGene: Scaling to Many

27 FHI-aims: Speedup on IBM s BlueGene: Ala100, -helical Speedup for one s.c.f. iteration Total time Speedup = p with R. Johanni (RZG) Number of cores 30 FHI-aims on IBM s BlueGene: Scaling to Many s Time for one s.c.f. iteration (s) Only 300 x 300 matrix / task Total time Density update Hartree potential Integration EV solution Linear line with R. Johanni (RZG) Number of cores 31

FHI-aims: Parallel Scalability Overall current status: 1. Optimised communication Up-to thousands of processors on BlueGene 2.

buffer size, message size, transfer method,...) 32 Parallel Data Storage In addition to distributed instructions also data must be distributed over MPI-tasks 1.

Splines describing Hartree potential vat,lm(r) - Store different splines to different tasks splines must be communicated to compute ves(r) 3.

28 FHI-aims: Parallel Scalability Overall current status: 1. Optimised communication Up-to thousands of processors on BlueGene 2. Non-Optimised communication: Up-to hundreds of processors on Power6, Cray XT/5, Opteron cluster from HP (lots of parameters needed for optimal MPI performance: buffer size, message size, transfer method,...) 32 Parallel Data Storage In addition to distributed instructions also data must be distributed over MPI-tasks 1. Grid based quantities: electron density, potential: - Distribute grid batches to different tasks parallel iterator is local to each task 2. Splines describing Hartree potential vat,lm(r) - Store different splines to different tasks splines must be communicated to compute ves(r) 3. Matrices and Kohn-Sham eigenvectors - As dictated by ScaLAPACK Conversely, each MPI-task has 1. Grid batches and associated quantities 2. Splines for Hartree potential vat,lm(r) 3. Pieces of matrices and Kohn-Sham eigenvectors 33

29 Conclusions Part I: - Integration, density update and the calculation of the Hartree potential can be made scale O(N) (or, O(N log N)) - This requires localisation of the basis functions - Hard part that remains: solution of the eigenproblem Part 2: - Electronic structure theory codes need to scale to large parallel systems today - This is achieved by minimising serial part of the program 34

Parallel Eigensolver Performance on High Performance Computers

Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization