CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Size: px

Start display at page:

Download "CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?"

Clifton Green
5 years ago
Views:

CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.

1 CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, Torino (Italy) 1 Why parallel? Faster time to solution; More available memory; High Performance Computing (HPC) resources available and not many softwares can run efficiently on thousands of processors. The programmer s concerns: Load imbalance: the time taken will be that of the longest job; Handling communications: the processors will need to talk to each other and communication is slow; Handling Input/Output (I/O): in most cases I/O is slow and should be avoided. The User s concerns: Choose an appropriate number of processors to run a job depending on the problem size (mostly determined by the number of basis functions in the unit cell). 2 1

2 Amdahl s law S(n)=(S+P)/(S+P/n) S+P=1 n: number of processors S: percentage of serial instructions P: percentage of parallelized instructions This is a somewhat frightening equation! 3 Gustafson s law BUT the relative values of S and P are a function of system size. Parallelize the more expensive parts (first) These typically become rapidly more expensive as the system size is increased Parallelism is good for large systems! 4 2

3 Load Imbalance Say we have twenty totally independent tasks and twenty processors Easy to parallelize give each task to one of the processors, but what if the tasks don t take all the same time? The time taken will be that of the longest job Because of Load Imbalance our speed up is less than perfect we have too few tasks for too many processors Don t use too many processors for too small a job 5 Communications and I/O But what if the tasks are not independent? The processors will need to talk to each other This is known as communication Communication is SLOW But usually the computation requirement scales more rapidly than the communication Depending on how the machine is set up I/O on parallel machines can be VERY slow. So in general it is best to run direct This may not be true for medium sized jobs on machines where each processor has a fast local disk. 6 3

Towards large unit cell systems A CRYSTAL job can be run serially in parallel crystal Pcrystal MPPcrystal Message Passing Interface (MPI) for communications Pcrystal uses replicated data

Choleski decomposition enhanced distribution of data in memory among processors 7 Running CRYSTAL in parallel Pcrystal full parallelism in the calculation of the interactions (one- and

in the calculation of the interactions (oneand two-electron integrals) double-level distribution of tasks in the reciprocal space: one k point to a subset of processors use of Linear

4 Towards large unit cell systems A CRYSTAL job can be run serially in parallel crystal Pcrystal MPPcrystal Message Passing Interface (MPI) for communications Pcrystal uses replicated data storage in memory MPPcrystal for large unit cell systems on high-performance computers use of Linear Algebra Library parallel routines (Scalapack) for diagonalization, matrix products, Choleski decomposition enhanced distribution of data in memory among processors 7 Running CRYSTAL in parallel Pcrystal full parallelism in the calculation of the interactions (one- and two-electron integrals) distribution of tasks in the reciprocal space: one k point per processor no call to external libraries few inter-process communications MPPcrystal full parallelism in the calculation of the interactions (oneand two-electron integrals) double-level distribution of tasks in the reciprocal space: one k point to a subset of processors use of Linear Algebra Library parallel routines (Scalapack) for diagonalization, matrix products, Choleski decomposition enhanced distribution of data in memory among processors many inter-process communications 8 4

Pcrystal and MPPcrystal in action Example of a CRYSTAL calculation: 10,000 basis functions 16 processors available 4 k points sampled in reciprocal space Processors 0-3 4-7 8-11 12 13 14 15 Tasks in

5 Pcrystal and MPPcrystal in action Example of a CRYSTAL calculation: 10,000 basis functions 16 processors available 4 k points sampled in reciprocal space Processors Tasks in real space (integrals) Tasks in reciprocal space Active Idle k 4 k 3 k 2 k 1 Pcrystal Tasks in real space (integrals) Tasks in reciprocal space Active k 4 k 3 k 2 k 1 MPPcrystal 9 Pcrystal - Implementation Standard compliant: Fortran 90 MPI for message passing Replicated data: Each k point is independent: each processor performs the linear algebra (FC=EC) for a subset of the k points that the job requires; Very few communications (potentially good scaling), but potential load imbalance; Each processor has a complete copy of all the matrices used in the linear algebra; The limit on the size of job is given by the memory required to store the linear algebra matrices for one k point; Number of k points limits the number of processors that can be exploited: in general scales very well provided the number of processors number of k points. 10 5

6 MPPcrystal Implementation I Standard compliant: Fortran 90 MPI for message passing ScaLAPACK 1.7 (Dongarra et al.) for linear algebra on distributed matrices Distributed data: Each processor holds only a part of each of the matrices used in the linear algebra (FC=EC); Number of processors that can be exploited is NOT limited by the number of k points (great for large Γ point only calculations); Use ScaLAPACK for e.g. Choleski decomposition Matrix matrix multiplies Linear equation solves As distributed data communications are required to perform the linear algebra; However, N 3 operations but only N 2 data to communicate. F o r 11 MPPcrystal Implementation II Scaling: Scaling gets better for larger systems; Very rough rule of thumb: if N basis functions can exploit up to around N/20 processors (optimal ratio: N/50); One further method that MPPcrystal uses is multilevel parallelism: if have 4 real k points and 32 procs each diagonalization will be done by 8 processors, so each diagonalization has to scale to fewer processors Complicated by complex k points Very useful for medium-large sized systems (for a big enough problem can scale very well) Non implemented features in MPPcrystal: Will fail quickly and cleanly if requested feature not implemented, such as: symmetry adaption of the Crystalline Orbitals (for large high symmetry systems Pcrystal may be more effective) CPHF Raman Intensities 12 6

MCM-41 mesoporous material model P. Ugliengo, M. Sodupe, F. Musso, I. J. Bush, R. Orlando, R. Dovesi, Advanced Materials 20, (2008).

at 823 K, water outgassed at 423 K B3LYP 3800 3600 3400 3200 3000 Simulated powder spectrum: no relevant reflexions at higher 2 because of short-range disorder 13

T SPEEDUP T 32 32 NC NC Supercells of the MCM-41 have been grown along the c crystallographic axis: Xn (side along c is n times that in X1).

7 MCM-41 mesoporous material model P. Ugliengo, M. Sodupe, F. Musso, I. J. Bush, R. Orlando, R. Dovesi, Advanced Materials 20, (2008). B3LYP approximation Hexagonal lattice with P1 symmetry 580 atoms per cell (7800 basis functions) MTS/423 K IR spectrum recorded on a micelle-templated silica calcinated at 823 K, water outgassed at 423 K B3LYP Simulated powder spectrum: no relevant reflexions at higher 2 because of short-range disorder 13 MCM-41:increasing the unit cell R. Orlando, M. Delle Piane, I. J. Bush, P. Ugliengo, M. Ferrabone, R. Dovesi, J. Comput. Chem. 33, 2276 (2012). T SPEEDUP T NC NC Supercells of the MCM-41 have been grown along the c crystallographic axis: Xn (side along c is n times that in X1). X10 contains 77,560 AOs in the unit cell. Calculations run on IBM SP6 at Cineca: Power6 processors (4.7 GHz) with peak performance of 101 Tflops/s Infiniband X4 DDR internal network Speedup vs number of cores (NC) for SCF+total energy gradient calculations 14 7

8 MCM-41:scaling of the main steps in MPPcrystal two-electron integrals X4 + total energy gradient one-electron integrals Fock matrix diagonalization exchange-correlation functional integration X preliminary steps Percentage data measure parallelization efficiency. Data in parenthesis: the amount of time for that task. 15 Running MCM-41 on different HPC architectures X1 IBM Blue Gene P at Cineca (Bologna) Cray XE6 - HECToR (Edimburgh) IBM Sp6 at CINECA (Bologna) 16 8

9 Memory storage optimization TOO MANY K POINTS IN THE PACK-MONKHORST NET: INCREASE LIM001 Most of the static allocations have been made dynamic: array size now fits the exact memory requirement; no need to recompile the code for large calculations; a few remaining fixed limits can be extended from input: CLUSTSIZE (maximum number of a atoms in a generated cluster; default setting: number of atoms i the unit cell) LATVEC (maximum number of lattice vectors to be classified; default value: 3500). n 2 atom-size arrays are distributed among the cores. Data are removed from memory as soon as they are not in use. 17 LOWMEM option The LOWMEM keyword avoids allocation of large arrays generally with a slight increase in the CPU time (by default in MPPcrystal): atomic orbital pair elements in matrices are located in real time without storing a large table into memory Fock and Density matrices are only stored in their irreducible forms; symmetry related elements are computed in real time Expansion of AO pair electron density for the bipolar approximation of 2-electron integrals into multipole moments is performed in real time instead of storing large buffers to memory Information about the grid of points used in DFT exchange-correlation functional integration (point cartesian coordinates, multiplicity, Becke s weights) is distributed among processors Dynamically allocated memory monitoring by means of: MEMOPRT, MEMOPRT2 18 9

10 T [sec] Speeding up two-electron integrals 2 g 4 h+l g 12 P l 34 0 g h h l 0 h g h l l0 h0 2 F h electron 1 electron 2 Integrals are screened on the basis of the overlap between atomic orbitals. In large unit cells a lot of (3, 4) pairs do not overlap to (1, 2 g ). The following integrals are equivalent by atomic orbital permutation: 0 g h hl 0 g hl h l h gh 0 l gh h Implemented for P1 symmetry Xn Linearization Permutation symmetry 19 Improved memory storage in Pcrystal F g F k V k F k V k Transformation of the Fock and the Density matrix into the basis set of the Symmetry- Adapted Crystalline Orbitals (SACO) is operated from the irreducible F g to each block of V k F k V k (irreducible representation) straightforwardly, without forming the full blocks of F k : the maximum size of matrices to be diagonalized is that of the largest block parallelization goes from k points down to the irreducible representations (many more than the number of k-points in highly symmetric cases) 20 10

Memory storage for fullerenes of increasing size (n,n)-fullerenes n N AO S irr S red 1 840 1759 169980 2 3360 6707 716130 3 7560 14570 1609020 4 13440 25377 2847690 5 21000 39047 4432140 6 30240

11 Memory storage for fullerenes of increasing size (n,n)-fullerenes n N AO S irr S red n = 7 N AO : number of basis functions S irr : size of the irreducible part of the overlap matrix represented in real space (number of matrix elements) S red : size of the full overlap matrix represented in real space (number of matrix elements) 21 Fullerenes: matrix block size in the SACO basis (n,n) A g A u F 1g F 1u F 2g F 2u G g G u H g H u N AO t SCF (1,1) (2,2) (3,3) (4,4) (5,5) (6,6) (7,7) (8,8) (9,9) (10,10) t SCF : wallclock time (in seconds) for running 20 SCF cycles on a single core 22 11

12 Conclusions CRYSTAL can be run in parallel on a large number of processors efficiently, with very good scalability; is portable to different HPC platforms; allowed the calculation of the total energy and wavefunction of MCM41-X14, containing more than 100,000 basis functions (8000 atoms), on 2048 processors; has been improved as concerning data storage to memory; has been made more efficient as for the calculation of the Coulomb and exchange series; Memory storage for highly symmetric cases has been drastically reduced by extending the use of SACOs to all steps in reciprocal space; Task farming in Pcrystal will soon be moved from the k-point level to that of the irreducible representations

CRYSTAL in parallel: replicated and distributed (MPP) data

CRYSTAL in parallel: replicated and distributed (MPP) data Lorenzo Maschio Dipar0mento di Chimica, Università di Torino lorenzo.maschio@unito.it Several slides courtesy of Roberto Orlando lorenzo.maschio@unito.it