Calibrating quantum chemistry: A multi-teraflop, parallel-vector, full-configuration interaction program for the Cray-X1

Size: px

Start display at page:

Download "Calibrating quantum chemistry: A multi-teraflop, parallel-vector, full-configuration interaction program for the Cray-X1"

Maud Holland
5 years ago
Views:

1 Calibrating quantum chemistry: A multi-teraflop, parallel-vector, full-configuration interaction program for the Cray-X1 Zhengting Gan and Robert J. Harrison Oak Ridge National Laboratory, MS6367, P.O. Box 2008, Oak Ridge TN 37831, USA harrisonrj@ornl.gov Abstract We describe an efficient parallel and vector algorithm for solving huge eigen-vector problems in quantum chemistry. An automatically adaptive, single-vector, iterative diagonalization method was also developed to reduce the memory requirement and avoid an I/O bottleneck. Our initial full-configuration interaction calculation solved for an eigenvector with 65 billion coefficients and was performed on 432 MSPs of the Oak Ridge National Laboratory Cray-X1. One matrixvector multiplication took about 4 minutes, with 25 iterations being required for a tightly converged result. The aggregate performance was 3.4TFLOP/s (62% of peak speed). 1. INTRODUCTION Full configuration interaction (FCI) [1,2] solves the non-relativistic many-electron Schrödinger equation exactly in a given finite one-electron basis space, and provides a vital tool in the evaluation and development of other quantum chemistry methods. Parallel FCI algorithms on massive parallel architectures (MPP) [3,4,5,6] as well as clusters of workstation [7] have been widely pursued. However, efficient utilization of both these architectures remains a challenging task because of the huge sparse matrix vector multiplication performed. (c) 2005 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by a contractor or affiliate of the [U.S.] Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. SC 05 November 12-18, 2005, Seattle, Washington, USA (c) 2005 ACM /05/0011 $5.00 1

2 First, the Hamiltonian matrices in FCI calculations are extremely sparse, which makes efficient utilization of the memory system a difficult task because of the data locality issues. Due to memory limitations, existing sparse matrix-vector multiplication algorithms [8,9] can not be readily applied. Instead, we use a matrix-free (or direct) method. The Hamiltonian matrix elements are re-computed on the fly, so efficient computation of these elements is also an important performance factor. Parallel scalability is also a complicated issue in FCI implementations. Various factors may affect the parallel performance, including but not limited to, communication, load imbalance and the percentage of sequential computation. A well known but unsolved scalability problem is the redundant computation of Hamiltonian matrix elements between electrons of the same spin. Such computation has historically been replicated on all the processors to avoid the otherwise enormous communication cost. However, the overall scalability may be severely limited according to Amdahl s law [12]. In this paper we present an efficient parallel and vector FCI implementation which converts the sparse matrix-vector multiplication in FCI calculations into dense matrix-matrix multiplications by exploiting the structure of FCI Hamiltonian matrix. The algorithm avoids the explicit computation of Hamiltonian matrix elements, thus removing the major scalability bottleneck. In addition, the overhead of memory movement and communication cost is reduced to minimum. To reduce the memory constraint in the large-scale benchmark calculations we propose in this paper an automatically adjusted single-vector diagonalization method. This paper is organized as follows. In section 2 we introduce the DGEMM based FCI algorithm and the automatically adjusted single-vector diagonalization method. In section 3 the detailed parallel implementation is described. In section 4 the results of parallel performance and the benchmark calculations are presented and discussed. 2. METHOD 2.1 DGEMM based FCI algorithm In the FCI method the N-electron wavefunction is expressed as the linear combination of all the possible determinants generated with the given one-electron basis. The expansion coefficients are obtained by iterative solution of the matrix eigenvector problem using iterative procedure. The 2

3 major computation task is to compute the product between the sparse Hamiltonian matrix H and coefficients vector C. σ=hc (1) The spin-free Hamiltonian operator in second quantization form is H ˆ = hpq Eˆ pq + 1 ( pq, rs) e ˆ pr, qs (2) pq 2 pqrs ˆ pq = aˆ paˆ q, eˆ pr qs aˆ paˆ r aˆ qaˆ, s E =, (3) where h pq and (pq,rs) are molecular integrals, and Ê pq and ˆ are one and two electron e pr, qs operators defined by creation and annihilation operators + â p and â p. In sequential minimum operation count (MOC) algorithms [2-7] only the non-zero Hamiltonian matrix elements are computed, and the σ vector is updated by indexed multiply and add operations. However, on cache-based computer systems the indexed operations generally perform poorly. In addition, the parallel scalability of MOC algorithms is greatly limited by the need to replicate some computation [4,5] and the high ratio of communication to computation. [7] In order to avoid the indexed computation kernel, we adopted a matrix-matrix multiplication based method using the N-2 electron string space [ 13]. By exploiting the structure of Hamiltonian matrix using N-2 electron intermediate states, the major computation can be organized in three steps: 1) build the coefficients matrix of N-2 electron states from vector C, 2) perform a dense matrix-matrix multiplication locally, and 3) accumulate the result coefficients to the σ vector. In the most time-consuming mixed-spin (alpha-beta) routine the contribution to the σ vector from the combination of single alpha and single beta excitations is computed as K ' K ', J J D( qs, á ) = # Bqs, C( J á J ) (4) J J E( pr, á ) = < pr, qs > D( qs, á ) (5) qs I I, K ' K ' á $ pr, á pr, K ' K ' # ( I I ) = A E( pr, ). (6) The interaction between electrons of the same spin (beta-beta) is D( qs, I E( pr, I K # K J qs, ) = B C( I J ) (7) ' J # K K q> s ) = (( pq, rs) $ ( ps, rq)) D( qs, I ) (8) 3

4 # I K pr, p> r, K ( I I ) = $ A E( pr, I K ) (9) The coupling coefficients matrices A and B are the matrix representation of two creation and two annihilation operators, respectively. The matrix-multiply-based algorithm avoids explicit computation of Hamiltonian matrix elements, which is the key to attaining parallel scalability for same-spin interactions. In addition, the communication requirement of the DGEMM based mixed-spin routine is significantly less than that for the MOC algorithm. In table 1, we summarize the computational costs of the DGEMM and MOC algorithms. Table 1. Performance model of alpha-beta routine in minimum operation count FCI algorithms and the DGEMM based FCI algorithm. MOC DGEMM Comput. Kernel DAXPY, or Indexed Multiply and add DGEMM (Gather & Scatter) Operation Count Nci * (n-na)*na* (n-nb) *Nb ~ Nci * n*n *Na*Nb Commu. Count Nci*Na*(n-Na) 3* Nci*Na Commu. Kernel Collective Gather Gather & Accumulate a) Nci is the dimension of FCI vector, n is the number of orbitals and Na and Nb are the number of alpha and beta spin electrons, respectively. 2.2 An automatically adjusted single vector iterative diagonalization method The limiting factor in FCI calculations is the storage of subspace vectors in the iterative Davidson diagonalization method [14]. On most supercomputers, the I/O bandwidth is so limited that storing the subspace vectors on disk implies a huge waste of computing resources. To alleviate partially this memory limitation, we developed a single-vector diagonalization method for the lowest eigenpair. Suppose at iteration n, the approximate CI vector is C (n) and the corresponding estimated eigenvalue is E (n) =<C (n) Ĥ C (n) >, following Olsen s correction scheme [15], the new correction vector can be computed as 1 t = ( H 0 E ) ( H E ) C (11) 4

5 , where H 0 is an approximation of H, normally the diagonal elements of H as suggested by Davidson. is the first order correction of the eigenvalue which is introduced to ensure the orthogonality between the correction vector and the vector C (n). < C = H E ( H E 0 ( n < C H ) 0 E C ) C > > (12) In Olsen s scheme, the correction vector t (n) is directly added to approximation vector C (n) to form the new approximation vector C (n+1). Since there is no minimization procedure to determine the expansion coefficients, convergence is not guaranteed [16]. A more general way to construct the new approximate vector is to use a step length λ to mix the vectors t (n) and C (n) C (n+1) = S (n) * (C (n) + λ n t (n) ) (13), where S (n) is the normalization parameter. The optimal v step length λopt (n) could be obtained by diagonalization of the corresponding 2x2 sub-matrix. However, the matrix element <t (n) Ĥ t (n) > can not be computed unless both the correction vector t (n) and its Hamiltonian projection vector are stored, which doubles the memory requirement. In this paper, we use the optimal step length λopt (n) from the previous iteration n as the step length λ (n+1) since at iteration n+1 we can compute < t ( n+ 1) n n n n n n E S 2 ( ) ( ) ( ) ( ) E < C H t > Hˆ ( ) / 2 t >= n (14). Thus, λopt (n) can be obtained by the diagonalization of the 2x2 Hamiltonian matrix of the iteration n. The new step length λ (n+1) is estimated as λ (n+1) =λopt (n). (15) In the first iteration the matrix element of <t (1) Ĥ t (1) > is more crudely estimated 3. IMPLEMENTATION 3.1 Data distribution In this section we describe our implementation strategy on parallel and vector supercomputers, especially on the massive parallel vector machine Cray-X1. The Cray-X1 is a distributed shared memory architecture, consisting of shared-memory parallel (SMP) nodes linked by a high performance interconnect. Each node has four multi-streaming processors (MSPs) sharing flat access to local memory, and each MSP consists of four single streaming vector processors (SSPs) and a 1 MB cache. Standard strategies for distributed parallel programming can be applied on the Cray-X1 at the coarse level. However, to achieve high performance the code running on each MSP is required to be cache aware, multi-streamed and vectorized. 5

6 Fig. 1. FCI data distribution and the strategies of parallel implementation C C Gathered C columns Gathered C columns Local Computation Stored σ update Stored σ update σ σ Node0 Node1 The FCI coefficients vector can be viewed as a matrix with rows and columns indexed by beta and alpha strings (occupation patterns), respectively. The coefficients matrix is distributed by columns evenly among all the processors. In cases where the coefficients matrix is symmetry blocked, each blocked matrix is distributed separately. In addition to the distributed vectors C and σ, each processor also requires memory to hold the replicated integrals matrix and a working area to store the gathered C vector coefficients and the computed update coefficients for the σ vector. As illustrated in Fig.1, the computation strategy can be simplified as three steps: to gather the coefficients columns from C vector, to perform matrix multiplication locally and to accumulate the computed result to the σ vector. The one-sided remote gather and accumulation is handled by the distributed data interface (DDI) [17] which is a derivative of Global Arrays [18]. On the Cray-X1, DDI uses the SHMEM library [19] which maps directly to the hardware capabilities. The vector gather and accumulate operations are accomplished by DDI_GET and DDI_ACC functions, respectively. The remote accumulation actually involves twice the amount of communication in remote get. To accumulate the result remotely, the DDI_ACC routine first acquires the mutex lock for the remote node and fetches the data to be updated by SHMEM_GET routine. The actual accumulation is done locally, and the result data is sent back by SHMEM_PUT routine. A SHMEM_QUIET call is made to ensure the completion of SHMEM_PUT before releasing the mutex lock at the end of DDI_ACC call. 3.3 The implementation of same-spin and mixed-spin routines 6

7 The implementations of the same-spin routine and mixed-spin routines are illustrated in Fig.2a and 2b, respectively. In the same-spin routine the transposed local C and σ coefficients matrices are used to facilitate the gather and scatter operations. The calculation is organized by looping through intermediate N β -2 electron beta strings. For each N β -2 electron string a list of related N β electron strings is pre-computed. The computed string list can be multi-streamed, and the corresponding column of each beta string is directly copied to matrix D. Fig. 2. The DGEMM-based same-spin (2a) and mixed-spin (2b) routines. C σ Multistream Vector Local Copy D INT DGEMM E Multistream Vector Local Accumulate Fig. (2a) C σ Node0 INT MV D DGEMM E MV Node0 C Fig. (2b) σ Node1 In the mixed-spin routine, the intermediate coefficients matrix D can be constructed in similar multi-streamed manner, except the remote coefficients columns need to be gathered and stored locally before being applied to construct matrix D. In both routines the major computation occurs at the matrix-matrix multiplication step. Optimization effort is greatly leveraged by using the system DGEMM routine. 3.3 Load balancing schemes 7

8 In the same-spin routine a static load balancing scheme is employed. The task of each processor is to update the local segment of σ vector. Since the loop over entire N β -2 electron beta strings is required for every processor, the workload of each processor is equi-distributed. At the same time, only the local part of C and σ coefficients is required, so no network communication is involved. In the mixed-spin routine, each processor is assigned different sets of N α -1 electron alpha occupations. The workload of each set can not be easily estimated, so a dynamic load balancing scheme is adopted. The load balancing scheme is implemented using task pool with a manager/worker style central controlled scheduling policy. The task pool is pre-computed and replicated on all the processors; and each processor repeatedly requests a new task number from the dynamical load balancing server. In DDI, the load balancing function is implemented using SHMEM_SWAP function. Fig. 3. The task list generated in alpha-beta routine for dynamic load balancing. Total Fine Granularity Tasks NStask NLtask Good load balance can be achieved by executing the tasks in the order of decreasing size. Generally, a large number tasks of very fine granularity is preferred for good load balance, but this increases communication overhead. As a compromise, task aggregation is done to artificially produce larger tasks in order of decreasing size. Three parameters are used in our code to define the parallel subtasks. (i) NFineTask_proc defines the number of initial fine grained tasks per processor; (ii) NLtask_proc defines the number of final aggregated large tasks; (iii) NStask_proc defines the number of small tasks at the end of task pool. As illustrated in Fig. 3, large tasks are generated with decreasing size, and the extra short tail of fine grained tasks helps to further ensure that the load imbalance is limited by the work of fine grained tasks even in the worse cases. 8

9 4. RESULTS Table 2 compares the convergence of Davidson subspace method, Olsen s scheme and the automatically adjusted method of this paper. In the subspace method, the Olsen correction vector is used as a basis vector and the optimal step length for mixing the correction vector with current approximation vector is computed at each iteration by diagonalization of the 2x2 subspace. In the modified Olsen s scheme, a fixed λ=0.7 is used to construct the new approximation vectors. All the calculations are tightly converged to residual norm In all the calculations a model space is selected to improve the convergence. Inside the model space the exact Hamiltonian is used to compute the correction vector; outside the model space the diagonal elements are used. Table 2. Iterations Required by Various Diagonalization Methods for (10 10 Eh) Convergence Criteria. Molecule Dimension Model Space Diagonalization Method Davidson Olsen/(Modified) Auto H 3 COH C 1, 1 A 41,409, NC/19 (λ=0.7) 15 H 2 O 2 C 2, 1 A 506,383, NC/22(λ=0.7) 15 CN + C 2v, 1 + g 104,806, >>60 22 O D 2h, 3 P 2 18,441, The chemical systems chosen here include system of strong multi-reference character CN +, as well as the open shell O atom. Table 2 indicates the original Olsen scheme has serious problem in producing tightly converged eigenvectors. A damping factor of 0.7 corrected the problems in some cases, but still failed for CN +. For all four systems both the subspace method and the automatically adjusted single-vector method reached tightly converged results. Although the mixing coefficients λ used in subspace method is considered to be optimal at each iteration, surprisingly, the automatically adjusted method requires less number of iterations than the subspace method does. In the calculation of CN + the number of iterations is even cut by half in the automatically adjusted single-vector method. Figure 4 reports times for a series of FCI calculations for the Oxygen atom on the Cray-X1 using 16 to 128 MSPs. Compared are the performance and parallel scalability of the minimum operation count and DGEMM-based FCI algorithms. The large aug-cc-pvqz basis set was employed in the calculation, so the difference between the operation counts of the two algorithms is insignificant, as can be estimated from the performance model in table 1. The differences arise 9

10 from the efficiencies of the computation kernels and overheads in the parallel implementation. Figure 4 clearly shows the performance advantage of the DGEMM-based algorithm. Although the major computation is multi-streamed and vectorized in both implementations, the performance of the DGEMM and DAXPY kernels are quite different on Cray-X1. The Cray-X1 evaluation report [20] indicates that out-of-cache DAXPY can only realize 2GFlops per MSP, whereas DGEMM attains 10-11GFlops/MSP for matrices beyond 300x300. Fig. 4. Comparison of timing and parallel scalability between the MOC algorithm and DGEMM based FCI algorithms. Seconds alpha-beta (MOC) beta-beta(moc) alpha-beta(dgemm) beta-beta(dgemm) MSPs Fig. 5. Parallel scalability of FCI calculations of Oxygen anion ground state on Cray-X SpeedUp SpeedUp MSPs The major performance boost of DGEMM based same-spin routine arises from the elimination of the explicit computation of Hamiltonian matrix elements. As shown in Figure 4 the same-spin 10

11 routine using the sequential MOC algorithm does not scale at all. Redundant computation of the entire double excitation list completely overwhelms any parallel efficiency in this calculation. For the mixed-spin routine, a significant saving of DGEMM based implementation also comes from the reduced communication cost. In the Oxygen calculation the communication cost is reduced by about a factor of 25 in the DGEMM based implementation. The parallel scalability of our DGEMM based implementation was further evaluated by running the Oxygen anion calculation from 128MSPs to 256MSPs on Cray-X1. The aug-cc-pvqz basis set is used and the calculation includes 14,851,999,576 determinants. As shown in Figure 5 almost perfect speedup was obtained. The same-spin routine sustained about 9.6GFlops/MSP, and the mixed-spin routine runs at 8.5 to 8.1GFlops/MSP from 128 MSPs to 256MSPs. In Table 3 the efficiency and capacity of our FCI program is demonstrated by a benchmark calculation on the C 2 ground state X 1 Σg +. Previous FCI benchmark calculations [22] on C 2 have been reported but in a double-zeta quality basis set. The basis set in our new benchmark consists of aug-cc-pvtz without the augmented diffuse d and f functions. In D 2h symmetry, the FCI space includes about 65 billion determinants. This is to our knowledge by far the largest FCI calculation ever explicitly performed. The previous largest FCI calculation was reported by Ansaloni et.al for calculation of N 2 molecule with about 10 billion determinants. That calculation took more than 7 hours per iteration on 128 nodes of a Cray-T3E [5]. Our C 2 benchmark calculation was run on 432MSPs of Cray-X1. It took about 249 seconds for each iteration, and 25 iterations were required to reach to residual norm of 10-5 using the automatically adjusted singlevector diagonalization method. In each iteration about 6.2TB of data is transferred via network communication in the mixed-spin routine. The entire calculation sustained approximately 8GFlops/MSP, or about 62% of the peak performance of the Cray-X1. The aggregate performance over the entire 432 MSPs is about 3.4 TFLOP/s. Table 3. Timing and performance results of FCI benchmark calculation on C 2 using 432 MSPs of Cray-X1. Molecule C 2 State X 1 g + Basis cc-pvtz(+1s,1p) 11

12 CI Space FCI(8,66) CI dimension 64,931,348,928 MSPs 432 Beta-beta 62s/ 8.5 GFlops/MSP Alpha-beta 167s/8.8 GFlops/MSP Load Imbalance 9s Vector Symm. Total Disk IO 11s 249s/~8.0 GFlops/MSP 293MB/s(R)-246MB/s (W) 5. CONCLUSION In this paper we presented an efficient parallel algorithm for solving the eigenproblem arising from FCI calculations. By exploiting the structure of Hamiltonian matrix we were able to express the problem using dense matrix-matrix multiplication with the help of vector gather and scatter operations. At the same time, the explicit construction of Hamiltonian matrix element is avoided, which removes the major bottleneck of parallel scalability not well solved in prior work by ourselves and others. To perform much larger calculations and also avoid the I/O bottleneck on supercomputers, we also introduced in this paper an automatically adjusted single-vector iterative diagonalization method. The new diagonalization method enables us to perform a 65 billion determinant benchmark calculation of the C 2 molecule using 432MSPs on the Cray-X1. Each matrix-vector multiplication took about 4 minutes, and the calculation sustained an aggregate performance of 3.4 TFLOP/s, or about 62% of the peak performance of Cray-X1. 6. ACKNOWLEDGMENTS This work is supported by the Scientific Discovery through Advanced Computing (SciDAC) program of the U.S. Department of Energy, the division of Basic Energy Science, Office of Science, using the resources of the Center for Computational Sciences at Oak Ridge National Laboratory under contract DE-AC05-00OR

13 7. REFERENCES 1 P. J. Knowles and N. C. Handy, A new determinant-based full configuration interaction method. Chem. Phys. Lett. 111, J. Olsen, B. Roos, O. P. Jorgensen, and H.J.A. Jensen Determinant based configuration interaction algorithms for complete and restricted configuration interaction spaces. J. Chem. Phys. 89, S. Evangelisti, G.L. Bendazzoli, R. Ansaloni, F. Duri, and E. Rossi A one billion determinant full CI benchmark on the Cray T3D. Chem. Phys. Lett. 252, E. Rossi, G.L. Bendazzoli and S. Evangelisti Full configuration interaction algorithm on a massively parallel architecture: Direct-list implementation. J. Comput. Chem. 19, R. Ansaloni, G. L. Bendazzoli, S. Evangelisti and E. Rossi A parallel Full-CI algorithm. Comput. Phys. Commu. 128, M. Klene, M. A. Robb, M. J. Frisch, and P. Celani, Parallel implementation of the CI-vector evaluation in full CI/CAS-SCF. J. Chem. Phys. 113, Z. Gan, Y. Alexeev, M. S. Gordon and R. A. Kendall The Parallel Implementation of a Full CI Program. J. Chem. Phys. 119, R. C. Agarwal, F. G. Gustavson, and M. Zubair, A high performance algorithm using pre-processing for sparse matrix vector multiplication, Proceedings of Supercomputing 92, Toledo S Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development, 41, I.Foster, Designing and Building Parallel Programs, Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA. ISBN: , R. J. Harrison and S. Zarrabian, An efficient implementation of the full-ci method using an (n 2)- electron projection space. Chem. Phys. Letters, 158, R. Davidson, The iterative calculation of a few of the lowest eigenvalues and corresponding eigenvectors of large real-symmetric matrices. J. Comput. Phys. 17, J. Olsen, P. Jorgensen and J. Simons, Passing the one-billion limit in full configuration-interaction (FCI) calculations. Chem. Phys. Letters, 169, M. L. Leininger, C. D. Sherrill, W. D. Allen, and H. F. Schaefer, Systematic Study of Selected Diagonalization Methods for Configuration Interaction Matrices. J. Comput. Chem. 22, M. W. Schmidt, K. K. Baldridge, J. A. Boatz et al., J. Comput. Chem. 14, 1347 (1993); G. D. Fletcher, M. W. Schmidt, and M. S. Gordon, Adv. Chem. Phys. 110, 267 (1999); Comput. Phys. Commun. 128, 190 (2000). 18 J. Nieplocha, RJ Harrison, and RJ Littlefield, 1996.Global Arrays: A nonuniform memory access programming model for high-performance computers, The Journal of Supercomputing, 10: R. Bariuso, Allan Knies, SHMEM's User's Guide, Cray Research, Inc., Eagan, MN, USA. 20 P. H. Worley and T. H. Dunigan Early Evaluation of the Cray X1 at Oak Ridge National Laboratory, in Proceedings of the 45th Cray User Group Conference, Columbus, OH. 22 L. Leininger, C. D. Sherrill, W. D. Allen and H. F. Scheafer III, Benchmark configuration interaction spectroscopic constants for X 1 g + C 2 and X 1 + CN +. J. Chem. Phys. 108,

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab