Calibrating quantum chemistry: A multi-teraflop, parallel-vector, full-configuration interaction program for the Cray-X1

Size: px
Start display at page:

Download "Calibrating quantum chemistry: A multi-teraflop, parallel-vector, full-configuration interaction program for the Cray-X1"

Transcription

1 Calibrating quantum chemistry: A multi-teraflop, parallel-vector, full-configuration interaction program for the Cray-X1 Zhengting Gan and Robert J. Harrison Oak Ridge National Laboratory, MS6367, P.O. Box 2008, Oak Ridge TN 37831, USA harrisonrj@ornl.gov Abstract We describe an efficient parallel and vector algorithm for solving huge eigen-vector problems in quantum chemistry. An automatically adaptive, single-vector, iterative diagonalization method was also developed to reduce the memory requirement and avoid an I/O bottleneck. Our initial full-configuration interaction calculation solved for an eigenvector with 65 billion coefficients and was performed on 432 MSPs of the Oak Ridge National Laboratory Cray-X1. One matrixvector multiplication took about 4 minutes, with 25 iterations being required for a tightly converged result. The aggregate performance was 3.4TFLOP/s (62% of peak speed). 1. INTRODUCTION Full configuration interaction (FCI) [1,2] solves the non-relativistic many-electron Schrödinger equation exactly in a given finite one-electron basis space, and provides a vital tool in the evaluation and development of other quantum chemistry methods. Parallel FCI algorithms on massive parallel architectures (MPP) [3,4,5,6] as well as clusters of workstation [7] have been widely pursued. However, efficient utilization of both these architectures remains a challenging task because of the huge sparse matrix vector multiplication performed. (c) 2005 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by a contractor or affiliate of the [U.S.] Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. SC 05 November 12-18, 2005, Seattle, Washington, USA (c) 2005 ACM /05/0011 $5.00 1

2 First, the Hamiltonian matrices in FCI calculations are extremely sparse, which makes efficient utilization of the memory system a difficult task because of the data locality issues. Due to memory limitations, existing sparse matrix-vector multiplication algorithms [8,9] can not be readily applied. Instead, we use a matrix-free (or direct) method. The Hamiltonian matrix elements are re-computed on the fly, so efficient computation of these elements is also an important performance factor. Parallel scalability is also a complicated issue in FCI implementations. Various factors may affect the parallel performance, including but not limited to, communication, load imbalance and the percentage of sequential computation. A well known but unsolved scalability problem is the redundant computation of Hamiltonian matrix elements between electrons of the same spin. Such computation has historically been replicated on all the processors to avoid the otherwise enormous communication cost. However, the overall scalability may be severely limited according to Amdahl s law [12]. In this paper we present an efficient parallel and vector FCI implementation which converts the sparse matrix-vector multiplication in FCI calculations into dense matrix-matrix multiplications by exploiting the structure of FCI Hamiltonian matrix. The algorithm avoids the explicit computation of Hamiltonian matrix elements, thus removing the major scalability bottleneck. In addition, the overhead of memory movement and communication cost is reduced to minimum. To reduce the memory constraint in the large-scale benchmark calculations we propose in this paper an automatically adjusted single-vector diagonalization method. This paper is organized as follows. In section 2 we introduce the DGEMM based FCI algorithm and the automatically adjusted single-vector diagonalization method. In section 3 the detailed parallel implementation is described. In section 4 the results of parallel performance and the benchmark calculations are presented and discussed. 2. METHOD 2.1 DGEMM based FCI algorithm In the FCI method the N-electron wavefunction is expressed as the linear combination of all the possible determinants generated with the given one-electron basis. The expansion coefficients are obtained by iterative solution of the matrix eigenvector problem using iterative procedure. The 2

3 major computation task is to compute the product between the sparse Hamiltonian matrix H and coefficients vector C. σ=hc (1) The spin-free Hamiltonian operator in second quantization form is H ˆ = hpq Eˆ pq + 1 ( pq, rs) e ˆ pr, qs (2) pq 2 pqrs ˆ pq = aˆ paˆ q, eˆ pr qs aˆ paˆ r aˆ qaˆ, s E =, (3) where h pq and (pq,rs) are molecular integrals, and Ê pq and ˆ are one and two electron e pr, qs operators defined by creation and annihilation operators + â p and â p. In sequential minimum operation count (MOC) algorithms [2-7] only the non-zero Hamiltonian matrix elements are computed, and the σ vector is updated by indexed multiply and add operations. However, on cache-based computer systems the indexed operations generally perform poorly. In addition, the parallel scalability of MOC algorithms is greatly limited by the need to replicate some computation [4,5] and the high ratio of communication to computation. [7] In order to avoid the indexed computation kernel, we adopted a matrix-matrix multiplication based method using the N-2 electron string space [ 13]. By exploiting the structure of Hamiltonian matrix using N-2 electron intermediate states, the major computation can be organized in three steps: 1) build the coefficients matrix of N-2 electron states from vector C, 2) perform a dense matrix-matrix multiplication locally, and 3) accumulate the result coefficients to the σ vector. In the most time-consuming mixed-spin (alpha-beta) routine the contribution to the σ vector from the combination of single alpha and single beta excitations is computed as K ' K ', J J D( qs, á ) = # Bqs, C( J á J ) (4) J J E( pr, á ) = < pr, qs > D( qs, á ) (5) qs I I, K ' K ' á $ pr, á pr, K ' K ' # ( I I ) = A E( pr, ). (6) The interaction between electrons of the same spin (beta-beta) is D( qs, I E( pr, I K # K J qs, ) = B C( I J ) (7) ' J # K K q> s ) = (( pq, rs) $ ( ps, rq)) D( qs, I ) (8) 3

4 # I K pr, p> r, K ( I I ) = $ A E( pr, I K ) (9) The coupling coefficients matrices A and B are the matrix representation of two creation and two annihilation operators, respectively. The matrix-multiply-based algorithm avoids explicit computation of Hamiltonian matrix elements, which is the key to attaining parallel scalability for same-spin interactions. In addition, the communication requirement of the DGEMM based mixed-spin routine is significantly less than that for the MOC algorithm. In table 1, we summarize the computational costs of the DGEMM and MOC algorithms. Table 1. Performance model of alpha-beta routine in minimum operation count FCI algorithms and the DGEMM based FCI algorithm. MOC DGEMM Comput. Kernel DAXPY, or Indexed Multiply and add DGEMM (Gather & Scatter) Operation Count Nci * (n-na)*na* (n-nb) *Nb ~ Nci * n*n *Na*Nb Commu. Count Nci*Na*(n-Na) 3* Nci*Na Commu. Kernel Collective Gather Gather & Accumulate a) Nci is the dimension of FCI vector, n is the number of orbitals and Na and Nb are the number of alpha and beta spin electrons, respectively. 2.2 An automatically adjusted single vector iterative diagonalization method The limiting factor in FCI calculations is the storage of subspace vectors in the iterative Davidson diagonalization method [14]. On most supercomputers, the I/O bandwidth is so limited that storing the subspace vectors on disk implies a huge waste of computing resources. To alleviate partially this memory limitation, we developed a single-vector diagonalization method for the lowest eigenpair. Suppose at iteration n, the approximate CI vector is C (n) and the corresponding estimated eigenvalue is E (n) =<C (n) Ĥ C (n) >, following Olsen s correction scheme [15], the new correction vector can be computed as 1 t = ( H 0 E ) ( H E ) C (11) 4

5 , where H 0 is an approximation of H, normally the diagonal elements of H as suggested by Davidson. is the first order correction of the eigenvalue which is introduced to ensure the orthogonality between the correction vector and the vector C (n). < C = H E ( H E 0 ( n < C H ) 0 E C ) C > > (12) In Olsen s scheme, the correction vector t (n) is directly added to approximation vector C (n) to form the new approximation vector C (n+1). Since there is no minimization procedure to determine the expansion coefficients, convergence is not guaranteed [16]. A more general way to construct the new approximate vector is to use a step length λ to mix the vectors t (n) and C (n) C (n+1) = S (n) * (C (n) + λ n t (n) ) (13), where S (n) is the normalization parameter. The optimal v step length λopt (n) could be obtained by diagonalization of the corresponding 2x2 sub-matrix. However, the matrix element <t (n) Ĥ t (n) > can not be computed unless both the correction vector t (n) and its Hamiltonian projection vector are stored, which doubles the memory requirement. In this paper, we use the optimal step length λopt (n) from the previous iteration n as the step length λ (n+1) since at iteration n+1 we can compute < t ( n+ 1) n n n n n n E S 2 ( ) ( ) ( ) ( ) E < C H t > Hˆ ( ) / 2 t >= n (14). Thus, λopt (n) can be obtained by the diagonalization of the 2x2 Hamiltonian matrix of the iteration n. The new step length λ (n+1) is estimated as λ (n+1) =λopt (n). (15) In the first iteration the matrix element of <t (1) Ĥ t (1) > is more crudely estimated 3. IMPLEMENTATION 3.1 Data distribution In this section we describe our implementation strategy on parallel and vector supercomputers, especially on the massive parallel vector machine Cray-X1. The Cray-X1 is a distributed shared memory architecture, consisting of shared-memory parallel (SMP) nodes linked by a high performance interconnect. Each node has four multi-streaming processors (MSPs) sharing flat access to local memory, and each MSP consists of four single streaming vector processors (SSPs) and a 1 MB cache. Standard strategies for distributed parallel programming can be applied on the Cray-X1 at the coarse level. However, to achieve high performance the code running on each MSP is required to be cache aware, multi-streamed and vectorized. 5

6 Fig. 1. FCI data distribution and the strategies of parallel implementation C C Gathered C columns Gathered C columns Local Computation Stored σ update Stored σ update σ σ Node0 Node1 The FCI coefficients vector can be viewed as a matrix with rows and columns indexed by beta and alpha strings (occupation patterns), respectively. The coefficients matrix is distributed by columns evenly among all the processors. In cases where the coefficients matrix is symmetry blocked, each blocked matrix is distributed separately. In addition to the distributed vectors C and σ, each processor also requires memory to hold the replicated integrals matrix and a working area to store the gathered C vector coefficients and the computed update coefficients for the σ vector. As illustrated in Fig.1, the computation strategy can be simplified as three steps: to gather the coefficients columns from C vector, to perform matrix multiplication locally and to accumulate the computed result to the σ vector. The one-sided remote gather and accumulation is handled by the distributed data interface (DDI) [17] which is a derivative of Global Arrays [18]. On the Cray-X1, DDI uses the SHMEM library [19] which maps directly to the hardware capabilities. The vector gather and accumulate operations are accomplished by DDI_GET and DDI_ACC functions, respectively. The remote accumulation actually involves twice the amount of communication in remote get. To accumulate the result remotely, the DDI_ACC routine first acquires the mutex lock for the remote node and fetches the data to be updated by SHMEM_GET routine. The actual accumulation is done locally, and the result data is sent back by SHMEM_PUT routine. A SHMEM_QUIET call is made to ensure the completion of SHMEM_PUT before releasing the mutex lock at the end of DDI_ACC call. 3.3 The implementation of same-spin and mixed-spin routines 6

7 The implementations of the same-spin routine and mixed-spin routines are illustrated in Fig.2a and 2b, respectively. In the same-spin routine the transposed local C and σ coefficients matrices are used to facilitate the gather and scatter operations. The calculation is organized by looping through intermediate N β -2 electron beta strings. For each N β -2 electron string a list of related N β electron strings is pre-computed. The computed string list can be multi-streamed, and the corresponding column of each beta string is directly copied to matrix D. Fig. 2. The DGEMM-based same-spin (2a) and mixed-spin (2b) routines. C σ Multistream Vector Local Copy D INT DGEMM E Multistream Vector Local Accumulate Fig. (2a) C σ Node0 INT MV D DGEMM E MV Node0 C Fig. (2b) σ Node1 In the mixed-spin routine, the intermediate coefficients matrix D can be constructed in similar multi-streamed manner, except the remote coefficients columns need to be gathered and stored locally before being applied to construct matrix D. In both routines the major computation occurs at the matrix-matrix multiplication step. Optimization effort is greatly leveraged by using the system DGEMM routine. 3.3 Load balancing schemes 7

8 In the same-spin routine a static load balancing scheme is employed. The task of each processor is to update the local segment of σ vector. Since the loop over entire N β -2 electron beta strings is required for every processor, the workload of each processor is equi-distributed. At the same time, only the local part of C and σ coefficients is required, so no network communication is involved. In the mixed-spin routine, each processor is assigned different sets of N α -1 electron alpha occupations. The workload of each set can not be easily estimated, so a dynamic load balancing scheme is adopted. The load balancing scheme is implemented using task pool with a manager/worker style central controlled scheduling policy. The task pool is pre-computed and replicated on all the processors; and each processor repeatedly requests a new task number from the dynamical load balancing server. In DDI, the load balancing function is implemented using SHMEM_SWAP function. Fig. 3. The task list generated in alpha-beta routine for dynamic load balancing. Total Fine Granularity Tasks NStask NLtask Good load balance can be achieved by executing the tasks in the order of decreasing size. Generally, a large number tasks of very fine granularity is preferred for good load balance, but this increases communication overhead. As a compromise, task aggregation is done to artificially produce larger tasks in order of decreasing size. Three parameters are used in our code to define the parallel subtasks. (i) NFineTask_proc defines the number of initial fine grained tasks per processor; (ii) NLtask_proc defines the number of final aggregated large tasks; (iii) NStask_proc defines the number of small tasks at the end of task pool. As illustrated in Fig. 3, large tasks are generated with decreasing size, and the extra short tail of fine grained tasks helps to further ensure that the load imbalance is limited by the work of fine grained tasks even in the worse cases. 8

9 4. RESULTS Table 2 compares the convergence of Davidson subspace method, Olsen s scheme and the automatically adjusted method of this paper. In the subspace method, the Olsen correction vector is used as a basis vector and the optimal step length for mixing the correction vector with current approximation vector is computed at each iteration by diagonalization of the 2x2 subspace. In the modified Olsen s scheme, a fixed λ=0.7 is used to construct the new approximation vectors. All the calculations are tightly converged to residual norm In all the calculations a model space is selected to improve the convergence. Inside the model space the exact Hamiltonian is used to compute the correction vector; outside the model space the diagonal elements are used. Table 2. Iterations Required by Various Diagonalization Methods for (10 10 Eh) Convergence Criteria. Molecule Dimension Model Space Diagonalization Method Davidson Olsen/(Modified) Auto H 3 COH C 1, 1 A 41,409, NC/19 (λ=0.7) 15 H 2 O 2 C 2, 1 A 506,383, NC/22(λ=0.7) 15 CN + C 2v, 1 + g 104,806, >>60 22 O D 2h, 3 P 2 18,441, The chemical systems chosen here include system of strong multi-reference character CN +, as well as the open shell O atom. Table 2 indicates the original Olsen scheme has serious problem in producing tightly converged eigenvectors. A damping factor of 0.7 corrected the problems in some cases, but still failed for CN +. For all four systems both the subspace method and the automatically adjusted single-vector method reached tightly converged results. Although the mixing coefficients λ used in subspace method is considered to be optimal at each iteration, surprisingly, the automatically adjusted method requires less number of iterations than the subspace method does. In the calculation of CN + the number of iterations is even cut by half in the automatically adjusted single-vector method. Figure 4 reports times for a series of FCI calculations for the Oxygen atom on the Cray-X1 using 16 to 128 MSPs. Compared are the performance and parallel scalability of the minimum operation count and DGEMM-based FCI algorithms. The large aug-cc-pvqz basis set was employed in the calculation, so the difference between the operation counts of the two algorithms is insignificant, as can be estimated from the performance model in table 1. The differences arise 9

10 from the efficiencies of the computation kernels and overheads in the parallel implementation. Figure 4 clearly shows the performance advantage of the DGEMM-based algorithm. Although the major computation is multi-streamed and vectorized in both implementations, the performance of the DGEMM and DAXPY kernels are quite different on Cray-X1. The Cray-X1 evaluation report [20] indicates that out-of-cache DAXPY can only realize 2GFlops per MSP, whereas DGEMM attains 10-11GFlops/MSP for matrices beyond 300x300. Fig. 4. Comparison of timing and parallel scalability between the MOC algorithm and DGEMM based FCI algorithms. Seconds alpha-beta (MOC) beta-beta(moc) alpha-beta(dgemm) beta-beta(dgemm) MSPs Fig. 5. Parallel scalability of FCI calculations of Oxygen anion ground state on Cray-X SpeedUp SpeedUp MSPs The major performance boost of DGEMM based same-spin routine arises from the elimination of the explicit computation of Hamiltonian matrix elements. As shown in Figure 4 the same-spin 10

11 routine using the sequential MOC algorithm does not scale at all. Redundant computation of the entire double excitation list completely overwhelms any parallel efficiency in this calculation. For the mixed-spin routine, a significant saving of DGEMM based implementation also comes from the reduced communication cost. In the Oxygen calculation the communication cost is reduced by about a factor of 25 in the DGEMM based implementation. The parallel scalability of our DGEMM based implementation was further evaluated by running the Oxygen anion calculation from 128MSPs to 256MSPs on Cray-X1. The aug-cc-pvqz basis set is used and the calculation includes 14,851,999,576 determinants. As shown in Figure 5 almost perfect speedup was obtained. The same-spin routine sustained about 9.6GFlops/MSP, and the mixed-spin routine runs at 8.5 to 8.1GFlops/MSP from 128 MSPs to 256MSPs. In Table 3 the efficiency and capacity of our FCI program is demonstrated by a benchmark calculation on the C 2 ground state X 1 Σg +. Previous FCI benchmark calculations [22] on C 2 have been reported but in a double-zeta quality basis set. The basis set in our new benchmark consists of aug-cc-pvtz without the augmented diffuse d and f functions. In D 2h symmetry, the FCI space includes about 65 billion determinants. This is to our knowledge by far the largest FCI calculation ever explicitly performed. The previous largest FCI calculation was reported by Ansaloni et.al for calculation of N 2 molecule with about 10 billion determinants. That calculation took more than 7 hours per iteration on 128 nodes of a Cray-T3E [5]. Our C 2 benchmark calculation was run on 432MSPs of Cray-X1. It took about 249 seconds for each iteration, and 25 iterations were required to reach to residual norm of 10-5 using the automatically adjusted singlevector diagonalization method. In each iteration about 6.2TB of data is transferred via network communication in the mixed-spin routine. The entire calculation sustained approximately 8GFlops/MSP, or about 62% of the peak performance of the Cray-X1. The aggregate performance over the entire 432 MSPs is about 3.4 TFLOP/s. Table 3. Timing and performance results of FCI benchmark calculation on C 2 using 432 MSPs of Cray-X1. Molecule C 2 State X 1 g + Basis cc-pvtz(+1s,1p) 11

12 CI Space FCI(8,66) CI dimension 64,931,348,928 MSPs 432 Beta-beta 62s/ 8.5 GFlops/MSP Alpha-beta 167s/8.8 GFlops/MSP Load Imbalance 9s Vector Symm. Total Disk IO 11s 249s/~8.0 GFlops/MSP 293MB/s(R)-246MB/s (W) 5. CONCLUSION In this paper we presented an efficient parallel algorithm for solving the eigenproblem arising from FCI calculations. By exploiting the structure of Hamiltonian matrix we were able to express the problem using dense matrix-matrix multiplication with the help of vector gather and scatter operations. At the same time, the explicit construction of Hamiltonian matrix element is avoided, which removes the major bottleneck of parallel scalability not well solved in prior work by ourselves and others. To perform much larger calculations and also avoid the I/O bottleneck on supercomputers, we also introduced in this paper an automatically adjusted single-vector iterative diagonalization method. The new diagonalization method enables us to perform a 65 billion determinant benchmark calculation of the C 2 molecule using 432MSPs on the Cray-X1. Each matrix-vector multiplication took about 4 minutes, and the calculation sustained an aggregate performance of 3.4 TFLOP/s, or about 62% of the peak performance of Cray-X1. 6. ACKNOWLEDGMENTS This work is supported by the Scientific Discovery through Advanced Computing (SciDAC) program of the U.S. Department of Energy, the division of Basic Energy Science, Office of Science, using the resources of the Center for Computational Sciences at Oak Ridge National Laboratory under contract DE-AC05-00OR

13 7. REFERENCES 1 P. J. Knowles and N. C. Handy, A new determinant-based full configuration interaction method. Chem. Phys. Lett. 111, J. Olsen, B. Roos, O. P. Jorgensen, and H.J.A. Jensen Determinant based configuration interaction algorithms for complete and restricted configuration interaction spaces. J. Chem. Phys. 89, S. Evangelisti, G.L. Bendazzoli, R. Ansaloni, F. Duri, and E. Rossi A one billion determinant full CI benchmark on the Cray T3D. Chem. Phys. Lett. 252, E. Rossi, G.L. Bendazzoli and S. Evangelisti Full configuration interaction algorithm on a massively parallel architecture: Direct-list implementation. J. Comput. Chem. 19, R. Ansaloni, G. L. Bendazzoli, S. Evangelisti and E. Rossi A parallel Full-CI algorithm. Comput. Phys. Commu. 128, M. Klene, M. A. Robb, M. J. Frisch, and P. Celani, Parallel implementation of the CI-vector evaluation in full CI/CAS-SCF. J. Chem. Phys. 113, Z. Gan, Y. Alexeev, M. S. Gordon and R. A. Kendall The Parallel Implementation of a Full CI Program. J. Chem. Phys. 119, R. C. Agarwal, F. G. Gustavson, and M. Zubair, A high performance algorithm using pre-processing for sparse matrix vector multiplication, Proceedings of Supercomputing 92, Toledo S Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development, 41, I.Foster, Designing and Building Parallel Programs, Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA. ISBN: , R. J. Harrison and S. Zarrabian, An efficient implementation of the full-ci method using an (n 2)- electron projection space. Chem. Phys. Letters, 158, R. Davidson, The iterative calculation of a few of the lowest eigenvalues and corresponding eigenvectors of large real-symmetric matrices. J. Comput. Phys. 17, J. Olsen, P. Jorgensen and J. Simons, Passing the one-billion limit in full configuration-interaction (FCI) calculations. Chem. Phys. Letters, 169, M. L. Leininger, C. D. Sherrill, W. D. Allen, and H. F. Schaefer, Systematic Study of Selected Diagonalization Methods for Configuration Interaction Matrices. J. Comput. Chem. 22, M. W. Schmidt, K. K. Baldridge, J. A. Boatz et al., J. Comput. Chem. 14, 1347 (1993); G. D. Fletcher, M. W. Schmidt, and M. S. Gordon, Adv. Chem. Phys. 110, 267 (1999); Comput. Phys. Commun. 128, 190 (2000). 18 J. Nieplocha, RJ Harrison, and RJ Littlefield, 1996.Global Arrays: A nonuniform memory access programming model for high-performance computers, The Journal of Supercomputing, 10: R. Bariuso, Allan Knies, SHMEM's User's Guide, Cray Research, Inc., Eagan, MN, USA. 20 P. H. Worley and T. H. Dunigan Early Evaluation of the Cray X1 at Oak Ridge National Laboratory, in Proceedings of the 45th Cray User Group Conference, Columbus, OH. 22 L. Leininger, C. D. Sherrill, W. D. Allen and H. F. Scheafer III, Benchmark configuration interaction spectroscopic constants for X 1 g + C 2 and X 1 + CN +. J. Chem. Phys. 108,

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

The COLUMBUS Project - General Purpose Ab Initio Quantum Chemistry. I. Background and Overview

The COLUMBUS Project - General Purpose Ab Initio Quantum Chemistry. I. Background and Overview The COLUMBUS Project - General Purpose Ab Initio Quantum Chemistry I. Background and Overview Ron Shepard Chemistry Division Argonne National Laboratory CScADS Workshop, Snowbird, Utah, July 23, 2007 Quantum

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

The Performance Evolution of the Parallel Ocean Program on the Cray X1

The Performance Evolution of the Parallel Ocean Program on the Cray X1 The Performance Evolution of the Parallel Ocean Program on the Cray X1 Patrick H. Worley Oak Ridge National Laboratory John Levesque Cray Inc. 46th Cray User Group Conference May 18, 2003 Knoxville Marriott

More information

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel? CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

TIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION

TIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION Mathematical and Computational Applications, Vol. 11, No. 1, pp. 41-49, 2006. Association for Scientific Research TIME DEPENDENCE OF SHELL MODEL CALCULATIONS Süleyman Demirel University, Isparta, Turkey,

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers Victor Yu and the ELSI team Department of Mechanical Engineering & Materials Science Duke University Kohn-Sham Density-Functional

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices*

A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* A Parallel Algorithm for Computing the Extremal Eigenvalues of Very Large Sparse Matrices* Fredrik Manne Department of Informatics, University of Bergen, N-5020 Bergen, Norway Fredrik. Manne@ii. uib. no

More information

Improvements for Implicit Linear Equation Solvers

Improvements for Implicit Linear Equation Solvers Improvements for Implicit Linear Equation Solvers Roger Grimes, Bob Lucas, Clement Weisbecker Livermore Software Technology Corporation Abstract Solving large sparse linear systems of equations is often

More information

Lecture 3: Quantum Satis*

Lecture 3: Quantum Satis* Lecture 3: Quantum Satis* Last remarks about many-electron quantum mechanics. Everything re-quantized! * As much as needed, enough. Electron correlation Pauli principle Fermi correlation Correlation energy

More information

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions

Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Cyclops Tensor Framework: reducing communication and eliminating load imbalance in massively parallel contractions Edgar Solomonik 1, Devin Matthews 3, Jeff Hammond 4, James Demmel 1,2 1 Department of

More information

J S Parker (QUB), Martin Plummer (STFC), H W van der Hart (QUB) Version 1.0, September 29, 2015

J S Parker (QUB), Martin Plummer (STFC), H W van der Hart (QUB) Version 1.0, September 29, 2015 Report on ecse project Performance enhancement in R-matrix with time-dependence (RMT) codes in preparation for application to circular polarised light fields J S Parker (QUB), Martin Plummer (STFC), H

More information

Performance optimization of WEST and Qbox on Intel Knights Landing

Performance optimization of WEST and Qbox on Intel Knights Landing Performance optimization of WEST and Qbox on Intel Knights Landing Huihuo Zheng 1, Christopher Knight 1, Giulia Galli 1,2, Marco Govoni 1,2, and Francois Gygi 3 1 Argonne National Laboratory 2 University

More information

Lecture 5: More about one- Final words about the Hartree-Fock theory. First step above it by the Møller-Plesset perturbation theory.

Lecture 5: More about one- Final words about the Hartree-Fock theory. First step above it by the Møller-Plesset perturbation theory. Lecture 5: More about one- determinant wave functions Final words about the Hartree-Fock theory. First step above it by the Møller-Plesset perturbation theory. Items from Lecture 4 Could the Koopmans theorem

More information

Out-of-Core SVD and QR Decompositions

Out-of-Core SVD and QR Decompositions Out-of-Core SVD and QR Decompositions Eran Rabani and Sivan Toledo 1 Introduction out-of-core singular-value-decomposition algorithm. The algorithm is designed for tall narrow matrices that are too large

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou May 30, 2008 Abstract We present parallel and sequential dense QR factorization

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

Cyclops Tensor Framework

Cyclops Tensor Framework Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r

More information

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

Lecture 4: Hartree-Fock Theory

Lecture 4: Hartree-Fock Theory Lecture 4: Hartree-Fock Theory One determinant to rule them all, One determinant to find them, One determinant to bring them all and in the darkness bind them Second quantization rehearsal The formalism

More information

Introduction to multiconfigurational quantum chemistry. Emmanuel Fromager

Introduction to multiconfigurational quantum chemistry. Emmanuel Fromager Institut de Chimie, Strasbourg, France Page 1 Emmanuel Fromager Institut de Chimie de Strasbourg - Laboratoire de Chimie Quantique - Université de Strasbourg /CNRS M2 lecture, Strasbourg, France. Notations

More information

Electronic structure theory: Fundamentals to frontiers. 1. Hartree-Fock theory

Electronic structure theory: Fundamentals to frontiers. 1. Hartree-Fock theory Electronic structure theory: Fundamentals to frontiers. 1. Hartree-Fock theory MARTIN HEAD-GORDON, Department of Chemistry, University of California, and Chemical Sciences Division, Lawrence Berkeley National

More information

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine

More information

Lecture 4: methods and terminology, part II

Lecture 4: methods and terminology, part II So theory guys have got it made in rooms free of pollution. Instead of problems with the reflux, they have only solutions... In other words, experimentalists will likely die of cancer From working hard,

More information

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul

More information

University of Denmark, Bldg. 307, DK-2800 Lyngby, Denmark, has been developed at CAMP based on message passing, currently

University of Denmark, Bldg. 307, DK-2800 Lyngby, Denmark, has been developed at CAMP based on message passing, currently Parallel ab-initio molecular dynamics? B. Hammer 1, and Ole H. Nielsen 1 2 1 Center for Atomic-scale Materials Physics (CAMP), Physics Dept., Technical University of Denmark, Bldg. 307, DK-2800 Lyngby,

More information

Enhancing Scalability of Sparse Direct Methods

Enhancing Scalability of Sparse Direct Methods Journal of Physics: Conference Series 78 (007) 0 doi:0.088/7-6596/78//0 Enhancing Scalability of Sparse Direct Methods X.S. Li, J. Demmel, L. Grigori, M. Gu, J. Xia 5, S. Jardin 6, C. Sovinec 7, L.-Q.

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

2.5D algorithms for distributed-memory computing

2.5D algorithms for distributed-memory computing ntroduction for distributed-memory computing C Berkeley July, 2012 1/ 62 ntroduction Outline ntroduction Strong scaling 2.5D factorization 2/ 62 ntroduction Strong scaling Solving science problems faster

More information

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria

1 Overview. 2 Adapting to computing system evolution. 11 th European LS-DYNA Conference 2017, Salzburg, Austria 1 Overview Improving LSTC s Multifrontal Linear Solver Roger Grimes 3, Robert Lucas 3, Nick Meng 2, Francois-Henry Rouet 3, Clement Weisbecker 3, and Ting-Ting Zhu 1 1 Cray Incorporated 2 Intel Corporation

More information

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems

Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems Parallelization Strategies for Density Matrix Renormalization Group algorithms on Shared-Memory Systems G. Hager HPC Services, Computing Center Erlangen, Germany E. Jeckelmann Theoretical Physics, Univ.

More information

Lec20 Fri 3mar17

Lec20 Fri 3mar17 564-17 Lec20 Fri 3mar17 [PDF]GAUSSIAN 09W TUTORIAL www.molcalx.com.cn/wp-content/uploads/2015/01/gaussian09w_tutorial.pdf by A Tomberg - Cited by 8 - Related articles GAUSSIAN 09W TUTORIAL. AN INTRODUCTION

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver

Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Scalable Hybrid Programming and Performance for SuperLU Sparse Direct Solver Sherry Li Lawrence Berkeley National Laboratory Piyush Sao Rich Vuduc Georgia Institute of Technology CUG 14, May 4-8, 14, Lugano,

More information

Communication-avoiding parallel and sequential QR factorizations

Communication-avoiding parallel and sequential QR factorizations Communication-avoiding parallel and sequential QR factorizations James Demmel Laura Grigori Mark Frederick Hoemmen Julien Langou Electrical Engineering and Computer Sciences University of California at

More information

Large-Scale Behavioral Targeting

Large-Scale Behavioral Targeting Large-Scale Behavioral Targeting Ye Chen, Dmitry Pavlov, John Canny ebay, Yandex, UC Berkeley (This work was conducted at Yahoo! Labs.) June 30, 2009 Chen et al. (KDD 09) Large-Scale Behavioral Targeting

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

B629 project - StreamIt MPI Backend. Nilesh Mahajan

B629 project - StreamIt MPI Backend. Nilesh Mahajan B629 project - StreamIt MPI Backend Nilesh Mahajan March 26, 2013 Abstract StreamIt is a language based on the dataflow model of computation. StreamIt consists of computation units called filters connected

More information

Lec20 Wed 1mar17 update 3mar 10am

Lec20 Wed 1mar17 update 3mar 10am 564-17 Lec20 Wed 1mar17 update 3mar 10am Figure 15.2 Shows that increasing the diversity of the basis set lowers The HF-SCF energy considerably, but comes nowhere near the exact experimental energy, regardless

More information

Parallel Eigensolver Performance on High Performance Computers 1

Parallel Eigensolver Performance on High Performance Computers 1 Parallel Eigensolver Performance on High Performance Computers 1 Andrew Sunderland STFC Daresbury Laboratory, Warrington, UK Abstract Eigenvalue and eigenvector computations arise in a wide range of scientific

More information

Parallelization of the Molecular Orbital Program MOS-F

Parallelization of the Molecular Orbital Program MOS-F Parallelization of the Molecular Orbital Program MOS-F Akira Asato, Satoshi Onodera, Yoshie Inada, Elena Akhmatskaya, Ross Nobes, Azuma Matsuura, Atsuya Takahashi November 2003 Fujitsu Laboratories of

More information

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing Parallel Processing CS575 Parallel Processing Lecture five: Efficiency Wim Bohm, Colorado State University Some material from Speedup vs Efficiency in Parallel Systems - Eager, Zahorjan and Lazowska IEEE

More information

High Performance Computing

High Performance Computing Master Degree Program in Computer Science and Networking, 2014-15 High Performance Computing 2 nd appello February 11, 2015 Write your name, surname, student identification number (numero di matricola),

More information

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components Nonlinear Analysis: Modelling and Control, 2007, Vol. 12, No. 4, 461 468 Quantum Chemical Calculations by Parallel Computer from Commodity PC Components S. Bekešienė 1, S. Sėrikovienė 2 1 Institute of

More information

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier Seiichi Ozawa, Shaoning Pang, and Nikola Kasabov Graduate School of Science and Technology, Kobe

More information

Parallelization and benchmarks

Parallelization and benchmarks Parallelization and benchmarks Content! Scalable implementation of the DFT/LCAO module! Plane wave DFT code! Parallel performance of the spin-free CC codes! Scalability of the Tensor Contraction Engine

More information

Introduction to Second-quantization I

Introduction to Second-quantization I Introduction to Second-quantization I Jeppe Olsen Lundbeck Foundation Center for Theoretical Chemistry Department of Chemistry, University of Aarhus September 19, 2011 Jeppe Olsen (Aarhus) Second quantization

More information

How to deal with uncertainties and dynamicity?

How to deal with uncertainties and dynamicity? How to deal with uncertainties and dynamicity? http://graal.ens-lyon.fr/ lmarchal/scheduling/ 19 novembre 2012 1/ 37 Outline 1 Sensitivity and Robustness 2 Analyzing the sensitivity : the case of Backfilling

More information

Ilya A. Kaliman* and Anna I. Krylov. Introduction

Ilya A. Kaliman* and Anna I. Krylov. Introduction SOFTWARE NEWS AND UPDATES WWW.C-CHEM.ORG New Algorithm for Tensor Contractions on Multi-Core CPUs, GPUs, and Accelerators Enables CCSD and EOM-CCSD Calculations with over 1000 Basis Functions on a Single

More information

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano ... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic

More information

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application Administrivia 1. markem/cs333/ 2. Staff 3. Prerequisites 4. Grading Course Objectives 1. Theory and application 2. Benefits 3. Labs TAs Overview 1. What is a computer system? CPU PC ALU System bus Memory

More information

Parallel Transposition of Sparse Data Structures

Parallel Transposition of Sparse Data Structures Parallel Transposition of Sparse Data Structures Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng Department of Computer Science, Virginia Tech Niels Bohr Institute, University of Copenhagen Scientific Computing

More information

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the dense nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer VCPC European Centre for Parallel Computing at Vienna Liechtensteinstraße 22, A-19 Vienna, Austria http://www.vcpc.univie.ac.at/qc/

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

A New Scalable Parallel Algorithm for Fock Matrix Construction

A New Scalable Parallel Algorithm for Fock Matrix Construction A New Scalable Parallel Algorithm for Fock Matrix Construction Xing Liu Aftab Patel Edmond Chow School of Computational Science and Engineering College of Computing, Georgia Institute of Technology Atlanta,

More information

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier Seiichi Ozawa 1, Shaoning Pang 2, and Nikola Kasabov 2 1 Graduate School of Science and Technology,

More information

Parallel Eigensolver Performance on High Performance Computers

Parallel Eigensolver Performance on High Performance Computers Parallel Eigensolver Performance on High Performance Computers Andrew Sunderland Advanced Research Computing Group STFC Daresbury Laboratory CUG 2008 Helsinki 1 Summary (Briefly) Introduce parallel diagonalization

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

Multiconfigurational Quantum Chemistry. Björn O. Roos as told by RL Department of Theoretical Chemistry Chemical Center Lund University Sweden

Multiconfigurational Quantum Chemistry. Björn O. Roos as told by RL Department of Theoretical Chemistry Chemical Center Lund University Sweden Multiconfigurational Quantum Chemistry Björn O. Roos as told by RL Department of Theoretical Chemistry Chemical Center Lund University Sweden April 20, 2009 1 The Slater determinant Using the spin-orbitals,

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors Contemporary Mathematics Volume 218, 1998 B 0-8218-0988-1-03024-7 An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors Michel Lesoinne

More information

Multi-reference Density Functional Theory. COLUMBUS Workshop Argonne National Laboratory 15 August 2005

Multi-reference Density Functional Theory. COLUMBUS Workshop Argonne National Laboratory 15 August 2005 Multi-reference Density Functional Theory COLUMBUS Workshop Argonne National Laboratory 15 August 2005 Capt Eric V. Beck Air Force Institute of Technology Department of Engineering Physics 2950 Hobson

More information

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem

Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Accelerating computation of eigenvectors in the nonsymmetric eigenvalue problem Mark Gates 1, Azzam Haidar 1, and Jack Dongarra 1,2,3 1 University of Tennessee, Knoxville, TN, USA 2 Oak Ridge National

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm Penporn Koanantakool and Katherine Yelick {penpornk, yelick}@cs.berkeley.edu Computer Science Division, University of California,

More information

Beyond Hartree-Fock: MP2 and Coupled-Cluster Methods for Large Systems

Beyond Hartree-Fock: MP2 and Coupled-Cluster Methods for Large Systems John von Neumann Institute for Computing Beyond Hartree-Fock: MP2 and Coupled-Cluster Methods for Large Systems Christof Hättig published in Computational Nanoscience: Do It Yourself, J. Grotendorst, S.

More information

Quantum Physics III (8.06) Spring 2007 FINAL EXAMINATION Monday May 21, 9:00 am You have 3 hours.

Quantum Physics III (8.06) Spring 2007 FINAL EXAMINATION Monday May 21, 9:00 am You have 3 hours. Quantum Physics III (8.06) Spring 2007 FINAL EXAMINATION Monday May 21, 9:00 am You have 3 hours. There are 10 problems, totalling 180 points. Do all problems. Answer all problems in the white books provided.

More information

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI * J.M. Badía and A.M. Vidal Dpto. Informática., Univ Jaume I. 07, Castellón, Spain. badia@inf.uji.es Dpto. Sistemas Informáticos y Computación.

More information

Machine Learning Applied to 3-D Reservoir Simulation

Machine Learning Applied to 3-D Reservoir Simulation Machine Learning Applied to 3-D Reservoir Simulation Marco A. Cardoso 1 Introduction The optimization of subsurface flow processes is important for many applications including oil field operations and

More information

Students & Postdocs Collaborators

Students & Postdocs Collaborators Advancing first-principle symmetry-guided nuclear modeling for studies of nucleosynthesis and fundamental symmetries in nature Students & Postdocs Collaborators NCSA Blue Waters Symposium for Petascale

More information

Simulated Quantum Computation of Molecular. Energies

Simulated Quantum Computation of Molecular. Energies Simulated Quantum Computation of Molecular Energies Alán Aspuru-Guzik* a, Anthony D. Dutoi* a, Peter J. Love c and Martin Head-Gordon a,b a Department of Chemistry, University of California, Berkeley b

More information

Second quantization. Emmanuel Fromager

Second quantization. Emmanuel Fromager Institut de Chimie, Strasbourg, France Page 1 Emmanuel Fromager Institut de Chimie de Strasbourg - Laboratoire de Chimie Quantique - Université de Strasbourg /CNRS M2 lecture, Strasbourg, France. Institut

More information

All-electron density functional theory on Intel MIC: Elk

All-electron density functional theory on Intel MIC: Elk All-electron density functional theory on Intel MIC: Elk W. Scott Thornton, R.J. Harrison Abstract We present the results of the porting of the full potential linear augmented plane-wave solver, Elk [1],

More information

CP2K. New Frontiers. ab initio Molecular Dynamics

CP2K. New Frontiers. ab initio Molecular Dynamics CP2K New Frontiers in ab initio Molecular Dynamics Jürg Hutter, Joost VandeVondele, Valery Weber Physical-Chemistry Institute, University of Zurich Ab Initio Molecular Dynamics Molecular Dynamics Sampling

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Transposition Mechanism for Sparse Matrices on Vector Processors

Transposition Mechanism for Sparse Matrices on Vector Processors Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands

More information

Update on Cray Earth Sciences Segment Activities and Roadmap

Update on Cray Earth Sciences Segment Activities and Roadmap Update on Cray Earth Sciences Segment Activities and Roadmap 31 Oct 2006 12 th ECMWF Workshop on Use of HPC in Meteorology Per Nyberg Director, Marketing and Business Development Earth Sciences Segment

More information

Susumu YAMADA 1,3 Toshiyuki IMAMURA 2,3, Masahiko MACHIDA 1,3

Susumu YAMADA 1,3 Toshiyuki IMAMURA 2,3, Masahiko MACHIDA 1,3 Dynamical Variation of Eigenvalue Problems in Density-Matrix Renormalization-Group Code PP12, Feb. 15, 2012 1 Center for Computational Science and e-systems, Japan Atomic Energy Agency 2 The University

More information

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallelization of the QC-lib Quantum Computer Simulator Library Parallelization of the QC-lib Quantum Computer Simulator Library Ian Glendinning and Bernhard Ömer September 9, 23 PPAM 23 1 Ian Glendinning / September 9, 23 Outline Introduction Quantum Bits, Registers

More information

Massive Parallelization of First Principles Molecular Dynamics Code

Massive Parallelization of First Principles Molecular Dynamics Code Massive Parallelization of First Principles Molecular Dynamics Code V Hidemi Komatsu V Takahiro Yamasaki V Shin-ichi Ichikawa (Manuscript received April 16, 2008) PHASE is a first principles molecular

More information

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors Hoon Ryu, Ph.D. (E: elec1020@kisti.re.kr) Principal Researcher / Korea Institute of Science and Technology

More information

MAA507, Power method, QR-method and sparse matrix representation.

MAA507, Power method, QR-method and sparse matrix representation. ,, and representation. February 11, 2014 Lecture 7: Overview, Today we will look at:.. If time: A look at representation and fill in. Why do we need numerical s? I think everyone have seen how time consuming

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation

A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation Tao Zhao 1, Feng-Nan Hwang 2 and Xiao-Chuan Cai 3 Abstract In this paper, we develop an overlapping domain decomposition

More information

Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems

Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems Mitglied der Helmholtz-Gemeinschaft Block Iterative Eigensolvers for Sequences of Dense Correlated Eigenvalue Problems Birkbeck University, London, June the 29th 2012 Edoardo Di Napoli Motivation and Goals

More information

Strassen s Algorithm for Tensor Contraction

Strassen s Algorithm for Tensor Contraction Strassen s Algorithm for Tensor Contraction Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn The University of Texas at Austin September 14-15, 2017 Tensor Computation Workshop Flatiron Institute,

More information

Parallel Eigensolver Performance on the HPCx System

Parallel Eigensolver Performance on the HPCx System Parallel Eigensolver Performance on the HPCx System Andrew Sunderland, Elena Breitmoser Terascaling Applications Group CCLRC Daresbury Laboratory EPCC, University of Edinburgh Outline 1. Brief Introduction

More information

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Modeling and Tuning Parallel Performance in Dense Linear Algebra Modeling and Tuning Parallel Performance in Dense Linear Algebra Initial Experiences with the Tile QR Factorization on a Multi Core System CScADS Workshop on Automatic Tuning for Petascale Systems Snowbird,

More information

Fragmentation methods

Fragmentation methods Fragmentation methods Scaling of QM Methods HF, DFT scale as N 4 MP2 scales as N 5 CC methods scale as N 7 What if we could freeze the value of N regardless of the size of the system? Then each method

More information

Outline Introduction: Problem Description Diculties Algebraic Structure: Algebraic Varieties Rank Decient Toeplitz Matrices Constructing Lower Rank St

Outline Introduction: Problem Description Diculties Algebraic Structure: Algebraic Varieties Rank Decient Toeplitz Matrices Constructing Lower Rank St Structured Lower Rank Approximation by Moody T. Chu (NCSU) joint with Robert E. Funderlic (NCSU) and Robert J. Plemmons (Wake Forest) March 5, 1998 Outline Introduction: Problem Description Diculties Algebraic

More information

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations! Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst:

More information

A knowledge-based approach to high-performance computing in ab initio simulations.

A knowledge-based approach to high-performance computing in ab initio simulations. Mitglied der Helmholtz-Gemeinschaft A knowledge-based approach to high-performance computing in ab initio simulations. AICES Advisory Board Meeting. July 14th 2014 Edoardo Di Napoli Academic background

More information