In 1830, Auguste Comte wrote in his work

Size: px

Start display at page:

Download "In 1830, Auguste Comte wrote in his work"

Ashlee Conley
6 years ago
Views:

N o v e l A r c h i t e c t u r e s Graphical Processing Units for Quantum Chemistry The authors provide a brief overview of electronic structure theory and detail their experiences implementing

1 N o v e l A r c h i t e c t u r e s Graphical Processing Units for Quantum Chemistry The authors provide a brief overview of electronic structure theory and detail their experiences implementing quantum chemistry methods on a graphical processing unit. They also analyze algorithm performance in terms of floating-point operations and memory bandwidth, and assess the adequacy of single-precision accuracy for quantum chemistry applications /08/$ IEEE Copublished by the IEEE CS and the AIP In 830, Auguste Comte wrote in his work Philosophie Positive: Every attempt to employ mathematical methods in the study of chemical questions must be considered profoundly irrational and contrary to the spirit of chemistry. If mathematical analysis should ever hold a prominent place in chemistry an aberration which is happily almost impossible it would occasion a rapid and widespread degeneration of that science. Fortunately, Comte s assessment was far off the mark and, instead, the opposite has occurred. Detailed simulations based on the principles of quantum mechanics now play a large role in suggesting, guiding, and explaining experiments in chemistry and materials science. In fact, quantum chemistry is a major consumer of CPU cycles at national supercomputer centers, with the field s rise to prominence largely due to early demonstrations of quantum mechanics applied to chemical problems and the tremendous advances in computing power over the past decades two developments that Comte could not have foreseen. Ivan S. Ufimtsev and Todd J. Martínez University of Illinois at Urbana-Champaign However, limited computational resources remain a serious obstacle to the application of quantum chemistry in problems of widespread importance, such as the design of more effective drugs to treat diseases or new catalysts for use in applications such as fuel cells or environmental remediation. Thus, researchers have a considerable impetus to relieve this bottleneck in any way possible, both by developing new and more effective algorithms and exploring new computer architectures. In our own work, we ve recently begun exploring the use of graphical processing units (GPUs), and this article presents some of our experiences with GPUs for quantum chemistry. GPU Architecture Low precision (generally, 24-bit arithmetic) and limited programmability stymied early attempts to use GPUs for general-purpose scientific computing. 4 However, the release of the Nvidia G80 series and the compute unified device architecture (CUDA) application programming interface (API) have ushered in a new era in which these difficulties are largely ameliorated. The CUDA API lets developers control the GPU via an extension of the standard C programming language (as opposed to specialized assembler or graphics-oriented APIs, such as OpenGL and DirectX). The G80 supports 32-bit floating-point arithmetic, 26 This article has been peer-reviewed. Computing in Science & Engineering

2 which is largely (but not entirely) compliant with the IEEE-754 standard. In some cases, this might not be sufficient precision (many scientific applications use 64-bit or double precision), but Nvidia has already released its next generation of GPUs that supports 64-bit arithmetic in hardware. We performed all the calculations presented in this article on a single Nvidia GeForce 8800 GTX card using the CUDA API ( offers a detailed overview of the hardware and API). We sketch some of the important concepts here. As depicted schematically in Figure, the Ge- Force 8800 GTX consists of 6 independent streaming multiprocessors (SMs), each comprised of eight scalar units operating in SIMD fashion and running at.35 GHz. The device can process a large number of concurrent threads; these threads are organized into a one- or two-dimensional (- or 2D) grid of -, 2-, or 3D blocks with up to 52 threads in each block. s in the same block are guaranteed to execute on the same SM and have access to fast on-chip shared memory (6 Kbytes per SM) for efficient data exchange. s belonging to different blocks can execute on different SMs and must exchange data through the GPU DRAM (768 Mbytes), which has much larger latency than shared memory. There s no efficient way to synchronize thread block execution that is, in which order or on which SM they ll be processed thus, an efficient algorithm should avoid communication between thread blocks as much as possible. The thread block grid can contain up to 65,535 blocks in each dimension. Every thread block has its own unique serial number (or two numbers, if the grid is 2D); likewise, each thread also has a set of indices, identifying it within a thread block. Together, the thread block and thread serial numbers provide enough information to precisely identify a given thread and thereby split computational work in an application. Much of the challenge in developing an efficient algorithm on the GPU involves determining an ideal mapping between computational tasks and the grid/thread block/thread hierarchy. Ideal mappings lead to load balance with interthread communication restricted to within the thread blocks. Because the number of threads running on a GPU is much larger than the total number of processing units (to fully utilize the device s computational capabilities and efficiently hide the DRAM s access latency, the program must spawn at least 0 4 threads), the hardware executes all threads in time-slicing. The thread scheduler splits the thread blocks into 32-thread warps, which SMs Nvidia GeForce 8800 GTX GPU DRAM (768 Mbytes) Streaming multiprocessors SM SM 2 SM 5 SM 6 Streaming multiprocessor Shared memory (6 Kbytes) SIMD streaming processors SP SP 2 SP 7 SP 8 Figure. Schematic block diagram of the Nvidia GeForce 8800 GTX. It has 6 streaming multiprocessors (SMs), each containing 8 SIMD streaming processors and 6 Kbytes of shared memory. then process in SIMD fashion, with all 32 threads executed by eight scalar processors in four clock cycles. The thread scheduler periodically switches between warps, maximizing overall application performance. An important point here is that once all the threads in a warp have completed, this warp is no longer scheduled for execution, and no loadbalancing penalty is incurred. Another important consideration is the amount of on-chip resources available for an active thread. Active means that the thread has GPU context (registers and so on) attached to it and is included in the thread scheduler s to do list. Once a thread is activated, it won t be deactivated until all of its instructions are executed. A G80 SM can support up to 768 active threads (24 warps), and the warpswitching overhead is negligible compared to the time required to execute a typical instruction. Such cost-free switching is possible because every thread has its own context, which implies that the whole register space (32 Kbytes per SM) is evenly distributed among active threads (for 768 active threads, every thread has 0 registers available). If the threads need more registers, fewer threads are activated, leading to partial SM occupation. This important parameter determines if the GPU DRAM access latency can be efficiently hidden (a large number of active threads means that although some threads are waiting for data, others can execute instructions and vice versa). Thus, any GPU kernel (a primitive routine each GPU thread executes) should consume as few registers as possible to maximize SM occupation. Quantum Chemistry Overview Two of the most basic questions in chemistry are, where are the electrons? and where are the nuclei? Electronic structure theory that is, quantum chemistry focuses on the first one. Because the electrons are very light, we must apply the laws of quantum mechanics, which are described Instruction unit November/December

3 with an electronic wave function determined by solving the time-independent Schrödinger equation. As usual in quantum mechanics, this wave function s absolute square is interpreted as a probability distribution for electron positions. Once we know the electronic distribution for a fixed nuclear configuration, it s straightforward to calculate the resulting forces on the nuclei. Thus, the answer to the second question follows from the answer to the first through either a search for the nuclei arrangement that minimizes energy (molecular geometry optimization) or solution of the classical Newtonian equations of motion. (It s also possible and in some cases, necessary to solve quantum mechanical equations of motion for the nuclei, but we don t consider this further here.) The great utility of quantum chemistry comes from the resulting ability to predict molecular shapes and chemical rearrangements. Denoting the set of all electronic coordinates as r and all nuclear coordinates as R, we can write the electronic time-independent Schrödinger equation for a molecule as H ( r, R ) ψ ( r, R ) = E( R ) ψ ( r, R ), () elec ( ) is the electronic Hamiltonian op- where H r, R erator describing the electronic kinetic energy as well as the Coulomb interactions between all electrons and nuclei, ψ elec ( r, R) is the electronic wave function, and E( R) is the total energy. This total energy depends only on the nuclear coordinates and is often referred to as the molecule s potential energy surface. For most molecules, exact solution of Equation is impossible, so we must invoke approximations. Researchers have developed numerous such approximate approaches, the simplest of which is the Hartree-Fock (HF) method. 5 In HF theory, we write the electronic wave function as a single antisymmetrized product of orbitals, which are functions describing a single electron s probability distribution. Physically, this means that the electrons see each other only in an averaged sense while obeying the appropriate Fermi statistics. We can obtain higher accuracy by including many antisymmetrized orbital products or by using density functional theory (DFT) to describe the detailed electronic correlations present in real molecules. 6 We consider only HF theory in this article because it illustrates many key computational points. We express each electronic orbital φ i ( r) in HF theory as a linear combination of K basis functions χ µ ( r) specified in advance: elec φ ( r) = C χ ( r). (2) i K µ = i µ µ Notice that we no longer write the electronic coordinates in boldface to emphasize that these are a single electron s coordinates. The computational task is then to determine the linear coefficients C iμ. Collecting these coefficients into a matrix C and introducing the overlap matrix S with elements = ( ) ( ) S χ r χ r dr 3, (3) µυ µ υ we can write the HF equations as F( C) C = ε SC, (4) where ε is a diagonal matrix of one-electron orbital energies. The Fock matrix F is a one-electron analog of the Hamiltonian operator in Equation, describing the electronic kinetic energy, the electron nuclear Coulomb attraction, and the averaged electron electron repulsion. Because F depends on the unknown matrix C, we solve for the unknowns C and ε with a self-consistent field (SCF) procedure. After guessing C, we construct the Fock matrix and solve the generalized eigenvalue problem in Equation 4, giving ε and a new set of coefficients C. The process iterates until C remains unchanged within some tolerance that is, until C and F are self-consistent. The dominant effort in the formation of F lies in the evaluation of the two-electron repulsion integrals (ERIs) representing the Coulomb (J) and exchange (K) interactions between pairs of electrons. The exchange interaction is a nonclassical Coulomb-like term arising from the electronic wave function s antisymmetry. Specifically, we construct the Fock matrix for a molecule with N electrons as F( C) = Hcore + J( C) K( C), (5) 2 where H core includes the electronic kinetic energy and the electron nuclear attraction. (The equations given here are specific to closed-shell singlet molecules, with no unpaired electrons.) The Coulomb and exchange matrices are given by J K = P ( µυ λσ) (6) µυ λσ λσ = P ( µλ σν), (7) µυ λσ λσ in terms of the density matrix P and the twoelectron integrals (μν λσ), 28 Computing in Science & Engineering

4 P = N / 2 2 C C λσ λi σi i= ( µν λσ) ( ) ( ) ( ) ( ) χµ r χν r χλ r χσ r = r r (8) 3 dr dr 3.(9) Usually, we choose the basis functions as Gaussians centered on the nuclei, which leads to analytic expressions for the two-electron integrals. Nevertheless, K 4 such integrals must be evaluated, where K grows linearly with the size of the molecule under consideration. In practice, many of these integrals are small and can be neglected, but the number of non-negligible integrals still grows faster than O(K 2 ), making their evaluation a critical bottleneck in quantum chemistry. We can calculate and store (conventional SCF) the ERIs for use in constructing the J and K matrices during the iterative SCF procedure, or we can recalculate them in each iteration (direct SCF). The direct SCF method is often more efficient in practice because it minimizes the I/O associated with reading and writing ERIs. The form of the basis functions χ μ (r) is arbitrary in principle as long as the basis set is sufficiently flexible to describe the electronic orbitals. The natural choice for these functions is the Slatertype orbital (STO), which comes from the exact analytic solution of the electronic structure problem for the hydrogen atom: ( ) ( ) ( ) ( ) χ STO l m n α µ r R µ µ r x x µ y y µ z z µ e 2, (0) where R μ is the position of the nucleus on which the basis function is centered (with components x μ, y μ, and z μ ), and the integers l, m, and n represent the orbital s angular momentum. The orbital s total angular momentum, l total, is given by the sum of these integers and is often referred to as s, p, and d, for l total = 0,, and 2, respectively. Unfortunately, it s difficult to evaluate the required two-electron integrals using these basis functions, so we use Gaussian-type orbitals (GTOs) in their stead: ( ) ( ) ( ) ( ) χ GTO l m n α µ r R µ µ r x x µ y y µ z z µ e 2. () Relatively simple analytic expressions are available for the required integrals when using GTO basis sets, but the functional form is qualitatively different. To mimic the more physically motivated STOs, we typically contract these GTOs as χ GTO, contracted ( r) χ ( r) = STO µ µ Nµ i= d GTO, primitve µ iχµ i ( r). (2) This procedure leads to two-electron integrals over contracted basis functions, which are given as sums of integrals over primitive basis functions. Unlike the elements of the C matrix, the contraction coefficients d μi aren t allowed to vary during the SCF iterative process. The number of primitives in a contracted basis function, N μ, is the contraction length and usually varies between one and eight. Given this basis set construction, we can now talk about ERIs over contracted or primitive basis functions: ( µν λσ) = Nµ Nν Nλ Nσ p= q= r = s= dµ d ν dλ dσ pq rs, (3) p q r s where square brackets denote integrals over primitive basis functions and parentheses denote integrals over contracted basis functions. Several algorithms can evaluate the primitive integrals [pq rs] for GTO basis sets. We won t discuss these in any detail here except to say that we ve used the McMurchie-Davidson scheme, 7 which requires relatively few intermediates per integral. The resulting low memory requirements for the kernels let us maximize SM occupancy; the operations involved in evaluating the primitive integrals also include evaluation of reciprocals, square roots, and exponentials, in addition to simple arithmetic operations such as addition and multiplication. GPU Algorithms for ERI evaluation In other work, we ve explored three different algorithms to evaluate the O(K 4 ) ERIs over contracted basis functions and store them in the GPU memory. 8 Here, we summarize the algorithms and comment on several aspects of their performance. As a test case, we consider a molecular system composed of 64 hydrogen atoms arranged on a lattice. We use two basis sets the first (denoted STO-6G) has six primitive functions for each contracted basis function with one contracted basis function per atom, and the second (denoted 6-3G) has three contracted functions per atom in combinations of three, one, and one primitive basis functions, respectively. These two basis sets represent highly contracted or relatively uncontracted basis sets and serve to show how the degree of contraction in the basis set affects algorithm performance. For the hydrogen atom lattice November/December

5 block contracted integral This block calculates ( ) integral (0) [ ] () [ 2] 2 3 (62) idle (63) idle KpKqKrKs primitive integrals thread primitive integral This block contributes to (3 23) integral (0,0) [ 3] (0,3) [22 3] (3,0) [ 33] (3,3) [22 33] thread contracted integral This block contributes to ( ) integrals Redundant contracted integrals (0,0) [23 23] (0,) [33 23] (,0) [23 33] (,) [33 33] Figure 2. Schematic of three different mapping schemes for evaluating ERIs on the GPU. The large square represents the matrix of contracted integrals; small squares below the main diagonal (blue) represent integrals that don t need to be computed because the integral matrix is symmetric. Each of the contracted integrals is a sum over primitive integrals. The mapping schemes differ in how the computational work is apportioned red squares superimposed on the integral matrix denote work done by a representative thread block, and the three blow ups show how the work is apportioned to threads within the thread block. test case, the number of contracted basis functions is 64 and 92 for the STO-6G and 6-3G basis sets, respectively, which leads to O(0 6 ) and O(0 8 ) ERIs over contracted basis functions. As we can see in Equation 9, the ERIs have several permutation symmetries for example, interchange of the first or last two indices in the (μν λσ) ERI doesn t change the integral s value. Thus, we can represent the contracted ERIs as a square matrix of dimension K(K + )/2 K(K + )/2, as Figure 2 shows here, the rows and columns represent unique μν and λσ index pairs. Furthermore, we can interchange the first pair of indices with the last pair without changing the ERI value that is, (μν λσ) = (λσ μν). This implies that the ERI matrix is symmetric, and only the ERIs on or above the main diagonal need to be calculated. Figure 2 shows the primitive integrals contributing to each contracted integral as small squares (see the blow up labeled primitive integrals ). We ve simplified here to the case in which each contracted basis function is a linear combination of the same number of primitives. In realistic cases, each of the contracted basis functions can involve a different number of primitive basis functions. This organization of the contracted ERIs immediately suggests three different mappings of the computational work to thread blocks. We could assign a thread to each contracted ERI (TCI, - Contracted Integral in Figure 2); a thread block to each contracted ERI (BCI, Block- Contracted Integral in Figure 2); or a thread to each primitive ERI (TPI, - Primitive Integral in Figure 2). We ve implemented all three of these schemes on the GPU; the grain of parallelism and the degree of load balancing differed in all three cases. The TPI scheme is the most fine-grained and provides the largest number of threads for calculation, and the TCI scheme is the least finegrained, providing a larger amount of work for active threads. In the TCI scheme, each thread calculates its contracted integral by directly looping over all primitive ERIs and accumulating the results according to Equation 3. Once the primitive ERI evaluation and summation completes, the contracted integral is stored in the GPU memory. 30 Computing in Science & Engineering

6 Neighboring integrals can have different numbers of contributing primitives and hence a different number of loop cycles. When the threads responsible for these neighboring integrals belong to the same warp, they execute in SIMD fashion, which produces load misbalancing. We can minimize the impact by further organizing the integrals into subgrids according to the contraction length of the basis functions involved. In this case, all threads in a warp have similar workloads (thus minimizing load-balancing issues), but it requires further reorganization of the computation with both programming and runtime overhead; the latter, however, is usually small when compared to typical computation timings. The BCI mapping scheme is finer-grained and maps each contracted integral to a whole thread block rather than a single thread. This organization avoids the load-imbalance issues inherent to the TCI algorithm because distinct thread blocks never share common warps. Within a block, we have several ways to assign primitive integrals to GPU threads. We chose to assign them cyclically, with each successive term in the sum of Equation 3 mapped to a successive GPU thread when the last thread is reached, the subsequent integral is assigned to the first thread, and so on. Because all threads compute their integrals, the latter are summed using the shared on-chip memory, and the final result is stored in GPU DRAM. Unfortunately, the BCI scheme sometimes experiences load-balancing issues that are difficult to eliminate. Consider a contracted integral comprised of just one primitive integral such a situation is possible when all the basis functions have unit contraction length (that is, they aren t contracted at all). In this case, only one thread in the whole block will have work assigned to it, but because the warps are processed in SIMD fashion, the other 3 threads in the warp will execute the same set of instructions and waste the computational time. Direct tests, performed on a system with a large number of weakly contracted integrals, confirm this prediction. The TPI mapping scheme exhibits the finestgrain level of parallelism of all the schemes presented. Unlike the two previous approaches, the target integral grid has primitive rather than contracted integrals, and each GPU thread calculates just one primitive integral, no matter which contracted integrals it contributes to. As soon as we calculate and store all the primitives on the GPU, another GPU kernel further transforms them to the final array of contracted integrals. The second step isn t required in TCI and BCI algorithms because all required primitives are stored either in registers or shared memory and thus are easily assembled into a contracted integral. In contrast, in the TPI scheme, those primitives constituting the same contracted integral can belong to different thread blocks running on different SMs. In this case, data exchange is possible only through the GPU DRAM, incurring hundreds of clock cycles of latency. Table shows benchmark results for the 64 hydrogen atom lattice. As mentioned earlier, we used two different basis sets to determine the contraction length s effect. For the weakly contracted basis set (6-3G), we found that BCI mapping performs poorly (as predicted), mostly because of the large number of empty threads that still execute instructions due to the SIMD hardware model. For this case, we estimated that the BCI algorithm possesses 4.2X computational overhead, assuming each warp contains 32 threads. The TPI mapping was the fastest, but two considerations are important here. First, the summation of Equation 3 doesn t produce much overhead most of the contracted basis functions consist of a single primitive. Second, the GPU integral evaluation kernel is relatively simple and consumes a small number of registers, which allows more active threads to run on an SM and hence provides better instruction pipelining. For the highly contracted basis set (STO-6G), the situation is reversed: the TPI algorithm is the slowest because of the summation of the primitive integrals, which is more likely to require communication across thread blocks. The BCI scheme avoids this overhead and distributes the work more evenly (all contracted integrals require the same number of primitive ERIs because all basis functions have the same contraction length). We found that the TCI algorithm represents a compromise that s less sensitive to the degree of basis set contraction. In both cases, it s either almost the fastest or simply the fastest and thus would be recommended for conventional SCF. An additional issue is the time required to move the contracted integrals between the GPU and CPU main memory (in practice, the integrals rarely fit in the GPU DRAM). Table shows that the time for this GPU CPU transfer can exceed the integral evaluation time for weakly contracted basis sets. An alternate approach that avoids transferring the ERIs would clearly be advantageous. By substituting Equation 3 into Equations 6 and 7, we can avoid the formation of the contracted ERIs completely. This is the usual strategy when we use direct SCF methods to re-evaluate ERIs in every step of the SCF procedure. In November/December

7 Table. Two-electron integral evaluation of the 64 hydrogen atom lattice on a GPU using three algorithms (BCI, TCI, and TPI). Basis set GPU BCI GPU TCI GPU TPI GPU CPU transfer* GAMESS** 6-3G s s s s 70.8 s STO-6G.608 s.099 s s 0.02 s 90.6 s *The amount of time required to copy the contracted integrals from the GPU to CPU memory **The same test case using the GAMESS program package on a single Opteron 75 CPU for comparison this case, we avoid the Achilles heel of the TPI scheme formation of the contracted ERIs so it thus becomes the recommended scheme. We ve implemented construction of the J and K matrices on the GPU concurrent with ERI evaluation via the TPI scheme. This has the added advantage of avoiding CPU GPU transfer of the ERIs the J and K matrices contain only O(K 2 ) elements, compared to the O(K 4 ) ERIs. Due to limited space, we won t discuss the details of the algorithms here, but we will present some results that demonstrate the accuracy and performance achieved so far. As mentioned earlier, the basis functions used in quantum chemistry have an associated angular momentum that is, the polynomial prefactor in Equations 0 and. In the hydrogen lattice test case, all basis functions were of s type, meaning no polynomial prefactor. For atoms heavier than hydrogen, it becomes essential to also include higher angular momentum functions. Treating these efficiently requires computing all components such as p x, p y, and p z simultaneously. In the context of the GPU, this means that we should write separate kernels for ERIs that have different angular momentum combinations. These kernels will involve more arithmetic operations as the angular momentum increases, simply because more ERIs are computed simultaneously. We ve written such kernels for all ERIs in basis sets including s and p type basis functions. To better quantify the GPU s performance, we ve investigated our algorithms for J matrix construction. The GPU s peak performance is 350 Gflops, and we were curious to see how close our algorithms came to this theoretical limit. Table 2 shows performance results for a subset of the kernels we coded. For each kernel, we counted the corresponding number of floating-point instructions it executed. We counted all the instructions as Flop, excluding MAD, which we assumed to take 2 flops. We then hand-counted the resulting floating-point operations from the compiler-generated PTXAS file (an intermediate assembler-type code that the compiler transforms to actual machine instructions). To evaluate the application s DRAM bandwidth, we also counted the number of memory operations (Mops) each thread needed to execute to load data from the GPU main memory for integral batch evaluation. We counted each 32-bit load instruction as Mop and the 64- and 28-bit load instructions as 2 and 4 Mops, correspondingly. Because we use texture memory (which can be cached), the resulting bandwidth is likely overestimated. In our algorithm, we found that using texture memory was even more efficient than the textbook global memory load shared memory store synchronization broadcast through shared memory scheme. This is due to synchronization overhead, which hinders effective parallelization. We also determined the number of active threads (the hardware supports 768 at most) that are actually launched on every streaming multiprocessor. Finally, we determined the number of registers each active thread requires, which, in turn, determines GPU occupancy as discussed earlier. As expected, the kernels involving higher angular momentum functions require more floating-point operations and registers, but the need for more registers per thread leads to fewer threads being active. Although sustained performance is less than 30 percent of the theoretical peak value, the GPU performance is still impressive compared to commodity CPUs for example, a single AMD Opteron core demonstrates to 3 Gflops in the Linpack benchmark. Given that a general quantum chemistry code is far less optimized than Linpack, we can estimate Gflop as the upper bound for integral generation performance on this CPU. In contrast, we achieve 70 to 00 Gflops on the GPU. Comparing the performance in Table 2 for the sspp and pppp kernels, we can see that the GPU performance grows with arithmetic complexity for a fixed number of memory accesses, which suggests that our application is memorybound on the GPU. Furthermore, the total memory bandwidth observed (although sometimes overestimated due to texture caching) is close to the 80 Gbytes/s peak value (in practice, we usually got 40 to 70 Gbytes/s bandwidth in global memory reads). To further verify the 32 Computing in Science & Engineering

8 Table 2. Integral evaluation GPU kernel specifications and performance results. Kernel Floating-point operations Memory operations Registers per thread Active threads per SM Performance (Gflops) Bandwidth (Gbytes/s) ssss (75) 3 sssp (74) 7 sspp (227) 64 pppp (98) 20 Table 3. Performance and accuracy of GPU algorithms for direct self-consistent field (SCF) benchmarks. Molecule Time per direct SCF iteration (seconds) Electronic energy (atomic units) Speedup GPU GAMESS GPU (32 bit) GAMESS Caffeine Cholesterol Buckyball Taxol Valinomycin conclusion that our application is memorybound, we performed a direct test. Out of the 48 to 84 bytes required to evaluate each batch, we left only 24 bytes that were vitally important for the application to run and replaced the other quantities with constants. We evaluated the resulting performance and present it in parentheses in Table 2 s performance column. Although the number of arithmetic operations was unchanged, the Gflops achieved increased by a factor of two or more, which clearly demonstrates our conclusion s correctness. In anticipation of upcoming double-precision hardware, we re pleased that our algorithms are currently memory-bound. Although the memory bandwidth will decrease by a factor of two (due to 64- instead of 32-bit number representation) when the next generation of GPUs uses double-precision, the more dramatic decrease will come in arithmetic performance. However, we anticipate that the increased arithmetic intensity won t much affect our algorithms instead, it will only be roughly a factor of two slower in double precision. Our code, which is still under development, successfully competes with modern, welloptimized, general-purpose quantum chemistry programs such as GAMESS. 9 We performed benchmark tests on the following molecules using the 3-2G basis set: caffeine (C 8 N 4 H 0 O 2 ), cholesterol (C 27 H 46 O), buckyball (C 60 ), taxol (C 45 NH 49 O 5 ), and valinomycin (C 54 N 6 H 90 O 8 ). Figure 3 represents all these molecules, and Table 3 summarizes the benchmark results. The GPU is up to 93 times faster than a single 3.0-GHz Intel Pentium D CPU for these molecules. The GPU we used for these tests supports only 32-bit arithmetic operations, meaning that we can only expect six or seven significant figures of accuracy in the final results. This might not always be sufficient for quantum chemistry applications, where chemical accuracy is typically considered to be 0 3 atomic units. As we can see by comparing the GPU and GAMESS electronic energies in Table 3, this level of accuracy isn t always achieved. Fortunately, the next generation of GPUs will provide hardware support for double-precision, and we expect our algorithms will only be two times slower because they re currently limited by the GPU s memory bandwidth and don t saturate the GPU s floating-point capabilities. In this article, we ve demonstrated that GPUs can significantly outpace commodity CPUs in the central bottleneck of most quantum chemistry problems evaluation of two-electron repulsion integrals and subsequent Coulomb and exchange operator matrix formations. Speedups on the order of 00 times are readily achievable for chemical systems of practical interest, and the inherent high level of parallelism results in complete elimination of interblock communication during Fock matrix formation, making further parallelization over multiple GPUs an obvious step in the near future. November/December

9 should be low for large molecules. The computational methods presented here can be easily augmented to allow calculations within the framework of DFT, which is known to be significantly more accurate than HF theory. We re currently implementing a general-purpose electronic structure code including DFT that runs almost entirely on the GPU in anticipation of upcoming hardware advances. There is good reason to believe that these advances will enable the calculation of structures for small proteins directly from quantum mechanics as well as computational design of new smallmolecule drugs targeted to specific proteins with unprecedented accuracy and speed. Figure 3. Molecules used to test GPU performance. The set of molecules used spans the size range from 20 to 256 atoms. For very large molecules, the 32-bit precision provided by the Nvidia G80 series hardware isn t sufficient because the total energy grows with the molecule s size in chemical problems, relative energies are the primary objects of interest. In fact, using an incremental Fock matrix scheme to compute only the difference between Fock matrices in successive iterations and accumulating the Fock matrix on the CPU with dual-precision accuracy can improve the final result s precision by up to a factor of 0. Nevertheless, we still require higher precision as the molecules being studied get larger. To maintain chemical accuracy for energy differences in 32-bit precision, we re limited in practice to molecules with less than 00 atoms. Fortunately, Nvidia recently released the next generation of GPUs that supports 64-bit precision in hardware. Because 32-bit arithmetic will remain significantly faster than 64-bit arithmetic, we anticipate that a mixed precision computational model will be ideal. In this case, the program will process a small fraction of ERIs (those with the largest absolute value) using 64-bit arithmetic and evaluate the vast majority of ERIs using the faster 32-bit arithmetic. Because the number of ERIs that require dual-precision accuracy scales linearly with system size, the impact of dual-precision calculations on overall computational performance References J. Bolz et al., Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid, ACM Trans. Graph., vol. 22, no. 3, 2003, p. 97. J. Hall, N. Carr, and J. Hart, GPU Algorithms for Radiosity and Subsurface Scattering, tech. report UIUCDCS-R , Univ. of Illinois, Urbana-Champaign, K. Fatahalian, J. Sugerman, and P. Hanrahan, Graphics Hardware, T. Akenine-Moller and M. McCool, eds., Wellesley, 2004, p. 33. A.G. Anderson, W.A. Goddard III, and P. Schroder, Quantum Monte Carlo on Graphical Processing Units, Computer Physics Comm., vol. 77, no. 3, 2007, p A. Szabo and N.S. Ostlund, Modern Quantum Chemistry, Dover, 996. R.G. Parr and W. Yang, Density-Functional Theory of Atoms and Molecules, Oxford, 989. L.E. McMurchie and E.R. Davidson, One- and Two-Electron Integrals Over Cartesian Gaussian Functions, J. Computational Physics, vol. 26, no. 2, 978, p. 28 I.S. Ufimtsev and T.J. Martínez, Quantum Chemistry on Graphical Processing Units.. Strategies for Two-Electron Integral Evaluation, J. Chemical Theory and Computation, vol. 4, no. 2, 2008, p M.W. Schmidt et al. General Atomic and Molecular Electronic Structure System, J. Computational Chemistry, vol. 4, no., 993, p Ivan S. Ufimtsev is a graduate student and research assistant in the chemistry department at the University of Illinois. His research interests include leveraging non-traditional architectures for scientific computing. Contact him at iufimts2@uiuc.edu. Todd J. Martínez is the Gutgsell Chair of Chemistry at the University of Illinois. His research interests center on understanding the interplay between electronic and nuclear motion in molecules, especially in the context of chemical reactions initiated by light. Martínez became interested in computer architectures and videogame design at an early age, writing and selling his first game programs (coded in assembler for the 6502 processor) in the early 980s. Contact him at toddjmartinez@gmail.com. 34 Computing in Science & Engineering

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd