In 1830, Auguste Comte wrote in his work

Size: px
Start display at page:

Download "In 1830, Auguste Comte wrote in his work"

Transcription

1 N o v e l A r c h i t e c t u r e s Graphical Processing Units for Quantum Chemistry The authors provide a brief overview of electronic structure theory and detail their experiences implementing quantum chemistry methods on a graphical processing unit. They also analyze algorithm performance in terms of floating-point operations and memory bandwidth, and assess the adequacy of single-precision accuracy for quantum chemistry applications /08/$ IEEE Copublished by the IEEE CS and the AIP In 830, Auguste Comte wrote in his work Philosophie Positive: Every attempt to employ mathematical methods in the study of chemical questions must be considered profoundly irrational and contrary to the spirit of chemistry. If mathematical analysis should ever hold a prominent place in chemistry an aberration which is happily almost impossible it would occasion a rapid and widespread degeneration of that science. Fortunately, Comte s assessment was far off the mark and, instead, the opposite has occurred. Detailed simulations based on the principles of quantum mechanics now play a large role in suggesting, guiding, and explaining experiments in chemistry and materials science. In fact, quantum chemistry is a major consumer of CPU cycles at national supercomputer centers, with the field s rise to prominence largely due to early demonstrations of quantum mechanics applied to chemical problems and the tremendous advances in computing power over the past decades two developments that Comte could not have foreseen. Ivan S. Ufimtsev and Todd J. Martínez University of Illinois at Urbana-Champaign However, limited computational resources remain a serious obstacle to the application of quantum chemistry in problems of widespread importance, such as the design of more effective drugs to treat diseases or new catalysts for use in applications such as fuel cells or environmental remediation. Thus, researchers have a considerable impetus to relieve this bottleneck in any way possible, both by developing new and more effective algorithms and exploring new computer architectures. In our own work, we ve recently begun exploring the use of graphical processing units (GPUs), and this article presents some of our experiences with GPUs for quantum chemistry. GPU Architecture Low precision (generally, 24-bit arithmetic) and limited programmability stymied early attempts to use GPUs for general-purpose scientific computing. 4 However, the release of the Nvidia G80 series and the compute unified device architecture (CUDA) application programming interface (API) have ushered in a new era in which these difficulties are largely ameliorated. The CUDA API lets developers control the GPU via an extension of the standard C programming language (as opposed to specialized assembler or graphics-oriented APIs, such as OpenGL and DirectX). The G80 supports 32-bit floating-point arithmetic, 26 This article has been peer-reviewed. Computing in Science & Engineering

2 which is largely (but not entirely) compliant with the IEEE-754 standard. In some cases, this might not be sufficient precision (many scientific applications use 64-bit or double precision), but Nvidia has already released its next generation of GPUs that supports 64-bit arithmetic in hardware. We performed all the calculations presented in this article on a single Nvidia GeForce 8800 GTX card using the CUDA API ( offers a detailed overview of the hardware and API). We sketch some of the important concepts here. As depicted schematically in Figure, the Ge- Force 8800 GTX consists of 6 independent streaming multiprocessors (SMs), each comprised of eight scalar units operating in SIMD fashion and running at.35 GHz. The device can process a large number of concurrent threads; these threads are organized into a one- or two-dimensional (- or 2D) grid of -, 2-, or 3D blocks with up to 52 threads in each block. s in the same block are guaranteed to execute on the same SM and have access to fast on-chip shared memory (6 Kbytes per SM) for efficient data exchange. s belonging to different blocks can execute on different SMs and must exchange data through the GPU DRAM (768 Mbytes), which has much larger latency than shared memory. There s no efficient way to synchronize thread block execution that is, in which order or on which SM they ll be processed thus, an efficient algorithm should avoid communication between thread blocks as much as possible. The thread block grid can contain up to 65,535 blocks in each dimension. Every thread block has its own unique serial number (or two numbers, if the grid is 2D); likewise, each thread also has a set of indices, identifying it within a thread block. Together, the thread block and thread serial numbers provide enough information to precisely identify a given thread and thereby split computational work in an application. Much of the challenge in developing an efficient algorithm on the GPU involves determining an ideal mapping between computational tasks and the grid/thread block/thread hierarchy. Ideal mappings lead to load balance with interthread communication restricted to within the thread blocks. Because the number of threads running on a GPU is much larger than the total number of processing units (to fully utilize the device s computational capabilities and efficiently hide the DRAM s access latency, the program must spawn at least 0 4 threads), the hardware executes all threads in time-slicing. The thread scheduler splits the thread blocks into 32-thread warps, which SMs Nvidia GeForce 8800 GTX GPU DRAM (768 Mbytes) Streaming multiprocessors SM SM 2 SM 5 SM 6 Streaming multiprocessor Shared memory (6 Kbytes) SIMD streaming processors SP SP 2 SP 7 SP 8 Figure. Schematic block diagram of the Nvidia GeForce 8800 GTX. It has 6 streaming multiprocessors (SMs), each containing 8 SIMD streaming processors and 6 Kbytes of shared memory. then process in SIMD fashion, with all 32 threads executed by eight scalar processors in four clock cycles. The thread scheduler periodically switches between warps, maximizing overall application performance. An important point here is that once all the threads in a warp have completed, this warp is no longer scheduled for execution, and no loadbalancing penalty is incurred. Another important consideration is the amount of on-chip resources available for an active thread. Active means that the thread has GPU context (registers and so on) attached to it and is included in the thread scheduler s to do list. Once a thread is activated, it won t be deactivated until all of its instructions are executed. A G80 SM can support up to 768 active threads (24 warps), and the warpswitching overhead is negligible compared to the time required to execute a typical instruction. Such cost-free switching is possible because every thread has its own context, which implies that the whole register space (32 Kbytes per SM) is evenly distributed among active threads (for 768 active threads, every thread has 0 registers available). If the threads need more registers, fewer threads are activated, leading to partial SM occupation. This important parameter determines if the GPU DRAM access latency can be efficiently hidden (a large number of active threads means that although some threads are waiting for data, others can execute instructions and vice versa). Thus, any GPU kernel (a primitive routine each GPU thread executes) should consume as few registers as possible to maximize SM occupation. Quantum Chemistry Overview Two of the most basic questions in chemistry are, where are the electrons? and where are the nuclei? Electronic structure theory that is, quantum chemistry focuses on the first one. Because the electrons are very light, we must apply the laws of quantum mechanics, which are described Instruction unit November/December

3 with an electronic wave function determined by solving the time-independent Schrödinger equation. As usual in quantum mechanics, this wave function s absolute square is interpreted as a probability distribution for electron positions. Once we know the electronic distribution for a fixed nuclear configuration, it s straightforward to calculate the resulting forces on the nuclei. Thus, the answer to the second question follows from the answer to the first through either a search for the nuclei arrangement that minimizes energy (molecular geometry optimization) or solution of the classical Newtonian equations of motion. (It s also possible and in some cases, necessary to solve quantum mechanical equations of motion for the nuclei, but we don t consider this further here.) The great utility of quantum chemistry comes from the resulting ability to predict molecular shapes and chemical rearrangements. Denoting the set of all electronic coordinates as r and all nuclear coordinates as R, we can write the electronic time-independent Schrödinger equation for a molecule as H ( r, R ) ψ ( r, R ) = E( R ) ψ ( r, R ), () elec ( ) is the electronic Hamiltonian op- where H r, R erator describing the electronic kinetic energy as well as the Coulomb interactions between all electrons and nuclei, ψ elec ( r, R) is the electronic wave function, and E( R) is the total energy. This total energy depends only on the nuclear coordinates and is often referred to as the molecule s potential energy surface. For most molecules, exact solution of Equation is impossible, so we must invoke approximations. Researchers have developed numerous such approximate approaches, the simplest of which is the Hartree-Fock (HF) method. 5 In HF theory, we write the electronic wave function as a single antisymmetrized product of orbitals, which are functions describing a single electron s probability distribution. Physically, this means that the electrons see each other only in an averaged sense while obeying the appropriate Fermi statistics. We can obtain higher accuracy by including many antisymmetrized orbital products or by using density functional theory (DFT) to describe the detailed electronic correlations present in real molecules. 6 We consider only HF theory in this article because it illustrates many key computational points. We express each electronic orbital φ i ( r) in HF theory as a linear combination of K basis functions χ µ ( r) specified in advance: elec φ ( r) = C χ ( r). (2) i K µ = i µ µ Notice that we no longer write the electronic coordinates in boldface to emphasize that these are a single electron s coordinates. The computational task is then to determine the linear coefficients C iμ. Collecting these coefficients into a matrix C and introducing the overlap matrix S with elements = ( ) ( ) S χ r χ r dr 3, (3) µυ µ υ we can write the HF equations as F( C) C = ε SC, (4) where ε is a diagonal matrix of one-electron orbital energies. The Fock matrix F is a one-electron analog of the Hamiltonian operator in Equation, describing the electronic kinetic energy, the electron nuclear Coulomb attraction, and the averaged electron electron repulsion. Because F depends on the unknown matrix C, we solve for the unknowns C and ε with a self-consistent field (SCF) procedure. After guessing C, we construct the Fock matrix and solve the generalized eigenvalue problem in Equation 4, giving ε and a new set of coefficients C. The process iterates until C remains unchanged within some tolerance that is, until C and F are self-consistent. The dominant effort in the formation of F lies in the evaluation of the two-electron repulsion integrals (ERIs) representing the Coulomb (J) and exchange (K) interactions between pairs of electrons. The exchange interaction is a nonclassical Coulomb-like term arising from the electronic wave function s antisymmetry. Specifically, we construct the Fock matrix for a molecule with N electrons as F( C) = Hcore + J( C) K( C), (5) 2 where H core includes the electronic kinetic energy and the electron nuclear attraction. (The equations given here are specific to closed-shell singlet molecules, with no unpaired electrons.) The Coulomb and exchange matrices are given by J K = P ( µυ λσ) (6) µυ λσ λσ = P ( µλ σν), (7) µυ λσ λσ in terms of the density matrix P and the twoelectron integrals (μν λσ), 28 Computing in Science & Engineering

4 P = N / 2 2 C C λσ λi σi i= ( µν λσ) ( ) ( ) ( ) ( ) χµ r χν r χλ r χσ r = r r (8) 3 dr dr 3.(9) Usually, we choose the basis functions as Gaussians centered on the nuclei, which leads to analytic expressions for the two-electron integrals. Nevertheless, K 4 such integrals must be evaluated, where K grows linearly with the size of the molecule under consideration. In practice, many of these integrals are small and can be neglected, but the number of non-negligible integrals still grows faster than O(K 2 ), making their evaluation a critical bottleneck in quantum chemistry. We can calculate and store (conventional SCF) the ERIs for use in constructing the J and K matrices during the iterative SCF procedure, or we can recalculate them in each iteration (direct SCF). The direct SCF method is often more efficient in practice because it minimizes the I/O associated with reading and writing ERIs. The form of the basis functions χ μ (r) is arbitrary in principle as long as the basis set is sufficiently flexible to describe the electronic orbitals. The natural choice for these functions is the Slatertype orbital (STO), which comes from the exact analytic solution of the electronic structure problem for the hydrogen atom: ( ) ( ) ( ) ( ) χ STO l m n α µ r R µ µ r x x µ y y µ z z µ e 2, (0) where R μ is the position of the nucleus on which the basis function is centered (with components x μ, y μ, and z μ ), and the integers l, m, and n represent the orbital s angular momentum. The orbital s total angular momentum, l total, is given by the sum of these integers and is often referred to as s, p, and d, for l total = 0,, and 2, respectively. Unfortunately, it s difficult to evaluate the required two-electron integrals using these basis functions, so we use Gaussian-type orbitals (GTOs) in their stead: ( ) ( ) ( ) ( ) χ GTO l m n α µ r R µ µ r x x µ y y µ z z µ e 2. () Relatively simple analytic expressions are available for the required integrals when using GTO basis sets, but the functional form is qualitatively different. To mimic the more physically motivated STOs, we typically contract these GTOs as χ GTO, contracted ( r) χ ( r) = STO µ µ Nµ i= d GTO, primitve µ iχµ i ( r). (2) This procedure leads to two-electron integrals over contracted basis functions, which are given as sums of integrals over primitive basis functions. Unlike the elements of the C matrix, the contraction coefficients d μi aren t allowed to vary during the SCF iterative process. The number of primitives in a contracted basis function, N μ, is the contraction length and usually varies between one and eight. Given this basis set construction, we can now talk about ERIs over contracted or primitive basis functions: ( µν λσ) = Nµ Nν Nλ Nσ p= q= r = s= dµ d ν dλ dσ pq rs, (3) p q r s where square brackets denote integrals over primitive basis functions and parentheses denote integrals over contracted basis functions. Several algorithms can evaluate the primitive integrals [pq rs] for GTO basis sets. We won t discuss these in any detail here except to say that we ve used the McMurchie-Davidson scheme, 7 which requires relatively few intermediates per integral. The resulting low memory requirements for the kernels let us maximize SM occupancy; the operations involved in evaluating the primitive integrals also include evaluation of reciprocals, square roots, and exponentials, in addition to simple arithmetic operations such as addition and multiplication. GPU Algorithms for ERI evaluation In other work, we ve explored three different algorithms to evaluate the O(K 4 ) ERIs over contracted basis functions and store them in the GPU memory. 8 Here, we summarize the algorithms and comment on several aspects of their performance. As a test case, we consider a molecular system composed of 64 hydrogen atoms arranged on a lattice. We use two basis sets the first (denoted STO-6G) has six primitive functions for each contracted basis function with one contracted basis function per atom, and the second (denoted 6-3G) has three contracted functions per atom in combinations of three, one, and one primitive basis functions, respectively. These two basis sets represent highly contracted or relatively uncontracted basis sets and serve to show how the degree of contraction in the basis set affects algorithm performance. For the hydrogen atom lattice November/December

5 block contracted integral This block calculates ( ) integral (0) [ ] () [ 2] 2 3 (62) idle (63) idle KpKqKrKs primitive integrals thread primitive integral This block contributes to (3 23) integral (0,0) [ 3] (0,3) [22 3] (3,0) [ 33] (3,3) [22 33] thread contracted integral This block contributes to ( ) integrals Redundant contracted integrals (0,0) [23 23] (0,) [33 23] (,0) [23 33] (,) [33 33] Figure 2. Schematic of three different mapping schemes for evaluating ERIs on the GPU. The large square represents the matrix of contracted integrals; small squares below the main diagonal (blue) represent integrals that don t need to be computed because the integral matrix is symmetric. Each of the contracted integrals is a sum over primitive integrals. The mapping schemes differ in how the computational work is apportioned red squares superimposed on the integral matrix denote work done by a representative thread block, and the three blow ups show how the work is apportioned to threads within the thread block. test case, the number of contracted basis functions is 64 and 92 for the STO-6G and 6-3G basis sets, respectively, which leads to O(0 6 ) and O(0 8 ) ERIs over contracted basis functions. As we can see in Equation 9, the ERIs have several permutation symmetries for example, interchange of the first or last two indices in the (μν λσ) ERI doesn t change the integral s value. Thus, we can represent the contracted ERIs as a square matrix of dimension K(K + )/2 K(K + )/2, as Figure 2 shows here, the rows and columns represent unique μν and λσ index pairs. Furthermore, we can interchange the first pair of indices with the last pair without changing the ERI value that is, (μν λσ) = (λσ μν). This implies that the ERI matrix is symmetric, and only the ERIs on or above the main diagonal need to be calculated. Figure 2 shows the primitive integrals contributing to each contracted integral as small squares (see the blow up labeled primitive integrals ). We ve simplified here to the case in which each contracted basis function is a linear combination of the same number of primitives. In realistic cases, each of the contracted basis functions can involve a different number of primitive basis functions. This organization of the contracted ERIs immediately suggests three different mappings of the computational work to thread blocks. We could assign a thread to each contracted ERI (TCI, - Contracted Integral in Figure 2); a thread block to each contracted ERI (BCI, Block- Contracted Integral in Figure 2); or a thread to each primitive ERI (TPI, - Primitive Integral in Figure 2). We ve implemented all three of these schemes on the GPU; the grain of parallelism and the degree of load balancing differed in all three cases. The TPI scheme is the most fine-grained and provides the largest number of threads for calculation, and the TCI scheme is the least finegrained, providing a larger amount of work for active threads. In the TCI scheme, each thread calculates its contracted integral by directly looping over all primitive ERIs and accumulating the results according to Equation 3. Once the primitive ERI evaluation and summation completes, the contracted integral is stored in the GPU memory. 30 Computing in Science & Engineering

6 Neighboring integrals can have different numbers of contributing primitives and hence a different number of loop cycles. When the threads responsible for these neighboring integrals belong to the same warp, they execute in SIMD fashion, which produces load misbalancing. We can minimize the impact by further organizing the integrals into subgrids according to the contraction length of the basis functions involved. In this case, all threads in a warp have similar workloads (thus minimizing load-balancing issues), but it requires further reorganization of the computation with both programming and runtime overhead; the latter, however, is usually small when compared to typical computation timings. The BCI mapping scheme is finer-grained and maps each contracted integral to a whole thread block rather than a single thread. This organization avoids the load-imbalance issues inherent to the TCI algorithm because distinct thread blocks never share common warps. Within a block, we have several ways to assign primitive integrals to GPU threads. We chose to assign them cyclically, with each successive term in the sum of Equation 3 mapped to a successive GPU thread when the last thread is reached, the subsequent integral is assigned to the first thread, and so on. Because all threads compute their integrals, the latter are summed using the shared on-chip memory, and the final result is stored in GPU DRAM. Unfortunately, the BCI scheme sometimes experiences load-balancing issues that are difficult to eliminate. Consider a contracted integral comprised of just one primitive integral such a situation is possible when all the basis functions have unit contraction length (that is, they aren t contracted at all). In this case, only one thread in the whole block will have work assigned to it, but because the warps are processed in SIMD fashion, the other 3 threads in the warp will execute the same set of instructions and waste the computational time. Direct tests, performed on a system with a large number of weakly contracted integrals, confirm this prediction. The TPI mapping scheme exhibits the finestgrain level of parallelism of all the schemes presented. Unlike the two previous approaches, the target integral grid has primitive rather than contracted integrals, and each GPU thread calculates just one primitive integral, no matter which contracted integrals it contributes to. As soon as we calculate and store all the primitives on the GPU, another GPU kernel further transforms them to the final array of contracted integrals. The second step isn t required in TCI and BCI algorithms because all required primitives are stored either in registers or shared memory and thus are easily assembled into a contracted integral. In contrast, in the TPI scheme, those primitives constituting the same contracted integral can belong to different thread blocks running on different SMs. In this case, data exchange is possible only through the GPU DRAM, incurring hundreds of clock cycles of latency. Table shows benchmark results for the 64 hydrogen atom lattice. As mentioned earlier, we used two different basis sets to determine the contraction length s effect. For the weakly contracted basis set (6-3G), we found that BCI mapping performs poorly (as predicted), mostly because of the large number of empty threads that still execute instructions due to the SIMD hardware model. For this case, we estimated that the BCI algorithm possesses 4.2X computational overhead, assuming each warp contains 32 threads. The TPI mapping was the fastest, but two considerations are important here. First, the summation of Equation 3 doesn t produce much overhead most of the contracted basis functions consist of a single primitive. Second, the GPU integral evaluation kernel is relatively simple and consumes a small number of registers, which allows more active threads to run on an SM and hence provides better instruction pipelining. For the highly contracted basis set (STO-6G), the situation is reversed: the TPI algorithm is the slowest because of the summation of the primitive integrals, which is more likely to require communication across thread blocks. The BCI scheme avoids this overhead and distributes the work more evenly (all contracted integrals require the same number of primitive ERIs because all basis functions have the same contraction length). We found that the TCI algorithm represents a compromise that s less sensitive to the degree of basis set contraction. In both cases, it s either almost the fastest or simply the fastest and thus would be recommended for conventional SCF. An additional issue is the time required to move the contracted integrals between the GPU and CPU main memory (in practice, the integrals rarely fit in the GPU DRAM). Table shows that the time for this GPU CPU transfer can exceed the integral evaluation time for weakly contracted basis sets. An alternate approach that avoids transferring the ERIs would clearly be advantageous. By substituting Equation 3 into Equations 6 and 7, we can avoid the formation of the contracted ERIs completely. This is the usual strategy when we use direct SCF methods to re-evaluate ERIs in every step of the SCF procedure. In November/December

7 Table. Two-electron integral evaluation of the 64 hydrogen atom lattice on a GPU using three algorithms (BCI, TCI, and TPI). Basis set GPU BCI GPU TCI GPU TPI GPU CPU transfer* GAMESS** 6-3G s s s s 70.8 s STO-6G.608 s.099 s s 0.02 s 90.6 s *The amount of time required to copy the contracted integrals from the GPU to CPU memory **The same test case using the GAMESS program package on a single Opteron 75 CPU for comparison this case, we avoid the Achilles heel of the TPI scheme formation of the contracted ERIs so it thus becomes the recommended scheme. We ve implemented construction of the J and K matrices on the GPU concurrent with ERI evaluation via the TPI scheme. This has the added advantage of avoiding CPU GPU transfer of the ERIs the J and K matrices contain only O(K 2 ) elements, compared to the O(K 4 ) ERIs. Due to limited space, we won t discuss the details of the algorithms here, but we will present some results that demonstrate the accuracy and performance achieved so far. As mentioned earlier, the basis functions used in quantum chemistry have an associated angular momentum that is, the polynomial prefactor in Equations 0 and. In the hydrogen lattice test case, all basis functions were of s type, meaning no polynomial prefactor. For atoms heavier than hydrogen, it becomes essential to also include higher angular momentum functions. Treating these efficiently requires computing all components such as p x, p y, and p z simultaneously. In the context of the GPU, this means that we should write separate kernels for ERIs that have different angular momentum combinations. These kernels will involve more arithmetic operations as the angular momentum increases, simply because more ERIs are computed simultaneously. We ve written such kernels for all ERIs in basis sets including s and p type basis functions. To better quantify the GPU s performance, we ve investigated our algorithms for J matrix construction. The GPU s peak performance is 350 Gflops, and we were curious to see how close our algorithms came to this theoretical limit. Table 2 shows performance results for a subset of the kernels we coded. For each kernel, we counted the corresponding number of floating-point instructions it executed. We counted all the instructions as Flop, excluding MAD, which we assumed to take 2 flops. We then hand-counted the resulting floating-point operations from the compiler-generated PTXAS file (an intermediate assembler-type code that the compiler transforms to actual machine instructions). To evaluate the application s DRAM bandwidth, we also counted the number of memory operations (Mops) each thread needed to execute to load data from the GPU main memory for integral batch evaluation. We counted each 32-bit load instruction as Mop and the 64- and 28-bit load instructions as 2 and 4 Mops, correspondingly. Because we use texture memory (which can be cached), the resulting bandwidth is likely overestimated. In our algorithm, we found that using texture memory was even more efficient than the textbook global memory load shared memory store synchronization broadcast through shared memory scheme. This is due to synchronization overhead, which hinders effective parallelization. We also determined the number of active threads (the hardware supports 768 at most) that are actually launched on every streaming multiprocessor. Finally, we determined the number of registers each active thread requires, which, in turn, determines GPU occupancy as discussed earlier. As expected, the kernels involving higher angular momentum functions require more floating-point operations and registers, but the need for more registers per thread leads to fewer threads being active. Although sustained performance is less than 30 percent of the theoretical peak value, the GPU performance is still impressive compared to commodity CPUs for example, a single AMD Opteron core demonstrates to 3 Gflops in the Linpack benchmark. Given that a general quantum chemistry code is far less optimized than Linpack, we can estimate Gflop as the upper bound for integral generation performance on this CPU. In contrast, we achieve 70 to 00 Gflops on the GPU. Comparing the performance in Table 2 for the sspp and pppp kernels, we can see that the GPU performance grows with arithmetic complexity for a fixed number of memory accesses, which suggests that our application is memorybound on the GPU. Furthermore, the total memory bandwidth observed (although sometimes overestimated due to texture caching) is close to the 80 Gbytes/s peak value (in practice, we usually got 40 to 70 Gbytes/s bandwidth in global memory reads). To further verify the 32 Computing in Science & Engineering

8 Table 2. Integral evaluation GPU kernel specifications and performance results. Kernel Floating-point operations Memory operations Registers per thread Active threads per SM Performance (Gflops) Bandwidth (Gbytes/s) ssss (75) 3 sssp (74) 7 sspp (227) 64 pppp (98) 20 Table 3. Performance and accuracy of GPU algorithms for direct self-consistent field (SCF) benchmarks. Molecule Time per direct SCF iteration (seconds) Electronic energy (atomic units) Speedup GPU GAMESS GPU (32 bit) GAMESS Caffeine Cholesterol Buckyball Taxol Valinomycin conclusion that our application is memorybound, we performed a direct test. Out of the 48 to 84 bytes required to evaluate each batch, we left only 24 bytes that were vitally important for the application to run and replaced the other quantities with constants. We evaluated the resulting performance and present it in parentheses in Table 2 s performance column. Although the number of arithmetic operations was unchanged, the Gflops achieved increased by a factor of two or more, which clearly demonstrates our conclusion s correctness. In anticipation of upcoming double-precision hardware, we re pleased that our algorithms are currently memory-bound. Although the memory bandwidth will decrease by a factor of two (due to 64- instead of 32-bit number representation) when the next generation of GPUs uses double-precision, the more dramatic decrease will come in arithmetic performance. However, we anticipate that the increased arithmetic intensity won t much affect our algorithms instead, it will only be roughly a factor of two slower in double precision. Our code, which is still under development, successfully competes with modern, welloptimized, general-purpose quantum chemistry programs such as GAMESS. 9 We performed benchmark tests on the following molecules using the 3-2G basis set: caffeine (C 8 N 4 H 0 O 2 ), cholesterol (C 27 H 46 O), buckyball (C 60 ), taxol (C 45 NH 49 O 5 ), and valinomycin (C 54 N 6 H 90 O 8 ). Figure 3 represents all these molecules, and Table 3 summarizes the benchmark results. The GPU is up to 93 times faster than a single 3.0-GHz Intel Pentium D CPU for these molecules. The GPU we used for these tests supports only 32-bit arithmetic operations, meaning that we can only expect six or seven significant figures of accuracy in the final results. This might not always be sufficient for quantum chemistry applications, where chemical accuracy is typically considered to be 0 3 atomic units. As we can see by comparing the GPU and GAMESS electronic energies in Table 3, this level of accuracy isn t always achieved. Fortunately, the next generation of GPUs will provide hardware support for double-precision, and we expect our algorithms will only be two times slower because they re currently limited by the GPU s memory bandwidth and don t saturate the GPU s floating-point capabilities. In this article, we ve demonstrated that GPUs can significantly outpace commodity CPUs in the central bottleneck of most quantum chemistry problems evaluation of two-electron repulsion integrals and subsequent Coulomb and exchange operator matrix formations. Speedups on the order of 00 times are readily achievable for chemical systems of practical interest, and the inherent high level of parallelism results in complete elimination of interblock communication during Fock matrix formation, making further parallelization over multiple GPUs an obvious step in the near future. November/December

9 should be low for large molecules. The computational methods presented here can be easily augmented to allow calculations within the framework of DFT, which is known to be significantly more accurate than HF theory. We re currently implementing a general-purpose electronic structure code including DFT that runs almost entirely on the GPU in anticipation of upcoming hardware advances. There is good reason to believe that these advances will enable the calculation of structures for small proteins directly from quantum mechanics as well as computational design of new smallmolecule drugs targeted to specific proteins with unprecedented accuracy and speed. Figure 3. Molecules used to test GPU performance. The set of molecules used spans the size range from 20 to 256 atoms. For very large molecules, the 32-bit precision provided by the Nvidia G80 series hardware isn t sufficient because the total energy grows with the molecule s size in chemical problems, relative energies are the primary objects of interest. In fact, using an incremental Fock matrix scheme to compute only the difference between Fock matrices in successive iterations and accumulating the Fock matrix on the CPU with dual-precision accuracy can improve the final result s precision by up to a factor of 0. Nevertheless, we still require higher precision as the molecules being studied get larger. To maintain chemical accuracy for energy differences in 32-bit precision, we re limited in practice to molecules with less than 00 atoms. Fortunately, Nvidia recently released the next generation of GPUs that supports 64-bit precision in hardware. Because 32-bit arithmetic will remain significantly faster than 64-bit arithmetic, we anticipate that a mixed precision computational model will be ideal. In this case, the program will process a small fraction of ERIs (those with the largest absolute value) using 64-bit arithmetic and evaluate the vast majority of ERIs using the faster 32-bit arithmetic. Because the number of ERIs that require dual-precision accuracy scales linearly with system size, the impact of dual-precision calculations on overall computational performance References J. Bolz et al., Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid, ACM Trans. Graph., vol. 22, no. 3, 2003, p. 97. J. Hall, N. Carr, and J. Hart, GPU Algorithms for Radiosity and Subsurface Scattering, tech. report UIUCDCS-R , Univ. of Illinois, Urbana-Champaign, K. Fatahalian, J. Sugerman, and P. Hanrahan, Graphics Hardware, T. Akenine-Moller and M. McCool, eds., Wellesley, 2004, p. 33. A.G. Anderson, W.A. Goddard III, and P. Schroder, Quantum Monte Carlo on Graphical Processing Units, Computer Physics Comm., vol. 77, no. 3, 2007, p A. Szabo and N.S. Ostlund, Modern Quantum Chemistry, Dover, 996. R.G. Parr and W. Yang, Density-Functional Theory of Atoms and Molecules, Oxford, 989. L.E. McMurchie and E.R. Davidson, One- and Two-Electron Integrals Over Cartesian Gaussian Functions, J. Computational Physics, vol. 26, no. 2, 978, p. 28 I.S. Ufimtsev and T.J. Martínez, Quantum Chemistry on Graphical Processing Units.. Strategies for Two-Electron Integral Evaluation, J. Chemical Theory and Computation, vol. 4, no. 2, 2008, p M.W. Schmidt et al. General Atomic and Molecular Electronic Structure System, J. Computational Chemistry, vol. 4, no., 993, p Ivan S. Ufimtsev is a graduate student and research assistant in the chemistry department at the University of Illinois. His research interests include leveraging non-traditional architectures for scientific computing. Contact him at iufimts2@uiuc.edu. Todd J. Martínez is the Gutgsell Chair of Chemistry at the University of Illinois. His research interests center on understanding the interplay between electronic and nuclear motion in molecules, especially in the context of chemical reactions initiated by light. Martínez became interested in computer architectures and videogame design at an early age, writing and selling his first game programs (coded in assembler for the 6502 processor) in the early 980s. Contact him at toddjmartinez@gmail.com. 34 Computing in Science & Engineering

Direct Self-Consistent Field Computations on GPU Clusters

Direct Self-Consistent Field Computations on GPU Clusters Direct Self-Consistent Field Computations on GPU Clusters Guochun Shi, Volodymyr Kindratenko National Center for Supercomputing Applications University of Illinois at UrbanaChampaign Ivan Ufimtsev, Todd

More information

arxiv: v1 [hep-lat] 7 Oct 2010

arxiv: v1 [hep-lat] 7 Oct 2010 arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA

More information

Electronic structure theory: Fundamentals to frontiers. 1. Hartree-Fock theory

Electronic structure theory: Fundamentals to frontiers. 1. Hartree-Fock theory Electronic structure theory: Fundamentals to frontiers. 1. Hartree-Fock theory MARTIN HEAD-GORDON, Department of Chemistry, University of California, and Chemical Sciences Division, Lawrence Berkeley National

More information

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign

More information

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Welcome to MCS 572. content and organization expectations of the course. definition and classification Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson

More information

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Accelerating linear algebra computations with hybrid GPU-multicore systems. Accelerating linear algebra computations with hybrid GPU-multicore systems. Marc Baboulin INRIA/Université Paris-Sud joint work with Jack Dongarra (University of Tennessee and Oak Ridge National Laboratory)

More information

Chem 4502 Introduction to Quantum Mechanics and Spectroscopy 3 Credits Fall Semester 2014 Laura Gagliardi. Lecture 28, December 08, 2014

Chem 4502 Introduction to Quantum Mechanics and Spectroscopy 3 Credits Fall Semester 2014 Laura Gagliardi. Lecture 28, December 08, 2014 Chem 4502 Introduction to Quantum Mechanics and Spectroscopy 3 Credits Fall Semester 2014 Laura Gagliardi Lecture 28, December 08, 2014 Solved Homework Water, H 2 O, involves 2 hydrogen atoms and an oxygen

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

CRYPTOGRAPHIC COMPUTING

CRYPTOGRAPHIC COMPUTING CRYPTOGRAPHIC COMPUTING ON GPU Chen Mou Cheng Dept. Electrical Engineering g National Taiwan University January 16, 2009 COLLABORATORS Daniel Bernstein, UIC, USA Tien Ren Chen, Army Tanja Lange, TU Eindhoven,

More information

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique Claude Tadonki MINES ParisTech PSL Research University Centre de Recherche Informatique claude.tadonki@mines-paristech.fr Monthly CRI Seminar MINES ParisTech - CRI June 06, 2016, Fontainebleau (France)

More information

Introduction to Quantum Mechanics and Spectroscopy 3 Credits Fall Semester 2014 Laura Gagliardi. Lecture 27, December 5, 2014

Introduction to Quantum Mechanics and Spectroscopy 3 Credits Fall Semester 2014 Laura Gagliardi. Lecture 27, December 5, 2014 Chem 4502 Introduction to Quantum Mechanics and Spectroscopy 3 Credits Fall Semester 2014 Laura Gagliardi Lecture 27, December 5, 2014 (Some material in this lecture has been adapted from Cramer, C. J.

More information

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems

More information

Dense Arithmetic over Finite Fields with CUMODP

Dense Arithmetic over Finite Fields with CUMODP Dense Arithmetic over Finite Fields with CUMODP Sardar Anisul Haque 1 Xin Li 2 Farnam Mansouri 1 Marc Moreno Maza 1 Wei Pan 3 Ning Xie 1 1 University of Western Ontario, Canada 2 Universidad Carlos III,

More information

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters ANTONINO TUMEO, ORESTE VILLA Collaborators: Karol Kowalski, Sriram Krishnamoorthy, Wenjing Ma, Simone Secchi May 15, 2012 1 Outline!

More information

ab initio Electronic Structure Calculations

ab initio Electronic Structure Calculations ab initio Electronic Structure Calculations New scalability frontiers using the BG/L Supercomputer C. Bekas, A. Curioni and W. Andreoni IBM, Zurich Research Laboratory Rueschlikon 8803, Switzerland ab

More information

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method

Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method NUCLEAR SCIENCE AND TECHNIQUES 25, 0501 (14) Research on GPU-accelerated algorithm in 3D finite difference neutron diffusion calculation method XU Qi ( 徐琪 ), 1, YU Gang-Lin ( 余纲林 ), 1 WANG Kan ( 王侃 ),

More information

This is a very succinct primer intended as supplementary material for an undergraduate course in physical chemistry.

This is a very succinct primer intended as supplementary material for an undergraduate course in physical chemistry. 1 Computational Chemistry (Quantum Chemistry) Primer This is a very succinct primer intended as supplementary material for an undergraduate course in physical chemistry. TABLE OF CONTENTS Methods...1 Basis

More information

A New Scalable Parallel Algorithm for Fock Matrix Construction

A New Scalable Parallel Algorithm for Fock Matrix Construction A New Scalable Parallel Algorithm for Fock Matrix Construction Xing Liu Aftab Patel Edmond Chow School of Computational Science and Engineering College of Computing, Georgia Institute of Technology Atlanta,

More information

CS 700: Quantitative Methods & Experimental Design in Computer Science

CS 700: Quantitative Methods & Experimental Design in Computer Science CS 700: Quantitative Methods & Experimental Design in Computer Science Sanjeev Setia Dept of Computer Science George Mason University Logistics Grade: 35% project, 25% Homework assignments 20% midterm,

More information

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs

Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Faster Kinetics: Accelerate Your Finite-Rate Combustion Simulation with GPUs Christopher P. Stone, Ph.D. Computational Science and Engineering, LLC Kyle Niemeyer, Ph.D. Oregon State University 2 Outline

More information

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using

More information

A CUDA Solver for Helmholtz Equation

A CUDA Solver for Helmholtz Equation Journal of Computational Information Systems 11: 24 (2015) 7805 7812 Available at http://www.jofcis.com A CUDA Solver for Helmholtz Equation Mingming REN 1,2,, Xiaoguang LIU 1,2, Gang WANG 1,2 1 College

More information

arxiv: v1 [hep-lat] 10 Jul 2012

arxiv: v1 [hep-lat] 10 Jul 2012 Hybrid Monte Carlo with Wilson Dirac operator on the Fermi GPU Abhijit Chakrabarty Electra Design Automation, SDF Building, SaltLake Sec-V, Kolkata - 700091. Pushan Majumdar Dept. of Theoretical Physics,

More information

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor Christian Lessig Abstract The Algorithm of Multiple Relatively Robust Representations (MRRR) is one of the most efficient and accurate

More information

SCF calculation on HeH +

SCF calculation on HeH + SCF calculation on HeH + Markus Meuwly Department of Chemistry, University of Basel, Basel, Switzerland Abstract This document describes the main steps involved in carrying out a SCF calculation on the

More information

High-Performance Computing, Planet Formation & Searching for Extrasolar Planets

High-Performance Computing, Planet Formation & Searching for Extrasolar Planets High-Performance Computing, Planet Formation & Searching for Extrasolar Planets Eric B. Ford (UF Astronomy) Research Computing Day September 29, 2011 Postdocs: A. Boley, S. Chatterjee, A. Moorhead, M.

More information

Introduction The Nature of High-Performance Computation

Introduction The Nature of High-Performance Computation 1 Introduction The Nature of High-Performance Computation The need for speed. Since the beginning of the era of the modern digital computer in the early 1940s, computing power has increased at an exponential

More information

Introduction to numerical computations on the GPU

Introduction to numerical computations on the GPU Introduction to numerical computations on the GPU Lucian Covaci http://lucian.covaci.org/cuda.pdf Tuesday 1 November 11 1 2 Outline: NVIDIA Tesla and Geforce video cards: architecture CUDA - C: programming

More information

arxiv: v1 [physics.comp-ph] 30 Oct 2017

arxiv: v1 [physics.comp-ph] 30 Oct 2017 An efficient GPU algorithm for tetrahedron-based Brillouin-zone integration Daniel Guterding 1, and Harald O. Jeschke 1 Lucht Probst Associates, Große Gallusstraße 9, 011 Frankfurt am Main, Germany, European

More information

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor

An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor An Implementation of the MRRR Algorithm on a Data-Parallel Coprocessor Christian Lessig Abstract The Algorithm of Multiple Relatively Robust Representations (MRRRR) is one of the most efficient and most

More information

Introduction to Electronic Structure Theory

Introduction to Electronic Structure Theory Introduction to Electronic Structure Theory C. David Sherrill School of Chemistry and Biochemistry Georgia Institute of Technology June 2002 Last Revised: June 2003 1 Introduction The purpose of these

More information

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11 Matrix Computations: Direct Methods II May 5, 2014 ecture Summary You have seen an example of how a typical matrix operation (an important one) can be reduced to using lower level BS routines that would

More information

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers

Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

Shortest Lattice Vector Enumeration on Graphics Cards

Shortest Lattice Vector Enumeration on Graphics Cards Shortest Lattice Vector Enumeration on Graphics Cards Jens Hermans 1 Michael Schneider 2 Fréderik Vercauteren 1 Johannes Buchmann 2 Bart Preneel 1 1 K.U.Leuven 2 TU Darmstadt SHARCS - 10 September 2009

More information

Case Study: Quantum Chromodynamics

Case Study: Quantum Chromodynamics Case Study: Quantum Chromodynamics Michael Clark Harvard University with R. Babich, K. Barros, R. Brower, J. Chen and C. Rebbi Outline Primer to QCD QCD on a GPU Mixed Precision Solvers Multigrid solver

More information

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization)

A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) A model leading to self-consistent iteration computation with need for HP LA (e.g, diagonalization and orthogonalization) Schodinger equation: Hψ = Eψ Choose a basis set of wave functions Two cases: Orthonormal

More information

Efficient implementation of the overlap operator on multi-gpus

Efficient implementation of the overlap operator on multi-gpus Efficient implementation of the overlap operator on multi-gpus Andrei Alexandru Mike Lujan, Craig Pelissier, Ben Gamari, Frank Lee SAAHPC 2011 - University of Tennessee Outline Motivation Overlap operator

More information

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD

COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD XVIII International Conference on Water Resources CMWR 2010 J. Carrera (Ed) c CIMNE, Barcelona, 2010 COMPARISON OF CPU AND GPU IMPLEMENTATIONS OF THE LATTICE BOLTZMANN METHOD James.E. McClure, Jan F. Prins

More information

CS 542G: Conditioning, BLAS, LU Factorization

CS 542G: Conditioning, BLAS, LU Factorization CS 542G: Conditioning, BLAS, LU Factorization Robert Bridson September 22, 2008 1 Why some RBF Kernel Functions Fail We derived some sensible RBF kernel functions, like φ(r) = r 2 log r, from basic principles

More information

Practical Combustion Kinetics with CUDA

Practical Combustion Kinetics with CUDA Funded by: U.S. Department of Energy Vehicle Technologies Program Program Manager: Gurpreet Singh & Leo Breton Practical Combustion Kinetics with CUDA GPU Technology Conference March 20, 2015 Russell Whitesides

More information

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Performance, Power & Energy ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So Recall: Goal of this class Performance Reconfiguration Power/ Energy H. So, Sp10 Lecture 3 - ELEC8106/6102 2 PERFORMANCE EVALUATION

More information

Software optimization for petaflops/s scale Quantum Monte Carlo simulations

Software optimization for petaflops/s scale Quantum Monte Carlo simulations Software optimization for petaflops/s scale Quantum Monte Carlo simulations A. Scemama 1, M. Caffarel 1, E. Oseret 2, W. Jalby 2 1 Laboratoire de Chimie et Physique Quantiques / IRSAMC, Toulouse, France

More information

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters

Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Dynamic Scheduling for Work Agglomeration on Heterogeneous Clusters Jonathan Lifflander, G. Carl Evans, Anshu Arya, Laxmikant Kale University of Illinois Urbana-Champaign May 25, 2012 Work is overdecomposed

More information

Matrix Assembly in FEA

Matrix Assembly in FEA Matrix Assembly in FEA 1 In Chapter 2, we spoke about how the global matrix equations are assembled in the finite element method. We now want to revisit that discussion and add some details. For example,

More information

All-electron density functional theory on Intel MIC: Elk

All-electron density functional theory on Intel MIC: Elk All-electron density functional theory on Intel MIC: Elk W. Scott Thornton, R.J. Harrison Abstract We present the results of the porting of the full potential linear augmented plane-wave solver, Elk [1],

More information

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University Che-Wei Chang chewei@mail.cgu.edu.tw Department of Computer Science and Information Engineering, Chang Gung University } 2017/11/15 Midterm } 2017/11/22 Final Project Announcement 2 1. Introduction 2.

More information

POLITECNICO DI MILANO DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS

POLITECNICO DI MILANO DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS POLITECNICO DI MILANO Facoltà di Ingegneria dell Informazione Corso di Laurea in Ingegneria Informatica DATA PARALLEL OPTIMIZATIONS ON GPU ARCHITECTURES FOR MOLECULAR DYNAMIC SIMULATIONS Relatore: Prof.

More information

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components

Quantum Chemical Calculations by Parallel Computer from Commodity PC Components Nonlinear Analysis: Modelling and Control, 2007, Vol. 12, No. 4, 461 468 Quantum Chemical Calculations by Parallel Computer from Commodity PC Components S. Bekešienė 1, S. Sėrikovienė 2 1 Institute of

More information

Chapter 3. The (L)APW+lo Method. 3.1 Choosing A Basis Set

Chapter 3. The (L)APW+lo Method. 3.1 Choosing A Basis Set Chapter 3 The (L)APW+lo Method 3.1 Choosing A Basis Set The Kohn-Sham equations (Eq. (2.17)) provide a formulation of how to practically find a solution to the Hohenberg-Kohn functional (Eq. (2.15)). Nevertheless

More information

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel? CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?

More information

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic Jan Verschelde joint work with Xiangcheng Yu University of Illinois at Chicago

More information

Introduction to numerical projects

Introduction to numerical projects Introduction to numerical projects Here follows a brief recipe and recommendation on how to write a report for each project. Give a short description of the nature of the problem and the eventual numerical

More information

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC

Hybrid static/dynamic scheduling for already optimized dense matrix factorization. Joint Laboratory for Petascale Computing, INRIA-UIUC Hybrid static/dynamic scheduling for already optimized dense matrix factorization Simplice Donfack, Laura Grigori, INRIA, France Bill Gropp, Vivek Kale UIUC, USA Joint Laboratory for Petascale Computing,

More information

Fast and accurate Coulomb calculation with Gaussian functions

Fast and accurate Coulomb calculation with Gaussian functions Fast and accurate Coulomb calculation with Gaussian functions László Füsti-Molnár and Jing Kong Q-CHEM Inc., Pittsburgh, Pennysylvania 15213 THE JOURNAL OF CHEMICAL PHYSICS 122, 074108 2005 Received 8

More information

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Scalable Non-blocking Preconditioned Conjugate Gradient Methods Scalable Non-blocking Preconditioned Conjugate Gradient Methods Paul Eller and William Gropp University of Illinois at Urbana-Champaign Department of Computer Science Supercomputing 16 Paul Eller and William

More information

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th Brief Introduction of CTC AGENDA Alpha/Beta Matrix

More information

Introduction to Benchmark Test for Multi-scale Computational Materials Software

Introduction to Benchmark Test for Multi-scale Computational Materials Software Introduction to Benchmark Test for Multi-scale Computational Materials Software Shun Xu*, Jian Zhang, Zhong Jin xushun@sccas.cn Computer Network Information Center Chinese Academy of Sciences (IPCC member)

More information

CPU SCHEDULING RONG ZHENG

CPU SCHEDULING RONG ZHENG CPU SCHEDULING RONG ZHENG OVERVIEW Why scheduling? Non-preemptive vs Preemptive policies FCFS, SJF, Round robin, multilevel queues with feedback, guaranteed scheduling 2 SHORT-TERM, MID-TERM, LONG- TERM

More information

CHEM3023: Spins, Atoms and Molecules

CHEM3023: Spins, Atoms and Molecules CHEM3023: Spins, Atoms and Molecules Lecture 5 The Hartree-Fock method C.-K. Skylaris Learning outcomes Be able to use the variational principle in quantum calculations Be able to construct Fock operators

More information

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry and Eugene DePrince Argonne National Laboratory (LCF and CNM) (Eugene moved to Georgia Tech last week)

More information

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library S6349 - XMP LIBRARY INTERNALS Niall Emmart University of Massachusetts Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library High Performance Modular Exponentiation A^K mod P Where A,

More information

Introduction to Hartree-Fock Molecular Orbital Theory

Introduction to Hartree-Fock Molecular Orbital Theory Introduction to Hartree-Fock Molecular Orbital Theory C. David Sherrill School of Chemistry and Biochemistry Georgia Institute of Technology Origins of Mathematical Modeling in Chemistry Plato (ca. 428-347

More information

Efficient algorithms for symmetric tensor contractions

Efficient algorithms for symmetric tensor contractions Efficient algorithms for symmetric tensor contractions Edgar Solomonik 1 Department of EECS, UC Berkeley Oct 22, 2013 1 / 42 Edgar Solomonik Symmetric tensor contractions 1/ 42 Motivation The goal is to

More information

PROJECT C: ELECTRONIC BAND STRUCTURE IN A MODEL SEMICONDUCTOR

PROJECT C: ELECTRONIC BAND STRUCTURE IN A MODEL SEMICONDUCTOR PROJECT C: ELECTRONIC BAND STRUCTURE IN A MODEL SEMICONDUCTOR The aim of this project is to present the student with a perspective on the notion of electronic energy band structures and energy band gaps

More information

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized

More information

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013

More information

Lecture 4: Linear Algebra 1

Lecture 4: Linear Algebra 1 Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation

More information

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise

More information

Real-time signal detection for pulsars and radio transients using GPUs

Real-time signal detection for pulsars and radio transients using GPUs Real-time signal detection for pulsars and radio transients using GPUs W. Armour, M. Giles, A. Karastergiou and C. Williams. University of Oxford. 15 th July 2013 1 Background of GPUs Why use GPUs? Influence

More information

Optimization Techniques for Parallel Code 1. Parallel programming models

Optimization Techniques for Parallel Code 1. Parallel programming models Optimization Techniques for Parallel Code 1. Parallel programming models Sylvain Collange Inria Rennes Bretagne Atlantique http://www.irisa.fr/alf/collange/ sylvain.collange@inria.fr OPT - 2017 Goals of

More information

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

MICROPROCESSOR REPORT.   THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE MICROPROCESSOR www.mpronline.com REPORT THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE ENERGY COROLLARIES TO AMDAHL S LAW Analyzing the Interactions Between Parallel Execution and Energy Consumption By

More information

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David

Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David Analytical Modeling of Parallel Programs (Chapter 5) Alexandre David 1.2.05 1 Topic Overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of granularity on

More information

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017 HYCOM and Navy ESPC Future High Performance Computing Needs Alan J. Wallcraft COAPS Short Seminar November 6, 2017 Forecasting Architectural Trends 3 NAVY OPERATIONAL GLOBAL OCEAN PREDICTION Trend is higher

More information

Intermission: Let s review the essentials of the Helium Atom

Intermission: Let s review the essentials of the Helium Atom PHYS3022 Applied Quantum Mechanics Problem Set 4 Due Date: 6 March 2018 (Tuesday) T+2 = 8 March 2018 All problem sets should be handed in not later than 5pm on the due date. Drop your assignments in the

More information

Transposition Mechanism for Sparse Matrices on Vector Processors

Transposition Mechanism for Sparse Matrices on Vector Processors Transposition Mechanism for Sparse Matrices on Vector Processors Pyrrhos Stathis Stamatis Vassiliadis Sorin Cotofana Electrical Engineering Department, Delft University of Technology, Delft, The Netherlands

More information

Julian Merten. GPU Computing and Alternative Architecture

Julian Merten. GPU Computing and Alternative Architecture Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg

More information

Accelerating Quantum Chromodynamics Calculations with GPUs

Accelerating Quantum Chromodynamics Calculations with GPUs Accelerating Quantum Chromodynamics Calculations with GPUs Guochun Shi, Steven Gottlieb, Aaron Torok, Volodymyr Kindratenko NCSA & Indiana University National Center for Supercomputing Applications University

More information

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler 2nd International Symposium Computer Simulations on GPU Freudenstadt, 29.05.2013 1 Contents Motivation walberla software concepts

More information

From Physics to Logic

From Physics to Logic From Physics to Logic This course aims to introduce you to the layers of abstraction of modern computer systems. We won t spend much time below the level of bits, bytes, words, and functional units, but

More information

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Nano-scale Integrated Circuit and System (NICS) Laboratory Sparse LU Factorization on GPUs for Accelerating SPICE Simulation Xiaoming Chen PhD Candidate Department of Electronic Engineering Tsinghua University,

More information

Brief review of Quantum Mechanics (QM)

Brief review of Quantum Mechanics (QM) Brief review of Quantum Mechanics (QM) Note: This is a collection of several formulae and facts that we will use throughout the course. It is by no means a complete discussion of QM, nor will I attempt

More information

This is called a singlet or spin singlet, because the so called multiplicity, or number of possible orientations of the total spin, which is

This is called a singlet or spin singlet, because the so called multiplicity, or number of possible orientations of the total spin, which is 9. Open shell systems The derivation of Hartree-Fock equations (Chapter 7) was done for a special case of a closed shell systems. Closed shell means that each MO is occupied by two electrons with the opposite

More information

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers

ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers ELSI: A Unified Software Interface for Kohn-Sham Electronic Structure Solvers Victor Yu and the ELSI team Department of Mechanical Engineering & Materials Science Duke University Kohn-Sham Density-Functional

More information

CHEM3023: Spins, Atoms and Molecules

CHEM3023: Spins, Atoms and Molecules CHEM3023: Spins, Atoms and Molecules Lecture 4 Molecular orbitals C.-K. Skylaris Learning outcomes Be able to manipulate expressions involving spin orbitals and molecular orbitals Be able to write down

More information

Exploring performance and power properties of modern multicore chips via simple machine models

Exploring performance and power properties of modern multicore chips via simple machine models Exploring performance and power properties of modern multicore chips via simple machine models G. Hager, J. Treibig, J. Habich, and G. Wellein Erlangen Regional Computing Center (RRZE) Martensstr. 1, 9158

More information

Chemistry 334 Part 2: Computational Quantum Chemistry

Chemistry 334 Part 2: Computational Quantum Chemistry Chemistry 334 Part 2: Computational Quantum Chemistry 1. Definition Louis Scudiero, Ben Shepler and Kirk Peterson Washington State University January 2006 Computational chemistry is an area of theoretical

More information

Quantum Chemical Simulations and Descriptors. Dr. Antonio Chana, Dr. Mosè Casalegno

Quantum Chemical Simulations and Descriptors. Dr. Antonio Chana, Dr. Mosè Casalegno Quantum Chemical Simulations and Descriptors Dr. Antonio Chana, Dr. Mosè Casalegno Classical Mechanics: basics It models real-world objects as point particles, objects with negligible size. The motion

More information

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu Performance Metrics for Computer Systems CASS 2018 Lavanya Ramapantulu Eight Great Ideas in Computer Architecture Design for Moore s Law Use abstraction to simplify design Make the common case fast Performance

More information

Quantum Chemistry on Graphics Processing Units

Quantum Chemistry on Graphics Processing Units CHAPTER 2 Quantum Chemistry on Graphics Processing Units Andreas W. Go tz 1, Thorsten Wo lfle 1,2, and Ross C. Walker 1 Contents 1. Introduction 22 2. Software Development for Graphics Processing Units

More information

Basic Physical Chemistry Lecture 2. Keisuke Goda Summer Semester 2015

Basic Physical Chemistry Lecture 2. Keisuke Goda Summer Semester 2015 Basic Physical Chemistry Lecture 2 Keisuke Goda Summer Semester 2015 Lecture schedule Since we only have three lectures, let s focus on a few important topics of quantum chemistry and structural chemistry

More information

Cyclops Tensor Framework

Cyclops Tensor Framework Cyclops Tensor Framework Edgar Solomonik Department of EECS, Computer Science Division, UC Berkeley March 17, 2014 1 / 29 Edgar Solomonik Cyclops Tensor Framework 1/ 29 Definition of a tensor A rank r

More information

Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB

Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB Parallel stochastic simulation using graphics processing units for the Systems Biology Toolbox for MATLAB Supplemental material Guido Klingbeil, Radek Erban, Mike Giles, and Philip K. Maini This document

More information

Self-consistent Field

Self-consistent Field Chapter 6 Self-consistent Field A way to solve a system of many electrons is to consider each electron under the electrostatic field generated by all other electrons. The many-body problem is thus reduced

More information

Massive Parallelization of First Principles Molecular Dynamics Code

Massive Parallelization of First Principles Molecular Dynamics Code Massive Parallelization of First Principles Molecular Dynamics Code V Hidemi Komatsu V Takahiro Yamasaki V Shin-ichi Ichikawa (Manuscript received April 16, 2008) PHASE is a first principles molecular

More information

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions are the most computationally intensive part of quantum

More information

ERLANGEN REGIONAL COMPUTING CENTER

ERLANGEN REGIONAL COMPUTING CENTER ERLANGEN REGIONAL COMPUTING CENTER Making Sense of Performance Numbers Georg Hager Erlangen Regional Computing Center (RRZE) Friedrich-Alexander-Universität Erlangen-Nürnberg OpenMPCon 2018 Barcelona,

More information

Same idea for polyatomics, keep track of identical atom e.g. NH 3 consider only valence electrons F(2s,2p) H(1s)

Same idea for polyatomics, keep track of identical atom e.g. NH 3 consider only valence electrons F(2s,2p) H(1s) XIII 63 Polyatomic bonding -09 -mod, Notes (13) Engel 16-17 Balance: nuclear repulsion, positive e-n attraction, neg. united atom AO ε i applies to all bonding, just more nuclei repulsion biggest at low

More information

Molecular Modelling for Medicinal Chemistry (F13MMM) Room A36

Molecular Modelling for Medicinal Chemistry (F13MMM) Room A36 Molecular Modelling for Medicinal Chemistry (F13MMM) jonathan.hirst@nottingham.ac.uk Room A36 http://comp.chem.nottingham.ac.uk Assisted reading Molecular Modelling: Principles and Applications. Andrew

More information

GPU- Accelerated Quantum Chemistry

GPU- Accelerated Quantum Chemistry GPU- Accelerated Quantum Chemistry Ivan Ufimtsev Stanford University TCBG GPU Programming Workshop, 2013 GPU (and Epiphany)- Accelerated Quantum Chemistry Ivan Ufimtsev TCBG GPU Programming Workshop, 2013

More information