Molecular Science Modelling

Size: px

Start display at page:

Download "Molecular Science Modelling"

Christian Bond
6 years ago
Views:

1 Molecular Science Modelling Lorna Smith Edinburgh Parallel Computing Centre The University of Edinburgh Version 1.0 Available from:

3 Table of Contents 1 Introduction Programming Models High Performance Fortran (HPF) Message Passing Interface (MPI) Linda Theory Molecular Dynamics First Principle (Ab-initio) Calculations Hartree-Fock Self Consistent Field Method Density Functional Theory Parallelisation Parallelisation Strategies Parallelisation of Molecular Dynamics Short Range Forces Long Range Forces Summary Parallelisation of First Principle Calculations Hartree-Fock Theory Density Functional Theory Summary Parallel Codes Ab-initio electronic structure methods CETEP AIMPRO CRYSTAL GAMESS_UK GAUSSIAN Other Codes Molecular Dynamics AMBER DL_POLY DPD GBMEGA Other Codes Edinburgh Parallel Computing Centre 1

4 Technology Watch Report 6 Conclusions References Molecular Science Modelling

5 Introduction 1 Introduction The use of the term modelling in science has a specific meaning, a meaning which does not relate to drawing or visualising models on a workstation or P.C. (although modellers can spend part of their time involved in this occupation). The term refers to techniques which involve a set of mathematical equations being used to accurately represent some specific scientific phenomena. Molecular Modelling, that is the modelling of molecules or molecular systems, has diverse applications and is the basis of most computational chemistry techniques. Previously, the application of certain molecular science modelling techniques has been limited by a lack of computational power. Nowadays however this statement has been somewhat negated by the development of parallel computing, in conjunction with that of low cost computational components and fast interconnect technology. Parallel computing offering a cost-effective method of carrying out larger and more realistic simulations. Parallelism in chemistry applications began in the 1980s where applications utilised matrix-vector operations and matrix-matrix operations to take advantage of multiple vector registers on vector supercomputers (Clementi et al., 1984). This trend has been developing ever since, with the use of asynchronous disk operations (e.g. overlapping computations and disk reads and writes) for clusters of workstations. The current challenge is now concerned with developing applications which run efficiently on hundreds (and even thousands) of processors. This report reviews the current status of Molecular Science Modelling techniques which have specific application to parallel computing. Two techniques are reviewed, that of molecular dynamics and ab-inito techniques. Arguably others should be discussed, such as Monte-Carlo methods, however this review focuses on the principle techniques which are currently in use in the UK at the moment. Initially, brief descriptions are given of the programming paradigms most commonly utilised by these applications. This is followed by a brief description of the modelling techniques and a longer description of the parallelisation strategies utilised by these. Finally a review of the programs currently being utilised is given. Edinburgh Parallel Computing Centre 1

6 2 Molecular Science Modelling Technology Watch Report

7 Programming Models 2 Programming Models To write an efficient parallel algorithm, a number of attributes must be considered such as load balancing, scalability and the tolerance of latency and low bandwidth for reference to remote memory locations. The concept of nonuniform memory access (NUMA) is essential when designing parallel algorithms, where memory can be nonuniform not only in the latency and bandwidth of access but also in the way it is accessed. Parallel programming languages deal with NUMA in a variety of ways, either by hiding all complexity by using an automatic parallelising compiler or by using a data driven model where the distribution of data is made explicit but all the data is referenced with the same language constructs (e.g. HPF). A subroutine interface can be used to access remote or distributed data (e.g. Linda) or in some cases no direct access is allowed to remote data (e.g. MPI). There are a large number of parallel programming environments which will not all be discussed here. In this section some of the more common programming environments utilised in writing parallel programs in the area of scientific modelling are briefly reviewed. 2.1 High Performance Fortran (HPF) This parallel fortran language is a set of constructs and extensions to Fortran90 which allow the user to express parallelism in a relatively simple manner. It was designed to promote the wider use of parallelism by hiding the details of the architecture from the programmer and to provide a portable language. HPF is primarily a data parallel language. The program resembles an ordinary sequential programming language, the flow of control following strict sequential order except when parallel intrinsics or built in procedures are called. The programmer only views a single memory, the actual distribution of data and communication between processors is done by the compiler with guidance from the programmer. The programmer places specially designed comments, called compiler directives, within his code to aid the compiler in distributing the data and work on a parallel machine. Example directives include DISTRIBUTE and ALIGN which are used to partition data among memory regions and the FORALL construct defines the assignment of multiple elements in an array without enforcing any order on the assignment to individual elements. HPF will work well where regular data structures are present (the case in some molecular dynamics algorithms) however if either the load balancing or the data structures are not regular (this is the case for a lot of quantum chemistry) then HPF becomes less effective. Further information on High Performance Fortran can be obtained from the Technology Watch Report High Performance Fortran. History, Overview and Current Status (Richardson, 1995). Edinburgh Parallel Computing Centre 3

8 Technology Watch Report 2.2 Message Passing Interface (MPI) MPI is based on the message passing model where each process has a local memory and no other process can directly read or write from or to that local memory. Unlike the data parallel model discussed above there is no globally addressable memory. This model is effective for algorithms that utilise static domain decomposition and for systolic loop type algorithms (which are both used in some molecular dynamics codes) and for course grain algorithms (utilised in SCF calculations). Domain decomposition and systolic loop algorithms are described in more detail in later sections. The program is written in a standard language (e.g. fortran or C) with data movement being controlled by calls to communication routines from some communication library. MPI has been designed as a standard for message passing and the development involved a large number of users and software and hardware vendors. Although MPI represents a standard for message passing, other message passing packages exists which are utilised by the scientific modelling community. Three of the most common of these are PVM (Parallel Virtual Machine), TCGMSG (Theoretical Chemistry Group Message Passing System) and occam (which was inspired by the CSP (communicating sequential processes) model). Further information on MPI can be obtained from the Technology Watch Report MPI: A Message-Passing Interface Standard. History, Overview and Current Status (Malard, 1996). 2.3 Linda Linda is a coordination language which is built on a base language, such as C or Fortran. Linda is based on a distributed data structure model and creates a virtual shared memory for every Linda program called tuple space. This is simply a medium used to share data between different processes without the need of a physically shared memory and can be accessed by any process within a given Linda program. Data is moved to and from tuple space in the form of tuples, tuples are the data structures of tuple space and are simply a collection of typed data objects or place holders (called elements). Linda is limited in it ability to provide information on how tuples are stored or accessed and also in the need for general tuples to be matching which can lead to inefficiencies in memory usage and communication. The lack of primitives for efficient global communication can also be a problem. 4 Molecular Science Modelling

9 Theory 3 Theory In this section a brief description of the two techniques, molecular dynamics and abinitio techniques, are given followed by a more detailed examination of the parallelisation strategies. Reviews of these (with application to parallel computing) include that of Harrison and Shepard (1994), of Kendall et al. (1995) and of Smith (1993a). 3.1 Molecular Dynamics Molecular Dynamics is a method for solving the many-particle equation of motion for a molecular system (Allen et al., 1997). It is used to determine equilibrium and transport properties of a classic many body system and involves the iterative computation of the total potential energy, forces and coordinates of every atom in the system at each of a number of time steps. A molecular dynamics simulation involves the following steps: 1) An initial starting configuration is constructed. This requires a set of initial position for the atoms, which can for example be taken from a known crystal structure or can simply be a set of random numbers. Initial velocities for the atoms are also required which can again be random numbers. 2) The system is then initialised by scaling the velocities to the desired temperature. 3) The forces on each particle are calculated. 4) The movement of the particles within some time interval are calculated (from the atomic positions, velocities and forces) 5) The atomic positions are updated and the process repeated. This cycle has to be repeated numerous times within a molecular dynamics simulation. Molecular Dynamics describes the molecular system as a function of time. This is an advantage over Monte Carlo methods in that time dependant phenomena such as transport properties (e.g. viscosity) can be calculated directly. The most computationally expensive part of a molecular dynamics simulation is the force calculation, i.e. the calculation of the interactions between particles. The integration of the equations of motion are also computationally intensive, however the force calculation is of the order N 2 (where N is the number of particles) and the equation of motion integration is of the order N. The calculation of the pair wise interactions can be carried out independently. This makes this part of the calculation inherently parallel, a fact that has been exploited by parallel molecular dynamics codes. 3.2 First Principle (Ab-initio) Calculations These calculations involve the direct calculation of material properties from fundamental quantum mechanical theory (Gillan, 1994) and essentially involves solving Edinburgh Parallel Computing Centre 5

10 Technology Watch Report various approximations to the Schrodinger equation that describes their basic structures. Fundamental properties, for example bond strength and reaction energies have been calculated from first principles. The system being studied (an atom or a molecule) is described by wavefunctions (generally this is a complex mathematical expression). The wavefunction can then be used to determine various properties and the energy of the system. The problem is that for the systems being studied, the wavefunctions can not be determined analytically. Approximations need to be made, the most common being the Born-Oppenheimer approximation. This allows the nuclei to be regarded as stationary, whilst the electrons move (due to the much heavier weight of the nuclei in comparison to the lighter electrons). The motion of the electrons is correlated, i.e. the motion of one electron is affected by the other electrons. However another approximation that is often made is that the motion of the particles is not correlated. The particles still interact, however each particle experiences and interaction which occurs from a smeared representation of the average position of all the other particles, rather than an instantaneous interaction which changes as they move. The problem now consists of finding a set of individual wavefunctions, one for each particle. These individual wavefunctions are known as molecular orbitals Hartree-Fock Self Consistent Field Method The simplest method of determining the electronic structure of molecules by forming approximate solutions to the Schrodinger equation (within the Born-Oppenheimer approximation) is the Hartree-Fock or self-consistent field method (SCF). This method takes the molecular wavefunction (a non-separable function of the coordinates of all N electrons) and approximates this to as an antisymmetric product of N one electron functions. Each of these one electron functions (called molecular orbitals) is expanded in an underlying basis set (typically atom centred Gaussian like functions) and the molecular orbitals are then determined by minimising the total energy by varying the expansion coefficients (C). The simplest N-electron wavefunction in use is a single antisymmetric product ( Slater determinant ) of one electron functions which are orthonormal linear combinations of the atomic orbital basis functions. The most computationally expensive part of the calculation is computing the derivatives of the energy with respect to the molecular orbital coefficients. This is closely related to the Fock matrix (F): Fij = hij + ( 1 2) [ 2 ( ij kl) ( ik jl) ] Dkl where D kl is the density matrix, h ij and (ij kl) are one and two electron integrals over the underlying basis function. The basis is typically of dimension N=O( ) and thus even allowing for sparsity the two electron integrals (which have four labels) are numerous. The major computation is evaluation of the non-zero integral and the largest data requirements are due to the Fock and density matrices (both O(N 2 )). The problems are: 1) Distributing and accessing the matrices to minimise communication costs. kl 2) Maintaining load balance in the presence of sparsity and large variation in the cost of integral evaluation. This involves contraction of a large, sparce four-index matrix (electron-repulsion integrals) with a two index matrix (the electronic density) to yield another two index 6 Molecular Science Modelling

11 Theory matrix (the Fock matrix). Both matrices are of size N*N where N is the size of the underlying basis set. (N= typically). The number of integrals scales between O(N*N) and O(N*N*N*N) depending on the nature of the system and level of accuracy required. The methodology is as follows: 1) A position is chosen for the atomic nuceli 2) A certain set of Gaussian basis functions is chosen. 3) An initial guess of the form of the 1 electron wave functions is generated by choosing the coefficients of the Gaussian basis functions representing each molecular orbital. 4) The density matrix is computed 5) The Fock Matrix is constructed 6) The N equations are solved 7) If by solving the N equations, improved molecular orbitals are obtained then these are used in the first step of the new iteration. Else the process is terminated and continues to stage 8. 8) The total energy of the system can now be evaluated Density Functional Theory Hartree-Fock theory uses an exact Hamiltonian and approximate many -electron wavefunctions. The correlations between electrons can be either long-range or short-range. Self consistent field theory deals with long-range forces by using averaging techniques, i.e. the field experienced by an atom depends on the global distribution of the atoms. Short range correlations, which involve the local environment around the atoms i.e. deviations, are not treated using the self consistent field method. These short range forces are often minor however in some cases, such as high temperature ceramic superconductors, these correlations are strong and need to be considered. Kohn and Sham (1965) have developed a theory to deal with this problem, this has been termed density functional theory since the electron density plays a crucial role. Effectively the energy is written as a function of the electron density rather than in terms of the many-electron wavefunctions. Approximations are made to the Hamiltonian. The difference in the two techniques can be seen by considering the forms of the energy for Hartree-Fock and for density functional theory. Hartree-Fock: where V is the nucleur repulsion energy, P is the density matrix, hp is the one electron (kinetic + potential) energy, 1/2P j (P) is the classical coulomb repulsion of the electrons and 1/2P k (P) is the exchange energy resulting from the quantum nature of the electrons. Density functional theory: E HF 1 1 = V + hp P j ( P) 2P k ( P) Edinburgh Parallel Computing Centre 7

12 Technology Watch Report E KS 1 = V + P E 2P j ( P) x [ P] + E c [ P] where E x [P] is the exchange functional and E C [P] is the correlation potential. The exchange functional and correlation functionals are integrals of some function of the density and in some cases the density gradient. A similar methodology is used as to that of Hartree-Fock methods. The Hamiltonian is broken down into some basic one electron and two electron components as before, however the two electron components are further reduced to a combination of the Coulomb term and the exchanges correlation term(s). This extra term is incorporated into the Fock matrix. The correlation term is normally integrated numerically on a grid, or fitted to a Gaussian basis and then integrated analytically. The computationally intensive parts of the calculation involve the fitting of the density, the construction of the Coulomb potential, the construction of the exchange-correlation potential and the subsequent diagonalisation of the resulting equations. The exchange correlation potential in density functional theory is determined only by the electron density, The precise dependence on density is not known except for the homogenous electron gas. For other situations the electron density varies through space and the assumption is made that the exchange correlation at a certain point is given by the homogeneous electron gas value involving the density at the same point. The charge density is determined and compared to the charge density used to generate the effective potential previously. If this is an improvement on the original charge density the cycle continues. An initial guess is made to the electronic charge density, the Hartree potential and exchange correlation potentials are then calculated. The hamiltonian matrices for each of the k points included in the calculation are constructed and diagonalised to obtain the Kohn-Sham eigenstates. These eigenstates can then be used to generate the charge density, a new set of Hamiltonian matrices is then generated and the process repeated until the output charge density is self consistent with the charge density used to construct the electronic potentials. In general, the Kohn-Sham equations are used rather than the Hartree-Fock equations, the methodology being very similar. 8 Molecular Science Modelling

13 Parallelisation 4 Parallelisation Most of the original work on parallel processing of chemistry applications was done by the LCAP (Loosely Coupled Array Processors) project (Clementi, 1990) whose aim was: to couple readily available commercial processors to form a system that is not massively parallel, but rather is modular and can be expanded to match the degree of parallelism that a set of applications can support For example, a direct SCF calculation was carried out using a master/slave model whereby each processor calculated a subset of the electron integrals and passed a partial Fock matrix back to the master processor which added these together. This project was probably the pioneering work in parallel computational chemistry and provided the incentive for later developments, many of which use the replicated data technique which was developed in this project. In the area of quantum chemistry, growth in parallel quantum chemistry codes has been relatively slow since this project, and considerably slower than the large growth in users of quantum chemistry codes. This is partly due to the size of the standard codes such as SPARTAN (Carpenter et al.) and Gaussian (Frisch et al.) and also due to a lack of code developers. This has been somewhat remedied, with parallel versions of the standard packages Gaussian, GAMESS (Guest et al., 1987), HONDO (Dupuis et al., 1993) and TURBOMOLE (Ahlrichs et al., 1989) now in existence. The emphasis is still however primarily on users own codes. Parallelisation of molecular dynamics calculations has fared rather better historically with the inherently parallel nature of molecular dynamics described extensively in the literature (a number of these reviews are referenced in the text). A number of widely used simulation packages have been parallelised such as CHARMM (Brooks et al., 1992), AMBER (Weiner et al., 1984), Discover (Biosym Technologies) and GROMOS (Bioms B. V.). In this section the principal parallelisation techniques utilised in molecular dynamics and ab-initio codes will be discussed. 4.1 Parallelisation Strategies There are five basic strategies for parallelisation. The first two of these will be discussed briefly whilst the later three, which have been applied more extensively to molecular dynamics simulations and first principle calculations, will be discussed in more detail in later relevant sections. 1) Cloning 2) Master-Slave 3) Replicated Data 4) Systolic Loops Edinburgh Parallel Computing Centre 9

14 Technology Watch Report 5) Domain Decomposition Hybrids of these are also available. Cloning simply involves allocating P independent simulations to P processors. This technique is both easily implemented and very efficient but is however limited in application. This is particularly suited to Monte Carlo simulations, where each processor conducts and independent random walk. The Master-Slave model utilises a master processor to run, or control the simulation. This processor allocates work to other processors when necessary. The problem with this model is both communication difficulties and load balancing problems. 4.2 Parallelisation of Molecular Dynamics When considering the parallelisation of molecular dynamics there are two important considerations. Firstly, the algorithm must be effective for a relatively small number of atoms (e.g. less than 1000) as the aim of any simulation must be to model the system accurately with the smallest number of atoms (and thus performing each time step as readily as possible). Most molecular dynamics simulations are carried out on systems of a size ranging from a few hundred to a few thousand atoms. Secondly, truly scalable algorithms are important and should hopefully be able to exploit larger and faster parallel machines developed in the future. The parallelisation of molecular dynamics calculations are discussed in this section. The force terms involved in a simulation are typically non-linear functions of the distance between pairs of atoms and can be either long-range or short-range. Both these types of forces will be discussed Short Range Forces The three most common methods of parallelising short range molecular dynamic simulations were suggested by Plimpton (1995) who developed: 1) Atom Decomposition. This is based on the replicated data method. 2) Force Decomposition. This involves either a systolic loop method or a force matrix method. 3) Domain Decomposition. A common domain decomposition method is the Linked Cell method. An extension of this method is called spatial decomposition (also known as geometric methods). 1. Atom Decomposition This method, which is based on the replicated data strategy, has identical copies of the configuration data on all the processors. Atom decomposition involves a subgroup of atoms being assigned to each processor, the processor computes forces on its atoms no matter where they move in the simulation domain, hence the name atom decomposition. Firstly each processor has a complete copy of the coordinates and velocities of the atoms in the system. Each processor is assigned a sub-block of the N*N force matrix (where N is the number of atoms) to calculate. For example, if there are P processors and ~ N(N-1)/2 interactions (note the 1/2 factor is a result of Newton s third law, F ij = F ji. then each processor calculates N(N-1)/2P of these interactions. 10 Molecular Science Modelling

15 Parallelisation At this point no processor has a complete representation of the force matrices and hence cannot build the total particle forces. The incomplete force arrays must be circulated to all the other processors to complete the summation of the forces on each processor. This requires a global pass - and - sum (Smith, 1991). This strategy involves each processor exchanging its data (N/P of the data) with an adjacent processor and the arrays are then summed. Following this, each processor now exchanges its data (now 2N/P) with a processor two positions away (+2 or -2 away) and the arrays are then summed. The procedure is then repeated again with a processor four positions away (+4 or -4 away) and so on. As every processor concurrently follows this sequence the end result is an identical sum on all processors, of all the original arrays local to each processor. Fig 1 shows the scheme Figure 1: The global pass - and - sum scheme The idea was outlined by Fox (Fox et al., 1988). Finally the equations of motion are integrated independently on each processor, again without reference to any other processor. The scheme benefits from simplicity. Routines exist for the global pass - and - sum and these can be inserted in proper locations in the code. Few other changes are typically required to parallelise the code. The duplication of information on each processor allows for straightforward computation of three and four body force terms. Data replication on each processor implies that the strategy is expensive in terms of memory. The efficiency of the algorithm is limited by the global summation of the forces, where the communication scales as N, independent of P. This is demonstrated by Plimpton (1995) in his benchmark of a Lennard Jones code where communication costs started to dominate with increased numbers of processors. Edinburgh Parallel Computing Centre 11

16 Technology Watch Report Examples of the use of this algorithm include the parallel implementations of GRO- MOS (Skeel, 1991) and CHARMM (Brooks et al., 1983). Bruge (Bruge et al., 1988) also developed a molecular dynamics program for ST2 water molecules using this technique. 2. Force Decomposition There are two types of force decomposition, the systolic loop algorithm and the forcematrix formulism described by Plimpton (1995). There are a number of different types of systolic algorithm, Raine (Raine et al., 1989) described three separate types of algorithm: 1) Systolic Loop Single Group (SLS-G) 2) Systolic Loop Double Group (SLD-G) 3) Systolic Loop Bidirectional Group (SLB-G) The general concept involves packets of data being circulated between processors, with the packets containing data relating to a subset of atoms (e.g. the atom coordinates, velocities and force accumulators). All these algorithms are fully distributed, each processors processes only a subset of the total system data and hence the memory demands are less than that of the replicated data (atom decomposition) strategy. In the SLD-G algorithm the data is shared between processors so that each processor has a group of atoms (with the force accumulators set to zero). This algorithm requires that the processors are connected in a ring topology with an odd number of processors. Each processor duplicates its packet which contains the coordinate arrays and force accumulators. One packet will remain on the home processor (i.e. fixed) and the other will be passed between processors. The pair forces within a home group are calculated and added to the home force accumulators. The duplicated packet is then passed to the next processor in a specific direction in the ring, hence each processor now has the atomic coordinates and force accumulators for two packets and the forces between these groups can be calculated and added to the force accumulators. These packets are then passed again in the same direction to the next processor and the process completed. With P processors the packet must be passed (P-1)/2 times so all possible pair forces have been calculated (this is the reason there must be an odd number of processors). The duplicated packets must then be passed back (in the opposite direction to the way they were sent) to their original home processor and the force accumulators of the replicated packet and the home processor packet added. Figure 2 shows the passing scheme for five processors. 12 Molecular Science Modelling

17 Parallelisation Rewind: Figure 2: The scheme utilised by the SLD-G algorithm to pass packets of data. This algorithm benefits from good load balancing and as mentioned is less memory expensive than atom decomposition (replicated data). The algorithm however loses from the need to send the replicated data packets back to their home nodes. This rewind step is wasteful. The SLB-G algorithm was thus developed to try and decrease this wastefulness. As with the SLD-G algorithm, the SLB-G algorithm involves the data being shared between the processors in a ring topology with an odd number of processors. Duplicated data packets are again produced. In this case however each of the packets on a processor are sent in opposite directions i.e. there is no home packet. At the end of (P-1)/2 data passes the duplicate data packets are within one pass of each other and hence the rewind step is much shorter than for the SLD-G algorithm. This technique is shown schematically in figure 3. The velocities of the atoms must be included in the Edinburgh Parallel Computing Centre 13

18 Technology Watch Report data packets as there is no home processor to return to and are needed when the force calculations are complete Figure 3: The scheme utilised by the SLB-G algorithm to pass packets of data The last algorithm, the SLS-G algorithm, initially has two data packets assigned to each processor. However, unlike the previous two algorithms each of the two packets on a node represent different groups of atoms. The number of processors can be odd or even and the processors are connected in a line with a head and tail processor at either end. Initially the forces within packets are calculated and then the forces between different packets on processor calculated. The data packets are then exchanged. Each processor (which is not a head or tail processor) sends the first data package to the right and simultaneously receives one from the left. The processor then sends the second package to the left and receives one from the right. The tail processor (the processor at the far right of the chain) sends the first data package to the left (package A), this is then replaced by the second data package on the same processor (package B). The tail processor also receives one package from the left. On the next pulse package B is sent to the left. The head processor (the far left processor) has one data package permanently fixed. The other package is sent to the right and a packet is received from the 14 Molecular Science Modelling

19 Parallelisation right. See figure 4. The number of sends required is 2P-1 to return the packets to their home processors with completed force accumulators Figure 4: The scheme utilised by the SLS-G algorithm to pass packets of data This algorithm has the advantage of being more generally applicable than the previous two examples. Some attempts to improve the efficiency of these algorithms has been carried out, mainly focusing on overlapping the communications with the computations of the forces. Systolic loop algorithms have been used successfully by a number of authors, for example Heller et al. (1990) built a sixty node MIMD parallel computer with a systolic loop architecture and programmed it in occam 2. They were interested in carrying out molecular dynamic simulations of large biopolymers. Fock-matrix algorithms differ from that of atom decomposition in that the algorithm is based on block-decomposition of the force matrix rather than row-wise decomposition. This method is advantageous in that the memory and communication costs are reduced by a factor of sqrt(p) versus the atom decomposition methods. Plimpton s Lennard Jones benchmark problem (Plimpton, 1995) continued to speed up, even when hundreds of processors were used. 3. Domain Decomposition The Linked Cell method is a commonly utilised domain decomposition method. The sequential version of the linked cell algorithm involves the molecular dynamic simulation cell being divided into smaller identically sized subcells. Their width must be slightly greater than the minimum cut-off radius, but apart from this the number of cells is chosen to be a maximum. Each atom is assigned to an appropriate sub-cell and a linked-list is created, a means in which each atom may be located. A header list is also constructed which identifies the first member of each subcell. For one subcell, Edinburgh Parallel Computing Centre 15

20 Technology Watch Report the interactions between each atom of the subcells and its neighbours in the subcell or in one of the neighbouring subcells is calculated. Half of the neighbouring cells are excluded to avoid double counting of pair-wise interactions. This is carried out on each subcell leading to the force evaluation. To parallelise this scheme, the molecular dynamics simulation cell is divided into regions in a manner similar to the method used in the serial version. Each region has the same shape and size (to ensure good load balancing) although it should be noted that this need not be cubic. Each region is then assigned to a specific processor, the region must be several times larger than the pair-forces cut-off. The region on each processor is then further subdivided into sub-cells (like the sequential algorithm), remembering to consider the cut-off range. The mapping of the regions onto the parallel processor is important and should ensure that neighbouring processors on the network should handle neighbouring regions of the molecular dynamics cell. Every subcell within a region has enough neighbouring subcells to complete the calculation. The exception to this is the subcells which lie on the regions boundaries. To calculate the forces on the atoms in subcells at the boundaries, the neighbouring processors must exchange copies of the relevant boundary subcell data. Two possible strategies exist to do this. The first of these involves exchanging copies of the atoms occupying the relevant boundary regions, then using these copies to only calculate the pair forces for the resident boundary regions i.e. no force data is communicated, only coordinate data, which is passed in both directions (Pinches et al., 1991). The second method involves passing the boundary region coordinates in one direction only (e.g., north, east and up boundary regions). These are then used to calculate the atom forces (on atoms in the south, west and down boundary regions of neighbouring cells) and the calculated forces are communicated back to the nodes in the south, west and down directions where they are added to the forces in the north, east and up regions of these nodes. The first method benefits from overlapped two way communication but involves some force calculation duplication. The second method involves no force duplication but has no overlapped communications. Smith (1991) suggested that there was unlikely to be any substantial difference between the two methods. After boundary data has been communicated the pair forces are calculated. The process then continues as in the sequential algorithm. The integration of the equations of motion has the advantage that each processor integrates the equations of motion for its atoms only. It is important to keep track of atoms which move out of a region allocated to a processor to the region of another processor. After the equations of motion have been integrated the location of the atoms must be checked and the atom coordinates and velocities must be reallocated to the appropriate neighbour processor if necessary. This algorithm is relatively easy to parallelise and is appropriate for simulations of very large systems. Pinches et al. (Pinches et al., 1991) used the link cell algorithm successfully for systems of atoms in two and three dimensions. Spatial decomposition (also called geometric methods) is very similar in nature to the link cell method. As with the link cell method, the simulation box is divided into smaller three dimensional boxes. However only one box is assigned to each processor. The size and shape of the boxes depends on the total number of atoms and the number of processors and a cubic box is favoured to minimise communication costs. The method differs in that the box lengths may be smaller than or larger than the force cut-off length. Each processor maintains two data structures, one for the atoms within its own box (N/P atoms) and one for atoms in neighbouring boxes. In the first structure each processor stores a complete set of information i.e. coordinates, velocities etc. The data is 16 Molecular Science Modelling

21 Parallelisation stored in a link list to keep track of the atoms moving around different cells. The second data structure only contains atom positions. In order to calculate the pairwise forces on each processor the second data structures need to be communicated between relevant processors. The scheme for communicating this data can be described in a number of steps: 1) Each processor firstly exchanges the second data packet in an east and west direction with neighbouring processors. For example, in figure 6 processor 1 fills a buffer with atom positions that are within a cut-off length of processor 0 s box. When the length of the box in the east/ west direction (d) is less than the cut-off length then this will be all of processor 1 s atoms, else it will contain those nearest to box 0. The message buffer is then sent to processor 0 from processor 1 (i.e. westwardly). All processors do this and hence processor 1 also receives a message from processor 2 (received from an easterly direction). The process is then repeated in the opposite direction (processor 1 sends to processor 2 and receives from processor 0). If the length of the box is greater than the cut-off length then all the necessary atoms have been received. If however the length of the box is less than the cut-off length, further communication is required and the east-west procedure is repeated. For example, processor 1 sends to processor 2 the atom positions from processor 0 (which processor 1 now has). This process can be repeated until each processor has all the atom positions within the cut-off range of its box. 2) The procedure is repeated in the north/south direction. In this situation however the data packet being sent to an adjacent processor contains not only atom positions that the processor owns but also those atom positions in the second data structure needed by that processor. See figure 6. (e.g. when the box length equals the cut-off limit three boxes are sent). 3) The procedure is repeated in the up and down direction. When the box lengths equals the cut-off limit and entire plane of boxes is sent. See figure 6. a) east/west exchanges b) north/south exchanges 2 c) up/down exchanges Figure 5: Schematic representation of data passing for spatial decomposition (after Plimpton, 1995). Edinburgh Parallel Computing Centre 17

22 Technology Watch Report An important feature is that when the box length is less than the cut-off distance and more atom information is needed from more distant boxes, this only requires a few extra data exchanges, all of which occur with the six intermediate neighbour processes. This allows the algorithm to perform efficiently, even with a large number of processors used for a small problem. One example of the use of spatial decomposition techniques is the large-scale molecular dynamics code developed by Belak (1993) for the BBN-TC Long Range Forces Typically long range forces encountered in molecular dynamics simulations include coulombic interactions in ionic solids or biological systems and normally involve each atom interacting with all the other atoms. Direct computation of these forces scales as N 2 and becomes more and more computationally prohibitive with large values of N (the number of particles). One method used is the Ewald summation. This method (Ewald, 1921) involves calculating three different terms, the sum of which results in the coulombic energy of the system. These terms are: 1) A sum in reciprocal space. This term has a cubic dependence on the chosen range of the reciprocal space and a linear dependence on the number of ions in the replicating cell. 2) A sum in real space. This is quadratically dependant on the number of ions. 3) A constant which only requires calculation once in the simulation. Smith et al. (1993b) discussed the parallelisation of the reciprocal and real space sums using a replicated data strategy. He described two different methods for determining the reciprocal sum, the first involves assigning a specific subset of the ions to each processor to compute (reduced ion list method) and the second allocates each processor a unique set of the reciprocal vectors (k) in the sum (reduced k vector list method). The former method needs to communicate between the processors during the force calculation while the latter does not. Kalia et al. (1993) described atom and spatial decomposition methods for the treatment of the Ewald sum. One other method for calculating the long range forces was described by Ding et al. (1992) called the cell multipole method, a technique well suited to parallel systems Summary Atom decomposition, or replicated data techniques are effective due to their relative ease of implementation. The need for global communications however results in poor scaling, communication costs dominating when large numbers of processors are used. The method is also expensive in memory due to data replication on each processor. This technique is used extensively and successfully in a number of codes, as later sections show. Spatial decomposition is ideally suited to large molecular dynamic simulations and scales well. Although this is more complex to implement than other techniques this method is likely to give the best performance increase. 18 Molecular Science Modelling

23 Parallelisation 4.3 Parallelisation of First Principle Calculations The parallelisation of ab-inito calculations uses some of the techniques described for molecular dynamics. Replicated data strategies have been utilised extensively and systollic loop strategies have also been described. Distributed data techniques have been successfully exploited to successfully parallelise ab-initio codes. In this section a number of the parallelisation techniques utilised to parallelise abinito codes are described Hartree-Fock Theory The most computationally intensive piece of the construction of the Fock matrix is the calculation of the two electron integrals. The density matrix, D and the Fock matrix, F, are symmetric and for any (i,j,k,l) the following integrals are equivalent: (ij kl) = (ji kl) = (ij lk) = (ji lk) = (kl ij) = (kl ji) = (lk ij) = (lk ji) Hence, once (ij kl) has been computed then the elements F ij, F ik, F il, F jk, F jl and F kl can be updated with the product of this integral and the appropriate element of the density matrix. Thus, rather than having to compute N 4 integrals, only ~N 4 /8 integrals need to be calculated. Screening is often considered to reduce the number of integrals requiring calculation. Simply this means that integrals whose size is so small that they are negligible are eliminated. This can reduce the number of integrals from O(N 4 ) to O(N 2 ) in some cases. The fact that each integral may be computed separately means that integral evaluation can be parallelised. The replicated data method is the same principle as that described for molecular dynamics. The density and Fock matrices are replicated onto each processor, each processor computes a subset of the integrals to form a partial Fock matrix. The partial Fock matrices are then globally summed to form the total Fock matrix. This method benefits from the fact that the integrals and required density and Fock matrix elements are all local to each other. There is no communication between processors except for the global summation of the Fock matrix and the broadcast of the density. The implementation is also relatively simple. The main drawback with this technique is that the size of the problem is limited by the memory of each individual processor rather than the complete (aggregate) memory of the machine. There are several examples of the use of the replicated data scheme for SCF methods, however one of the most well known is by Cooper et al. (1991) who parallelised GAMESS-UK (Guest et al., 1992) on a transputer based system. An alternative method to replicated data schemes work is that of distributed data algorithms. Burkhart et al. (Burkhart et al., 1993) distributed the integral evaluation between processors and in the process accumulated the Fock matrix on one fast processor with a large amount of memory. One problem with this algorithm was the serial accumulation of the Fock matrix on the one processor which limited the speed up to the ratio of: time taken to compute integrals time required to send them to the master processor and add them to the Fock matrix. Edinburgh Parallel Computing Centre 19

24 Technology Watch Report The model Burkhandt et al. considered was a farm model, one specific processor (the master processor) generates the jobs and distributes them to the other (server) processors. In this situation the number of data sets must significantly exceed the number of processors. To optimise the efficiency, the authors considered two different communication options and two different methods of updating the Fock matrix. Communication options: 1) The master processor distributes the jobs to all the server processors and then receives the results from all the server processors. The master processor then schedule new jobs to the now idle server processes (called global communication management). 2) Each server processor decides whether to process a given job or to send it to another server (local communication management). Fock matrix generation: 1) All the calculated integrals are returned to the master processor where the Fock matrix is calculated using the integrals and the density matrix (sequential Fock matrix update). 2) Each server receives the density matrix and builds its own partial Fock matrix (distributed Fock matrix update). The problem with using a sequential Fock matrix update was the Fock matrix determination created a bottleneck for the communications required. They achieved better success with this technique when local communication management was utilised rather than global communication management however they concluded that for more than sixteen processors using a distributed Fock matrix was most effective. Another progression came with Colvin et al (1993) who utilised a systolic loop system similar to those mentioned previously for molecular dynamics. Colvin s method involved setting up a systolic loop and data packets being passed around the ring. Each processor hosts a sub-block of the Fock matrix and the density matrix. A second copy of both the Fock and density matrices is formed and these are passed around the ring. After each send, the processor forms all the interactions that connect the current density and Fock matrix elements and then passes the data to the left and receives the new data from the right where the process is repeated. If there are P processors, P sends are required in order to send all the density and Fock matrix blocks around the ring. After this the full two electron Fock matrix is formed, the one electron terms are added and the Fock matrix is ready for transformation and diagonalisation. The number of integrals that need to be calculated is 3N 4 /8 and the problems with load balancing have forces the need for asynchronous communications and double buffering. As mentioned before, the number of integrals requiring computation can be reduced to N 4 /8 by considering equivalency. In general the calculation of each element of the Fock matrix requires access to all the elements of the density matrix and the array which holds the elements i,j,k,l (the Z matrix). Each integral requires six elements of the density matrix and contributes to six of the Fock matrix elements. i.e. there are N 4 /8 integral computations. Each data element is accessed by a number of integral computations, hence the computations need to be able to access the data in an asynchronous and distributed fashion. Each integral computation must perform sixteen communications, six to obtain the density matrix elements, four to obtain the Z matrix elements and six to store Fock matrix elements. Rather than using replicated data techniques we can use partial replication techniques which have been described in detail by Foster et al. (1996). 20 Molecular Science Modelling

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

CRYSTAL in parallel: replicated and distributed (MPP) data Roberto Orlando Dipartimento di Chimica Università di Torino Via Pietro Giuria 5, 10125 Torino (Italy) roberto.orlando@unito.it 1 Why parallel?