Biomolecules are dynamic no single structure is a perfect model

Molecular Dynamics Simulations of Biomolecules References: A. R. Leach Molecular Modeling Principles and Applications Prentice Hall, 2001. M. P. Allen and D. J. Tildesley "Computer Simulation of Liquids", Clarendon Press, Oxford, 1996. Biomolecules are dynamic no single structure is a perfect model - 1 -

What is a classical Force Field? The molecules in an MD simulation need to behave similarly to how they would in the real world. Ideally, Quantum Mechanics would be used to describe the electronic structure of your model. In Quantum Mechanics, solutions to the wave function provide information used to determine the physical properties of a model to a high degree of accuracy. Unfortunately, these solutions are time consuming and for larger molecules, like bio-molecules, this becomes too difficult to acquire within a reasonable amount of time. Incorporating dynamics into the equation only adds to the difficulty (dynamics will be explained in the next section). The solution to this problem is to use simpler approximations to the wave function. Molecular Mechanics uses classical (Newtonian) physics to restrict the motion of the molecules by approximating the potential energy. The force field equation (above) is the sum of the potentials from bond stretching, angle bending, torsional rotation, electrostatics, and van der Waals repulsion. A classical mechanical force field is: a mathematical model based on Newtonian physics that defines the potential energy of a molecule as a function of the 3D structure Bond Stretching and Bending + Bond Torsion Rotation Valence Terms Atoms Atoms A + ij 12 R B ij 6 ij R + i=1 j= i+1 ij Atoms Atoms i=1 j= i+1 q i q j εr ij van der Waals (repulsion and attraction) + Electrostatics Non-Bonded Terms The valence terms maintain molecular connectivity and define internal flexibility, nonbonded and torsion terms control 3-D shape and molecular interactions. - 2 -

Bond Stretching and Bending The equilibrium bond length, r i o, and force constant, k i, in the Bonds potential are used to force the bond stretching motion to be within a parabolic function. Similarly, the equilibrium angle, θ i,0, and parameter, k i, in the angles potential are used to restrict the motions of an angle opening and closing within the confines of a parabolic expression. The graph of E bond indicates several features. 1) The higher the force constant (k r ) the more narrow the range of allowed motion 2) Longer equilibrium bonds tend to have lower frequency motions (CT-I versus CT-F) A harmonic approximation does not account for anharmonicity effects, and does not enable atoms to dissociate from one another and is therefore only a good approximation for a narrow range of distortion. Harmonic energy terms are in effect penalty functions that penalize any distortion away from the equilibrium position. Equilibrium bond lengths and force constants from the AMBER (PARM99) force field Bond C=O C-F CT-CT (Csp 3 -Csp 3 ) CT-I Length (r o, Å) 1.229 1.38 1.526 2.166 Force Constant (k r, kcal/mol Å 2 ) 570 367 310 148-3 -

Bond Torsion Rotation The torsional rotation equation is composed of a parameterized term to describe relative barrier heights, V n, that depends on the multiplicity, n, of the function. The γ-term is a phase shift of the cosine function that is used to adjust the location of the minima while the ω term describes the rotational angle. The magnitude of V n determines the magnitude of the effect from cosine term while the sign of V n determines how the relative barrier will change at different angles. The periodicity term, n, affects the frequency and thus which angles will increase/decrease the barrier relative to another angle. For example, the graphs above show the changes from n = 1, 2 & 3 as well as the impact of increasing/decreasing V n and the sign of V n while setting the phase shift, γ, to zero. - 4 -

Torsion terms are corrections for differences in the rotation barrier as can be seen above. Without torsion terms, the other forces, including van der Waals and electrostatics, are usually able to reproduce a similar pattern to the QM (quantum mechanics) curve, however, the MM (molecular mechanics) maxima at 0 and 120 are slightly below that of the QM curve. Using torsion parameters, the barrier is closer to the QM predicted height and the absolute average error for this torsion reduces from kcal/mol to kcal/mol. van der Waals (repulsion and attraction) The last terms in the force field equation describes both the electrostatics and van der Waals contributions to the potential. The first part of this potential, shown below, represents the van der Waals interactions which is an approximation to the Lennard-Jones potential. The ε-term is the well depth and the σ ij -term is the distance from zero separation (i.e. overlapping atoms) to the distance where the energy is zero (see Fig. 4.34, pg 207 in Leach). The r-term represents the distance between the two atoms. Electrostatics Coulomb s law is used to express the electrostatic interaction in MD simulations. This equation depends on static point charges on each atom. These charges do not necessary possess a real meaning as they are often partial atomic charges used to describe electronic features caused by differences in electron density. The electrostatic equation shown below consists of the charge on atom i, q i, and the charge on atom j, q j, separated by distance r ij where ε 0 is the force constant. Generally, partial atomic charges are calculated from the electrostatic potential. - 5 -

Parameter generation Accuracy of the computed energy depends on the parameters (k, V, A, B, q). Many of the parameters in the latest force fields are computed from quantum mechanical calculations of small related molecular systems. Sometimes legacy parameters from earlier experimental data are still in use, most commonly van der Waals terms. Limits associated with energy minimization Minimization of the potential energy function (force field) leads to a single structure, which, since it is devoid of motion, can only exist at absolute zero. This structure may represent a reasonable "average" conformation if the molecule is essentially rigid. But, when the molecule exists as two or more distinct conformations, no single structure is very representative. In some examples the average can be totally non-physical or virtual. "The properties of an average conformation are not always the same as the average of the properties for the individual conformations." This is particularly relevant when the properties are not linearly dependent on the conformation. For example, experimentally determined nuclear Overhauser intensities (NOEs) depend on the inverse sixth power of the distance between the interacting protons. They can be used to deduce the conformation of a molecule but as you can see, care must be taken in their calculation. Approaches to generating a group (or ensemble) of "realistic" conformations - that is conformations that represent structures present under experimental conditions Conformational (adiabatic) mapping can be achieved through grid searching with energy evaluation (minimization). - practically, this approach is limited to searching only a few degrees of freedom - this approach ignores direct influences of solvent - the resultant energies reflect absolute zero (zero motion) conformational states - the final energy is at best an enthalpy (ΔH) not a free energy (ΔG) Two alternative approaches to adiabatic mapping are molecular dynamics (MD) simulation and Monte Carlo (MC) sampling. - 6 -

What to do with molecular motion? In flexible molecules, such as RNA, peptides or carbohydrates, it is impossible to describe accurately their "structure" as a single conformation. "What is the shape of a snake?" The "structure" can be thought of as an ensemble of individual conformations (also called snapshots), which give rise to characteristic average properties. The average behavior can be described (or computed) either by looking at the motion as a function of time (time-average) or from a collection of energetically accessible structures (ensemble-average) In the limit the time-average and the ensemble average have to be identical - this is known as the ergodic hypothesis - a fundamental principle that relates time-dependent Molecular Dynamics (MD) and time-independent Monte Carlo (MC) simulations. The conclusion: "A snake is mostly elongated, but wiggles from side to side" This would be reached from either average, but the time-dependent approach (MD) lets you predict where the snake will go next, and understand how it got to where it is now. This ability means that MD is a deterministic method. - 7 -

Molecular Dynamics Simulation Molecules are dynamic. (Despite the frequently static images presented by some experimental and theoretical techniques). All molecules are in motion except at absolute zero. The motion can be associated with overall translation, rotation or vibration, which are relevant to temperature and pressure of a system. However, more importantly, some of the motions are internal. These internal motions can drastically alter the conformation of the molecule, particularly through torsion rotation. In reality, one conformational state changes to another in a smooth time-dependent manner. In MD simulations the motion is divided into very small time steps (Δt = 1-2 fs). Given the position (x) of a particle at time t, its new position after time Δt may be obtained by the familiar Taylor expansion: where v = atomic velocity and a = acceleration. This is an infinite series and must at some point be truncated. Several methods have developed for the treatment of this truncation. The method of Verlet is one of the most common. The Verlet algorithm uses the positions and accelerations at time t and the positions from the previous step (t - Δt) to compute the new positions at t + Δt. The equations for the position at (t + Δt) and (t - Δt) are: These may be added together and rewritten as: The accelerations are computed from the force field by solving Newton's first law: where, V is the potential energy function (i.e. the force field). - 8 -

In the Verlet algorithm the position is not defined in terms of velocity. It is fast and efficient, but it can suffer from numerical inaccuracies. Other methods have been developed which offer some advantages, but the principles are similar. In choosing a suitable integration scheme, a key concern should be how big a time step can be used since the larger the step the more conformational or configurational space can be sampled. A second issue is that the simulation should remain in equilibrium, that is, the total energy should not change once equilibrium has been reached. t - Δt t t + Δt t - Δt t t + Δt t - Δt t t + Δt x v a At each step the stored variables are shown in red. Note the velocities are not needed to compute the trajectories, but they are useful for estimating the kinetic energy, pressure and temperature. They are obtained simply from: The general scheme of a stepwise MD simulation may be summarized as: 1) Predict the positions at time t + Δt, using the current acceleration and current and previous position: 2) Evaluate the forces (from the force field), and hence the accelerations from the new positions: 3) Calculate any variables of interest, such as the velocities or energy, then return to step 1: - 9 -

Considerations in Multimolecular Simulations One of the powers of molecular mechanics is the ability to approximate the interactions between a large number of atoms. This can be extended to interactions between multiple molecules. In biomolecular systems, the molecules never exist in an isolated gas phase. The ability to include solvent molecules and counter ions explicitly in a simulation has the potential to lead to a realistic model for the molecule and its environment. There are several issues that need to be considered before and during a simulation on such a complex system. These include: Thermodynamic Ensemble (NPT, NVT, etc.) Initial Configuration Temperature Control Pressure Control Non-Bonded Interaction Cutoffs Updating Neighbor Lists Solvent Model Measuring Equilibration - 10 -

Thermodynamic Ensembles MD simulations are often performed with either the volume or the pressure held constant. These conditions are referred to as thermodynamic ensembles (not to be confused with conformational ensembles!) There are several possible thermodynamic ensembles. Traditionally MD was performed with a constant number of particles (N), in a constant volume (V) with a constant energy (E). This is referred to as the microcanonical or constant NVE ensemble. For comparison with many experimental conditions, an ensemble in which the temperature (T) and pressure (P) are constant may be used. This is the isothermalisobaric or NPT ensemble. MC is traditionally performed on the canonical or NVT ensemble, which not surprisingly keeps N, V and T constant. Following from definitions in physical chemistry, the NVT ensemble gives rise to Helmholtz free energy, whereas, the NPT ensemble corresponds to Gibbs free energy. Note that in either case the energy is not simply an enthalpy. These simulations - 11 -

Pressure and Temperature Translational motion (or atomic velocity) is directly related to pressure and temperature. For an ideal gas, the pressure is a function of mean atomic velocity (<v>), atomic mass (m) and number of particles in a unit volume (N): If we replace N by the product of Avogadro's number (L) and the number of moles (n) and consider the entire volume (V), we can rewrite this as: But we know also that: where, T is the absolute temperature and k B is Boltzman's constant. Therefore, given the atomic velocities we can calculate the system pressure. Similarly, the temperature can be expressed in terms of the atomic velocities as: - 12 -

Initial Velocities To begin a MD simulation initial velocities must be assigned to each of the atoms. Since the initial temperature of the simulation is very low, the initial velocities are very small. The velocities are usually assigned randomly from a Maxwell-Boltzman probability distribution at the initial temperature (typically 5 K). That is, 1) select the temperature (T) 2) for each atom (i) choose a random number ρ between 0-1 (from a generator that produces a value that is distributed according to a Boltzman probability), and calculate the velocity component (v x, v y, v z ) for that atom, remember velocity is vector property. 3) Repeat for each component and each atom. - 13 -

Temperature Control (NPT or NVT) The system temperature is related to the average kinetic energy <K> by: which can be rewritten as: The average kinetic energy is also related to the atomic velocities <v> or momenta <p>, Thus temperature can be expressed as: Therefore the temperature can be controlled by scaling the atomic velocities. If the temperature at time (t) is T(t) and the velocities are multiplied by a factor λ, then the associated temperature change can be calculated as: where The simplest way to control temperature is to multiply the velocities at each time step by: because - 14 -

Coupling to an External Temperature Bath (See Leach 6.7.1) An alternative approach to controlling the temperature is to imagine placing the system in an external bath at a constant temperature. By coupling the system to the external bath, a constant temperature can be maintained. By varying the strength of the coupling, the method can be tuned for different systems. This is particularly important for the case of a large solute in a box of water. Simplistically, since the solvent molecules have no mechanism for internal motion, increasing their atomic velocities results directly in increasing their temperature. In contrast, a large solute can absorb the atomic velocity increases through internal motions, and so heats up more slowly. The result is that simple velocity scaling can lead to a phenomenon of hot solvent / cold solute. The velocities are scaled in this case such that the rate of change of the temperature is proportional to the difference in the temperature between the bath and the system: τ T is a coupling parameter whose magnitude determines how tightly the bath is coupled to the system. If τ is large the coupling will be weak, if τ T is small the coupling will be strong. The change in temperature between successive time steps (Δt) will be: And the scale factor for the velocities is: Note that when the coupling parameter (τ T ) equals the integration time step (Δt), this approach is equivalent to simple velocity scaling: - 15 -

Pressure Control (NPT) Coordinate scaling may be used to maintain a constant pressure. The coordinates may be rescaled in response to increasing or decreasing the volume. The volume is scaled by coupling to an external "pressure" bath, analogous to the coupling applied to control temperature. Here κ is the isothermal compressibility and τ P is the coupling constant. If the volume is scaled by λ, then the individual coordinates are scaled by λ 1/3. Therefore the new atomic positions are given by r' = λ 1/3 r. This expression can be applied isotropically (equally in all three directions) or anisotropically (independently for each direction). Anisotropic position scaling allows the box dimensions to change independently and is preferable. - 16 -

Initial Configuration Before beginning a simulation it is necessary to have an initial 3-D structure for the molecule. In the case of a protein, the initial conformation may be one for which a crystalographically determined structure has been reported. In this case it is straight forward to download the Cartesian coordinates from the RCSB Protein Database. The coordinates are stored in a "standard" format known as PDB format. At the PDB web site it is possible to search for and retrieve structures from NMR, although by far the majority of structures are from X-ray diffraction. Alternatively, the initial structure may have been obtained through an earlier experimental (NMR) or modeling study (homology modeling). For more flexible molecules, such as carbohydrates, the initial structure may be more hypothetical, since it will be expected to change and "converge" to a realistic ensemble of structures during the simulation. - 17 -

Once the structure is obtained there are still a few details to address, in particular, 1) Are all of the residues recognized by the force field of choice? That is, are all of the atom types in the PDB file the same as those that the force field expects? If not, they may have to be manually corrected. Does the structure contain structurally important metal ions? Are these parameterized in the force field? Does the structure contain any counter ions (SO 2-4, Ca 2+, Na + etc), and are they treated properly by the force field? 2) Does the structure contain hydrogen atoms? Most X-ray determined protein structures do not, and they must be added. This is usually an automated procedure based on simple valence geometry rules. For example, if the atom is sp3 hybridized (such as the CA in an amino acid), the hydrogen is tetrahedrally positioned with respect to the CA atom. But what about charged groups? Most amino acids that have ionizable side chains (Asp, Glu, Lys, Arg and the C- and N-terminus) are ionized at physiological ph (i.e. 6-8). Note, the imidazole side chain of histidine may be neutral or charged (its observed pka = 6-7), therefore its ionization state must be specified and a hydrogen atom added as necessary. Further, in the neutral state the side chain must contain a hydrogen atom at one of the nitrogen atoms (either ND1 or NE2), usually NE2 - but it depends on the local ph. 3) Does the structure contain any "waters of crystallization" if so they should be retained in the structure if they appear to be filling any surface or interior cavities. Otherwise they may be deleted. Note, the names for the waters (atomic and residue) must agree with the water model used in the simulation, and, these individual waters must be treated the same as the rest of the solvent waters. That is they must be treated as part of the solvent, not part of the solute. - 18 -

Water Model The choice of solvent model depends on which properties are important in the simulation. Generally, the more sophisticated the water model, the slower the calculation. Therefore you must decide on a suitable level of accuracy. Are you more interested in the solute or the solvent? Does a rigid water model that displays the correct bulk water behavior (density, radial distribution) suffice? Or, is a relaxed model that allows the O H bonds to stretch and the valence angle to bend necessary? - 19 -

Water Model: General considerations Regardless of the geometry of the model, the electrostatic interactions between the solvent and the solute will be important. It is good practice to model the electrostatic interactions between the water and the solute in the same way that the way that the water was designed for. For example, a poor approach is to employ a protein model with partial atomic charges on each atom derived from one approximation with a water model in which the partial atomic charges were derived from a different approximation. As an extreme example, an unbalanced model would be expected to result from employing MM3 (which does not use partial atomic charges) to model a solute, with a model for water (TIP3P) that incorporates partial atomic charges. This sort of apples and oranges mixture of models is sometimes the result when an investigator creates a new force field for a particular class of solute. The solute force field may be (indeed should be!) internally consistent, but it may not have been derived with attention to applying it with a given water model. Water Model: Validation How is a water (or any solvent) model judged? One obvious criteria is density of the simulated solvent. But that doesn't say anything about the dynamics of the model. Other experimentally observable properties include the diffusion coefficient and the viscosity. The diffusion coefficient (D) can be calculated in a straightforward way from a simulation by knowing the initial and final position (r) of a molecule after a time t, during which the molecule has diffused through the solvent. - 20 -

Information about the detailed "structure" of a liquid may be obtained from a study of the radial distribution function (rdf) also known as the g(r). The rdf is a measure of the number of molecules at a given distance from a central molecule. Since liquids are dynamic, the rdf gives a characteristic average structure. X-ray diffraction can be used to measure the rdf of liquids. The rdf of a molecule give the probability of finding a molecule a distance r from another molecule. In practice, the environment of the molecule is divided up into thin shells of thickness dr. The number of molecules in each shell is then counted and averaged over the course of the simulation. For short distances (r < the molecular radius) the rdf is zero since there can be no other molecules within the molecular surface. Thereafter, the rdf exhibits ripples corresponding to solvation shells. The first peak it the largest indicating a high probability of finding another molecule at that intermolecular separation. As the separation increases, there is less order and the peak intensities decrease. The area under the curve defines the number of molecules in the solvation shell. For water, there is a high probability of finding another water molecule ~3Å away corresponding to a water-water hydrogen bond. - 21 -

Boundary Conditions Molecular Clusters or Droplets If a molecule is simply surrounded by a droplet of other molecules (perhaps solvent), there will exist a boundary between the droplet and the vacuum around it. It then becomes difficult to prevent the molecules from diffusing into the vacuum. An artificial restraint may be applied to force the molecules to stay within the boundary, but this does not correspond to a traditional thermodynamic ensemble. - 22 -

Periodic Boundary Conditions (PBC) An alternative to the droplet model is to arrange the molecules into a regular lattice structure. By mirroring the contents (positions and velocities) of the central "box" a periodic system is generated. This periodic boundary system avoids edge effects. When a molecule diffuses out of one side of the box, it reenters on the other. Thus a constant density can be maintained. If the box dimensions are allowed to change with temperature, it is possible to maintain a constant internal pressure (NPT ensemble), alternatively, if the box dimensions are kept frozen, the internal pressure will fluctuate with temperature (NVT ensemble). - 23 -

Non-Bonded Interaction Cutoffs: Minimum Image Convention The calculation of the interactions between the non-bonded molecules is the most timeconsuming part of a MD or MC simulation. The number of non-bonded terms (vdw, electrostatic) that need to be calculated increases with the square of the number of molecules. Since all of these interactions decrease with distance, it is possible to invoke a cutoff distance (R), beyond which it is assumed that two molecules are independent of one another. In a periodic boundary simulation, a molecule should not be able to sense the presence of its mirror image. Nor should it see more than one copy of any other molecule. To ensure this, the non-bonded interaction cutoff (R) must be less than 1/2 of the box length (L). This is the minimum image convention. A cube is commonly used for the periodic boundary system, but other choices are possible, for example, a truncated octahedron or a rhombic dodecahedron. Using non-cubic lattices minimizes the number of non-bonded contacts that need to be computed. For example, with a cubic box might contain 256 molecules, whereas the truncated octahedron would contain 197. The rhombic dodecahedron would only contain 181 molecules. The choice of lattice can therefore greatly affect the time required for the simulation. - 24 -

Updating Neighbor Lists In order to calculate the forces between the non-bonded molecules, it is first necessary to calculate all of the intermolecular (interatomic) distances. This is a very time consuming step (almost as slow as calculating the energies themselves). Those molecules that are within the cut-off are then included in the energy calculation. The rest are ignored. Since a molecule will probably not drift too far from its neighbor in a short timestep, this fact can be used to further speed up the simulation. Normally, more than 20 MD time steps (or MC iterations) are required before a molecule's position will change significantly. Therefore, once the list of non-bonded neighbors (those within the cut-off) is created, it can be reused for a number of steps. This saves having to recalculate all of the intermolecular distances at each step. - 25 -

Measuring Equilibration At what point in a simulation is it possible to say that all of the particles have reached equilibrium? Monitoring the temperature, pressure or energy can all give an indication of equilibration. Certainly a simulation has not equilibrated until it has reached the desired temperature, but is temperature alone a sufficient indication? Earlier we discussed the structure of liquids and showed that a radial distribution function (rdf) can be used to characterize a liquid. A rdf gives a measure of the bulk structural properties, but says nothing about the behavior of an isolated solvent molecule. As a simulation proceeds, the solvent molecules move away from their initial positions. The degree to which they become more randomly oriented can be measured by their order parameters. The rotational order parameter is very useful, since nearly spherical molecules (like water) should display no preferred orientation at room temperature. - 26 -

Root mean-squared deviation (rmsd) as a measure of equilibration Configurational rmsd The rmsd is easily computed from the difference (d i ) between the current atomic positions and those of a reference structure. The reference structure may be the initial simulation configuration, or some other structure (such as the experimental structure). Because of translational and rotational motion the rmsd for the solvent and the solute should gradually increase with time. Conformational rms In many biomolecular simulations it is more useful to measure the change in the structure of the solute as a function of time. To do this the solute must be aligned with respect to a reference coordinate set before the rmsd is computed. In doing this the displacements arising from rotation and translation are removed and the rmsd gives a direct measure of the change in shape of the solute. Usually in a MD simulation a protein will expand from its initial structure and then stabilize at a slightly looser but similar structure. The rmsd will then increase approximately linearly with temperature from 0 to ~1-2 Å and then remain at that value. This is a very good measure of structural equilibration. - 27 -