6 Similarities between principal components of protein dynamics and random diffusion

Size: px

Start display at page:

Download "6 Similarities between principal components of protein dynamics and random diffusion"

Sydney Sparks
6 years ago
Views:

1 6 Similarities between principal components of protein dynamics and random diffusion B. Hess Phys. Rev. E 6: () Principal component analysis, also called essential dynamics, is a powerful tool for finding global, correlated motions in atomic simulations of macromolecules. It has become an established technique for analyzing molecular dynamics simulations of proteins. he first few principal components of simulations of large proteins often resemble cosines. We derive the principal components for high-dimensional random diffusion, which are almost perfect cosines. his resemblance between protein simulations and noise implies that for many proteins the time scales of current simulations are too short to obtain convergence of collective motions. 8

2 6 PC s of proteins and diffusion 6. Introduction his chapter focuses on the use of covariance analysis as a tool to gain information about the conformational freedom of proteins. Knowledge about the conformational freedom of a protein can give insight into the stability and functional motions of the protein. Conformational freedom is related to the potential energy: in equilibrium the distribution of conformations can, in principle, be calculated from the potential. In practice this is not possible, since proteins are too complex, not only because of their high dimensionality, but also because the energy landscape usually consists of many minima. Normal mode analysis can be used to analyze any of these potential energy minima, but it does not take entropy due to the occupation of several minima into account, which plays an important role at room temperature. For reviews on normal mode and principal component analysis applied to proteins see [55, 56]. he first application of principal component analysis to macromolecules was performed to estimate the configurational entropy [57]. More recently the application of principal component analysis to proteins has been termed molecule optimal dynamic coordinates [58, 59] and essential dynamics [6]. It has now become a standard technique for analyzing molecular dynamics trajectories. However, the effects of insufficient sampling on the results are not well understood. Covariance analysis, also called principal component analysis, is a mathematical technique for analyzing high-dimensional data sets. Essentially, it defines a new coordinate system for the data set, with the special property that the covariance is zero for any two coordinates. In this sense these new coordinates can be called uncorrelated. hese coordinates are to be ordered according to the variance of the data in that coordinate. his can allow for a reduction of the dimensionality of the space by neglecting the coordinates with small variance, thus concentrating on the coordinates with larger spread or fluctuations. he procedure for N-dimensional data x(t) is as follows. First the covariance matrix has to be constructed. he covariance C ij of coordinate i and coordinate j is defined as: C ij x i x i x j x j where an over-line denotes the average over all data points. he data can consist of a finite number of N-dimensional points, in which case the average is a summation. x(t) can also be an N-dimensional function; then the average is an integral. he symmetric N N matrix C can be diagonalized with an orthonormal transformation matrix R: R CR diag( ; ;:::; N ) he columns of R are the eigenvectors or principal modes. he eigenvalues are equal to the variance in the direction of the corresponding eigenvector; they can be chosen such that ::: N. he original data can be projected on the eigenvectors to 8

3 6. Analysis of random diffusion give the principal components p i ; i ;::: N: p R (x x) When the data are the result of a dynamic process, the principal components are a function of time and p (t) will be the principal component with the largest mean square fluctuation. Note that if the system obeys Newton s laws every coordinate has to be weighted with the square root of the mass to obtain physically relevant dynamic principal components. For a system moving in a (quasi-) harmonic potential, the mass-weighted principal modes are similar to the normal modes, which are the eigenvectors of the mass-weighted Hessian of the potential function in the energy minimum. he advantage of normal mode analysis is that it depends only on the shape of the potential energy surface; however, its use is restricted to systems for which the motion is a single potential well is a relevant representation of its dynamics. Covariance analysis can be applied to any dynamic system, but the results will also depend on the sampling. he problem is how to separate intrinsic properties of the system from sampling artifacts. As was shown by Linssen [6], the principal components of multidimensional random diffusion are cosine shaped, with amplitudes inversely proportional to the eigenvalue ranking number. his behavior is very similar to that observed from simulations of proteins. In order to interpret protein data correctly, it is important that the behavior of random diffusion is basically understood. In the next section we will perform a theoretical analysis of random diffusion. he results of random diffusion depend only on the sampling, because there is no potential. 6. Analysis of random diffusion N-dimensional diffusion is described by a system of N independent stochastic differential equations: dx i (t) r i (t) i ;:::;N (6.) r i (t) is a stationary stochastic noise term, which is ffi-correlated: h r i (t) i (6.) h r i (t) r j (t fi ) i Dffi ij ffi(fi ) (6.3) where h i denotes the expectation, ffi ij is the Kronecker delta and ffi(fi ) is the Dirac delta function. he solution of this system is a set of N non-stationary stochastic functions x i (t), which represents a Wiener process [6]. he displacements x i (t fi ) x i (t) are Gaussian distributed with: h x i (t fi ) x i (t) i (6.4) 83

4 6 PC s of proteins and diffusion h (x i (t fi ) x i (t))(x j (t fi ) x j (t)) i Dfiffi ij (6.5) We will use this continuum description for the analysis of random diffusion. Realizations can be generated only for a finite number of time points. his is not a problem, since the results of the covariance analysis will be virtually identical when we have more than N time points. When the number of time points n is smaller than N, the results for first n principal components will be the same, the last principal components will all be zero. We will now prove that the first few principal components of the randomly diffusing system (6.)-(6.3) are cosines, by expanding the solution x(t) in an appropriate series of functions. We are interested in the eigenvalues and eigenvectors of the covariance matrix C and in the functional shape of the principal components: C ij (x i (t) x i )(x j (t) x j ) (6.6) p k (t) e k (x(t) hxi) (6.7) where e k is eigenvector k and x i is the average of x i : x i x i (t) (6.8) he eigenvectors are ordered according to descending eigenvalue. he first eigenvector e is the vector that produces the linear combination of x i s with the largest mean square fluctuation. hus e is the vector with ke k which maximizes: max kek (e (x(t) hxi)) (6.9) We can write x(t) hxi as a sum of a series of orthonormal functions f k : max kek ψ e X c k f k (t)! (6.) k f k (t)f l (t) ffi kl (6.) c k i f k (t)(x i (t) hx i i) (6.) We can choose the f k such that f k (t) is proportional to the projection on eigenvector k: f k (t) ο e k (x(t) hxi), thus f k will contribute only to k. f (t) is the function that maximizes the variance: max max kek f e c f (t) max max e c kek f (6.3) 84

5 6. Analysis of random diffusion It can easily be derived that e is proportional to c : e c p c c (6.4) When N is, f (t) is proportional to x (t) hx i.asn gets larger, the set of x i s will form a better representation of the full ensemble. For large N we can approximate with an ensemble average: max f c c ß max f he integral over f k is zero: f k (t) ο D c c E D E max Nc f i c i Λ (6.5) e k (x(t) hxi) (6.6) For large N we can approximate f by the function that maximizes Λ, we will call this function f Λ. Since the integral over f is zero and we want to approximate f, we demand that the integral over f Λ is zero as well. Using this we can calculate the expectation of c i c i : fi fl *ψ! Z t c i f Λ (t)x i(t) D f Λ (u)du (6.7) he full derivation is given in appendix 6.A. When we define g(t) as: Z t g(t) f Λ (u)du (6.8) we can rephrase the optimization problem in terms of g and add the constraints (6.) and (6.6) using Lagrange multipliers μ and μ. We have to find the function g which maximizes: with the boundary conditions: (g(t)) μ g (t) μ ((g (t)) ) g() g( ) Z L(t; g; g ) (6.9) f Λ (t) (6.) f Λ (t) (6.) According to the Euler-Lagrange formalism, the optimal g is the @g g(t) μ g (t) (6.) 85

6 6 PC s of proteins and diffusion he solutions are: g k (t) C sin ψ ßkt! with k ; ; ; 3;::: (6.3) g only satisfies the boundary conditions when C, so we discard this function. By differentiating g k and using equation (6.) we obtain a set of orthonormal functions h k : h k (t) dg k(t) s ψ ßkt cos! with k ; ; 3;::: (6.4) We have to find the h k that maximizes Λ: f Λ h k ) Λ D E Nc i c ND i (6.5) ß k hus the best guess for the projection on the first eigenvector f Λ is h. We can apply the same procedure for fk Λ, with the extra restriction that equation (6.) should hold for all» l<k. he result is: f Λ k h k (6.6) Λ k D E Nc i c ND i (6.7) ß k he derivation of Λ k is based on expectations and statistical fluctuations are not taken into account. Since the difference between Λ and Λ is a factor of 4, we expect that the projection on the first eigenvector is very close to a half cosine. But for large k the statistical fluctuations might cause mixing of the cosines. he projections fk Λ suggest that the covariance matrix can be approximately diagonalized with a matrix of cosine coefficients. o test this approximation we have to write each x i as a sum of cosines: s X x i (t) hx i i c k i f Λ k (t) X (6.8) k k ψ! ßkt c k i cos where c k i is defined as: Z s c k i x i (t)f Λ k (t) x i (t) cos ψ ßkt! (6.9) he cosine coefficients are zero on average and uncorrelated: D c k i E where k ; ; 3;::: (6.3) D c k i c l j E D ß kl ffi ijffi kl where k; l ; ; 3;::: (6.3) 86

7 6. Analysis of random diffusion Full derivations are given in appendix 6.A. he covariance matrix can be expressed in terms of cosine coefficients: C ij X k C Λ ij X X c k i f Λ k (t) k l c l jf Λ l (t) (6.3) c k i c k j (6.33) k c k i c k j X kn X kn c k i c k j (6.34) c k i c k j (6.35) From (6.3) we can see that on average C Λ ij is about N times larger than the second term in equation (6.35). We can write C Λ as a matrix product which involves a diagonal matrix and a matrix with independent stochastic elements with variance : C Λ ij k ψp D ßk!ψ ßk pd ck i!ψ!ψp ßk D pd ck j ßk! Y XX Y ij (6.36) where Y is a diagonal matrix with Y kk Dß k. Recently Bai and Silverstein proved that for large N the eigenvalues of (N)C Λ and Y have the same separation [63]. his means that when there are intervals that contain no eigenvalues the number of eigenvalues in the intervals between such intervals are equal for both matrices. Since Y is a diagonal matrix with separated eigenvalues Y kk Λ kn, the eigenvalues of C Λ and Y are identical for large N. Because C Λ is a good approximation of C, the eigenvalues of C are Λ k. We want to prove that the projections on the first few eigenvectors of C resemble cosines. When this is the case, we can approximate the eigenvector matrix of C with a matrix of the first N normalized cosine coefficients for each coordinate: for i; k ;:::;N : R Λ ik ß p ck c k c k i (6.37) qh c k c k i ck i (6.38) ßk pnd ck i (6.39) R ik (6.4) For reasonably large N, R will be a good approximation of R Λ. he use of R instead of R Λ simplifies the analysis. On average the rows and columns of R are uncorrelated 87

8 6 PC s of proteins and diffusion and have norm : * N X k * N X i R ikr jk R ikr il However, the columns are only approximately orthogonal: *ψ N! X *ψ! for i 6 j : RikR ß kl il ND i N ffi ij (6.4) ffi kl (6.4) ψ ß kl ND! We can use the matrix R to transform the covariance matrix: B ij (R CR ) ij l ßi pnd ci l ß ij ND 3 R li l m m X k X c i l c j m l m k i j i fi c k i c k i c k j c l ic l j flfi c l i fl (6.43) (6.44) (6.45) C lm R mj (6.46) c k l c k ßj m pnd cj m (6.47) c k l c k m (6.48) Since the columns of R are not completely orthogonal, matrix B will not inherit all the properties of C. We hope that the matrix B is approximately diagonal. o check this we have to calculate the expectation of each matrix element: h B ii i *ψ ß i X N ND 3 X c i l c i m l m k c i X N l l m c i X l c k l ( ffiik ) l k ψ* ß i X N ND 3 * N X c k l c k m! c i m ( ffilm ) * N X! c i 4 l (6.49) (6.5) l ψ Di N N N ψ! ß Nß i 4 i 3N! (6.5) 6 i i 4 D N 6 (6.5) ß i h B ij i for i 6 j (6.53) 88

9 6.3 Validation where we have used the fourth moment of a Gaussian distribution for D c i l4 E. he trace of B is twice as large as the trace of C. Comparing B ii with Λ i shows that this difference is caused by the term 6, which arises from correlations between the columns of R. he 6 is negligible for i fi p N. he expectation of the off-diagonal elements is zero, because every term contains at least one c with an odd power. For the same reason the cross correlation between elements of B is zero: h B ij B kl i for i 6 j or k 6 l (6.54) he off-diagonal elements are zero on average, but they might be very large. he variance of the off-diagonal elements of B is: D (B ij ) E 4D ψ 9 3ß ψ i j! ψ! N 3 ß 4 i N 4 O 4 j 4 i j N! (6.55) he derivation is given in appendix 6.A. When i fi p N, the off-diagonal elements are order p N, which is p N smaller than the diagonal elements. Although there are N off-diagonal elements, they are completely uncorrelated. he sum of N random elements is p N larger than each element and also random. A cumulative effect can be achieved for one inner product of an eigenvector with a row of B, but for an eigenvalue of order N the inner products with all N rows should be N times larger than the eigenvector elements and of the correct sign. Since this is impossible and for i fi p N the eigenvalues of C are B ii O(), the projections will resemble cosines. When i p N, the off-diagonal elements are of the same order as the diagonal elements and the resemblance will be lost. 6.3 Validation o see how much real principal components resemble cosines, we have to compare the theory with results from Langevin dynamics simulations. An x i (t) can only be generated for a finite number of time points. his can be done using the algorithm x i ((n )fi ) x i (nfi ) x. he value of x i () is irrelevant. x should be drawn from a Gaussian distribution which obeys equations (6.4) and (6.5). When fi is small, the distribution does not have to be Gaussian, since the convolution of many distributions will converge to a Gaussian distribution. he simulations and covariance analysis were performed with the Gromacs package[3]. For the simulations we always used steps and a convolution of 4 uniform distributions, which corresponds to 4 steps with an uniform distribution. Figure 6. shows the eigenvalues obtained from a simulation and the B ii from equation (6.46) with normalized eigenvector length for N, D :5 and 89

10 6 PC s of proteins and diffusion Eigenvalue 5 3 eigenvalues eigenvalue estimates using cosines π i 5 5 Index Figure 6.: he eigenvalues of the covariance matrix of a -dimensional Brownian dynamics simulation. he eigenvalue estimates are the B ii from equation (6.46), using the normalized vectors of equation (6.37). 9

11 6.3 Validation 4 pc 5 pc 4 pc 3 pc pc ime Figure 6.: he first 5 principal components of a -dimensional Brownian dynamics simulation.. he first 3 eigenvalues are predicted quite accurately. After eigenvalues the diagonal elements of B are too small compared to the off-diagonal elements to be a reasonable approximation of the eigenvalues. he predictions are close to the estimated values (equation (6.5)), the term 6 in equation (6.5) starts to dominate for i >. When this term is discarded and the eigenvalues are estimated by Λ i (equation (6.7)), a good correspondence is obtained over the whole range. Using Λ i it can be calculated that on average the first 3 eigenvalues contain 8% of the mean square fluctuation when N and 84% for large N. he first 5 principal components are shown in Figure 6., they resemble cosines. he cosine content of an eigenvector can be determined by taking inner products with columns of R Λ. Because these inner products fluctuate heavily, we have calculated average inner products over Langevin dynamics simulations with N set to 3, and 48. Figure 6.3 shows average inner products of eigenvector i with column i of R Λ. he first eigenvector is almost a perfect cosine, with a inner product p larger than.996 for all three system sizes. he inner products are one half for i N, which is exactly the behavior we expected. he cosine-like principal components lead to strange dynamic effects. One example is the mean square displacement, which is proportional to time for a single pure diffusive coordinate. he mean square displacement along the cosine-shaped principal compo- 9

12 6 PC s of proteins and diffusion.8 3 dimensions dimensions 48 dimensions Average inner product Principal component index Figure 6.3: he average inner products of the eigenvectors of the covariance matrix and the estimated eigenvectors R Λ (equation (6.37)). he inner products were averaged over simulations for each system size. Since the columns of R are cosine coefficients, these inner products show how much each principal component resembles a cosine with the number of periods equal to half the principal component index. he inner product is one half at the square root of the system size. he error bars show the standard deviation. 9

13 6.3 Validation.. pc pc pc 3 pc 4 pc 5 all coordinates MSD..... ime Figure 6.4: he mean square displacement of the first 5 principal components of a -dimensional Brownian simulation. he principal components are ballistic (MSD ο t ) over a large range of times. he MSD of the whole system is proportional to time. nents is proportional to the time squared. his can be shown with a simple derivation. Principal component i can be approximated as: p 4ND ßit p i (t) ß cos (6.56) iß From this we can calculate the mean square displacement along eigenvector i: msd i (t) ß Z t 4ND t i ß 8ND i ß sin ißt ψ ψ cos ißy cos ißt iß( t) sin ψ!! iß(y t) dy (6.57)! (6.58) ß ND t for t< (6.59) i he mean square displacements of the principal components for the -dimensional Brownian system (Figure 6.) are shown in Figure 6.4. he whole system behaves diffusively, but the first few eigenvectors exhibit ballistic motion. 93

14 6 PC s of proteins and diffusion 6.4 Protein simulations Molecular simulations of macromolecules are good examples of high-dimensional systems where principal component analysis can be useful. It will reveal global motions when they are present in the system, without having to visually inspect the whole trajectory. In order to analyze internal motions of a molecule, the overall rotation has to be removed. he most simple way to accomplish this is least squares fitting each structure on a reference structure. Care should be taken when applying this procedure. Because different reference structures will lead to different orientations of the structures with respect to each other, the reference structure should be representative for the whole ensemble of structures. When large movements take place the fit might not be well determined and the fitted structures might jump between different orientations. his problem is most apparent in elongated linear chains, such as polymers, where the rotational orientation around the chain axis is not well defined. We will present principal component analysis results for three different protein systems. In all these systems the dynamics of the first few principal components is mainly diffusive, but the sampled part of the free-energy landscape has different characteristics for the different simulations. he three simulations and the covariance analyses were performed with the Gromacs package[3]. Ompf porin in a bilayer Ompf is a trimeric protein that consists of three fi-barrels of 34 residues each. he protein was simulated with molecular dynamics (MD) at ph 3 in a bilayer of 53 DMPC lipids, surrounded by 49 SPC water molecules and 66 chloride ions. his system of 583 atoms was simulated for 3 nanoseconds with a time step of femtoseconds, the coordinates were written to file every picoseconds. A twin-range cut-off was used:. nm for the Lennard-Jones and short range Coulomb interactions and.8 nm for the long-range Coulomb interactions, updated every steps. he temperature and pressure were coupled to 3 K and bar respectively. he system is similar to the system in [64]. A description of the force-field parameters can be found in [65]. We split the trajectory into 3 parts of one nanosecond. Covariance analyses were performed for each part on the C ff atoms of monomer one (34 atoms, dimensions). All structures were least squares fitted on the C ff atoms of the monomer one at time. Since there are structures in each part, there are only 6 94 nonzero eigenvalues. he eigenvalues of the covariance matrix for time - ns are shown in Figure 6.5. he eigenvalues decay with approximately a power of -.. he first 5 principal components are shown in Figure 6.6. he first 4 resemble cosines with the number of periods equal to half the eigenvalue index. hus the sampling in these directions is far from complete. he principal components resemble those of the Brownian system, which suggests that the protein might be diffusing on a part of the free-energy landscape 94

15 6.4 Protein simulations..5 Eigenvalue (nm ) Index Figure 6.5: Eigenvalues of the covariance matrix of subunit of a Ompf trimer. he analysis was done over time to nanoseconds of the MD simulation. that is almost flat in a few dimensions. If this is the case, then these eigenvectors do not describe relevant motions, but only give an indication in which directions the protein is more free to move. o check the relevance of the eigenvectors, we can look at the overlap between subspaces spanned by eigenvectors for different parts of the simulation. he overlap between two sets of n orthonormal vectors v ;:::;v n and w ;:::;w n can be quantified as follows: overlap(v; w) nx nx (v i w j ) (6.6) n i j We can also look at the total overlap s of two covariance matrices, where the squared inner products of the eigenvectors are weighted with the square root of the product of the corresponding eigenvalues. his measure s is introduced in chapter 7, equations (7.8) to (7.). When sets v and w span the same subspace, the overlap is. he overlap between the subspace spanned by the first eigenvectors is.3 between the first and second nanosecond,.6 between the second and third nanosecond and. between the first and third nanosecond. he weighted overlap s of the total space is. between the first and the second and the second and the third nanosecond and.8 between the first and the third nanosecond. his means that one nanosecond is not enough to get a good 95

16 6 PC s of proteins and diffusion pc 5 pc 4 pc 3 pc pc ime (ns) Figure 6.6: he first 5 principal components (nm) of subunit of the Ompf trimer. he analysis was done over time to nanoseconds of the MD simulation. Each principal components is fitted to a cosine of with number of periods equal to half the index. 96

17 6.4 Protein simulations MSD (nm ).. pc ns pc ns pc 3 ns pc 4 ns pc 5 ns pc 3ns t.5 t. ime (ps) Figure 6.7: he mean square displacement of the first 5 principal components of subunit of the Ompf trimer. he analysis was done over time to nanoseconds of the MD simulation. he first principal component for the analysis over to 3 nanoseconds is shown as well. hese principal components are subdiffusive on short time scales. he first principal components show ballistic behavior, due to their cosine shape (see Figure 6.6). impression of the conformational freedom of this protein. We have also calculated the mean square displacement along eigenvectors, these are shown in Figure 6.7. On time scales below ps the behavior is sub diffusive, which is reasonable for a set of connected atoms on a relatively flat part of the free energy landscape. On longer time scales eigenvectors and go ballistic. his is not a property of the system, but rather an artifact of the short simulation time in combination with the covariance analysis, which filters ballistic motions out of a diffusive system. his can be illustrated by doing the same analysis for the whole 3 nanoseconds. Again the principal components look like cosines, but now the wavelength is 3 times as long. hus the cosine behavior is linked to the analysis interval and not to inherent properties of the protein. he mean square displacement looks similar as well, but the transition to ballistic motion is shifted by approximately a factor of 3 (see Figure 6.7). 97

18 6 PC s of proteins and diffusion i (nm ) k kj mol nm a () fi (ps) fl u ps able 6.: Properties of the first 5 eigenvectors of the fi-hairpin simulation. i is the eigenvector index, is the eigenvalue, k k B is an estimate of the harmonic force constant in the direction of eigenvector i, a and fi are the result of an exponential fit a exp( tfi ) from to ps to the autocorrelation function of principal component i, fl kfi is a rough estimate of the friction coefficient. fi-hairpin in water he second system is a 6 residue fi-hairpin solvated in a box of 44 SPC water molecules and 3 sodium ions. his system was simulated for nanoseconds, structures were written to trajectory every. ps. A complete description of the simulation setup can be found in [66]. We performed the covariance analysis on the 4 structures from 5.3 to 7.7 ns. We chose this period, because the peptide seems to reside in one free energy minimum for these.4 nanoseconds. Because the termini are very flexible, only the backbone atoms of residue to 5 were included in the analysis. he first 5 eigenvalues can be found in able 6.; they contain 65% of the fluctuations and the first eigenvalues contain 8% of the fluctuations. he eigenvectors are well defined for this part of the simulation, the subspace overlap between the first and second. nanosecond intervals is.89 for the first 5 eigenvectors, the weighted overlap (expression 7.) of the total space is.84. he first five principal components are shown in Figure 6.8. hese look very noisy and resemble diffusion in a harmonic potential; their distributions are almost Gaussian. he force constants k for the harmonic well can be estimated from the eigenvalues and the temperature (see able 6.). Kinetic information can be obtained from the autocorrelation functions of the principal components, shown in Figure 6.9. here is a rapid drop in the first half picosecond followed by an approximately exponential decay. After ps the curves level off, which is an indication that the peptide is not diffusing in a single harmonic well, but in several connected wells. he fast effect can also be found in the velocity autocorrelation functions of the eigenvectors functions; these have a negative peak at.3 to.4 ps. his time q scale corresponds to the momentum effect, which should occur on a time scale of mk, where m is approximately amu. he long time scale motions are over damped, since the velocity autocorrelation is almost zero after a few picoseconds. In a diffusive system the correlations decay exponentially with fi flk, where fl is the friction coefficient We can estimate the friction coefficient 98

19 6.4 Protein simulations pc 5 pc 4 pc 3 pc pc ime (ns) Figure 6.8: he first 5 principal components (nm) of the fi-hairpin..8 pc pc pc 3 pc 4 pc 5 Autocorrelation ime (ps) he autocorrelation of the first 5 principal components of the fi- Figure 6.9: hairpin. 99

20 6 PC s of proteins and diffusion pc 5 pc 4 pc 3 pc pc ime (ns) Figure 6.: he first 5 principal components (nm) of HPr. for the first 3 eigenvectors, by fitting the autocorrelation function with an exponential from to ps (see able 6.). HPr in water he last system is the 85-residue protein HPr simulated in a box of 535 SPC water molecules. he total simulation time was 5 nanoseconds, structures were written to a trajectory file every picosecond. We left the first nanosecond out of the analysis to remove equilibration effects. he first 5 and eigenvalues contain 63% and 74% of the fluctuations respectively. he first 5 principal components show plateaus with distinct jumps at.5 and 4. nanoseconds (Figure 6.). he global shape of the first and second eigenvector resemble a half and full cosine. he jumping behavior can also be observed in the root mean square deviation (RMSD) matrix (Figure 6.), which has several light triangles along the diagonal. he structures can be clustered on RMSD with for instance a Jarvis-Patrick algorithm. In this algorithm a list of the 9 closest neighbors is constructed for each structure. wo structures are in the same cluster when they are in each others list and have a least 3 list members in common. he algorithm finds 3 large clusters (Figure 6.). he first 5 eigenvectors mainly describe the difference between the clusters. Because of the jumping behavior the overlap between the first and the second half of the 4 nanoseconds is only.3 and.46 for the subspace spanned by the first

21 6.4 Protein simulations ime (ns) ime (ns) RMSD (nm).6 Figure 6.: Matrix showing the RMS deviation of the C ff s HPr for each pair of structures in the upper left half. he lower right half shows the result of Jarvis- Patrick clustering. A black dot means that two structures are in the same cluster. here are 3 large clusters.

22 6 PC s of proteins and diffusion 5 and eigenvectors respectively. he weighted overlap (expression 7.) of the total space is.43. When the simulation is prolonged, jumps to new clusters might occur, which can cause large changes in the eigenvectors. 6.5 Discussion Principal component analysis is considered to be a powerful tool for finding large scale motions in proteins. he advantages of the analysis do not lie in the analysis itself, but in the fact that most of the fluctuations can be captured with the first few principal modes. his means that many analyses can be performed in only a few dimensions, which makes visual inspection of the results easier. his can reveal features of the free energy landscape. Kinetic properties can also be analyzed, since all the time information is still present in the principal components. However, one can not conclude from the eigenvalues alone, that only a small subspace is important. We have shown that for high-dimensional random diffusion, the first 3 eigenvalues contain 84% of the fluctuations. his does not mean that the first 3 principal modes reveal special properties of the system. In random diffusion all directions are equivalent, a second simulation will produce completely different principal modes. How can the relevance of the principal modes be determined? One should always divide the simulation into or more parts and compare the principal modes for each part. A simple measure is the subspace overlap of the first few principal modes (equation (6.6)). Amadei et al. extensively analyzed the overlap for MD simulations of protein L and Cytochrome C [67]. hey found that the overlap for the first principal components is.6 for two parts of 5 ps, irrespective of the time interval between the two parts. his indicates that the protein samples only one free energy minimum, as is the case for the fi-hairpin simulation. he overlap can be lower when the protein samples multiple minima, as was shown for HPr. In that case the principal modes describe the differences between the minima, rather than the shape of the energy minima. Although the subspace of the first 5 or principal modes might not be well defined, there will always be a splitting in a subspace of diffusive modes, which contain most of the fluctuations and a subspace of (near-) harmonic modes. his splitting will not change significantly between different simulations. We have derived that the first few principal components of high-dimensional random diffusion are a half cosine, a full cosine, one and a half cosine etc. Many protein simulations with similar principal components can be found in the literature, for instance [59, 68, 69]. When the trajectory is projected on the plane of the first two principal components, the half and full cosine produce a characteristic semi-circle, as can been seen for BPI simulations in[7]. Often these simulations are relatively short (a few nanoseconds) and of relatively large proteins (more than residues), like the Ompf simulation presented in this chapter. It is very improbable that properties of the protein will produce

23 6.5 Discussion cosine shaped atomic displacements, consisting of exactly a half period, a full period etc. It is more likely that this behavior is the result of random diffusion, since the simulation time is not long enough to repeatedly reach barriers on the free energy landscape. he first principal component, a half cosine, can be very misleading, as it seems if the protein moves from one well-defined conformation to another. he cosine shaped principal components lead to ballistic behavior on longer time scales. García et al[68] write that this behavior belongs to the Lévy flight class. However, this behavior is caused by the incomplete sampling and scales with the simulation time, as shown for Ompf, and it will disappear as a free energy minimum has been sampled to a reasonable extent, as shown for the fi-hairpin. When one sees cosine-like principal components, one should interpret the results carefully and always keep in mind that most of the fluctuations could be caused by random diffusion. he analysis is still useful, because it can separate the random diffusion degrees of freedom from the more restricted degrees of freedom. In such cases insight into the principal modes of random diffusion in a harmonic potential will make better estimates of the conformational freedom possible. In the next chapter we present simulations of proteins of reasonable size, which are long enough to analyze the local convergence. 3

24 6 PC s of proteins and diffusion 6.A Appendix: variances he variance of the integral over f (t)x i (t) can be calculated, given that: We can apply partial integration: *ψ f (t)x i (t) *ψ Z t *ψ * Z t Z t Z t Z t D! f (t) f (v)dv dx Z i(t) t f (v)dv dx i(t) f (v)dv dx i(t) Z t f (v)dv f (v)dv f (u)du Z u Z u! Z u f (w)dw * dxi (t) f (v)dvx i (t) dx i (u) du fi fi fi fi t t! f (w)dw dx i(u) du du du f (w)dwdffi(t u)du (A.) (A.) (A.3) (A.4) (A.5) (A.6) (A.7) (A.8) he expectations of c k i and ck i cl j can be calculated using integration by parts: D E * s Z c k ψ! ßkt i x i (t)cos (A.9) * p ψ! dx p i (t) ßkt sin (A.) ßk p Z * ψp! dxi (t) ßkt p sin (A.) ßk (A.) For k ; ;:::and l ; ;:::: D c k i c l j E * x i (s)cos * ß kl dx i (s) ds ψ! ßks ds ψ! ßks sin ds x j (t)cos dx j (t) ψ! ßlt sin ψ ßlt! (A.3) (A.4) 4

25 ß kl ß kl 4D ß kl ffi ij * dxi (s) ds dx j (t) Dffi(s t)ffi ij sin ψ! ßkt ψ ßlt sin sin D ß kl ffi ijffi kl he size of the off-diagonal elements of B is: D (B ij ) E where U *ψ ß ij ND 3 ψ ß ij sin ψ ßks! ψ ßks c i l c j m l m k 6.A Appendix: variances! ds sin ψ ßlt! ds sin X c k l c k m! D E (U V ND 3 ij V ji ) ψ! ß ij D E U ND 3! ψ ßlt!! (A.5) (A.6) (A.7) (A.8) (A.9) (A.) D E D E V ij V ji (A.) h UV ij i h UV ji i h V ij V ji i) (A.) X c i l c j m l m k V ij c i X N l l m c j mc i m c k l c k m ( ffi ik)( ffi jk ) (A.3) (A.4) D U E D V ij E h UV ij i * N X c i X N l c j X m c k l l m k ψ! D N O(N) ß *ψ N X! X N c i l i j ψ ß 4 c j m c i m c k m ( ffiik )( ffi jk )! 9 i 4 j 4 l m ψ! D N 3 N (3 ) O(N) ß * N X c i X N l l m ψ D ß c j m c i m i 6 j c k m c i m c j m k ψ! ß i 4 j 6 i j ψ X! N O(N)! (A.5) (A.6) (A.7) (A.8) (A.9) (A.3) 5

26 6 PC s of proteins and diffusion h V ij V ji i * N X c i X N l c j m c i X N m c j k l m k ψ! D N 3 N (3 ) O(N) ß i 4 j 4 D (B ij ) E 4D ψ 9 3ß ψ i j! ψ! N 3 ß 4 i N 4 O 4 j 4 i j N! (A.3) (A.3) (A.33) 6

Structural and mechanistic insight into the substrate. binding from the conformational dynamics in apo. and substrate-bound DapE enzyme

Electronic Supplementary Material (ESI) for Physical Chemistry Chemical Physics. This journal is the Owner Societies 215 Structural and mechanistic insight into the substrate binding from the conformational