FIRST PASSAGE TIMES AND RELAXATION TIMES OF UNFOLDED PROTEINS AND THE FUNNEL MODEL OF PROTEIN FOLDING

Size: px

Start display at page:

Download "FIRST PASSAGE TIMES AND RELAXATION TIMES OF UNFOLDED PROTEINS AND THE FUNNEL MODEL OF PROTEIN FOLDING"

Dale Walsh
5 years ago
Views:

1 FIRST PASSAGE TIMES AND RELAXATION TIMES OF UNFOLDED PROTEINS AND THE FUNNEL MODEL OF PROTEIN FOLDING By WEI DAI A dissertation submitted to the Graduate School New Brunswick Rutgers, The State University of New Jersey in partial fulfillment of the requirements for the degree of Doctor of Philosophy Graduate Program in Physics and Astronomy written under the direction of Ronald M. Levy and approved by New Brunswick, New Jersey May, 2016

2 ABSTRACT OF THE DISSERTATION First Passage Times and Relaxation Times of Unfolded Proteins and the Funnel Model of Protein Folding By WEI DAI Dissertation Director: Ronald M. Levy Protein folding has been a challenging puzzle for decades but it is still not fully understood. One important way to gain insights of the mechanism is to study how kinetics in the unfolded state affects protein folding. The answer to this fundamental issue hinges on the time scale to equilibrate the unfolded state and the energy landscape of the unfolded state. We construct Markov state models (MSMs) of several mini-proteins to study the kinetics of their unfolded state ensemble and find that the folding kinetics are two-state even though there are multiple folding pathways with nonuniform barriers, which are the direct consequences of rapid mixing within the unfolded state. Also, we introduce a time integral of a proper correlation function, namely relaxation time, to characterize the time scale of equilibration within the unfolded state. However, the mean first passage times (MFPTs) between different regions of the unfolded state are observed to be orders of magnitude longer than the folding time. This seeming paradox is solved by the derivation of a simple relation that shows the mean first passage time to any state is equal to the relaxation time of that state divided by its equilibrium population. This simple relation explains why MFPTs among unfolded states can be very long but the energy landscape can still be smooth (minimally frustrated). As a matter of fact, when the folding kinetics is two-state, all of the unfolded state relaxation ii

3 times are faster than the folding time. This result supports the well-established funnel-like energy landscape picture and resolves an apparent contradiction between this model and the recently proposed kinetic hub model of protein folding. Markov state model is a powerful tool but we seek for alternative ways of studying kinetics when MSM does not work very well. For example, diffusion maps of dimensionality reduction and discrete transition-based reweighting analysis method, are very useful in determining a geometrical measure that preserves intrinsic dynamics and in fully utilizing enhanced sampling simulation data. iii

4 Acknowledgments I would like to sincerely thank my adviser Ronald M. Levy, whose enthusiasm to science is always contagious and sustains my morale when I encounter bottleneck in the research. Many thanks to my colleagues in our group, Weihua, Emilio, Nanjie, Junchao, Bin with whom I consulted, discussed and collaborated on the projects presented here. I would also like to thank the fellow graduate students I ve known during my study here at Rutgers, Allen, Bill, Bin, Can, Juan, Liyang, Mauro, Michael, Omar, Peng, Wenhu, Wenshuo, Xueyun, Yiwen, Yuanjun. Last but not least, especial thanks to my parents and my wife Jia Tang, who always enlightens my day! Thanks to all! iv

5 Table of Contents Abstract ii Acknowledgments iv List of Tables vii List of Figures viii 1. Introduction Protein Folding: Challenges and Theories Computer Simulations of Protein Folding Thesis Organization Kinetics Study of Proteins Using Markov State Models Intorduction to Markov Processes Construction of Markov State Models Results and Discussion Conclusion Appendix I: Simple Models and Derivation Appendix II: Publication Attached Important Time Scales in Unfolded State Ensemble of Proteins Motivation: Funnel Picture or Kinetic Hubs Simple Relation between MFPTs and Relaxation Times Results and Discussion Conclusion v

6 3.5. Appendix: Publication Attached Alternative Ways of Studying Kinetics Introduction to Diffusion Maps Results and Discussions on Diffusion Maps Introduction to Discrete Transition-Based Reweighting Analysis Method Results and Discussions on Discrete Transition-Based Reweighting Analysis Method Conclusions and Future Works Bibliography vi

7 List of Tables 2.1. The statistics of fig. 2.14, fig and fig For each way of choosing the unfolded states, MFPTs are collected. And each MFPT data point is averaged out from 1000 successful kinetic Monte Carlo trajectories Important timescales including folding time (the slowest implied timescale with the absorbing boundary condition at F state), slowest implied timescale (unmodified equilibrium boundary condition), total relaxation times with different boundary conditions (unmodified equilibrium and reflecting at F state) for NTL9 and the rugged NTL9 network with an internal barrier introduced in the unfolded state. To study the intra U-state relaxation, reflecting boundary condition at F state is applied. All the total relaxation times in reflecting at F boundary conditions are much shorter (an order of magnitude) than the slowest implied timescale except the modified NTL9 network due to the presence of the internal barrier in the U ensemble vii

8 List of Figures 2.1. The population distribution of a 1-D quadri-well potential diffusion model. There are 100,000 samples (x coordinates), which are drawn from the potential using Langevin equation. Four free energy basins can be observed The top four eigenvectors of a 100 by 100 transition matrix estimated from a 1-D quadri-well potential diffusion model. For each eigenvector, the components and the corresponding cluster center x-coordinates are plotted. a) The first eigenvector indicates the equilibrium population of the potential. b) The second eigenvector describes the dynamics between the negative and positive regions. So the boundary is established at x = 0. c) The third eigenvector splits the negative region into two basins. d) The fourth eigenvector divides the positive region into two macrostates The NMR structure (PDB entry 2JOF) of Trp-cage. It consists of an α-helix, a 3,10 helix and a hydrophobic core, which constains a Trp in the middle The distribution of Cα RMSD from the NMR structure (PDB entry 2JOF) of Trp-cage The time series of Cα RMSD from the NMR structure (PDB entry 2JOF) of Trp-cage The time series of native contact with respect to the NMR structure (PDB entry 2JOF) of Trp-cage viii

9 2.7. The implied timescale corresponding to the 10 slowest decaying eigenmodes using transition matrices T( t) at fine-grained level (25, 000 microstates), with different boundary conditions, estimated from Shaw s Trp-cage trajectory [1]. (a) Unmodified equilibrium boundary condition. (b) Reflecting at F and I states boundary condition, where the dynamics are constrained within the unfolded ensemble. The reflecting at F and I boundary condition is realized by adding the value of transition probability from any state i to F or I (T if or (T ii ) to the component T ii and then setting T if and T ii to zero. The implied timescales approximately level off after lag time t = 5ns. The optimal lag time 10 ns is chosen for further analysis based on the trade-off between the network being Markovian and the resolution being sufficient for studying folding mechanism (a)the tube fluxes J(α), (b)the population which folds through tube α, P (α), (c)tube folding rates k(α) for the top 16 folding tubes obtained from 25, 000- node MSM In the 25, 000-cluster MSM of Trp-cage, plot the RMSD of the cluster centers versus the corresponding components of the second and third eigenvectors of the transition matrix at lag time of 5ns.(a) RMSD vs the second eigenvector components. (b) RMSD vs the third eigenvector components. The first eigenvector corresponding to the equilibrium population. So all the components share the same sign ix

10 2.10. The implied timescale corresponding to the 10 slowest decaying eigenmodes using transition matrices T( t) at coarse-grained level (20 macrostates, PCCA+ clustering from the 31, 284-node network in figure 2.9), with different boundary conditions, estimated from Shaw s Trp-cage trajectory [1]. (a) Unmodified equilibrium boundary condition. (b) Reflecting at F and I states boundary condition, where the dynamics are constrained within the unfolded ensemble. The implied timescales approximately level off after lag time t = 5ns. The optimal lag time 10 ns is chosen for further analysis based on the trade-off between the network being Markovian and the resolution being sufficient for studying folding mechanism In the 20-node MSM of Trp-cage, (a) a typical propagator matrix element T ij ( t) from state i to state j. Both of state i and j are picked from the U ensemble. But they are calculated in three ways: using absorbing at F (black), unmodified equilibrium (blue) and reflecting boundary conditions at F (red). The upper and lower dash and point horizontal lines correspond to the equilibrium population of state j under reflecting at F and unmodified equilibrium boundary conditions. (b) The typical transition matrix element T ij is shown in a longer time window. All the elements at different time are obtained from spectral decomposition using all the 20 eigenmodes at lag time of 10 ns The population fluctuation relaxation functions (Eq. 2.36, 2.37, 2.38) of the 20-node MSM of Trp-cage at a lag time of 10ns, using two different boundary conditions, (a) unmodified equilibrium and (b) reflecting at F. The time integrals of the functions are the relaxation times, which are 543ns and 100ns correspondingly x

11 2.13. The population fluctuation relaxation functions (Eq. 2.36, 2.37, 2.38) of the node MSM of Trp-cage at a lag time of 10ns, using two different boundary conditions, (a) unmodified equilibrium and (b) reflecting at F. The time integrals of the functions are the relaxation times, which are evaluated by spectral decomposition of the transition matrix T( t = 10ns). Only the 100 largest eigenvalues and their eigenvectors are used. Besides, the relaxation functions are renormalized by the population of the states which have a relaxation time larger than 8.2ns since we don t have enough statistics or sufficient resolution to estimate the relaxation time smaller than 8.2ns. The corresponding relaxation times are 825ns and 108ns The MFPT distribution between unfolded states of the 25, 000-node MSM of Trp-cage. The top 101 most populated unfolded states are selected to calculate the pairwise MFPTs (totally 10, 100 MFPTs) by running kinetic Monte Carlo trajectories. The transition matrix is built with unmodified boundary condition at lag time of 10ns The MFPT distribution between unfolded states of the 25, 000-node MSM of Trp-cage pairs of unfolded states are randomly chosen according to their weights, which means that the unfolded states with larger weights would be more likely to be picked up than the unfolded states with smaller weights. And the forward and backward MFPTs between the chosen pairs are calculated by running kinetic Monte Carlo trajectories. The transition matrix is built with unmodified boundary condition at lag time of 10ns The MFPT distribution between unfolded states of the 25, 000-node MSM of Trp-cage pairs of unfolded states are randomly chosen with equal probability. And the forward and backward MFPTs between the chosen pairs are calculated by running kinetic Monte Carlo trajectories.. The transition matrix is built with unmodified boundary condition at lag time of 10ns xi

12 2.17. In the 20-node MSM of Trp-cage the lifetime distribution and population fluctuation relaxation function of a typical unfolded state are calculated plotted. (a) The lifetime distribution of the unfolded state, which is obtained by running hundreds of thousands kinetic Monte Carlo trajectories. Start these trajectories at the unfolded state and terminate them once they leave the unfolded state. (b) The population fluctuation relaxation function of the unfolded state, which is calculated from Eq by spectral decomposition of the transition matrix T( = 10ns). The lifetime and the relaxation function decay on a very similar time scale (a) The implied timescale spectrum to an unfolded state, which is highlighted as red circle in (b). (b) The average lifetime of the collective state (U i), which excludes the state i versus the MFPT to state i using Eq The transition matrix of 20-node MSM of Trp-cage is with the absorbing boundary condition at the red state. The lag time is 10ns The MFPTs to node 3, which is highlighted as a red circle in Fig. 2.18, of the 20-node MSM of Trp-cage are shown. It can be seen that the MFPT to node 3 does not strongly depend on where the trajectory starts. This implies a rapid equilibration within the unfolded state ensemble The MFPT to state i versus the reciprocal of population of state i. The MFPTs are calcuated by running kinetic Monte Carlo trajectories on the 25, 000-node Markov state network. 300 states are randomly chosen including the folded state, which is labelled as a green star. The fitted curve is shown in the plot. As can be seen that the coefficient above the variable x(population) is 0.428µs, which is considered as an approximation of the total relaxation time of the Markov state model, 825ns. The lag time of the transition matrix is 10ns xii

13 2.21. The ensemble view of the 20-node MSM of Trp-cage. The MSM is schematically shown by nodes and edges. The size of the nodes is positively correlated to the populations. A thousand of trajectories are initiated from an unfolded state. At some particular time later, we track where the trajectories are and get the estimation of instant populations. If the population is higher than the equilibrium, the node is labelled as warm colors. If the population is lower than the equilibrium, the node is labelled as cool colors. After approximately 500ns, all the unfolded states carry the excess population and decay to their equilibrium with the slowest implied timescale A typical folding trajectory representing a general diffusion process. The outer sphere is the conformational space. The inner sphere is the folded state. In an folding trajectory, only a small portion of the conformational space is visited A simple 5-node model for studying the influence of unfolded state kinetics on folding Effects of the unfolded state ensemble equilibration rate constant k 12 on the folding rates for the two tubes in the 5-node simple model (Fig. 2.23). The P (α) green dashed line is the total folding rate,k tot = k α P eq (U) + k P (β) β P eq (U) The schematic plot of a 3-node simple model. The combination of node 1 and node 3 yields to the collective state U i. The edges connecting the nodes represent the rate constants between the two connected nodes The indicator function of the collective state U i is one when the trajectory is in state U i and zero otherwise. t 1, t 2 and t 3 are three realizations of the lifetimes of state U i. Point A is a typical starting point of the first passage time to state i The NMR structure (PDB entry 2HBA) of NTL9 with ribbons representing a three-stranded anti-parallel β-sheet and an α-helix xiii

14 3.3. The distribution of Cα RMSD from the NMR structure (PDB entry 2HBA) of NTL9. The RMSD data shown here is one tenth (uniformly subsampled) of the whole trajectory. The distributio that is larger than 16 A is truncated since it is barely zero and of less interest The implied timescale corresponding to the 10 slowest decaying eigenmodes using transition matrices T( t) at coarse-grained level (20 macrostates, PCCA+ clustering from the 24, 216-node network), with different boundary conditions, estimated from Shaw s NTL9 trajectories [1]. (a) Unmodified equilibrium boundary condition. (b) Reflecting at F and I states boundary condition, where the dynamics are constrained within the unfolded ensemble. The implied timescales approximately level off after lag time t = 10ns. The optimal lag time 20 ns is chosen for further analysis based on the trade-off between the network being Markovian and the resolution being sufficient for studying folding mechanism The implied timescale corresponding to the 10 slowest decaying eigenmodes using transition matrices T( t) at coarse-grained level (100 macrostates, PCCA+ clustering from the 24, 216-node network), with different boundary conditions, estimated from Shaw s NTL9 trajectories [1]. (a) Unmodified equilibrium boundary condition. (b) Reflecting at F and I states boundary condition, where the dynamics are constrained within the unfolded ensemble. The implied timescales approximately level off after lag time t = 10ns. The optimal lag time 20 ns is chosen for further analysis based on the trade-off between the network being Markovian and the resolution being sufficient for studying folding mechanism xiv

15 node MSM of NTL9. (a) The y coordinates of all the circles are the MFPTs calculated from Eq The x coordinates of the blue circles are calculated from the right hand side of eq While the x coordinates of the red circles are calculated from the right hand side of Eq The solid line is y=x. In the right plot, it shows implied timescales at lag time of 20ns. The red line highlights the implied timescale corresponding to the dynamics between the F and U ensemble node MSM of NTL9 with an internal barrier introduced in the U ensemble. (a) The y coordinates of all the circles are the MFPTs calculated from Eq The x coordinates of the blue circles are calculated from the right hand side of eq While the x coordinates of the red circles are calculated from the right hand side of Eq The solid line is y=x. In the right plot, it shows implied timescales at lag time of 20ns. The red line highlights the implied timescale corresponding to the dynamics between the F and U ensemble. The divergence between the red circles and y=x line is observed, which means that Eq breaks down. In (b) the implied timescale of the dynamics within the U ensemble is even longer than the dynamics between the F and U ensemble (red line), which demonstrates that rapid equilibration is not the case The distribution of Cα RMSD from the NMR structure (PDB entry 1UAO) of chignolin y axis is the logarithm of the sum of all the elements in the transition matrix A (Eq. 4.2). This plot can not show a consistent linearity between log( i,j W ij) and log ɛ 2. But in the linearity region, both of the statistical and bias errors are small. The optimal ɛ should be chosen out of this region, which is labelled as red point The first 20 eigenvalues of the transition matrix A (Eq. 4.2). There are significant gaps between the first and second, the second and third nonequilibrium eigenvalues xv

16 4.4. The implied timescales spectrum versus lag times of model 1 built from the first 10, 000 conformations of the Shaw s chignolin trajectory. The distance metric is the conventional RMSD The implied timescales spectrum versus lag times of model 2 built from the first 10, 000 conformations of the Shaw s chignolin trajectory. The diffusion coordinates are used to describe geometrical similarity The implied timescales spectrum versus lag times of model 3 built from the first 10, 000 conformations of the Shaw s chignolin trajectory. Beside the diffuion maps used in model 2, model 3 also implements fuzzy C-means and transition-based assignment , 000 conformations used to build MSMs of chignolin are projected onto the first and the second diffusion coordinates. For each conformation, it is labelled by different colors based on the macrostate definition. All of the folded conformations well sit on top of each other. The intermediate and unfolded state ensemble are well spread in the diffusion maps space Fig. 1(a) of [2]. A schematic plot of the energy landscape of the simple double-well. Three discrete states, well A, well B and the transition state Comparison among convectional UWHAM, wise UWHAM and dtram results for macrostate populations using three independent trajectories (two unbiased trajectories starting from macrostate A and B, and one biased trajectory sampling from a flat potential). Two dashed lines represent the theoretical equilibrium populations of macrostate state A and B. The transition state is omitted since its population is very small and has similar trend as other macrostates. Error bars are calculated over 30 independent runs xvi

17 1 Chapter 1 Introduction 1.1 Protein Folding: Challenges and Theories Proteins play vital roles in all living matters for every aspect, such as providing the architectural support in tissues and bones, etc., transporting biological resources, i.e., hemoglobin and myoglobin the oxygen carriers, and regulating basic biological processes fundamental to life, i.e., immunological functions and reaction catalysis. Protein molecules consist of a linear chain of amino acids, which are not functional until a specific 3-D configuration, namely the folded state (or native state), is reached. The process that an unfolded polypeptide chain turns into its folded state is called protein folding, which has been studied for decades in the community of biophysicists and biochemists not only due to its overwhelming complexity but also the advance of understanding the problem shedding light on the degenerative diseases such as Alzheimer s and Parkinson s diseases caused by protein misfolding and aggregation [3]. Cyrus Levinthal [4] first raised a question how come that a protein can fold in a time scale (from microsecond to second) but the number of all possible configurations is astronomically massive so that it requires a time longer than the age of the universe to be fully sampled. This argument assumes that all the conformations are equally probable in the configuration space and describes an energy landscape that looks like a flat golf course with a single hole at the free energy minimum. To resolve the Levinthal s paradox, Levinthal argued that the folded conformation of a protein should be determined by kinetic pathways. A variety of phenomenological models were developed to explain that in which way a conformational space is sampled so that the folding time falls into the experimental range, such as framework model [5, 6], jigsaw puzzle mode [7, 8], diffusion-collision model [9], the hydrophobic collapse model [10 12] and nucleation-condensation mechanism [13 16]. Later

18 2 on, new approaches to the protein folding problem focus more on the consideration of the energy landscape of a polypeptide chain instead of phenomenological models, which has much more general characters. As the energy landscape is one of the most fundamental factors of reactions, surely including protein folding, there must be some energetic aspects that aim the formation of the native states. The emergence of the funnel-shaped energy landscape [17 20] provided a way to visualize the solution to the Levinthal s paradox and achieved the goals of reaching a global energy minimum and doing so quickly at the same time. Rather than looking for the existence of a specific folding pathway characterized by some well-defined intermediate states, the funnel picture emphasizes the smooth (minimally frustrated) energy landscape with multiple competing folding pathways [21 24]. Such a landscape does not conflict with the local minima, which corresponds to the ruggedness of the landscape resulted from the inter-residue interactions. Transient trapping in the local minima slows down the exploration toward the stable native state. The principle of minimal frustration is reflected as the interactions of foldable proteins should have fewer conflicts than the expectation and was confirmed by extensive study of many sequences [25]. 1.2 Computer Simulations of Protein Folding On the other hand, as the computer power per unit area increases based on Moore s law, computer simulation of proteins at atomic level proposes an accessible way of understanding the process of protein folding, which requires that the computer simulation can capture both the thermodynamics and kinetics of biological systems. However, it is still extremely challenging to reach the biologically relevant time scales for most of the molecular dynamics (MD) simulation. Many advanced sampling methods have been developed to overcome the gap between biological and simulation time scales. One of the popular approaches involves the use of generalized ensemble (GE), such as umbrella sampling [26] and replica exchange (RE) [27]. For umbrella sampling, the biasing potentials are added to bridge the regions of minima separated by high free energy barrier. rugged energy landscape and nearly free diffusion. The simulation undergoes much less In RE, a set of parallel simulations at several different temperatures are performed so that replicas can walk randomly in the temperature space. Periodically, the temperature swaps are attempted using a Metropolis

19 3 criterion so that canonical distributions at any temperature are maintained. Hopefully, the simulation at low temperatures (most likely the target temperature of interest) would gradually explore the landscape of interest via the help from more frequent transitions among different regions of conformational space at high temperatures. Combined with the weighted histogram analysis method (WHAM) [28 30], the distorted or biased energy landscape can be unbiased to recover the canonical ensemble by using all configurations no matter which thermodynamic state they are sampled. However, every coin has two sides. These advanced sampling methods have limited advantage or even disadvantage if it is the entropic rather than energetic barriers which are dominant at high temperatures for larger system like replica exchange. Nonetheless, some of the methods, like umbrella sampling, require prior knowledge about the system so that appropriate biasing functions could be applied, but finding efficient arrangement of the biasing functions usually takes considerable computational time. Even if these advanced methods have better performance than the traditional molecular dynamic simulation, it is still nontrivial to state the issue that how to extract insightful and humanly comprehensive information from those massive simulation data (usually 10 6 or more snapshots). Markov state models (MSMs) are such a powerful tool to deal with the aforementioned challenges. MSMs not only can study the thermodynamic properties of the biomolecules system, such as equilibrium populations of metastable states, but also have the ability to investigate the kinetics of conformational changes, such as the time scales of transitions from unfolded state to the folded state, while these kinetic properties are not accessible from the histogram analysis method mentioned above. Essentially, MSMs consist of a graph which represents clusters of detailed simulations such as MD that are the nodes of the graph, and a set of directed edges that connect the nodes. Clusters are an ensemble of geometrically similar or kinetically relevant configurations and discretized from the data set. Edges indicate the transitions between the two connected clusters at some time resolution, namely lag time, which are parametrized from the trajectory. MSMs discretize and coarse-grain the conformational space of the system in a way that reflects the underlying dynamics and free energy landscape. In this sense, MSMs are extremely useful to identify the meta-stable macrostates, the free energy difference between them and

20 4 the dynamic process such as the equilibration within an ensemble of macrostates or the relaxation decay to their equilibrium. Recently, a class of estimators that exploit ideas from both of reweighting technique and Markov modeling emerged in the study of the thermodynamics and kinetics of the simulation data without requiring the absence of the correlation in the data. They are known as transition-based reweighting analysis method (TRAM) [2, 31, 32]. Commonly, projecting the free energy landscape onto the order parameters is applied to illustrate the resulting dataset. However, if the order parameters were not chosen from the intrinsic reaction coordinates for the process of interest, the projection of the free energy landscape would not be reliable or valid. Unfortunately, the reaction coordinates are clear for only a very few cases such as the dihedral angles (ψ and φ) of alanine dipeptide. For the complicated process such as protein folding happens in high-dimensional (usually hundredsor even thousands-dimension) space, there is still no rigorous way of determining a simple reaction coordinate. Root mean square deviation (RMSD) or the number of native contacts or radius of gyration are popular choices, but if the shape of the energy landscape largely depends on the choices of the reaction coordinates, these methods are not that useful. The question comes down to how to find meaningful geometric measure of data sets?, since it is very critical in every aspect of analyzing the high dimensional molecular dynamics simulation data in a much lower dimension manifold, which is the subject of dimensionality reduction. Dimensionality reduction is to find a mapping which maps the original data from one space to another. The new space is supposed to have simpler description and as a tradeoff to discard part of the information in the original data sets. The constrainsts will be applied to the mapping function so that what information needs to be reserved can be determined. Principal Component Analysis (PCA) [33] is a linear dimensionality reduction technique which aims to capture most of the variability in the data. Multidimensional Scaling (MDS) [34] maps data into a lower dimensional space in such a way that pair-wise distances among data points are preserved. Diffusion maps [35] is a mapping which embeds high dimensional data sets to a lower Euclidean space via the transition matrix diagonalization of a suitably defined Markov walk on the given data.

21 5 1.3 Thesis Organization In chapter 2, an introduction to Markov processes and the building of an MSM will be elaborated. I will provide a general view of Markov state models and how an MSM is built for a miniprotein, Trp-cage in more technical details. When studying the MSM of Trpcage, an important question has been brought up, how long does it take to equilibrate the unfolded state of a protein?. The answer to this question has important implications for our understanding of why many small proteins (single domain, less than 100 residues) fold with two-state kinetics. When the equilibration within the unfolded (U) state is much faster than the folding, the folding kinetics will be two-state even if there are many folding pathways with different barriers. Yet the mean first passage times (MFPTs) between different regions of the unfolded state can be much longer than the folding time. This seems to imply that the equilibration within U is much slower than the folding. We attempt to resolve this paradox by presenting a formula for estimating the time to equilibrate the unfolded state of a protein. We also present a formula for the MFPT to any state within U, which is proportional to the average lifetime of that state divided by the state population. This relation is valid when the equilibration within U is very fast as compared with folding as it often is for small proteins. To illustrate the concepts, we apply the formulas to estimate the time to equilibrate the unfolded state of Trp-cage and MFPTs within the unfolded state based on a Markov State Model using an ultra-long 208 microsecond trajectory of the miniprotein to parameterize the model. The time to equilibrate the unfolded state of Trp-cage is 100 ns while the typical MFPTs within U are tens of microseconds or longer. In chapter 3, a simple relation that shows the mean first passage time to any state is equal to the relaxation time of that state divided by the equilibrium population is derived. This explains why mean first passage times from state to state within the unfolded ensemble can be very long but the energy landscape can still be smooth (minimally frustrated). In fact, when the folding kinetics is two-state, all of the unfolded state relaxation times within the unfolded free energy basin are faster than the folding time. This result supports the well-established funnel energy landscape picture and resolves an apparent contradiction between this model and the recently proposed kinetic hub model of protein folding. We

22 6 validate these concepts by analyzing a Markov state model of the kinetics in the unfolded state and folding of the mini-protein NTL9 (where NTL9 is the N-terminal domain of the ribosomal protein L9) constructed from a 2.9 millisecond simulation provided by D. E. Shaw Research. In chapter 4, we continue on the goal of studying kinetics in biological systems but with the methods other than MSMs. Firstly, a non-linear dimensionality reduction technique, diffusion maps will be introduced. Three models are built based on different distance metrics and state assignment methods. We are trying to show that the diffusion map embedding better preserves the intrinsic dynamics and systematically extracts reaction coordinates from data sets. Secondly, discrete transition-based reweighting analysis method (dtram) is used to evaluate the equilibrium population of an illustrating simple model. As the simulation length increases, we are going to see how fast the population estimation from dtram converges to the reference values. Last but not least, we will show that the way of defining thermodynamic states affects the performance of unbinned weighted histogram analysis method significantly.

23 7 Chapter 2 Kinetics Study of Proteins Using Markov State Models 2.1 Intorduction to Markov Processes Discrete Time Processes Markov processes are stochastic processes which describe the time evolution of a system whose state is given by a random variable X that is in a state space Ω, of which the states are labeled as {1, 2,..., i,...} in the state space based on some discretization. Here we only discuss a finite and discrete state space. We denote that Prob {j, t n 1, t 1 ; 2, t 2 ;...; i, t n 1 } as the probability that the random variable is in the state j at time t n given that the random variable was in a series of states 1, 2,..., i at time t 1, t 2,..., t n 1. A stochastic process is called be a Markov process or Markovian, if Prob {j, t n 1, t 1 ; 2, t 2 ;...; i, t n 1 } = Prob {j, t n i, t n 1 }. (2.1) The equation above states that the probability that the random variable X is in state j at time t n has nothing to do with all the previous states except the one just before it, state i at time t n 1. So a Markov process is often thought of as memoryless. The predictions about the future realization of the random variable only depends on the current value. This probability is usually called as transition probability T ij denoting the transition probability from state i to state j, i and j Ω. The stochastic process is also represented by a Markov chain since the state space is discrete. Let a row vector P = {P 1, P 2,..., P N } be the population distribution of all the discretized states in Ω. A normalization condition is maintained, N P i = 1. (2.2) i=1 Two definition will be introduced as follows. A homogeneous process is one such that

24 8 Prob {j, t n i, t n 1 } depends only on the time interval t n t n 1 instead of some particular time. Secondly, a stationary process is defined as one such that the probability distribution P is time-independent and also homogeneous. However, people are usually interested in the time-evolution of the probability distribution of a system. Discretize time into units of t, which is the smallest increment of the time, so that any time t can be expressed as the product of an integer and the time unit, m t. The probability distribution can be evolved forward by repeatedly multiplying the transition matrix, P(t + m t) = P(t) [T( t)] m, (2.3) where T( t) is the transition matrix, of which the element is T ij. From now on, t is implied for the transition matrix T and its elements T ij for simplicity unless otherwise denoted. The transition matrix T is row-stochastic matrix, so the normalization condition should be, N T ij = 1. (2.4) j=1 As m increases, the probability distribution at any time can be predicted. When m goes to infinity, the stationary distribution of P will be reached, which is also denoted as equilibrium distribution P eq = {P eq (1), P eq (2),..., P eq (N)}, P eq = P(t) [T( t)]. (2.5) There are some notes regarding the classification of the state space Ω. We define a set of classes for the states in Ω. If state j is accessible from state i in a finite number of steps, in other words, t is finite so that T ij (t) > 0. If state j is accessible from state i and vice versa, then state i and j are in the same communication class of states. Then, Ω can be divided into a set of non-overlapping communication classes. If Ω consists of a single communication class, then the process is irreducible; otherwise, it is reducible. Using the directed graph representation, if the process on the graph is irreducible then the graph is strongly connected. As a system is in a steady state, detailed balance should be obeyed as follows, P i T ij = P j T ji. (2.6)

25 9 The left hand side of equation 2.6 indicates the flux (number of transitions per unit time) flowing from state i to state j, which should be equal to the flux flowing from state j to state i in the case of steady state. Otherwise, there is a net flux between states, which will drain up some other states and therefore is not consistent with the steady state definition. As the detailed balance provides a symmetry property to the transition matrix, it guarantees that T can be diagonalized and decomposed by its eigenvalues and eigenvectors. Let the eigenvalue spectrum of T be λ i, where i takes on values from 1 to N. Then the sorted eigenvalues should have the relation, 1 λ 1 λ 2... λ N 0. We can obtain a set of left and right eigenvectors, Tψi R = λ i ψi R, (2.7) ψi L T = ψi L λ i, (2.8) where ψi R is the ith right eigenvector and a collumn vector, ψi L is the ith left eigenvector and a row vector. The normalization condition of these two sets of eigenvectors is, ψ L i ψ R j = δ ij (2.9) where δ ij is the Kronecker delta function. It can be verified that ψ L 1 = P eq and ψ R 1 = {1, 1,..., 1}. In fact, the two sets of eigenvectors are related as below, ψn R (i) = ψl n (i) ψ1 L (2.10) (i). If t = m t, the transition element has the following spectral decomposition, T ij (t) = N λ m n ψn R (i)ψn L (j). (2.11) n=1 With the spectral decomposition, it is convenient to write the probability distribution in terms of the eigenvalues and eigenvectors, P(t) = N ψn L (i)[p(0) ψn R (i)]λ m n, (2.12) i=1 where P(0) is the initial probability distribution.

26 Continuous Time Processes Now we discuss the continuous time limit of Markov processes. By setting m = 1 and substracting P(t) from both sides of equation 2.3 and dividing by t, we have P(t + t) P(t) t = P(t) T I, (2.13) t where I is the identity matrix. Taking the right-hand limit t 0 +, we obtain the master equation in the case of continuous Markov processes, dp dt = P K, (2.14) T I where K = lim t 0 + and is also called the rate matrix. The element of the rate t matrix is K ij, which represents the rate constant from state i to state j in a physical system. Based on how equation 2.14 is organized, the rate matrix K is a row-stochastic matrix, so the normalization condition is, N K ij = 0, (2.15) j=1 where the diagonal term K ii denotes the negative sum of all the outgoing rates from state i. Invoking the detailed balance condition, P i K ij = P j K ji, (2.16) spectral decompsition can be similarly deducted just like the case of discrete Markov processes. But we can also make it in an alternative way. Let s define a transpose symmetric matrix K sym, which shares the same diagonal elements of the rate matrix K [36]. off-diagonal elements are defined as, K sym ij = K ij K ji = P 1 2 eq (j)k ij P 1 2 eq (i), which yields, The K sym = P 1 2 eq KP 1 2 eq, (2.17) where the diagonal matrix P ± 1 2 eq = diag[p ± 1 2 eq (1), P ± 1 2 eq (2),..., P ± 1 2 eq (N)]. Therefore, the symmetric K sym has a real set of eigenvalues, so does K since it s similarity transformation to K sym. The eigenvalues and eigenvectors are expressed as, K sym φ i = µ i φ i (2.18)

27 11 Kψi R = µ i ψi R (2.19) ψi L K = φ L i µ i, (2.20) where µ i are the eigenvalues, which are smaller or equal to zero. φ i is the ith orthonormal right eigenvector of K sym, ψi R is the ith right eigenvector of K, ψi L is the ith left eigenvector of K. They are normalized according to the Euclidean inner product, ψ L mψ R n = δ mn (2.21) φ mφ n = δ mn. (2.22) In fact, the left and right eigenvectors of K are related by the set of eigenvectos ψ, as shown ψ L n = P 1 2 eq φ n (2.23) ψ R n = P 1 2 eq φ n. (2.24) We can now write the solution to the master equation 2.14 in spectral decomposition, P(t) = N ψn L (i)[p(0) ψn R (i)]e µnt. (2.25) i=1 By comparison, the relationship between the transition matrix and rate matrix can be wriiten as, T(t) = e Kt (2.26) λ i = e µ it. (2.27) 2.2 Construction of Markov State Models There are many applications such as MSMBuilder2 [37] and EMMA [38], which provide a systematic way to construct Markov State Models and automate the procedure. Although a great mount of technical efforts can be saved from using these packages, it is still useful and necessary to understand how the simulation data is dealt inside the black box. In this section, a practical guide to build a Markov state model from scratch will be introduced. Currently, the most common approach to build MSM is a two-phase process. The first phase is state space discretization, which divides the state space into many small-volume

28 12 boxes based on geometric similarity. The second phase is to estimate the transition matrix on the graph built from previous phase by projecting either several long trajectories or massive short trajectories onto the graph. When the transition matrix is obtained, many further analysis can be performed, including estimation of the important time scales of the system [39], macrostate recognition, observable calculation [40 42], calculation of transition pathways [43] State Space Discretization Generally speaking, samples obtained from computer simulation are discretized from a continuous state space. The probability of visiting an exact same conformation is barely zero. To generate an MSM, dividing a high-dimensional state space of a molecule into smallvolume boxes, namely the microstates, has to be achieved. We consider that all the conformations within a microstate are both kinetically and geometrically similar. One would like to generate microstates as finely as possible so that there is no sensible free energy barriers within a microstate. However, the finer the microstate network is, the poorer the statistics within each microstate are. So a tradeoff has to be made in fine-graining. There are many fine-grained clustering methods which can group kinetically and geometrically relevant conformations together, such as hierarchical clustering, k-means algorithm, k-centers algorithm, hybrid k-centers/k-medoids clustering, fuzzy clustering etc. As there is no single correct answer for clustering, different methods work under different circumstances. The key step before the clustering is to pick a good distance metric, where the RMSD between atoms is a popular choice. The RMSD between all atoms is computationally expensive. It is generally sufficient to use the RMSD between α-carbons or all backbone heavy atoms. However, alternative choice of distance metric, such as diffusion coordinates [44], was reported recently to have better discretization than traditional RMSD. MSMs at fine-grained level would be quite useful to predict observables that can be determined experimentally [41, 42]. But, due to the large scale of the models, people find that it s difficult to comprehend and extract insightful understanding about the system. To build humanly comprehensive model, one can perform a coarse-grained clustering, which groups a kinetically relevant set of microstates to larger aggregates, namely macrostate.

29 13 Macrostate loses the ability to make quantitative predictions but is very useful to provide insights about free energy basins of system. New hypotheses can be generated at coarsegrained level and tested out at fine-grained level. Clustering Algorithms at Fine-grained Level Let s first introduce the clustering algorithms at fine-grained level. At fine-grained level, snapshots are grouped into tens of thousands of microstates. A couple of most commonly used clustering algorithms are introduced here. In k-centers clustering algorithm, given a graph G = (V, E), which represents the microstates of a state space and connection among them, find a subset C V while minimizing, min max d(x, c i(x)), (2.28) x V c i C where d(x, y) is the distance between the two conformations x and y, c i (x) is the cluster center of the ith cluster c i, x is mapped to its closest cluster center denoted as c i (x). In other words, it means finding a set of cluster centers {c 1, c 2,..., c k }, for which the largest distance of any conformation x to its closest cluster center in c i (x) is minimized. One of the properties k-center have is that the state space has relatively even discretization and the clusters have similar radii. Even though the computational complexity of the k-center problem remains NP-hard, a fast approximation algorithm was developed [45], which has the running time of O(kN) if k is the number of clusters and N is the number of conformations. The steps of this approximated algorithm are as follows: 1. Pick a random starting conformation as the initial cluster center and assign all the other conformations to this cluster center. And calculate the distances between the cluster center and all other conformations. 2. Select the conformation which has the largest distance as the next cluster center. Reassign all the conformations other than the cluster centers to the new cluster center if the distance between the conformations and the new cluster center is shorter than the distance between the conformations and their previously-assigned cluster center.

30 14 3. Repeat the above step (always select the conformation farthest from all the existing cluster centers as the next cluster center) until the pre-set number of the cluster centers is reached or the maximum size of any cluster is reached. In k-medoids clustering algorithm, in costrast to the k-center algorithm, it minimizes the pairwise distance within a cluster, written in mathematical expression, min 1 d(x, c i (x)) 2, (2.29) N x c i where d(x, y) is the distance between the two conformations x and y, c i (x) is the cluster center of the ith cluster c i, x is mapped to its closest cluster center denoted as c i (x), N is the number of conformations. The steps of k-medoids are as follows, 1. Pick k random starting conformations as the initial cluster centers and assign all the other conformations to their closest cluster center. 2. In each cluster c i, choose a random conformation y in the cluster as the new cluster center and evaluate x i c i d(x, y). If this summation was reduced by the choice of the new cluster center, that swap the old cluster center and new one, y. 3. Repeat the above step until a pre-set number of iterations is reached. One advantage that k-medoids has over the k-center is that k-medoids is more likely to generate clusters with a more even number of conformations. The reason this feather is preferable is that sometimes k-center tends to create clusters with very few conformations so that the transition probability estimation among those clusters are not statistically reliable. But k-medoids does not have this problem since it will generate more clusters in the dense region of the state space and less clusters in the sparse region, which will have a potential trade-off for over-discretization or under-discretization. k-centers clustering algorithm minimizes the worst-case clustering error since the maximum distance between the cluster center and any other conformations within that cluster is minimized. While k-medoids minimizes the average error since the objective funtion 2.29 penalizes the ensemble-averaged deviation from cluster centers. To employ the advantages of both k-centers and k-medoids clustering algorithms, a hybrid k-centers/k-medoids clustering algorithm emerged [37]. The general steps are,

31 15 1. Obtain a set of initial cluster centers c i by using the approximate k-cetner. 2. Taking c i as the input of the k-medoids and reject the swaps if the objective funtion 2.28 increases. Hybrid k-centers/k-medoids clustering improves state definition as the discretization of the state space is more even, in the meanwhile the cluster centers will be shifted to the densest regions of the conformational space. Clustering Algorithms at Coarse-grained Level After using the clustering methods described above, the trajectories are usually reduced to clusters, which are called fine-grained models. But still, the size of the MSMs is overwhelming to visualize for human beings. It would be humanly comprehensive and can be used to gain an intuition for understanding a system if the number of clusters can be further reduced to, for example, tens of clusters. Here we review a few methods that are developed to build coarse-grained MSMs. Perron Cluster Cluster Analysis (PCCA) is one of the methods for coarse-graining [46]. Let us consider that the fine-grained partition of the state space Ω is c i, where i iterates through all microstate cluster centers. The goal is to group c i into macrostates m j (j represents the macrostate cluster center indices) so that the dynamics remains in each cluster m j for a long time before juming to another cluster m k. In other words, each macrostate cluster m j is associated with a free energy basin. It is known that the slow kinetics are described by the dominant eigenvectors of the transition matrix. The basic idea of PCCA is to use the sign structures of these dominant eigenvectors to decompose the conformational states into metastable states. I will take a 1-D quadri-well potential diffusion model for example. There is a particle moving on a 1-D potential, which is driven by Langevin equation. The potential consists of four wells, which can be characterized as four popularized metastable states (Fig. 2.1). 100 microstate cluster centers are obtained from 100,000 samples by using k-means. A 100 by 100 transition matrix is estimated from the trajectory. As the transition matrix is diagonalized, a spectrum of eigenvectors are calculated and ordered descendingly based on the magnitude of the corresponding eigenvalues. In Fig. 2.2a, the components of the

32 16 first eigenvector corresponding to the largest eigenvalue are proportional to the equilibrium population of each cluster. The rest of the eigenvectors characterize the dynamics of different eigenmodes of the system. In Fig. 2.2b, one would split the microstates into one group with negative components and another group with positive components. Then, PCCA uses the third eigenvector (Fig. 2.2c) and gives rise to two macrostates in the negative region. Next,the positive region is divided into two macrostates based on the sign structure of the components of the four eigenvectors (Fig. 2.2d). Now we have four macrostates suggested by PCCA using the largest three eigenvectors (not including the equilibrium eigenvector). The partition is consistent with our intuition since there are four free energy basins in Fig PCCA is really a good and natural coarse-graining clustering algorithm, if different free energy basins are seperated by significant free energy barriers and good statistics are obtained in every macrostate. Not all real-world systems have the clean coarse-graining as the simple model provided. PCCA does not always work properly as the uncertainties of the components propagate so that the sign structures of small magnitude components can be arbitrary. The microstates corresponding to these small magnitude components can be arbitrarily assigned to macrostates and cause issues. To overcome the limitations of PCCA, PCCA+ is developed [47, 48] to create robust coarse-graining. PCCA+ will consider all n eigenvectors simultaneously if n macrostates are intended. PCCA+ generates one membership vector of every microstate, which allows a microstate to belong to more than one macrostate with a different level of membership. More details about this algorithm can be found in [47] Estimation of Transition Matrices The second phase of building a MSM is to estimate the transition matrices, which involves counting the transitions among the microstates. After the clustering, shred the original trajectory through the cluster centers and obtain a count matrix C, where the element C ij represents the number of transitions observed from state i and state j. Thus the estimated transition matrices can be obtain, T ij ( t) = C ij l C il (2.30)

33 17 Figure 2.1: The population distribution of a 1-D quadri-well potential diffusion model. There are 100,000 samples (x coordinates), which are drawn from the potential using Langevin equation. Four free energy basins can be observed. where t is the lag time, denoting the time resolution of the MSM. When collect the statistics for the count matrix C, a sliding window method is often implemented for fully utilizing the data. Suppose there is a long molecular dynamics data, of which the conformation is saved every 200ps. A lag time of 2ns is chosen to estimate the count matrix. For independent counting, transition is recorded every ten snapshots. It is equivalent to subsample the trajectory so that one tenth of the data is used, which is extremely inefficient. As the lag time increases, less and less data is left to be statistically sufficient to estimate the count matrices. While the sliding window treats every snapshot as a starting point, it counts the transition between every snapshot and the snapshot recorded after ten snapshots. The detailed balance condition is very important for the transition matrices, which ensures time reversibility and real eigenvalues. Symmetrized count matrix is used to guarantee the equation 2.6, C sym ij = 1 2 (C ij + C ji ). (2.31) The imposition of the detailed balance introduces statistical bias to both equilibrium and kinetic properties [49], especially in the case of many short trajectories. It is suggested to

34 Figure 2.2: The top four eigenvectors of a 100 by 100 transition matrix estimated from a 1-D quadri-well potential diffusion model. For each eigenvector, the components and the corresponding cluster center x-coordinates are plotted. a) The first eigenvector indicates the equilibrium population of the potential. b) The second eigenvector describes the dynamics between the negative and positive regions. So the boundary is established at x = 0. c) The third eigenvector splits the negative region into two basins. d) The fourth eigenvector divides the positive region into two macrostates. 18

35 19 remove the minimum amount of links to obtain an irreducible directed graph and keep the data in the maximal ergodic subgraph. To put another way, every vertice is reachable from every vertice in an ergodic subgraph. If only single direction transition is observed between the nodes, they will be trimmed off so that strongly connected components remain. Tarjan s algorithm was designed to achieve the goal [50]. After that, a reversible count matrix is estimated using a maximum likelihood estimator for reversible Markov state models. The steps are listed as below, 1. Obtain the observed count matrix C. Suppose a symmetric count matrix X exists. C i := j C ij. X i := j X ij. 2. Update X ii = C ii X i C i and X ij = (C ij + C ji )( C i X i + C j X j ) Repeat the last step until a tolerant error is reached. If the trajectory is long enough, equation 2.30 is the maximum likelihood estimator for the transition matrices Build MSM for Trp-Cage In this section, we will use a mini-protein Trp-cage to illustrate the generation of Markov state models and study its folding mechanism in the following sections. Trp-cage, a 20- residue artificial protein, was designed by Neidigh et al. [51], which forms a compact hydrophobic core containing a tryptophan amino acid and α-helix. Its experimental folding time is around 4µs [52] and the folding is cooperative. This fast folding feather of Trp-cage enables the community of computational biophysists to study the folding mechanism, thermodynamic and kinetic properties in all-atom computer simulation using both of implicit and explicit solvent models[52 62]. The MD simulation trajectory of Trp-cage that analyzed here was generated from the Anton supercomputer, which is designed and built by D.E. Shaw Rearch [63], using a modified CHARMM22 [64] all-atom force field in the TIP3P explicit solvent. The length of the simulation is 208µs and consists of 12 folding and unfolding events. The Trp-cage with the sequence ASP-ALA-TYR-ALA-GLN-TRP-LEU- ALA-ASP-GLY-GLY-PRO-SER-SER-GLY-ARG-PRO-PRO-PRO-SER was solvated in 65

36 Figure 2.3: The NMR structure (PDB entry 2JOF) of Trp-cage. It consists of an α-helix, a 3,10 helix and a hydrophobic core, which constains a Trp in the middle. 20

37 21 Figure 2.4: The distribution of Cα RMSD from the NMR structure (PDB entry 2JOF) of Trp-cage. mm NaCl in a cubic box of 37 A side length containing 1, 700 water molecules. The MD trajectory contains about one million conformations saved every 200ps. If we examine the Cα RMSD distribution of Trp-cage (Fig. 2.4), two peaks can be observed. One sharp peak locates at RMSD = 1.3 A. The wide one is approximately at RMSD = 6 A, which is separated from the sharp peak by a weakly populated intermediate region. Based on the RMSD distribution, we divide all the conformations into three ensembles (macrostates): folded (RMSD 2.2 A, shorthand notation as F), intermediate (2.2 A< RMSD 5 A, I) and unfolded states (RMSD > 5 A, U). The populations of F, I and U are correspondingly 17.5%, 15.1% and 67.4%. These definitions will be used in the following analysis such as calculating the important time scales. One may find different definitions of the macrostates in Shaw et al. [1] as they use the fraction of native contacts as the criterion. Even though there are some quantitative differences between their work and ours, such as the mean transition path time, these do not affect the interpretation of the main results. In Fig. 2.5 and Fig. 2.6, the consistency between the two definitions of macrostates can be observed. When Q value is close to 1, RMSD approaches to the folded region (2.2 A). When Q value is close to 0, RMSD appears in the unfolded region. To analyze the folding kinetics of Trp-cage, we constructed a MSM based on the Shaw trajectory. Since the number of conformations is too large (about one million snapshots),

38 22 Figure 2.5: The time series of Cα RMSD from the NMR structure (PDB entry 2JOF) of Trp-cage. Figure 2.6: The time series of native contact with respect to the NMR structure (PDB entry 2JOF) of Trp-cage.

39 23 which may cause a computational issue in the phase of clustering, we uniformly subsampled the trajectory so that one tenth of the original trajectory was used for the clustering. A set of 25,000 cluster centers are generated by k-centers and the rest of the conformations are assigned to their closest cluster centers based on geometrical distance (RMSD). The transition matrices at different lag times are estimated. The eigenvalues λ i of the transition matrices in equation 2.7 actually correspond to physical time scales, indicating how fast the eigenmodes propagate the population toward equilibrium [65, 66]: where t is the lag time of the MSM. τ implied i τ implied i ( t) = t ln λ i ( t), (2.32) is called the ith implied timescale, excluding the case i = 1, since λ 1 = 1 represents the equilibrium eigenvalue, namely the stationary state. The time scale for a stationary state is infinite. Generally, for a discrete MSM, there is an underlying rate matrix, which is difficult to be inferred (but can be inferred [67]) and generates the transition matrices at any lag times (given by equation 2.27). So a selfconsistent and well-constructed MSM should have lag time independent implied timescales. An implied timescale versus lag time plot (Fig. 2.7a) is a common way of testing the Markovian dynamics (Markovianity). Furhter details about the MSM of Trp-cage will be represented in the section Important Quantities of Kinetics In order to obtain folding pathways, there are some important quantities to be defined. The main element for calculating the folding pathways is the committor probability of folding, p fold [68]. It represents the probability of a trajectory reaching the folded state before the unfolded state. It s 0 in the unfolded state and 1 in the folded state. Flux J to folded state of a pathway is the number of folding events to the folded state through that pathway per unit time, which can be calculated as [43], T ij ( t)p eq (i)p fold (j) J = i U,j / U. (2.33) t The folding rate k f is, k f = J i P eq(i)[1 p fold (i)]. (2.34)

40 Figure 2.7: The implied timescale corresponding to the 10 slowest decaying eigenmodes using transition matrices T( t) at fine-grained level (25, 000 microstates), with different boundary conditions, estimated from Shaw s Trp-cage trajectory [1]. (a) Unmodified equilibrium boundary condition. (b) Reflecting at F and I states boundary condition, where the dynamics are constrained within the unfolded ensemble. The reflecting at F and I boundary condition is realized by adding the value of transition probability from any state i to F or I (T if or (T ii ) to the component T ii and then setting T if and T ii to zero. The implied timescales approximately level off after lag time t = 5ns. The optimal lag time 10 ns is chosen for further analysis based on the trade-off between the network being Markovian and the resolution being sufficient for studying folding mechanism. 24

41 25 The number of pathways usually are too overwhelming to gain insights. One can cluster the pathways into tubes according to structural similarity. The flux through folding tube α is J(α). The rate constant k(α) is obtained from the reciprocal of the mean first passage time of folding events through tube α. The population P (α) of the unfolded state which folds through tube α is calculated as, P (α) = t(i α) i U T total (2.35) where t(i α) is the waiting time at unfolded state i before folding through tube α. T total is the total length of the simulation. The sum of all P (α) should be the population of all unfolded states. 2.3 Results and Discussion Let us investigate some important kinetic quantities associated with folding tubes first. In Fig. 2.8, the kinetic quantities of top 16 tubes are shown. Although the fluxes and populations fold through different tubes vary a lot from tube to tube, they all have very similar folding rates, which is not straightforward at all since for each tube the barrier is not necessarily similar to each other. But from a toy model study (section 2.5.1), the constant rates through different tubes are a direct consequence of significant mixing within the unfolded state ensemble before folding. In a Markov state model, the kinetics is characterized by the implied timescale spectrum of the transition matrix, which contains all the information about the relaxation times of the states within the discrete time MSM. As can be seen in Fig. 2.7a, there is a substantial gap between the longest implied timescale and the second longest one. The slowest implied timescale is associated with folding and unfolding kinetics. Imposing the reflecting at F boundary condition, it provides a model for the dynamics of the unfolded ensemble alone. It can be seen that the spectrum is very similar except that the slowest implied timescale of the unmodified boundary condition is missing from the spectrum with reflecting boundary condition at F. The reflecting at F boundary condition is realized by adding the value of transition probability from any state i to F (T if ) to the component T ii and then setting T if to zero. The substantial gap between the slowest implied timescale and the others means

42 Figure 2.8: (a)the tube fluxes J(α), (b)the population which folds through tube α, P (α), (c)tube folding rates k(α) for the top 16 folding tubes obtained from 25, 000-node MSM. 26

43 27 that the folding is two-state [40, 69 74] and the implied timescale ( 1.2µs) is the inverse of the sum of the folding plus unfolding rates. In Fig. 2.9, subplot (a) shows that the slowest eigenmode involves the kinetics between the region where RMSD < 2.5 A and the region where RMSD > 5.0 A while subplot (b) indicates that the second slowest eigenmode describes the dynamics mainly within the unfolded region. Therefore, the intra U state fluctuations can be separated from the folding. 25, 000 microstates are still massive to be visualized and comprehended by people. So we built a much more coarse-grained Markov state model for Trp-cage by using the PCCA+ subroutine in MSMBuilder2 [37]. We used MSMBuilder2 to cluster all the snapshots into 31, 284 cluster centers with a hybrid of k-centers and k-medoids algorithm. Then a 20-node macrostate Markov state model was constructed from PCCA+. In this 20-node model, it contains a native state with population of 17%, which is consistent with the statistics directly obtained from trajectory. To validate the 20-node Markov state model in the aspect of kinetics, the implied timescales are plotted in Fig As we can see, the implied timescale spectra agree with the spectra of 25, 000-node Markov state model (Fig. 2.7) very well. So this 20-node network serves a good representation of the 25, 000-node network but describes the conformational space of Trp-cage in a much more comprehensive and simple way. In Fig we show a typical transition matrix element T ij from state i to state j, both within unfolded ensemble, estimated in three ways: using absorbing, unmodified equilibrium and reflecting boundary conditions at F. The time dependence of T ij describes the relaxation process following an initial point perturbation at state i. On a time scale of a few hundred nanoseconds they look very similar. Each rises rapidly to a plateau value which overshoots the equilibrium population of state j by a small amount. When added up over all the states in the unfolded ensemble, the excess corresponds to the equilibrium population of F that folds from U to F on a slower time scale of 5µs. After a few hundred nanoseconds, the matrix elements T ij shown in Fig. 2.11b have the following longer time behavior. Under reflecting boundary conditions T ij is approximately constant, the unmodified transition matrix T ij relaxes to the equilibrium population at a time scale of 1.2µs, while under absorbing boundary condition the matrix elements relax to zero (since all the population

44 Figure 2.9: In the 25, 000-cluster MSM of Trp-cage, plot the RMSD of the cluster centers versus the corresponding components of the second and third eigenvectors of the transition matrix at lag time of 5ns.(a) RMSD vs the second eigenvector components. (b) RMSD vs the third eigenvector components. The first eigenvector corresponding to the equilibrium population. So all the components share the same sign. 28

45 29 Figure 2.10: The implied timescale corresponding to the 10 slowest decaying eigenmodes using transition matrices T( t) at coarse-grained level (20 macrostates, PCCA+ clustering from the 31, 284-node network in figure 2.9), with different boundary conditions, estimated from Shaw s Trp-cage trajectory [1]. (a) Unmodified equilibrium boundary condition. (b) Reflecting at F and I states boundary condition, where the dynamics are constrained within the unfolded ensemble. The implied timescales approximately level off after lag time t = 5ns. The optimal lag time 10 ns is chosen for further analysis based on the trade-off between the network being Markovian and the resolution being sufficient for studying folding mechanism.

46 Figure 2.11: In the 20-node MSM of Trp-cage, (a) a typical propagator matrix element T ij ( t) from state i to state j. Both of state i and j are picked from the U ensemble. But they are calculated in three ways: using absorbing at F (black), unmodified equilibrium (blue) and reflecting boundary conditions at F (red). The upper and lower dash and point horizontal lines correspond to the equilibrium population of state j under reflecting at F and unmodified equilibrium boundary conditions. (b) The typical transition matrix element T ij is shown in a longer time window. All the elements at different time are obtained from spectral decomposition using all the 20 eigenmodes at lag time of 10 ns. 30

47 31 would be absorbed to the sink) at a time scale of 5µs. The results shown in Fig are suggestive as to the time scales for equilibrating the unfolded state, but the full relaxation involves all the elements T ij of the propagator. Then we will consider the full expression for the relaxation time. The way to estimate the time it takes to equilibrate a system from equilibrium statistical mechanics is to calculate an integral of an appropriate time correlation function [75]. The correlation function of interest here corresponds to the decay of the population fluctuations in the unfolded state. After some derivation (see section for details), this correlation function can be expressed as, Ĉ tot (t) = N i=1 Ĉ tot (t) = Ĉ tot (t) = P eq (i) < δp i(0) δp i (t) > δp i (0) 2, (2.36) N i=1 N N [ n=2 i=1 (T ii (t) P eq (i))p eq (i), (2.37) 1 P eq (i) P eq (i)ψn R (i)ψn L (i) λ n ], (2.38) 1 P eq (i) where ψ R n (i) and ψ L n (i) are the ith element of the nth right and left eigenvectors of the T matrix. λ n is the nth eigenvalue of the T matrix. δp i (0) = P i (t) P eq (i). P eq (i) is the equilibrium population of the state i. P i (t) is an indicator function, which is 1 when the trajectory is at state i and 0 otherwise at time t. In figure 2.12 we show the unfolded state population fluctuation correlation function of the 20-node MSM of Trp-cage. When the dynamics is restricted to the unfolded state, the time to equilibrate the unfolded state is estimated from the time integral of Ĉtot(t) to be 100ns; when the additional relaxation of the unfolded states U due to the much slower equilibration between U and F is also considered, the time to equilibrate the unfolded state is increased to 540ns. The consistency is also observed in the 25, 000-node Markov state model of Trp-cage (Fig. 2.13). The separation of time scales between the equilibration within the unfolded state U and the folded state F is implicit in the folding funnel model of protein folding. While folding on a flat golf-course landscape, which lacks the energy bias can also produce a separation of time scales, the very fast equilibration ( 100ns) within the unfolded state is a feature of the funnel-like landscape. Our estimate of the time to equilibrate the protein unfolded state based on the decay of fluctuations of the U

48 Figure 2.12: The population fluctuation relaxation functions (Eq. 2.36, 2.37, 2.38) of the 20-node MSM of Trp-cage at a lag time of 10ns, using two different boundary conditions, (a) unmodified equilibrium and (b) reflecting at F. The time integrals of the functions are the relaxation times, which are 543ns and 100ns correspondingly. 32

49 Figure 2.13: The population fluctuation relaxation functions (Eq. 2.36, 2.37, 2.38) of the node MSM of Trp-cage at a lag time of 10ns, using two different boundary conditions, (a) unmodified equilibrium and (b) reflecting at F. The time integrals of the functions are the relaxation times, which are evaluated by spectral decomposition of the transition matrix T( t = 10ns). Only the 100 largest eigenvalues and their eigenvectors are used. Besides, the relaxation functions are renormalized by the population of the states which have a relaxation time larger than 8.2ns since we don t have enough statistics or sufficient resolution to estimate the relaxation time smaller than 8.2ns. The corresponding relaxation times are 825ns and 108ns. 33

50 34 state population (Eq. 2.37) is independent of the kind of experiment chosen to monitor the system. Any particular experiment will measure the time evolution of the population fluctuations reweighted by how sensitive that particular probe is to the different modes by which the population fluctuations relax. If for example, the experiment is sensitive to the fluctuations of some property f, then the experimental relaxation time measured for that probe of the unfolded state dynamics would be: Ĉ f = i,j f(i)f(j)p eq(j)(t ji (t) P eq (i)) i,j f(i)f(j)p eq(j)(t ji (0) P eq (i)) (2.39) where f(i), f(j) are the values of the experimental observables in state i and state j. But keep in mind that it is optimal to use fine-grained Markov state model for the purpose of estimating the experimental relaxation time. A common choice of the experimental observable f is the FRET efficiency, which is a nonlinear function of the distance between two particular residues within the protein. The relaxation time thus determined depends on the choice of those residues. We now turn to an analysis of the mean first passage times (MFPTs) between different states within the unfolded state ensemble. MFPT is an important measure of time scales since it directly measures how long it takes to go from one state to another. The definition of MFPT to state i is that starting from a state other than state i, the mean time it takes to reach state i for the first time. The MFPT to an unfolded state i can be obtined from the formula [40]: MFPT i = 0 t MFPT = j dt abs i ji dt = dt j { P eq(j) 1 P eq (i) P eq (j) 1 P eq (i) N n=2 0 t dt abs i ji dt, (2.40) dt ψ R n (j)ψ L n (i)( τ implied n )}, (2.41) where ψ R n (j) and ψ L n (i) are the jth and ith elements of the nth right and left eigenvectors of the transition matrix with an absorbing boundary at state i, T abs i. τ implied n is its nth implied timescale, which is defined in equation The average shown in Eq is taken over all the other states j and includes a sum over all the eigenmodes. As we found that the relaxation time ( 540ns) within the unfolded ensemble is much faster than the reported folding time (5.56µs, table 3.1) for Trp-cage, it is interesting to

51 35 Ways to pick the unfolded states Min (µs) Max(µs) Mean(µs) Std(µs) Top 101 most populated According to the weights totally random Table 2.1: The statistics of fig. 2.14, fig and fig For each way of choosing the unfolded states, MFPTs are collected. And each MFPT data point is averaged out from 1000 successful kinetic Monte Carlo trajectories. investigate the MFPT distribution within the unfolded state ensemble and ask the question that how the MFPT distribution compares with the fast relaxation time. To acquire sufficient statistics for the MFPT distribution, the fine-grained Markov state model of Trp-cage is used. The way to obtain the MFPTs is to directly run many kinetic Monte Carlo simulations from the starting state to the target state based on the 25, 000 by 25, 000 transition matrix. However, it is time-consuming and unnecessary to compute all the possible pairs within the unfolded ensemble, which is in the order of magnitude of hundreds of millions. In Fig. 2.14, the top 101 most populated unfolded states, of which the weight is 9.69% of all the unfolded population, are chosen to calculate the MFPT distribution. Totally, 10, 100 MFPTs are calculated, which has a mean of 13.71µs and comparable to the folding time. But keep looking at fig and fig. 2.16, the mean of the MFPT distribution grows much larger than the folding time, 168µs and 373µs correspondingly. The overview of the MFPTs statistics can be found in table 2.1. The findings in fig and fig are not surprising since more weakly populated unfolded states are chosen under the picking criteria of these two figures. It takes longer time for a trajectory to hit a smaller target (with the population 10 6 ) in the conformational space. So what measure would best characterize the equilibration within the unfolded state of a protein, relaxation time or mean first passage time, or something else? When a protein folds along multiple pathways as suggested by the funnel picture theory, the folding kinetics will still be two-state regardless of differences in the intrinsic barriers along each pathway if the equilibration within the unfolded state ensemble is much faster than the time it takes to fold. In other words, a two-state kinetic model is justified if proteins rapdily equilibrate between different unfolded conformations prior to complete folding. Yet the mean first passage times between different regions of the unfolded state ensemble are typically much

52 36 Figure 2.14: The MFPT distribution between unfolded states of the 25, 000-node MSM of Trp-cage. The top 101 most populated unfolded states are selected to calculate the pairwise MFPTs (totally 10, 100 MFPTs) by running kinetic Monte Carlo trajectories. The transition matrix is built with unmodified boundary condition at lag time of 10ns.

53 Figure 2.15: The MFPT distribution between unfolded states of the 25, 000-node MSM of Trp-cage pairs of unfolded states are randomly chosen according to their weights, which means that the unfolded states with larger weights would be more likely to be picked up than the unfolded states with smaller weights. And the forward and backward MFPTs between the chosen pairs are calculated by running kinetic Monte Carlo trajectories. The transition matrix is built with unmodified boundary condition at lag time of 10ns. 37

54 Figure 2.16: The MFPT distribution between unfolded states of the 25, 000-node MSM of Trp-cage pairs of unfolded states are randomly chosen with equal probability. And the forward and backward MFPTs between the chosen pairs are calculated by running kinetic Monte Carlo trajectories.. The transition matrix is built with unmodified boundary condition at lag time of 10ns. 38

55 39 longer than the folding time. It seems that a paradox is present: the single exponential kinetics can be explained by very fast equilibration within the unfolded state ensemble relative to folding, but the long MFPTs within the unfolded state ensemble seem to imply that the equilibration of the unfolded state is slow relative to folding. In Fig. 2.18a, we show the implied timescale spectrum of the transition matrix with absorbing boundary condition at a typical unfolded state i. The large gap between the largest implied timescale and the rest is the signature of the exponential distribution of first passage times to unfolded state i. The longest implied timescale is of the order of 100µs. Because the unfolded state ensemble relaxes on a time scale a hundred to a thousand times faster than the time it takes on average to reach state i, the MFPT to state i does not depend on the starting point within the unfolded state ensemble. The kinetics involving the transitions between any specific state i and all the other states taken collectively is then effectively two-state and the MFPT to state i can be written as: MFPT = j { P eq(j) 1 P eq (i) ψr 2 (j)ψ L 2 (i)( τ implied 2 )}. (2.42) The MFPT to the unfolded state i chosen for the example shown in Fig is found to be 106µs. To understand why the MFPTs to states within the unfolded state ensemble are so long, we consider the relationship between the average lifetime of a state i within U and the average lifetime of the collective state consisting of the remainder of U excluding state i: where τ life i τ life U i = τ life i ( 1 1), (2.43) P eq (i) is the average lifetime of state i and τu i life is the average lifetime of the collective state U i consisting of the remainder of the unfolded state ensemble excluding state i. Here we define the lifetime distribution of a state as the distribution times recorded upon entering a state when the clock starts and then leaving it when the clock stops, during a single very long trajectory when the state is visited many times (see section for the derivation of Eq. 2.43). In Fig we plot the MFPT to state i against the average lifetime of the collective state τ life U i for each of the unfolded states in the 20-node Markov state model. It can be

56 40 seen that these times are almost equal. This is true when the time to equilibrate within the collective state U i is much shorter than the average lifetime of U i. Under these circumstances, the MFPT to any unfolded state i is proportional to the average lifetime of the state i divided by the population, and there is an equality involving Eqs. 2.40, 2.42 and 2.43, MFPT i τ life U i. (2.44) As can be seen from Fig that the average lifetimes of the unfolded states decay on the same time scale as the decay of the population fluctuations, we find that the MFPT to any state in the network is approximately equal to the time to equilibrate the network divided by the population of the target state, which is shown in Fig. 2.20, the MFPT to state i is linearly correlated with the reciprocal of the equilibrium population of state i. Importantly, the MFPTs depend on the resolution of the network for the unfolded state, the more fine grained the network, the longer the MFPTs to an individual state. On the other hand, the time to equilibrate the unfolded state is a charateristic of the free energy landscape, which depends only weakly on the resolution. For the 20-node model of Trp-cage studied in this chapter, the longest MFPT( 200µs) is to the state with the smallest population 0.003, while the average lifetime of that state is 48ns, comparable to the time to equilibrate the unfolded state. 2.4 Conclusion In this chapter, a way of building Markov state models from long molecular dynamics simulation trajectory of mini-protein Trp-cage is presented. Markov state model has been proved to be a powerful tool to study the dynamics of proteins. For example, MSM can help one study the relaxation of the unfolded state ensemble to the folded state, which is extremely important since the relaxation time of U reflects the underlying free energy landscape that determines the system s thermodynamic and kinetic properties. Markov state models provide insights into biological problems, such as protein folding or ligand binding, in the aspects of states and the transition probabilities among them. We attempted to resolve a paradox about the kinetics within the unfolded state of

57 Figure 2.17: In the 20-node MSM of Trp-cage the lifetime distribution and population fluctuation relaxation function of a typical unfolded state are calculated plotted. (a) The lifetime distribution of the unfolded state, which is obtained by running hundreds of thousands kinetic Monte Carlo trajectories. Start these trajectories at the unfolded state and terminate them once they leave the unfolded state. (b) The population fluctuation relaxation function of the unfolded state, which is calculated from Eq by spectral decomposition of the transition matrix T( = 10ns). The lifetime and the relaxation function decay on a very similar time scale. 41

58 Figure 2.18: (a) The implied timescale spectrum to an unfolded state, which is highlighted as red circle in (b). (b) The average lifetime of the collective state (U i), which excludes the state i versus the MFPT to state i using Eq The transition matrix of 20-node MSM of Trp-cage is with the absorbing boundary condition at the red state. The lag time is 10ns. 42

59 Figure 2.19: The MFPTs to node 3, which is highlighted as a red circle in Fig. 2.18, of the 20-node MSM of Trp-cage are shown. It can be seen that the MFPT to node 3 does not strongly depend on where the trajectory starts. This implies a rapid equilibration within the unfolded state ensemble. 43

60 Figure 2.20: The MFPT to state i versus the reciprocal of population of state i. The MFPTs are calcuated by running kinetic Monte Carlo trajectories on the 25, 000-node Markov state network. 300 states are randomly chosen including the folded state, which is labelled as a green star. The fitted curve is shown in the plot. As can be seen that the coefficient above the variable x(population) is 0.428µs, which is considered as an approximation of the total relaxation time of the Markov state model, 825ns. The lag time of the transition matrix is 10ns. 44

61 45 proteins, which leads to a better understanding of why most small proteins fold with twostate kinetics. When the equilibration of the unfolded state ensemble is very fast as it is for most small proteins, the protein will fold with single exponential kinetics. While it seems paradoxical that the time to equilibrate the unfolded state can be orders of magnitude shorter than MFPTs within the unfolded state ensemble, we have shown that there is no inconsistency. Using a time-correlation function approach, we have presented a general formula for the time scale of population relaxation within the unfolded state ensemble (Eq. 2.38). Applying this formula to the folding of the two-state mini-protein Trp-cage, we found that the folding follows a two-stepp process: starting from an arbitrary nonequilibrium conformational distribution within the unfolded region the protein population will quickly relax to a pre-equilibrium within the unfolded state on time scales ( 100ns for Trp-cage) much faster than folding. From this time forward, while the relative populations of all the unfolded microstates remain constant, the excess population within the unfolded state ensemble, which will populate the folded state at equilibrium, folds with single exponential kinetics at a rate of 1/5.5µs (Fig. 2.21). It should be noted that as Deng et al. ([72]) reported, an individual Trp-cage folding trajectory only visits a fraction (e.g., 25%) of the unfolded state space. The key to reconciling this with the rapid equilibration in the unfolded state is to realize that while any one trajectory explores only a small part of the unfolded state before folding (Fig. 2.22), an ensemble of such trajectories starting from the same initial condition within the unfolded state ensemble will explore all of the unfolded states with a probability that is close to the equilibrium population of that state before folding [70, 72, 73]. The methodology developed in this study is also well suited for studying the kinetics of larger and more complex proteins or biological systems [76] where the time scales to equilibrate within the unfolded state ensemble or the interested regions of the systems and to fold or to other regions of the systems may overlap. A simple two-state kinetics is no longer suitable to describe them.

62 46 Figure 2.21: The ensemble view of the 20-node MSM of Trp-cage. The MSM is schematically shown by nodes and edges. The size of the nodes is positively correlated to the populations. A thousand of trajectories are initiated from an unfolded state. At some particular time later, we track where the trajectories are and get the estimation of instant populations. If the population is higher than the equilibrium, the node is labelled as warm colors. If the population is lower than the equilibrium, the node is labelled as cool colors. After approximately 500ns, all the unfolded states carry the excess population and decay to their equilibrium with the slowest implied timescale.

47 Figure 2.22: A typical folding trajectory representing a general diffusion process. The outer sphere is the conformational space. The inner sphere is the folded state.

63 47 Figure 2.22: A typical folding trajectory representing a general diffusion process. The outer sphere is the conformational space. The inner sphere is the folded state. In an folding trajectory, only a small portion of the conformational space is visited. 2.5 Appendix I: Simple Models and Derivation Kinetic Study of a 5-node Simple Model In Fig. 2.23, a 5-node kinetic model is presented. The equilibrium populations are: P 1 = P 2 = 1 3, P 3 = P 4 = 1 12 and P F = 1. There are two distinct tubes from the unfolded states 6 to the folded state, of which folding rate constants differ by 10-fold, k 13 = 10. The rate k 24 constant k 12 characterizes the mixing within the unfolded state ensemble. It s shown that in Fig. 2.24, as the mixing within U-ensemble is very rapid comparing to the folding process, the folding rates through two distinct tubes become indistinguishable Population Fluctuation Relaxation Function In this subsection, the population fluctuation relaxation function will be introduced [75]. Let δp i (t) denote the instantaneous deviation in P i (t) of state i from its time-independent

64 48 Figure 2.23: A simple 5-node model for studying the influence of unfolded state kinetics on folding. Figure 2.24: Effects of the unfolded state ensemble equilibration rate constant k 12 on the folding rates for the two tubes in the 5-node simple model (Fig. 2.23). The green dashed P (α) line is the total folding rate,k tot = k α P eq (U) + k P (β) β P eq (U).

65 49 equilibrium < P i (t) >, δp i (t) = P i (t) < P i (t) >, (2.45) Let s(t) be the location of the trajectory at time t. The indicator function is defined as follow, 1, s(t) = i P i (t) =. (2.46) 0, s(t) i Denote < P i (t) > as P eq (i). The population fluctuation relaxation function of state i can be wrtten in terms of δp i (t), C i (t) =< δp i (0) δp i (t) > (2.47) = (1 P eq (i))p eq (i) (T ii (t) P eq (i)) (2.48) + (0 P eq (i))(1 P eq (i)) ( j i T ji (t) = (1 P eq (i))p eq (i) (T ii (t) j i P eq (j) 1 P eq (i) ) P eq (j) 1 P eq (i) ) (2.49) = (1 P eq (i))p eq (i) (T ii (t) P eq(i)(1 T ii (t)) ) (2.50) 1 P eq (i) = (T ii (t) P eq (i))p eq (i). (2.51) The normalization condition (equation 2.4) and detailed balance condition (equation 2.6) are applied. It is conventional to normalize the population fluctuation relaxation function of state i so that Ĉi(0) = 1, Ĉ i (t) = T ii(t) P eq (i). (2.52) 1 P eq (i) We define a total population fluctuation relaxation function of an ensemble as the weighted sum of the population fluctuation relaxation function of each state within that ensemble, Ĉ tot (t) = i (T ii (t) P eq (i))p eq (i). (2.53) 1 P eq (i) Substituting T ij (t) by equation 2.11 and setting m = 1 without the loss of generality, we obtain, Ĉ tot (t) = N n=2[ i P eq (i)ψn R (i)ψn L (i) λ n ]. (2.54) 1 P eq (i)

66 Mean First Passage Times (MFPT) and Average Lifetime They can be determined by solving a set of linear equations [77]. We let m ij denote the MFPT as going from state i to state j. m ij = t + k j T ik m kj. (2.55) Then, the average MFPT to state j is MFPT j = i P eq (i) 1 P eq (j) m ij. (2.56) Let me show that the relationship between the average lifetimes in Eq is an identity. First of all, the unfolded state ensemble is divided into two metastable state, i and U i, which are complement to each other. Without the loss of generality, we constrain the dynamics within the unfolded state ensemble. In an exhaustively long trajectory, we have and P eq (i) = P eq (U i) = Ni k=1 t k(i) t tot, (2.57) NU i k=1 t k(u i) t tot. (2.58) Here t k (i) is the time span of the kth segment when the trajectory is in state i. There are N i such segments. Similarly t k (U i) is the time span of the kth segment in the collective state U i and there are N U i such segments. Note that the different between N i and N U i is at most 1 for a very long continuous trajectory since the trajectory stays at either state i or state U i at any moment. As N i and N U i are very large, we can approximately write N i = N U i = N. Then combining Eq and 2.58 gives, P eq (i) P eq (U i) = N k=1 t k(i) N k=1 t k(u i). (2.59) Dividing both the numerator and denominator of the right hand side of Eq by N, we obtain, P eq (i) P eq (U i) = t i t U i (2.60) where t i and t U i are the average lifetimes of the state i and U i respectively. Eq is obtained by rearranging the equation above.

67 51 From Eq. 2.60, after reorganization, t U i = t i Peq(U i) P eq (i) 1 = j i k ij 1 P eq(i) P eq (i) (2.61) (2.62) = 1 P eq(i) j i k jip eq (j). (2.63) The denominator of the equation above is the flux from state U i to state i, J U i. Therefore, we have also shown that the average lifetime of state U i can be determined by t U i = P eq(u i) J U i. (2.64) Generally speaking, the MFPT in Eq and the average lifetime t U i Eq are not equal unless the equilibration within the collective state U i is much faster than the MFPT to state i. We can demonstrate this analytically in a 3-node toy model (see Fig. 2.25). Suppose we are going to calculate the average lifetime of the collective state which consists of node 1 and node 3. In the following steps, we will show that the average lifetime of the collective state t 1+3 and the MFPT to node 2 MFPT 2 are equal to each other when the equilibration is rapid within the collective state, in other words, k u >> k slow and k u >> k fast. From Eq. 2.43, 1 t 1+3 = t 2 ( P eq (2) 1) = 2 k slow + k fast, (2.65) where the average lifetime of node 2, t 2, is the reciprocal of the sum of all the outgoing rate constants of node 2. When the rate constants are not accessible from the simulation or experiments, the average lifetime of node 2 can also be estimated by the transition t probability, t 2 =. To calculate the MFPT to node 2 we can use the Eq T 22 ( t) But for this special case, we could also utilize the rate matrix we have. First set the outgoing rate constants from node 2 to zero. Then diagonalize the rate matrix and the first non-equilibrium eigenvalue gives the reciprocal of the MFPT to node 2, MFPT 2 = k slow + k fast + 4k u 2(k slow k fast + k u (k slow + k fast )). (2.66)

68 52 When k u goes to zeros and k fast >> k slow, which mimics the case that the relaxation within the collective state is really slow and the free energy barriers along the two pathways are quite different, the MFPT to node 2 can be approximated as, MFPT 2 = 1 2k slow + 1 2k fast 1 2k slow. (2.67) As we can see from Eq and Eq that in general the average lifetime of the collective state (node 1 + node 3) is not equal to the MFPT to node 2. However, in the rapid equilibration case, when k u >> k fast >> k slow the MFPT 2 converges to t 1+3, 2 MFPT 2 = k slow + k fast 2. (2.68) k fast Thus, depending on whether the equilibration within the collective state (1+3) is very slow or very fast, the mean first passage time to node 2 goes to very different limits. It can also be shown for this toy model that when the equilibration of the collective state (1+3) is very fast, the first passage time distribution to node 2 is single exponetial with a rate equal to the smaller non-zero eigenvalues of the rate matrix. As analytically shown here by using the 3-node toy model, there is agreement between Eq and Eq when the equilibration within the collective state is rapid. A similar result is reported in reference [78]. 2.6 Appendix II: Publication Attached Parts of this chapter were published in [40, 72]. The publications are attached.

69 Article pubs.acs.org/jpcb How Kinetics within the Unfolded State Affects Protein Folding: An Analysis Based on Markov State Models and an Ultra-Long MD Trajectory Nan-jie Deng, Wei Dai, and Ronald M. Levy* BioMaPS Institute for Quantitative Biology and Department of Chemistry and Chemical Biology, Rutgers, the State University of New Jersey, Piscataway, New Jersey 08854, United States *S Supporting Information ABSTRACT: Understanding how kinetics in the unfolded state affects protein folding is a fundamentally important yet less well-understood issue. Here we employ three different models to analyze the unfolded landscape and folding kinetics of the miniprotein Trp-cage. The first is a 208 μs explicit solvent molecular dynamics (MD) simulation from D. E. Shaw Research containing tens of folding events. The second is a Markov state model (MSM-MD) constructed from the same ultralong MD simulation; MSM-MD can be used to generate thousands of folding events. The third is a Markov state model built from temperature replica exchange MD simulations in implicit solvent (MSM-REMD). All the models exhibit multiple folding pathways, and there is a good correspondence between the folding pathways from direct MD and those computed from the MSMs. The unfolded populations interconvert rapidly between extended and collapsed conformations on time scales 40 ns, compared with the folding time of 5 μs. The folding rates are independent of where the folding is initiated from within the unfolded ensemble. About 90% of the unfolded states are sampled within the first 40 μs of the ultralong MD trajectory, which on average explores 27% of the unfolded state ensemble between consecutive folding events. We clustered the folding pathways according to structural similarity into tubes, and kinetically partitioned the unfolded state into populations that fold along different tubes. From our analysis of the simulations and a simple kinetic model, we find that, when the mixing within the unfolded state is comparable to or faster than folding, the folding waiting times for all the folding tubes are similar and the folding kinetics is essentially single exponential despite the presence of heterogeneous folding paths with nonuniform barriers. When the mixing is much slower than folding, different unfolded populations fold independently, leading to nonexponential kinetics. A kinetic partition of the Trp-cage unfolded state is constructed which reveals that different unfolded populations have almost the same probability to fold along any of the multiple folding paths. We are investigating whether the results for the kinetics in the unfolded state of the 20-residue Trp-cage is representative of larger single domain proteins. INTRODUCTION Although much progress has been made on the protein folding problem, unresolved questions still exist concerning some of the fundamental aspects of how proteins fold For example, how does the energy landscape of the unfolded state affect folding? 3,5,16,17 Does residual structure within the unfolded ensemble influence folding rates? Why do some proteins which theory and simulation suggest have multiple folding pathways exhibit two-state, single exponential kinetics? Molecular dynamics simulations (MD) in atomic detail provide the spatial and temporal resolution required to investigate the mechanisms of protein folding in aqueous solutions. However, the time scale covered by MD is usually too short for direct unbiased folding simulations. In recent years, the D. E. Shaw lab has developed a special-purpose computer that greatly accelerates MD simulations of biomolecules, and the gap between direct simulation and biological time scales is now beginning to be closed Using their ANTON technology on 12 structurally diverse fast-folding proteins, Shaw and coworkers were able to fold 11 of them to experimental structures and observe numerous reversible folding transitions in simulations ranging from microseconds to milliseconds. 20 Other methods which do not require special purpose hardware are being developed to overcome the time scale limitation of direct MD and more efficiently sample the rare events associated with biomolecular transitions In this area, Markov state models (MSM) constructed from atomistic simulations have been particularly successful in sampling the rare events associated with protein folding and protein conformational transitions In this approach, the protein Special Issue: Peter G. Wolynes Festschrift Received: February 25, 2013 Revised: May 20, 2013 Published: May 24, American Chemical Society dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

70 The Journal of Physical Chemistry B conformational space is discretized into a network of coarse grained substates. Transitions on the network are modeled by a master equation; the kinetics on the network is Markovian. The network approach provides an efficient way to extract mechanistic insights from a large amount of MD trajectory data without losing the accuracy of the underlying atomistic simulations. The folding pathways and their fluxes can be obtained by applying transition path theory (TPT) on the network, 30,31,39 yielding a statistical description of how a protein acquires the specific native conformation starting from an extremely large number of possibilities. Noe et al. studied the folding of the PinWW domain by constructing a Markov state network model from many relatively short MD simulations. 31 They found many nonoverlapping pathways passing through intermediate regions to reach the native state. On the basis of their Markov state modeling of small single domain proteins, Pande and co-workers proposed that protein native states act as kinetic hubs connected to unfolded structures by stochastic jumps through metastable states. 34,36 In this kinetic hub model, the unfolded state ensemble is divided into collections of states that fold along different folding paths; to get from states which fold along one path to those which fold along another involves transiting through the folded state. 34 To overcome the sampling limitations of constant temperature MD in constructing a network model, over the past several years, our group has developed an approach that takes advantage of replica exchange molecular dynamics (REMD) in accelerating barrier crossing, and extracts kinetic information from REMD by assuming that a network of transitions can be reconstructed by applying structural similarity criteria together with reweighting techniques. 28,40,41 By exploiting transition path theory together with stochastic simulations, the kinetic network can be interrogated and information concerning the temperature dependent folding pathways can be obtained. Application of this approach to the miniprotein Trp-cage indicated that, below the folding temperature, the folding flux is dominated by a small number of localized pathways. 40 Above the folding temperature, the folding pathway ensemble becomes much more diverse. The effect of the unfolded state heterogeneity on folding was the focus of an insightful study by Ellison and Cavagnero. 16 One important finding from their simple kinetic model is that, for proteins with heterogeneous folding pathways, deviations from single exponential are observed only when unfolded conformations exchange at rates slower than folding. This result may provide a simple explanation for the apparent two-state, single exponential kinetics shown by some proteins, even though these proteins may fold through multiple diverse pathways. In the present study, we employ stochastic simulations, transition path theory, and Markov state models constructed from atomistic simulations to investigate how the kinetics within the unfolded state ensemble affects folding. The Trpcage miniprotein (Figure 1) has served as a model system for studying folding in numerous experimental and theoretical studies Here we investigate the kinetics in the unfolded state and its effects on folding using the following models: (1) a 208 μs explicit solvent MD simulation from the Shaw lab 20 that contains several folding events, (2) a Markov state model constructed from the same ultralong MD trajectory (MSM- MD), and (3) a kinetic network model built from REMD simulations using an implicit solvent effective potential over a Figure 1. NMR structure of Trp-cage miniprotein. Article wide temperature range (MSM-REMD). Direct comparison between the ultralong MD and MSM-MD trajectories serves to test the validity of the Markov model. Pande and co-workers have reported the first such comparison using two 100 μs folding trajectories of the FIP35 WW domain. 53 They found that the MSM has a hub-like topology, and the analysis yielded more insights into the diversity of folding pathways and dynamics between two alternative native structures. Here our emphasis is on sampling within the unfolded state and its effects on folding pathways. Because stochastic simulations on a discretized network are extremely efficient, we use it in the present study to extensively explore the kinetics within the unfolded ensemble. We have developed techniques to map the reactive stochastic trajectories onto the folding pathways computed using TPT. 40 By combining stochastic simulations with TPT pathway analysis, we can evaluate the folding rate along each pathway and the probability of folding along any pathway from any place within the unfolded state ensemble. By analyzing the Trp-cage kinetics in the light of a simple kinetic model calculation, we determine a general relationship between the folding kinetics and the rate of mixing in the unfolded states. We discuss our network model analysis in relation to the study of Ellison and Cavagnero 16 and other folding models. 5,34,54 The main result of the present study is that proteins with heterogeneous pathways will fold with single exponential kinetics, as long as the rate of mixing within the unfolded state is comparable to or faster than folding. While the mixing within the unfolded state modulates the apparent waiting times for folding along individual paths, the overall folding rate depends only on the total folding flux and the equilibrium unfolded state population. METHODS Analysis of the Ultralong MD Trajectory. A MD simulation of Trp-cage was performed by Shaw and co-workers on the Anton computer for 208 μs using a modified charmm22 all-atom force field in the TIP3P explicit solvent. 20 The MD trajectory contains 10 6 snapshots saved every 200 ps. During the course of the simulation, the Trp-cage fluctuates between the low and high rmsd regions, via a transiently occupied intermediate region (Figure S1a, Supporting Information). The distribution of rmsd is bimodal, containing a sharp peak at rmsd = 1.3 Å, which is separated from a broad peak at rmsd = 6 Å by a weakly populated intermediate region (Figure S1b, dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

71 The Journal of Physical Chemistry B Supporting Information). On the basis of the rmsd distribution, we define three macrostates as follows: folded (rmsd 2.2 Å), intermediate (2.2 Å < rmsd 5 Å), and unfolded states (rmsd > 5 Å); the populations carried by these macrostates are 17.5% folded, 15.1% intermediate, and 67.4% unfolded. Note that the definition of the three macrostates is somewhat different from that used in the study by Shaw et al., 20 where the folded and unfolded states are defined on the basis of the fraction of native contacts, Q. As a result, there are some quantitative differences between the kinetic properties calculated in this work and those found by Shaw and co-workers. In particular, using the RMSDbased cutoff scheme, the trajectory contains a number of rapid folding transitions that are not considered as folding events in the study of Shaw and co-workers which used a Q-based definition of the macrostates. This does not affect the interpretation of the main results except for the reported value of the folding transit time as discussed in the following section. Construction of MSM-MD. We constructed a Markov state model based on the 208 μs MD simulation to analyze the Trp-cage folding kinetics. The MSM-MD consists of a collection of conformational microstates and the transition probability matrix describing the memory-less jumps among these microstates. A set of microstates is generated by geometrically clustering the 10 6 MD snapshots according to their mutual rmsd using the k-means clustering method. The average rmsd between a structure and its cluster center is 2.45 Å. The transition matrix T ij (τ) is estimated by projecting the MD trajectory onto the network nodes and counting the number of transitions from node i to node j within lag time τ, i.e., T ij (τ) P(j, τ i, 0)=C i j (τ)/ k C i k (τ). To choose a lag time for which the transitions on the network are Markovian, we used a criteria based on the mean first passage time (MFPT) of folding: when the transitions are Markovian, the folding MFPT computed using T ij (τ) should not depend on the choice of lag time τ; see Figure S2a, Supporting Information. Here folding MFPT is obtained from the inverse of the folding rate k f = J/( i P eq (i)[1 P fold (i)]), 31 where P fold (i) is the commitor probability of folding and J is the folding flux computed using TPT, 31 J = ( i U,j U T ij (τ)p eq (i)p fold (j))/τ. The calculated folding MFPT at different lag times are shown in Figure S2a, Supporting Information. At τ 5 ns, the folding time for the node MSM begins to level off to a plateau value close to the MFPT observed in the ultralong MD simulation, suggesting that the model behaves Markovian for lag times t 5 ns. We also tested a coarser model with 6000 states: although the MFPT shows similar curvature, the plateau value of MFPT is considerably smaller than the folding time observed from the MD trajectory. To further verify the Markovian property of the network, we used another criterion based on the implied time scales t i, calculated from the eigenvalues λ i (τ) of T (τ), t i = τ/ ln[λ i (τ)]. 27 Here T (τ) is identical to the transition matrix T(τ) except that all the rates leaving the folded state are set to zero, which corresponds to the absorbing boundary condition for folding. The use of T (t) allows its slowest eigenvalue to reproduce the MFPT of folding computed from TPT. As seen from Figure S2b, Supporting Information, at τ 5 ns, the implied time scales begin to level off. The implied time scale computed for the slowest decaying eigenmode of the T (τ) atτ = 5 ns is 5.5 μs, which is in excellent agreement with the MFPT obtained using the flux from TPT calculation. The choice of lag time τ not only determines the Markovian behavior of the MSM but also strongly affects the kinetic resolution of the network model. At small lag time, T ij (τ) is equivalent to the rate matrix in the continuous-time Markov model via k ij = lim τ 0 (T ij (τ)/τ), which gives the highest possible kinetic resolution on the network. At large τ, many of the unfolded states are connected to the folded state in one jump. At τ = 1, 5, and 20 ns, the one-jump folding pathways were found to account for 2.7, 10, and 28% of the total folding flux, respectively. In contrast, in the 208 μs MD simulation, a one-jump folding event was observed only once out of the 31 folding transitions, which corresponds to 3.2% folding flux. Therefore, our results show that, while the lag time needs to be long enough to satisfy the Markov property for memory-less transition, τ should also be small enough to allow sufficient kinetic resolution for studying the folding mechanism. Likewise, the radius of the clustered nodes needs to be large enough to have adequate statistics, but too much coarse graining could lead to non-markovian behavior by grouping structures separated by significant barriers. The MSM-MD used in the present study is based on state clustering and a lag time of 5 ns, which we found give a good trade-off between satisfying Markov property and providing adequate kinetic resolution and statistics. Construction of MSM-REMD from Replica Exchange Simulation. We also constructed a kinetic network model for Trp-cage (MSM-REMD) by clustering the snapshots, obtained from REMD simulations at temperatures from 363 to 566 K, into a set of conformational microstates. The details about the REMD simulations are described in ref The clustering is performed on the basis of the Cα-RMSD between the pair of the snapshots, using a cutoff radius of 1.1 Å. All the neighboring conformations found within the cutoff RMSD from a selected central node are merged to create a composite node. The resulting clustered nodes generally consist of contributions from many REMD snapshots observed at several temperatures. The rates for the memory-less transitions on the network were parametrized using a scheme involving many short MD simulations. The rate constant k ij for the transition from state j to state i is k = C P i ij ij Pj,eq,eq 1/2 Here the prefactor is C ij = C ji, which satisfies the detailed balance k ij P j,eq = k ji P i,eq. By definition, the rate k ij can be expressed in terms of the branching probability P j i and the mean lifetime at node j, T j : k ij Article (1) Pj i = Tj (2) Equation 2 suggests a way to parametrize k ij based on the lifetime observed from many short MD trajectories. The branching probability P j i can be approximately expressed in terms of the RMSD distance between node i and j, Δr ij. From running many short MD trajectories, we found that the average probabilities of jumping to a neighboring node at Δr can be fitted with dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

72 The Journal of Physical Chemistry B P Δ Δ rij ( r ) 6 Δ r j i ij ij 6 j (Figure S3, Supporting Information). Additionally, P j i decreases approximately with the number of neighbors of node j, i.e., P j i (Δr ij ) 1/N j nb. From eqs 2 and 3, the rate constant k ij is expressed as k ij 6 Δrij 6 nb Δ r N T ij j j j,md Pi Pj,eq,eq Here the prefactor is identified with C ij in eq 1, i.e., C ij Δr ij 6 / ( Δr ij 6 j N j nb T j,md ). Since C ij = C ji (needed to maintain detailed balance), we symmetrize C ij and write C ij (1/2)((Δr ij 6 / ( Δr ij 6 j N j nb T j,md )) + (Δr ij 6 /( Δr ij 6 i N i nb T i,md ))). Taking these considerations together, we obtain k ij 1 Δr 2 Pi Pj 6 ij,eq,eq 1/ nb Δ Δ rij jnj Tj,MD rij ini Ti 1/2 6 nb,md To test how well the rates parametrized using eq 5 describe the kinetics, we compare the distributions of state lifetimes obtained from many short MD simulations and those from stochastic simulations on the MSM-REMD. The results show that the two distributions of the state lifetimes agree well with each other (Figure S4, Supporting Information). The procedures of the decomposition of the folding flux into folding pathways, the clustering of folding pathways into folding tubes, and the mapping of stochastic simulation trajectories onto folding tubes were described in a previous paper. 41 RESULTS Below we first present the results of sampling the Trp-cage unfolded state by MD and the MSM built from MD (MSM- MD), including the time scales of structural reorganization in the unfolded state, the kinetic partitioning of the unfolded state into populations that fold along different paths, and the folding rates associated with different folding paths. We then analyze the distribution of the folding passage times, transit times, and the nature of the heterogeneity in the folding pathways. Finally, we discuss the results for the MSM constructed from REMD sampling (MSM-REMD) to investigate how the folding kinetics is influenced as a function of temperature and for comparison with the ultralong MD trajectory from the Shaw group. Sampling of the Unfolded States by MD and MSM- MD. We first examine to what extent the sampling of the unfolded states has converged in the 208 μs MD trajectory, by estimating the fraction of the unfolded conformational space sampled by the trajectory as a function of simulation time. To this end, we cluster the trajectory; the clustering scheme we employed is described in the Methods. We calculated the fraction of the unfolded state clusters visited by the trajectory as a function of simulation time and found that about 90% of the unfolded states are sampled within the first 40 μs, which is onefifth of the total simulation time (Figure S5, Supporting Information). The trajectory spends the remaining 80% of the (3) (4) (5) Article simulation time mostly revisiting the structures seen earlier. This result is reasonably robust with respect to the variation in the granularity of the clustering (see Figure S5, Supporting Information). It is therefore an indication that the ultralong MD simulation exhibits good convergence in the sampling of the unfolded state ensemble. 55 We characterized the structural reorganization in the unfolded state to address the following: (1) How heterogeneous are the structures explored between two adjacent unfolding/folding events? (2) What is the time scale for chain extension and collapse for unfolded Trp-cage before it folds? We choose the radius of gyration (R g ) as the order parameter to characterize the structural reorganization in the unfolded region. Figure 2 shows the distribution of R g and its fluctuations in the MD trajectory. The folded structure has an R g value of 7 Å; for the unfolded state, R g spans broadly the range from 6.5 to 15 Å, which correspond, respectively, to compact unfolded conformations and fully extended chains, two examples of which are shown in Figure S6, Supporting Information. As seen from Figure 2b, the MD trajectory visits both extended conformations (R g 14 Å) and compact unfolded structures (R g 8 Å) many times before it folds. We computed the distribution Γ(τ) of relaxation times τ for the radius of gyration in the unfolded state (see Figure S7, Supporting Information) and found a dominant relaxation mode at τ = 6 ns along with a much weaker mode centered at τ = 38 ns. The relaxation times for the fluctuation between the extended and collapsed forms of Trp-cage are much shorter than the average residence time of 5 μs in the unfolded state between adjacent folding events. We also examined the time scale of collective motions in the unfolded state by computing the autocorrelation functions for the principal components. The relaxation time along the slowest principal component is found to be 40 ns, i.e., similar to the time scale of fluctuations in R g (Figure S8, Supporting Information). In order to gain further insight into the kinetic properties of the unfolded state ensemble, we analyze the fraction of the total conformational space of the unfolded ensemble visited by the MD trajectory, between two adjacent unfolding/folding transitions. We compute this quantity by analyzing unfolded intervals between each consecutive unfolding/folding event. Here, an unfolded interval starts from the time when the trajectory enters the unfolded region and ends when the trajectory enters the folded state. Figure 3 shows the fraction of unfolded conformations visited during each of the unfolded intervals before the trajectory folds. In 45% of the folding events, the trajectory visits >30% of the unfolded states before it folds. On average, a trajectory typically explores about 27% of the unfolded conformational space between consecutive unfolding/folding events. Figure 3 also shows that the fraction of unfolded states visited is strongly correlated with the folding passage time. This correlation is an indication of substantial mixing within the unfolded state ensemble, as discussed below. Using the ultralong MD trajectory, we constructed a state kinetic network model (MSM-MD); see Methods. We performed a kinetic Monte Carlo (KMC) simulation for 64 ms, which contains folding and unfolding events. The radius of gyration time series are nearly indistinguishable from those observed in the direct molecular dynamics simulation MD (Figure 2b; see also Table 1). We computed the folding pathways and their fluxes using transition path theory (TPT) 31,39 to analyze the MSM-MD network model; 5000 pathways were generated. To obtain dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

$For each unfolding/folding transition observed in the ultralong MD trajectory, from top to bottom: the fraction of unfolded state space sampled before folding ordered from largest to smallest; the$

73 The Journal of Physical Chemistry B Article Figure 3. For each unfolding/folding transition observed in the ultralong MD trajectory, from top to bottom: the fraction of unfolded state space sampled before folding ordered from largest to smallest; the corresponding passage time for folding; the folding transit time. The x-axis is the index of each folding transition, in descending order of fraction of visited unfolded states. Table 1. Time Scales of Trp-Cage Folding and Unfolded State Kinetics from the Ultralong MD and MSM-MD Model Figure 2. (a) The distribution of radius of gyration from the 208 μs MD trajectory. 20 The R g corresponding to the native structure is indicated by the red arrow. (b) A 50 μs portion of the time series of R g obtained from the MD trajectory and from the stochastic simulation trajectory on the MSM-MD. The letter U indicates a time span in the unfolded state. mechanistic insights, the pathways were clustered into a much smaller number of folding tubes ( 100), each containing between 10 and 100 structurally similar pathways. The grouping of folding pathways into tubes is based on structural similarity between the structures along two pathways; 40,41 the average RMSD distance between two pathways in different tubes is at least 4 Å. mean folding time conformational relaxation time in the unfolded states a time to sample 90% of the unfolded states transit time of folding MD (208 μs) b 5.5 μs mode 1: 6 ns 40 μs 23.6 ns mode 2: 38 ns MSM-MD c 5.3 μs mode 1: 7 ns 44 μs 30 ns mode 2: 30 ns a Estimated from the time correlation functions of R g. b The MD trajectory contains 31 folding events. c The Markov state model contains microstates. The time scales are obtained by running kinetic Monte Carlo simulation which generated folding events. Kinetic Partition of the Unfolded State by Folding Tubes. We now discuss the results on the kinetic partitioning of the unfolded state by folding tubes and the characterization of the unfolded populations and folding rates associated with different folding tubes. By projecting the stochastic MSM-MD dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

74 The Journal of Physical Chemistry B trajectories onto the different folding tubes, we determined three important kinetic quantities associated with each folding tube: J(α), the flux through tube α; k(α), the folding rate corresponding to the tube; and P(α), the fraction of the unfolded population that folds through tube α. The flux J(α) is defined by the number of folding events through tube α per unit time. The tube rate constant k(α) is obtained from the inverse of the mean first passage time for the folding events through tube α. The population P(α) of the unfolded state which folds through tube α is calculated using P(α) = i U t(i α)/t total, where t(i α) is the residence time that trajectories which fold through tube α spend on unfolded node i and T total is the total simulation time. The set {P(α)} corresponds to a kinetic partition of the unfolded state ensemble into populations which fold along each tube; the partition has the total property that α P(α) = P unfold. Additionally, for the hub folding model, as different unfolded populations fold independently along different paths, i P(i α)p(i β) is expected to be small, although this is not observed for the kinetic network model of Trp-cage folding constructed from the ultralong MD trajectory (see below). The values for P(α), J(α), and k(α) calculated for the top 16 folding tubes are shown in Figure 4. Although the fluxes vary by more than 3-fold along the different folding tubes, they all have Figure 4. The tube fluxes J(α), tube population P(α), and tube folding rates k(α) for the top 16 folding tubes obtained using MSM-MD. Article very similar folding rates, i.e., k(α) constant. Consequently, the tube fluxes are proportional to the corresponding populations, i.e., P(α) J(α). We discuss how these results are a direct consequence of the significant mixing within the unfolded state before folding. We have also computed the overlap between the distributions of the unfolded state populations which fold along different tubes. This is another indication of the extent of mixing within the unfolded state between folding events. We define the conditional probability P(i α) =t(i α)/( i U t(i α)). It corresponds to the fraction of the time the system spends on unfolded node i given that it folds along tube α. It is obtained by normalizing t(i α) with the total time trajectories which fold through tube α in the entire unfolded region. The distribution P(i α) over all the unfolded nodes describes the extent to which the unfolded states are explored before folding through tube α. In the case of extensive mixing between unfolded state populations, P(i α) should be only weakly dependent on i. In contrast, for a kinetic hub-like scenario, in which the exchanges between unfolded states are severely limited, each folding tube s P(i α) distribution is confined to a local area of the unfolded ensemble. To examine the extent to which the P(i α) distributions overlap, we define a quantitative measure of the overlap between the two normalized distributions P(i α) and P(i β) in discretized space Ω(α, β) =[ i U P(i α) P(i β)]/[( i U P(i α) 2 ) 1/2 ( i U P(i β) 2 ) 1/2 ]. For the case of rapid mixing, Ω(α, β) will be 1. In the opposite regime, if the two folding tubes α and β are connected with very different regions of the unfolded ensemble, then Ω(α, β) will be 0. We found that all the matrix elements of Ω(α, β) for the top 16 folding tubes are greater than 0.95, which implies extensive mixing prior to folding. Another unresolved question in protein folding concerns the role of residual structures in the unfolded states in modulating folding kinetics. 14 For example, UV-Raman measurements found significant α-helical content of Trp-cage under denaturing conditions. 43 It has been speculated that residual secondary structure may help accelerate Trp-cage folding. To probe the role of preexisting residual structure in folding, we performed a large number of stochastic simulations initiated from unfolded conformations with and without the residual secondary structure. In the MD trajectory, about 7% of the unfolded conformations contain an intact N-terminal α-helix (residues 2 9). This value is consistent with the UV-resonance Raman study. 43 We initiated 8000 folding simulations from (1) unfolded conformations with an intact N-terminal α-helix and (2) unfolded states with a disordered N-terminal segment. The folding starting from the conformations with α-helix is only slightly faster than that starting from those conformations without the secondary structure. We also examine in a similar fashion the influence of nonnative compactness in the unfolded region. Folding simulations starting from collapsed unfolded conformations (R g < 7.0 Å) and from extended conformations (R g >15Å) result in virtually identical MFPT. Therefore, neither the preexisting N-terminal α-helix nor nonnative compactness was found to significantly influence the Trp-cage folding rate. Comparison of the Folding Kinetics and Pathways from MD and MSM-MD. There are 31 folding events in the 208 μs MD trajectory. The distribution of the first passage times of the folding events can be approximately fit to a single exponential (Figure S9, Supporting Information). The mean first passage time (MFPT) of the 31 folding transitions is found dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

75 The Journal of Physical Chemistry B Article Figure 5. Two representative folding transitions extracted from the 208 μs MD trajectory. to be 5.5 μs, in good agreement with the experimental folding time of 4 μs at room temperature. 56 Another quantity describing the folding kinetics is the transit time, which is the time for a folding trajectory to traverse the intermediate region. The average transit time of all the folding transitions is 23.6 ns; the range is between 1.8 and 267 ns. The observation that the average transit time is 200 times smaller than the mean folding passage time of 5 μs indicates that the Trp-cage folding is highly cooperative. Examination of the folding transitions sampled by the ultralong MD trajectory revealed heterogeneous structural pathways leading to the folded state. Here we discuss two representative paths (Figure 5). In pathway A, the polypeptide chain first undergoes a hydrophobic collapse, forming a compact molten globule containing multiple non-native H- bonds; later on, the non-native interactions are loosened, which is followed by the formation of the N-terminal α-helix and native hydrophobic core. In pathway B, the folding starts from more extended conformation with preformed α-helix in the unfolded state; the hydrophobic core and the helix then form in concert to complete the folding process. The two pathways have very different transit times: in pathway A, the trajectory has to loosen the non-native contacts and gradually replace them with a native hydrophobic core; these localized structural rearrangements take place in a relatively long transit time of 44 ns. By contrast, the folding along pathway B is much simpler because the starting unfolded structure contains fewer non-native interactions; the associated transit time in this pathway is only 3 ns. We also found that pathway B is a more dominant pathway; i.e., there are more folding transitions in which the α-helix forms before the hydrophobic collapse. By carrying out stochastic simulations on the kinetic network generated from the MD simulation, a large number of folding transitions are obtained. The folding passage time distribution exhibits single exponential decay (Figure S10, Supporting Information), with a folding time close to the average of the 31 transitions observed in the ultralong MD simulation. We have compared the folding tubes constructed using transition path theory applied to the MSM-MD kinetic network with the folding transitions observed in the MD trajectory. Using rmsd = 3.0 Å as the cutoff distance between a TPT folding pathway and an MD folding pathway, we found that 29 out of 31 MD folding transitions can be assigned to TPT folding tubes. Figure 6 shows the fluxes of the top 12 TPT Figure 6. The blue histogram shows the distribution of fluxes along the top 12 folding tubes generated from the MSM-MD kinetic network using TPT. The red histogram represents the fluxes of MD folding transitions projected onto the folding tubes. folding tubes compared with the number of the MD folding transitions assigned to each folding tube. There is a general correspondence between the folding transitions observed in the ultralong MD simulation and the flux through folding tubes generated from the kinetic network (Figure 6). The MSM folding tube with the largest flux is also the one that contains the largest number of MD folding transitions among all the folding tubes. The folding mechanism in this tube is the same as that in the MD pathway B discussed above, which features an early formation of the α-helix (Figure 5). The analysis of the TPT folding tubes shows that about 45% of the flux is carried by pathways in which the α-helix forms early. In the remaining dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

76 The Journal of Physical Chemistry B pathways, the hydrophobic compaction either occurs early or forms in concert with the α-helix. It should be noted that, in general, MSM predicts more folding pathways than that contained in the raw MD trajectory. For example, two such pathways that are predicted by MSM- MD but not observed in the original MD data are shown in Figure S11, Supporting Information. The reason for the richer folding pathways in MSM can be qualitatively understood by considering the schematic transition diagram shown in Figure S12, Supporting Information, where a MD trajectory contains transitions U1 I U2 and separately N I N. There is no direct folding transition in this MD transition diagram. The corresponding MSM, however, would predict folding pathways U1 I N and U2 I N. MSM Constructed from REMD Simulations. The MSM- MD model we have analyzed in the previous sections was based on MD simulations performed well above the Trp-cage folding temperature with just 17% native population. 20 What is the kinetic picture of Trp-cage folding below the folding temperature? To address the temperature dependence of folding kinetics, here we study a Markov network model of Trp-cage built from temperature replica exchange (REMD) simulations with implicit solvation over a wide temperature range. 40,41 We call this Markov network model MSM-REMD. Using the REMD data obtained over a wide temperature range, we determined the Trp-cage melting behavior (Figure 7); the folding temperature T f was found to be 468 K. The Figure 7. The melting curve of Trp-cage obtained from REMD simulation. high melting temperature compared to the experimental T f is typical of the results found with implicit solvent models and is partially attributable to the overly attractive intramolecular interactions in the OPLS-AA force field with the AGBNP implicit solvent model 57 used in the REMD simulations. To investigate Trp-cage kinetics below and above the folding temperature, we performed TPT pathway calculations and stochastic simulations using the MSM-REMD model at T = 465 and 539 K, at which 54 and 11% of the populations are folded, respectively. We found that both the rate of mixing within the unfolded state and the diversity of the folding pathways vary strongly with temperature. The folding pathway ensemble becomes more diverse at higher temperatures. At T = 465 K, the top folding tube carries 61% of the total flux and the top three folding tubes account for 90% of the total flux. In contrast, at T = 539 K, the top folding tube carries just 30% of the flux and it takes nine folding tubes to accumulate 90% of the total flux (Figure 8, top row). Below the folding temperature, we observe very slow folding through one of the Article folding tubes (Figure 8). The folding through this tube is 60 times slower than the fastest folding tube. Next, we examine the temperature dependence for the mixing within the unfolded states by computing the conditional probabilities P(i α) for the different folding tubes α, and the overlaps of P(i α) with P(i β) among the folding tubes at both T = 465 K and T = 539 K. Table 2 shows the overlap factor Ω(α, β) between different pairs of unfolded population distributions P(i α) and P(i β) associated with the top folding tubes. At T = 465 K, overlaps between the P(i α) of the slow folding tube (No. 3) and that of the rest of the tubes are zero, indicating that the unfolded populations associated with the slow tube and those with the other tubes fold independently. At the high temperature T = 539 K, the overlaps between the P(α) of the slow tube (No. 6) and those of the other tubes increased significantly (Table 2). This trend reflects more extensive mixing within the unfolded ensemble above the folding temperature. How does this enhanced mixing in the unfolded state affect the folding kinetics at higher temperature? For this, we compare the tube populations P(α), fluxes J(α), and folding rates k(α) for the different folding tubes at the two temperatures (Figure 8). It can be seen that the difference in the folding rates between the slowest and fastest folding tubes decreases significantly as the temperature is increased (Figure 8, bottom row). At the lower temperature T = 465 K, the ratio of the slowest folding rate to the fastest folding rate is k slow /k fast At the higher temperature T = 539 K, this ratio becomes k slow /k fast 0.2. Another observation from Figure 8 is that, at the higher temperature T = 539 K, there is a clear correlation between J(α) and P(α), i.e., J(α) P(α). The plot of J(α) and P(α) at this temperature shows that the correlation coefficient R- squared 0.7 (Figure S13, Supporting Information). Such correlation between J(α) and P(α) is not observed at the lower temperature T = 465 K. We have identified the conformational species that folds through the slow folding tube at the lower temperature. The average structure of the slow folding population adopts a hairpin-like conformation stabilized by between five and seven nonnative hydrogen bonds. It also contains a nonnative hydrophobic core featuring Trp6 Arg16 stacking. It is found that the same compact conformation is also sampled by the ultralong MD trajectory in explicit solvent, but in explicit solvent, these conformations are not metastable. In contrast, their lifetime is 500 ns with the AGBNP implicit solvent model. The results using the MSM-REMD trajectory at the lower temperature reflect the increased ruggedness in the free energy landscape of the unfolded ensemble at the lower temperatures with the AGBNP implicit solvent model, and a more hub-like partitioning of the unfolded state ensemble, in which slow folding populations and fast folding populations fold independently. At the higher temperature, however, the MSM-REMD results are more qualitatively similar to those observed using the MSM-MD model (compare Figure 4 and the right half of Figure 8). DISCUSSION In this study, we have focused on kinetics within the unfolded state ensemble and its influence on folding, which is less well understood compared with other aspects of protein folding. We begin with the following observations: First, the sampling of the dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

77 The Journal of Physical Chemistry B Article Figure 8. Results of J(α), P(α), and k(α) calculated using the MSM-REMD model at T = 465 K and T = 539 K. unfolded states in the ultralong MD simulation shows good convergence (Figure S5, Supporting Information). Second, the kinetic properties observed in the direct MD simulation are well reproduced by the Markov state model constructed from the MD simulation: The folding passage times, transit times, unfolded state dynamics, and folding pathways obtained from the 208 μs MD simulation and 64 ms stochastic MSM-MD simulation are in good agreement (Table 1 and Figure 6). The analysis of the MD and MSM-MD data suggests that the unfolded population of Trp-cage mixes well before folding. The relaxation time of the autocorrelation function for the radius of gyration and the principal components are 40 ns, which are much faster than the folding time of 5 μs. The experimentally determined time scales for large scale motions in unfolded proteins have been reported in several studies Using lasertemperature jump, Sadqi et al. found that the hydrophobic collapse of the acid-denatured 40-residue BBL occurs on an 60 ns time scale. 58 Using single-molecule spectroscopy, Schuler and co-workers found that the chain reconfiguration time for the unfolded, 70-residue cold shock protein (Csp) was approximately 100 ns. 59,61 The orders of magnitude for the relaxation times for the radius of gyration calculated in the present study for the 20-residue Trp-cage are consistent with those measured for the somewhat larger polypeptides BBL and Csp. Additional evidence of significant mixing within the unfolded state ensemble comes from the fact that the folding rates are independent of where the folding is initiated from within the unfolded basin and the extensive overlaps among the unfolded state populations which fold along different pathways. The strong correlation between folding passage times and the fraction of the unfolded nodes visited before each MD folding transition also reflects the absence of major internal barriers in the unfolded basin (Figure 3). To further analyze how the kinetics within the unfolded state affects folding, we have studied a simple five-state model (Figure S14a, Supporting Information), in which two unfolded nodes 1 and 2 have very different microscopic escape rates k 13 and k 24, with k 13 /k 24 = 10. We examine how the rates of the fast folding tube α and slow folding tube β are affected by changes in the U-state interconverting rate k 12. The simulation shows that the tube folding rates k α and k β strongly depend on the rate dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

78 The Journal of Physical Chemistry B Article Table 2. MSM-REMD Results: The Overlap Factor Matrix Ω(α, β) Involving the Top Folding Tubes That Account for 95% of the Total Folding Flux a (a) T = 465 K folding tube (b) T = 539 K folding tube a The slow folding tubes are indicated by bold font. of transition within the U-state (Figure S14b, Supporting Information). When the transition rate within the U-state k 12 is small relative to the microscopic escape rates k 13 and k 24, the unfolded populations on nodes 1 and 2 fold independently with very different k α and k β, respectively, governed by the intrinsic escape rates k 13 and k 24, producing biexponential folding time distributions (Figure S15, Supporting Information). As k 12 increases, the difference between k α and k β decreases monotonically. When k 12 is comparable to or faster than k 13 and k 24, the two tube folding rates k α and k β converge to the overall folding rate k tot (Figures S14b and S15, Supporting Information). On the basis of the simple model results and using the concept of P α introduced earlier, we can write expressions for the folding rates k(α) when mixing within the unfolded free energy basin is much slower or much faster than folding (see Table 3). The tube folding rate k α has the simple, general Table 3. Results of k α, P α, and J α for Folding Tube α, Determined from Studying a Simple Folding Model (Figure S13, Supporting Information) (Neglecting the Small Intermediate State Population) general fast U-state mixing (funneled folding landscape) slow U-state mixing (hub folding landscape) k α P α J α Jα α ti ( ) i U p T eq kp p α J P total eq total,u expression k α = J α /P α, where the tube flux is obtained from transition path theory, J α = i U,j U k ij p eq i p fold j. 30,31 In the fast exchange limit, the folding rate along a folding tube becomes k α eq J total /P total,u k tot, which is the same for all the folding tubes, independent of the intrinsic rates (k 12 and k 24 ). In the limit of slow exchange within the unfolded basin, the result is k α = J α / P eq eq α,u, where P α,u is the unfolded population locally associated with tube α. In this regime, k α depends on the escape rates from the local population P eq α,u. While rates along individual folding J J α total total eq total,u P ij i i U, j U, j α k 13 eq p 1 eq k 13 p 1 k 24 eq p 2 eq k 24 p 2 fold j tubes are modulated by the rate of U-state mixing, k tot, which is the simple average of folding events per unit time (also the same as the inverse of the mean first passage time), is constant and can be written as the weighted average of k α : k tot Jtotal = = P J + J α P Pα = kα + k P β β β total,u total,u total,u Ptotal,U We now apply these insights to interpret the results of Trpcage folding obtained from the MSM-MD model. As shown in Figure 4, for the results from the stochastic simulations on the MSM-MD kinetic network, P α J α and k α constant. Comparing with Table 3, we can see that such behavior is consistent with the scenario of significant U-state mixing. The result that under the fast exchange condition the folding kinetics is single exponential was first pointed out by Ellison and Cavagnero in an insightful study on the role of unfolded state kinetics. 16 The authors studied different types of folding energy landscapes using simple kinetic models and concluded that, under the condition of fast exchange in the unfolded basin, it is not possible to determine the microscopic rate constants for different parallel folding routes by a simple experiment in bulk solution. They also observed that the folding flux along a given route is controlled by the intrinsic escape rate along that route. These results agree well with our analysis of the Trp-cage folding kinetics and the simple model. As we show in Table 3 here, the flux for a folding route is determined by the product of intrinsic rate and the equilibrium population of the unfolded region from which the folding route originates. The results for P α, J α, and k α calculated using the MSM- REMD model (Figure 8) reveal the temperature dependence of the unfolded states landscape: at low temperature, the landscape contains a deep basin whose population folds through a slow folding tube only, and does not exchange with other regions of the unfolded ensemble; at higher temperature, there is considerable mixing between the slow folding population and the rest of the unfolded basin, which is reflected in the overlap factor Ω(α, β) (Table 2). The relationships between J α and P α at the different temperatures provide additional evidence for the greater mixing within the unfolded basin at higher temperature. We have shown that a P dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

79 The Journal of Physical Chemistry B strong correlation between J α and P α is a signature for significant mixing in the unfolded state relative to folding (Table 3). Here we look at J α and P α obtained from the MSM- REMD model. At T = 465 K, there is little correlation between J α and P α ; however, at T = 539 K, a stronger correlation between the two quantities emerges (R 2 0.7, Figure S13, Supporting Information). This suggests that, at the higher temperature, the unfolded state landscape becomes substantially smoother and this allows for more rapid exchange between the different folding tube populations. The MSM- REMD result at the higher temperature is qualitatively similar to the results we obtained at ambient temperature using the MSM-MD model based on the ultralong MD trajectory. We now examine our results in the light of the insightful paper by Bicout and Szabo, 5 who studied different folding landscapes by modeling the protein dynamics in conformational space as diffusion under a spherically symmetric potential. They showed that the folding kinetics on both a golf-course landscape (Levinthal) and a funnel landscape 2 is single exponential, which arises from the entropic barrier to folding. They also showed that to get such two-state behavior a folding trajectory on these landscapes needs not explore most of the unfolded states before folding. 5 Our results for the Trpcage folding are consistent with theirs: for the single exponential, two-state folding behavior of Trp-cage, a trajectory typically explores 27% of the unfolded space before it folds (Figure 3). Finally, we discuss our results from the perspective of the kinetic hub model of folding introduced by Pande. 34,53,54,62 In this model, the folded state F acts as a hub, so that most paths which connect pairs of unfolded states U1 and U2 pass through F. Hub-like behavior also appears to imply that the unfolded state partitions into subspaces which largely fold along different pathways. 14 However, as we have reported in this paper, we find no evidence of a kinetic partitioning of the U-state space into regions which mostly fold along different pathways. Dickson and Brooks 63 introduced a hub score to quantify the hub-like character of a network; the hub score for (U1, U2) corresponds to the fraction of trajectories starting at U1 which pass through native state F before reaching U2. We have calculated the distribution of hub scores for the MSM-MD network constructed from the Shaw trajectory and obtained an average hub score of Such a high hub score is not inconsistent with the observation of single exponential folding kinetics of Trp-cage and the rapid mixing within its unfolded state. It is simply a manifestation that, on a funnel landscape, because of the energetic bias toward the native state, two sufficiently separated unfolded states will be connected by pathways which include folding events. It is not clear therefore how the hub score can be used to distinguish a rugged landscape from a smooth folding funnel. CONCLUSIONS An important problem in protein folding is to understand the relationship between the structural heterogeneity and kinetics within the unfolded free energy basin and the folding kinetics. We have investigated the unfolded state kinetics and folding pathways of the miniprotein Trp-cage using (1) a 208 μs MD trajectory in explicit solvent, (2) Markov state model simulations based on the ultralong MD trajectory, and (3) a Markov state model constructed from replica exchange molecular dynamics simulations in implicit solvent over a wide temperature range. Using stochastic simulations and Article transition path theory, we have explored the kinetics of the unfolded state ensemble and studied its impact on the kinetics of folding. By comparing the folding behavior observed in the fully atomistic Trp-cage simulations with the kinetics in a simple five-state folding model, we have obtained a relationship between the rate of mixing in the unfolded state and the folding kinetics along individual pathways (tubes). Here the main result is that the conformational mixing in the unfolded state modulates the apparent protein folding rates by affecting the waiting times for folding along different routes. When this mixing is comparable to or faster than folding, the folding rates associated with different folding routes converge to the same value which is independent of the intrinsic rates along any given route; despite the presence of multiple folding routes with nonuniform barriers, the folding kinetics is essentially single exponential. In the slow exchange limit, the folding rate of along folding route is controlled by the intrinsic rates along the route. In this case, the different unfolded populations fold independently and the overall folding kinetics can deviate from single exponential. We have presented results showing that, based on atomistic Trp-cage models in explicit and implicit solvent, the Trp-cage unfolded state ensemble does not contain long-lived metastable states; there exists significant mixing in the unfolded state. These include the time scale for chain extension and compaction within the unfolded state, the approximately uniform folding rates among different folding tubes, the extensive overlaps among the unfolded populations associated with the different folding tubes, and the strong correlation between the flux along folding tubes and the unfolded state populations associated with the corresponding tubes. Because of the significant internal mixing of the unfolded state, the probability to fold along any of the multiple folding paths is almost the same regardless of where in the unfolded state the folding is initiated Analysis of the Markov state model constructed from the temperature replica exchange data provides an opportunity to probe the temperature dependence of the unfolded states kinetics. By studying the results below and above the midpoint of the folding transition, we found that in implicit solvent at low temperature the unfolded state landscape contains a slow folding basin; internally, the exchange between the slow folding population and other regions of the unfolded state basin is much slower than folding. Above the folding temperature, the unfolded state landscape becomes less rugged, allowing more rapid mixing and considerable overlap among the unfolded populations associated with the different folding tubes. Our study reinforces and extends the simple kinetic model of Ellison and Cavagnero 16 in providing a physical basis for the apparent two-state, single exponential kinetics exhibited by many proteins with heterogeneous folding pathways. The current work makes use of Markov state kinetic network models built from atomic simulations, stochastic simulations on the network, and transition path theory to analyze how kinetics within the unfolded state affects folding rates. For the models we have studied, the unfolded state of Trp-cage is well mixed, and the rate of exchange within the unfolded state ensemble is comparable to or faster than the folding rate. We emphasize that Trp-cage is a small system and its kinetics may not be representative of the folding of larger and more complex proteins. It would be interesting to apply the computational tools and the concepts of P α and the overlap matrix introduced in the present study to investigate the folding mechanisms of dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

80 The Journal of Physical Chemistry B Article proteins with different native topology and more complex unfolded state kinetics. ASSOCIATED CONTENT *S Supporting Information Figures that illustrate the results of the kinetics of the unfolded state and its effects on protein folding. This material is available free of charge via the Internet at AUTHOR INFORMATION Corresponding Author * ronlevy@lutece.rutgers.edu. Phone: Notes The authors declare no competing financial interest. ACKNOWLEDGMENTS This work has been supported by a grant from the National Institute of Health (GM30580). Some of the calculations were performed using the XSEDE allocation TG-MCB We thank Dr. David Shaw and Dr. Piana-Agostinetti for reading the manuscript and for making the long MD trajectory of Trp-cage available for analysis. Dr. Dmitrii Makarov read the manuscript and made very helpful comments. Dr. Emilio Gallicchio also made helpful suggestions. Dr. Weihua Zheng performed the REMD simulations of Trp-cage. Dr. Junchao Xia helped with the figures. My (R.M.L.) interactions with Peter go back to the days of Prince House II at Harvard 35 years ago. His passion for science was clear from the first time I spoke with him. And so too was his brilliance and strong opinions. It is always exciting and energizing talking with Peter Wolynes. Happy Birthday! REFERENCES (1) Bryngelson, J. D.; Wolynes, P. G. Intermediates and Barrier Crossing in a Random Energy Model (with Applications to Protein Folding). J. Phys. Chem. 1989, 93, (2) Bryngelson, J. D.; Onuchic, J. N.; Socci, N. D.; Wolynes, P. G. Funnels, Pathways, and the Energy Landscape of Protein Folding: a Synthesis. Proteins 1995, 21, (3) Wang, J.; Onuchic, J.; Wolynes, P. Statistics of Kinetic Pathways on Biased Rough Energy Landscapes with Applications to Protein Folding. Phys. Rev. Lett. 1996, 76, (4) Onuchic, J. N.; Luthey-Schulten, Z.; Wolynes, P. G. Theory of Protein Folding: the Energy Landscape Perspective. Annu. Rev. Phys. Chem. 1997, 48, (5) Bicout, D. J.; Szabo, A. Entropic Barriers, Transition States, Funnels, and Exponential Protein Folding Kinetics: A Simple Model. Protein Sci. 2000, 9, (6) Shea, J.-E.; Brooks, C. L., III From Folding Theories to Folding Proteins: A Review and Assessment of Simulation Studies of Protein Folding and Unfolding. Annu. Rev. Phys. Chem. 2001, 52, (7) Onuchic, J. N.; Wolynes, P. G. Theory of Protein Folding. Curr. Opin. Struct. Biol. 2004, 14, (8) Wolynes, P. G. Energy Landscapes and Solved Protein-Folding Problems. Philos. Trans. R. Soc., A 2005, 363, (9) Kubelka, J.; Hofrichter, J.; Eaton, W. A. The Protein Folding Speed Limit. Curr. Opin. Struct. Biol. 2004, 14, (10) Shakhnovich, E. Protein Folding Thermodynamics and Dynamics: Where Physics, Chemistry, and Biology Meet. Chem. Rev. 2006, 106, (11) Dill, K. A.; Ozkan, S. B.; Shell, M. S.; Weikl, T. R. The Protein Folding Problem. Annu. Rev. Biophys. 2008, 37, (12) Thirumalai, D.; O Brien, E. P.; Morrison, G.; Hyeon, C. Theoretical Perspectives on Protein Folding. Annu. Rev. Biophys. 2010, 39, (13) Karplus, M. Behind the Folding Funnel Diagram. Nat. Chem. Biol. 2011, 7, (14) Sosnick, T. R.; Barrick, D. The Folding of Single Domain Proteins Have We Reached a Consensus? Curr. Opin. Struct. Biol. 2011, 21, (15) Zheng, W.; Schafer, N. P.; Wolynes, P. G. Frustration in the Energy Landscapes of Multidomain Protein Misfolding. Proc. Natl. Acad. Sci. U.S.A. 2013, 110, (16) Ellison, P. A.; Cavagnero, S. Role of Unfolded State Heterogeneity and En-Route Ruggedness in Protein Folding Kinetics. Protein Sci. 2006, 15, (17) Gin, B. C.; Garrahan, J. P.; Geissler, P. L. The Limited Role of Nonnative Contacts in the Folding Pathways of a Lattice Protein. J. Mol. Biol. 2009, 392, (18) Shaw, D. E.; Bowers, K. J.; Chow, E.; Eastwood, M. P.; Ierardi, D. J.; Klepeis, J. L.; Kuskin, J. S.; Larson, R. H.; Lindorff-Larsen, K.; Maragakis, P.; et al. Millisecond-Scale Molecular Dynamics Simulations on Anton; Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis; ACM Press: New York, DOI: / (19) Shaw, D. E.; Maragakis, P.; Lindorff-Larsen, K.; Piana, S.; Dror, R. O.; Eastwood, M. P.; Bank, J. A.; Jumper, J. M.; Salmon, J. K.; Shan, Y.; et al. Atomic-Level Characterization of the Structural Dynamics of Proteins. Science 2010, 330, (20) Lindorff-Larsen, K.; Piana, S.; Dror, R. O.; Shaw, D. E. How Fast-Folding Proteins Fold. Science 2011, 334, (21) Piana, S.; Lindorff-Larsen, K.; Shaw, D. E. How Robust Are Protein Folding Simulations with Respect to Force Field Parameterization? Biophys. J. 2011, 100, L47 L49. (22) Dellago, C.; Bolhuis, P. G.; Csajka, F. S.; Chandler, D. Transition Path Sampling and the Calculation of Rate Constants. J. Chem. Phys. 1998, 108, (23) Faradjian, A. K.; Elber, R. Computing Time Scales from Reaction Coordinates by Milestoning. J. Chem. Phys. 2004, 120, (24) Laio, A. Escaping Free-Energy Minima. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, (25) A Beccara, S.; Skrbic, T.; Covino, R.; Faccioli, P. Dominant Folding Pathways of a WW Domain. Proc. Natl. Acad. Sci. U.S.A. 2012, 109, (26) Zheng, W.; Qi, B.; Rohrdanz, M. A.; Caflisch, A.; Dinner, A. R.; Clementi, C. Delineation of Folding Pathways of a β-sheet Miniprotein. J. Phys. Chem. B 2011, 115, (27) Swope, W. C.; Pitera, J. W.; Suits, F. Describing Protein Folding Kinetics by Molecular Dynamics Simulations. 1. Theory. J. Phys. Chem. B 2004, 108, (28) Andrec, M.; Felts, A.; Gallicchio, E.; Levy, R. M. Chemical Theory and Computation Special Feature: Protein Folding Pathways from Replica Exchange Simulations and a Kinetic Network Model. Proc. Natl. Acad. Sci. U.S.A. 2005, 102, (29) Chodera, J. D.; Swope, W. C.; Pitera, J. W.; Dill, K. A. Long- Time Protein Folding Dynamics from Short-Time Molecular Dynamics Simulations. Multiscale Model. Simul. 2006, 5, (30) Berezhkovskii, A.; Hummer, G.; Szabo, A. Reactive Flux and Folding Pathways in Network Models of Coarse-Grained Protein Dynamics. J. Chem. Phys. 2009, 130, (31) Noe, F.; Schutte, C.; Vanden-Eijnden, E.; Reich, L.; Weikl, T. R. From the Cover: Constructing the Equilibrium Ensemble of Folding Pathways from Short off-equilibrium Simulations. Proc. Natl. Acad. Sci. U.S.A. 2009, 106, (32) Bowman, G. R.; Beauchamp, K. A.; Boxer, G.; Pande, V. S. Progress and Challenges in the Automated Construction of Markov State Models for Full Protein Systems. J. Chem. Phys. 2009, 131, (33) Pande, V. S.; Beauchamp, K.; Bowman, G. R. Everything You Wanted to Know about Markov State Models but Were Afraid to Ask. Methods 2010, 52, (34) Bowman, G. R.; Pande, V. S. Protein Folded States Are Kinetic Hubs. Proc. Natl. Acad. Sci. U.S.A. 2010, 107, dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

81 The Journal of Physical Chemistry B (35) Marinelli, F.; Pietrucci, F.; Laio, A.; Piana, S. A Kinetic Model of Trp-Cage Folding from Multiple Biased Molecular Dynamics Simulations. PLoS Comput. Biol. 2009, 5, e (36) Voelz, V. A.; Bowman, G. R.; Beauchamp, K.; Pande, V. S. Molecular Simulation of ab Initio Protein Folding for a Millisecond Folder NTL9(1 39). J. Am. Chem. Soc. 2010, 132, (37) Prinz, J.-H.; Wu, H.; Sarich, M.; Keller, B.; Senne, M.; Held, M.; Chodera, J. D.; Schuẗte, C.; Noe, F. Markov Models of Molecular Kinetics: Generation and Validation. J. Chem. Phys. 2011, 134, (38) Prinz, J.-H.; Keller, B.; Noe, F. Probing Molecular Kinetics with Markov Models: Metastable States, Transition Pathways and Spectroscopic Observables. Phys. Chem. Chem. Phys. 2011, 13, (39) Metzner, P.; Schuẗte, C.; Vanden-Eijnden, E. Transition Path Theory for Markov Jump Processes. Multiscale Model. Simul. 2009, 7, (40) Zheng, W.; Gallicchio, E.; Deng, N.; Andrec, M.; Levy, R. M. Kinetic Network Study of the Diversity and Temperature Dependence of Trp-Cage Folding Pathways: Combining Transition Path Theory with Stochastic Simulations. J. Phys. Chem. B 2011, 115, (41) Deng, N.; Zheng, W.; Gallicchio, E.; Levy, R. M. Insights into the Dynamics of HIV-1 Protease: A Kinetic Network Model Constructed from Atomistic Simulations. J. Am. Chem. Soc. 2011, 133, (42) Neidigh, J. W.; Fesinmeyer, R. M.; Andersen, N. H. Designing a 20-Residue Protein. Nat. Struct. Biol. 2002, 9, (43) Ahmed, Z.; Beta, I. A.; Mikhonin, A. V.; Asher, S. A. UV Resonance Raman Thermal Unfolding Study of Trp-Cage Shows That It Is Not a Simple Two-State Miniprotein. J. Am. Chem. Soc. 2005, 127, (44) Neuweiler, H. A Microscopic View of Miniprotein Folding: Enhanced Folding Efficiency through Formation of an Intermediate. Proc. Natl. Acad. Sci. U.S.A. 2005, 102, (45) Mok, K. H.; Kuhn, L. T.; Goez, M.; Day, I. J.; Lin, J. C.; Andersen, N. H.; Hore, P. J. A Pre-Existing Hydrophobic Collapse in the Unfolded State of an Ultrafast Folding Protein. Nature 2007, 447, (46) Simmerling, C.; Strockbine, B.; Roitberg, A. E. All-Atom Structure Prediction and Folding Simulations of a Stable Protein. J. Am. Chem. Soc. 2002, 124, (47) Zagrovic, B.; Snow, C. D.; Shirts, M. R.; Pande, V. S. Simulation of Folding of a Small Alpha-helical Protein in Atomistic Detail Using Worldwide-Distributed Computing. J. Mol. Biol. 2002, 323, (48) Chowdhury, S.; Lee, M. C.; Xiong, G.; Duan, Y. Ab Initio Folding Simulation of the Trp-Cage Mini-Protein Approaches NMR Resolution. J. Mol. Biol. 2003, 327, (49) Pitera, J. W. Understanding Folding and Design: Replica- Exchange Simulations of "Trp-Cage Miniproteins. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, (50) Zhou, R. Trp-Cage: Folding Free Energy Landscape in Explicit Water. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, (51) Paschek, D.; Hempel, S.; Garcia, A. E. Computing the Stability Diagram of the Trp-Cage Miniprotein. Proc. Natl. Acad. Sci. U.S.A. 2008, 105, (52) Juraszek, J.; Bolhuis, P. G. Rate Constant and Reaction Coordinate of Trp-Cage Folding in Explicit Water. Biophys. J. 2008, 95, (53) Lane, T. J.; Bowman, G. R.; Beauchamp, K.; Voelz, V. A.; Pande, V. S. Markov State Model Reveals Folding and Functional Dynamics in Ultra-Long MD Trajectories. J. Am. Chem. Soc. 2011, 133, (54) Bowman, G. R.; Voelz, V. A.; Pande, V. S. Taming the Complexity of Protein Folding. Curr. Opin. Struct. Biol. 2011, 21, (55) Du, R.; Grosberg, A.; Tanaka, T. Random Walks in the Space of Conformations of Toy Proteins. Phys. Rev. Lett. 2000, 84, (56) Qiu, L.; Pabit, S. A.; Roitberg, A. E.; Hagen, S. J. Smaller and Faster: The 20-Residue Trp-Cage Protein Folds in 4 μs. J. Am. Chem. Soc. 2002, 124, Article (57) Gallicchio, E.; Paris, K.; Levy, R. M. The AGBNP2 Implicit Solvation Model. J. Chem. Theory Comput. 2009, 5, (58) Sadqi, M.; Lapdius, L.; Munoz, V. How Fast Is Protein Hydrophobic Collapse? Proc. Natl. Acad. Sci. U.S.A. 2003, 100, (59) Nettels, D.; Gopich, I. V.; Hoffmann, A.; Schuler, B. Ultrafast Dynamics of Protein Collapse from Single-Molecule Photon Statistics. Proc. Natl. Acad. Sci. U.S.A. 2007, 104, (60) Neuweiler, H.; Johnson, C. M.; Fersht, A. R. Direct Observation of Ultrafast Folding and Denatured State Dynamics in Single Protein Molecules. Proc. Natl. Acad. Sci. U.S.A. 2009, 106, (61) Soranno, A.; Buchli, B.; Nettels, D.; Cheng, R. R.; Muller-Spath, S.; Pfeil, S. H.; Hoffmann, A.; Lipman, E. A.; Makarov, D. E.; Schuler, B. Quantifying Internal Friction in Unfolded and Intrinsically Disordered Proteins with Single-Molecule Spectroscopy. Proc. Natl. Acad. Sci. U.S.A. 2012, 109, (62) Bowman, G. R.; Voelz, V. A.; Pande, V. S. Atomistic Folding Simulations of the Five-Helix Bundle Protein λ6 85. J. Am. Chem. Soc. 2011, 133, (63) Dickson, A.; Brooks, C. L. Quantifying Hub-Like Behavior in Protein Folding Networks. J. Chem. Theory Comput. 2012, 8, dx.doi.org/ /jp401962k J. Phys. Chem. B 2013, 117,

82 ACCELERATED COMMUNICATIONS How long does it take to equilibrate the unfolded state of a protein? Ronald M. Levy, 1 * Wei Dai, 2 Nan-Jie Deng, 1 and Dmitrii E. Makarov 3 1 Department of Chemistry and Chemical Biology, Rutgers, the State University of New Jersey, Piscataway, New Jersey Department of Physics and Astronomy, Rutgers, the State University of New Jersey, Piscataway, New Jersey Department of Chemistry and Biochemistry and Institute for Computational Engineering and Sciences, University of Texas at Austin, Austin, Texas Received 22 July 2013; Revised 9 August 2013; Accepted 12 August 2013 DOI: /pro.2335 Published online 21 August 2013 proteinscience.org Abstract: How long does it take to equilibrate the unfolded state of a protein? The answer to this question has important implications for our understanding of why many small proteins fold with two state kinetics. When the equilibration within the unfolded state U is much faster than the folding, the folding kinetics will be two state even if there are many folding pathways with different barriers. Yet the mean first passage times (MFPTs) between different regions of the unfolded state can be much longer than the folding time. This seems to imply that the equilibration within U is much slower than the folding. In this communication we resolve this paradox. We present a formula for estimating the time to equilibrate the unfolded state of a protein. We also present a formula for the MFPT to any state within U, which is proportional to the average lifetime of that state divided by the state population. This relation is valid when the equilibration within U is very fast as compared with folding as it often is for small proteins. To illustrate the concepts, we apply the formulas to estimate the time to equilibrate the unfolded state of Trp-cage and MFPTs within the unfolded state based on a Markov State Model using an ultra-long 208 microsecond trajectory of the miniprotein to parameterize the model. The time to equilibrate the unfolded state of Trp-cage is ~100 ns while the typical MFPTs within U are tens of microseconds or longer. Keywords: protein unfolded state; protein folding; mean first passage time; Markov state model Introduction How long does it take to equilibrate the unfolded state of a protein? The answer to this question has important implications for our understanding of why Additional Supporting Information may be found in the online version of this article. Grant sponsor: National Institutes of Health; Grant number: GM30580; Grant sponsor: National Science Foundation; Grant number: CHE *Correspondence to: Levy Ronald, Department of Chemistry and Chemical Biology, Rutgers, the State University of New Jersey, Piscataway, NJ ronlevy@lutece.rutgers.edu many small proteins fold with two state kinetics The protein folding funnel picture provides key insights. 3,5,6 When a protein folds along multiple pathways as suggested by the funnel picture, the folding kinetics will still be two-state regardless of differences in the intrinsic barriers along each pathway if the equilibration within the unfolded state ensemble is much faster than the time it takes to fold. Yet the mean first passage times (MFPTs) between different regions of the unfolded state ensemble are typically much longer than the folding time; this suggests that the time to equilibrate the unfolded state ensemble is much longer than the Published by Wiley-Blackwell. VC 2013 The Protein Society PROTEIN SCIENCE 2013 VOL 22:

83 time to fold. 29,30 So there is a paradox: the single exponential kinetics can be explained by very fast equilibration within the unfolded state U relative to folding, but the long MFPTs within U seem to imply that the equilibration of the unfolded state is slow relative to folding. In this communication we resolve this paradox. It arises when the average time for a single molecule trajectory to hit a specific location (the MFPT to state i) within U, is compared with the time for population fluctuations within the unfolded state to relax. This relaxation time provides a quantitative measure of the time to equilibrate the unfolded state. We will show that the MFPT to any state within the unfolded ensemble is approximately equal to the time to equilibrate the unfolded state divided by the population of the target state. The smaller the size of the target state, the longer the MFPT to that state, even though the equilibration of the unfolded state ensemble is very fast. For the Trp-cage example we use for discussion, MFPTs between different regions of the unfolded state ensemble are 10s to 100s of microseconds, while the time to equilibrate the unfolded state is of the order of 100 ns. These times are to be compared with the folding time for Trp-cage, which is 5.5 microseconds. An estimate of the time required to equilibrate the protein unfolded state is also needed to understand the implications of the recently introduced kinetic hub model of protein folding. 29,31,32 In this model, the folded state F acts as a hub, so that most paths, which connect pairs of unfolded states U1 and U2 pass through F. 33,34 Hub like behavior appears to imply that the unfolded state partitions into subspaces, which largely fold along different pathways, but we have shown that this is not the case for Trp-cage. 28 Furthermore, when the time to equilibrate within the unfolded state ensemble is much faster than the folding time, the hub like behavior simply reflects the fact that the F state has sufficient population to have a high probability of being on most paths between typical points U1 and U2 within the unfolded state ensemble. It has recently become clear that hub like behavior is consistent with a smooth folding funnel. 28 We use the integral of the time correlation function, which quantifies how the population fluctuations within the unfolded state relax to equilibrium as the measure of the time to equilibrate the unfolded state. 35 There are two contributions to the relaxation of population fluctuations within the unfolded state ensemble of a protein, or equivalently the equilibration of the unfolded state. The first corresponds to relaxation of fluctuations, which originate and propagate entirely within the U state and the second to relaxation within U, which arises from the equilibration between the unfolded and folded states. When the former relaxation process is much faster than the later, the protein folding is two-state. In this communication we mostly focus on the fast relaxation processes entirely within U. For our analysis we use a discrete master equation model of Trpcage with 20 states parameterized on a 208 microseconds all atom molecular dynamics simulation of this mini-protein in water provided by the D.E. Shaw group. 23 The kinetics is characterized by the implied timescale spectrum of the transition matrix, which contains all the information about the relaxation times of the states within the discrete time Markov State Model (MSM). The Trp-cage implied timescale spectrum has a substantial gap between the longest implied timescale, which is associated with folding and the others, therefore the intra U state fluctuations can be separated from the folding and the mini-protein folds in a two-state manner with single exponential kinetics. That the remaining eigenmodes correspond to intra U-state relaxation can be verified by comparing the spectrum with the corresponding implied timescale spectrum obtained using reflecting boundary conditions at F, as we do in the following section. Results and Discussion We use a master equation to study the timescales over which the unfolded state equilibrates. The formal solution to the master equation is: ~PðtÞ 5 TðtÞ ~ Pð0Þ (1) where P is a vector of state probabilities and the transition matrix T (also called the propagator) contains all the information about the kinetics of the system (see Supporting Information). The propagator matrix element T ij (t) is the probability that the system is in state j at time t given that it was in state i at time zero. All observables of the system can be calculated in terms of functions of the T ij (t). The T ij (t) in turn can be expressed in terms of the eigenvalues and eigenvectors of T. Figure 1 shows the spectrum of implied timescales for the Trp-cage transition matrix constructed from the Shaw trajectory and for a modified transition matrix with a reflecting boundary added at F. Imposing the reflecting boundary condition here provides a model for the dynamics of the unfolded state alone. It can be seen that the spectrum is very similar except that the largest nonzero eigenvalue is missing from the spectrum with reflecting boundary condition at F; this eigenmode corresponds to the relaxation between the unfolded (U) state ensemble and the folded (F) state. The large gap between the largest implied timescale and the others means that the folding is twostate and the implied timescale (1.2 ls) is the inverse of the sum of the folding plus unfolding rates. In Figure 2 we show a typical propagator matrix element T ij (t) from state i to state j, both within U, 1460 PROTEINSCIENCE.ORG Equilibrate the Unfolded State of a Protein

84 The results shown in Figure 2 are suggestive as to the timescales for equilibrating the unfolded state, but the full relaxation involves all the elements T ij (t) of the propagator. We consider the full expression for the relaxation now. The way to estimate the time it takes to equilibrate a system from equilibrium statistical mechanics is to calculate an integral of the appropriate time correlation function. 35 The correlation function of interest here corresponds to the decay of the population fluctuations in the unfolded state. After some manipulation (see Supporting Information), this correlation function can be expressed as: C _ tot 5 X i P eqðiþ <DP ið0þdp i ðtþ > <DP i ð0þ 2 > C _ tot5 X i C _ tot 5 X N n52 ðt ii ðtþ 2 P eq ðiþþp eq ðiþ 1 2 P eq ðiþ " # X P eq ðiþw R N ðiþwl N ðiþ k i n 1 2 P eq ðiþ (2a) (2b) (2c) Figure 1. (a) The implied timescales corresponding to the 10 slowest decaying eigenmodes using transition matrices T(s), with different boundary conditions, (a) unmodified equilibrium and (b) reflecting at F and I states. The optimal lag time 10 ns is chosen for further analysis based on the tradeoff between the network being Markovian and the resolution being sufficient for studying folding mechanism. calculated three ways; using absorbing, unmodified equilibrium, and reflecting boundary conditions at F. The time dependence of T ij (t) describes the relaxation process following an initial point perturbation at state i. On a timescale of a few hundred nanoseconds they look very similar. Each rises rapidly to a plateau value which overshoots the equilibrium population of state j by a small amount. When added up over all the states in U, the excess corresponds to the equilibrium population of F that folds from U to F on the slower timescale of 5 ms. After a few hundred nanoseconds, the T ij (t) matrix elements shown in Figure 2 have the following longer time behavior. Under reflecting boundary conditions T ij (t) is approximately constant, the unmodified transition matrix T ij (t) relaxes to the equilibrium population with a relaxation time 1.2 ms, while under absorbing boundary condition the matrix elements relax to zero with a relaxation time 5 ms. where w R n (i) and w L n (i) are the ith element of the nth right and left eigenvectors of the T matrix. k n is the nth eigenvalue of the T matrix. DP i (t) 5 P i (t) 2 P eq (i). P eq (i) is the equilibrium population of state i. P i (t) is an indicator function, which is 1 when the trajectory is on state i and 0 otherwise at time t. In Figure 3 we show the unfolded state population fluctuation correlation function. When the motions are restricted to the unfolded state, the time to equilibrate the unfolded state is estimated from the time integral of C _ totðtþ to be 100 ns; when the additional relaxation of U due to the much slower equilibration between U and F is also considered, the time to equilibrate the unfolded state is increased to 540 ns. The separation of timescales between the equilibration within U and the folding is implicit in the folding funnel model of protein folding. 3,5 While folding on a flat golf-course landscape, 11 which lacks the energy bias can also produce a separation of timescales, the very fast equilibration (100 ns) within the unfolded state is a feature of the funneled landscape. Our estimate of the time to equilibrate the protein unfolded state based on the decay of fluctuations of the U state population (eq. 2b) is independent of the kind of experiment chosen to monitor the system. Any particular experiment will measure the time evolution of the population fluctuations reweighted by how sensitive that particular probe is to the different modes by which the population fluctuations relax. If for example, the experiment is sensitive to the fluctuations of some property f, then the experimental relaxation time measured for that probe of the unfolded state dynamics would be: Levy et al. PROTEIN SCIENCE VOL 22:

85 Figure 2. (a) A typical propagator matrix element Tij(t) from state i to state j, both within U, calculated three ways; using absorbing at F (black), unmodified equilibrium (blue) and reflecting boundary conditions at F (red). The upper and lower dash and point lines in each subplot are correspondingly the equilibrium population of state j under reflecting and unmodified equilibrium boundary conditions. All the propagator elements are calculated from spectral decomposition using all the 20 eigenmodes at lag time of 10 ns. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] X fðiþfðjþp C _ i;j eq ðjþðt ij ðtþ 2 P eq ðiþþ f 5 Xi;j fðiþfðjþp eq ðjþðt ij ð0þ 2 P eq ðiþþ (3) where f(i), f(j) are the values of the experimental observables in state i and j. A common choice of the experimental observable f is the FRET efficiency, which is a nonlinear function of the distance between two particular residues within the protein. The relaxation time thus determined depends on the choice of those residues. 25 We turn now to an analysis of the MFPTs between different states within the unfolded state ensemble. From MSMs, the MFPTs between unfolded states have been reported to be tens of microseconds or longer. 29,31,32 For the Trp-cage model we studied it extends to 200 microseconds. The MFPT to an unfolded state i can be obtained from the formula: ð 1 MFPT i 5 t: dtabs!i ji dt 5 X P eq ðjþ 0 dt j12p eq ðiþ ð 1 0 t: dtabs!i ji dt dt (4a) MFPT i 5 X P eq ðjþ j 1 2 P eq ðiþ X N n52 wr N ðjþwl N ðiþð2l nþ where w }R }N (4b) ðiþ and w}l }NðiÞ are the ith element of the nth right and left eigenvectors of the transition matrix with an absorbing boundary at i T abs!i. l n is its the nth implied timescale (see Supporting Information). The average shown in eq. 4a is taken over all the other states j in U and includes a sum over all the eigenmodes n. In Figure 4(a) we show the implied timescale spectrum of the transition matrix with absorbing boundary at a typical unfolded state i. The 1462 PROTEINSCIENCE.ORG Equilibrate the Unfolded State of a Protein

86 To understand why the MFPTs to states within U are so long, we consider the relationship between the average lifetime of a state i within U and the average lifetime of the collective state consisting of the remainder of U excluding state i: 1 t U 2 i 5 t i P eq ðiþ 21 (6) Figure 3. (a) The population fluctuation relaxation functions [Eq. (2)] of the 20-node network at a lag time of 10 ns, using two different boundary conditions, (a) unmodified equilibrium and (b) reflecting on F. The integrals of the functions are the relaxation times, which are 543 and 100 ns correspondingly. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] large gap between the largest implied timescale and the rest is the signature of the exponential distribution of first passage times to unfolded state i. The longest implied timescale is of the order of 100 microseconds. Because the unfolded state ensemble relaxes on a timescale a hundred to a thousand times faster than the time it takes on average to reach state i, the MFPT to state i does not depend on the starting point within U. The kinetics involving the transitions between any specific state i and all the other states taken collectively is then effectively two state and the MFPT to state i can be written as: MFPT i X P eq ðjþ j 1 2 P eq ðiþ wr 2 ðjþwl 2 ðiþl 2 (5) The MFPT to the unfolded state i chosen for the example shown in Figure 4(a) is found to be 106 microseconds. where t i is the average lifetime of state i and t U-i is the average lifetime of the collective state U-i consisting of the remainder of U excluding state i. Here we define the lifetime distribution of a state as the distribution of times recorded upon entering a state when the clock starts and then leaving it when the clock stops, during a single very long trajectory when the state is visited many times [see Supporting Information for the derivation of Eq. 6]. In Figure 4(b) we plot the MFPT to state i [Eq. (4b)] against the average lifetime of the collective state, t U-i [Eq. 6] for each of the unfolded states in the 20-state model. It can be seen that these times are almost equal. This is true when the time to equilibrate within the unfolded state (U-i) is much shorter than the average lifetime of (U-i). Under these circumstances, the MFPT to any unfolded state i is proportional to the average lifetime of the state t i divided by the population, and there is an equality involving Eqs 4, 5 and 6. Because the average lifetimes of the unfolded states decay on the same timescale as the decay of the population fluctuations, we find that the MFPT to any state within U is approximately equal to the time to equilibrate U divided by the population of the target state. Importantly, the MFPTs depend on the resolution of the model for the unfolded state, the more fine grained the model, the longer the MFPTs to an individual state. On the other hand, the time to equilibrate the unfolded state is a characteristic of the macrostate, which depends only weakly on the resolution. For the 20-state model of Trpcage studied here, the longest MFPT (200 ms) is to the state with the smallest population 0.003, while the average lifetime of that state is 48 ns, comparable to the time to equilibrate the unfolded state. In this communication we have resolved a paradox about kinetics within the unfolded state of proteins, which leads to a better understanding of why most small proteins fold with two-state kinetics. When the equilibration of the unfolded state ensemble is very fast as it is for most small proteins, the protein will fold with single exponential kinetics. While it seems paradoxical that the time to equilibrate the unfolded state can be orders of magnitude shorter than MFPTs within U, we have shown there is no inconsistency. Using a time-correlation function approach, we have presented a general formula for the timescale of population relaxation within U [Eq. (2c)]. Applying this formula to the folding of the two-state mini-protein Trp-cage, we found that the Levy et al. PROTEIN SCIENCE VOL 22:

87 Figure 4. (a) The implied timescale spectrum to the state, which is highlighted as red in Figure 4(b). (b) The average lifetime of the collective state (U-i), which excludes the state i versus the MFPT to state i using Eq. (4b). [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] folding follows a two-step process: starting from an arbitrary nonequilibrium conformational distribution within the unfolded region the protein population will quickly relax to a pre-equilibrium within the unfolded state on timescales (100 ns for Trp-cage) much faster than folding. From this time forward, while the relative populations of all the unfolded microstates remain constant, the excess population within U, which will populate the folded state at equilibrium, folds with single exponential kinetics (rate 1=5.5 ls). It should be noted that as we reported in a recent article, an individual Trp-cage folding trajectory only visits a fraction (e.g., 25%) of the unfolded state space. 28 The key to reconciling this with the rapid equilibration in the U-state is to realize that while any one trajectory explores only a small part of U before folding, an ensemble of such trajectories starting from the same initial condition within U will explore all of the U states with a probability that is close to the equilibrium population of that state before folding. 11,28,30 The methodology developed in this study is also well suited for studying the kinetics of larger and more complex proteins where the timescales to equilibrate within U and to fold may overlap and the folding is no longer two state. Materials and Methods A MD trajectory of Trp-cage, which contains 1 million snapshots and saved at every 200 ps, was obtained from D.E. Shaw Research. 23 The simulation length is 208 ms using a modified CHARMM22 all-atom force field in the TIP3P explicit solvent. A node fine-grained network and a 20-node coarse-grained network were generated from the trajectory (see Supporting Information for detailed descriptions of how the fine-grained network was generated). Acknowledgments Some of the calculations were performed using the XSEDE allocation TG-MCB The authors thank Dr. Attila Szabo for very helpful discussions. ND would like to thank Dr. Kyle Beauchamp from Dr. Vijay Pande group for help with the MSMBuilder2. 36 References 1. Creighton TE (1988) Toward a better understanding of protein folding pathways. Proc Natl Acad Sci U S A 85: Jackson SE, Fersht AR (1991) Folding of chymotrypsin inhibitor Evidence for a two-state transition. Biochemistry (Mosc.) 30: Bryngelson JD, Onuchic JN, Socci ND, Wolynes PG (1995) Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins 21: Eaton WA, Thompson PA, Chan C-K, Hage SJ, Hofrichter J (1996) Fast events in protein folding. Structure 4: Onuchic JN, Luthey-Schulten Z, Wolynes PG (1997) Theory of protein folding: the energy landscape perspective. Annu Rev Phys Chem 48: Dill KA, Chan HS (1997) From Levinthal to pathways to funnels. Nat Struct Biol 4: Zwanzig R (1997) Two-state models of protein folding kinetics. Proc Natl Acad Aci U S A 94: Perl D, Welker C, Schindler T, Schr oder K, Marahiel MA, Jaenicke R, Schmid FX (1998) Conservation of rapid two-state folding in mesophilic, thermophilic and hyperthermophilic cold shock proteins. Nat Struct Biol 5: Jackson SE (1998) How do small single-domain proteins fold? Fold Des 3:R81 R Cieplak M, Henkel M, Karbowski J, Banavar J (1998) Master equation approach to protein folding and kinetic traps. Phys Rev Lett 80: PROTEINSCIENCE.ORG Equilibrate the Unfolded State of a Protein

88 11. Bicout DJ, Szabo A (2000) Entropic barriers, transition states, funnels, and exponential protein folding kinetics: a simple model. Protein Sci 9: Dinner AR, Sali A, Smith LJ, Dobson CM, Karplus M (2000) Understanding protein folding via free-energy surfaces from theory and experiment. Trends Biochem Sci 25: Mirny L, Shakhnovich E (2001) Protein folding theory: From lattice to all-atom models. Annu Rev Biophys Biomol Struct 30: Makarov DE (2002) How the folding rate constant of simple, single-domain proteins depends on the number of native contacts. Proc Natl Acad Sci U S A 99: Yang WY, Gruebele M (2003) Folding at the speed limit. Nature 423: Kaya H, Chan HS (2003) Simple two-state protein folding kinetics requires near-levinthal thermodynamic cooperativity. Proteins 52: Weikl TR (2004) Cooperativity in two-state protein folding kinetics. Protein Sci 13: Rhoades E, Cohen M, Schuler B, Haran G (2004) Twostate folding observed in individual protein molecules. J Am Chem Soc 126: Ellison PA, Cavagnero S (2006) Role of unfolded state heterogeneity and en-route ruggedness in protein folding kinetics. Protein Sci 15: Barrick D (2009) What have we learned from the studies of two-state folders, and what are the unanswered questions about two-state protein folding? Phys Biol 6: Zheng W, Andrec M, Gallicchio E, Levy RM (2009) Recovering kinetics from a simplified protein folding model using replica exchange simulations: A kinetic network and effective stochastic dynamics. J Phys Chem B 113: Best RB, Hummer G (2009) Coordinate-dependent diffusion in protein folding. Proc Natl Acad Sci U S A 107: Lindorff-Larsen K, Piana S, Dror RO, Shaw DE (2011) How fast-folding proteins fold. Science 334: Karplus M (2011) Behind the folding funnel diagram. Nat Chem Biol 7: Soranno A, Buchli B, Nettels D, Cheng RR, Muller- Spath S, Pfeil SH, Hoffmann A, Lipman EA, Makarov DE, Schuler B (2012) Quantifying internal friction in unfolded and intrinsically disordered proteins with single-molecule spectroscopy. Proc Natl Acad Sci U S A 109: Zhang Z, Chan HS (2012) Transition paths, diffusive processes, and preequilibria of protein folding. Proc Natl Acad Sci U S A 109: De Sancho D, Mittal J, Best RB (2013) Folding kinetics and unfolded state dynamics of the GB1 hairpin from molecular simulation. J Chem Theory Comput 9: Deng N, Dai W, Levy RM (2013) How kinetics within the unfolded state affects protein folding: An analysis based on Markov state models and an ultra-long MD trajectory. J Phys Chem B Bowman GR, Pande VS (2010) Protein folded states are kinetic hubs. Proc Natl Acad Sci U S A 107: Lane TJ, Schwantes CR, Beauchamp KA, Pande VS. Probing the origins of two-state folding. Physicsbio-Ph Arxiv Voelz VA, Bowman GR, Beauchamp K, Pande VS (2010) Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1239). J Am Chem Soc 132: Bowman GR, Voelz VA, Pande VS (2011) Taming the complexity of protein folding. Curr Opin Struct Biol 21: Dickson A, Brooks CL (2012) Quantifying hub-like behavior in protein folding networks. J Chem Theory Comput 8: Dickson A, Brooks CL (2013) Native states of fastfolding proteins are kinetic traps. J Am Chem Soc 135: Chandler D (1987) Introduction to modern statistical mechanics. New York: Oxford University Press. 36. Beauchamp KA, Bowman, GR, Lane TJ, Maibaum L, Haque, IS, Pande VS (2011) MSMBuilder2: Modeling conformational dynamics on the picosecond to millisecond scale. J Chem Theory Comput 7: Levy et al. PROTEIN SCIENCE VOL 22:

89 Figure 2.25: The schematic plot of a 3-node simple model. The combination of node 1 and node 3 yields to the collective state U i. The edges connecting the nodes represent the rate constants between the two connected nodes. 53

90 54 Chapter 3 Important Time Scales in Unfolded State Ensemble of Proteins 3.1 Motivation: Funnel Picture or Kinetic Hubs The funnel-like energy landscape [17, 79] provides a way to visualize the solution to the Levinthal paradox [4], but it has deeper meaning beyond the pictures [80]. The recently introduced kinetic hub model [81] of folding calls into question one aspect of the protein folding funnel namely the smoothness of the funnel; yet it is an important characteristic which is associated with the principle of minimal frustration [79, 82]. Also, in contrast to the smooth funnel model, the kinetic hub model assigns a significant role to non-native interactions in the folding process. The hub model refers to kinetic features of protein folding, i.e. the properties of first passage times within the unfolded free energy basin and between unfolded and folded states; and also to topological features of protein folding, i.e. the connectivity of unfolded states with each other. In this chapter we resolve an apparent contradiction between kinetic features of the hub model and the smooth funnel model of protein folding; we also comment on the meaning of the hub-like network topology in light of our kinetic analysis. An interesting and challenging question how long does it take to equilibrate the unfolded state of a protein has been addressed in the reference [40], which has important implications for our understanding of why many small proteins fold with two-state kinetics. It is found in that paper that when a protein equilibrates within the unfolded state free energy basin on a faster time scale than the time it takes to fold, the folding will follow two-state kinetics [22, 73, 78], regardless of the number of folding pathways and barriers. So if there are multiple pathways with different barriers, these features of the energy landscape will be hidden from direct observation when the population fluctuations within the

91 55 unfolded ensemble equilibrate more rapidly than the time course of the folding [72]. The process of equilibration within the unfolded free energy basin is quantified by calculating the time integral of the total time correlation function of the population fluctuations and its decomposition in terms of the eigenvectors and eigenvalues of the corresponding transition matrix. The mean first passage times from state to state within the unfolded state ensemble can be expressed in terms of the eigenvalues and eigenvectors of the transition matrix with an absorbing boundary condition at the target state. It is showed that for mini-proteins, the MFPTs within the unfolded basin are typically much longer than the time required for the population fluctuations to relax. In this chapter we derive a simple expression that relates the relaxation times of the individual coarse grained states to the MFPTs to those states; this is a very general relation that is valid whether the relaxation is fast or slow compared to the folding process. 3.2 Simple Relation between MFPTs and Relaxation Times The relaxation time quantifies the process of the population fluctuations decaying to their equilbrium values. The definitions of the relaxation time of state i and the total relaxation time, which are the time integral of Eq and Eq respectively, are τ relax tot = τi relax = 0 0 = i i=1 T ii (t) P eq (i) dt (3.1) 1 P eq (i) (T ii (t) P eq (i))p eq (i) dt 1 P eq (i) P eq (i)τ relax i (3.2) where T ii (t) is the ith diagonal element of the transition matrix and P eq (i) is the equilibrium population of state i. τ relax tot is simply the weighted average of the relaxation times of all the states within a particular ensemble. Another important time scale is the MFPT to a state i. The MFPT to state i is the average time that a trajectory takes to reach state i for the first time, with the initial conditions chosen according to the thermodynamic equilibrium populations excluding state i (Eq. 2.40). A third time scale of the system which can be used to characterize the dynamics is the lifetime of a state. The lifetime of state i is defined

92 56 as the average time that a trajectory stays at state i during each visit. Note that the probability density P i (t) for the lifetime distribution of a state i in a Markov state model is an exponential distribution, P i (t) = λ i e λ it (3.3) where λ i is the sum of all the outgoing rate constants from state i. We consider the following scheme to characterize the dynamics of the unfolded free energy basin of a protein. We choose any state i and follow the motions between state i and all the other states within the unfolded basin which we label U i. In general, the distribution of U i lifetimes is not single exponential. However, we can relate the average lifetimes of the states i and U i to their corresponding populations (Eq. 2.43). The lifetimes and relaxation times of the states within the unfolded basin are fundamental time scales which characterize the kinetics within the unfolded basin, many kinetic and thermodynamic quantities can be written in terms of them. For example, the equilibrium conditions can be expressed as: P eq (i) = = P eq (U i) = λ 1 i λ 1 i + t l 1, (3.4) 1 + λ i t l l t l T = 1 P eq (i), (3.5) where t l denotes the time that a trajectory stays at state U i in the lth visit, as schematically illustrated in Fig. 3.1, and T is the total length of the trajectory. We now show that the MFPT to state i can be expressed as a ratio of the second and first moments of the lifetime distribution of the collective state U i. We consider a very long trajectory of total length T which moves between state i and state U i many times as shown schematically in Fig The MFPT to state i can be calculated by picking a point at random while the trajectory is in state U i, for example point A in Fig. 3.1, and clocking the time it takes to get to state i from point A in state U i. This is repeated many times to build up the passage time distribution. Suppose point A is chosen as the starting

93 57 Figure 3.1: The indicator function of the collective state U i is one when the trajectory is in state U i and zero otherwise. t 1, t 2 and t 3 are three realizations of the lifetimes of state U i. Point A is a typical starting point of the first passage time to state i. point, which as shown in the figure is located during the second visit to U i with lifetime t 2. The probability of starting from point A is equal to the product of the probability of choosing a starting point within the second visit of the trajectory to U i, which is given by t 2 n t, times the averaged first passage time from point A to state i, which is given by t 2 n 2, since the point A can be located at any point in time along the second visit of the trajectory to state U i with equal probability. The weighted sum of all possible first passage times to state i from any place along the trajectory while it is in state U i can then be written as MFPT i = l t l n t t l n 2, (3.6) where t l and t n are the lifetimes of state U i during the lth and nth visit of the trajectory to state U i. Dividing both the numerator and the denominator of the right hand side of Eq. 3.6 by N, which is the total number of visits to state U i, gives MFPT i = 1 2 t 2 l t l. (3.7) Eq. 3.7 shows the fact that the MFPT to state i can be expressed as one half of the second moment divided by the first moment of the lifetime distribution of state U i. To write the relaxation times in terms of lifetimes is more complicated. The diagonal transition matrix elements in Eq. 3.2 need to be decomposed into the sum of the contributions from various

94 58 classes of trajectories, which are associated with different numbers of departures from the starting state, shown as follows, T ii (t) = e λ it + + t 0 t t 2 t t 3 t 0 λ i e λ it 1 dt 1 λ i e λ it 1 dt 1 t t 1 λ i e λ i(t 3 t 2 ) dt 3 t t 1 dt 2P U i (t 2 t 1) e λ i(t t 2 ) P U i (t 2 t 1)dt 2 dt 4P U i (t 4 t 3) e λ i(t t 4 ) + (3.8) where P U i (t 2 t 1 ) represents the probability density that the trajectory leaves state U i at t 2 t 1 (t 2 > t 1 ). The first term in the equation above is the probability that the trajectory never leaves state i within the time t. While the second term corresponds to the probability that the trajectory leaves state i at time t 1 and comes back to state i at time t 2, then stays in state i through the end of the time t. The third term corresponds to the case that the trajectory leaves state i twice at time t 1 and t 3 respectively and then returns to state i at time t 2 and t 4 (t 4 > t 3 > t 2 > t 1 ). The remaining terms can be similarly expressed. Laplace transforming eq. 3.8 and using the convolution theorem [83], we have T ii (s) = 1 + λ i s + λ i s + λ P U i (s) i + λ i s + λ i P U i (s) = 1 s + λ i [1 + λ i 1 s + λ i λ i s + λ i P U i (s) λ i s + λ i P U i (s) λ i 1 s + λ i + + ( s + λ P U i (s)) ( i s + λ P U i (s)) + ] (3.9) i where s is the complex argument of the Laplace transform. Eq. 3.9 is a geometric series which can be summed, T ii (s) = 1/(s + λ i ) 1 λ i PU i (s)/(s + λ i ). (3.10) To express P U i (s) in terms of the complex argument s, we use the Taylor expansion, P U i (s) = 0 e st P U i (t)dt = 1 s t l s2 t 2 l. (3.11)

95 59 To express the relaxation time of state i in Eq. 3.1 in terms of the lifetimes of state U i, estimate the integral by taking the s 0 limit, 0 (T ii (t) P eq (i))dt = lim( T ii (s) P eq(i) ). (3.12) s 0 s Substituting T ii (s) and P eq (i) using Eq and 3.4 and ignoring the terms of O(s 3 ) or higher, leads to the following relation: 0 (T ii (t) P eq (i))dt = lim = 1 2 s 0 ( t 2 l t l 1 s(1 + λ i t l ) λ i 2 s2 t2 l + O(s 3 ) λ i t l 1 s(1 + λ i t l ) ) (3.13) λ i t l 1 + λ i t l. (3.14) On the right hand side of Eq. 3.14, the first term is MFPT to state i (Eq. 3.7), the second term is P eq (i) (Eq. 3.4), and the third term is 1 P eq (i) (Eq. 3.5). Therefore, the general relationship between the MFPT and the relaxation time can be written as, MFPT i = τ i relax P eq (i). (3.15) A similar result was derived by Szabo et al. [84, 85] in a different context using a Green s function approach. Also, Noh et al. [86] derived this relationship in the context of network theory and centrality. The analysis above is exact and general for the relationship between the MFPTs and relaxation times no matter what the shape of the energy landscape is. 3.3 Results and Discussion Eq explains why mean first passage times from state to state within the unfolded ensemble can be very long but the energy landscape can still be smooth (minimally frustrated). In fact, when the folding kinetics is two-state, all of the unfolded state relaxation times within the unfolded free energy basin are faster than the folding time. This result supports the well-established funnel energy landscape picture and resolves an apparent contradiction between this model and the recently proposed kinetic hub model of protein folding. We validate these concepts by analyzing a Markov state model of the kinetics in the unfolded state and folding of the mini-protein NTL9 (where NTL9 is the 39 amino acid N-terminal domain of the ribosomal protein L9), constructed from a 2.9 millisecond(ms)

96 60 simulation provided by D. E. Shaw Research [1]. The polypeptide (see Fig. 3.2) was solvated in 100 mm NaCl in a cubic box of 50 A side length containing 3, 800 water molecules. A force-shifted cutoff of 9.5 A was used for the LJ and electrostatic interactions. 39-residue NTL9 is a system slightly more complicated than Trp-cage as it forms a small mixed α-β structure. α-β proteins are a class of proteins in which the secondary structure is composed of α-helices and β-strands along the backbone. Other than that, NTL9 folds much slower than Trp-cage, which was reported as 1.5 ms experimentally [87]. Larger number of residues adds heterogeneity and complexity to the unfolded state ensemble of NTL9. Much longer folding time increases challenge of fully understanding the kinetics from unfolded state to unfolded state or vice versa. However, there is one common point that NTL9 shares with Trp-cage, which is that both of them are two-state folder [87 91]. Let s first investigate the RMSD distribution of NTL9 (see Fig. 3.3) to see if there are two modes in the RMSD distribution. As can be seen that the native state of NTL9 is pretty stable in the simulation, 22% of the samples have the RMSD larger than 2 A. We divide all the conformations into three ensembles: folded (RMSD 2.4 A, F), intermediate (2.4 A RMSD 6 A, I) and unfolded states (RMSD 6 A, U). A Markov state model was constructed based on the NTL9 trajectories to study the folding kinetics. We uniformly subsampled the trajectories so that one in one hundred conformations in the trajectories were used for the clustering. A set of 24, 216 cluster centers are generated by hybrid k-centers/k-medoids and the rest of the trajectories are assinged to their closest cluster centers based on RMSD. Furthermore, two coarse-grained Markov state models are built from this 24, 216-node network by using PCCA+, which are 20-node and 100-node networks. To validate the two coarse-grained MSMs in the aspect of kinetics, the implied timescales are plotted in Fig. 3.4 and Fig In Fig. 3.4a, with the unmodified equilibrium boundary conditions, there is a major gap between the slowest implied timescale and other implied timescales, which means that the folding is two-state. It can also be seen that the slowest implied timescales, which involve the inverse of the sum of the folding plus unfolding rates, do not significantly depend on the granularity, increasing from 7µs at the 20-node level to 8.3 µs at the 100-node level(see table 3.1). After applying a reflecting boundary at the folded state, the dynamics is restricted to be within

97 Figure 3.2: The NMR structure (PDB entry 2HBA) of NTL9 with ribbons representing a three-stranded anti-parallel β-sheet and an α-helix. 61

Advanced sampling. fluids of strongly orientation-dependent interactions (e.g., dipoles, hydrogen bonds)

Advanced sampling. fluids of strongly orientation-dependent interactions (e.g., dipoles, hydrogen bonds) Advanced sampling ChE210D Today's lecture: methods for facilitating equilibration and sampling in complex, frustrated, or slow-evolving systems Difficult-to-simulate systems Practically speaking, one is