ALGORITHMS FOR BUILDING MODELS OF MOLECULAR MOTION FROM SIMULATIONS

Size: px

Start display at page:

Download "ALGORITHMS FOR BUILDING MODELS OF MOLECULAR MOTION FROM SIMULATIONS"

Oswald Willis
5 years ago
Views:

1 ALGORITHMS FOR BUILDING MODELS OF MOLECULAR MOTION FROM SIMULATIONS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Nina Singhal Hinrichs September 2007

2 ALGORITHMS FOR BUILDING MODELS OF MOLECULAR MOTION FROM SIMULATIONS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Nina Singhal Hinrichs September 2007

4 I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Vijay S. Pande) Principal Co-Advisor I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Serafim Batzoglou) Principal Co-Advisor I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. (Leonidas Guibas) Approved for the University Committee on Graduate Studies. iii

5 Abstract Many important processes in biology occur at the molecular scale. A detailed understanding of these processes can lead to significant advances in the medical and life sciences for example, many diseases are caused by protein aggregation or misfolding. One approach to studying these systems is to use physically-based computational simulations to model the interactions and movement of the molecules. While molecular simulations are computationally expensive, it is now possible to simulate many independent molecular dynamics trajectories in a parallel fashion by using distributed computing methods such as Folding@Home. The analysis of these large, high-dimensional, data sets presents new computational challenges. This dissertation presents a novel approach to analyzing large ensembles of molecular dynamics trajectories to generate a compact model of the dynamics. The model groups conformations into discrete states and describes the dynamics as Markovian, or history-independent, transitions between the states. We will discuss why the Markovian state model (MSM) is suitable for macromolecular dynamics, and how it can be used to answer many interesting and relevant questions about the molecular system. We will also present new approaches for many of the computational and statistical challenges in building such a model, specifically a novel algorithm for defining the states, methods for comparing between different state definitions and determining the optimal number of states, efficient error analysis techniques to determine the statistical reliability, and adaptive algorithms to efficiently design new simulations. The methods are applied to model systems as well as molecular dynamics simulation data of several small peptides. iv

6 Acknowledgements I would first like to thank my family and friends for their love and support: my parents Kumud and Kishore, who always inspired and encouraged me; my sister Monica, for her guidance and advice; Tim Knight, Eran Guendelman, and Andrea Tompa for filling graduate school with fun memories; and especially Tim Hinrichs, my best friend and husband, for sharing this wonderful experience with me. I would also like to acknowledge some of my collaborators: Peter Kasson for collaborations on lipid vesicle simulations; John Chodera and Bill Swope for interesting discussions about Markov state models and excellent collaborations on state decomposition algorithms; and all of the Pande lab members, past and present, who I had the pleasure of working with. There were numerous people who made their simulation data available for the analysis presented in this thesis: Christopher Snow for the trpzip2 data set (Chapter 2); Eric Sorin for the F s peptide data set (Chapter 3); Jed Pitera for the trpzip2 data set (Chapter 3); John Chodera for the alanine data set (Chapters 4 and 6); and Guha Jayachandran for the villin headpiece model (Chapter 6). Several people had helpful comments on various parts of this thesis: Hans Andersen and Frank Noé for enlightening conversations on the nature of Markov chain models; Vishal Vaidyanathan for assistance with clustering algorithms (Chapter 3); Jed Pitera for insightful discussions and constructive comments on Chapter 3; Libusha Kelly, David Mobley and Guha Jayachandran for critical comments on Chapter 3; Kishore Singhal for insightful discussions about sensitivity analysis (Chapters 5 and 6); and John Chodera for helpful comments on Chapters 4 and 6. My thesis committee members deserve special thanks: Axel Brunger, as the chair of my orals committee; Jean-Claude Latombe for inspiration about graphical kinetic models and for serving on my orals committee; Leonidas Guibas for discussions about the geometric nature of conformation space for being a committee member; Serafim Batzoglou, as my co-advisor and for discussions about alignment which helped motivate many of the ideas in this thesis; and especially my advisor Vijay Pande, for his help and guidance throughout my graduate career. v

7 Contents Abstract Acknowledgements iv v 1 Introduction 1 2 Markovian state models Introduction Theory and methods Direct rate calculations Sampling of paths MSM generation Post-processing of MSMs Reweighting of edges Mean first passage time and P fold calculation Results Model system Trpzip2 kinetics Discussion and conclusions Automatic state decomposition Introduction Theory Markov chain and master equation models of conformational dynamics Markov model construction from simulation data given a state partitioning Requirements for a useful Markov model vi

8 3.2.4 Validation of Markov models The automatic state decomposition algorithm Practical considerations for an automatic state decomposition algorithm Sketch of the method Implementation Validation Applications Alanine dipeptide The F s helical peptide The trpzip2 β-peptide Discussion Supporting Information Model selection Introduction Methods Bayesian Networks Parameter estimation in Bayesian Networks Scoring of Bayesian Networks Markovian state models as Bayesian Networks Comparison between different Markovian state models Non-equilibrium data Results Model system Alanine peptide Conclusions Error analysis methods Introduction Methods Mean first passage times Transition probability distribution Sampling based error analysis methods Non-sampling based error analysis method vii

9 5.2.5 Adaptive sampling algorithm Extension to large systems Results Demonstration of method Validity of approximations Adaptive sampling Discussion and conclusions Eigenvalue and eigenvector error analysis Introduction Methods Eigenvalue and eigenvector equations Transition probability distribution Distribution of eigenvalues and eigenvectors Adaptive sampling Results Eigenvalue distributions Eigenvector distributions Adaptive sampling Discussion and Conclusions Conclusions 132 A Sampling from a Dirichlet distribution 134 B Sampling from a Multivariate Normal distribution 135 C MFPT sensitivity analysis 137 D Solving a bordered sparse matrix 139 E Eigenvalue sensitivity analysis 141 F Eigenvector sensitivity analysis 144 Bibliography 146 viii

10 List of Tables 3.1 Macrostates from a 20-state state decomposition of the F s helical peptide Four state definitions for the transition model between 9 conformations Summary of sampling based methods for calculating the error of the MFPT from the initial state due to sampling Means and standard deviations of the MFPT distributions generated for the four sampling and the non-sampling based error analysis methods Running times for the error analysis methods on calculating the MFPT distribution of an 87 state example ix

11 List of Figures 2.1 The shooting algorithm for sampling paths Clustering of MSM points Clustering of nodes to guarantee that all nodes can reach the final state Contour graph of the potential energy, E(x, y), of the model energy landscape The correlation between P fold values calculated directly from many simulations and MSM simulations on the model energy landscape The comparison between the MFPT calculated directly from many simulations and from the MSM simulations as a function of temperature The comparison between the MFPT calculated from many simulations to the MFPT calculated from reweighted versions of a single MSM as a function of temperature Error analysis of direct simulations and the various MSM techniques The effect of clustering cutoff on the calculated MFPT for the model system and trpzip2 peptide Flowchart of the automatic state decomposition algorithm Potential of mean force and manual state decomposition for alanine dipeptide Comparison of manual and automatic state decompositions for alanine dipeptide Stability and recovery of optimal state decomposition for alanine dipeptide Implied time scales of the F s peptide as a function of lag time for 20-state automatic state decomposition Reproduction of observed state population evolution by Markov model for the F s peptide Comparison of some trpzip2 macrostates found by automatic state decomposition with misregistered hydrogen bonding states identified in a previous study x

12 3.8 Implied time scales of trpzip2 as a function of lag time for 40-state automatic state decomposition The Dynamic Bayesian Network corresponding to a Markovian state model The transition probabilities and state definitions for a simple model with 9 conformations The difference in scores between a 9-state and 3-state definition of the transition model of 9 conformations for different lag times τ and number of data instances M Comparison of MSMs corresponding to the subdivision of states Several state decompositions for the terminally blocked alanine dipeptide Comparison of different state definitions for the terminally blocked alanine peptide Distributions of the mean first passage time as generated by the first sampling based method on the 87 state example Distribution of the mean first passage time as calculated by the five error analysis methods Effect of adaptive sampling on the variance of the mean first passage time Relationship between the number of samples and the variance for the even and adaptive sampling algorithms Percent of samples required for each state in the optimal allocation of samples per state Potential of mean force and manual state decomposition for terminally-blocked alanine peptide Distributions of the five non-unit eigenvalues of the system shown in Fig The percent contribution of each state to the variance for the five non-unit eigenvectors (Eq. 6.24) Distributions of the eigenvector components corresponding to the third and fifth eigenvalues of the six-state model of the terminally blocked alanine peptide The contributions to the variance of the eigenvector components as decomposed by transitions from each state The mean and standard deviation of the variance of the largest non-unit eigenvalue of the six-state model of terminally blocked alanine peptide and the 2454-state model of the villin headpiece for different sampling algorithms xi

13 Chapter 1 Introduction Many important processes in biology occur at the molecular scale. A detailed understanding of these processes can lead to significant advances in the medical and life sciences for example, many diseases are caused by protein aggregation or misfolding [Dob03] and potential drug molecules can be designed by understanding their binding properties and conformational changes. These processes have typically been studied through experiments. While such experiments can yield a wealth of insight, they are often insufficient to describe the system dynamics on an atomic scale, which is desirable for many of the problems of interest. An alternate approach is to use physically-based computational simulations to model the interactions and movement of the molecules. These simulations are typically performed in atomic detail and with small time steps in order to accurately reproduce the underlying dynamics. The interesting movements of these systems occur over a relatively long time scale (protein folding times range from microseconds to seconds) and, unfortunately, atomistic molecular dynamics simulations of this length would take thousands of CPU-years to complete. Several methods have been developed for making the computational problem tractable. It is possible to trade accuracy for speed through the use of simplified models to simulate the molecular system (e.g., course-grained atoms, lattice models, simplified forcefields). When the resulting loss in accuracy is unacceptable, another approach harnesses the power of parallel and distributed computing. One strategy is to parallelize a single simulation across multiple processors. However, parallelizing a single simulation requires significant inter-processor communication because of the long-range interactions between atoms. An alternate strategy is to perform a large number of independent simulations, distributed across multiple processors with little or no communication. While this strategy does not reduce the required time to generate long trajectories, it can efficiently generate a large number of short, independent trajectories. With sufficient sampling, a set of short, 1

14 CHAPTER 1. INTRODUCTION 2 independent simulations can represent the full system dynamics by describing all short pieces that can then be merged to describe the overall dynamics. This dissertation involves the analysis of large ensembles (tens of thousands) of short, independent, molecular dynamics trajectories (nanoseconds to microseconds) to extract as much information as possible about the underlying system by constructing a compact model of the dynamics. The actual dynamics are a series of transitions between the detailed configurations of the system, for example, the Cartesian coordinates of all the atoms. To overcome the high dimensionality of the configuration space, we developed a model which groups sets of similar configurations into discrete states and describes the dynamics as transitions between these states. Once the states are defined, it is possible to calculate the transition probabilities between these states by counting the transitions that occur in the simulation data between the configurations assigned to each state. We call the states and transition probabilities a Markovian state model (MSM), because the model assumes that the transitions between states are history independent. When the transitions of the underlying system are Markovian over the defined states, a random walk in the model will mimic the true dynamics. It is possible to build this model from short (nanosecond length) trajectories, but then use the constructed model to project to infinite timescales. We can thus study the mechanism of long time scale events, such as protein folding, which were previously inaccessible through simulations. The model naturally handles complex kinetic behavior and can accurately describe kinetic mechanisms through intermediate and trap states in the landscape. Important quantities, such as the rate of folding, can be computed efficiently from this graph based model and compared with experimental results to validate the model. The MSM has been applied to small peptides, non-biological polymers, and vesicle fusion, with good agreement to experimental rates. Chapter 2 describes how to build a Markovian state model from molecular dynamics data and how to efficiently compute kinetic properties from this model, such as the probability that a conformation will fold before unfolding, and the average time to transition between two states. It also describes a method for modifying the MSM to calculate kinetic properties at parameters other than the simulation parameters. The proof of concept of the Markovian state model is demonstrated on dynamics over a simple two-dimensional energy landscape and simulation data of the 12-residue β-peptide trpzip2. The material presented in this chapter was published in the Journal of Chemical Physics [SSP04]. One of the main difficulties in building a MSM is in defining the states, as the model is only useful and representative of the underlying dynamics if transitions on the defined states are, in fact, Markovian. Previous descriptions of states usually relied on order parameters that assume that the

15 CHAPTER 1. INTRODUCTION 3 relevant degrees of freedom are known or on structural clustering. In Chapter 3, we describe an iterative algorithm for automatically finding important states from the simulation data. Because the configuration space is high dimensional, an automatic algorithm is required to avoid any inadvertent subjective bias that could be introduced through manual methods. Our randomized heuristic algorithm combines clustering by structural similarity with clustering by kinetic similarity to produce states that faithfully describe the dynamics of the system. Including kinetic similarity measures greatly improves the history-independence of the states, since structurally similar configurations may have different kinetic properties, and structurally diverse configurations may behave similarly. The state decomposition algorithm is applied to three peptide systems, the terminally blocked alanine peptide, the 21-residue helical F s peptide, and trpzip2. This chapter describes joint work with John Chodera and William Swope which was published in the Journal of Chemical Physics [CSP + 07], and I contributed jointly to the design and implementation of the algorithm and the application to the test systems. MSMs are not unique and models with different numbers of states may all describe the system dynamics. Chapter 4 adapts and applies concepts from structure learning of Bayesian Networks to quantitatively compare models with different numbers of states and state definitions. We show how to convert a MSM into a Bayesian Network and then evaluate how well the MSM fits the data through maximum likelihood and Bayesian scoring functions. The advantage of the Bayesian scoring function is that it automatically accounts for the amount of simulation data and can better discriminate the appropriate number of states in the MSM. No previous methods for evaluating Markovian state models explicitly quantified the tradeoff between the number of states in the model and the amount of simulation data needed for their parameterization. The scoring functions are evaluated on MSMs corresponding to a simple transition model and the alanine peptide. Once a model has been created, there are numerous quantities which can be calculated from it, and determining their accuracy is an important task. Since there is only a finite amount of simulation data, the transition probabilities in the MSM, which are defined from this data, will have uncertainties associated with them. We developed techniques to assess the resulting uncertainty of many important kinetic properties that can be calculated from the model. Previous methods for uncertainty analysis relied on repeatedly sampling possible transition probabilities and calculating the quantity of interest, which is computationally expensive and does not scale well for large models. We instead derived efficient closed-form functions that approximate the distributions of several properties. Chapter 5 describes methods for the error analysis of the average time to transition between two states and Chapter 6 extends these techniques to calculate the uncertainties in such

16 CHAPTER 1. INTRODUCTION 4 properties as the equilibrium distribution and reaction rates. We show that these functions are excellent approximations to the distributions obtained through repeated sampling and present efficient sparse matrix techniques that allow these calculations to scale to large systems. In addition, we developed methods to identify the states that contribute the most to the uncertainty and designed an adaptive sequential sampling scheme where one can selectively start new simulations to reduce the uncertainty with greatly reduced computational cost. When applied to a MSM of the 12-residue trpzip2 (Chapter 5), the adaptive sampling scheme required 20 times fewer simulations as compared with the usual unguided simulations. When applied to a MSM of the 36-residue α-helical villin headpiece (Chapter 6), the adaptive sampling scheme required three orders of magnitude fewer simulations, thus demonstrating the power of simulation planning. Chapter 5 and Chapter 6 are based on papers published previously in the Journal of Chemical Physics [SP05a, HP07].

17 Chapter 2 Markovian state models We introduce an efficient new method for predicting protein folding rate constants and mechanisms from molecular dynamics simulations. The Markovian state model (MSM) is a discrete representation of the underlying kinetics of the molecular system. Using the MSM framework, we show how to quickly calculate the folding probability (P fold ) and mean first passage time of all the sampled conformations. In addition, we provide techniques for evaluating these values under perturbed conditions, specifically different temperatures, without expensive recomputations. To demonstrate this method on a challenging system, we apply these techniques to a two-dimensional model energy landscape and the folding of a tryptophan zipper β-hairpin. 2.1 Introduction The direct simulation of protein folding has been a grand challenge of computational biology for several decades [SB01]. Simulating protein folding is particularly challenging due to the long time scales involved. While the fastest proteins fold on the microsecond to millisecond time scale, atomistic molecular dynamics simulations are typically constrained to the nanosecond time scale. In order to overcome this fundamental computational barrier, several new computational methods have been proposed. One such approach to study protein folding events is transition path sampling [DBCC98]. Given an initial trajectory between the unfolded and folded regions, this method generates an ensemble of different pathways that join the unfolded and folded regions. From these path ensembles, Bolhuis and coworkers determined the formation order of hydrogen bonds and the hydrophobic core in a β-hairpin [Bol03]. Using the fluctuation-dissipation theorem [Cha87], it is possible to calculate 5

18 CHAPTER 2. MARKOVIAN STATE MODELS 6 folding rates from these ensembles [DBCC98]. More recently, a new method called transition interface sampling [vemb03] introduced an alternate method to calculate transition rates. One drawback of these methods is that they do not utilize all the simulation results. To ensure that trajectories are decorrelated, only every fifth or tenth pathway generated is added to the path ensemble. Also, these methods require many individual path sampling simulations corresponding to different boundary conditions in order to calculate rates. Since path sampling methods are computationally demanding, it is interesting to consider whether one can construct an algorithm which can more efficiently utilize simulation data (e.g. folding trajectories) in order to predict folding rates and mechanisms. There are also techniques that analyze the nature and kinetics of the folding process by representing possible pathways in a graph, or roadmap. These methods sample configuration space and connect nearby points with weights according to their Monte Carlo probabilities. From these graphs, it is possible to calculate such properties as the shortest path, most probable path, and P fold values [ABG + 02], as well as analyze the order in which secondary structures form [STD + 03]. The primary challenge of these techniques is how to sample the conformational states in order to construct the pathway graph. The graph representation of protein folding pathways does not solve the sampling problem, but recasts it, and sampling any continuous, high dimensional space is still a difficult challenge. Previous graph-based methods have sampled configuration space uniformly (e.g., choosing conformations at random) or used sampling methods biased towards the native state. Clearly, as the protein size increases, it becomes exponentially difficult to sample the biologically important conformations with random sampling. In addition, while probabilistic roadmap methods can predict P fold values [ABG + 02] and suggest structure formation order [STD + 03], they have not included the time involved in the transitions. Because of this, one cannot predict time dependent properties such as folding rates, and thus it is difficult to assess the experimental validity of these methods. This chapter outlines a novel combination of the techniques described above. We propose transforming the simulation data gathered from transition path sampling algorithms into a probabilistic roadmap that includes information about the transition times. As opposed to traditional transition path sampling analysis, this method would incorporate all of the simulated data into the results, therefore potentially yielding an increase in efficiency. This method is also general enough to work on any molecular dynamics data sets, not just those gathered from path sampling simulations. We call our model a Markovian state model, or MSM, as it assumes Markovian transitions between states. From this MSM we can quickly and simultaneously calculate such properties as the P fold for all configurations sampled and the mean first passage time (MFPT) from the unfolded state to

19 CHAPTER 2. MARKOVIAN STATE MODELS 7 the folded state from a single transition path sampling simulation. In addition, this method provides a compact representation of the possible pathways in the system, which may be useful for understanding the mechanisms involved in folding. We suggest that our method would improve on the current roadmap techniques by sampling points using molecular dynamics, thereby greatly increasing the probability that the configurations that are included are kinetically relevant. In addition, the simulation time between points would inherently capture transition times, making the calculation of folding rates possible. In the following sections we describe the algorithms necessary to transform molecular trajectories into a MSM with the correct transition probabilities and times (Secs and 2.2.4). We also provide methods that allow for data gathered at one set of parameters, such as temperature, to be analyzed easily at other parameter values without the need for additional simulations (Sec ). We then describe linear algebra techniques to quickly calculate such values as P fold and MFPT (Sec ). We first give results on a model energy landscape and find that they are in good agreement with results from direct simulations (Sec ). Finally, we apply these methods to the analysis of existing simulation data of the folding of a small protein: the tryptophan zipper β-hairpin [SQD + 04] (Sec ). 2.2 Theory and methods Direct rate calculations One purpose of the MSM is to understand kinetics when one cannot easily simulate transitions from one state to another (e.g., for slow transitions from the unfolded state to the folded state). However, to validate the new methods for calculating kinetic properties, it is important to test the methods on systems in which the direct kinetics simulations can be performed. In this case, one can calculate the mean first passage time (in terms of number of Monte Carlo steps for Monte Carlo (MC) simulations and simulated time for Langevin simulations) directly from many independent simulations, even if these simulations are each shorter than the mean folding time. If one assumes first order kinetics, the probability that a particle has reached the final state at some time t is given by P f (t) = 1 e kt, (2.1) where t is the time, k is the rate, and P f (t) is the probability of having reached a final state by time t. By running many independent simulations shorter than 1/k, one can estimate the cumulative

20 CHAPTER 2. MARKOVIAN STATE MODELS 8 distribution P f (t), and hence fit the value for the rate, k. The mean first passage time is the average time when a particle will first reach the final state, given that it is in an initial state at t = 0, MFPT = t=0 Integrating by parts yields the solution ( ) d dt P f (t) t dt = kte kt dt. (2.2) t=0 MFPT = 1 k. (2.3) One could also find the MFPT by directly calculating the average time when each simulation first reached a final state. However, if because of simulation time constraints, some simulations are stopped before reaching the final state, the calculated MFPT will be too low. By first fitting the rate to the P f (t) data (which can be calculated accurately even if some simulations do not finish), the MFPT value will be more accurate. For simple systems (such as the two-dimensional energy landscape presented below), one can directly simulate kinetics on long time scales Sampling of paths For systems where the probability of reaching the final state is very low, the above direct method would require a large number of simulations to get a reasonable estimate of the rate of folding and the mean first passage time. The method that we describe below is a modified version of the shooting algorithm [DBC98] that has been shown to efficiently generate a sample of uncorrelated transition paths leading from the initial region to the final region. First, we must obtain some initial path between the initial and final regions. This can be obtained from previous data, high temperature unfolding simulations, direct MC or Langevin simulations, or some other means. We keep points on this path such that successive points are separated by some time interval, τ. We can label the points along this path as {p 0, p 1,..., p n }, where n is the length of the path. We generate new paths by picking a random point along the current path, p i, and shooting a new path from it by starting a new simulation from this point. Points are recorded along this path every τ and are labeled {np 0, np 1,..., np m }. If neither the initial nor final state is reached within some simulation time cutoff, we reject this path and the current path remains the same for the next iteration. Otherwise, if either of these states is reached, we stop the simulation at that time point and define our new current path as the combination of the previous current path and the newly generated path. If the new path reached the initial state, then the new current path

21 CHAPTER 2. MARKOVIAN STATE MODELS 9 Figure 2.1: The shooting algorithm for sampling paths. The solid path shows an original path between the initial and final regions. The red paths represent two possible new path segments, corresponding to the new path reaching either the initial or final regions. In the case of the path reaching the initial state, the accepted path would be {np 2, np 1, np 0, p 4, p 5, p 6 }. In the case of the path reaching the final state, the accepted path would be {p 0, p 1, p 2, p 3, np 0, np 1, np 2 }. is {np m, np m 1,..., np 0, p i, p i+1,..., p n }. If the new path reached the final state, then the new current path is {p 0, p 1,..., p i, np 0, np 1,..., np m } (Fig. 2.1). We repeat this shooting step for some set number of trials. This sampling strategy will capture paths between the boundaries of the initial and final region. If we are to calculate the MFPT between the initial and final regions, we must also simulate the time a molecule can spend within the initial region. To do this, we start many simulations from within the initial region and stop the simulations once the boundary of that region has been crossed MSM generation Here, we describe how to generate the MSM of conformational states, including the probability and time to traverse from node to node in the MSM. Each conformation in the paths accepted while sampling paths is represented by a node in the MSM, node i, for some unique index i. Successive points in each accepted path segment are represented by edges in the MSM, edge ij, representing an edge between node i and node j. Each edge has associated with it the simulation time taken to traverse that edge, time ij, and the probability of taking that edge, P ij. We initialize all edge probabilities to one and renormalize in the post-processing step. If there is no edge between states i and j, we set the probability P ij to zero. The MSM may be generated from data from the transition path sampling shooting algorithm as above, or on any existing simulation data, so long as the time between points in the simulation is known. Simulations of different time resolutions may also be included in a single MSM.

22 CHAPTER 2. MARKOVIAN STATE MODELS 10 The MSM is designed to embody the possible pathways that the molecule may take while traversing the conformation space. Different paths generated by our simulation methods may pass through very similar conformations, but since the conformation space is continuous, these points will never be exactly the same. However, we wish to capture the fact that these paths reach essentially the same point. We can do this by clustering nearby points in conformation space according to some metric. We define some cutoff value that represents how close two points need to be in order for us to consider them to be the same kinetically. Then, we combine points that are within this distance from one another according to some clustering algorithm. We may choose different cutoffs for the different regions of conformation space, the initial region, the final region, and the transition region. To combine two points, we remove all the incoming and outgoing edges from one of the points and connect them to the other point. If there are now multiple edges between two nodes, we combine them into a single edge with the following values (Fig. 2.2): time new ij = P new ij = P 1 ij + P 2 ij, (2.4) ( ) ( ) Pij 1 time1 ij + Pij 2 time2 ij P 1 ij + P 2 ij. (2.5) The coordinates of clustered points are represented as the weighted average of all points belonging to the cluster Post-processing of MSMs We need to ensure that every node in the MSM is able to reach a final state. Otherwise, since these nodes will have an infinite mean first passage time, calculations done on the MSM will fail. We identify the nodes that can reach a final state by performing a depth first search from the final states over the incoming edges, and marking all nodes that are reachable. We propose two different methods for removing the nodes that were not marked. In the first, we simply delete those nodes, thus ensuring that all nodes in the MSM can reach a node in the final state. If there are not many such nodes, this should not bias the results very much. However, if there are many unmarked nodes, deleting these nodes could distort the results. An alternate method for removing nodes that cannot reach the final state is to merge each into its closest neighbor until all nodes can reach the final state (Fig. 2.3). This nearest neighbor provides the best guess to the future dynamics of the unmarked node with respect to reaching the final state.

23 CHAPTER 2. MARKOVIAN STATE MODELS 11 Figure 2.2: Clustering of MSM points. If two nodes are closer than a cutoff for some metric, we cluster together these points by replacing them with a new point containing all of their edges. The left picture shows the nodes before clustering, with the dotted circle indicating the nodes that will be merged. The right picture shows the nodes after this clustering step. In addition, we normalize the probabilities on all the edges so that on each node, the sum of the probabilities for all outgoing edges is one, Pij new = P ij k P. (2.6) ik The probability on each edge equals the number of times that transition was made divided by the total number of transitions from that node. Given sufficient sampling, these probabilities will converge towards the actual probabilities of each transition from that node. Figure 2.3: Clustering of nodes to guarantee that all nodes can reach the final state. In the left picture, the gray nodes and edges are not able to reach a final state. The dotted circle indicates which two nodes will be merged. The right picture shows the MSM after this step. All nodes can now reach the final state.

24 CHAPTER 2. MARKOVIAN STATE MODELS Reweighting of edges The MSM now represents a discrete sampling of the conformation space, and the edges represent the transitions between these states, weighted with the correct probabilities. If some parameters of the system were to change, one could simply adjust the edge weights by the relative probabilities at each value of the parameters to generate a MSM at the new parameters. This assumes that the states and transitions that would be sampled at the new parameters are the same as those sampled at the original parameters. For example, it is common to examine folding at a series of temperatures (T ); instead of rerunning the calculation for each temperature, it would be ideal if one could reweight an existing MSM for different temperatures. This reweighting scheme is loosely analogous to thermodynamic reweighting schemes [Iba01]. While our methodology is for kinetic properties, both methods share the idea of reweighting an ensemble generated at one temperature to yield information at another, and thus both rest on the assumption that the ensemble generated would be useful under the perturbed conditions. Accordingly, one would not expect reasonable results for perturbations that are too large (e.g., temperatures far from the original sampling). The probability of moving between nodes depends on the nature of the dynamics used, since the dynamics simulations were used to estimate the transition probabilities. Below, we will examine how to reweight edges to build a MSM at a different temperature under two dynamics schemes, Monte Carlo dynamics and Langevin dynamics. Monte Carlo dynamics First, consider simulations performed using the Metropolis Monte Carlo algorithm to generate moves. Given a current point, x, a new point x is chosen from a distribution η(x, x ). This move is accepted according to the Metropolis criteria [MRR + 53], { P acc = 1 E(x ) E(x) e [E(x ) E(x)]/k B T E(x ) > E(x) }, (2.7) where E(x) is the energy at point x, T is the temperature, and k B is Boltzmann s constant. The transition between two states as defined by this algorithm is then { P ij = η(node i, node j ) } 1 E(node j ) E(node i ). (2.8) e [E(node j) E(node i )]/k B T E(node j ) > E(node i )

25 CHAPTER 2. MARKOVIAN STATE MODELS 13 To reweight the edges at a new temperature, we need the relative probability of each transition at the two temperatures. Dividing Eq. 2.8 at the two temperatures of interest, we get P ij (T 1 ) P ij (T 2 ) = η(node i, node j, T 1 ) η(node i, node j, T 2 ) 1 E(node j ) E(node i ) e [E(node j,t 1 ) E(node i,t 1 )]/k B T 1 e [E(node j,t 2 ) E(node i,t 2 )]/k B T 2 E(node j ) > E(node i ) Assuming that η(node i, node j ) and E(node) are independent of temperature, we get the following equation for the transition probabilities at the new temperature: P ij (T 2 ) = { Langevin dynamics P ij (T 1 ) E(node j ) E(node i ) e E[(1/k BT 2 ) (1/k B T 1 )] P ij (T 1 ) E(node j ) > E(node i ) }. (2.9). (2.10) Langevin dynamics is likely more representative of dynamical properties than Metropolis Monte Carlo, especially since the kinetic interpretation of Monte Carlo relies on the physical nature of the Monte Carlo moves chosen, η(x, x ). In Langevin simulations, one performs simulations using the Langevin equation of motion, F ext mγ dx dt + R = 0, (2.11) R(t)R(0) = 2mγk B T δ(t), (2.12) to move particles, where F ext are the external forces acting on the particle, m is the mass, γ is the friction coefficient, and R is a random force, assumed to be a Gaussian random variable with the property given by Eq Rewriting this equation, we can find the change in position with respect to the forces, P ( ) R t mγ x = R t mγ + F ext t mγ, (2.13) = 1 σ 2π e [(R t)/mγ)2 /2σ2], σ 2 = 2k BT t, (2.14) mγ where T is the temperature and the random displacement is distributed according to a normal distribution with standard deviation σ defined in Eq [AT91]. We stress that if a Langevin probability is to be used for transition probabilities, it is imperative

26 CHAPTER 2. MARKOVIAN STATE MODELS 14 that successive nodes be highly related conformationally. Otherwise, if one tries to take large steps (e.g., large t), the constant external force approximation will not hold and the transition probabilities become irrelevant. This is the result of over extending the Langevin integrator and it is unclear whether the resulting MSM probabilities will have the desired physical interpretation. We can now define the transition probability between two nodes as the probability of the random displacement needed, R t mγ = x(node j) x(node i ) F ext(x(node i )) t mγ (2.15) P ij = α 1 σ 2π e[ ( xα R )2 /2σ2 ] (2.16) where α represents the dimension of the system. We again wish to compute the relative probabilities at two temperatures, so we divide the above equation at the two temperatures, P ij (T 1 ) P ij (T 2 ) = α α 1 σ(t 1 ) 2π e[ ( xα R (T 1 ))2 /2σ(T 1 )2 ]. (2.17) 1 σ(t 2 ) 2π e[ ( xα R (T 2 ))2 /2σ(T 2 )2 ] Assuming that the forces, mass, and friction coefficient are independent of temperature, and substituting for σ(t ) as defined in Eq. 2.14, we get the following equation: P ij (T 1 ) P ij (T 2 ) = α T2 T 1 e [( xα R )2 mγ/4 t][(1/k B T 1 ) (1/k B T 2 )] (2.18) For both the Monte Carlo and Langevin dynamics schemes, we can now reweight MSMs generated from simulation data at one temperature to build a MSM at a different temperature. Since we know the transition probability at the first temperature from the current MSM, we can easily calculate the probability at the new temperature using Eq for Monte Carlo dynamics and Eq for Langevin dynamics. We again normalize all edge probabilities so that for each node, the sum of the outgoing probabilities is one. In this way, we generate a MSM at a different temperature without additional simulations. This analysis can be done on any parameter where it is possible to define the relative transition probabilities in terms of the two parameter values.

27 CHAPTER 2. MARKOVIAN STATE MODELS Mean first passage time and P fold calculation The MSM consists of a set of nodes and a set of transitions or edges between these nodes. Each edge has a probability associated with it as well as the time taken to traverse that edge. One can define the P fold of a node as the probability that a particle started at that node would reach the final state before reaching the initial state [DPG + 98]. P fold values have been shown to be useful in understanding the nature of the folding pathway in simplified [DPG + 98, PGTR98] and atomistic [PR99, LS01, GC02] models. Typically, one calculates P fold values by running multiple simulations (differing by random number seeds or initial velocities) and recording the fraction that fold before they unfold. While this is computationally tractable (compared with a full folding simulation starting from the unfolded state) and naturally parallelizable on massively-parallel or grid-computing architectures, it can still be a demanding computational task, especially if the P fold values for many conformations are sought. Following Apaydin et al. [ABG + 02], we will use the MSM to calculate P fold values. The P fold can be defined conditionally based on the first transition made from the node, P fold (node i ) = j P (transition(i j)) P fold (node i transition(i j)), (2.19) where the sum is over all possible transitions from node i. The possible transitions must be mutually exclusive and the sum of their probabilities must be one. The probability of each of the transitions from node i, P (transition(i j)), are simply the P ij values defined previously. By the postprocessing step, these probabilities satisfy the above condition. P fold (node i transition(i j)) is simply the P fold of node j which results in the following set of equations: P fold (node i ) = j P ij P fold (node j ), P fold (node i ) = 1, node i F, P fold (node i ) = 0, node i I, (2.20) where I is the initial region and F is the final region. This definition results in a series of n equations for each of the n nodes in the system and n unknowns, the P fold (node i ) variables. This system of

28 CHAPTER 2. MARKOVIAN STATE MODELS 16 equations can be solved by iteration as follows: Initialize: P 0 fold (node i) = 1, node i F, P 0 fold (node i) = 0, node i / F ; Iterate: P t+1 fold (node i) = j P ij P t fold (node j), (2.21) until each P fold converges. This iteration method is known as Jacobi iteration [GvL96]. Instead of always using the P fold values from the previous iteration, one can use the new values from the current iteration as soon as they become available. This results in the following iterative scheme known as Gauss-Seidel iteration, which converges twice as fast as the Jacobi method [GvL96], P t+1 fold (node i) = P ij Pfold t (node j) + P ij P t+1 fold (node j). (2.22) j i j<i Analogously, we can also get rate information from the MSM. Indeed, rates are a primary mean of comparison to experiment and are thus a critically important quantity to calculate in order to experimentally validate any folding simulation. Rates have not been previously calculated from a roadmap-type representation of states. Below we present a natural generalization of the method to calculate P fold values for the calculation of rates in an efficient and precise manner. One can define the mean first passage time (MFPT) of any node as the average time taken to get from that node to any node in the final state. The MFPT can be defined conditionally based on the first transition made from the node, MFPT(node i ) = j P (transition(i j)) MFPT(node i transition(i j)), (2.23) where the sum is over all possible transitions from node i. The MFPT of node i given that a transition to node j was made is the time it took to get from node i to node j plus the MFPT from node j. This leads to an equation for the MFPT of MFPT(node i ) = j P ij (time ij + MFPT(node j )). (2.24) In addition, we define the boundary condition MFPT(node i ) = 0, node i F. (2.25)

29 CHAPTER 2. MARKOVIAN STATE MODELS 17 This system of linear equations can be iterated in the same way as above except that the initial values for the system should be MFPT 0 (node i ) = 0, node i F, (2.26) MFPT 0 (node i ) =, node i / F. 2.3 Results Model system We first test the methods outlined above on a simple, two-dimensional model system. Due to their tractability, such model systems are useful for demonstrating the benefits of the proposed method. Our model system is defined by an energy potential of E(x, y) = (4(1 x2 y 2 ) 2 + 2(x 2 2) 2 + ((x + y) 2 1) 2 + ((x y) 2 1) 2 2) 6 (2.27) and has been used previously to test transition path sampling methods [DBCC98]. The initial and final regions were defined by circles centered at (-1,0) with a radius of 0.2 and (1,0) with a radius of 0.3 respectively. A contour graph of this energy landscape is shown in Fig Since this model system is computationally tractable, we can directly compare our proposed methods to direct, brute force simulations of the kinetics. In particular, we will compare the two kinetics methods described in Sec : Monte Carlo and Langevin dynamics. We performed 10,000 Monte Carlo simulations for each temperature ranging from 0.1 to 1.0, at intervals of 0.1. The move set was defined in each dimension as a normal distribution centered on the current point, η(x α, x α) = 1 σ 2π e[ (x α x α )2 /2σ2]. (2.28) The standard deviation was defined according to the distance the particle is expected to travel because of diffusion, σ = D t 2, (2.29) where t is the time step and was defined as ps and D is the diffusion constant and equals 91.0 ps 1. In addition, we also ran 10,000 Langevin simulations for each temperature ranging from 0.2 to 1.0, at intervals of 0.1. The forces, F ext, were defined as the gradient of the energy potential given in Eq. 2.27, the mass m was defined to be 1, and the viscosity γ was defined as The time step t for these simulations was 1.

30 CHAPTER 2. MARKOVIAN STATE MODELS I F Figure 2.4: Contour graph of the potential energy, E(x, y), of the model energy landscape. The initial and final regions are represented by the circles labeled I and F respectively. The energy difference between the stable regions and the valley in between them is approximately 1 and between the stable regions and the hill in between them is approximately 2 (energy in arbitrary units). For each temperature and type of simulation, five sets of 10,000 independent simulations were started from the initial state, and the time at which they reached the final state was recorded. The initial point in each simulation was sampled randomly from points on the border of the initial region. The mean first passage time for each set was calculated from these 10,000 trials. MSM generation To test Monte Carlo kinetics, MSMs were generated on the model energy landscape at temperatures ranging from 0.1 to 1.0, at intervals of 0.1. The time step, t, was and the interval at which points on the paths were recorded, τ, was Each shooting step was stopped if neither the initial nor final regions were reached after a time of 1.0. Four independent MSMs were generated at each temperature, and each MSM consisted of 10,000 attempted shooting moves. In addition, 50 paths were sampled from the initial state for each MSM. All points in either the initial or final regions were clustered together. For the remaining points, the distance metric chosen was Euclidian

31 CHAPTER 2. MARKOVIAN STATE MODELS 19 distance and the clustering cutoff for each simulation was σ 5, where σ is the standard deviation of the normal distribution from which the moves were selected, as defined in Eq Points were clustered hierarchically with average-link clustering the distance between two clusters is equal to the average distance from any member of one cluster to any member of the other cluster. After clustering, any points that could not reach the final state were deleted. Analogously, Langevin dynamics was examined by generating MSMs at temperatures ranging from 0.2 to 1.0, at intervals of 0.1. The time step was 1.0 and the interval at which points on the paths were recorded was Each shooting step was stopped if neither the initial nor final regions were reached after a time of 10,000. Five MSMs were generated at each temperature, and each MSM consisted of 10,000 attempted shooting moves. In addition, 50 paths were sampled from the initial state for each MSM. Clustering was as above with the clustering cutoff as σ 1.5, where σ is the standard deviation of the normal distribution from which the random component of each move was selected, as defined in Eq P fold comparison For one MC and Langevin MSM at each temperature, the P fold values were calculated for every node. Since it would be too time consuming to compute all P fold values from many direct simulations, about nodes were chosen at random from each MSM for comparison. 10,000 MC or Langevin simulations were started at the given temperature from each of these coordinates to compute its P fold value directly. Figure 2.5 shows the P fold value calculated by many direct simulations compared to those calculated from the MSMs for both simulation types. The correlation coefficient between the direct MC values and MSM values is over all temperature values. For each individual MSM at a given temperature, the correlation coefficient ranges from to The correlation coefficient between the direct Langevin values and MSM values is over all temperatures. The correlation coefficient ranges from to for each MSM at a given temperature. This shows excellent agreement of P fold values over a wide range of temperatures for both simulation types. MFPT comparison In addition to being able to calculate P fold values at every node, the use of simulation data allows us to estimate the transition times between nodes, and therefore to estimate the MFPT from the initial state. We can compare the MFPTs calculated from the MSMs with the MFPTs we calculated

32 CHAPTER 2. MARKOVIAN STATE MODELS 20 Figure 2.5: The correlation between P fold values calculated directly from many simulations and MSM simulations on the model energy landscape. The left graph shows the comparison for Monte Carlo simulations and the right one shows the same comparison for Langevin simulations. directly from many MC or Langevin simulations (Fig. 2.6). The MFPT calculated from the MSMs agree well with those calculated from direct simulations, although the variance among MSM simulations is greater for all temperatures in the MC simulations and for high temperatures in the Langevin simulations. MFPT from reweighting of edges We also tested how well our formulation for the reweighting of MSM edges based on temperature was able to predict MFPTs at the new temperatures. For both MC and Langevin dynamics, five additional MSMs were generated at temperatures of 0.2, 0.6, and 1.0. The edges on these MSMs were reweighted to give MSMs at temperatures of 0.2 to 1.0 at an interval of 0.1. The MFPTs calculated from these reweighted MSMs were then compared with those from the direct simulations (Fig. 2.7). For the MC simulations, the MFPT calculated from the reweighted MSMs generated at all three temperatures agrees reasonably well over the entire temperature range. The MSMs generated at a temperature of 0.2 show a systematic overestimation of the MFPT when reweighted to high temperatures. For the Langevin simulations, the MFPT calculated from the reweighted MSMs generated at

33 CHAPTER 2. MARKOVIAN STATE MODELS 21 Figure 2.6: The comparison between the MFPT calculated directly from many simulations and from the MSM simulations as a function of temperature. The left graph shows the result for Monte Carlo simulations and the right one shows the result for Langevin simulations. temperatures of 0.6 and 1.0 also agreed well over the entire temperature range. However, the MFPT calculated from the MSMs generated at a temperature of 0.2 greatly overestimated the MFPT for higher temperatures. When generating a MSM at lower temperatures, we may not be sampling the transitions relevant at the higher temperatures. We examined this possibility by looking at the shortest possible path between the initial and final regions in the MSMs generated at different temperatures. For the Monte Carlo simulations, the MSMs at temperatures of 0.6 and 0.2 showed an increase in the shortest path length of the MSM at a temperature of 1.0 of 40% and 80% respectively. For the Langevin simulations, the MSMs at temperatures of 0.6 and 0.2 showed an increase over the shortest path length of the MSM at a temperature of 1.0 of 90% and 300% respectively. These differences may account for the low temperature MSMs inability to scale, since we never sample the faster transitions. To estimate the error in the reweighted MSMs compared to the direct simulations and to MSMs generated individually at each temperature, we compare the MFPT standard deviations at different temperatures for the various methods (Fig. 2.8). Specifically, we examine the standard deviation relative to the average MFPT at each temperature calculated from the five direct simulations, from the four MSMs generated individually at that temperature, and from the five reweighted MSMs generated at temperatures of 0.2, 0.6, and 1.0 and reweighted to that temperature.

34 CHAPTER 2. MARKOVIAN STATE MODELS 22 Figure 2.7: The comparison between the MFPT calculated from many simulations to the MFPT calculated from reweighted versions of a single MSM as a function of temperature. The square, diamond, and circle points represent the MSMs generated at temperatures of 0.2, 0.6, and 1.0 respectively. The cross points are from the direct MFPT calculations. The left graph shows the results for Monte Carlo simulations and the right graph shows the results for Langevin simulations. For both the MC and Langevin simulations, the reweighted MSMs have essentially the same error as the MSMs generated individually at each temperature except for the Langevin MSM generated at a temperature of 0.2. At low temperatures, the MFPT calculated directly from simulations has a percent standard deviation which is approximately half that of the MSMs. However, at higher temperatures, the error between the MFPT calculated directly from simulations and that from MSMs is only slightly lower for MC simulations and higher for Langevin simulations Trpzip2 kinetics In addition to the model energy landscape, we applied our methods above to a small protein, the 12-residue tryptophan zipper β-hairpin, trpzip2 [CSS01]. Trpzip2 has previously been simulated on Folding@Home [SP00]. Our goal here is to use these trajectories from Folding@Home to build a MSM to further study the folding of trpzip2. This is a much more challenging test of our methods than the simple two-dimensional example above. If successful, we propose that MSMbased methods would allow one to extend the Folding@Home distributed computing methods to examine the folding of slower and more complex proteins. Indeed, MSM methods combined with

35 CHAPTER 2. MARKOVIAN STATE MODELS 23 Figure 2.8: Error analysis of direct simulations and the various MSM techniques. The graphs show the standard deviation to the average direct simulation value at each temperature divided by that average value. The left graph shows the results for Monte Carlo simulations and the right graph shows the results for Langevin simulations. sampling may also be able to tackle some fundamental issues in the simulation of protein folding, especially that of proteins with non-single-exponential folding kinetics [Fer02]. The results for trpzip2 presented here are intended as a proof of concept application of our methods to fully atomistic simulation. Using Folding@Home [SP00], trpzip2 folding has been simulated using the OPLSaa all atom parameter set [JMTR96] and the generalized Born/surface area implicit solvent model [QSHS97] at a temperature of 296 K. Trajectories were started from an extended conformation and ranged in length from 10 nanoseconds to 450 nanoseconds. To define the initial and final regions, we used a combination of alpha carbon root mean square distance (RMSD) to the native state (pdb code 1LE1 [CSS01]) and hydrogen bond and trp-trp distances. In particular, we track the following set of interatomic distances, where the number indicates the residue, n indicates the backbone amide nitrogen, o indicates the backbone carbonyl, and w indicates the CD2 side chain atom: d 1 = (n3 o10) + (o3 n10) + (n5 o8) + (o5 n8) +(w2 w11) + (w2 w9) + (w9 w4) d 2 = d 1 + (n1 o12) + (o1 n12). (2.30)

36 CHAPTER 2. MARKOVIAN STATE MODELS 24 The initial region was defined as any configuration that had (RMSD 2.5 or RMSD d ) and (RMSD d 1 9.5). The final region was defined as any configuration that had (RMSD < 2.5) and (RMSD d 1 < 7.75) and (d 2 < 45). These cutoffs follow Snow et al., except for the dependence on d 2, which was added to discriminate between native states and a set of frayed native-like states. For more on the simulation details, see Snow et al. [SQD + 04]. To generate the MSM, we chose a tenth of this data set at random, resulting in 1,750 independent trajectories. Of these trajectories, 14 reached the final folded state. Frames from the non-folding trajectories were selected every 10 ns and frames from the folding trajectories were selected every 250 ps. This was done so that there would be more representative conformations in the transition and final states, while still allowing the number of nodes to stay manageable. As discussed in the MSM generation section (Sec ), because the edges contain the time taken to traverse them, multiresolution data can be accommodated. This selection of data resulted in a total of approximately 22,400 nodes. The distance metric for clustering was defined as the root mean square deviation of the interheavy-atom distance matrix for two conformations. The clustering was performed hierarchically using the average-link distance as the distance between two clusters. After clustering, nodes which could not reach the final state were merged into their nearest neighbor. The usefulness of a MSM depends upon the type of ensemble used for its construction (similar to the concept of a basis set). Here, the underlying ensemble is fairly one sided, has relatively few transitions, and is not at equilibrium. Specifically, very few trajectories reached the final state and even fewer unfolded after having folded, so the set of transitions to the initial state was not very well sampled. Accordingly, any P fold values calculated would have been biased since the P fold measures the percentage of trajectories that reach the final state before reaching the initial state. On the other hand, a good estimate of the MFPT was possible since it measures the average time taken for a particle to reach the final state having started in the initial state, which is exactly what our data set represents. We examined a wide range of clustering cutoffs for both the initial and transitional regions to estimate the MFPT. To compare the effects of the clustering cutoff, we also performed a similar experiment on the model energy landscape. We ran 500 MC simulations started from the initial state at a temperature of 0.6 with a time step of , total time of 0.1, and recording points every Of these trajectories, 135 reached the final state. Again, we varied the clustering cutoffs in the initial and transitional regions (Fig. 2.9). Holding the transitional region cutoff constant, increasing the initial clustering cutoff causes

37 CHAPTER 2. MARKOVIAN STATE MODELS 25 Figure 2.9: The effect of clustering cutoff on the calculated MFPT for the model system and trpzip2 peptide. One axis shows the cutoff in the initial region and the other shows the cutoff in the transitional region. The vertical axis shows the MFPT at each point. The graph on the left is from the model energy landscape and the graph on the right is for the trpzip2 protein. no change in the MFPT in the model system. For the trpzip2 data, increasing the initial region clustering cutoff causes the MFPT to increase until a cutoff of 2 Å and then plateau. One reason why the trpzip2 data shows an initial increase in MFPT that is not seen in the model system data is because the trpzip2 conformations are in a much higher dimensional space. If we sample equally in these two spaces, we expect the points in the higher dimensional space to be farther apart. After clustering the two dimensional data to , there are 87% fewer nodes. In comparison, after clustering the initial region of the trpzip2 data to 2Å, there are only 76% fewer nodes. There is an increase in the model system data, but only when the transitional cutoff is zero. Holding the initial region cutoff constant and increasing the transitional region cutoff causes the MFPT to decrease in both systems and for all values of the initial cutoff. In the transitional region, we expect that the molecule will go through a series of sequential points on the way to the final state. The rationale behind clustering is that we can merge together points which are similar by some metric, thus assuming that transitions into or from one of the points is equally likely to go to or come from the other point. This does not apply for points which are sequential along a pathway. Therefore, merging these points causes the MFPT to decrease since we are essentially shortening the length of the path. Because we do not expect any sequential patterns within states in the initial region, increasing the clustering cutoff within this area does not have the same decreasing MFPT as the transitional region.

38 CHAPTER 2. MARKOVIAN STATE MODELS 26 Over reasonable ranges of cutoffs for the initial (> 2Å) and transitional (1 2.5Å) regions, we can estimate the MFPT as 2 9 microseconds. This estimate agrees well with experimental results of 1.8 ± 0.01µs from fluorescence and 2.47 ± 0.05µs from IR [SQD + 04]. This estimate also agrees well with previous analysis of this simulation data of 8µs by fitting the rate directly for the full data set and 4.5µs for the random one-tenth sample used in the MSM analysis. 2.4 Discussion and conclusions We have introduced new computational tools for efficiently analyzing the data collected from Monte Carlo or molecular dynamics simulations. These methods capture the probabilistic and time dependent nature of the kinetics in a compact Markovian state model representation which can easily be analyzed for properties such as the P fold and MFPT of every node. The P fold values calculated for a wide range of temperatures for both Monte Carlo and Langevin simulation types show excellent agreement with those calculated by brute force. One area of improvement for this method is that while the MSM values give the same average MFPT as the MFPT calculated from direct simulations, the variance is much higher. This is probably because the transition path sampling simulations used in building the MSM are somewhat dependent on the initial path chosen. The only way in which edges can lead from the initial state is either from the initial path or from the sampling of the initial state. Edges resulting from the shooting algorithm all lead into the initial state. One could fix this problem by having many initial paths. However, in a protein example, folding trajectories may be very difficult to generate beforehand. Another way to fix this problem and achieve more precise results would be to include shooting paths which move backwards in time, thus allowing for more edges leading from the initial state. Monte Carlo and Langevin dynamics cannot be run backward in time because the velocity is not maintained between steps. If one were to use some other molecular dynamics simulation system that maintained velocity, then the trajectories may be run backwards in time and this problem could be averted. In addition to the algorithms necessary to create the MSM from simulation data, we have also described methods for reweighting the edges of the MSM to analyze the system at different parameter values. In particular, we provided transformations for the edge weights at different temperatures, given that the simulations were from either MC or Langevin dynamics. These methods show promise since we can analyze the system at many different parameter values without the need for any additional simulations. The reweighting of the MSMs seemed to work well in general and gave results with similar errors as MSMs generated individually at each temperature. The one exception

39 CHAPTER 2. MARKOVIAN STATE MODELS 27 was the Langevin MSMs generated at a temperature of 0.2. These MSMs did not give accurate results when rescaled for temperatures greater than 0.3. One reason for this may be that at the low temperature, the system did not have enough samples of the relevant transitions to scale to higher temperatures. It may be necessary to generate composite MSMs consisting of data from many different temperatures in order to properly rescale to a wider range of temperatures. This method was then applied to existing folding simulation data from a small β-hairpin protein. We were able to calculate folding rates which were in reasonable agreement with experimental data and previous analysis of the simulation data. The majority of time in these simulations is spent in the clustering step. Currently, we compute the full n 2 distance matrix between all nodes in the MSM, a very expensive calculation. Better clustering algorithms can reduce this computation time.

40 Chapter 3 Automatic state decomposition In the previous chapter, we described a method for modeling the conformational dynamics of biological macromolecules over long time scales a discrete-state Markovian state model, which can be built from short molecular dynamics simulations. To construct useful models that faithfully represent dynamics at the time scales of interest, it is necessary to decompose configuration space into a set of kinetically metastable states. Previous attempts to define these states have relied upon either prior knowledge of the slow degrees of freedom or on the application of conformational clustering techniques, as in Chapter 2, which assume that conformationally distinct clusters are also kinetically distinct. In this chapter, we present a first version of an automatic algorithm for the discovery of kinetically metastable states that is generally applicable to solvated macromolecules. Given molecular dynamics trajectories initiated from a well-defined starting distribution, the algorithm discovers long-lived, kinetically metastable states through successive iterations of partitioning and aggregating conformation space into kinetically related regions. We apply this method to three peptides in explicit solvent terminally blocked alanine, the 21-residue helical F s peptide, and the engineered 12-residue β-hairpin trpzip2 to assess its ability to generate physically meaningful states and faithful kinetic models. 3.1 Introduction Many biomolecular processes are fundamentally dynamic in nature. Protein folding, for example, involves the ordering of a polypeptide chain into a particular topology over the course of microseconds to seconds, a process which can go awry and lead to misfolding or aggregation, causing disease [Dob03]. Enzymatic catalysis may involve transitions between multiple conformational substates, 28

41 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 29 only some of which may allow substrate access or catalysis [EBAK02, YR06, BMDW06]. Posttranslational modification events, ligand binding, or catalytic events may alter the transition kinetics among multiple conformational states by modulating catalytic function, allowing work to be performed, or transducing a signal through allosteric change [FMA + 01, CE05, MMGD06]. A purely static description of these processes is insufficient for mechanistic understanding the dynamical nature of these events must be accounted for as well. Unfortunately, these processes may involve molecular time scales of microseconds or longer, placing them well outside the range of typical detailed atomistic simulations employing explicit models of solvent. However, due to the presence of many energetic barriers on the order of the thermal energy, the uncertainty in initial microscopic conditions, and the stochasticity introduced into the system by the surrounding solvent in contact with a heat bath, any suitable description of conformational dynamics must by necessity be statistical in nature. This has motivated the development of stochastic kinetic models of macromolecular dynamics which might conceivably be constructed from short dynamics simulations, yet provide a useful and accurate statistical description of dynamical evolution over long times. Several approaches have been used to construct these models. Transition interface sampling (TIS) [MvEB04], milestoning [FE04], and methods based on commitment probability distributions [RP05, BS05] describe dynamics on a one-dimensional reaction coordinate, but only can be applied if an appropriate reaction coordinate can be identified such that relaxation transverse to this coordinate is fast compared to diffusion along it. Discrete-state, continuous-time master equation models, characterized by a matrix of phenomenological rate constants describing the rate of interconversion between states [vk97], can be constructed by identifying local potential energy minima as states and estimating interstate transition rates by transition state theory [CE90, KB95, BB98, LJB01, MW01, MEW02, EW04]. Unfortunately, the number of minima, and hence the number of states, grows exponentially with system size, making the procedure prohibitively expensive for larger proteins or systems containing explicit solvent molecules. Others have suggested that stochastic models of dynamics can be constructed by expansion of the appropriate dynamical operator in a basis set [Sha96, US98, SF03], but this approach appears to be limited by the great difficulty of choosing rapidly-convergent basis sets for large molecules, a process that is not fundamentally different from identifying the slow degrees of freedom. Instead, much work has focused on the construction of discrete- or continuous-time Markov models to describe dynamics among a small number of states which may each contain many minima within large regions of configuration space [GT94, dgdmg01, SPS + 04b, SSP04, AFGL05, SP05c,

42 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 30 SKH05, SHCT05, SP05a, EPP05b, PP06]. In these models, it is hoped that a separation of time scales between fast intrastate motion and slow interstate motion allows the statistical dynamics to be modeled by stochastic transitions among the discrete set of metastable conformational states governed by first-order kinetics. Consider, for example, the isomerization of butane, which has three main metastable conformational states (gauche-plus, gauche-minus, and trans). At sufficiently low temperature, dynamics is dominated by long dwell times within each of these three states, punctuated by infrequent transitions between them. The slow interstate transition process is welldescribed by first order reaction kinetics for observation intervals longer than the fast molecular relaxation time for intrastate dynamics due to the presence of a separation of time scales [Cha78]. Such a separation of time scales would be a natural consequence of the widely held belief that the nature of the energy landscape of biomacromolecules is hierarchical [ABB + 85, BF89, BK97, LJB01, LJB02]. If the system reaches local equilibrium within the state before attempting to exit, the probability of transitioning to any other state will be independent of all but the current state. This allows the process to be modeled with either a discrete-time Markov chain (e.g. Ref. [SSP04]) or a continuous-time master equation model with coarse-grained time (e.g. Ref. [SKH05]). In either model, processes occurring on time scales faster than the time to reach equilibrium within each state cannot be resolved. Markov models embody a concise description of the various kinetic pathways and their relative likelihood, facilitating comparison with experimental data and providing a powerful tool for mechanistic insight. Once the model is constructed and the time scale for Markovian behavior determined, it can be used to compute the stochastic temporal evolution of either a single macromolecule or a population of noninteracting macromolecules, allowing direct comparison of simulated and experimental observables for both single-molecule or ensemble kinetics experiments. In addition, useful properties difficult to access experimentally, such as state lifetimes [SPS04a], relaxation from experimentally inaccessible prepared states [CSPD06b], mean first passage times [SSP04], the existence of hidden intermediates [ODB02], and P fold values or transmission coefficients [LZSP04], can easily be obtained. This allows for both a thorough understanding of mechanism and the generation of new, experimentally testable hypotheses. To build such a model, it is necessary to decompose configuration space into an appropriate set of metastable states. If the low-dimensional manifold containing all the slow degrees of freedom is known a priori, then this can be partitioned into free energy basins to define the states, such as by examination of the potential of mean force [SPS + 04b, SKH05, SP05c, EPP05b, CSPD06b]. In

43 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 31 the absence of this knowledge, others have turned to conformational clustering techniques to identify conformationally distinct regions which may also be kinetically distinct [KTB93, dgdmg01, SSP04, AFGL05]. In this chapter, we adopt a strategy first suggested for the discovery of metastable states in molecular systems by researchers at the Konrad-Zuse-Zentrum für Informationstechnik [SFHD99]. The principal idea is this: If configuration space could be decomposed into a large number of small cells, the probability of transitioning between these cells in a fixed evolution time could be measured. This probability is a measure of kinetic connectivity among the cells, which allows the identification of aggregates of these cells that approximate true metastable states [SH02]. Unfortunately, the choice of how to divide configuration space into cells is not straightforward. Suppose one is to consider the analysis of some fixed amount of simulation data. If configuration space is decomposed very finely, the boundaries between metastable states can in principle be well-approximated, but the estimated cell-to-cell transition probabilities will become statistically unreliable. On the other hand, if configuration space is decomposed too coarsely, the transition probabilities may be well-determined, but the boundaries between metastable states cannot be clearly resolved, potentially disrupting or destroying the Markovian behavior of interstate dynamics. An optimal choice would ultimately require knowledge of the metastable regions in order to determine the best decomposition of space into cells. We propose an iterative procedure to determine both the choice of cells and their aggregates to approximate the desired metastable states. We use a conformational clustering method to carve configuration space into an initial crude set of cells (splitting), and a Monte Carlo simulated annealing procedure to collect metastable collections of cells into states (lumping). This cycle is repeated, with the splitting procedure now applied individually to each state to generate a new set of cells, and the lumping procedure applied to the entire set of cells to redefine states until further application of this procedure leaves the approximations to metastable states unchanged. This procedure allows state boundaries to be iteratively refined, as regions that mistakenly have been included in one state can be split off and regrouped with the proper state. Throughout this process, we require that the cells never become so small that estimation of the relevant transition matrix elements is statistically unreliable. Our proposed method is efficient, of O(N) complexity in the number of stored configurations, and can easily be parallelized. This chapter is organized as follows: In Section 3.2, we give an overview of the Markov chain model and its construction, elaborate on desirable properties of an algorithm to partition configuration space into states, and outline the principles underlying the algorithm we present here. In

44 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 32 Section 3.3, we provide a detailed description of the automatic state decomposition algorithm and its implementation. In Section 3.4, we apply this algorithm to three model peptide systems in explicit solvent to assess its performance: alanine dipeptide, the 21-residue F s helix-forming peptide, and the 12-residue engineered trpzip2 hairpin. Finally, in Section 3.5, we discuss the advantages and shortcomings of our algorithm, with the hope that future state decomposition algorithms can address the remaining challenges. 3.2 Theory Some discussion of the stochastic model of kinetics considered here and the theory underlying the method is appropriate before describing the algorithmic implementation in detail. The actual implementation of the algorithm used here is described in detail in Section Markov chain and master equation models of conformational dynamics Consider the dynamics of a macromolecule immersed in solvent, where the solvent is at equilibrium at some particular temperature of interest. We presume that all of configuration space has already been decomposed into a set of nonoverlapping regions, or states, which together form a complete decomposition of configuration space. The method by which these states are identified is described in subsequent sections. If we observe the evolution of this system at times t = 0, τ, 2τ,..., where τ denotes the observation interval, we can represent this sequence of observations in terms of the state the system visits at each of these discrete times. The sequence of states produced is a realization of a discrete-time stochastic process. For this process to be described by a Markov chain, it must satisfy the Markov property, whereby the probability of observing the system in any state in the sequence is independent of all but the previous state. For a stationary process on a finite set of L states, this process can be completely characterized by an L L transition matrix 1 T(τ) dependent only on the observation interval, or lag time, τ. The element T ji (τ) denotes the probability of observing the system in state j at time t given that it was previously in state i at time t τ. If this process satisfies detailed balance (which we will assume to be the case for physical systems of the sort we consider here [vk97]) we 1 In this chapter, we adopt the notation for a column-stochastic transition matrix, in which the columns sum to unity. This differs from the notation in some previously-cited references and the other chapters, which use a row-stochastic transition matrix, equal to the transpose of the column-stochastic matrix used here.

45 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 33 additionally have the requirement T ji p eq,i = T ij p eq,j, (3.1) where p eq,i denotes the equilibrium probability of state i. The vector of probabilities of occupying any of the L states at time t (here also referred to as the vector of state populations, such as in an experiment involving a population of noninteracting macromolecules) can be written as p(t). If the initial probability vector is given by p(0), we can write the probability vector at some later time t = nτ as p(nτ) = T(nτ)p(0) = [T(τ)] n p(0). (3.2) This is a form of the Chapman-Kolmogorov equation. Alternatively, the process can be characterized in continuous time by a matrix of phenomenological rate constants K, where the element K ji, j i denotes the nonnegative phenomenological rate from state i to state j. The diagonal elements are determined by K ii = j i K ji to ensure the columns sum to zero so as to conserve probability mass. Time evolution is then governed by the equation ṗ(t) = Kp(t), (3.3) where the dot represents differentiation with respect to time. This evolution equation has formal solution p(t) = e Kt p(0), (3.4) where the exponential denotes the formal matrix exponential. Eq. 3.3 is often referred to as a master equation [vk97, OSW77] describing evolution among a discrete set of states in continuous time. It is important to note that, despite the fact that p(t) is formally defined for all times t, we do not expect Eq. 3.4 to hold for all times t for physical systems of the sort we consider here. In particular, for states of finite extent in configuration space, there exists a corresponding limit for the time resolution for which dynamics will appear Markovian; processes that occur on time scales shorter than this will be incorrectly described by the master equation. There is an obvious relationship between the transition matrix T(τ) and the rate matrix K

46 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 34 evident from comparison of Eqs. 3.2 and 3.4: T(τ) = e Kτ. (3.5) If the process can be described by a continuous-time Markov process at all times, then this process can be equivalently described at discrete time intervals by the corresponding transition matrix. The converse may not always be true due to sampling errors in T(τ), though methods exist to recover rate matrices K consistent with the observed data and the requirements of detailed balance and nonnegative rates [GT94, SKH05]. The transition and rate matrices have eigenvalues µ k (τ) and λ k, respectively, and share corresponding right eigenvectors u k. The detailed balance requirement additionally ensures that all eigenvalues are real, and we here presume them to be sorted in descending order. µ k (τ) and λ k are related by µ k (τ) = e λ kτ. (3.6) The eigenvalues each imply a time scale τ k = λ 1 k = τ[ln µ k (τ)] 1, (3.7) and the associated eigenvector gives information about the aggregate conformational transitions that are associated with this time scale [Sch99, SFHD99, Hui01, SH02]. In particular, the components of u k sum to zero for each k 2, and the aggregate dynamical mode corresponds to transitions from states with positive eigenvector components to states with negative components, and viceversa, with the degree of participation in the mode governed by the magnitude of the eigenvector component. This property can be useful in identifying metastable states. For the remainder of this chapter, we will refer exclusively to the discrete-time Markov chain model picture without loss of generality (Eq. 3.2) Markov model construction from simulation data given a state partitioning Once a statistical-mechanical ensemble describing equilibrium and a microscopic model describing dynamical evolution in phase space have been selected, the transition matrix T(τ) can be estimated from molecular dynamics simulations. For a system in which dynamical evolution is Newtonian and, at equilibrium, configurations are distributed according to a canonical distribution at a given

47 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 35 temperature, Swope et al. [SPS04a] show that the transition probability T ji (τ) can be written as the following ratio of canonical ensemble averages: T ji (τ) = dz(0) e βh(z(0)) χ j (z(τ)) χ i (z(0)) dz(0) e βh(z(0)) χ i (z(0)) (3.8) = χ j(τ)χ i (0), (3.9) χ i where z(t) denotes a point in phase space visited by a trajectory at time t, χ i (z) denotes the indicator function for state i (which assumes a value of unity if z is in state i, and zero otherwise), β (k B T ) 1 the inverse temperature, H(z) the Hamiltonian, and A the canonical ensemble expectation of a phase function A(z) at inverse temperature β. Given a set of simulations initiated from an equilibrium distribution, the expectations in Eq. 3.9 can be computed independently by standard analysis methods [AT91]. Estimation of the correlation function in the numerator can make use of both the stationarity of an equilibrium distribution (by considering overlapping intervals of time τ), and the microscopic reversibility (by considering also time-reversed versions of the simulations) of Newtonian trajectories. Alternatively, if an equilibrium distribution within each state can be prepared, one can also directly estimate a column of transition matrix elements by computing the fraction of trajectories initially at equilibrium within state i that terminate in state j a time τ later. More elaborate methods based on equilibrium ensembles prepared within special selection cells that are not coincident with the states [SPS04a, SPS + 04b] or partition of unity restraints [Web06] can also be used to compute transition matrix elements efficiently Requirements for a useful Markov model For any given state partitioning, the dynamics of the system will be Markovian on some time scale. For example, if the lag time τ is so long as to approach the time for the system to relax to an equilibrium distribution from any arbitrary starting distribution, a single application of the transition matrix T(τ) produces the invariant equilibrium distribution. However, if this τ exceeds the time scale of the process of interest, our model is not useful 2 for describing it, and therefore it is advantageous to attempt to find a state decomposition that is Markovian on a shorter time scale in order to extract useful dynamical information about this process. For a given state i, we will define its internal equilibration time, τ int,i, as the characteristic time 2 Equilibrium probabilities can still be extracted from the stationary eigenvector (the eigenvector corresponding to an eigenvalue of unity) of such a transition matrix, which may have some utility if one had constructed the transition matrix from trajectories not initiated from distributions at equilibrium globally.

48 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 36 one must wait before the system, initially in a configuration within state i, generates a new uncorrelated configuration within the state by dynamical evolution. This internal equilibration time, or memory time, closely related to the molecular relaxation time scale τ mol in Chandler s reactive flux formulation of transition state theory [Cha78], depends, of course, on the choice of state decomposition. We can denote the longest of these times over all states by τ int. If the lag time is longer than τ int, we will expect the system to have lost memory of its previous location within any state it may have been in, either remaining within that state or transitioning to a new one, and for dynamics on this set of states to be independent of history. On the other hand, for lag times shorter than τ int, we cannot guarantee that transition probabilities are independent of history everywhere. This suggests a way in which the utility of various decompositions can be measured. For a fixed number of states, the most useful model will partition configuration space to yield the shortest τ int, as this model can be used to study the widest range of dynamical processes. In addition to producing transition probabilities that are history independent at a relevant lag time, we impose additional conditions on our states to ensure the resulting model also provides physical and chemical insight. In order for the states to be defined such that equilibration within a state is rapid, we desire that the region of configuration space defining each state be connected. A state composed of two or more unconnected regions of configuration space defies the assumption that equilibration within the state is much faster than the characteristic time to leave it Validation of Markov models Once a decomposition of configuration space is chosen, we are faced with the task of determining the observation time interval τ at which dynamics in this state space appears Markovian. Unfortunately, we cannot directly compute the internal state equilibration times, though examination of the eigenvalues of the transition matrix restricted to a state may give a lower bound on this time in the absence of statistical uncertainty [MSF05]. The most rigorous test for Markovian behavior would be a direct check of history independence. The simplest test of this type is to compute second order transition probabilities and compare them to the appropriate products of the first order transition probabilities to see if their disagreement is statistically significant. While it is possible to estimate the second order probabilities from the simulation data, this requires the estimation of three-time correlation functions, which often possess statistical uncertainties so large as to render them useless for this kind of test [CSPD06a]. Additionally, this would miss possible yet unlikely higher order history dependencies.

49 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 37 Information-Theoretic metric Another approach, from Park et al. [PP06], uses concepts from information theory to compute the conditional mutual information conveyed by the second-to-last state, which quantifies the discrepancy between observed second-order transition probabilities and the estimate modeled from first-order transition probabilities. The result of this analysis is a scalar that quantifies the degree of history dependence. For a pure first-order Markov process, the mutual information will be zero, as no additional information is gained by including additional history. While this method also requires computing three-time correlation functions, which may individually have substantial uncertainties, the weighted combination of these into a single value reduces the uncertainty in the resulting metric. Unfortunately, there is no rigorous criteria for how small this measure must be in order for the model to be considered acceptably Markovian. Chapman-Kolmogorov Alternatively, raising the transition matrix to a power n (hence summing over the intermediate states) and comparing with the observed transition probabilities for a lag time of nτ, such that one is effectively determining whether the Chapman-Kolmogorov equation (Eq. 3.2) is satisfied, helps to reduce the uncertainty so that the test becomes practical. This is equivalent to propagating the population in time out of a probability distribution confined to each state i initially, and comparing the model evolution with the observed transition probabilities over times much longer than τ int. This serves as a check to ensure that the model is at least consistent with the dataset from which it was constructed, to within the statistical uncertainty of the transition matrices obtained from the dataset. This method was employed, for example, in Refs. [SPS04a, CSPD06b], and is used here as well. Implied Timescales Swope et al., [SPS04a] suggested a number of additional tests for signatures of Markov behavior, the most sensitive of which appears to be examining the behavior of the implied time scales of the transition matrix T(τ), which can be computed from the eigenvalues of the transition matrix by Eq. 3.7, as a function of increasing lag time τ [CSPD06a]. At sufficiently large τ, the implied time scales will be independent of τ, implying that exponentiation of the transition matrix is nearly identical to constructing the transition matrix using longer observation time intervals (Eq. 3.2). The

50 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 38 shortest observation time interval for which this holds can be correlated with the internal equilibration time τ int, and descriptions of the behavior of the system using that state decomposition should be Markovian for all lag times τ τ int. This is also a test of whether the Chapman-Kolmogorov equation holds, but as it computes only L numbers and orders them by time scale, it allows emphasis to be placed on the longest time scales in the system. Implied time scales were used for all systems considered here. Unfortunately, this last method has some drawbacks. First, small uncertainties in the eigenvalues of the transition matrix can induce very large uncertainties in the implied time scales. With increasing lag time τ, the number of statistically independent observed transitions from which T(τ) is estimated diminishes, and the statistical uncertainty in the implied time scales τ k will grow. Second, while stability of the implied time scales with respect to lag time is a necessary consequence of history independence, it is not itself sufficient to guarantee history independence, though we may be unlikely to encounter physical systems for which this is problematic. However, tests on simple models indicate that the information theoretic metric suggests the emergence of Markovian behavior on similar lag times to this method, suggesting some degree of fundamental equivalence [PP06]. 3.3 The automatic state decomposition algorithm Based on the theory above, we provide a list of practical considerations for an automatic state decomposition algorithm and then present an algorithm that meets them. The algorithm operates on an ensemble of molecular dynamics trajectories where conformations have been stored at regular time intervals. In this work, we apply the method to a set of equilibrium trajectories at the temperature of interest, but the algorithm can in principle be applied to trajectories generated from biased initial conditions, provided the unbiased transition probabilities between regions of configuration space can be computed. We stress that the algorithm presented here is simply a first attempt at a truly general and automatic algorithm for use with biomacromolecules Practical considerations for an automatic state decomposition algorithm There are several desirable properties that a state decomposition should possess to be both useful and practical: 1. It is not uncommon for simulations performed on distributed computing platforms such as Folding@Home [SP00, PBC + 03], supercomputers such as Blue Gene [FGM + 03, GFR + 05],

51 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 39 or even computer clusters to generate datasets that may contain 10 5 to 10 7 configurations in up to 10 4 trajectories, therefore rendering impractical the use of any algorithm with a computational complexity greater than O(N log N) in the number of configurations. 2. We assume configurations lie exclusively in the configuration space of the macromolecule. We presume decorrelation of momenta and reorganization of the solvent is faster than processes of interest Molecules may have symmetries due to the presence of chemically equivalent atoms such as in aromatic rings, methyl protons, and the oxygens of carboxylate groups. The state decomposition should be invariant to permutations of these atoms. 4. The state decomposition algorithm should produce a decomposition for which dynamics appears to be Markovian at the shortest possible lag time τ, so as to produce the most useful model. 5. The resulting model should not include so many states so that the elements of the transition matrix will be statistically unreliable Sketch of the method A state decomposition algorithm intended to produce the most useful Markov models, as discussed in Section above, would generate models that minimize the internal equilibration time τ int, the minimum time for which the model behaves in a Markovian fashion. If states can be constructed where the time scale for equilibration within each state is much shorter than the time scale for transitions among the states, we would expect interstate dynamics to be well-modeled by a Markov chain after sufficiently long observation intervals. Unfortunately, τ int is difficult to determine directly, so we are instead forced to identify some surrogate quantity whose maximization will hopefully lead to improved separation between the time scales for intrastate and interstate transitions. Following the approach of Ref. [HS05], we define a measure of the metastability Q of a partitioning into L macrostates as the sum of the self-transition probabilities for a given lag time τ: Q L T ii (τ). (3.10) i=1 3 We recognize that solvent coordinates may be critical in some phenomena, but dealing with solvent degrees of freedom would also require accounting for the indistinguishability of solvent molecules upon their exchange. We leave this to further versions of the algorithm.

52 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 40 For τ = 0, Q = L, and decays to unity as τ grows large enough for the self-transition probabilities T ii to reach the equilibrium probabilities of each macrostate. Poor partitionings will result in a small Q, as trajectories started in some states will rapidly exit; conversely, good partitionings into strongly metastable states will result in a large Q, as trajectories will remain in each macrostate for long times. In the absence of statistical uncertainty, Q is bounded from above by the sum of the L largest eigenvalues of the true dynamical propagator for the system [HS05]. The goal of our algorithm is to identify a partitioning into L contiguous macrostates that maximizes the metastability Q. While in principle, the boundaries between these macrostates can be varied directly to optimize Q, in analogy to variational transition state theory [TGK96], a complicated parameterization may be necessary to describe the potentially highly convoluted hypersurfaces separating the states, and Q may have multiple maxima in these parameters. Instead, we choose an approach based on splitting the conformation space into a large number of small contiguous microstates and then lumping these microstates into macrostates to maximize the metastability. This approach is similar to the approach of Schütte and coworkers described in Ref. [SFHD99], but with a substantial difference. In their work, each degree of freedom of the molecule (such as a torsion angle) is subdivided independently to produce a multidimensional grid. As the number of states is exponential in the number of degrees of freedom, this approach quickly becomes intractable for macromolecules that possess large numbers of degrees of freedom, even if the sparsity of the transition matrix is taken into account. Instead, we choose to let the data define the low-dimensional manifold of configuration space accessible to the macromolecule, and we can apply any clustering algorithm that is O(N log N) in the number of configurations to decompose the sampled conformation space into a set of K contiguous microstates. This step corresponds to the first split step in Figure 3.1. Once the conformation space is divided into K microstates, we lump the microstates together to produce L < K macrostates with high metastability, Q. This corresponds to the first lump step in Figure 3.1. The difficulty here is that the uncertainty in the metastability of a partitioning can be large if any macrostate contains very few configurations. Since a macrostate may consist of a single microstate, the microstates must be large enough for the self-transition elements to be statistically well-determined. This comes at a price: with large microstates, the procedure may have difficulty accurately determining the boundaries between macrostates because the resolution of partitioning is limited by the finite extent of the microstates. Additionally, the choice of decomposition into microstates is arbitrary, whereas we would like the state decomposition algorithm to produce equivalent sets of macrostates regardless of the quality of the initial partitioning.

53 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 41 SPLIT K-medoid partitioning of entire sampled space LUMP maximize trace of macrostate transition matrix SPLIT K-medoid partitioning on each macrostate ITERATE LUMP lump over all microstates to maximize trace Figure 3.1: Flowchart of the automatic state decomposition algorithm. We consider K microstates which are used as the basis to construct L < K macrostates that are the approximations to the true metastable states in the system.

54 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 42 To overcome these difficulties, we iterate the aforementioned procedure. After microstates are combined into macrostates, each macrostate is again fragmented into a new set of microstates (the second split step in Figure 3.1). The refined set of all microstates is then lumped to form refined macrostates (the second lump step in Figure 3.1). In this way, the boundaries between macrostates are iteratively refined, and regions incorrectly lumped in previous iterations may be split off and lumped with the correct macrostate in subsequent iterations. At convergence, no shuffling of conformations between macrostates should occur. There is unfortunately no unambiguous way to choose the number of states L. If there is a clean separation of time scales, examination of the eigenvalue spectrum of the microstate transition matrix may suggest an appropriate value of L [SH02]. In a hierarchical system, there will be many gaps in the eigenvalue spectrum and many choices of L will lead to good Markovian models of varying complexity. There is, however, a tradeoff between the number of states and the amount of data needed to obtain a model with the same statistical precision. It may be necessary to apply the algorithm repeatedly with different choices of L to produce a model adequate for describing the time scales of interest. L could even be chosen dynamically at each iteration of the algorithm, though we did not choose to do so in this version Implementation There are a number of implementation choices to be made in the algorithm given above, and here we briefly summarize and justify our selections. Splitting For the split step, we choose to apply K-medoid clustering [HTF01] for a fixed number of iterations because of its O(KN) time complexity (where K can be taken to be constant) and ease of parallelization. Additionally, K-medoid clustering has an advantage over the more popular K-means clustering [Mac67] in this application, as it does not require averaging over conformations, which may produce nonsensical constructs when drastically different conformations are included in the average. Splitting by K-medoid clustering is initiated from a random choice of K unique conformations to function as generators. All conformations are assigned to the microstate identified by the generator they are closest to by some distance metric (defined below). Next, an attempt is made to update the generator of each microstate. K members of the microstate, drawn at random, are evaluated to see if they reduce the intrastate variance of some distance metric from the generator. If

55 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 43 so, the configuration for which the intrastate variance is minimal is assigned as the new generator. All conformations are then reassigned to the closest generator, and the process of updating the generators is repeated. In standard K-medoid applications, this procedure is iterated to convergence, but since the purpose of the splitting phase is simply to divide the sampled manifold of configuration space into contiguous states, ensuring that each state is significantly populated, only five iterations of this procedure were used. For the distance metric, we selected the root-mean squared deviation (RMSD), computed after a minimizing rigid body translation and rotation using the rapid algorithm of Theobald [The05]. In the first splitting iteration, only C α atoms were used to compute the RMSD due to the expense of having to cluster all conformations in the dataset; in subsequent iterations, all heavy atoms (excepting those indistinguishable by symmetry) were used, as well as sidechain polar hydrogens. This metric was chosen because it possesses all the qualities of a proper distance metric [Ste02], accounts for both local similarities between pairs of conformations as well as global ones, and runs in time proportional to the number of atoms, as opposed to a metric such as distance matrix error (DME or drmsd), which scales as the square of the number of atoms. In molecules with additional symmetry, the distance metric can be adjusted accordingly. Our choice of distance metric is not the only one that would suffice; any distance metric which can distinguish between kinetically distinct conformations is sufficient for this algorithm. In contrast, using something like backbone RMSD throughout the process may be a poor distance metric since it would ignore potentially relevant sidechain kinetics. Lumping Lumping to L states so as to maximize the metastability Q of the macrostates proceeds in two stages. In the first stage, information on the metastable state structure contained in the eigenvectors associated with the slowest time scales [Sch99, Hui01, DHFS00, SH02] is used to construct an initial guess at the optimal lumping. Because the eigenvectors contain statistical noise, this may not actually be optimal; so, we include a second stage that uses a Monte Carlo simulated annealing (MCSA) optimization algorithm to further improve the metastability. Though the MCSA algorithm could in principle be used without the first stage to find optimal lumpings, we find its convergence is greatly accelerated by use of the initial guess. Ensuring connectivity during the lumping stage would be difficult due to the need to enumerate neighbors in configuration space, but in practice, we find this unnecessary. In the first stage, a transition matrix among microstates is computed (using Eq. 3.9) taking

56 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 44 advantage of both stationarity and time-reversibility, for a short lag time τ, typically the shortest interval at which configurations were stored. Motivated by the Perron cluster cluster analysis (PCCA) algorithm of Deuflhard et al. [DHFS00], an initial guess for the optimal lumping of microstates to macrostates is generated using the left eigenvectors 4 associated with the largest eigenvalues of the microstate transition matrix. We begin by assigning all microstates to a single macrostate. For each eigenvalue, the corresponding eigenvector contains information about an aggregate transition between the set of microstates with positive eigenvector components and the set with negative components, with a time scale determined by the eigenvalue. Equilibration within each set must occur on a faster time scale, provided the eigenvalues are non-degenerate. We can therefore use this information to identify one macrostate to divide in two. We select the macrostate with the largest L 1 norm of eigenvector components (restricted to microstates belonging to the macrostate), after subtracting the mean of these components. In Ref. [DHFS00], the sign structure alone was used to split these sets, but since we restrict the splitting to a single macrostate, we split about the mean, so that microstates with eigenvector components above the mean become one macrostate and the rest go into another. This procedure is performed for eigenvectors k = 2,..., L in order, which should correspond to the slowest processes in the system, generating a total of L macrostates. Due to statistical noise in the eigenvectors and near-degeneracy in the eigenvalues, this procedure does not always result in the lumping with the maximal metastability Q. Therefore, in the second stage, the metastability was maximized using a Monte Carlo simulated annealing (MCSA) algorithm, using the eigenvector-generated lumping as an initial seed. In each step of the Monte Carlo procedure, a microstate was selected with uniform probability and assigned to a random macrostate. If this proposed move would leave a macrostate empty or did not change the partitioning, it was rejected immediately. The proposed partitioning was accepted with probability min{1, e β Q }. The effective inverse temperature parameter β was set to be equal to the step number, and the MCSA procedure run for 20,000 steps. Twenty independent MCSA runs were initiated from the initial eigenvector-based partitioning, and the partitioning with the highest metastability sampled in any run was selected to define the lumping into macrostates. No attempt was made to optimize the annealing schedule. It should be noted that the metastability Q is not the only surrogate that could be optimized in order to produce a useful state decomposition 5. 4 The left eigenvector v k is simply related to the right eigenvector u k by (v k ) i = p 1 eq,i (u k) i [OSW77]. 5 One could choose to maximize the largest eigenvalues or fastest time scales of the lumped transition matrix, the product of eigenvalues (which would give more weight to faster time scales), or even a weighted sum of the eigenvalues, where the weights might be due to the equilibrium importance of the eigenmode in dynamics or in modeling a process of interest. Unfortunately, these quantities all necessitate computing some eigenvalues or the determinant of the lumped transition matrix for every proposed lumping to be evaluated by the MCSA algorithm, which would add a significant

57 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 45 Iteration For the remaining iterations, the K-medoid clustering is repeated independently on each macrostate for five iterations. In general, we split each macrostate into 10 microstates, unless otherwise noted. However, we wish to ensure statistical reliability of the transition probability matrix, so if the expected microstate size (estimated by the population of the macrostate divided by K) falls below some threshold (100 configurations unless otherwise noted), we split to a smaller number of states such that the expected size is above the threshold. The lumping step is then repeated on all resulting microstates. The entire procedure of splitting and lumping is repeated for a total of 10 iterations, which for the applications considered here was sufficient for convergence of the metastability Validation To validate the model, we examine the largest implied time scales as a function of lag time, as computed from the eigenvalues of the transition matrix by Eq In particular, we attempt to determine the minimum lag time after which the implied time scales appear to be independent of lag time to within the estimated statistical uncertainty (see Section 3.2.4). To estimate statistical uncertainties in the implied time scales and other quantities, we perform a bootstrap procedure [Efr79] on the pool of independent trajectories. Forty bootstrap replicates, each consisting of a number of trajectories equal to the number of independent trajectories in the dataset pool, are generated by drawing from the pool with replacement. For alanine dipeptide, 100 bootstrap replicates were used. For each replicate, the implied time scales or other quantity is computed, and either the standard deviation over the sample of replicates computed (if reported in the text as a±b) or a 68% confidence interval centered on the sample mean estimated (if depicted in a figure as vertical error bars). We also estimate the number of statistically independent visits to each macrostate. Since sequential samples from a single trajectory are temporally correlated, we compute the integrated autocorrelation time [SABW82, Jan02] τ ac,i for each macrostate i. Ignoring statistical uncertainty, this correlation time is an upper bound on the equilibration time within a state; long-lived states will necessarily have long autocorrelation times, but trajectories trapped within them may contain many uncorrelated samples if the internal equilibration time is short. In the absence of a convenient way to quantify the internal equilibration time for each state, the autocorrelation time provides a better estimate of the appropriate time scale than the time to reach global equilibrium τ eq. The effective computational burden. Alternatively, other quantities could be computed from the transition matrix directly, such as the state lifetimes estimated from the self-transition probabilities as τ L,i = (1 T ii ) 1. However, the combination of computational and theoretical convenience makes the use of metastability a natural choice here.

58 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 46 number of independent samples for each state is estimated by summing the number of independent samples from each trajectory (which are assumed independent), where the effective number of independent samples of state i from trajectory n is computed as Nn,i eff min{1, N n,i/g i }, where N n,i is the number of configurations from trajectory n in state i, and g i = τ ac,i τ sample is the statistical inefficiency of state i, where τ sample is the sampling interval between conformations. 3.4 Applications Alanine dipeptide We first demonstrate application of the automatic state decomposition algorithm to a simple model system, terminally blocked alanine peptide (sequence Ace-Ala-Nme) in explicit solvent. Because the slow degrees of freedom (φ and ψ torsions, labeled in Figure 3.2, left) are known a priori 6, it is relatively straightforward to manually identify metastable states from examination of the potential of mean force, making it a popular choice for the study of biomolecular dynamics [AFC99, BDC00, MW01, HK03, CIL04, CSPD06b]. Previously, a master equation model constructed using six manually identified states (Figure 3.2, right) was shown to reproduce dynamics over long times (with the time to reach equilibrium over 100 ps at 302 K) given trajectories only 6 ps in length [CSPD06b]. We therefore determine whether the automatic algorithm can recover a model of equivalent utility to this manually constructed six-state decomposition for this system, as well as study its convergence properties. Because the algorithm uses the solute Cartesian coordinates, rather than the (φ,ψ) torsions, this is a good test of whether good approximations to the true metastable states can be discovered without prior knowledge of the slow degrees of freedom. For ease of visualization, however, we project the state assignments onto the (φ,ψ) torsion map for comparison with our manually constructed states. Simulation details Trajectories were obtained from the 400 K replica of a 20 ns/replica parallel tempering simulation 7 described in Ref. [CSPD06b], and consisted of an equilibrium pool of 1,000 constant-energy, 6 Simulations of alanine dipeptide examining the committor distribution have implicated solvent coordinates as the next-slowest degrees of freedom [BDC00, MD05], but we have previously verified that φ and ψ torsions form a sufficient basis for the slow degrees of freedom on time scales of 6 ps and greater [CSPD06b]. 7 Note that only 10 ns/replica were used in Ref. [CSPD06b] the data presented here includes an additional 10 ns/replica of production simulation. Additionally, configurations containing cis-ω torsions discussed in the text were not observed in the first 10 ns/replica cited in the previous study these conformations only appeared in the latter 10

CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 47 9 8 Ace - Ala - Nme 60-60 1 2 3 4 6 7 6 5 4 3 2 5 1-60 60 Figure 3.2: Potential of mean force and manual state decomposition for alanine dipeptide.

59 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION Ace - Ala - Nme Figure 3.2: Potential of mean force and manual state decomposition for alanine dipeptide. Left: The terminally-blocked alanine peptide with φ, ψ, and ω backbone torsions labeled. Right: The potential of mean force in the (φ, ψ) torsions at 400 K estimated from the parallel tempering simulation, truncated at 10 k B T (white regions), with reference scale (far right) labeled in units of k B T. Boundaries defining the six states manually identified in Ref. [CSPD06b] from examining the 302 K PMF are superimposed, and the states labeled. constant-volume trajectory segments 20 ps in length with configurations stored every 0.1 ps. The peptide was modeled by the AMBER parm96 forcefield [KDC + 97], and solvated in TIP3P water [JCM + 83]. The previous study [CSPD06b] considered the dynamics at 302 K, but resorted to a focused sampling strategy where a number of trajectories were initiated from equilibrium distributions within constricted selection cells [SPS04a] in order to obtain statistically reliable estimates of the transition matrix. Here, as the focus was on locating these metastable states from equilibrium data, we found it necessary to use equilibrium data from a higher temperature here, the 400 K replica in order to obtain sufficient numbers of trajectories covering the entirety of the landscape. A 2D potential of mean force (PMF) at 400 K in the (φ, ψ) backbone torsions was estimated from the parallel tempering simulation using the weighted histogram analysis method [KBS + 92, CSP + 06] by discretizing each degree of freedom into 10 bins (Figure 3.2). Because the (φ, ψ) torsions are supposed to be the only slow degrees of freedom in the system, we can associate basins in the potential of mean force with metastable states. The six such states identified from the 302 K PMF in the previous study [CSPD06b], identified as dark lines in Figure 3.2, can be seen to adequately separate the free energy basins observed at 400 K. We take this decomposition as our reference gold standard, and compare the one obtained from our automatic state decomposition algorithm with it. ns/replica.

60 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 48 Automatic State Decomposition First, the automatic state decomposition method described in Section 3.3 was applied to this dataset in a fully automatic way to obtain six macrostates that could be compared with the gold standard. Since there is only one C α atom in the peptide, we opted to use the backbone RMSD (including the amide proton and carbonyl oxygen) in the first stage, splitting to 100 microstates; subsequent iterations used the distance metric and splitting procedure described in Section A single sampling interval 0.1 ps was used for the calculation of the metastability metric Q used in lumping, as described in Section Application of the state decomposition algorithm to the entire dataset revealed a state that heavily overlapped with several others when projected onto the (φ, ψ) map, along with an extremely long time scale associated with its transitions (data not shown). Closer examination of the ensembles of configurations contained in this overlapping state revealed that the overlapping regions differed by a peptide bond isomerization; a small population of the trajectories contained an N-terminal ω peptide bond in the cis state, rather than the typical trans state. The number of trajectories that connected these states was extremely small. Examination of the parallel tempering data revealed that the majority of these transitions had occurred at a much higher temperature, and that the cis-ω configurations found at 400 K had reached this temperature by annealing from the higher temperature; in the majority of trajectories at 400 K that contained cis-ω configurations, the peptide remained in this state over the duration of the trajectory. This is a clear demonstration of how the automatic algorithm can discover additional slow degrees of freedom that the experimenters may not realize are important. For subsequent investigation, due to the extremely small number of transitions, trajectories containing conformations that included cis-ω bonds (a total of 25 trajectories) were removed from the set of trajectories used for analysis, leaving 975 trajectories. Comparison with manual state decomposition The results of the automatic state decomposition algorithm applied to this reduced dataset can be seen in Figure 3.3, in comparison with the gold standard manual state decomposition from Ref. [CSPD06b] and a poor manual decomposition that is expected to fail to reproduce kinetics well because its states include internal kinetic barriers 8. Independent applications of the automatic 8 The poor partitioning was defined as follows: (1) φ [(179, 135], ψ (98, 48]; (2) φ ( 135, 60], ψ (98, 48]; (3) φ (179, 135], ψ (48, 98]; (4) φ ( 135, 60], ψ (48, 98]; (5) φ ( 60, 179], ψ (98, 45]; (6) φ ( 60, 179], ψ ( 45, 98]. Specified intervals denote intervals on the torus, which is continuous from -180 to All torsion angles are specified in degrees.

CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 49 manual good Q = 5.59 ± 0.02 timescales manual poor Q = 3.21 ± 0.05 automatic Q = 5.64 ± 0.02 automatic 40 Q = 5.64 ± 0.03 k (ps) 20 0 0 2 4 6 8 10 lag time (ps) Figure 3.

61 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 49 manual good Q = 5.59 ± 0.02 timescales manual poor Q = 3.21 ± 0.05 automatic Q = 5.64 ± 0.02 automatic 40 Q = 5.64 ± 0.03 k (ps) lag time (ps) Figure 3.3: Comparison of manual and automatic state decompositions for alanine dipeptide. The left panels depict state partitionings, and the right panels the associated time scales (in picoseconds) as a function of lag time with uncertainties shown, as estimated from the procedure described in Section Axes are the same in all plots. Top two panels: Manual good or gold standard state decomposition from Ref. [CSPD06b] and manual poor state decomposition, where the state boundaries are grossly distorted so as to include internal kinetic barriers within the states. Bottom two panels: Two nearly-equivalent partitionings obtained from the automatic state decomposition algorithm.

62 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 50 method were observed to yield two distinct decompositions with metastabilities within statistical uncertainty, both of which slightly exceeded the metastability of the manual decomposition (Figure 3.3, bottom two plots). In the first automatic decomposition, six states in the same general locations as the manual gold standard decomposition are observed, though the boundaries are somewhat perturbed. However, the time scales as a function of lag time are not significantly different from those of the manual gold standard decomposition (Figure 3.3, right). In the other automatic decomposition, states 3 and 4 of the manual decomposition (numbering given in Figure 3.2) have been merged into a single state, and state 5 of the manual decomposition has been fragmented into two states. Despite this, the time scales as a function of lag time again appear to be statistically indistinguishable from those of the gold standard, suggesting that this model may have equal utility. This suggests that the phenomenological rates may not be very sensitive to the exact choice of state boundaries after the Markov time, as recrossings will have been suppressed by this time. The fact that this lumping does not disrupt the behavior of the model substantially is not altogether surprising, because the barrier separating states 3 and 4 is rather small, and these states act like a single state even on time scales of a few picoseconds or greater. In contrast, the poor decomposition has extremely short time scales which do not appear to level off over the course of 10 ps. Stability of state decomposition To examine the ability of the algorithm to recover optimal partitionings, the automatic state decomposition algorithm was applied to both the gold standard and poor manual decompositions (Figure 3.4) to see whether these partitionings would be maintained over the course of subsequent iterations. Ten iterations were conducted, with each macrostate split to ten microstates in the first iteration, rather than the entire configuration space being split into 100 states. In both cases, the algorithm converged to nearly equivalent partitionings after ten iterations (Figure 3.4), as verified by examination of the converged time scales (data not shown). This suggests the method yields partitionings that are relatively stable and optimal. From the poor manual decomposition, however, a number of conformations in manual states 5 and 6 are incorrectly grouped with state 2, though this did not significantly affect the time scales. Further investigation showed that the algorithm never split these conformations from state 2, partly due to their comprising only 1 % of the population of the state. Splitting each macrostate into more microstates should alleviate this problem.

CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 51 good manual decomposition initial final poor manual decomposition Figure 3.4: Stability and recovery of optimal state decomposition for alanine dipeptide.

63 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 51 good manual decomposition initial final poor manual decomposition Figure 3.4: Stability and recovery of optimal state decomposition for alanine dipeptide. Top: Ten cycles of automatic state decomposition applied to a good manual partitioning (left) to yield an automatic partitioning (right). Bottom: Ten cycles of automatic state decomposition applied to a poor manual partitioning (left) to yield an automatic partitioning (right) The F s helical peptide To illustrate behavior of the automatic state decomposition method on a larger peptide system with fast kinetics, we applied it to the 21-residue helix-forming F s peptide, which has been studied extensively both experimentally [LK92, LK93, WCG + 96, TEH97, LKSA01] and computationally [GS02, ZLCD04, SP05b, SP05c]. Since helix formation occurs on the nanosecond time scale, Sorin et al. were able to reach equilibrium from both helix and coil conformations and observe equilibrium conformational dynamics using ensembles of molecular dynamics trajectories on the distributed computing platform Folding@Home [SP05c]. Two sets of 1,000 trajectories at 302 K of varying length of the capped F s peptide (sequence Ace-A 5 [AAARA] 3 A-Nme), one set initiated from an ideal helix and another from a random coil, were obtained from Sorin et al. [SP05c]; details of the simulation protocol are available therein. The first 40 ns of each trajectory, a conservative overestimate of the time to reach equilibrium from either helix or coil, was discarded, and the two sets of trajectories combined to yield a total of 1,689 trajectories varying in length from 10 ns to 95 ns with a sampling interval of 100 ps. In total, this equilibrium dataset contained nearly 65 µs

46 921 22 559 22 367 15 859 11 975 τ ac (ns) 3.1 0.9 1.4 0.6 4.0 1.3 1.

626 τ ac (ns) 2.2 2.0 2.2 1.2 1.6 11.3 2.

shown in blue (Arg10), magenta (Arg15), and green (Arg20) for clarity.

The peptide was modeled using the AMBER-99φ forcefield [WCK00, SP05c] and

Though the Berendsen weakcoupling scheme [BPvG + 84] was employed for

microscopic reversibility when only the coordinates of the macromolecular

Comparison of states We performed automatic state decomposition on this

if the expected microstate size was less than 500 configurations, the

64 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 52 state members τ ac (ns) state members τ ac (ns) state members τ ac (ns) Table 3.1: Macrostates from a 20-state state decomposition of the F s helical peptide. The backbone is depicted in alpha carbon trace, and arginine sidechains are shown in blue (Arg10), magenta (Arg15), and green (Arg20) for clarity. of simulation data in 642,604 conformations. The peptide was modeled using the AMBER-99φ forcefield [WCK00, SP05c] and solvated in TIP3P water [JCM + 83]. Though the Berendsen weakcoupling scheme [BPvG + 84] was employed for thermal and pressure control 9, we presume the trajectories still obey microscopic reversibility when only the coordinates of the macromolecular solute are considered for the purposes of computing transition probabilities. Comparison of states We performed automatic state decomposition on this dataset to generate a set of 20 macrostates through 10 iterations of splitting and lumping. In the first iteration, the sampled region of conformation space was split into 400 microstates. In subsequent iterations, each macrostate was split into 50 microstates (or, if the expected microstate size was less than 500 configurations, the maximum number of microstates such that the expected microstate size was above 500). Automatic state decomposition produced a structurally diverse set of states (Table 3.1), ranging in size from over 350,000 members to 500 members, with the majority containing from 5,000 to 9 We note that Berendsen thermal control, here applied independently to the peptide and solvent, modulates the velocities of the peptide atoms during the course of the simulation, which may have a nonphysical effect on dynamics and affect interstate transition rates. However, since we compare our Markov model with the original simulation dataset, rather than directly with experiment, this is not of concern.

65 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 53 20,000 members. The states include a large state (state 1 of Table 3.1), consisting of slightly over half the total conformations in the dataset containing both extended coil and helical conformations; a pure helix state (state 15); a number of helix/coil states which are bent in half to different degrees to form tertiary contacts (states 2 14); and a number of smaller helical states which are bent into circles to form tertiary interactions (states 16 20). A previous analysis [SP05c] of this data clustered conformations into states based on various order parameters: the number of helical residues, number of helical segments (stretches of helical residues), length of the longest helical segment, and radius of gyration. We compared the macrostates generated by the automatic algorithm with these clusters, and found that while some states are similar, namely the bi-nucleated helices of different sizes, most were quite different. The most significant difference was the grouping of helix and coil conformations into a single macrostate in the lumping phase of the automatic algorithm, whereas the order parameter-based clustering kept helix and coil states distinct [SP05c]. When examining individual trajectories, we noticed conformations would rapidly transition between helices and coils between consecutive 100 ps frames of the trajectory, suggesting that their rapid interconversion justifies their lumping into a single macrostate. Additionally, the clustering based on helical order parameters was unable to distinguish certain structures that involved long-lived tertiary contacts, such as the bent and circular helical states. Interestingly, a previous study employing the related AMBER parm03 forcefield [DWC + 03] identified similar configurations to those noted by the automatic state decomposition, terming these states helix (state 15), helix-turn-helix (states 3, 6 8), adjusted helix-turn-helix (states 4 5, 9 12, 14), and globular helix (states 16 20). Kinetic analysis We then examined the implied time scales as a function of lag time (Figure 3.5). Lumping appeared to preserve the longest time scales found in the microstate transition matrix (data not shown), indicating that our lumping scheme had been successful in identifying a nondestructive lumping into kinetically metastable states at each iteration. Over the course of 10 iterations, the metastability (as optimized with a lag time of 100 ps) increased from 12.5 ± 0.3 to 14.5 ± 0.1, suggesting that the iterative refinement was improving the quality of the state decomposition. On the first iteration, the longest time scales increase nearly linearly with lag time, while on the last iteration, some of the longest time scales become stable by a lag time of 4 5 ns, suggesting Markovian behavior for some of the processes. Using the interpretation of eigenvector components in terms of aggregate modes described in Section 3.2.1, the longest time scale was found to correspond to movement between the extended

66 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION τ k (ns) lag time τ (ns) Figure 3.5: Implied time scales of the F s peptide as a function of lag time for 20-state automatic state decomposition. The five longest time scales are shown. Circles represent the maximum likelihood estimate, and vertical bars depict 68% symmetric confidence intervals about the mean. Note the time scales associated with two processes appear to cross, but are here colored and uncertainties are estimated using the bootstrap procedure by ordering the time scales computed from each bootstrap replicate by rank. This may cause the uncertainties depicted here to be an underestimate of the true uncertainties of each process. helix/coil state (state 1) and one of the twisted helix-turn-helix states (state 18) with only 500 members. We found, however, that state 18 appeared a small number of times in thirty trajectories, and over 450 times in a single trajectory. Further examination revealed that conformations belonging to this state were almost exclusively temporally adjacent to conformations belonging to state 5, and structural comparison of conformations of these two states showed they were strikingly similar. This suggests that slight conformational differences between conformations in states 18 and 5 allowed the K-medoid clustering algorithm to partition between these states in a splitting step, and since state 18 was mainly isolated in a single trajectory, its self-transition probability was maximized by not lumping it with state 5, even though the two behaved in a similar kinetic fashion. Indeed, when we manually lump states 18 and 5, the longest time scale, corresponding to transitions involving state 18, disappears, but the remaining time scales are all preserved (data not shown). A potential cause of the increase with lag time observed in some of the other long time scales may be due to the finite length of trajectories. If the state is long-lived, and occurs near the trajectory beginning or end, then it can be seen that the estimated self-transition probability T ii artificially increases as a function of lag time. This effect is most pronounced when a state occurs in very few trajectories, and appears to be mitigated when the state occurs in many trajectories at random times within the trajectory.

67 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 55 In order to determine which states are poorly characterized, we estimated the number of statistically independent visits to each macrostate using the autocorrelation time given in Sec As the correlation functions became statistically unreliable at times larger than 10 ns, a least squares linear fit to the log of the computed correlation function over the first 10 ns was used to estimate the tail of the function at times greater than 10 ns, and this combined correlation function was integrated to obtain the autocorrelation time. Computed state autocorrelation times are given in Table 3.1. For many states, the correlation time was 1 2 ns, giving thousands of independent samples; however, for five states, including the four involved in the four longest time scales, the correlation times were between 10 and 50 ns, suggesting that the dataset contained less than 50 independent samples of these states. Currently, in the automatic state decomposition algorithm, we try to reduce the statistical uncertainty in the transition matrix by limiting the expected population of each state to be greater than some minimum number of configurations. Since the conformations appearing within some states may be highly correlated, the number of conformations within a state is not the best measure of how statistically well-determined its transition elements are; instead, it may be advantageous to place a lower limit on the effective number of independent visits to each state, which is far less than the number of configurations it contains. Alternatively, it may be necessary to ensure better characterization of these states by conducting additional simulations from them, provided the equilibrium transition probabilities can still be computed. We constructed a Markov model from the transition matrix estimated at a 5 ns lag time, where some (though apparently not all) of the time scales to have stabilized. The Chapman-Kolmogorov test (Sec ) can assess how well the model reproduces the observed kinetics. The time evolution of probability density out of three states (state 2, a populous state; state 13, a moderately populated state; and state 19, a sparsely populated state) over the course of 50 ns is shown in Figure 3.6. The Markov model appears to do a very reasonable job of predicting the time evolution of the system to within statistical uncertainty over many times longer than the lag time used to construct it. In fact, the time evolution was well modeled for evolution out of all states, except for state 13, for which dynamics seemed to be particularly poorly reproduced. This state has a long correlation time, and many trajectories seem to contain only a single configuration that is part of this state, suggesting its boundaries are simply poorly resolved. Regardless, the time evolution is generally well-modeled for this system.

68 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 56 population time (ns) Figure 3.6: Reproduction of observed state population evolution by Markov model for the F s peptide. The time evolution of the Markov model constructed from the 5 ns lag time transition matrix is shown by the filled circles with flat error bars, which denote the 68% confidence interval estimated from a sample of 40 bootstrap realizations, with each realization the result of a 5 ns transition matrices estimated by a bootstrap sample of trajectories. Vertical bars without flat ends denote the 68% confidence interval centered on the sample mean for the probability of finding the system in the 20 macrostates a given time after initial preparation in a specific state. The system was originally prepared in state 2 (top, red), 13 (middle, yellow), or 19 (bottom, purple). The most populous states are colored green (state 1), red (state 2), and blue (state 3) The trpzip2 β-peptide As an illustration of the application of the state decomposition algorithm to a system with complex kinetics implying the existence of multiple metastable states [YG04], we considered the engineered 12-residue β-peptide trpzip2 [CSS01]. A set of ns constant-energy, constant-volume simulations of the unblocked peptide 10 simulated using the AMBER parm96 forcefield [KDC + 97] in TIP3P water [JCM + 83] was obtained from Pitera et al. [PHS06]; details of the simulation protocol are provided therein. The trajectories were initiated from an equilibrium sampling of configurations at 425 K, a temperature high enough to observe repeated unfolding and refolding events at equilibrium. Configurations were sampled every 10 ps, giving a total of 3.23 µs of data in 323,000 configurations. 10 Note that the peptide studied experimentally in Refs. [CSS01] and [YG04] was synthesized with an amidated C- terminus, whereas the termini of the simulated peptide in the dataset considered here were left zwitterionic.

69 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 57 Comparison of states The automatic state decomposition method was applied to obtain a set of 40 macrostates in 10 iterations of splitting and lumping. The algorithm was performed as described in Section 3.3.3, except for the first iteration, where the conformations were split into 400 microstates. Figure 3.7 depicts some of the final set of 40 macrostates compared with a set of states identified by consideration of backbone hydrogen bonding patterns in the previous study by Pitera et al. [PHS06] 11. As the trajectories considered here were resampled to 10 ps intervals (rather than 1 ps in Ref. [PHS06]) we found less than five examples of the +2 and -2 hydrogen bonding states identified in Ref. [PHS06], and therefore exclude them from comparison. The automatic state decomposition method recovers states corresponding to the native, +1C, and +1N hydrogen bonding patterns, and often further resolves them based on the orientation of the tryptophan sidechains (Figure 3.7, A, C, D). However, the -1N hydrogen bonding pattern is not further resolved, and instead is grouped into a state of mostly disordered hairpins; further examination is necessary to determine whether the algorithm simply failed to resolve this state or if the state is simply not long-lived. In addition to recovering most of the manually identified misregistered states, the algorithm was also able to greatly resolve the state labeled as unfolded in Pitera et al. (in that it did not conform to any of the enumerated hydrogen bonding patterns) into substates which exhibit considerable structure (E J). Some of these kinetically resolved states have distinct hydrogen bonding patterns, such as where both strands are rotated (H), causing the tryptophan sidechains to appear on the opposite face, or where the misregistration is greater than two residues (G, J). This demonstrates the utility of the method in identifying additional kinetically relevant states that were not initially part of the experimental hypothesis space. Kinetic analysis Figure 3.8 depicts the implied time scales of the kinetic model as a function of lag time. The longest time scale ranges between 25 and 35 ns and appears to stabilize over the range of lag times considered, though the uncertainty is quite large. Eigenvector analysis (described in Sec ) shows that this time scale corresponds to transitions between the unfolded and disordered hairpin states (E) and the hairpin with both strands rotated (H). The states labeled H together totaled 935 conformations, but appeared in only 13 trajectories, with over 95% of the conformations appearing in a single trajectory. Correlation time analysis (Sec ) suggests there are less than 10 independent 11 The complete set of macrostates is shown in a figure included as Supplementary Information of Ref. [CSP + 07].

70 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 58 states from Pitera et al. automatic state decomposition 12 native 10 8 A N B +1C C N D E F G ''unfolded'' H I J Figure 3.7: Comparison of some trpzip2 macrostates found by automatic state decomposition with misregistered hydrogen bonding states identified in a previous study. Left: The five hydrogen bonding patterns enumerated in Pitera et al. [PHS06] that occurred in sufficient numbers in the subsampled trpzip2 dataset used here, with representative conformational ensembles. Blue squares denote backbone amide hydrogen bond donors, and red circles denote backbone carbonyl hydrogen bond acceptors. Right: A selection of macrostates discovered by automatic state decomposition that contain the largest numbers of hydrogen bonding pattern states. The backbone is depicted in alpha carbon trace, and tryptophan sidechains are shown in light blue (Trp2), orange (Trp4), magenta (Trp9), and teal (Trp11).

71 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION τ k (ns) lag time τ (ns) Figure 3.8: Implied time scales of trpzip2 as a function of lag time for 40-state automatic state decomposition. The five longest time scales are shown. samples for each of the three states, so proper resolution of this time scale would require more data. The second longest time scale grows to about 15 ns, levels off by around 4 ns, and corresponds to transitions between the unfolded and disordered hairpin states (E) and the native backbone states (A). The states involved in this transition are much better characterized, with a total of over 25,000 conformations appearing in over half the trajectories. The next three longest time scales were all between 3 and 4 ns and correspond to movement between the unfolded state (E) and various sets of misregistered states, namely the newly identified misregistered states I and J, and the +1C state (C). Unfortunately, these time scales are on the order of the time to reach global equilibrium, so it is difficult to characterize these transitions well. 3.5 Discussion Markov models are expected to be effective and efficient ways to statistically summarize information about the pathways (mechanism) and time scales for heterogeneous biomolecular processes such as protein folding. The great challenge in their use lies in defining an appropriate state space. Here, we have presented a new algorithm for automatically generating a set of configurational states that is appropriate for describing peptide conformational dynamics in terms of a Markov model, though we expect it to be applicable to macromolecular dynamics in general. The algorithm uses molecular dynamics simulations as input, and generates state definitions using information about the temporal order of conformations seen in the trajectories. The importance of having an automatic algorithm, i.e., one that requires little or no human intervention, is that without it, human bias

72 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 60 may inadvertently produce incorrect interpretations of the mechanism of conformational change by imposing a particular view of the simulation data. Additionally, molecular simulation datasets are becoming so large and complex that effectively summarizing the data or extracting insight becomes increasingly impractical unless the experimenter analyzes the data with a specific hypothesis in mind. Construction of a Markov model, however, allows for a hypothesis-free investigation of conformational dynamics, provided that the state space is sufficiently well sampled. Our algorithm is based on the availability of large numbers of molecular dynamics simulations of appropriate simulation length such as might be generated by a supercomputer or a large (possibly distributed) cluster. Current technology allows for the production of thousands of simulations that can be tens of nanoseconds in length, hundreds of trajectories of up to hundreds of nanoseconds in length, or dozens that are on the order of a microsecond in length. Since our goal has been to develop Markov models that accurately characterize the time evolution of ensembles of macromolecules over experimental time scales (that can range from microseconds to milliseconds) from short simulations of single molecules, our approach places strong emphasis on the longest time scales observed in molecular simulations. For example, recognizing that ill-formed states often result in artificially shortened time scales, we sought to find states that maximize the time scales implied by their corresponding transition matrix for a particular choice of lag time and number of states. This resulted in the maximization of the metastability as a computationally convenient surrogate for minimizing the internal equilibration time τ int. For the three data sets to which we have applied the method, there have been a number of important successes. For alanine dipeptide, the algorithm discovered a distinct manifold of states that consisted of conformations containing a cis-ω peptide bond. This manifold was discovered because it was kinetically distinct, rather than structurally distinct. Also, for alanine dipeptide, the method produces states that are robust and structurally very similar to the best ones produced manually, as well as kinetically indistinguishable to within statistical uncertainty according to our validation metrics. The application of the method to the F s peptide data set produced a set of states somewhat different from those identified previously from the clustering of helical order parameters [SP05c]. The states produced by the algorithm properly identified many very long lived (metastable) conformations whose lifetimes and kinetics might be experimentally relevant. The Markov model produced from this state decomposition and a 5 ns transition matrix was shown to reproduce the observed state populations over 50 ns to within statistical uncertainty. Finally, for the application of the method to the trpzip2 peptide the states constructed were consistent with ones previously

73 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 61 identified [PHS06]. This was very encouraging since the previously constructed states used an intramolecular hydrogen bonding criterion and the automatic algorithm utilized different observables and metrics, heavy atom RMSD and kinetics, to resolve states. Moreover, the automatic algorithm more finely resolved what was considered to be the unfolded ensemble into metastable states that were not identified by the decomposition based on hydrogen bonding patterns. Therefore, the algorithm is achieving many of its design objectives. It provides a method for identifying and characterizing the slower degrees of freedom of a molecular system. It correctly identifies metastable states, dividing structurally similar conformations into multiple sets that have short times for intraconversion but long times for interconversion, and combines conformations that rapidly interconvert even though they may be structurally diverse. This is a prerequisite to capturing a concise description of the pathways for conformational changes. Once meaningful states are identified, the transition matrix itself encapsulates the branching ratios for various pathways and the time scales for overall relaxation to equilibrium from any arbitrary starting ensemble. Work is ongoing to establish standards for the amount and nature of simulation data (number and length of simulations) needed to develop useful and sufficiently precise Markov models as well as investigations of the effect of quality metrics other than the metastability on the nature of the resulting states and time scales. Metrics for assessing the quality of the resulting model also need to be examined to complement, or as alternatives to, seeking stability of the implied time scales with respect to lag time. Finally, alternative approaches to performing this state decomposition are a further matter of current study, such as the method of Noé and coworkers appearing in this issue, motivated by much the same ideas of metastability but employing different methods for the construction of a microstate space [NHSS07]. A general observation about the models produced using states defined by our method is that Markovian behavior is not obtained until lag times that are only an order of magnitude shorter than the longest time scales. Recall that the utility of a state space depends to a large extent on how early Markovian behavior is observed compared to the processes of interest. There are multiple possibilities for why this might be the case. For some molecular systems, there may be no identifiable metastable states in the usual sense. The existence of experimentally observed metastable states in protein systems (e.g., native, intermediate, unfolded) combined with the observation of metastable states in models of small solvated peptides [CSPD06b] argues that this is unlikely. It could be that statistical uncertainty is undermining both the metastability quality metric and the tests for Markovian behavior. Alternatively, the way we establish boundaries between states may not be flexible enough to adequately divide true metastable regions. It may also be that we simply need to allow

74 CHAPTER 3. AUTOMATIC STATE DECOMPOSITION 62 more states to be produced, resulting in subdivision of states that have internal barriers, to reduce the Markov times. Both of these latter possibilities could in principle be easily addressed by allowing the creation of more states. However, the creation of more states, especially ones with low populations, leads inevitably to situations where transition probabilities become statistically unreliable given a fixed quantity of equilibrium data. Long time scales are ultimately the result of infrequent events, and even for large but finite equilibrium datasets these will be small in number, with resulting small off-diagonal transition probabilities that are statistically unreliable. This has placed us in the particularly difficult but unavoidable situation of attempting to optimize a statistically uncertain objective function. One solution to this problem, of course, is to consider this algorithm as only the first step of an iterative process where important states and transitions are identified, and then further simulations are performed to improve the characterization of important regions of conformation space. This will allow refinement of the state space and improved precision for important selected transition probabilities. Information from the subsequent simulations could be combined with that from the first set using the selection cell approach described previously [SPS04a]. Selection of states, or regions of configuration space, from which further simulations should be initiated could be chosen based on uncertainty considerations, to be described in Chapters 5 and Supporting Information A Fortran 90/95 implementation of the automatic state decomposition algorithm presented here is available for download as part of the Supplementary Information of Ref. [CSP + 07]. The latest version of the code, along with the alanine dipeptide dataset, can be obtained from dillgroup.ucsf.edu/ jchodera/code/automatic-state-decomposition/. The trpzip2 dataset is available directly from WCS upon request ( swope@almaden. ibm.com). A gallery of all macrostates produced by the 40-state decomposition of the trpzip2 peptide is also available as part of the Supplementary Information of Ref. [CSP + 07].

75 Chapter 4 Model selection A common approach for describing the conformational dynamics of biological molecules is to model the dynamics as a discrete-state Markovian model. The previous chapter gave an automatic method for building a state decomposition, given a target number of states. However, it is currently difficult to determine the correct number of states in the decomposition for a given set of simulation data. This chapter outlines a maximum likelihood score and a Bayesian score for the fit of a given model to the observed data. We show how these scores can be used to compare state decompositions, which may differ in the state definitions themselves and the number of states. The scoring functions are tested on decompositions of a simple transition model between 9 conformations and on state decompositions of the terminally blocked alanine peptide. We demonstrate how the maximum likelihood score always prefers a state definition with more states, while the Bayesian score correctly assesses the tradeoff between the number of states and the amount of data. 4.1 Introduction Computational simulations are often used to study the movement of biological molecules. A Markovian state model (MSM) is a convenient method for analyzing the simulation data by first discretizing the conformation space of the molecule into some number of states. The kinetics of the system are then described as Markovian, or history-independent, transitions between the states. We have described an automatic algorithm which tries to find stable states in the conformation space such that the dynamics over the states will be Markovian (Chapter 3). Once the states are defined, we find the Markovian transition probabilities by simply counting the number of transitions between states observed in the simulation data, at some lag time between successive conformations, τ. The 63

76 CHAPTER 4. MODEL SELECTION 64 model is then a compact description of the dynamics, and can be projected out to long time scales. There are a few tests for evaluating how well a given MSM describes the simulation data. One previous method for evaluating a MSM involves calculating the implied time scales, which are calculated from the eigenvalues of the transition probability matrix, as a function of the lag time at which the transitions are counted [SPS04a]. If the model is Markovian after some lag time, τ int, the implied time scales will be constant for lag time τ τ int. A good model is one for which τ int is short. An alternate method for evaluating a MSM is to calculate the metastability of the model [HS05]. The metastability, Q, of a MSM is defined as the sum of the self-transition probabilities calculated at some lag time τ. One way to reduce the lag time after which the MSM will appear to be Markovian is to add more states to the MSM. Adding more states will make the extent of each state smaller, therefore reducing the lag time after which the dynamics will appear to be Markovian. However, by making the states smaller, there will be more parameters in the MSM the transition probabilities between the states. By adding more states to the model, we are estimating more parameters with the same amount of data, thus increasing the uncertainty in the parameters. The previous metrics for evaluating a MSM do not take into account the amount of simulation data, and therefore this increased uncertainty in the transition probabilities. Errors in estimating the transition probabilities, in turn, lead to errors in the prediction of kinetic properties such as the rates of folding, as we will discuss in later chapters (Chapters 5 and 6). In this chapter, we introduce a maximum likelihood scoring function and a Bayesian scoring function which are based on the scoring of Bayesian Networks to evaluate how appropriate a MSM is for a given data set. We first introduce Bayesian Networks and scoring functions, and then show how to use the scoring functions to compare between different MSMs. The maximum likelihood and Bayesian scoring functions are tested on MSMs of a simple transition model between 9 conformations. We show that the scoring functions are able to select the correct 3-state decomposition of the transition model. We also show how the Bayesian scoring function is better than the maximum likelihood scoring function in determining the appropriate number of states in the decomposition, which depends on both the amount of data and the computed transition probabilities. The Bayesian scoring function is also used to determine which state in a MSM should be subdivided in order to best improve the model. The scoring functions are then tested on state decompositions of the terminally blocked alanine peptide, and we show how they are able to provide more information about the quality of the resulting MSMs than previous techniques.

77 CHAPTER 4. MODEL SELECTION Methods In this section, we first introduce the basic formalism of a Bayesian Network (Sec ), including techniques for estimating the parameters (Sec ) and different scoring functions (Sec ). We then describe how a Markovian state model is transformed into a corresponding Bayesian Network (Sec ) and how the scoring functions are used to compare different MSMs (Sec ) Bayesian Networks Assume that there exists a set of variables {X 1, X 2,..., X N }, each of which has possible values x 1j V al(x 1 ), x 2j V al(x 2 ),..., x Nj V al(x N ). Here, we adopt the notation of an uppercase letter representing a variable and a lowercase letter representing possible values for that variable. The variables {X 1, X 2,..., X N } have a joint probability distribution over them, P (X 1, X 2,..., X N ). A Bayesian Network (BN) is a way to compactly represent this joint distribution by assuming certain conditional independence relationships between the variables [HM81, Pea88]. Explicitly, a Bayesian Network is a directed acyclic graph, G, where the probability of each node in the graph is given just in terms of its parents. If a variable X i has parents P a(x i ), then P (X i X 1,..., X i 1, X i+1,..., X N ) = P (X i P a(x i )). For each variable X i, the Bayesian Network will have a set of parameters Θ Xi P a(x i ), where each parameter θ xij π i P a(x i ) defines the probability that a certain value of X i, x ij, occurs given specific values for all the parents, π i. Sometimes, the data set we are interested in does not consist of a fixed set of variables, but instead involves observations at different points through time. A Dynamic Bayesian Network (DBN) can be used to represent this type of time-series data. In a DBN, at each time slice, we have some set of variables whose joint probability distribution is represented as a Bayesian Network. In addition, we can have edges between variables in different time slices, indicating how the variables evolve through time, given previous values. The process modeled by a DBN is assumed to be stationary, that is, the dependencies between time slices are independent of time Parameter estimation in Bayesian Networks Assume that we know the structure of a Bayesian Network (the set of variables and edges) and that we want to estimate values for the parameters, Θ. Also assume that we are given complete instances of the variables, drawn from their joint probability distribution. Let M be the number of data instances we have, where each data instance consists of an assignment of values to all the

78 CHAPTER 4. MODEL SELECTION 66 variables: {x 1 [m], x 2 [m],..., x N [m]}. We can summarize the data, D, with count variables, where M[( )] is the number of data instances where ( ) holds. Maximum likelihood estimation The likelihood of observing the data, D, given some value of the parameters, Θ, is simply the product over all the variables of the product over all the data instances of the probability of observing that instance of the variable: L(Θ : D) = N i=1 m=1 M θ xi [m] π i [m] P a(x i ). (4.1) The maximum likelihood estimates of the parameters are simply those values which maximize the likelihood function in Eq By grouping terms and using the count variables, the likelihood function reduces to L(Θ : D) = N k i i=1 π i P a(x i ) j=1 θ M[x ij,π i ] x ij π i, (4.2) where k i is the number of possible instances of X i, size(v al(x i )). Typically, we take the natural log of the likelihood function: l(θ : D) = N k i i=1 π i P a(x i ) j=1 M[x ij, π i ] ln θ xij π i. (4.3) The likelihood function decomposes by variable X i, and by maximizing the log-likelihood, the maximum likelihood estimates of the parameters Θ Xi P a(x i ) are given as ˆθ xij π i P a(x i ) = M[x ij, π i ], (4.4) M[π i ] the fraction of instances where x ij is true, restricted to the case where the parents have value π i. Bayesian estimation Instead of calculating a single value for the parameters in the Bayesian Network, it is possible to ascribe a distribution over the parameters. Using Bayes rule, the probability of a particular

79 CHAPTER 4. MODEL SELECTION 67 parameter is P (θ xij π i P a(x i ) D) = P (D θ x ij π i )P (θ xij π i ), (4.5) P (D) where P (θ xij π i ) is some prior probability over the parameters. A convenient choice for the prior distribution is the Dirichlet distribution, which is the conjugate prior of the multinomial distribution from which our data is observed. The Dirichlet distribution with variables p and parameters u is defined as Dirichlet(p; u) = 1 Z(u) K i=1 p u i 1 i, (4.6) Z(u) = K ( i=1 Γ(u i) K ), (4.7) Γ i=1 u i where Z(u) is a normalizing constant and Γ is the gamma function. If we define the prior of the parameters Θ Xi π i P a(x i ) as a Dirichlet distribution with parameters α xi1 π i, α xi2 π i,..., α xiki π i, and we observe counts M[x i1, π i ], M[x i2, π i ],..., M[x iki, π i ], then the posterior distribution is P (Θ Xi π i P a(x i ) D) = Dirichlet(X i; α Xi π i + M[x i, π i ]). (4.8) P (D) Scoring of Bayesian Networks If we have a Bayesian Network, we can evaluate how well it represents the data by calculating different likelihoods of observing the data given the model. Below, we present two scoring functions corresponding to the maximum likelihood and the marginal likelihood. Maximum likelihood scoring function One choice of scoring function is the maximum likelihood scoring function (score L ), which is the maximum possible likelihood of the data given the Bayesian Network: score L (G, D) = max(l( G, Θ : D). (4.9) Θ

80 CHAPTER 4. MODEL SELECTION 68 The maximum likelihood occurs when the maximum likelihood parameters, ˆΘ, are used, which were defined above in Eq. 4.4: score L (G, D) = l( G, ˆΘ : D). (4.10) The maximum likelihood score is therefore equal to score L (G, D) = = N k i i=1 π i P a(x i ) j=1 N k i i=1 π i P a(x i ) j=1 M[x ij, π i ] ln ˆθ xij π i M[x ij, π i ] ln M[x ij, π i ]. (4.11) M[π i ] The major downside of using score L is that as more parameters and dependencies are added to the Bayesian Network, the fit to the data will never decrease. Even if the underlying probability distribution satisfies conditional independence between variables, it is unlikely that the empirical data will also satisfy these independencies. While a more complicated model will provide a better fit to the given data set, it is likely that it will overfit to the training data, and the model may lose its ability to predict new data, as each parameter must be estimated with fewer data samples. Bayesian scoring function The maximum likelihood scoring function prefers more complicated models since it selects the best parameters, ˆΘ, to calculate the score. An alternative scoring function is the Bayesian score (score B ), which uses the entire distribution over the parameters to calculate the marginal likelihood. As opposed to Eq. 4.9, where the maximum likelihood values for Θ were chosen, in the Bayesian score, we integrate over all possible values for Θ: score B (G, D) = (l( G, Θ : D)dΘ. (4.12) Θ

81 CHAPTER 4. MODEL SELECTION 69 Substituting in the likelihood of observing the data, D, given the graph, G, and the parameters, Θ, the Bayesian scoring function becomes score B (G, D) = ln = ln Θ P (D G, Θ)P (Θ G)dΘ N k i Θ i=1 π i P a(x i ) j=1 where P (Θ G) is some prior distribution of the parameters, given the graph, G. θ M[x ij,π i ] x ij π i P (Θ G)dΘ, (4.13) For each set of parameters Θ Xi π i P a(x i ), if the prior distribution P (Θ Xi π i P a(x i ) G) is defined as a Dirichlet distribution with parameters α xij π i, as in Eq. 4.8, the above integral has a closed-form solution [CH92]: score B (G, D) = ln = N i=1 π i P a(x i ) N i=1 π i P a(x i ) where α Xi π i k i j=1 α x ij π i. Γ(α Xi π i ) k i Γ(α xij π i + M[x ij, π i ]) Γ(α Xi π i + M[π i ]) Γ(α j=1 xij π i ) ki Γ(α Xi π ln i ) Γ(α Xi π i +M[π i ]) + ln Γ(α x ij π i +M[x ij, π i ]), (4.14) Γ(α xij π i ) j=1 Comparison of scores If we have two different Bayesian Networks, we may wish to evaluate which is a better model for the data. Often, this test is done in terms of a likelihood ratio, known as a Bayes factor when the marginal likelihood is used. A likelihood ratio is simply the ratio of the probabilities of observing each of the two events, which in our case are the probabilities that the given Bayesian Network produced the data. If we assume that the two BNs are equally likely a priori, then ratio = P (BN 2 D) P (BN 1 D), (4.15) where the probability is calculated either as the maximum likelihood (score L ) or as the marginal likelihood (score B ). The ratio value gives how much more likely BN 2 is to have generated the data than BN 1. Since the scoring functions we have outlined calculate the natural log of the likelihood,

82 CHAPTER 4. MODEL SELECTION 70 to compare the scores we simply take the natural log of Eq. 4.15: logratio = ln P (BN 2 D) ln P (BN 1 D). (4.16) For each of the scoring functions, the difference of the scores between the two BNs which we are comparing gives how many orders of magnitude the second BN is more likely than the first to have produced the data set. When comparing scores, we will use the normalized log ratio, where we normalize by the number of data instances, M: nlr = ln P (BN 2 D) ln P (BN 1 D). (4.17) M This normalized log ratio tells on average how many orders of magnitude the second BN is more likely than the first on any particular data instance Markovian state models as Bayesian Networks Assume we have a molecular dynamics trajectory of the form {c(0), c(1), c(2),...}, where c(t) represents the conformation of the molecule at time t, and τ = 1 is the lag time between consecutive observations. The conformation of the molecule may, for example, be represented as the spatial coordinates of all of the atoms of the molecule. Estimating the dynamics between conformations is difficult because the high dimensionality of the conformation space makes it unlikely that a given conformation will be visited more than once in the trajectory data. A Markovian state model attempts to reduce this dimensionality by mapping each conformation of the molecule to one of k S discrete states: s(t) = f(c(t)), (4.18) where f is the function of the Markovian state model which maps conformations to states. The goal in a Markovian state model is to group conformations together into states such that the conformations within a state will transition between each other rapidly. When the transitions within a state are faster than the transitions between states, the transitions between states are well approximated as Markovian, or history-independent transitions. We can represent a Markovian state model with the Dynamic Bayesian Network shown in Figure 4.1. In this network, the variables X and Y represent the state which the system is in at time points separated by lag time τ. The variables A and B represent the conformation which the system is in

83 CHAPTER 4. MODEL SELECTION 71 X Y A B Figure 4.1: The Dynamic Bayesian Network corresponding to a Markovian state model. The variables X and Y represent the state of the system at consecutive time points and the variables A and B represent the corresponding conformations. at the corresponding times. We assume the transitions between states are Markovian, and thus the probability of state Y only depends on the previous state X. We also assume that the probability of a conformation at a given time, A or B, is only dependent on the current state, X or Y respectively Comparison between different Markovian state models If we have two different Markovian state models, MSM 1 and MSM 2, each will correspond to different functions f which map the conformations to states, f 1, and f 2. These functions may differ both in the number of states, k S1 and k S2, and in the mapping itself. Each of these Markovian state models will correspond to different Dynamic Bayesian Networks. The goal of this work is to determine which Markovian state model is more likely, given the data set, D. We therefore measure how well each Dynamic Bayesian Network fits the data set, D, with the different scoring functions described in Sec The mapping from trajectory data to DBN data instance is relatively straightforward. We first divide the trajectories into non-overlapping pairs of conformations separated by lag time τ, since we assume that each data instance is independent. Then, for each pair of conformations, we calculate the values of the corresponding state variables using the functions f 1 and f 2. For MSM 1, we call the corresponding DBN G 1, which has data instances {X = f 1 (c(t)), Y = f 1 (c(t + τ)), A = c(t), B = c(t + τ)}, (4.19) and for MSM 2, we call the corresponding DBN G 2, which has data instances {X = f 2 (c(t)), Y = f 2 (c(t + τ)), A = c(t), B = c(t + τ)}. (4.20)

84 CHAPTER 4. MODEL SELECTION 72 Since the probability of a conformation given a state (P (A X) or P (B Y ) in Fig. 4.1) is independent of the time slice, we share one set of parameters, Θ C S, instead of the two sets, Θ A X and Θ B Y. The parameters Θ C S are estimated using all the data instances {X[m], A[m]} {Y [m], B[m]}. This reduces the number of parameters in the DBN and will give more precise estimates of the probability of a conformation given a state. Calculating the maximum likelihood score (score L ) for each of the two DBNs (G 1 and G 2 ) is straightforward using Eq. 4.11, and is simply: score L (G, D) = k S x=1 M[x] ln M[x] M + ks k S x=1 y=1 M[y, x] ln M[y, x] M[x] + k S k C s=1 c=1 M[c, s] ln M[c, s] M[s]. (4.21) To calculate the Bayesian score (score B ) for each DBN, we must also define the prior distribution over the parameters. The scoring function assumes the prior distribution is a Dirichlet distribution, but we still need to select the prior parameters. The weight of a prior distribution is equal to the sum of the prior distribution parameters. We could select a uniform distribution for the prior by setting all prior parameter values to one, but then different DBNs may have different prior weights, since a DBN with more states would have more prior parameters. Instead, we chose to use a BDe prior [HGC95], so that the weight of the prior is the same over all DBNs. If we define some joint probability distribution over the variables, P (X 1, X 2,..., X N ), and some total prior weight, M, the BDe prior is defined as: α xij π i P a(x i ) = M P (x ij, π i ). (4.22) In our case, we calculated the prior parameters of the distributions of X and Y, α X and α Y X with prior weight M = 1 and uniform distribution P (X, Y ). Therefore, the prior parameters for the two DBNs are α G 1 x = 1 k S1 ; α G 1 y x = 1 k 2 S 1 ; α G 2 x = 1 k S2 ; α G 2 y x = 1 k 2 S 2. (4.23) We calculated the prior parameters of the distribution of the conformations, α C S, with prior weight M = 1 and the distribution P (C, S) as uniform where s = f(c) and zero otherwise. Once the prior is defined, we calculate the Bayesian score for each DBN by substituting these

85 CHAPTER 4. MODEL SELECTION 73 prior parameter values into Eq. 4.14: score B (G, D) = ln + ks 1 Γ(1 + M) + k S ln Γ(1/k2 S x=1 y=1 k S α C s ln Γ(α C s + M[s]) + s= Non-equilibrium data + M[y, x]) Γ(1/kS 2 ) c:f(c)=s ln Γ(α c s + M[c, s]) Γ(α c s ) (4.24) All of the above analysis assumed that each data instance was selected from the joint equilibrium distribution over all the variables. For molecular dynamics simulations, this assumption corresponds to the assumption that the simulation is at equilibrium. However, simulations typically take a long time to equilibrate, and it will be useful to use non-equilibrium data as well. In the context of scoring the corresponding Dynamic Bayesian Networks, it is possible to model non-equilibrium data using interventions. An intervention is when the value of one or more of the variables, X i, is assigned to a particular value X i = x i. The Bayesian Network corresponding to a data instance with an intervention simply removes all incoming edges to the variables whose values were assigned and defines P (X i = x i ) = 1. To model non-equilibrium data, we assume that we assign the value of the first conformation of each data instance (A in Fig. 4.1) instead of selecting it from its equilibrium distribution. This corresponds to an intervention at the variable A, and we thus remove the edge from X to A in the DBN shown in Fig The probability distribution associated with A is then defined as P (A = a) = 1, where a is the specific conformation for a given data instance. Since the value of the variable X is determined as a function of the first conformation, we also define the probability P (X = x) = 1, where x = f(a). We can now calculate the maximum likelihood and Bayesian scores of non-equilibrium data. We remove the term corresponding to P (X), and we calculate the term P (C S) using only the data instances {Y [m], B[m]} as opposed to the set {X[m], A[m]} {Y [m], B[m]}. It is also possible to calculate the scores of a mixture of equilibrium and non-equilibrium data [Pe 03].

86 CHAPTER 4. MODEL SELECTION c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c s 1 s 2 s 3 Figure 4.2: The transition probabilities and state definitions for a simple model with 9 conformations. Self transition probabilities are set so the outgoing probability for each conformation is equal to one. All other transition probabilities not specified are equal to zero. The conformations are grouped into three states, as shown by the dotted lines. 4.3 Results Model system To test the methods outlined above, we first use a simple transition model between 9 conformations (Figure 4.2) which has been studied previously [SPS04a]. To generate data from this transition model, we first select a conformation at random from its equilibrium distribution, and then select a transition at random from the possible transitions from that conformation, with probabilities given in Fig The transition probabilities given are the transition probabilities at lag time τ = 1. To generate data at different lag times, we can generate sequential transitions and only record data at intervals corresponding to τ, or equivalently, we can generate data from the transition matrix raised to the power τ. To better represent molecular dynamics data, we will assume that each time we visit c i, we generate a unique instance of that conformation, which is not equivalent to other visits to c i (this is important when estimating Θ C S ). While this model has transitions between 9 conformations, Swope et al. have shown that by grouping the conformations into 3 states (shown in Fig. 4.2), the dynamics appear to be Markovian after some lag time [SPS04a].

87 CHAPTER 4. MODEL SELECTION 75 Comparison of different state decompositions As described in Sec , the scoring functions can be used to compare MSMs with different state definitions. In this section, we focus only on the effect of the state definition. We compare the scores of different possible state definitions with k S = 3 over the transition model shown in Fig We computed the scores for the 27 possible MSMs corresponding to all 3-state decompositions of the conformations, where the conformations in a state had to be sequential. For lag times τ ranging from 1 1,000 and total number of data instances M ranging from 1,000 50,000, we generated independent data sets from the transition model at the given lag time. We then calculated the maximum likelihood score and Bayesian score for each of the DBNs (Fig. 4.1) corresponding to the 27 MSMs. Over this range of lag times and amount of data, the highest scoring MSM for both the maximum likelihood and Bayesian scoring functions had the state definition: {c 1, c 2, c 3 }, {c 4, c 5, c 6 }, {c 7, c 8, c 9 }. (4.25) The next highest MSMs were consistently the following state definitions: {c 1, c 2, c 3 }, {c 4, c 5 }, {c 6, c 7, c 8, c 9 } {c 1, c 2 }, {c 3, c 4, c 5, c 6 }, {c 7, c 8, c 9 } {c 1, c 2, c 3 }, {c 4, c 5, c 6, c 7 }, {c 8, c 9 } {c 1, c 2, c 3, c 4 }, {c 5, c 6 }, {c 7, c 8, c 9 } The decomposition with the highest score for both scoring functions was the one designed such that equilibration within each state would be faster than transitions out of the state [SPS04a]. Indeed, by examining the eigenvectors of the transition matrix between the 9 conformations, we can see that the conformations grouped together in Eq behave the same for the two longest time scales of the system (data not shown). Thus, the scoring functions are able to select meaningful state definitions. Comparing MSMs with different numbers of states We now demonstrate how the scoring functions perform when comparing MSMs with different numbers of states. Using the transition model between 9 conformations depicted in Fig. 4.2, we define two Markovian state models over the conformations. The first MSM is defined such that the first three conformations map to one state, the second three conformations map to a second state,

88 CHAPTER 4. MODEL SELECTION 76 and the last three conformations map to a third state, as shown in Fig. 4.2, and as validated as the best 3-state definition in the previous section. We call the DBN (Fig. 4.1) corresponding to this MSM, G 1. The second MSM is simply defined such that each conformation maps to a unique state. We call the DBN corresponding to this MSM, G 2. We compare the scores of these two DBNs for differing lag times τ and total independent data instances M. The top panel of Fig. 4.3 shows the normalized difference in maximum likelihood scores for the two DBNs and the bottom panel shows the normalized difference in Bayesian scores, as computed with Eq Values above zero indicate that DBN G 2, corresponding to the MSM with 9 states, is that many orders of magnitude more likely to have produced an average data instance than DBN G 1, the MSM with 3 states. We see that the maximum likelihood score (top panel) always prefers the MSM with 9 states, since the score difference is always greater than zero. The preference is stronger when the lag time τ is short, compared to when it is longer. A lag time of τ indicates selecting transitions with probability equal to the transition matrix given in Fig. 4.2 raised to the power τ. It is easy to verify that as the lag time increases, the transition probabilities on conformations within the same state in the 3-state MSM become more similar. Therefore, the relative difference of the probabilities of the 3-state MSM and the 9-state MSM is smaller than at shorter lag times, when the transition probabilities within each state are more different. The maximum likelihood score difference is relatively insensitive to changes in the number of data instances. Conversely, the Bayesian score (bottom panel) differs in preference between the 9-state MSM and the 3-state MSM depending on both the lag time τ and the number of data instances M. The contour indicating equal probability between the two MSMs is shown in bold. For a fixed amount of data, the Bayesian score prefers the 9-state MSM at short lag times, but switches preference to the 3-state MSM at longer lag times. As discussed above, this is because the transition probabilities for the 3-state MSM are better approximations of the true transition probabilities between the 9 conformations at longer lag times, since the transition probabilities of conformations belonging to a single state become more similar. In addition, the Bayesian score also depends on the number of data instances. For a fixed lag time τ, the Bayesian score prefers the 3-state MSM for low amounts of data and switches to the 9-state MSM at higher amounts of data. This is because, even though the true underlying kinetic model can only be represented by the 9-state MSM, at small amounts of data, the 3-state MSM is more predictive of future data instances since there are less parameters which need to be estimated.

89 CHAPTER 4. MODEL SELECTION 77 data instances (M) lag time (τ) data instances (M) lag time (τ) Figure 4.3: The difference in scores between a 9-state and 3-state definition of the transition model of 9 conformations for different lag times τ and number of data instances M. The top contour plot shows the average normalized score difference (Eq. 4.17) for the maximum likelihood score. The bottom contour plot shows the average normalized score difference for the Bayesian score, where the bold contour corresponds to zero. Both plots are averaged over 20 independent data sets for each lag time and number of data instances.

90 CHAPTER 4. MODEL SELECTION 78 MSM State definition MSM orig {c 1, c 2, c 3 } {c 4, c 5, c 6 } {c 7, c 8, c 9 } MSM 1 {c 1 }, {c 2 }, {c 3 } {c 4, c 5, c 6 } {c 7, c 8, c 9 } MSM 2 {c 1, c 2, c 3 } {c 4 }, {c 5 }, {c 6 } {c 7, c 8, c 9 } MSM 3 {c 1, c 2, c 3 } {c 4, c 5, c 6 } {c 7 }, {c 8 }, {c 9 } Table 4.1: Four state definitions for the transition model between 9 conformations. The Bayesian score is thus superior to the maximum likelihood score in determining an appropriate number of states in a state decomposition. While a maximum likelihood score will always prefer dividing the conformation space into more states, the Bayesian score will only prefer more states if there is sufficient data to justify adding more parameters to the DBN. When the transition probabilities from the suggested new states are significantly different from one another, as in the above case for short lag times τ, less data is necessary for the justification, and as the transition probabilities become more similar, as for longer lag time τ, more data is necessary for the justification. Determination of weakest state By designing state decompositions which are hierarchical, we can determine which state in a given decomposition should be subdivided in order to create a better decomposition. In this section, we will show how this state corresponds to the least Markovian state which has sufficient data instances to warrant further subdivision. We define 4 MSMs over the transition model shown in Fig The first MSM, MSM orig, maps the conformations to 3 states, as shown in Fig. 4.2 and as described in the previous sections. We also define three MSMs, each with 5 states, which subdivide one of the states into 3 states corresponding to the conformations. The MSMs which subdivide states s 1, s 2, or s 3 are named MSM 1, MSM 2, or MSM 3, respectively. The state definitions are summarized in Table 4.1. We compute the maximum likelihood and Bayesian scores of the DBNs (Fig. 4.2) corresponding to each of the 4 MSMs over a range of lag time τ and total number of data instances M. Figure 4.4 shows the normalized score difference between either MSM 1, MSM 2, or MSM 3 and MSM orig. Scores greater than zero indicate the given subdivided MSM is preferred, and scores less than zero indicate the original MSM is preferred. The top panels show the normalized score difference for a constant amount of data M = 10, 000

91 CHAPTER 4. MODEL SELECTION score L / M MSM 1 MSM 2 MSM 3 score B / M MSM 1 MSM 2 MSM lag time (τ) lag time (τ) score L / M MSM MSM MSM data amount (M) x 10 4 score B / M MSM MSM 2 MSM data amount (M) x 10 4 Figure 4.4: Comparison of MSMs corresponding to the subdivision of states. The normalized score difference for either the maximum likelihood score (left panels) or the Bayesian score (right panels) is shown for constant number of data instances M = 10, 000 (top panels) or constant lag time τ = 400 (bottom panels). Each line compares the listed MSM with MSM orig. Each data point shows the mean and standard deviation as calculated from 20 independent data sets of the given size and lag time. and varying lag time τ for the maximum likelihood score (left) and the Bayesian score (right). If we were to compute the differences of the scores of the subdivided MSMs, MSM 1, MSM 2, and MSM 3, we would see that both scoring functions prefer MSM 1, followed by MSM 2, followed by MSM 3. This corresponds to the scoring functions wanting to subdivide state s 1, s 2, and s 3 in that order. By calculating the eigenvalues and eigenvectors of the transition matrix between all 9 conformations, we see that state s 1 takes the longest time to equilibrate within the state, followed by state s 2, followed by state s 3. The scoring functions are thus able to discern that subdividing state s 1 would improve the MSM the most, since s 1 has the longest internal equilibration time, and thus is the least Markovian.

92 CHAPTER 4. MODEL SELECTION 80 The bottom panels of Fig. 4.4 show the normalized score difference for a constant lag time τ = 400 and differing amounts of data. We see that the maximum likelihood score is relatively insensitive to the amount of data, though at very low amounts of data the preference for the subdivided MSMs are slightly higher. This is because there are greater fluctuations in the empirical data, causing more deviation between transition probabilities of conformations belonging to a single state in MSM orig. The Bayesian score, on the other hand, prefers the original MSM for low amounts of data and only prefers the subdivided MSMs when there is sufficient data to be able to correctly parameterize the additional transition probabilities. For this example, each state has the same equilibrium probability, and thus there are the same number of data instances, on average, for transitions out of each of the three states. We can vary the amount of data for each state by not selecting the first conformation (A in Fig. 4.1) from its equilibrium distribution, and then model the resulting DBN using interventions as discussed in Sec When we do this, the Bayesian score correctly varies the preference between the three subdivided MSMs depending on the amount of data samples from each state, while the maximum likelihood score always retains the same preferences as in the equal data case (data not shown). Thus, the Bayesian score is better at discriminating when it is possible to subdivide a state to produce a better model Alanine peptide We now evaluate how well the scoring functions perform on several previously generated state decompositions of the terminally blocked alanine peptide. We use a data set consisting of 975 trajectories from the 400 K replica of a 20 ns/replica parallel tempering simulation with conformations stored every 0.1 ps [CSPD06b]. The peptide was modeled by the AMBER parm96 forcefield [KDC + 97], and solvated in TIP3P water [JCM + 83]. Full simulation details can be found in Ref. [CSPD06b]. The alanine peptide molecule is depicted on the left side of Fig. 4.5, with the main degrees of freedom, the φ and ψ torsion angles labeled. The middle panels show four previously computed state decompositions for the alanine peptide, projected into the φ-ψ space (Chapter 3). The right panels show the implied time scales as a function of lag time τ, and the metastability Q, computed at lag time τ = 0.1 ps. The implied time scales are calculated from the non-unit eigenvalues of the transition matrix as τ k = τ, (4.26) ln λ k

CHAPTER 4. MODEL SELECTION 81 Figure 4.5: Several state decompositions for the terminally blocked alanine dipeptide. The φ and ψ angles of the alanine peptide are shown in the left figure.

93 CHAPTER 4. MODEL SELECTION 81 Figure 4.5: Several state decompositions for the terminally blocked alanine dipeptide. The φ and ψ angles of the alanine peptide are shown in the left figure. Four state decompositions, MSM 1, MSM 2, MSM 3, and MSM 4 are shown in the center panels, with the state definitions labeled for MSM 1. The right panels show the implied time scales as a function of lag time for each MSM, as well as the metastability Q, calculated at a lag time of τ = 0.1 ps. where τ is the lag time, λ k is the kth eigenvalue of the transition matrix, and τ k is the corresponding kth implied time scale [SPS04a]. If the dynamics over the states in the MSM are Markovian after some lag time τ int, then the implied time scales will be constant for τ > τ int. The metastability Q is calculated as the sum of the self-transition probabilities of the transition probability matrix. We compare the MSMs defined by the four 6-state decompositions depicted in Fig. 4.5, a manually defined good state decomposition (MSM 1 ), a manually defined poor state decomposition (MSM 2 ), an automatically generated state decomposition nearly equivalent to the good decomposition (MSM 3 ), and an automatically generated state decomposition which groups together two of the manually defined good states (3 and 4 as labeled in Fig. 4.5) and subdivides another of the

94 CHAPTER 4. MODEL SELECTION score L / M score B / M lag time (ps) lag time (ps) score L / M score B / M lag time (ps) lag time (ps) Figure 4.6: Comparison of different state definitions for the terminally blocked alanine peptide. The left panels shows the normalized maximum likelihood score difference for the four MSMs for different lag times and the right panels shows the Bayesian scores. The differences are calculated with respect to the average score over the four MSMs (top) or with respect to the average of the three good MSMs, MSM 1, MSM 3, and MSM 4 (bottom). manually defined good states (5 as labeled in Fig. 4.5) (MSM 4 ). We calculate the maximum likelihood and Bayesian scores for the DBNs (Fig. 4.1) corresponding to the four MSMs over a range of lag times. Figure 4.6 shows the maximum likelihood scores (left panel) and Bayesian scores (right panel) for the four MSMs. We see that the manually defined poor state decomposition (MSM 2 ) scores much worse compared to the other state decompositions for both the maximum likelihood and Bayesian scores over all lag times. The scores of the other three state decompositions are more similar. This is consistent with other metrics for evaluating state decompositions, the metastability and the implied time scales, as shown in Fig For the three good state definitions, there are some variations in the preferences with lag time, which are emphasized in the bottom panels of Fig For short lag times, both the maximum

95 CHAPTER 4. MODEL SELECTION 83 likelihood and the Bayesian scores prefer MSM 1 and MSM 3 over MSM 4. By examining the eigenvectors at short lag times τ, we see that the average time to transition between manually defined states 3 and 4 (as labeled on Fig. 4.5) is approximately 1 2 ps. The preference for MSM 1 and MSM 3, which separate these states, over MSM 4, which groups these states together, indicates that we have sufficient data to support separating these states, in order to capture the different transition probabilities from these states, at short lag times τ. At longer lag times, the maximum likelihood scores and Bayesian scores disagree over which is the preferred MSM. The maximum likelihood scores give equal preference to the three good decompositions, while the Bayesian score prefers MSM 4 over MSM 1 and MSM 3. For a given data set, the number of independent data instances decreases with lag time, since we take nonoverlapping pairs of conformations, separated by lag time τ, to ensure independence of the data instances. At the longer lag times τ, the transition probabilities from manually defined states 3 and 4 are more similar, and we no longer have sufficient data to justify separating them into their own states. The preferences determined by the Bayesian score take into account the amount of data, and thus differ from the maximum likelihood score preferences when the number of data instances decrease. The previous scoring metrics, metastability and implied time scales, as shown in Fig. 4.5, were unable to determine the differences between manual decomposition MSM 1, automatic decomposition MSM 3, and automatic decomposition MSM 4. The maximum likelihood and Bayesian scoring functions, on the other hand, were able to resolve finer preferences between the MSMs, which can be validated by looking at the eigenvectors of the transition matrix. 4.4 Conclusions It is becoming common to study the dynamic properties of biomolecular systems through computer simulations. From these simulations, it is possible to build a Markovian state model which assumes Markovian transitions between discrete regions of the conformation space. In the previous chapters, we have given algorithms for automatically decomposing the conformation space into states (Chapter 3), and methods for efficiently computing kinetic properties such as the average time for the molecule to fold (Chapter 2). The main intuition behind a Markovian state model is that if we define states such that the conformations within a state transition between each other rapidly, then the dynamics are well approximated by a Markov chain over the states. If the conformations within a state interconvert

96 CHAPTER 4. MODEL SELECTION 84 quickly, then their outgoing transition probabilities will become similar at lag times longer than the time it takes for the interconversion. In this chapter, we showed how to convert a MSM into a Dynamic Bayesian Network, and then how to use functions for scoring DBNs, the maximum likelihood scoring function and the Bayesian scoring function, to compare between different MSMs. These functions give higher scores when the conformations within a state have similar transition probabilities. The key advantage of the Bayesian scoring function over the maximum likelihood scoring function is that the maximum likelihood scoring function always prefers MSMs which divide the conformation space into more states, while the Bayesian scoring function will only prefer a MSM with more states if there is sufficient data to characterize the new states. For a simple transition model between 9 conformations, we have shown how the scoring functions are able to select the best 3-state decomposition of the space. We have also shown how the Bayesian scoring function can determine the appropriate number of states for a given amount of data. One of the unresolved problems in performing a state decomposition (Chapter 3) is in knowing how small (in terms of number of member conformations) we can allow a state to become. Previously, we bounded the lower size of a state based on some arbitrary number of conformations. However, we can use the Bayesian scoring function to determine the optimal size of any given state. We can generate a new MSM which subdivides any state into substates, and then, as we showed for the model system, compare the Bayesian scores of the two MSMs, thus determining whether we have sufficient data in that region of conformation space to support the new state definition. We have also compared different state definitions of the terminally blocked alanine peptide. Four state decompositions for the alanine peptide were previously defined, a manually defined good decomposition, a manually defined poor decomposition, and two decompositions automatically created by the state decomposition algorithm (Chapter 3). The four decompositions were analyzed previously using the implied time scales as a function of lag time and the metastability of the decomposition. These metrics found that the manually defined poor decomposition was worse than the other three decompositions, which performed nearly equivalently. Using the maximum likelihood and Bayesian scoring functions, we were better able to characterize these state decompositions. The scoring functions gave preferences between the three good decompositions as a function of lag time, which were validated by looking at the eigenvalues and eigenvectors of the transition matrices. The Bayesian scoring function additionally gave preferences based on the decreasing amount of independent data with lag time. This chapter has shown how to distinguish between different MSMs to determine which is the best MSM for the data set. But, being the best MSM for the data set does not imply that the MSM

97 CHAPTER 4. MODEL SELECTION 85 is adequate for studying the underlying problem of biological importance. In order to use a MSM to calculate kinetic properties of the system, the transitions between the states need to be Markovian at the lag time used to calculate the transition probabilities. We can calculate this lag time using the implied time scale test previously mentioned [SPS04a]. If the MSM is not Markovian at the lag time at which we wish to use the model, we can try to reduce this Markovian lag time τ int by subdividing states and testing the new state definitions using the Bayesian score as above. However, if the Bayesian score prefers the original MSM, this indicates that we need more data in that region of the conformation space. Integrating the state decomposition problem, model selection problem, and simulation planning problem to build MSMs is the subject of future work.

98 Chapter 5 Error analysis methods Once a Markovian state model (MSM) has been built for a given set of simulation data, there are numerous properties which we can calculate from the model. In this chapter, we analyze the errors in the model caused by finite sampling to the calculated mean first passage time (MFPT) from the initial to the final states. We give different methods with various approximations to determine the precision of the reported MFPTs. These approximations are validated on an 87 state toy Markovian system. In addition, we propose an efficient and practical sampling algorithm that uses these error calculations to build a MSM that has the same precision in mean first passage time values but requires an order of magnitude fewer samples. We also show how these methods can be scaled to large systems using sparse matrix methods. 5.1 Introduction To meet the challenge of modeling the conformational dynamics of biological macromolecules over long time scales, much recent effort has been devoted to constructing stochastic kinetic models, often in the form of discrete-state Markovian state models, from short molecular dynamics simulations [SSP04, SPS04a, SPS + 04b]. It is efficient to calculate kinetic properties such as the probability that a conformation will fold (P fold ) or the average time taken for a given conformation to fold (MFPT) from this type of model [SSP04]. The MSM also allows one to easily combine and analyze simulation data started from various conformations and naturally handles intermediate states and traps. This approach has been applied to small protein systems [SSP04, SPS + 04b], a non-biological polymer [EP04, EPP05b], and vesicle fusion [KKS + 06] with good agreement with experimental rates. While these kinetic models agree well with experiments, only a single value for the rate was 86

99 CHAPTER 5. ERROR ANALYSIS METHODS 87 used in the comparison. It is also important to determine the uncertainty in this value, so one can know the confidence of the results. One main source of error is caused by grouping conformations into states and assuming that transitions between these states are Markovian. If we look at a protein and consider each conformation as its own state, on the tens of picosecond and longer time scale, the transitions follow a Markovian pattern. Unfortunately, sampling transitions between an infinite number of states is impractical; therefore, the Markovian state model groups conformations into a finite number of discrete states. However, it has been shown that if the conformations are grouped incorrectly, the state space is no longer Markovian, and any analysis that assumes a Markovian process may produce incorrect results [SPS04a]. Even if the states are defined such that the transitions between them are Markovian, the results could still be in error. This second source of error results from the finite sampling of transitions between states, which gives uncertainties in the transition probability estimates and in turn leads to uncertainties in the values we calculate, such as the MFPT. There has been some recent work on error analysis in these kinetic models of a protein conformation space. Swope et al. focused on the problem of defining states which meet the Markovian criteria. They provided tests for whether or not a given state space definition is history independent [SPS04a]. Here, we will focus on the error caused by finite sampling. Some recent work has involved a Bayesian approach to sampling possible transition probability matrices and solving each sample for the value of interest [SKH05]. While this approach can estimate errors, it does not scale well for systems with large numbers of states. In addition, we want to determine the transitions that contribute the most to the uncertainty, which the current techniques do not allow. If we can identify these transitions, additional simulations can be started from them to increase the overall precision. In this chapter, we will discuss novel methods for computing the error in a Markovian state model for molecular dynamics caused by finite sampling of transitions. We will give different methods for calculating the error from finite sampling and how it translates into errors in the mean first passage time and other estimates. The methods employ a set of approximations and lead to an efficient and practical closed-form solution for the uncertainty. We will also present a new sequential sampling algorithm that uses these error estimates to improve the sampling efficiency by over an order of magnitude. In addition, we discuss how the use of sparse matrix techniques will allow these methods to scale to systems with large numbers of states. These algorithms are then tested and the approximations validated a toy Markovian system.

100 CHAPTER 5. ERROR ANALYSIS METHODS Methods Molecular dynamics simulations are often used to understand protein kinetics. A question then arises as how to best analyze these molecular dynamics trajectories. In Chapter 3, we discussed new methods for clustering conformations from the trajectories into discrete states, which tried to ensure that the transitions between the states were Markovian, or history independent. The transition probabilities between states were estimated by counting the number of times each transition was observed in the trajectories. From this Markovian state model (MSM), it was possible to efficiently calculate kinetic properties such as the P fold and the mean first passage time (MFPT). In this chapter, we are interested in determining the uncertainty in the kinetic results that can be calculated from this graph-based model. We will assume that we can define a Markovian state space for the protein, though forming states that meet this criterion is not a trivial task [SPS04a, CSP + 07]. Even with this assumption, one can still have errors in the results. Since we can only finitely sample the transitions between states, we will have some statistical uncertainty in the transition probabilities. Therefore, any value we calculate from the transition probabilities will also have an uncertainty associated with it. In this section, we will first discuss how to calculate the MFPT from the transition probabilities. We then derive the distribution for the transition probabilities, and define both sampling and nonsampling based methods for calculating the distribution of the MFPT. From these error estimates, we develop an efficient adaptive sampling technique designed to increase precision. Lastly, we will show how to use sparse matrix manipulations that permit the scaling of these methods to systems with large numbers of states Mean first passage times In a Markovian state model, we represent the conformation space by K discrete states, each of which corresponds to some distinct group of protein conformations. Let us define the probability of transitioning from state i to state j at a time step of t as p ij. We also assume that these states are Markovian, i.e., that the transitions between them are history independent at t. We can use the transition probabilities to calculate kinetic properties of the system such as the probability of folding or mean first passage time to reach the final state. These quantities are defined by sets of linear equations that are based on the transition probabilities. For example, the equations for the mean first passage time, x, from any state to the final state,

101 CHAPTER 5. ERROR ANALYSIS METHODS 89 are of the form K t + x j p ij i K x i = j=1 0 i = K, (5.1) where the Kth state represents the final state [SSP04]. Writing this in matrix form, we have p 11 1 p 12 p 1K p 21 p 22 1 p 2K... p (K 1)1 p (K 1)(K 1) 1 p (K 1)K x 1 x 2. x K 1 x K t t =., (5.2) t 0 where the last line is the boundary condition that the mean first passage time from the final state is zero. The matrix on the left side of Eq. 5.2 will be referred to as A, with rows a i. We will use these mean first passage time equations as an example throughout the chapter Transition probability distribution Finite sampling causes uncertainties in the estimates of the transition probabilities between states. In this section, we derive a distribution over possible transition probability vectors. Define p ij as the actual transition probability from state i to j at a time step of t. The sum of the transition probabilities from state i is equal to 1: K p ij = 1. (5.3) j=1 We do not know these actual transition probabilities, but we can estimate them by sampling transitions between states. Since we make the assumption that our state space is Markovian, each transition sample originating from state i will be a random variable with K possible values occurring with probabilities p ij for j = 1... K. Define the transition count z ij as the total number of transition samples which start in state i and end in state j, and define n i as the total number of samples originating from state i: K z ij = n i. (5.4) j=1 The distribution of the z ij variables follows the multinomial distribution with parameters n i, p i1,

102 CHAPTER 5. ERROR ANALYSIS METHODS 90 p i2,..., p ik [JKB97]. From these transition counts, we can calculate the maximum likelihood estimates of the transition probabilities, ˆp ij, which, for the multinomial distribution, are simply the number of transitions from state i to state j divided by the total number of transitions from state i [JKB97]. Thus, ˆp ij = z ij n i. (5.5) However, the maximum likelihood estimates give no indication of the uncertainties in the transition probabilities. Using these same transition counts, z ij, and Bayesian analysis, we can compute the distribution over all possible vectors of transition probabilities, as opposed to simply calculating the most likely transition probability vector. Each set of possible transition probabilities, p i1, p i2,..., p ik, where 0 p ij 1 and K j=1 p ij = 1, has some chance of producing the transition counts, z ij, that we observed. The probability of a particular vector p i being the true transition probability vector, given the observed transition counts, is, from Bayes rule, P (p i z i ) P (z i p i )P (p i ) = p z i1 i1 pz i2 i2... pz ik ik P (p i), (5.6) where P (p i ) is the prior probability over the transition probability vectors, i.e., the distribution of transition probability vectors before observing any data. A typical choice for the prior is the Dirichlet distribution, the conjugate prior of the multinomial distribution. This means that if the prior, P (p i ), is a Dirichlet distribution, then the posterior, P (p i z i ), is also a Dirichlet distribution [KBJ00]. The Dirichlet distribution with variables p and parameters u is defined as Dirichlet(p; u) = 1 Z(u) K i=1 p u i 1 i, (5.7) where Z(u) is a normalizing constant defined in Appendix A. If we define the prior of the transition probabilities as a Dirichlet distribution with parameters α i1, α i2,..., α ik and we observe transition counts z i1, z i2,..., z ik, the posterior of the transition probabilities is a Dirichlet distribution with parameters α i1 + z i1, α i2 + z i2,..., α ik + z ik. For notational convenience, we define the Dirichlet counts as u ij = α ij + z ij. (5.8) Therefore, assuming a Dirichlet prior, the distribution of the transition probabilities, p i, given the observed data counts is Dirichlet(p i ; u i ).

103 CHAPTER 5. ERROR ANALYSIS METHODS 91 Choosing the parameters for the prior completes the description of the distribution. The Dirichlet distribution is non-informative for any parameter α ij = 0. If we set α ij = 0 and do not observe any transitions from state i to state j, the posterior of p ij will always equal zero. However, for molecular dynamics, not seeing a particular transition over some finite sampling does not imply that the transition can never occur. So, we will restrict ourselves to positive priors. Possible choices for the prior distribution are the uniform distribution, α i1 = α i2 =... = α ik = 1, and the symmetric Dirichlet, α i1 = α i2 =... = α ik. In the limit, as the sampling (and therefore transition counts) increases, the distribution of the transition probabilities will not depend on the choice of the prior distribution, therefore making further calculations insensitive to the choice of prior distribution. It will be useful to state the expected values of the posterior distribution of the transition probabilities for future reference, where w i is a normalizing weight variable [KBJ00]: p ij = E(p ij ) = u ij, w i K w i = u ij. (5.9) j= Sampling based error analysis methods In Chapter 2, we used ˆp ij in Eq. 5.2 to calculate an estimate of the true values of the mean first passage times, x. We now wish to calculate the distribution of x given the distribution of the p i values. In particular, we are interested in the MFPT from the initial state to the final state, since this value can be converted to the rate of folding and compared with experiments. Therefore, we are interested in the distribution of the term x 1. We propose a number of sampling-based methods for calculating the distribution of x 1. All of these methods involve repeatedly generating a sample of transition probabilities and converting to a sample of x 1. We can either sample the transition probabilities from the Dirichlet distributions or from approximate multivariate normal (MVN) distributions. Then, we can either solve the above set of linear equations or we can substitute into a first order Taylor series approximation to the set of equations. Each of these four options will be described below, and can be combined to give four sampling-based methods for calculating error, which are summarized below.

104 CHAPTER 5. ERROR ANALYSIS METHODS 92 Sampling from the Dirichlet distribution As shown in Sec , if we assume a Dirichlet prior, the posterior distribution of p i, given the transition counts, is a Dirichlet distribution with parameters u i, as defined in Eq A method for generating samples from the Dirichlet distribution is given in Appendix A. A sample of the A matrix, defined in Eq. 5.2, consists of K independent samples from Dirichlet distributions, each corresponding to one row of the matrix. As was shown in Appendix A, each sample from a Dirichlet distribution takes expected time O(8KQ), where Q is the time to sample from a normal distribution. Therefore, each sample of transition probabilities requires time O(8K 2 Q). Sampling from a Multivariate Normal distribution Sampling from the Dirichlet distribution is very expensive. In an attempt to reduce this cost, the true Dirichlet distribution of the p i parameters is approximated by a multivariate normal distribution (MVN). In addition, the MVN has some nice properties that we will exploit in Sec If p i is distributed as Dirichlet(p i ; u i ), then by the central limit theorem, the distribution of p i converges to a multivariate normal distribution [Rao73] with mean µ i and covariance matrix Σ i given by Σ i = µ i = u i w i, (5.10) 1 [ wi 2(w wi Diag(u i ) u i u T i i + 1) ], (5.11) where the superscript T denotes the transpose and Diag(u i ) represents a matrix with entries u ij along the diagonal. A method for creating samples of the p i variables from this approximate MVN distribution is given in Appendix B. For each sample of the A matrix, we must generate K independent samples from the MVN distributions. As described in Appendix B, each sample of a MVN distribution requires time O(KQ + K), where Q is the time to sample from a normal distribution. Therefore, each sample of transition probabilities takes time O(K 2 Q + K 2 ). In addition, there is a one-time cost of O(K) for each of the K MVN distributions. This approximation assumes that the central limit theorem holds and that the transition probabilities are well approximated by multivariate normal distributions. One drawback of the MVN approximation is that it permits negative values of the transition probabilities. While these negative values are invalid from a physical perspective, they generally do not affect the error calculations.

105 CHAPTER 5. ERROR ANALYSIS METHODS 93 Solving sets of linear equations Once we have generated a sample of transition probabilities and therefore the A matrix from either Dirichlet or MVN distributions, we can simply solve Eq. 5.2 to find the MFPT vector. This can be done by factoring the matrix into the form A = LU where L is a lower triangular matrix and U is an upper triangular matrix with unit entries along the diagonal followed by forward and back substitutions [GvL96]. The cost of this algorithm is (1/3)K 3 for the factoring plus O(K 2 ) for the substitutions. Taylor series approximation Solving the set of linear equations for each sample directly is expensive. Instead, we can approximate the solution using a first order Taylor series expansion. The mean first passage time from the initial state, as given by Eq. 5.2, depends on all the parameters a ij as well as the parameter t. We assume for now that t is a constant, but can modify this if we allow variable time steps in the simulation data. Thus, x 1 = f(a 11, a 12,, a KK ) = f(a). (5.12) The function f does not, in general, have a simple form. We therefore perform a first order Taylor series expansion to x 1 around the expected values of the parameters, as given in Eq. 5.9: x 1 = x 1 + x 1 = f f(ā) + a a Ā 11 + f 11 a a Ā f 12 a a Ā KK, (5.13) KK where x 1 f(ā) and the a ij are small perturbations in the parameters. Thus, x 1 = f a a Ā 11 + f 11 a a Ā f 12 a a Ā KK (5.14) KK Appendix C gives an efficient way for computing all the terms of the form f/ a ij Ā in Eq Once we have generated a sample of the transition probabilities from either Dirichlet or MVN distributions, we convert to a sample of a ij using Eq. 5.2 and then a sample of a ij by subtracting the expected values ā ij. We next substitute these values into Eq to find a sample from the distribution of x 1. There are K 2 terms in Eq. 5.13, so the substitution will take time O(K 2 ) per sample. In addition, as shown in Appendix C, there is a one-time cost of O((1/3)K 3 + 3K 2 ) to both solve for x 1 = f(ā) and generate the partial derivative terms in the Taylor series.

106 CHAPTER 5. ERROR ANALYSIS METHODS 94 Dirichlet distribution Multivariate normal distribution Linear algebra Method 1 N ( 8K 2 Q+ 1 3 K3) = O(NK 3 ) No assumptions Method 2 K 2 +N ( K 2 Q+K K3) = O(NK 3 ) Assumes central limit theorem holds Permits negative transition probabilities Taylor series approximation Method K3 +K 2 +N ( 8K 2 Q+K 2) = O(K 3 + NK 2 ) Ignores higher order terms in Taylor series Method K3 +K 2 +K 2 +N ( K 2 Q+K 2 +K 2) = O(K 3 + NK 2 ) Assumes central limit theorem holds Ignores higher order terms in Taylor series Table 5.1: Summary of sampling based methods for calculating the error of the MFPT from the initial state due to sampling. Each cell gives the running time, where N is the number of samples, K is the number of states, and Q is the time taken to sample from a normal distribution, and advantages or assumptions for the method. Sampling methods summary Combining the techniques for sampling and the solution of the linear system given above, we get four sampling-based methods for finding the uncertainty in the MFPT from the initial state. Table 5.1 shows the running times and various assumptions of each of the four sampling-based methods, where Q is the time to sample from a normal distribution, and we assume that we take a total of N samples to estimate the distribution of x 1. Method 1 that we propose is similar to previous methods to calculate error in these models [SKH05]. We have also introduced new methods that use different approximations and improve the running time of the error analysis.

107 CHAPTER 5. ERROR ANALYSIS METHODS Non-sampling based error analysis method The methods in Sec all relied on sampling possible transition probabilities from either the Dirichlet or MVN distributions to get samples from the distribution of x 1. If we make both the MVN approximation and the Taylor series expansion, we can derive a closed-form representation for the distribution of x 1. First, we will rewrite Eq by grouping K terms at a time as x 1 = [ ] f a f Ā 11 a Ā 1K a 11. a 1K [ ] f + + a f Ā K1 a Ā KK a K1. a KK For notational convenience, we define the above vectors as the sensitivity s i and deviation a i, s T i = [ ] f a f Ā i1 a Ā, ik. (5.15) a T i = [ a i1 a ik ]. (5.16) Therefore, K x 1 = s T i a i. (5.17) i=1 The vector a i is equal to a i ā i and, with the MVN approximation, has mean 0 and covariance matrix Σ i given by Eq As described in Appendix B, linear combinations of MVN distributions are also MVN distributions, and Eq. B.3 gives that x 1 is distributed as normal with mean 0 and variance σ 2, where σ 2 = Substituting Eq for Σ i, we see that K s T i Σ is i. (5.18) i=1 σ 2 = = K i=1 K i=1 1 w 2 i (w i + 1) st i [ wi Diag(u i ) u i u T i 1 [ wi 2(w wi s T i i + 1) Diag(u i)s i (s T i u i)(u T i s i) ]. (5.19) ] si Therefore, x 1, which equals x 1 + x 1, has a normal distribution with mean x 1 and variance given

108 CHAPTER 5. ERROR ANALYSIS METHODS 96 by Eq These closed-form expressions for the mean and variance of x 1 give the closed-form distribution of x 1. In addition, the distribution of x 1 can be computed efficiently. As described in the Taylor series expansion section (Sec ), we can calculate x 1 and all the partial derivative terms in the sensitivity vectors in time O((1/3)K 3 + 3K 2 ). Since the variance is the sum of vector dot products (rather than matrix vector products), we can calculate the sum in Eq in time O(K 2 ). The running time for calculating the distribution of x 1 is thus O((1/3)K 3 + 4K 2 ), a large improvement over the running time for any of the sampling based methods Adaptive sampling algorithm To generate molecular dynamics data, the typical method is either to start all simulations from one state, or to generate a representative set of starting conformations from, for example, high temperature unfolding [DL93] or replica exchange [SO99], and start a number of simulations from each of these conformations. If our trajectories are sufficiently long or we have enough trajectories from relevant conformations, we can hope to sample all the important transitions. In the framework of a Markovian state model with the state space defined, we are sampling transitions with no guidance from which ones are more or less uncertain given our current data. In this section, we present an algorithm that uses the error analysis techniques described in the previous sections to achieve much higher precision in the quantities of interest for the same number of total simulations. In addition to the total error in the values, which we have already discussed, we also want to determine the main contributors to this error so that we can selectively add simulations to those regions that give rise to the greatest uncertainties. One advantage of the Taylor series methods (sampling based methods 3 and 4 and the non-sampling based method) outlined above is that they naturally decompose the contribution of each element in the matrix to the variation in x 1. Since the elements in the same row of A are not independent, we look at the combined contributions associated with each row. Each row of A corresponds to transitions from a single state, so if we find that one row contributes the most to the uncertainty in x 1, we can decrease the uncertainty from that row by generating new transitions from that state. In principle, we could also find the main error contributors directly from the set of linear equations using techniques such as analysis of variance and statistical design of experiments [BHH78]. However, these are computationally expensive when the problem dimension is high. In this section, we will focus only on the non-sampling based method for calculating the error. First, we will show how to minimize the variance given the actual transition probability matrix. In practice, we do

109 CHAPTER 5. ERROR ANALYSIS METHODS 97 not know this matrix, but it will be useful to compare the variance achieved by different sampling algorithms to this optimal variance. Assume that we know the actual transition probability matrix, P. With a total of M simulations, we can calculate the optimal allocation of simulations per row that minimizes the variance in the MFPT from the initial state. If we allocate w i simulations to row i, the expected counts for that row are u ij = p ij w i. Substituting into Eq. 5.19, the variance of x 1 is σ 2 = K i=1 ν i w i + 1, ν i = 1 w 2 i s T i [ wi Diag(u i ) ] u i u T i si = s T i [ Diag(p i ) p ] i p T i si, (5.20) where we separate out the νi terms which do not depend on the allocation of simulations, w i. We can minimize the quantity σ 2 with respect to the variables w i subject to the constraint that the total number of simulations is equal to M, K w i = M. (5.21) Solving this minimization problem gives i=1 w i = (M + K) ν i K j=1 νj 1. (5.22) If we know the transition probability matrix and make the MVN and Taylor series approximations, Eq gives the optimal number of simulations per row that minimizes the variance of the MFPT from the initial state. Strictly speaking, the variables w i should be constrained to be positive integers. However, for large sample size M, we can round the values to get good approximations. However, in general, we do not know the true transition probability matrix. We now outline an adaptive sampling algorithm that attempts to approximate the optimal number of simulations per row. Assume that we have observed some transitions and have the Dirichlet transition counts of u ij.

110 CHAPTER 5. ERROR ANALYSIS METHODS 98 From Eq. 5.19, the variance of x 1 is K σ 2 ν i = w i + 1, ν i = s T i i=1 [ Diag( pi ) p i p T i ] si, (5.23) since p ij = u ij /w i. This equation is similar to Eq. 5.20, but we have replaced the actual values of the transition probabilities, p i, with the expected values of the transition probabilities, p i. Our goal is to decrease the variance of x 1, σ 2, given a total number of simulations, M. Since the expected values of the transition probabilities are just estimates to the actual values, they may change as we add new simulations. simulations, re-evaluate the expected values, and repeat. Therefore, we will use these estimates to start a few new Let us assume that with our current expected value estimates, we can start m more simulations from any states. In the simplest implementation of the adaptive sampling algorithm, we will start all m simulations from the same state j, but we could easily modify this using the analysis above to start a total of m simulations from different states. The expected values of the transition probabilities may change after these m simulations, but our best guess for them are the current expected values. Therefore, the only term in Eq that changes with these additional simulations is the term corresponding to the jth row. The expected change of variance in x 1 is σ 2 = ν j w j + m + 1 ν j w j + 1, (5.24) which is simply the difference of the jth term with m additional simulations and the original jth term. Using the above equation, we calculate the expected decrease in variance caused by adding m more simulations to any specified row, select the row that reduces the variance the most, and start m more simulations from that state. Repeating this process, we adaptively add samples to our transition counts. The adaptive sampling algorithm is given as Algorithm 1. The tolerance criteria for the while loop could be that the total number of simulations is less than some maximum (as we motivated above), the total variance σ 2 is larger than some tolerance, or the decrease in σ 2 is larger than some value. Using this algorithm, we either decrease the total variance with the same number of simulations, or decrease the number of simulations necessary for a given precision.

111 CHAPTER 5. ERROR ANALYSIS METHODS 99 Algorithm 1 The adaptive sampling algorithm 1: Generate initial simulations and transition counts 2: while some tolerance criteria do 3: ν i s T [ i Diag( pi ) p i p T ] ( i si ) νj 4: best argmax j w j + 1 ν j w j + m + 1 5: Start m more simulations from state best 6: end while Extension to large systems As the simulated system becomes large, or if we include spatial degrees of freedom in the MSM, the number of states required may become unwieldy. In these cases, both the storage requirements and the cost associated with the linear algebra of Eq. 5.2 become prohibitive. In general, since we are looking at molecular dynamics at a small time step, we will not see transitions between all pairs of states. We only expect to see transitions between states that are sufficiently close conformationally to move between each other in time, t. For this reason, we expect that the observed transition counts, z ij, will be sparse, i.e., most of them will equal zero. However, the Dirichlet distribution of the transition probabilities also depends on the prior probability distribution, which may not be sparse. In this section, we will describe how to maintain the sparsity of the transition counts in our calculations, even with a dense prior, since sparse matrix calculations are much more efficient in terms of both storage and computation. First we show how to decompose the transition probabilities into a dense term and a sparse term and how to write this as a bordered sparse matrix [BR74]. Then, we show how to efficiently solve this system using sparse matrix techniques. We also discuss the implementation of the adaptive sampling algorithm using similar techniques. We will focus first on solving Eq. 5.2 at the expected transition probability values p ij. In our system, the matrix Z, with elements z ij, is sparse. Assume that the prior probabilities α ij are symmetric for each row. α 11 = α 12 = = α 1K = c 1, α 21 = α 22 = = α 2K = c 2,. α K1 = α K2 = = α KK = c K. (5.25)

112 CHAPTER 5. ERROR ANALYSIS METHODS 100 We can represent this prior compactly as the product of two vectors, α = c1 T, (5.26) where 1 is a column vector with all unit entries. Each expected transition probability value, p ij, is defined by Eq. 5.9 as p ij = z ij + α ij w i, w i = K (z ij + α ij ). (5.27) j=1 Therefore, the expected values of the transition probabilities are P = W 1 (Z + α), (5.28) where W is a diagonal matrix with entries w i along the diagonal. Eq. 5.2 can be rewritten as ( I + P)x = b, ( I + W 1 (Z + c1 T ))x = b, (5.29) where I is the identity matrix and Z is sparse. Technically, we need the last row of the matrix to correspond to the boundary condition. This does not change the sparse structure of the matrix, so we will ignore it for notational simplicity. The matrix in Eq is generally dense. However, noting that it is a rank one update of a matrix with sparse structure, Z, simple algebra gives the augmented system [ ] [ ] [ ] F c x Wb =, (5.30) 1 T 1 y 0 where F = Z W (5.31) and F is a sparse matrix. We have thus shown how to rewrite Eq. 5.2 as a bordered sparse matrix. Using standard LU decomposition to solve the system of equations in Eq takes time O((1/3)K 3 ). However, using sparse matrix techniques, it is possible to store only the nonzero

113 CHAPTER 5. ERROR ANALYSIS METHODS 101 entries and solve for the LU factors of the matrix F efficiently, where both L and U are sparse [DER86]. The complexity of solving the sparse matrix is very system dependent; however the running time is much less than O((1/3)K 3 ). Appendix D shows how to efficiently solve the system given in Eq by using the LU factors of F. Thus, we can leverage sparse matrix algorithms even with a dense prior. Recall that this decomposition is possible because the expected values of the transition probabilities can be separated into a sparse term and a dense term. We cannot use these sparse matrix schemes for methods 1 and 2 outlined above. In those cases, we need to solve the system of linear equations after generating a sample of the transition probabilities, which is unlikely to be the sum of a sparse term and a low-rank dense term. However, we can use these schemes for the Taylor series methods, since they rely on solving Eq. 5.2 at the expected values of the transition probabilities, p ij, which is what we have outlined above. This reduces the time for finding x 1 = f(ā) and the partial derivative terms in the Taylor series from O(K 3 ) to O(K 2 + sparse matrix time). In particular, this reduces the running time of the non-sampling based method to O(K 2 + sparse matrix time). The above discussion assumes a symmetric prior, but we can generalize these results to any prior that is the outer product of two vectors. Specifically, we can have a prior that is the Boltzmann probability of transitioning between two states as given by their energy difference, which may be a more natural choice of prior parameters for molecular dynamics: α e E 1/T. e E K/T [ e E 1/T e E K/T ]. (5.32) In addition to using sparse matrices for the error analysis, we can also use these techniques during the adaptive sampling algorithm. Since each iteration of the adaptive sampling algorithm only adds simulations from a single state i, only the ith row of the matrix F, as defined by Eq. 5.31, is updated in this iteration. Say we observe a total number of new transitions from state i to each state j, z ij. The new ith row of the matrix F will now equal f ij = { z ij + z ij i j z ii + z ii w i K j=1 z ij i = j. (5.33)

114 CHAPTER 5. ERROR ANALYSIS METHODS 102 We can represent this change as a rank one update to F, F = F + e i z i1... z i(i 1) K z ii z ij z i(i+1)... z ik, (5.34) and use the previously described techniques for converting to a bordered matrix to reuse the LU factors and reduce the computation time. After some number of adaptive iterations, it will be worthwhile to add the updated counts to the F matrix and re-factor this matrix, since each update increases the size of the system by one and the updated rows and columns of the factors are generally dense. j=1 5.3 Results The error analysis methods presented in this paper assume that the defined state space is Markovian. For molecular systems, it is difficult to define states that meet this criterion. Though there are tests for Markovian behavior in a system [SPS04a], it is unclear whether these tests are both necessary and sufficient. Therefore, since we have assumed that the state space is Markovian in order to calculate the error from sampling, we test the methods given above on a toy system with 87 states and Markovian transitions between the states. We want the transition probabilities to be representative of molecular kinetics, which random transition probabilities are not. Therefore, we construct the transition probabilities of the toy system, p ij, from existing simulation data of a small protein, the 12-residue tryptophan zipper β-hairpin, TZ2 [CSS01]. We took a subset of 1,750 independent molecular dynamics trajectories generated by Snow et al. [SQD + 04] which were started from the unfolded state and taken at a resolution of 10 ns (a total of approximately 12,000 conformations). These conformations were then clustered using hierarchical clustering with a cutoff of 3.25 angstroms to result in a total of 87 states [SSP04]. We define the transition count z ij as the sum over all trajectories of the number of transitions from state i to state j at a time step t of 10 ns. We define the transition probability matrix P of our toy system as the expected transition probabilities as defined by Eq. 5.9, P = u 11 w u K1 w K u 1K w 1 u KK w K, (5.35) where u ij = α ij + z ij, we use a symmetric Dirichlet prior of α i1 = α i2 = = α ik = 1/K,

115 CHAPTER 5. ERROR ANALYSIS METHODS 103 and w i are the normalization constants. Since this cluster space is not Markovian on the time scales of the source trajectories, the transition matrix does not represent a Markovian model for protein folding and thus we will not use the analysis presented below to draw conclusions about the protein system. However, there certainly exists a Markovian model with these transition probabilities, and thus we can use this matrix in our analysis as long as we restrict our comments to the nature of the error analysis and sampling, which is the goal of this chapter. The results presented below are on the toy Markovian system with 87 states, transition probabilities given by Eq. 5.35, and a time step t of 10 ns Demonstration of method 1 Given a transition probability matrix P we can calculate the MFPT from the initial state, x 1, using Eq In this section we will demonstrate that if we sample transitions from the matrix P and use method 1 on these transition counts, the distribution of x 1 which we calculate is a good approximation to x 1. For our toy system, the transition probability matrix P is given by Eq We sample transitions from this matrix by first selecting a row i, and then choosing a transition j, with probability p ij. We generate transition counts by sampling transitions from each row of P independently and summing the number of transitions from state i to state j. For each of these transition counts, we use method 1 to estimate the distribution of x 1. We have taken 10,000 independent samples of possible transition probability matrices and solved each for x 1. Figure 5.1 shows the actual value of x 1 for the matrix P as well as the distributions of x 1 for six different transition count matrices, generated with either 500, 1,000, 5,000, 10,000, 50,000, or 100,000 independent transition samples per row. It is easy to verify that the actual value x 1 falls within each distribution. Also, as the number of transition samples increases, the distribution of x 1 becomes narrower and centers around the actual value, x 1. While we have only shown one distribution for each number of samples per row in the figure, these results are typical. Also, we repeated these experiments for random transition probability matrices P and found similar results (data not shown) Validity of approximations We have demonstrated above that given transition counts, we can calculate the distribution of x 1 by sampling from possible transition matrices that could have produced the observed data and solving the system of linear equations for each sample. But, as described in Sec. 5.2, this procedure is

116 CHAPTER 5. ERROR ANALYSIS METHODS 104 Figure 5.1: Distributions of the mean first passage time as generated by the first sampling based method on the 87 state example. The solid line shows the true value of the mean first passage time for the toy system given by the transition probability matrix P. The dotted distributions were generated using method 1 for transition count matrices sampled from P with total numbers of 500, 1,000, 5,000, 10,000, 50,000, or 100,000 samples per row. computationally expensive. Therefore, we proposed two approximations, the MVN approximation to the Dirichlet distribution and the Taylor series approximation to the set of linear equations, which when taken together, give an efficient closed-form approximation to the distribution of x 1. We now demonstrate the validity of these approximations. First, we generated transition counts from 2,000 independent samples per row of the toy system as described above. We then ran methods 1 4, as well as the non-sampling based method to calculate the distribution of x 1 from these transition counts. For each of methods 1 4, we used 10,000 independent samples of transition probability matrices. Figure 5.2 shows the resulting distributions of x 1 for each of the four sampling based methods, as well as the density for the non-sampling based method. Methods 1 and 2 and methods 3 and 4 overlay almost exactly, showing that the MVN approximation to the Dirichlet distribution is valid. Between the linear equation methods (1 and 2) and the Taylor series methods (3 and 4), there is a slight difference since the Taylor series method ignores higher-order terms. However, this difference is mostly in the tails of the distributions and is minimal if one only cares about the mean and variance of the distribution (Table 5.2). In addition, it is clear that the non-sampling based method overlays sampling based method 4 exactly, which is expected

117 CHAPTER 5. ERROR ANALYSIS METHODS 105 Figure 5.2: Distribution of the mean first passage time as calculated by the five error analysis methods. The vertical line indicates the mean first passage time at the expected values of the transition probabilities, x 1. since they solve the same approximations to the problem. For this example, we compared the running times of the various methods. The code was implemented in MATLAB and run on a Dual Athelon MP (1.8 GHz) computer. Table 5.3 gives the running times for the five different methods required to generate the histograms shown in Figure 5.2. While we did not fully optimize the code for sampling from the Dirichlet and MVN distributions and did not yet implement the sparse matrix solver, these running times clearly demonstrate the superiority of the non-sampling based method for error analysis. These tests were repeated on random matrices and the toy system with varying levels of total transitions per row with similar results (data not shown). In addition, we tried different prior probability distributions and found no noticeable change in the results for small priors Adaptive sampling Our goals for this chapter were to both calculate the error in the MFPT as well as use this error to improve the results. Above, we have shown how to efficiently calculate the error in the MFPT, and

118 CHAPTER 5. ERROR ANALYSIS METHODS 106 Mean Standard deviation Method Method Method Method Non-Sampling Table 5.2: Means and standard deviations of the MFPT distributions generated for the four sampling and the non-sampling based error analysis methods (all units are in nanoseconds). All methods show reasonable agreement for these quantities, but differ between the linear equation methods (methods 1 and 2) and the Taylor series methods (methods 3 and 4 and non-sampling method). we have demonstrated that the approximations we made were reasonable. Now, we show how using these error estimates improves the sampling through an adaptive algorithm. We demonstrate how the adaptive sampling algorithm compares to both an even allocation of samples and the optimal allocation of samples. We will generate samples from the toy system given by the transition probability matrix, P, defined in Eq Assume we can take m transition samples in each round, we can decide where to allocate the samples before each round, and that we have a limit on the total number of samples. An even sampling algorithm will always take the same number of samples from each state in each round, m/k. In the simplest implementation of the adaptive sampling algorithm, we calculate the contribution to the variance of x 1 for each row and add all m samples to the row that is expected to decrease the variance the most. We ran both the even and adaptive sampling algorithms by generating samples from the matrix Pre-processing Sampling Solving Total for 10,000 samples Method 1 NA 4164 s 14.5 s s Method s 273 s 15.2 s s Method 3 NA 4164 s 4.6 s s Method s 273 s 5.1 s s Non-Sampling 0.05 s NA 0.02 s 0.07 s Table 5.3: Running times for the error analysis methods on calculating the MFPT distribution of an 87 state example. The sampling based methods 1 4 used 10,000 independent samples of the transition probabilities.

119 CHAPTER 5. ERROR ANALYSIS METHODS 107 Figure 5.3: Effect of adaptive sampling on the variance of the mean first passage time. The blue points are the variance generated from the even sampling algorithm and the purple points are the variance from the adaptive sampling algorithm. The black line shows the variance when the samples are distributed optimally. P. We started with the Dirichlet prior of α ij = 1/K and an initial 10 samples per row and added 870 samples in each round until we had a maximum of 500,000 samples. Figure 5.3 shows the variance of x 1 versus the total number of samples over 20 independent runs of each algorithm, with the blue points from the even sampling algorithm and the purple points from the adaptive sampling algorithm. The dark blue and purple lines on the figure represent the variance in x 1 for the even and adaptive sampling schemes respectively, generated by keeping the transition counts proportional to P. The black line on the figure is the variance of x 1 when the samples are distributed optimally, as described in Sec These solid lines were generated by scaling each row in the matrix by the desired number of samples for that row. We can see that the adaptive sampling algorithm rapidly achieves the optimal variance. It is clear that the adaptive sampling algorithm achieves either a much higher precision in MFPT with the same number of samples or, conversely, requires many fewer samples to achieve a given precision. Figure 5.4 shows these relations and demonstrates that the adaptive sampling algorithm,

120 CHAPTER 5. ERROR ANALYSIS METHODS 108 Figure 5.4: Relationship between the number of samples and the variance for the even and adaptive sampling algorithms. The top panel shows the ratio of the number of even samples to the number of adaptive samples required for a desired variance. The bottom panel shows the ratio of the variance of the even sampling to the variance of the adaptive sampling for the same number of samples. for this data set, achieved a greater than 20-fold reduction in the number of samples or increase in precision. If we look at the optimal allocation of samples per row, we see that the distribution is far from uniform across the states (Figure 5.5). Again, this algorithm was tested on random matrices with similar results (data not shown). 5.4 Discussion and conclusions Given that we can generate a large number of molecular dynamics trajectories using distributed computing methods, such as Folding@Home, it is important to develop efficient techniques for analyzing the data. In previous chapters, we described a technique for building a graph of the important states of a protein and estimating transition probabilities between these states. In this

121 CHAPTER 5. ERROR ANALYSIS METHODS 109 Figure 5.5: Percent of samples required for each state in the optimal allocation of samples per state. chapter, we discuss methods for ascertaining the uncertainties in important kinetic properties that can be calculated from this graph. The focus of this chapter is in the computation of error from finite sampling. Given that a state definition is Markovian, we have shown that the distribution of transition probabilities, estimated from molecular dynamics data, is Dirichlet if one assumes a Dirichlet prior. From these distributions, we gave a method which, given sufficient samples of transitions, can calculate the distribution of a desired quantity, such as the mean first passage time. We also presented and validated two approximations that have little effect on the accuracy of the results and improve the efficiency of the first method. When taken together, these approximations yield an efficient closed-form approximation to the uncertainty, which can be calculated in the same amount of time as one solution to the set of linear equations. While we have not gone into detail in this chapter, it should be noted that the analysis presented above could easily be modified to calculate errors in P fold values or any other quantity that can be represented by a set of linear equations. Similarly, it is possible to modify the sensitivity analysis to calculate the error of other functions of the MFPT vector, such as the sum of errors or the norm. The analysis presented here assumes that it is possible to define states that behave in a Markovian manner at a given time step. This is not a trivial task, and incorrect state definitions may lead to unpredictable results. In this chapter, we did not address the errors that may arise from incorrect state definitions. Instead, we cite tests that can be performed on a specific state definition to see whether or not it is Markovian. Some of these tests rely on finding eigenvalues or other properties

122 CHAPTER 5. ERROR ANALYSIS METHODS 110 of transition probability matrices. The methods that we presented here for finding uncertainties in solutions to linear equations can also be generalized to finding uncertainties in eigenvalues [VS83], which we will do in Chapter 6. It is important when running these Markovian tests to find the errors from finite sampling, since it may be possible to pass the tests within the error, or we may find that the sampling errors are too large to draw any meaningful conclusions about the Markovian-ness of the system. For systems that can be reduced to a small number of states, the efficiency gains in the uncertainty estimates are minimal compared to the time it takes to generate the molecular dynamics data and cluster the conformations. However, systems may have important transitional and rotational degrees of freedom, e.g., two proteins moving in relation to one another or a binding event. In these cases, for each relevant spatial state of the protein, we would need separate states for each conformation of the protein and an MSM with tens of thousands of states may be necessary. The sampling based methods will hardly be practical for such large systems. Our non-sampling based, closedform solution for uncertainty, when taken together with the sparse matrix manipulations given in Sec , would provide efficient ways to measure uncertainties. In addition to the error analysis techniques, we also presented an adaptive sampling method that can produce a given precision with over an order of magnitude reduction in the number of samples required by a naive sampling algorithm. We outlined how this algorithm can be applied efficiently for systems with many states and demonstrated the large gains in either sampling time or precision. We also outlined sparse matrix techniques that can improve the efficiency of both the error analysis and adaptive sampling. In conclusion, we have developed efficient and practical computational tools and algorithms that find the main sources of error in a Markovian state model caused by finite sampling. We have shown that our approximate solutions are in good agreement with the actual distributions, and are computationally far more efficient. In addition, we gave an algorithm that uses the error analysis to greatly reduce the number of samples necessary to build an MSM with the same precision. Lastly, we gave techniques for using sparse matrix manipulations that will allow the handling of systems with large numbers of states. In the future, when the adaptive sampling algorithm generates new simulations of molecular systems, some modifications to the algorithm may be necessary. For example, it is likely that we will need to re-cluster the conformations as we gather new data in order to meet the Markovian criteria. Also, if we begin new simulations from a given state, we should pick the starting conformation at random from all the conformations in the state, so that we do not bias the transitions from that state.

123 Chapter 6 Eigenvalue and eigenvector error analysis Markovian state models (MSMs) are a convenient and efficient means to compactly describe the kinetics of a molecular system as well as a formalism for using many short simulations to predict long time scale behavior. Building a MSM consists of grouping the conformations into states and estimating the transition probabilities between these states. In the previous chapter, we described an efficient method for calculating the uncertainty due to finite sampling in the mean first passage time between two states. In this chapter, we extend the uncertainty analysis to derive similar closedform solutions for the distributions of the eigenvalues and eigenvectors of the transition matrix, quantities that have numerous applications when using the model. We demonstrate the accuracy of the distributions on a six-state model of the terminally blocked alanine peptide. We also show how to significantly reduce the total number of simulations necessary to build a model with a given precision using these uncertainty estimates for the blocked alanine system and for a 2454-state MSM of the villin headpiece. 6.1 Introduction One approach to studying the movement of biomolecules is to use molecular dynamics. After generating large ensembles of molecular dynamics simulations, we wish to analyze these trajectories to find thermodynamic properties such as the equilibrium conformational distribution of the protein and kinetic properties such as the rate and mechanism of folding. A recent approach for such analysis involves graph-based models of protein kinetics that divide the conformation space into 111

124 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 112 discrete states and calculate transition probabilities or rates between the states based on molecular dynamics trajectories [KTB93, GT94, SSP04, SPS04a, SPS + 04b, SKH05, AFGL05, SHCT05]. These Markovian state models (MSMs) allow one to easily combine and analyze simulation data started from arbitrary conformations and naturally handle the existence of intermediate states and traps. This approach has been applied to small protein systems [SSP04, SPS + 04b, JVP06], a nonbiological polymer [EP04, EPP05a], and vesicle fusion [KKS + 06, KZP + 07] with good agreement with experimental rates. The MSM uses discrete states, and we expect the transition probabilities to be insensitive to the exact state boundaries after sufficient transition time [Cha78, AD81, VD85]. An alternative approach uses fuzzy partitions and partial membership of conformations into states, and may be able to better characterize transition regions and describe dynamics at shorter time scales [WG02, Web06]. For any quantities which can be calculated from the MSM, such as the mean first passage times between states [SSP04], probability of folding from a given conformation [SSP04], or rates [SPS04a], it is also important to determine the uncertainty in these values, so that one can form an idea about the confidence of the results. One main source of error is caused by grouping conformations into states and assuming that transitions between these states are Markovian. It has been shown that if the conformations are grouped incorrectly or if the transition probabilities are calculated from a time step which is too short, the transitions are no longer history independent, and any analysis that assumes a Markovian process may produce incorrect results [SPS04a]. Even if the states are defined such that the transitions between them are Markovian, the results could still be in error. This second source of error results from the finite sampling of transitions between states, which gives uncertainties in the transition probability estimates and, in turn, leads to uncertainties in the values we calculate. In the previous chapter (Chapter 5), we focused on the uncertainties caused by finite sampling and showed how to efficiently calculate the resulting uncertainty in the mean first passage time between two states. Those methods can easily be applied to calculate the uncertainty in any quantity that can be expressed as the solution of a set of linear equations of the transition probabilities. However, many interesting collective properties of the system are described using the eigenvalues and eigenvectors of the transition matrix. For example, the eigenvalues correspond to the aggregate time scales of the system, and thus can be compared with experiments to validate the model [KTB93, SPS + 04b]. Additionally, they are used in some tests for determining the time at which the system becomes Markovian [SPS04a]. The eigenvectors are useful in determining the states which participate in the relaxation process corresponding to a given eigenvalue, and can be used to group

125 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 113 kinetically similar states [Sch99, SFHD99, Hui01, SH02, DW04]. In this chapter, we will extend the uncertainty analysis methods presented previously [SP05a] to estimate the uncertainties in the eigenvalues and eigenvectors of a transition matrix caused by finite sampling. These error estimates can again be calculated with efficient closed-form solutions. Moreover, these error estimates can be used to adaptively direct further simulations to reduce the uncertainties of functions of the eigenvalues or eigenvectors. The validity of the error estimates is demonstrated on a small system, the terminally blocked alanine peptide, and the power of adaptive sampling is demonstrated on the alanine peptide and a model of the villin headpiece. 6.2 Methods Molecular dynamics simulations are a popular tool for understanding molecular motion. Analyzing these trajectories to extract kinetic information is a difficult task. Recent work [KTB93, GT94, SSP04, SPS04a, SPS + 04b, SKH05, AFGL05, SHCT05] has involved modeling the system as a Markovian state model, where the conformation space of the molecule is divided into discrete regions, or states, and transition probabilities are calculated between the states. If the transitions between the states are Markovian, or history independent on some time scale, it is possible to model the long time scale behavior of the system as a Markov chain on the Markovian state model graph. Determining a state space over which transitions are Markovian is a difficult task and there has been much work on determining appropriate decompositions [CSP + 07, NHSS07]. Even if an appropriate decomposition can be found for which the dynamics are Markovian at some lag time, the kinetic properties calculated from the model still have uncertainties. Since we can only sample a finite number of transitions between states, the estimated transition probabilities between states will have statistical uncertainty. Therefore, any value calculated from the transition probabilities will also have an uncertainty associated with it. In the previous chapter, we mapped the uncertainty in the transition probabilities to uncertainties in the mean first passage time between two states, or other similar quantities that are solutions of linear equations in the transition probabilities. In the following section, we calculate efficient closed-form expressions for the uncertainties in the eigenvalues and eigenvectors of the Markovian state model, which describe the full kinetics of the system. The basis for the derivation and many of the equations are similar to those for the mean first passage time (Chapter 5). However, we reproduce them here for clarity.

126 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS Eigenvalue and eigenvector equations In a Markovian state model, we represent the conformation space by K discrete states, each of which corresponds to some distinct group of molecular conformations. Let us define the probability of transitioning from state i to state j at a time step of t as p ij. An eigenvalue λ of a matrix P is defined as Pv λ = λv λ, (6.1) where v λ is the eigenvector corresponding to eigenvalue λ. We define the matrix A with rows a i as p 11 λ p 12 p 1K p 21 p 22 λ p 2K A = P λi =., (6.2).. p K1 p (K 1)K p KK λ where I is the identity matrix. Eq. 6.1 is then equivalent to Av λ = 0, (6.3) and has non-trivial solution v λ when the determinant of the matrix is zero, det(a) = 0. (6.4) Transition probability distribution Finite sampling causes uncertainties in the estimates of the transition probabilities between states. A derivation and complete explanation of the distribution over transition probability vectors has been given before in Sec Here, we summarize the main results. We define p ij as the actual transition probability from state i to j at a time step of t, where the sum of the transition probabilities from state i is equal to one. We can estimate these transition probabilities by independently sampling transitions between states i and j, either through independent simulations or by only including transitions separated by the lag time at which the transitions are Markovian. We generate counts z ij which are the total number of transition samples from state

127 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 115 i to state j. We define n i as the total number of samples originating from state i, K n i = z ij. (6.5) j=1 The distribution of the z ij variables follows the multinomial distribution with parameters n i, p i1, p i2,..., p ik [JKB97]. Using Bayesian analysis, we can compute the distribution over all possible vectors of transition probabilities. The probability of a particular column vector p i being the true transition probability vector, given the observed transition counts, is, from Bayes rule, P (p i z i ) P (z i p i )P (p i ) = p z i1 i1 pz i2 i2... pz ik ik P (p i), (6.6) where P (p i ) is the prior probability over the transition probability vectors, i.e., the distribution representing the state of knowledge of transition probability vectors before observing any data. A convenient choice for the prior is the Dirichlet distribution, the conjugate prior of the multinomial distribution. The Dirichlet distribution with variables p and parameters u is defined as Dirichlet(p; u) = 1 Z(u) K i=1 p u i 1 i (6.7) where Z(u) is a normalizing constant defined in Appendix A and Γ is the gamma function. If we define the prior of the transition probabilities as a Dirichlet distribution with parameters α i1, α i2,..., α ik and we observe transition counts z i1, z i2,..., z ik, the posterior of the transition probabilities is a Dirichlet distribution with parameters α i1 + z i1, α i2 + z i2,..., α ik + z ik. For notational convenience, we define the Dirichlet counts as u ij = α ij + z ij. (6.8) Therefore, assuming a Dirichlet prior, the distribution of the transition probabilities, p i, given the observed data counts is Dirichlet(p i ; u i ). In the limit, as the sampling (and therefore transition counts) increases, the distribution of the transition probabilities will not depend on the choice of the prior distribution.

128 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 116 It will be useful to state the expected values of the posterior distribution of the transition probabilities for future reference, where w i are normalizing weight variables [KBJ00]: p ij = E(p ij ) = u ij, w i K w i = u ij. (6.9) j= Distribution of eigenvalues and eigenvectors It is possible to repeatedly sample from the transition probability posterior distribution and find the eigenvalues and eigenvectors for each sample to determine the posterior distributions of these quantities. However, this method is very expensive, both because many samples are required to accurately describe the distribution and the solution of the eigenvalue system is expensive (O(K 3 ) plus some small number of iterations [GvL96]) for each sample. For these reasons, we will make two approximations that will yield efficient closed-form solutions for the distributions of the eigenvalue λ and the corresponding eigenvector v λ. If the distributions of multiple eigenvalue/eigenvector pairs are desired, this procedure would need to be repeated independently for each pair. Taylor series approximation First, we will approximate the eigenvalue and eigenvector of interest with a Taylor series expansion about these values calculated at the mean values of the transition probabilities. We define the mean matrix Ā as p 11 λ p 12 p 1K p 21 p 22 λ p 2K Ā =., (6.10).. p K1 p (K 1)K p KK λ where the variables p ij are defined in Eq The mean eigenvalue λ satisfies the equation: det(ā) = 0, (6.11) and the mean eigenvector v λ satisfies the equation: Ā λ v λ = 0. (6.12)

129 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 117 The first-order Taylor series expansion for the eigenvalue λ as a function of the transition probabilities is λ = λ + λ p p Ā 11 + λ 11 p p Ā λ 12 p p Ā KK, (6.13) KK where the p ij are small perturbations in the parameters. Appendix E shows how to compute the terms λ/ p ij Ā in Eq efficiently. Similarly, the first-order Taylor series expansion for the eigenvector v λ as a function of the transition probabilities is v λ = v λ + v λ p 11 Ā p 11 + v λ p 12 Ā p v λ p KK Ā p KK. (6.14) Appendix F shows how to calculate all the terms v λ / p ij Ā in Eq efficiently. Multivariate normal approximation As shown in Sec , the transition probabilities p i are distributed according to Dirichlet distributions. If the sample size is sufficiently large, then, by the central limit theorem, the distribution of p i converges to a multivariate normal distribution (MVN) with mean µ i and covariance matrix Σ i given by µ i = u i w i, (6.15) Σ i = 1 [ wi 2(w wi Diag(u i ) u i u T i i + 1) ], (6.16) where the superscript T denotes the transpose, Diag(u i ) represents a matrix with entries u ij along the diagonal, and the w i terms are the normalizing weight variables defined in Eq. 6.9 [Rao73]. The covariance matrix in this distribution enforces the constraint that each possible transition probability vector p i must sum to unity. Closed-form solutions Making both the Taylor series and multivariate normal approximations leads to closed-form expressions for the distributions of the eigenvalue λ and its corresponding eigenvector v λ. For notational convenience, we define the deviation vector p i, sensitivity of λ vector s λ i, and sensitivity of v λ

130 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 118 matrix S v λ i, p T i = [ p i1 p ik ], ( ) [ ] T λ s λ i = p λ Ā i1 p Ā, ik v1 λ p v λ 1 Ā i1 p Ā ik S v λ i =..... (6.17) vk λ p vk λ Ā i1 p Ā ik We can then rewrite Eqs and 6.14 by grouping K terms at a time as λ = λ + v λ = v λ + K i=1 K i=1 ( s λ i ) T pi, S v λ i p i. (6.18) The vector p i is equal to p i p i and, with the MVN approximation, has mean 0 and covariance matrix Σ i given by Eq Linear combinations of MVN random variables are also MVN random variables, as described in Appendix B [Rao73]. Therefore, λ has a normal distribution with mean λ and variance σ 2, 1 λ N( λ, σ 2 ), (6.19) where Substituting Eq for Σ i, we see that σ 2 = K i=1 ( s λ i ) T Σi s λ i. (6.20) σ 2 = = K i=1 K i=1 1 ( ) T [wi wi 2(w s λ i Diag(u i ) u i u T i i + 1) 1 w 2 i (w i + 1) ] s λ i ( ) T ( ) ] T [w i s λ i Diag(ui )s λ i ( s λ i ui )(u T i sλ i ). (6.21) 1 If the distribution of multiple eigenvalues is desired, it is possible to group terms in Eq similarly to how we group terms for the eigenvectors in Eq to find the covariance matrix between the eigenvalues.

131 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 119 Similarly, v λ has a multivariate normal distribution with mean v λ and covariance matrix Σ 2 v λ, v λ MVN ( v λ, Σ 2 v λ ), (6.22) where Σ 2 v λ = = K i=1 K i=1 S v λ i ( Σ i S v λ ) T i 1 [ wi 2(w i + 1) w i S v λ i Diag(u i ) ( S v λ i ) T ( S v λ i ) ( ( u i u i S v λ ) )] T i. (6.23) Computational cost The closed-form solutions given in Eqs and 6.22 require solving for λ and v λ which take time O(K 3 ) [GvL96]. Appendix E shows that we can find all the partial derivative terms for the eigenvalue in time O(K 2 ) and Appendix F shows we can find all the partial derivative terms for the eigenvector in time O(K 3 ). Since the variance in Eq for the eigenvalue is the sum of vector dot products (rather than matrix-vector products), we can calculate it in time O(K 2 ). Similarly, since the covariance matrix in Eq for the eigenvector is the sum of matrix-vector products (rather than matrix-matrix products), we can calculate it in time O(K 3 ) Adaptive sampling As described previously (Sec ), we can decompose the closed-form normal or multivariate normal distributions for the eigenvalue or eigenvector to calculate the contribution to the variance from the elements in each row of the transition matrix, corresponding to the transitions from a single state. We can then start new simulations from the states which contribute the most to the variance in order to improve the overall precision. The variance of the eigenvalue λ decomposes as σ 2 = q i = K q i w i=1 i + 1, ( ) T s λ i [Diag( pi ) p i p T i ] s λ i, (6.24) where we have separated out the q i terms which do not depend on the allocation of samples, w i.

132 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 120 If we were to add m more samples and assume that the expected transition probabilities remain constant, we can choose the state i which will decrease this variance the most as ( ) qi i = argmax w i + 1 q i. (6.25) w i + m + 1 Similar calculations can be performed to obtain the state which contributes the most to any function of the covariance matrix of the eigenvector. 6.3 Results To test the closed-form solutions for the distribution for an eigenvalue λ given in Eq and an eigenvector v λ given in Eq. 6.22, we compare the distributions with those obtained from sampling from the posterior transition probability distribution and solving each sample for the eigenvalue or eigenvector of interest. We can test all combinations of the two assumptions given above using different sampling and solving methods. Namely, method 1 will sample from the Dirichlet distributions and solve for the eigenvalues or eigenvectors directly, method 2 will sample from the MVN distributions and solve for the eigenvalues or eigenvectors directly, method 3 will sample from the Dirichlet distributions and substitute into the Taylor series approximations, and method 4 will sample from the MVN distributions and substitute into the Taylor series approximations. In this way, we can determine independently whether the MVN approximations and the Taylor series approximations are valid. The equations derived above (Eqs and 6.22) are simply closed-form solutions to the sampling-based method 4. We apply these methods to calculate the distributions of eigenvalues and eigenvectors in the terminally blocked alanine peptide (Fig. 6.1) to demonstrate that the multivariate normal and Taylor series approximations are valid. Stable states on the conformational landscape have previously been identified [CSPD06b]. A set of 30,000 shooting trajectories (5,000 initiated from equilibrium distributions within each of the six states) at 302 K was obtained from Chodera et al. [CSPD06b]. We count transitions between these states at a lag time t of 0.1 ps, counting only one transition per trajectory to ensure independence of the data. The state decomposition is non-markovian at this lag time; therefore, the eigenvalues and eigenvectors of the transition matrix may not correspond to the true underlying alanine dynamics. However, it is still important to determine the error from sampling in the eigenvalues and eigenvectors, since these values are used in tests for Markovian behavior [SPS04a] and clustering of states [CSP + 07]. Further, the primary focus in this chapter is

CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 121 Figure 6.1: Potential of mean force and manual state decomposition for terminally-blocked alanine peptide.

133 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 121 Figure 6.1: Potential of mean force and manual state decomposition for terminally-blocked alanine peptide. Left: The terminally-blocked alanine peptide with φ and ψ backbone torsion labeled. Right: The potential of mean force in the (φ, ψ) torsions at 302 K estimated from the parallel tempering simulation. Boundaries defining the six states manually identified by Chodera et al. [CSPD06b] are superimposed and the states labeled. in validating the mathematical modeling of the distributions of eigenvalues and eigenvectors. The counts for this system are Z = , (6.26) and we set the prior α ij = 1/6, as previously described [SP05a] Eigenvalue distributions Figure 6.2 shows the distributions for the five non-unit eigenvalues as calculated from the normal distribution in Eq (red lines), and from the four sampling based methods described above. It is clear that for the fifth and sixth eigenvalue, the normal distributions are excellent matches with the sampling based distributions. For the second eigenvalue, there are slight discrepancies between

134 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 122 the Dirichlet samples and the MVN samples. For the third and fourth eigenvalues, there appear to be differences between the methods which solve for the eigenvalues directly and those which make the Taylor series approximation. When there are multiple eigenvalues that are close in magnitude, small perturbations in the transition probabilities may result in the shifting of the rank of the eigenvalues of the corresponding perturbed matrix with respect to the original eigenvalues. Therefore, when the eigenvalues of the perturbed matrix are solved directly, one cannot simply take, for example, the second largest eigenvalue of the perturbed matrix to calculate the distribution of the eigenvalue that was second largest in the original matrix. In this system, the third and fourth eigenvalues overlap in range. In the direct solutions of eigenvalues, the distribution of the third largest eigenvalue is therefore shifted to the right of the distribution of the eigenvalue ranked third in the original matrix. It is possible to match eigenvalues based on their corresponding eigenvectors, but we have not done that here. A benefit of the Taylor series approximation is that it automatically calculates deviations to the particular eigenvalue of interest, and thus is insensitive to these changes in rank. The Taylor series expansion also immediately decomposes into contributions from the transitions out of each state, as discussed in Sec Figure 6.3 shows the contribution of each state to the variance in each eigenvalue (normalized such that the total contribution for each eigenvalue sums to one). These values are the q i in Eq If we wished to determine from which states to start new simulations to reduce the variance in any of the eigenvalues, we would use Eq. 6.25, since the expected decrease in variance depends on the current number of samples from a given state. However, since the shooting trajectories have an equal number of samples from each state, we can use Fig. 6.3 to see that, for example, we should add more samples to state 5 to decrease the variance of the second eigenvalue. It would be very difficult to extract this information if one were to sample possible transition probabilities and solve each sample for the eigenvalues Eigenvector distributions In addition to the distributions of the eigenvalues, we are also interested in the distribution of the eigenvectors. Figure 6.4 shows the mean and variance calculated from the closed form distribution (Eq. 6.22) and the four sampling based methods for the eigenvector components corresponding to the third (top panel) and fifth (bottom panel) eigenvalues. The inset in the top panel shows the full distributions for the second eigenvector component. Since the eigenvalues may shift in rank with the perturbations to the transition probabilities, the eigenvector component distribution generated from solving for the eigenvectors directly is clearly bimodal. Because the third and fourth eigenvalues

135 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 123 Figure 6.2: Distributions of the five non-unit eigenvalues of the system shown in Fig The red lines indicate the normal distributions calculated using Eq. 6.21, and the magenta, green, blue, and cyan density plots indicate the distributions generated from the four sampling based methods, Dirichlet sampling and direct solving, MVN sampling and direct solving, Dirichlet sampling and Taylor series substitution, and MVN sampling and Taylor series substitution, respectively, for 20,000 samples.

CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 124 Figure 6.3: The percent contribution of each state to the variance for the five non-unit eigenvectors (Eq. 6.24).

136 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 124 Figure 6.3: The percent contribution of each state to the variance for the five non-unit eigenvectors (Eq. 6.24). overlap in range (as shown in Fig. 6.2), the eigenvectors calculated by ranking the eigenvalues in order actually correspond to different processes. As in the case with the eigenvalues, the Taylor series methods are insensitive to rank ordering changes, and calculate the true, unimodal distribution of the eigenvector components. The fifth eigenvalue, however, is well separated from the other eigenvalues, and the full distribution of the third eigenvector component (shown in the bottom inset of Fig. 6.4), is a good approximation of the actual distribution. The distributions of eigenvector components are not independent they also have some covariance between them, which is not shown here. While we have only shown the mean and variances for two of the eigenvectors and the full distributions for two of the components, the results are similar across eigenvectors and components (data not shown). The variance of each eigenvector component can again be decomposed into contributions from transitions leaving each state. Figure 6.5 shows this decomposition for the eigenvectors corresponding to the third (top panel) and fifth (bottom panel) eigenvalues. The values shown are the percent contribution to the sum of the variances of all the components. We can see that for the third eigenvector, the sixth component has the most variance and can be improved by adding more samples from states 5 and 6. For the fifth eigenvector, the fifth and sixth components are quite precise, and the remaining four components depend to different degrees on the transitions from the first four states. Again, this information would be very difficult to extract by sampling from the transition matrix and solving the eigenvectors for each sample.

137 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 125 Figure 6.4: Distributions of the eigenvector components corresponding to the third (top panel) and fifth (bottom panel) eigenvalues of the six-state model of the terminally blocked alanine peptide. Distributions are calculated either from the MVN distribution given in Eq or from the samples obtained by methods 1 4 described above. The insets show the actual distribution for the second eigenvector component (top inset) and third eigenvector component (bottom inset).

CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 126 Figure 6.5: The contributions to the variance of the eigenvector components as decomposed by transitions from each state.

138 CHAPTER 6. EIGENVALUE AND EIGENVECTOR ERROR ANALYSIS 126 Figure 6.5: The contributions to the variance of the eigenvector components as decomposed by transitions from each state. The top panel corresponds to the third eigenvalue and the bottom panel corresponds to the fifth eigenvalue Adaptive sampling In addition to efficiently calculating the uncertainties in the eigenvalues and eigenvectors, we also wish to use these estimates to improve the sampling as described in Sec We compare the adaptive sampling algorithm to equilibrium sampling, where the number of trajectories initiated from each state is proportional to the equilibrium probability of the state, and even sampling, where an equal number of trajectories are initiated from each state. Assume we can take m transition samples in each round, we can decide where to allocate the

Calculation of the distribution of eigenvalues and eigenvectors in Markovian state models for molecular dynamics

THE JOURNAL OF CHEMICAL PHYSICS 126, 244101 2007 Calculation of the distribution of eigenvalues and eigenvectors in Markovian state models for molecular dynamics Nina Singhal Hinrichs Department of Computer