Using Dimensionality Reduction to Better Capture RNA and Protein Folding Motions

Size: px

Start display at page:

Download "Using Dimensionality Reduction to Better Capture RNA and Protein Folding Motions"

Willis Davis
6 years ago
Views:

1 Using Dimensionality Reduction to Better Capture RNA and Protein Folding Motions Lydia Tapia Shawna Thomas Nancy M. Amato Technical Report TR8-5 Parasol Lab. Department of Computer Science Texas A&M University October 5, 28

2 Abstract Molecular motions, including both protein and RNA, play an essential role in many biochemical processes. Simulations have attempted to study these detailed large-scale molecular motions, but they are often limited by the expense of representing complex molecular structures. For example, enumerating all possible RNA conformations with valid contacts is an exponential endeavor, and the complexity of protein motion increases with the model s detail and protein length. In this paper, we explore the use of dimensionality reduction techniques to better approximate protein and RNA motions. We present two new methods to study motions: (1) an evaluation technique to compare different distributions of conformations and (2) a way to identify likely local motion transitions. We combine these two methods in an existing motion framework to study large-scale motions for both proteins and RNA. We show that dimensionality reduction can be effectively applied, even to discrete conformation spaces (as for RNA secondary structure) that do not typically lend themselves to reduction techniques. 1 Introduction Molecular motions are critical for many biological processes. For example, ribonucleic acid (RNA) motions are responsible for synthesizing proteins, catalyzing reactions, splicing introns, and regulating cellular activities [49, 24, 3]. Also, proteins are well known for their conformational flexibility when it comes to binding with other proteins, ligands, sugars, or other small molecules. The different conformations, or molecular shape, that each of these molecules undertakes influences their function. While experimental methods have been highly successful at determining some dynamics and many static three-dimensional structures of proteins and RNA, they do not operate at the time scales necessary to record detailed large-scale protein motions. Methods that simulate the folding in silico have attempted to fill this gap, but they are often limited by the complexity of representing detailed molecular structures. For example, enumerating all possible RNA conformations with valid contacts is an exponential endeavor and has been shown to be impractical for sequences longer than 4 nucleotides [6]. Protein conformations are also represented in a complex space. Even just considering a protein conformation as a set of 3-D atom coordinates, a conformation of size N is represented by a vector of size 3N. Due to the high complexity of representing molecular motion, the study of many systems is limited. However, if the complex data can be summarized by fewer, more likely possible motions, then it may be possible to study larger systems. Mathematical techniques of dimensionality reduction give a clear way of reducing a high-dimensional data set to a lowdimensional representation. They have been applied in many domains including computational biology [7, 12, 14, 15, 29, 34, 35, 47]. Our Contribution. In this paper we explore the use of dimensionality reduction to study the motions of both RNA and proteins. We contribute two novel methods to the study of motions: a way to evaluate and compare different distributions of conformations and a way identify local motion transitions. The combination of these two new methods enables the study of biologically-relevant large scale motions. In order to evaluate different distributions of samples, we use dimensionality reduction to identify the principal features of a large conformational space. The coverage of different distributions on this reduced manifold can then be calculated. We demonstrate this new technique on an RNA reduction where many different conformation sets can be evaluated. This is particularly interesting due to the fact that RNA folding is often studied in a discrete setting, and dimensionality reduction techniques have not typically been applied to discrete problems. Small, local motions from one nearby conformation to another are often used as a way to piece together larger, more interesting motions. However, identifying this concept of nearby conformations is not always easy or inexpensive. We demonstrate how dimensionality reduction can be applied to identify candidates for this localized motion. This new method is applied to a set of 35 proteins and demonstrates a significant improvement over previous methods. Both of these contributions, sampling distributions and local transitional motions, are general and can be applied to any conformational set. We demonstrate their use within the framework of probabilistic roadmap methods (PRMs) [22]. 2 Related work For years, mathematical dimensionality reduction techniques have been applied to a variety of problems that exist in a complex space. Often, the data from these problems is too large and complex to analyze by hand, so these reduction techniques approximate the complex space with a smaller representation that includes the features of interest. High-dimensional data 1

3 from a variety of domains has been successfully reduced. These domains include areas such as: human subject studies [52], stellar spectra [11], and facial images [5]. Recently, dimensionality reduction has been applied to the biological problems of analyzing protein folding trajectories [7, 12, 14, 15, 29, 34, 35] and protein flexibility [46, 47]. There have been many approaches taken to explore the reduction of high-dimensional molecular data including linear dimensionality reduction [21], non-linear dimensionality reduction [27], and Normal Mode Analysis [51]. One of the most common techniques for dimensionality reduction, principle component analysis (PCA), was used to study the high-amplitude fluctuations in a molecular dynamics simulation of a small 46 residue protein [12]. From there, it has been applied to examine dynamics problems such as identifying protein conformational sub-states [23, 5, 36], extending the timescale of molecular dynamics simulations [1, 25], and performing conformational sampling [9, 8, 45]. PCA has also been applied to compare interpretations of the reduced space against experimental data, e.g., as was done with extensive mutation data [34]. Due to the fact that protein motion was shown to be generally non-linear [12], non-linear dimensionality reduction techniques have been applied to proteins. Non-linear techniques were used to analyze hundreds of thousands of conformations generated from a statistical mechanical method in order to define the most relevant reaction coordinates for the system [7]. Later, techniques to speed up the analysis were introduced in [35]. Another common approach to dimensionality reduction, normal mode analysis (NMA), determines the collective modes for dynamic systems. When applied to proteins, it has given insight into the vibrational motions [29, 3]. Recently, adaptations have been made to the standard NMA approach that allows the study of larger systems, e.g., a 12 amino acid protein in water [54]. The combination of PCA and NMA can also provide useful insight when the two measures agree or disagree [23]. Using information gained from the two methods, proteins such as bovine pancreatic trypsin inhibitor [15] and T4 lysozyme [14] have been studied. 3 Methods There are two main classes of methods used for our analysis of protein and RNA landscapes. First, we must have a method of generating molecular conformations and connecting them with local motion transitions. We refer to this process as roadmap construction because it originates from PRMs for studying robot motion [22]. The strength of PRMs lies in the ability to probabilistically approximate the conformations and local transitions required to capture critical events in the folding process. However, to represent the motions of even a small protein using an atomic-level model, thousands of sample conformations and hundreds of thousands of local motion transitions are required. After describing how a set of conformations can be generated, we explore the use of various dimensionality reduction techniques for determining a low-dimensional representation for our conformation sets. We discuss each of these two components in more detail below. 3.1 Roadmaps for Protein and RNA Folding In previous work, we introduced approaches for studying protein folding [2] and RNA folding [4] based on the probabilistic roadmap approach for motion planning [22]. We successfully applied our method to a large number of structures and were able to identify subtle differences that had been experimentally determined in the secondary structure formation order for proteins with very similar structures [38, 48] and kinetic differences for mutated proteins [43] and RNA [41, 42]. Our method is simple and consists of two main steps: (1) sampling conformations on the landscape and (2) making transitions between sampled conformations. In the first step, conformations (roadmap nodes) are sampled on the folding landscape, with a bias to increase density near the known native, folded state (Figure 1(a)). In the second step, connections (roadmap edges) are made between sampled conformations with similar structure (Figure 1 (b)). Weights are assigned to directed edges to reflect the energetic feasibility of transitioning between the two endpoint conformations. This combination of nodes and weighted edges forms a roadmap that approximates the energy landscape. This roadmap encodes thousands of folding pathways. The most energetically feasible pathways in the roadmap can be extracted using these weights (Figure 1(c)). Models. To study protein motion, we model the protein as an articulated linkage. Using a standard modeling assumption for proteins that bond angles and bond lengths are fixed [39], the only degrees of freedom in our model are the backbone s φ and ψ torsional angles. These are represented as revolute joints with values in the range [, 2π). 2

(a) (b) (c) Figure 1: A PRM roadmap for molecular folding shown imposed on a visualization of the potential energy landscape: (a) after node generation, (b) after the connection phase, and (c) using

4 (a) (b) (c) Figure 1: A PRM roadmap for molecular folding shown imposed on a visualization of the potential energy landscape: (a) after node generation, (b) after the connection phase, and (c) using it to extract folding paths to the known native structure. In the results shown in this paper, we use a coarse potential function similar to [28]. We use a step function approximation of the van der Waals potential component and model side chains as spheres with zero degrees of freedom. If any two spheres are too close (i.e., less than 2.4Å during sampling and 1.Å during connection), a very high potential is returned such that these conformations are rejected from the landscape model. Otherwise, the potential is: U tot = K d {[(d i d ) 2 + d 2 c ]1/2 d c } + E hp (1) restraints where K d is 1 kj/mol and d = d c = 2Å as in [28]. The first term represents constraints favoring known secondary structure through main-chain hydrogen bonds and disulphide bonds, and the second term (E hp ) is the hydrophobic effect. The hydrophobic effect is computed as follows: if two hydrophobic residues are within 6Å of each other, then the potential is decreased by 2 kj/mol. For more details, please see [37]. To study RNA motion, we focus the RNA model on the formation of secondary structure. Secondary structure is a planar representation of an RNA conformation, which is commonly used to study RNA folding [55, 56, 16]. We adopt the definition in [16] that eliminates other types of contacts that are not physically favored. In the results shown in this paper, we use a common energy function called the Turner or nearest neighbor rules [55]. This method involves determining the types of loops that exist in the molecule and looking up their free energy in a table of experimentally determined values. Intuitively, adjacent contacts typically form stable subunits (called stacks or stems) that have low energy. Sampling. Conformation samples are retained based on their energy. In our protein work, a sample q, with potential energy E q, is accepted with probability: P(acc. q) = 1 if E q < E min if E min E q E max (2) if E q > E max E max E q E max E min where E min is the potential energy of the open chain and E max is 2E min. The roadmap produced by our technique is an approximation of the protein s energy landscape. The quality of the approximation largely depends on the sampling distribution. Generally, we are most interested in regions near the native conformation and so seek to concentrate sampling there. In the results shown here, we use the sampling technique presented in [48] based on rigidity analysis [18, 19, 2, 17, 26]. We have shown that this method provides a denser distribution of samples near the native conformation, increasing the size of the proteins we can study. In our previous work with RNA, we have explored a variety of sampling methods: complete base-pair enumeration (BPE), stack-pair enumeration (SPE), and probabilistic Boltzmann sampling (PBS) [42]. While a BPE roadmap describes the complete energy landscape, it is infeasible for large RNA (e.g., more than 4 nucleotides). SPE roadmaps are smaller (one or two orders of magnitude smaller than BPE roadmaps). PBS roadmaps are the smallest (up to 1 orders of magnitude smaller than BPE roadmaps), and we have shown they scale well for much larger RNA (e.g., with hundreds of nucleotides) [41]. PBS uses Wuchty s method [53] to enumerate suboptimal (low energy) conformations within a given energy threshold. We take these suboptimal conformations as seeds and include additional random conformations. Then, we use a probabilistic filter to retain a subset of the conformations based on their Boltzmann distribution factors. For a given conformation 3

5 q with free energy E q, the probability of accepting it is: { P(acc. q) = e (Eq E ) kt if (E q E ) > 1 if (E q E ) (3) where E is a reference energy threshold that we can use to control the number of samples kept, k is the Boltzmann constant, and T is the temperature. Connection. Connections between two conformations, q i and q j, are labeled with edge weights that reflect the energetic feasibility of transitioning between them. For proteins, this is done by first identifying all the intermediate nodes, q i = c, c 1,..., c n 1, c n = q j, that connect q i to q j. For each pair of consecutive conformations c i and c i+1, the probability P i of transitioning from c i to c i+1 depends on the difference between their potential energies E i = E(c i+1 ) E(c i ): P i = { e E i kt if E i > 1 if E i (4) where k is the Boltzmann constant and T is the temperature. This keeps the detailed balance between two adjacent states and enables the edge weight to be computed by summing the logarithms of the probabilities for all pairs of consecutive conformations in the sequence. With this edge weight definition, we can use simple graph search algorithms to extract the most energetically feasible pathways in the roadmap between two given states (e.g. from the unfolded state to the folded state). Similar to the method described for proteins (above), we calculate a weight w ij for edge (q i, q j ) that reflects the Boltzmann transition probability between q i and q j for RNA. First, we determine the energy barrier (the maximum energetic cost) E b between q i and q j. Then, we calculate the Boltzmann transition probability k ij (or transition rate) of moving from q i to q j using Metropolis rules [1]: k ij = { e E kt if E > 1 if E (5) where E = max(e b, E j ) E i, k is the Boltzmann constant, and T is the temperature. Note that the same energy barrier E b is also used to estimate the transition probability k ji, so the calculation satisfies the detailed balance. As with the proteins, the edge weight w ij is the negative logarithm of the transition probability. 3.2 Dimensionality Reduction Techniques A variety of dimensionality reduction methods have been developed that analyze a set of points (input) and produce a lowdimensional representation for each input point (output). The methods vary in the speed of calculation and the complexity of the data the models are able to represent. As in many data mining techniques, there are two main classes of methods: those that are able to capture data that is linearly representable and those that are able to capture non-linear data. Two popular types of methods for doing linear reduction are the classical techniques of Principal Component Analysis (PCA) [21] and Multidimensional Scaling (MDS) [4]. These methods are very popular because they are easy to implement, compute solutions efficiently, and can guarantee a globally optimal linear subspace reduction of the high-dimensional data. However, if the data being studied is non-linear, then more recent non-linear reduction techniques have been used to obtain better reductions [27]. In this paper we explore two methods for dimensionality reduction: PCA (linear) and Isomap [44] (non-linear). While these two methods both provide a reduction of some given model (see Algorithm 3.1), they differ greatly on how this model is obtained and internally represented. In our description of these methods we will use: n as the size of the original dataset (in our case RNA or protein conformations), D as the size of the dimensionality of the original dataset, R as number of dimensions in the reduced space required to represent the original dataset. PCA. Principal Component Analysis (PCA) is one of the most well known methods for dimensionality reduction. Its popularity stems from the ease of calculation and the longevity of the method [21]. The goal of PCA is to compute the D Principal Components (PCs) of the original data set. Even though there are D resulting PCs, often the variance in the data can be fully represented by a smaller set of the PCs, e.g., of size R. The general algorithm for PCA is briefly outlined in Algorithm 3.2. The critical step of the PCA method is the calculation of the the D PCs for an initial data set of dimensionality D. Each resulting PC is a vector that is aligned with a direction 4

6 Algorithm 3.1 Dimensionality Reduction for Molecules Input. A set of n conformations, represented in D dimensions Output. A set of size n in R dimensions where R << D Algorithm 3.2 Principal Component Analysis for Molecules Input. n D matrix, X Output. Set of R principle components, P C 1: Center the data in X by subtracting the data mean from each point 2: Construct the covariance matrix C = XX D 3: Compute the top D eigenvalues and eigenvectors of C via singular value decomposition (SVD) of C. 4: Set PC as the ordered D eigenvectors of C. 5: return The first R of PC where the variance of the representation of the original dataset is minimized and R < D. of maximal variance in the initial data set. They are ordered, e.g., the first PC represents the direction of maximal variance, the second with the second maximal, etc. Again, despite the fact that there are D resulting PCs, often the variance in the data can be fully represented by a smaller set of the PCs, e.g., of size R. Isomap. A popular non-linear dimensionality reduction technique is Isomap [44]. It retains the features of efficiency and global optimality while being able to represent non-linearity in the data. Isomap has been shown to work well on large and complex data sets [44] and has been applied to proteins [7]. Algorithm 3.3 Isomap for Molecules Input. A set of n conformations. 1: Construct a neighborhood graph G. For each conformation n i, connect it to neighbor n j with edge length d(i, j) if n j is a k nearest neighbor of n i. If n j is not a k nearest neighbor of n i, connect with an edge weight of d(i, j) =. 2: Compute the shortest paths in a matrix D G. For every pair of points, i, j, compute the shortest path distances between those points. E.g., min[d(i, j), (d(i, k) + d(k, j))] for every k from 1 to n. 3: Construct a R-dimensional embedding Apply classical multi-dimensional scaling to the matrix of graph distances D G. This will construct an embedding of the data in an R dimensional Euclidean space while preserving intrinsic geometry. The general algorithm for Isomap is briefly outlined in Algorithm 3.3. The algorithm works obtaining a geometric representation of input data, e.g., distances from one conformation to another. By using these geodesic distances, Isomap can preserve the topology of a complex and non-linear manifold even with a low-dimensional representation, e.g., of size R. 4 Application: Capturing RNA and Protein Landscapes In this section, we explore the application of linear and non-linear dimensionality reduction techniques to both RNA and protein conformation sets. We also investigate the parameters that affect the reduction quality. 4.1 Selecting Linear vs. Non-Linear Reduction Here we compare the efficiency of dimensionality reduction performed by both PCA (linear) and Isomap (non-linear). For the PCA reduction, we take all the roadmap conformations as input. For example, with proteins, each conformation is the series of backbone φ and ψ torsional angles. Then, we run PCA through MATLAB R and plot the variance of the residuals. For the Isomap reduction, we again take all the roadmap conformations as input. Then, we construct a neighborhood graph (see Algorithm 3.3) using a distance measure. For the RNA shown, we use a distance measure calculated from 5

7 base-pair differences [16]. For the proteins shown, we use all backbone atom root mean square distance (RMSD). The implementation of Isomap is from [44]. Figure 2 shows the variance of the residuals for both PCA and Isomap as a function of the number of reduced dimensions. Residual variance decreases rapidly with each additional dimension and then tapers off as the number of dimensions increases for both methods. To completely represent the data, both methods would require greater than 6 dimensions GB1 ISOMAP PCA.5 Variance # Dimensions Figure 2: Variance of the residuals from the dimensionality reduction for Protein G (PDB ID: 1GB1) from PCA and Isomap. Note that the non-linear representation given by Isomap is better able to capture the complexity of the data (as shown by lower and continuously decreasing residuals). This non-linearity in protein folding landscapes also corresponds to previous studies. For example [12, 7], also demonstrated that protein folding landscapes were better represented by non-linear reduction techniques. 4.2 Parameter Setting For the geometric representation required by the Isomap method, we need to define the k nearest neighbors for each conformation. Recall that for the protein results results shown in this paper, the RMSD distance is used to define the distance, and for the RNA results, the number of contact pair differences is used. However, the parameter k also affects the quality of the representation. Figure 3 shows the variance of the residuals Isomap reductions of a 21 nucleotide RNA where k is varied between the values of 8 and 5. Note, there is little difference between the quality of the reductions RNA 21nt k=8 k=7 k=6 k= Figure 3: Variance of the residuals from Isomap reductions for a 21 nucleotide RNA with varying values of k. Similar results were seen for reductions of protein conformations (data not shown). Due to this, a value of k = 8 was used for all reductions. 6

8 4.3 Selecting an Appropriate Number of Dimensions Once a reduction is performed another question arises: How many dimensions appropriately capture the space at the lowest complexity? Obviously, this is determined by the application the reduction is being used for. In the context of the results shown in this paper, we are interested in using simple representations that allow us to capture motions of RNA and proteins. We explore two measures for selecting the number of dimensions. The first, the residual variances, is standard and often used when the highest-quality reduction is required. Ideally, a reduction would exactly capture the complexity of the space (as represented by the residual variances reaching ). However, in complex spaces, extremely low-dimensional representations are not always possible or necessary. The second measure we investigate, the elbow criterion, is a measure commonly used in data clustering techniques to evaluate how well a particular clustering represents the data and to determine an appropriate number of clusters [13, 32]. The elbow criterion monitors the percentage of the variance explained by different clusterings and selects the one where this value no longer significantly changes, i.e., adding additional clusters (or in our case additional dimensions) does not add sufficient information. Given the variance of the data set, σ 2, the percentage of the variance explained is ( R i=1 σ2 i )/σ2 for each residual. In our case of principle dimensions, this measure captures the point at which the growth in the quality of the representation is maximized. Figure 4 demonstrates an elbow calculated from a reduction of the protein Ubiquitin (PDB ID: 1UBI). For this reduction, we would select 4 dimensions to represent the data..3 1UBI Residual Variance.2.15 Elbow Dimensionality Dimensionality Figure 4: The elbow (star) is shown for an example reduction of the protein Ubiquitin. The elbow indicates the point at which the growth in the quality of the representation is maximized. 4.4 Discovering Landscape Characteristics One of the most exciting things about reduced landscapes is the insight they give us as an approximation to the full energy landscape. In this section, we take a full enumeration of the conformations of a 21 nucleotide RNA (5,353 conformations). Note that the residuals clearly indicate that increasing dimensionality more accurately represents this conformation space (see Figure 3). However, even two dimensions reduces the residuals significantly. Figure 5 shows the first two dimensions of the reduction plotted against the potential of the conformations. Despite the low dimensional representation and the fact that potential was not used for the reduction, we see striking characteristics. Conformations of similar potential are clearly grouped together (red=high potential, blue=low potential). This reduction also demonstrates the typical ruggedness of RNA landscapes. 5 Application: Evaluating Sampling In this section, we demonstrate how the reduced space can be used to evaluate the quality and importance of different sample sets. A perfect test-case for this is the 21 nucleotide RNA. Due to the small size of this RNA, we are able to fully enumerate the conformation space. In addition to this Base Pair Enumeration (BPE) set, we can generate samples in two other ways: Stack Pair Enumeration (SPE) and Probabilistic Boltzmann Sampling (PBS) (see Section 3.1). SPE generates conformations such that all contacts in a conformation are part of a stack (a set of consecutive contacts). These conformations are a subset 7

PBS probabilistically selects a subset of the conformations, favoring those with smaller energies. We can adjust the severity of this bias by altering the reference energy threshold, E.

9 Figure 5: The first two dimensions of reduction for a 21 nucleotide RNA plotted against potential energy Dimension 1 (a) Dimension Dimension 2 6 Dimension 2 Dimension 2 of the BPE conformations. The 21 nucleotide RNA has 25 SPE conformations. PBS probabilistically selects a subset of the conformations, favoring those with smaller energies. We can adjust the severity of this bias by altering the reference energy threshold, E. This threshold consequently determines the size of the subset. For this evaluation, we selected two reference energy thresholds: 4 and. The first threshold (labeled higher ) generates more conformations (213) than the second threshold (labeled lower ) with only 58. In previous experiments, we have seen that our BPE, SPE, and PBS roadmaps produce similar simulated kinetics results despite their drastically different roadmap sizes [42]. Figure 6 shows how the different conformation subsets cover a reduction of a full enumeration of the landscape (BPE). The two dimensions displayed here are the same two dimensions in Figure 5. For this 21 nucleotide hairpin, BPE generated 5,353 possible conformations. In Figure 6(a), the gray dots represent a BPE conformation and the star indicates the native state. Even though there are only 25 SPE conformations, it is clear from Figure 6(b) that they cover much of the reduced space. This implies that even though there are only about 5% of the samples, they still capture the general characteristics and distribution of the full set Dimension 1 5 Dimension 1 (b) 2 (c) Dimension 1 (d) Figure 6: (a) First two dimensions of a reduction of full enumeration of all possible conformations (5,353). The native state is indicated with a star. (b-d) Comparison of different subsets of conformations (black circles) overlaid on the reduction (gray dots). Subsets include: (b) 25 SPE conformations, (c) 213 PBS conformations (higher energy threshold), and (d) 58 PBS conformations (lower energy threshold). Figure 6(c) shows a similar plot for the 213 PBS conformations generated with the higher reference energy threshold. Again, even though there are much fewer samples, much of the fully enumerated space is still captured. It is interesting to note that the PBS distribution with the higher threshold and the SPE distribution are not exactly the same. Stack-based conformations have lower energies than conformations with isolated contacts, but they are not guaranteed to have low energies. This becomes apparent as we compare the SPE distribution to the PBS distribution which is probabilistically biased towards lower energy regions of the landscape. The PBS distribution is missing a fraction of the SPE subset (in the lower right quadrant of the reduction) that have higher energy. Finally, we plot the 58 PBS conformations generated with the lower reference energy threshold on the reduced space, see Figure 6(d). Despite the fact that only 58 conformations are generated, they still cover a large portion of the primary dimensions of the reduction. Also, as expected with a low energy threshold, they cover a large portion of space near the native state. A comparison with the higher threshold samples (Figure 6(c)) indicates that the many of high energy 8

10 conformations are eliminated by using this lower energy threshold. However, despite this reduction, there are some samples left to represent the region of higher energy conformations. 6 Application: Capturing Motions It was clear from the previously shown reductions that conformations of similar energetics and structure were grouped together, even at low dimensional representations. One way to take advantage of this grouping is to use the reduction to identify likely motion transitions. In the past, we have identified likely transitions from a conformation by using a distance metric to define nearby conformations. Then, we make connections between them as described in Section Methods We identify likely motion transitions by defining a new distance metric based on the reduction of a set of conformations C. After performing a reduction (as described in Section 3.2), we obtain a vector, r i, of length R for each conformation, c i. Here, the number of dimensions R used from the reduction is computed from the elbow criterion (see Section 4.3). We then calculate the distance between two conformations c a and c b by calculating their distance in the reduced space as (r a d R (c a, c b ) = 1 r1 b)2 + (rd a rb d )2 (6) 2n We call this measure the reduction distance. In previous work, we defined neighbors through a metric based on the amount of rigid structure in two conformations called rigidity distance [48]. This metric provided results that were able to capture experimental findings with two major benefits: fewer required edges and low edge weights. 6.2 Experimental Setup In order to compare the two ways of identifying neighbors for local motion transitions, we applied the two metrics to connect a single set of conformations: the previously developed rigidity distance and our new metric reduction distance. We took the proteins from our protein folding server that includes both our previously published results and user submissions. This set consisted of 35 proteins from 46 to 153 residues of varied secondary structure (Table 1). All proteins listed are referenced by their PDB ID except MMP19. This protein was a submission to our publicly available online folding server ( The conformation sets varied in size from 4, to 1, conformations (as defined previously by the amount needed to maintain a stable secondary structure formation order). Isomap was run on the set of conformations as defined in Section 3.2. As discussed in Section 4.2, the nearest neighbor parameter used by Isomap was set to k = 8. The number of dimensions used to represent the data was automatically defined by the elbow criterion (Section 4.3). The metrics were asked to attempt local connections to each conformation s 2 nearest neighbors. Recall that this is the neighbor rate as defined for roadmap connection (Section 3.1). 6.3 Results Table 1 displays the differences caused by the two different distance metrics for each protein studied. Edge Number Difference is the number of edges in the reduction connected map over the number of edges in the rigidity connected map. Edge Weight Difference is the average edge weight in the reduction connected map over the average edge weight in the rigidity connected map. It is clear that using the reduction distance causes on average a 6% decrease in the number of edges and almost a 1% decrease in the average edge weight. Figure 7(a) demonstrates the difference in number of edges in the roadmaps constructed by the two metrics. Since all 35 points fall below the red line, all maps connected by reduction distance were smaller than maps connected by rigidity distance. This was true even for maps with larger numbers of conformations (reflected in a larger number of edges). Since the edge weight reflects the energetic feasibility of making a local transition from one conformation to another, it is good to examine the changes in edge weight caused by this new connection method. Figure 7(b) shows the average edge weights from the maps connected by the rigidity distance against the maps connected by the reduction distance. Overall, the average edge weights from the reduction distance maps were almost 1% smaller than the original maps. While not all 9

11 Edge Edge PDB Number Weight Identifier Length Structure Nodes Difference Difference 1AB1 46 2α + 2β CCM 46 1α + 3β RDV 52 2α + 3β EGF 53 3β PRB 53 5α IY5 54 1α + 3β SMU 54 3α + 3β FCA 55 2α + 4β VGH 55 1α + 4β GB1 56 1α + 4β MHX 57 1α + 4β MI 57 1α + 4β BPI 58 2α + 2β PTI 58 2α + 2β BDD 6 3α TCP 6 2α + 2β ADR 6 2α + 2β CRS 6 6β PTL 62 1α + 4β COA 64 1α + 5β SRM 64 1α + 5β CI2 65 2α + 5β NYF 67 5β HOE 74 7β AIT 74 7β UBI 76 3α + 5β UBQ 76 1α + 5β O6X 81 2α + 3β A2P 18 4α + 6β YCC 18 5α VYN 117 5α + 8β RBX 124 4α + 7β L 129 7α + 3β AFG 14 4α + 1β MMP19* 153 3α + 7β Average.6.91 Table 1: Comparison of reduction distance connection to previous work for 35 proteins. In all cases, reduction distance connection reduces the number of edges needed, and in many proteins, it decreased the average edge weight. [* User submission without a PDB ID.] reduction connection maps had smaller average edge weight, 3 of 35 maps had averages that were similar to or less than the original average edge weight. In addition to reducing the required number of edges and the average edge weight, using a reduction distance to connect a roadmap dramatically changed the connectivity of the map. The degree for a conformation (or vertex) v in the roadmap is the number of edges connected to v. In the reduction distance maps, the average degree dropped to from More striking differences are seen in the conformations of maximum degree. For example, with the rigidity distance, the maximum degree in all roadmaps was in the range [32, 1,832] while in the reduction distance maps the degree was in the range [36, 47]. From these changes, it is clear that the reduction distance maps are more evenly connected. For example, the reduction of maximum degree implies that massive connectivity hubs are removed, and the average degree change implies that all conformations are more equally connected. From the previous statistics, it is clear that local motion transitions are changing the roadmaps. These changes seem to be for the better: smaller roadmaps, smaller edge weights, and more disperse connectivity. Another, more biologicallyinspired, measure is the order in which secondary structure is formed along the pathways in the roadmap. In previous work [48], we validated a set of roadmaps against experimental results. We showed that our roadmaps, connected by rigidity distance were able to capture the same secondary structure formation orders as found in experiment. Table 2 shows the secondary structure formation orders for 4 proteins with similar folding structure but differing folding behavior from the reduction distance roadmaps. It also indicates the decrease in map size required over the previously build rigidity distance roadmaps. In all cases, the reduction connected maps were able to predict the secondary structure formation order seen in experiment with almost 5% fewer edges than previously required. 1

12 x 1 5 Number of Edges Comparison x Edge Weight Comparison 5 Number of Edges in Reduction Distance Maps Average Edge Weight in Reduction Distance Maps Number of Edges in Rigidity Distance Maps x 1 5 (a) Average Edge Weight in Rigidity Distance Maps x 1 6 (b) Figure 7: (a) Number of edges from original maps vs. maps connected using reduction distance. (b) Average edge weights from original maps vs. maps connected using reduction distance. Size Protein Experimental Order Roadmap Order (%) Decrease G [α,β1,β3,β4], β2 1 α, β3-4, β1-2 (1.) 51% [α,β4], [β1,β2,β3] 2 L [α,β1,β2,β4], β3 1 α, β1-2, β3-4 (1.) 5% [α,β1], [β2,β3,β4] 2 NuG1 β1-2, β3-4 3 α, β1-2, β3-4 (98.) 47% α, β1-2, β3-4 (1.9) NuG2 β1-2, β3-4 3 β1-2, α, β3-4 (99.2) 54% β1-2, α, β3-4 (1.1) β3-4, β1-2, α (1.1) Table 2: Comparison of secondary structure formation orders and ratio of edges needed (Size Decrease) for proteins G, L, NuG1, and NuG2 with known experimental results: 1 hydrogen out-exchange experiments [31], 2 pulsed labeling/competition experiments [31], and 3 Φ-value analysis [33]. Brackets indicate no clear order. In all cases, our new technique predicted the secondary structure formation order seen in experiment with significantly reduced numbers of edges. Only formation orders greater than 1% are shown. 7 Conclusions In this work we proposed two new methods for studying molecular motions based on dimensionality reduction techniques. First, we demonstrated how dimensionality reduction can be used to compare different distributions of conformations. We illustrated this technique with a small RNA which could be fully enumerated. We showed how to evaluate 3 different sampling distributions by looking at the coverage and distribution of samples against the fully enumerated landscape in a reduced space. Second, we developed a new way to identify likely local motion transitions using dimensionality reduction. We define a new distance measure, reduction distance, to select neighboring conformations for localized motions. This new metric yields a significant improvement over previous techniques resulting in a 4% reduction in landscape model size (number of edges) for the 35 proteins studied. Both of these new methods are general and can be applied to any set of conformations. We showcase their utility in an existing motion framework. Acknowledgments We would like to acknowledge Mark Moll of the Physical and Biological Computing Group at Rice University for inspiring us to work with dimensionality reduction. This research supported in part by NSF Grants EIA-13742, ACR-8151, ACR , CCR , ACI , CRI , by the DOE and HP. Computing resources were generously donated by Chevron. Tapia supported in part by a PEO scholarship, NIH Molecular Biophysics Training Grant (T32GM6588) and by a Department of Edu- 11

13 cation (GAANN) Fellowship. Thomas supported in part by an NSF Graduate Research Fellowship, a PEO scholarship, a Dept. of Education Graduate Fellowship (GAANN), and an IBM TJ Watson PhD Fellowship. References [1] A. Amadei, A. Linssen, B. de Groot, D. van Aalten, and H. Berendsen. An efficient method for sampling the essential subspace of proteins. J. Biomol. Struct. Dyn., 13: , [2] N. M. Amato, K. A. Dill, and G. Song. Using motion planning to map protein folding landscapes and analyze folding kinetics of known native structures. J. Comput. Biol., 1(3-4): , 23. Special issue of Int. Conf. Comput. Molecular Biology (RECOMB) 22. [3] D. Bartel. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, 116: , 24. [4] I. Borg and P. J. Groenen. Modern Multidimensional Scaling Theory and Applications. Springer, New York, NY, 25. [5] L. S. Caves, J. D. Evanseck, and M. Karplus. Locally accessible conformations of proteins: Multiple molecular dynamics simulations of crambin. Protein Sci., 7: , [6] J. Cupal, C. Flamm, A. Renner, and P. F. Stadler. Density of states, metastable states, and saddle points exploring the energy landscape of an RNA molecule. In Proc. Int. Conf. Intelligent Systems for Molecular Biology (ISMB), pages 88 91, [7] P. Das, M. Moll, H. Stamati, L. E. Kavraki, and C. Clementi. Low-dimensional, free-energy landscapes of proteinfolding reactions by nonlinear dimensionality reduction. Proc. Natl. Acad. Sci. USA, 13(26): , 26. [8] B. de Groot, A. Amadei, R. Scheek, N. van Nuland, and H. Berendsen. An extended sampling of the configurational space of HPr from E. coli. Proteins Struct. Funct. Genet., 26: , [9] B. de Groot, A. Amadei, D. van Aalten, and H. Berendsen. Toward an exhaustive sampling of the configurational spaces of the two forms of the peptide hormone guanylin. J. Biomol. Struct. Dyn., 13: , [1] K. A. Dill and H. S. Chan. From Leventhal to pathways to funnels. Nat. Struct. Biol., 4:1 19, [11] P. R. Fiorentin, C. A. L. Bailer-Jones, Y. S. Lee, T. C. Beers, T. Sivarani, R. Wilhelm, C. A. Prieto, and J. E. Norris. Estimation of stellar atmospheric parameters from SDSS/SEGUE spectra. Astronomy & Astrophysics, 467: , 27. [12] A. E. Garcìa. Large-amplitude nonlinear motions in proteins. Physical Review Letters, 68(17): , [13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 21. [14] S. Hayward, A. Kitao, and H. J. Brendsen. Model-free methods of analyzing domain motions in proteins from simulation: A comparision of normal mode analysis and molecular dynamics simulation of lysozyme. Proteins Struct. Funct. Genet., 27: , [15] S. Hayward, A. Kitao, and N. G ō. Harmonic and anharmonic aspects in the dynamics of BPTI: A normal mode analysis and principal component analysis. Protein Sci., 3: , [16] I. L. Hofacker. RNA secondary structures: A tractable model of biopolymer folding. J. Theor. Biol., 212:35 46, [17] D. Jacobs. Generic rigidity in three-dimensional bond-bending networks. J. Phys. A: Math. Gen., 31: , [18] D. Jacobs and M. Thorpe. Generic rigidity percolation: The pebble game. Phys. Rev. Lett., 75(22): , [19] D. Jacobs and M. Thorpe. Generic rigidity percolation in two dimensions. Phys. Rev. E, 53(4): ,

14 [2] D. J. Jacobs and B. Hendrickson. An algorithm for two dimensional rigidity percolation: The pebble game. J. Comp. Phys, 137: , [21] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, [22] L. E. Kavraki, P. Švestka, J. C. Latombe, and M. H. Overmars. Probabilistic roadmaps for path planning in highdimensional configuration spaces. IEEE Trans. Robot. Automat., 12(4):566 58, August [23] A. Kitao and N. G ō. Investigating protein dynamics in collective coordinate space. Curr. Op. Str. Biol., 9: , [24] P. Klaff, D. Riesner, and G. Steger. RNA structure and the regulation of gene expression. Plant Mol. Biol., 32:89 16, [25] M. B. Kubitzki and B. L. de Groot. Molecular dynamics simulations using temperature-enhanced essential dynamics replica exchange. Biophys. J., 92: , 27. [26] A. Lee and I. Streinu. Pebble game algorithms and sparse graphs. European Conference on Combinatorics, Graph Theory and Applications, 25. [27] J. A. Lee and M. Verleysen. Nonlinear Dimensionality Reduction. Springer, New York, NY, 27. [28] M. Levitt. Protein folding by restrained energy minimization and molecular dynamics. J. Mol. Biol., 17: , [29] M. Levitt. Real-time interactive frequency filtering of molecular dynamics trajectories. J. Mol. Biol., 22:1 4, [3] R. M. Levy and M. Karplus. Vibrational approach to the dynamics of an α-helix. Biopoly., 18: , [31] R. Li and C. Woodward. The hydrogen exchange core and protein folding. Protein Sci., 8(8): , [32] L. Lieu and N. Saito. Automated shapes discrimination in high dimensions. Proc. of SPIE, 671 Wavelets XII(6711W), 27. [33] S. Nauli, B. Kuhlman, and D. Baker. Computer-based redesign of a protein folding pathway. Nature Struct. Biol., 8(7):62 65, 21. [34] S. B. Nolde, A. S. Arseniev, V. Y. Orkhov, and M. Billeter. Essential domain motions in barnase revealed by MD simulations. Proteins Struct. Funct. Genet., 46:25 258, 22. [35] E. Plaku, H. Stamati, C. Clementi, and L. E. Kavraki. Fast and reliable analysis of molecular motion using proximity relations and dimensionality reduction. Proteins: Structure, Function, and Bioinformatics, 67(4):897 97, 27. [36] T. Romo, J. Clarage, D. Sorensen, and G. P. Jr. Automatic identification of discrete substates in proteins: Singular value decomposition analysis of time-averaged crystallographic refinements. Proteins Struct. Funct. Genet., 22: , [37] G. Song. A Motion Planning Approach to Protein Folding. Ph.D. dissertation, Dept. of Computer Science, Texas A&M University, December 24. [38] G. Song, S. Thomas, K. Dill, J. Scholtz, and N. Amato. A path planning-based study of protein folding with a case study of hairpin formation in protein G and L. In Proc. Pacific Symposium of Biocomputing (PSB), pages , 23. [39] M. J. Sternberg. Protein Structure Prediction. OIRL Press at Oxford University Press, [4] X. Tang, B. Kirkpatrick, S. Thomas, G. Song, and N. M. Amato. Using motion planning to study RNA folding kinetics. J. Comput. Biol., 12(6): , 25. Special issue of Int. Conf. Comput. Molecular Biology (RECOMB) 24. [41] X. Tang, S. Thomas, L. Tapia, and N. M. Amato. Tools for simulating and analyzing RNA folding kinetics. In Proc. Int. Conf. Comput. Molecular Biology (RECOMB), pages ,

15 [42] X. Tang, S. Thomas, L. Tapia, D. P. Giedroc, and N. M. Amato. Simulating RNA folding kinetics on approximated energy landscapes. J. Mol. Biol., 28. doi: 1.116/j.jmb [43] L. Tapia, X. Tang, S. Thomas, and N. M. Amato. Kinetics analysis methods for approximate folding landscapes. Bioinformatics, 23(13): , 27. Special issue of Int. Conf. on Intelligent Systems for Molecular Biology (ISMB) & European Conf. on Computational Biology (ECCB) 27. [44] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 29: , 2. [45] M. Teodoro. Molecular conformational sampling using collective coordinate expansive spaces. Master s thesis, Dept. of Computer Science, Rice University, 23. [46] M. L. Teodoro, G. N. P. Jr., and L. E. Kavraki. A dimensionality reduction approach to modeling protein flexibility. In Proc. Int. Conf. Comput. Molecular Biology (RECOMB), pages , 22. [47] M. L. Teodoro, G. N. Phillips, Jr., and L. E. Kavraki. Understanding protein flexibility through dimensionality reduction. J. of Computational Biology, 1(3 4): , 23. [48] S. Thomas, X. Tang, L. Tapia, and N. M. Amato. Simulating protein motions with rigidity analysis. J. Comput. Biol., 14(6): , 27. Special issue of Int. Conf. Comput. Molecular Biology (RECOMB) 26. [49] I. Tinoco and C. Bustamante. How RNA folds. J. Mol. Biol., 293: , [5] M. Turk and A. P. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71 86, [51] E. B. Wilson, J. Decius, and P. C. Cross. Molecular Vibrations: The Theory of Infrared and Raman Vibrational Spectra. McGraw-Hill, Dover, 198. [52] M. Wish and J. D. Carroll. Multidimensional scaling and its applications. In P. Krishnaiah and L. Kanal, editors, Handbook of Statistics 2: Classification Pattern Recognition and Reduction of Dimensionality, chapter 14, pages North-Holland, Amsterdam, The Netherlands, [53] S. Wuchty. Suboptimal secondary structures of RNA. Master s thesis, University of Vienna, Austria, March [54] L. Zhou and S. A. Siegelbaum. Effects of surface water on protein dynamics studied by a novel coarse-grained normal mode appraoch. Biophys. J., 94: , 28. [55] M. Zuker, D. H. Mathews, and D. H. Turner. Algorithms and thermodynamics for RNA secondary structure prediction: A practical guide. In J. Barciszewski and B. F. C. Clark, editors, RNA Biochemistry and Biotechnology, NATO ASI Series. Kluwer Academic Publishers, [56] M. Zuker and D. Sankoff. RNA secondary structure and their prediction. Bulletin of Mathematical Biology, 46: ,

A Path Planning-Based Study of Protein Folding with a Case Study of Hairpin Formation in Protein G and L

A Path Planning-Based Study of Protein Folding with a Case Study of Hairpin Formation in Protein G and L G. Song, S. Thomas, K.A. Dill, J.M. Scholtz, N.M. Amato Pacific Symposium on Biocomputing 8:240-251(2003)