Protein Structure Comparison Methods
|
|
- Bernice Newton
- 5 years ago
- Views:
Transcription
1 Protein Structure Comparison Methods D. Petrova Key Words: Protein structure comparison; models; comparison algorithms; similarity measure Abstract. Existing methods for protein structure comparison are examined and their basic properties are presented. Analysis of the methods is made, which is based on their three major components model of the structure, comparison algorithm and similarity measure. Each method is evaluated according some criteria, which include the level of representation and the way of service of the model, the complexity and strategy of the comparison algorithm and how meaningful the similarity measure is. The methods are grouped according these criteria. Advantages and disadvantages of all groups are discussed in aspect to their application for solving different type of problems in contemporary structural biology. Introduction Proteins are long chains of amino acids and like other biological macromolecules are essential parts of organisms and participate in every process within cells. Many proteins are enzymes that catalyze biochemical reactions and are vital to metabolism. Proteins also have structural, mechanical or transport functions or form system of scaffolding that maintains cell shape. Other proteins are important in cell signaling, immune responses, cell adhesion and the cell cycle... The number of newly determined protein structures is growing fast. When the Protein Data Bank [1] was originally founded it contained just 7 protein structures. Since then it has achieved an approximate exponential growth in the number of structures and does not show any sign of falling off. Today there are protein structures, deposited in Protein Data Bank and up to 2015 their number would reach The need for developing new methods and algorithms for studying protein structure, function and evolution is evident. One of the main tasks in these studies is the precise and adequate comparison of protein structure. Protein structure comparisons are employed in almost all branches of contemporary structural biology. They are applied for: Protein fold classification SCOP [2] is one of the databases, where the classification of protein structures is based on evolutionary relationships and on the principles that govern their three-dimensional structure. The method used to construct the protein classification in SCOP is essentially the visual inspection and comparison of structures. CATH [3] includes the adaptation of a method for rapid structure comparison, based on secondary structure matching. Protein structure modelling for protein structure prediction [4]. Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry. Its aim is the prediction of the three-dimensional structure of proteins from their amino acid sequences, sometimes including additional relevant information such as the structures of related proteins. In other words, it deals with the prediction of a protein s tertiary structure from its primary structure, where the prediction is based on comparison with proteins with already known structure. Structure-based function prediction: [5] the functions of proteins are strongly connected with their structure and chemical characteristics. The structure of a protein can directly reveal the mechanisms of its functions. This is the reason why protein structure comparison methods can be applied to obtain structural information, which will aid the function prediction. Structure comparison methods are needed here for determining the unknown protein functions, where the conclusions are based on the detected structural similarity with proteins with already known functions. Protein Structure Comparison Methods In order to compare two (or more) protein structures three components are necessary: model of the protein, alignment algorithm and similarity measure. The ideal combination of these 3 components would be sensitive give high scores to similar structures, selective give low scores to structures, which are not similar and fast. A: Protein Structure Representation/Model One of the main characteristics of protein model is the level of representation and how detailed it is. Òwo sets of structure elements A = { a1, a2,..., an} for protein P A and B = { b 1, b 2,..., b m} for protein P B are given. The choice of structure elements defines the following possibilities for the level of representation: 1) Low, atomic level, where the coordinates of Cα atoms of the proteins are used to construct the model. Let the set of Cα atoms in protein X be defined with eq. (1). (1) Cα X ) = Cα, Cα,... Cα } ( { 1 2 p Then the sets A and B of structure elements for both compared proteins can be defined as: (2) A Cα P ) ( A B Cα( P B ) 2) Atomic level of representation, where proteins are presented with fragments with fixed size. ragment is defined as a sequence of amino acids: (3) f = { Cα 1, Cα 2,... Cα k} Parameter k in eq.(3) is the fragment s size (the number of amino acids, which are included in the fragment). Then the sets of structure elements A and B can be defined as sets of fragments with fixed size: (4) A = { f1, f2,,... f r } B = { f f,... } 1, 2, f s 3) The level of secondary structure elements, where
2 proteins are presented with their sets of helices and sheets/ strands. Let the set of secondary structure elements in protein X be defined with eq. (5). SSE (5) ( X ) = h1,... hp, { s11, s12,... s1 } { } l,... sq1, sq2,... sql SH 1 SHq Parameter p is the number of helices ( h i ) and q is the number of sheets ( SH j ) in the structure. Sheets are presented with the strands, which compose them ( s jk ). In this case the sets of structure elements for both proteins can be defined as: (6) A SSE P ) ( A B SSE( P B ) 4) Combination of both of the previous. When the level is specified, the type of distances, which would be compared and superimposed, can further define the model. If the model consists of set of points or secondary structure elements, then the intermolecular distances are computed between elements, which are fixed to be equivalent. This type of model does not usually consider the flexibility of the molecules. The other type of model uses intramolecular distances (distance matrices or contact maps). The model consists of all relative positions or distances between elements from one protein molecule, which have to be compared with the relative distances and positions in the other protein molecule. The oldest and the most common method to compare protein structures is to select a set of equivalent points usually Cα atoms and superimpose them in 3D space to minimize the least-squares distances between corresponding atoms from both proteins. Proteins are presented with sequences of their Cα atoms as points in 3D space in RAND and RAG [6], proposed by Akutsu and in MINRMS [7]. Their sets of structure elements to be compared are defined with eq. (2). igure 1 shows such kind of representation of two protein molecules P A and with their corresponding sets A and B of coordinates, which are rotated and translated to find the alignment {, b, a, b, a, b, a, b, a, b, a b } a , igure 1. Protein structure alignment by selecting equivalent points in 3D space These models are detailed and precise, if the only requirement is to compare points in 3D space, which are the atomic coordinates of the two different macromolecules. However, a 6 PB common comparison of coordinates of Cα atoms is not sufficient, when these are biological macromolecules. The disadvantage of such models is that they could miss similarities between divergent structures, which are biologically similar. These models are very detailed and the useful information that they bring in some cases exceeds the necessary one. Combinatorial extension (CE) of Shindyalov and Bourne [8], lexprot [9] and ATCAT [10] are also comparison methods, which examine the protein structure at the low atomic (Cα atoms) level, but they group the atoms in fragments parts with fixed size of compared proteins. ragment pairs with a size, defined by the number of residues, which compose them, are aligned. Models of CE, lexprot and ATCAT are defined with eq. (4). Bostick and Vaisman made an interesting proposal for a model [11]. They applied Delaunay-based topological mapping for protein structure comparison. The model of the protein is based upon the topology of its cores especially Cα atoms eq. (2), which form the protein backbone. our-body nearest neighbors of Cα atoms are clustered and used for comparison. Delaunay Tessellation is made to represent the set of points that compose Amino Acid residues the four nearest neighbors are arranged at the vertices of irregular tetrahedral, which is called Delaunay simplex. The length of a simplex edge is defined as the number of Cα atoms, which compose the segment of the protein between its simplex vertices. If all residues are enumerated with consecutive integers according their occurrence in primary structure of proteins, the length of a simplex edge d ij is defined as: (7) d ij = j i 1 Indices i and j in eq. (7) are the integer numbers, which residues have obtained by the enumeration. Each simplex in the tessellation has three lengths, associated with the four vertices. Some protein structure comparison methods use coordinates of Cα atoms to construct model, which is called distance matrix. The structure of compared proteins is examined at the level of Cα atoms as the group discussed above. The intramolecular distances between the sets, defined in eq. (2) are computed and results are distance matrices, which present the geometrical properties of compared proteins. The idea of DALI method [12], proposed by Holm and Sander is that similar 3D structures have similar inter-residue distances. This method uses the distance matrix as a 2D representation of 3D structure, which contains all pair wise distances between all residues centres i. e. Cα atoms of a protein. Szustakowski and Weng [13] also use intramolecular distance matrices, based on elastic similarity score developed by Holm and Sander [12], to evaluate the alignments between compared proteins. Another type of matrix, which is applied as a model, is the contact map. The contact map is a matrix of distance pairs between atoms, residues or secondary structure elements of proteins according the preferred model for analysis. The comparison of two contact maps is defined as the alignment of the set A of structure elements from the one contact map that describes first protein with corresponding set B from the other contact map from the second protein. Aligned components from
3 the sets are considered equivalent. In maximal contact map overlap problem, the degree of similarity between two proteins is defined as the number of equivalent contacts between the proteins. This number is called the overlap of the contact maps and the goal is to maximize this value. The Max CMO problem is proven to be NP-complete [14]. Carr and Hart [15] use a way for simplification of this representation of protein s structure and define a contact as a pair of residues that are closer than a given threshold, which ranges usually between 2 and 9 Angstroms. The result is a binary contact map, which has the advantage that structural properties of proteins can be more easily visualized and compared. The detailed models, discussed above have the disadvantage to be complicated for comparison, because of the huge number of elements to be compared (Cα atoms, fragments of Cα atoms, or distances between them) the problem belongs to NP-complete class. It is challenge for comparison methods, which use these models to be fast and exact at the same time. Different simplifications are searched to decrease the complexity of the model, as Carr and Hart proposed or heuristic methods for comparison are used to achieve reasonable times. Another approach for hastening the comparison is to use not so detailed model and to examine the protein structure at the level of helices and sheets, which are secondary structure elements. Vectors of secondary structure elements are preferred for the initial model in MATRAS [16]. Kawabata and Nishikawa use the Markov transition model for evolution, which is similar to the Dayhoff s substitution model [17] between amino acids. The transition probability matrix of the Markov transition is generated by transforming the numbers of structure transitions within pairs of proteins with similar sequences. The protein structure is examined first at the level of secondary structure elements and then the similarity is detailed at the residues level. Sets, defined with eq. (2) and eq. (6) are used to construct the models of the proteins. Other methods use graph models for protein secondary structure representation. When the scope of interest is the atomic level and the average number of objects, which have to be represented is about (the common number of amino acids in a protein with one chain) the graph model is not appropriate. Such number of vertices is huge enough and the time for comparison would be enormous. The situation is different when the level is high and secondary structure elements are represented as vertices of a graph the average number is about 10 20, which makes the graph model suitable. VAST [18] such as MATRAS, first detects and aligns secondary structure elements and then refines this alignment by finding the residue equivalences, but VAST uses graph model. All pairs of secondary structure elements (one from each protein) that are of the same type are presented as vertices in a graph. An edge is defined between two vertices if the distance and angle between corresponding secondary structure elements are within some threshold. The resulting model of the proteins is a graph, which is composed of structure elements, defined with eq. (6). The graph model represents the correspondence between pairs of secondary structure elements that have the same type, relative orientation and connectivity. Taylor [19] also proposes graph model for protein structure comparison. The model is defined by the interaction between secondary structure elements, which are presented with their line segments, and other geometric properties of the protein molecules. The degree of interaction between two line segments is evaluated by the degree of overlap between the segments and these interactions are presented with a bipartite graph. igure 2. Line segment overlap measure Two line segments corresponding to secondary structure elements are shown (A -> B and C -> D) as thick lines in figdure 2, with their mutually perpendicular connecting line (p and q). A series of fine lines cover the span in which the line segments overlap, the end points of which are equidistant from their corresponding ends of the mutual perpendicular. A measure of interaction is calculated from this as a summation of the lengths (x) of these lines. Bipartite graph is constructed to compare protein structure P A and protein structure P B in the method proposed by Wang, Makedon and ord [20]. Parts of the vertices in this graph represent structural elements form both proteins left part for protein P A and right part for protein P B respectively. In contrast with the previous discussed graph models, here the level of representation can vary - the structural elements can be all atoms, only Cα atoms, amino acids residues or secondary structure elements. Each vertex is connected with all of the vertices in the opposite part. The weight of an edge between two vertices is defined by the similarity measure, which can include geometric and chemical properties of compared protein structures. Given the two sets of structure elements A and B for protein P A and protein P B, undirected weighted bipartite graph G (V, E) can be constructed, where V = A U B and E = e }, for i = 1,2,...n and j = 1,2,... m figure 3. igure 3. Bipartite graph matching proposed by Wang, Makedon and ord { ij
4 Each edge e ij corresponds to an weighted connection between à i and b j, and the weight w(e ij ) shows the degree of similarity between à i and b j. Edges between nodes from the same part are not allowed. In the graph model proposed by Krissinel and Henrick [21] the secondary structure elements (helices and strands eq. 6) are used as graph vertices with composite labels, which have a part for the type of the element and a part for the number of residues, which compose it. Any two vertices of the graph are connected by an edge, whose label describes the geometry of mutual position an orientation of the connected elements. igure 4 shows the properties, which are considered when the graph is constructed. Vertices v i and v j are represented by vectors r SSE ; edge e ij connects their centers. Edge length k p ij and angles α ij, k = 1..4, define mutual positions and orientations of all vertices in the graph. Models for proteins, which examine the structure at the igure 4. Properties of vertices and edges of the SSE graph level of helices and sheets/strands, are more compact and easy for construction and service. The disadvantage of using only a model of secondary structure elements is that the information may be not enough to make precise comparison. This is the reason why many methods use first this model for fast comparison and then refine it with detailed comparison at the atomic level, but with the alignment of secondary structure elements already available. At the same time geometric properties of the structures are considered when the model at the SSE level is constructed by detecting the mutual position and orientation of secondary structure elements each against the other including distances, angles, connectivity and overlaps. B: Comparison Algorithm /Strategy The variety of models for protein structure representation brings a variety of comparison algorithms. There are cases, when the models are almost the same, but the serving algorithms are different or the models are different, yet the search strategy for equivalences is the same. Comparison algorithms can be grouped into different classes, each class with specific characteristics, which determine the advantages and disadvantages of the algorithms and their applications. According the dependence of the chain order alignment algorithms can be: sequence-order dependent use the order of atoms in the protein chain, thus reducing a problem to 3D curve matching. The comparison task becomes easier, when the order of the chain is considered. sequence-order independent the structural similarity between compared models is measured without requiring that each residue of the one protein to be structurally matched with the corresponding residue of the other protein. Since these algorithms do not exploit the chain order, they can detect nonsequential motifs in proteins, such as molecular surface motifs, especially binding sites. One of the important advantages of such algorithms is that they can be applied to other molecular structures (drugs for example), not only for proteins. Protein structure alignment algorithms can search global when the purpose is comparison of the molecules as a whole, or local similarity. Global comparison algorithms are mainly used when protein structure classification and identification of evolutionary links between distant homologues are needed. or the purposes of protein function prediction the local structural comparison methods are applied. Local structural comparison refers to the possibility of detecting a similar 3D arrangement of a small set of residues, possibly in the context of completely different protein structures. More detailed comparison between alignment algorithms is made here according the strategies, which they use: branchand-bound, dynamic programming, geometric hashing, genetic algorithms, subgraph isomorphism, bipartite graph matching technique, etc. Some methods may use combination of two strategies, when comparison is made at different levels. The proposed algorithms can be compared and evaluated according their complexity. The problem for protein structure comparison is NP-Complete and any reported achievements in this field are marked below. Branch-and-Bound is a widely used strategy for solving large-scale NP combinatorial optimization problems and many comparison methods preferred it. This technique consists of a systematic enumeration of all candidate solutions by using upper and lower estimated bounds of the quantity being optimized, while in the process of searching large subsets of useless candidates are discarded together at the same time.this is done by a recursive procedure that is used to extend initial candidate solutions or matching seed. The extension stops when the algorithm determines that the current path cannot lead to solutions that are better than the current best one. In such case, the recursion goes one step back and the candidate is extended in another direction or another candidate is selected. The running time of these methods depends dramatically on how similar the proteins to be compared are. If the structures are very similar, then there will be a large number of seed matches to explore. One of the comparison methods, which use such technique, is DALI. The alignment algorithm compares protein A and protein B in two steps: 1) their distance matrices are first decomposed into elementary contact patterns hexapeptide submatrices. All elementary contact patterns in protein P A are pair wise compared with all elementary contact patterns in protein P B. Similar contact patterns are stored in a non-exclu
5 sive list of pairs which is the raw material for structural alignment; 2) The goal of the second steps is to assemble pairs of contact patterns into larger consistent set of pairs (larger alignment), maximizing the similarity score. A Monte Carlo procedure is used to build up the full alignment. MATRAS also uses branch-and-bound algorithm for initial alignment of SSEs. Then a residue-based alignment is iteratively performed by dynamic programming using the previous results to refine them. Combinatorial Extension CE, proposed by Shindyalov and Bourne, finds an optimal alignment between two protein structures using combinatorial extension of an alignment path, defined by aligned fragment pairs. It is based on local similarity detection. The algorithm first applies rigid-body superpositioning of the fragment pairs. Then it tries to extend this alignment using a greedy heuristics followed by an optimization of the best alignment. ATCAT also alignes fragment pairs and then uses dynamic programming to connect them, while considering the protein molecule flexibility. ATCAT aligns flexible protein molecules by including the possibility of twists in the peptide backbone within the alignment algorithm. This allows an alignment of two domains that are structurally similar but have local structural differences that preclude a full alignment when each domain is treated as a rigid body. lexprot is a sequence-order dependent proposal for alignment of two proteins structures; one of them can be a flexible molecule. irst lexprot detects congruent fragment pairs - one from each protein, which can be superimposed with minimum RMSD. Matching atom pairs are extended, following the protein backbone with one or more atom pairs until the RMSD and the length of matching fragments are within some thresholds. Then lexprot composes an acyclic directed graph, where vertices are fragment pairs and edges show the order of fragments (according the Amino Acid sequence). Weights are assigned to edges to award long matching fragments and to penalize big gaps. Single source shortest paths algorithm is applied to this graph and proceeded paths are compared regarding the total size and minimum RMSD. The last step of the algorithm clusters the consecutive fragment pairs that have a similar 3D transformation. The first step of the algorithm takes O ( n 2 ), the second step takes O ( n 4 ), and the clustering step takes O ( n 2 ). Thus, the overall complexity is bounded by O ( n 4 ), where n is the number of Cα atoms in the larger protein. (8) T : d a if if d = 0, d = 1, if d = 2, if d = 3, if 4 d 6, if 7 d 11, if 12 d 20, if 21 d 49, if 50 d 100, if d 101. Bostick and Vaisman use comparison of 3D arrays to find similarity between protein molecules. They apply a transformation T on the Delaunay simplices, which are the models for compared proteins. T is used to map each length of an edge of a simplex to an integer value. Equation (8) defines the transformation. Each simplex is mapped into a 3D array M, where M npr is the number of simplices, whose edges satisfy the following conditions: (1) the Euclidean length of each simplex edge is less than 10 A; (2) d ij = n; (3) d jk = p; (4) d ki = r. The comparison of proteins, presented with arrays Ì and ' M is computed by the evaluation of the difference between their corresponding elements (Q is the score value and measure for similarity): ' (9) Q = M npr M npr r= 1 p= 1 n= 1 The geometric hashing [22] is a technique, which originates from the computer vision [23], [24] and first has been applied for structural biology data comparison by ischer [25]. The coordinates of one structure are expressed relative to several local reference frames, which can be any triplets of points (Cα atoms) of the protein. Since the points used as a reference belong to the structure itself this representation is invariant under both rotation and translation. The positions in which the other points are situated for each frame are used as keys in a hash table. When such representation has been calculated it is possible to compare two structures using a series of fast searches - the hash table is queried with structural features from the second molecule. Each hit in the table identifies a transformation between the two molecules. Transformations that eceive many hits are those that are likely to superimpose essential structural features of both molecules. The complexity of this algorithm is O ( n 3 ), where n is the numbers of atoms, which compose the protein to be compared. The method does not assume the order of the protein Cα atoms and has the advantage of sequence-order independent algorithms - can be applied to any molecule type, not only to proteins. Some of the alignment techniques, which examine the protein structure at the atomic level, can be summarized in two distinguished steps: 1) Generation of all initial superpositions and 2) Identification of optimal alignment by RMSD. MINRMS uses comparison of all consecutive fragments of four residues from one protein with all such residues from the other to generate initial superspositions. Then, dynamic programming algorithm is used to evaluate the similarity at this step between two protein structures. MINRMS generates and fills a score pyramid, which is composed of matrices, stacked on
6 top of one another figure 5. The lowest layer represents the score matrix for alignment of a pair of residues. Each layer above is the score matrix for alignment of newly added pair of residues and is evaluated using the matrix from the layer below. The value of a cell in the pyramid is derived from one of three adjacent cells: by row, by column or by diagonal from the layer below. The optimal alignment can be reconstructed by backtracking the maximum value of each scoring matrix. The complexity of MINRMS algorithm is O ( m 3 n 2 ), where m and n are the number of Cα atoms in both protein molecules to be compared. Akutsu proposes two algorithms RAND and RAG, which consist of the two steps, described above. RAND algorithm finds the initial superposition by a random sampling technique, while RAG uses fragment search method. The part, which is common for RAND and RAG, is the second step the use of bipartite graph matching technique for protein structure alignment. igure 5. Alignment with MINRMS After finding an initial superposition between A and B, a bipartite graph G ( A, B, E) is composed by the sequences A and B, where Å is the set of edges between A and B figure 6. The edge ( ai, bj ) AxB is contained in Å, if the distance between ai and bj is less than δ (δ = A). The complexity of this method is O (mns), where m and n are the sequence lengths and s is the number of the initial superpositions. The genetic algorithms are a general purpose, global optimization technique that provides promising results in the entire area of computational structural biology [26]. The genetic algorithms mimic the process of evolution. A generation within this process comprises a set of configurations that are coded via chromosomes. Chromosomes are subjected to manipulation by some genetic operators such as crossover and mutation. The information content of the chromosomes varies depending on the application. Typically, it comprises the intramolecular matches or a coding of the orientation degrees of freedom and a coding of the torsion degrees of freedom in the case of considered molecular flexibility. The fitness function used to enable the process of selection typically comprises an efficiently computable similarity function. Szustakowski and Weng propose a genetic algorithm to determine the optimal alignment between protein structures. The genetic algorithm is applied after the initial superposition of all secondary structure elements from compared proteins which igure 6. Bipartite graph matching, proposed by Akutsu
7 generate all initial populations of possible alignments between secondary structure elements. Each population is altered with genetic operators mutate, hop, swap and crossover to make recombination between randomly chosen alignments. Resulting alignments are accepted or rejected according rules, which are defined to obey the validity of the alignment. Carr and Hart, whose goal is to maximize the number of equivalent contacts (which number defines the degree of similarity) between compared proteins in their contact maps, also use a genetic algorithm to solve maximal contact map overlap problem. The chromosome is presented by a vector c of dimension n, where each position can take values in the range [-1,..., m-1], where m is the length of the longer protein, n is the length of the shorter. The position j in c, c[j] specifies that the j-th residue in the longer protein is aligned to the c[j]-th residue of the shorter protein. The value -1 in the same position specifies that j-th residue in one protein is not aligned to any of the residues of the other protein. An important aspect of the method is that unfeasible configurations are not allowed. or this purpose genetic operators are defined to preserve the feasibilities. When models of compared proteins are graphs there are two different types of comparison algorithms, which can be applied subgraph isomorphism detection or bipartite graph matching technique. In complexity theory, the maximum common subgraph-isomorphism (MCS) is an optimization problem that is known to be NP-hard and this method is suitable for graph models with small number of vertices, in other cases the comparison would be slow and clumsy. That is the reason common subgraph isomorpism to be searched for models, which represent protein structure at the level of secondary strucure elements. After this initial alignment some of the algorihms make a refinement at the atomic level. VAST is one of the techniques which uses subgraph isomorphism to detect similiarity between protein molecules. The model of the proteins is a graph, which represents the correspondence between pairs of secondary structure elements that have the same type, relative orientation and connectivity. This graph is searched for cliques and the detection of cliques is the starting point for alignment. This initial alignment is extended to a residue level one with Gibbs sampling technique. Since the max clique problem is known to be NP-hard, it is only feasible for small graphs with about less than 30 vertices. SSM [21] is another tool which includes an original procedure of matching graphs built on the protein s secondarystructure elements, followed by an iterative three-dimensional alignment of protein backbone Cα atoms. This matching technique uses a described by the authors optimal backtracking algorithm for common subgraph isomorphism [27], CSIA, which represents an advancement of the widely known algorithm of Ullman [28] for exact subgraph isomorphism. The time complexity of CSIA is bounded by O( m n+ 1 n), which makes it applicable to graphs having up to n, m 70 unlabelled vertices. One of the main difficulties when aligning structures with sequences, which are not similar, is to determine the correspondence between equivalent residues. Typically the process is either iterated between residue assignment and a minimization step or it uses a stochastic optimization procedure to find the maximal subset of equivalent residues within some constraints. Some methods first use fast and not so accurate filter to define the initial alignment and slower and more accurate residuebased alignment, which is performed only for subsets, which satisfy the first step. One of the most intuitive methods for initial alignment is first to detect and align secondary structure elements and then to refine this alignment by finding the residue equivalences. This method is used by VAST and SSM. In contrast, Taylor uses representation at the secondary structure level without any refinement at the atomic one and BGMT instead of heavy search for the full subgraph isomorphism to construct a fast filter for protein structure comparison. The interactions between secondary structure elements are presented with a graph and a bipartite graph-matching algorithm ( stable marriage ) is used for searching between the two sets of interactions. This method takes 1/ 10 of a second for a typical comparison between two protein structures and this makes it suitable as a fast filter for slower and more complex algorithms for protein structure comparison. Wang, Macedon and ord also use bipartite graph matching in their framework for finding correspondences between structural elements (atoms residues or secondary structure elements) in two proteins. They define a maximum weight matching as a matching such that the sum of the weights of the edges in the matching is maximized. Then a maximum weight maximum cardinality matching is a matching with the maximum number of edges with the greatest weight. igure 3 shows an example of the maximum weight bipartite matching. In the context of protein structural elements correspondence, a maximum weight matching would return the correspondence with the maximum weight, but there is no guarantee of maximum cardinality. Therefore, some elements in the smaller protein may not be matched to any element in the other protein. In other words, a maximum weight matching favors good local matches. On the other hand, a maximum weight maximum cardinality matching, would always return the matching with maximum cardinality, even if some edges in the matching have relatively small weights. It guarantees that every element in the smaller protein will be matched to an element in the other protein. In other words, a maximum weight maximum cardinality matching favors good global matches. The best-known strongly polynomial time bound algorithm for weighted bipartite matching is the classical Hungarian method due to Kuhn [29], which runs in time O( V ( E + V logv ). The weighted bipartite matching algorithms can be implemented efficiently, and can be applied to graphs of reasonably large size (about 100,000 vertices) [30]. When the comparison algorithms produce their results a similarity measure is needed to evaluate them and to make them suitable for any conclusions for presence or absence of similarity between compared objects. C. Similarity Measure It is important for similarity measure to be sensitive and to rank more similar structures higher than more different ones when the measure is positive and the opposite, when the mea
8 sure is negative. Another important fact is that structural alignment lacks a theory that defines and describes the distribution of structural similarity scores. The most commonly used metric in this category is the root-mean-square deviation, RMSD, in which the root-meansquare distance between corresponding residues is calculated after an optimal rotation of one structure to another. This metric has a lower score if the structures are similar and higher in the other case. RMSD is defined as follows: RMSD A B 1 N N i 1 x i (10) (, ) = ( ( ) = y( i) 2 ) In eq. (10) N is the number of aligned atoms and x and y are their coordinates. Since the RMSD weights the distances between all residue pairs equally, a small number of local structural deviations could result in a high RMSD, even when the global topologies of the compared structures are similar. urthermore, the average RMSD of randomly related proteins depends on the length of compared structures, which makes the absolute magnitude of RMSD meaningless [31]. Carugo showes [32] that the root-mean-square distance (RMSD) of an alignment is linearly related to the resolution of the compared domains. Alignment of two domains with low or significantly different resolution would result in a higher RMSD than alignment of two domains with high resolution. In order to be the RMSD meaningful the number of aligned residues has to be considered and the metric should be normalized. or their relative RMSD, Betancourt and Skolnick [31] normalize the RMSD by the average RMSD from random structure pairs with similar size. (11) RRMSD ( A, = RMSD( A, / D( A, RMSD ( A, is the RMSD between proteins A and B and D ( A, is an estimate of the average RMSD between two random protein fragments with the same length when the proteins are aligned. or the definition of RMSD 100 score [33] Carugo and Pongor divide the RMSD by a factor of 1+ N / 100, with N representing the protein length. RMSD( A, (12) RMSD100 ( A, = 1+ ln N /100 Zemla [34,35] propose two different metrics which are used to evaluate protein structure prediction and service as major assessment criteria for the results of CASP experiments: LCS the longest common segment and GDT global distance test. The LCS is a measure that shows the longest continuous segment that can be aligned with RMSD between Cα atoms less than a specified value 1, 2 or 5 A. The GDT score is calculated as the largest set of amino acid residues Cα atoms in the model structure falling within a defined distance cutoff of their position in the experimental structure. It is typical to calculate the GDT score under several cutoff distances (1, 2, 4, and 8 Å are used in the CASP5 experiment), and scores generally increase with increasing cutoff. GDT is aimed to identify any accurately, not necessary continuous similar substructures. It attempts to find the maximum number of residues in the one protein that can be superimposed over the other protein within a given threshold. This measure can be applied to find the largest similar subsets. Levitt and Gerstein [36] propose different structure similarity measure, which takes into account the numbers of gaps, when the structures are aligned: (13) S str = M ( 1/(1 + ( dij / d 0 ) ) N gap / 2) d ij in eq. (13) is the distance between aligned Cα atoms, N gap is the number of gaps, M = 20 and d 0 = 20 A. This measure and GDT are bases for other protein structure similarity measures. With MaxSub [37] Siew, Elofsson, Rychlewski and ischer try to identify the maximum substructure in which the distances between equivalent residues of two structures after superposition are below some threshold value, such as 3.5 Å. MaxSub counts only the residues in the substructure and the spatial information of the templates outside this substructure is omitted. This measure is based on similar principles as GDT. It computes a single scalar in the range of 0 to 1, which measures the similarity between compared structures. This scalar is a normalization of the size of the most similar subset and is computed using the variation of a formula, suggested by Levitt and Gerstein [36]. And while RMSD considers intermolecular distances Holm and Sander propose DALI similarity score [12], which is based upon intramolecular rather than intermolecular distances: (14) S( A, = i A B dij dij d 0.2. e avg AB d j avg 2 AB ( / 20) 2 Kawabata and Nishikava [17] propose theory to evaluate protein structure similarity by log-odds score, which is based on the Markov transition model of evolution. Their similarity score between structures i and j is defined as: P( i j) (15) S( i, j) = log P( j) P( i j) is the probability that structure i changes to structure j during the evolutionary process, and P(j) is the probability that structure j appears by chance. This is a reasonable definition of structure similarity, especially for finding evolutionarily related (homologous) similarity. The probability P( i j) is estimated by the Markov transition model, which is similar to the Dayhoff s substitution model between amino acids. To evaluate pairwise similarities between proteins A and B Kawabata and Nishikawa define the next equation (10): S( A, S min (16) R( A, = * 100 S max Smin S(A, is the log-odds score between proteins A and B and Smin and Smax are the minimum and maximus scores. This score is similar to that, proposed by eng and Doolittle [38]. Tempalte modeling or TM-score [39] also uses a variation of Levitt Gerstein (LG) weight factor that weights the residue pairs at smaller distances relatively stronger than those at larger
9 distances. or that reason the TM-score is more sensitive to the global topology than to the local structural variations. Its value is normalized in a way that the score magnitude relative to random structures is not dependent on the protein s size, with a value of 0.17 for an average pair of randomly related structures. or aligning a template model to a native structure TMscore is defined as follows: (17) 1 TM score = Max L L T N i= 1 1 d 1 + d i 0 2 L N is the length of the native structure, L T is the length of the aligned residues to the template structure, d i is the distance between the i-th pair of aligned residues and d 0 is a scale to normalize the match difference. Yang and Honig propose the Protein structural distance [40]: (, ) a s A B log ( max(, ) (, ), ) log a b s A A RMSD PSD A B = + x y In previous equation a is the number of the SSEs for protein A, b is the number of SSEs for protein B, s(a, A) is the self-alignment score for protein A, s (A, is the score for alignment of SSEs of both proteins, x and y are adjustable parameters. The PSD is designed to describe relationships between protein structures in quantitative rather than descriptive terms and is applicable both when two structures are very similar, and when they are very different. It is calculated with a structural alignment procedure that uses double dynamic programming to align secondary structure elements and an iterative rigid body superposition that minimizes the root-mean-square deviation of Cα atoms. Krasnogor and Pelta showed [41] how Universal Similarity Metric (USM), introduced by Li in [42] can be used to calculate similarities between protein pairs. The USM approximates every possible similarity metric (i.e. those that exist today and those that are yet to be defined) and is based on the concept of Kolmogorov complexity. The Kolmogorov complexity K( ) of an object o is defined by the length of the shortest program for a Universal Turing Machine U that is needed to output o. It is an objective measure of the amount of information contained in a given object. A related measure is the conditional Kolmogorov complexity of o1 given o2, which measures how much information is needed to produce object 1 if object 2 is known: (18) K ( o 1 o 2 ) = min{ P, P program, U ( P, o 2 ) = o 1 } The information distance between two objects is equivalent to: (19) ID ( o1, o2 ) = max{ K( o1 o2 ), K ( o2 o1 )} The Universal Similarity Measure is a proper metric; it is universal and also normalized. The metric is formally defined as: 2 2 (20) * * max{ K ( o1 o2), K ( o2 o1 )} d ( o1, o2) = max{ K ( o1), K ( o2 )} * 2 1, o indicates a shortest program for 1 o ( or 2 o ). Magnitudes of some of the metrics, discussed above, depend on the evaluated proteins size [39], which makes the absolute magnitude of these scoring functions meaningless. To eliminate the dependence on protein size, some of the authors (Levitt and Gerstein, Ortiz, Strauss and Olmea [43]) convert their structure alignment score into a statistical significance score, called the P-value, on the basis of the statistics of their random structure database. VAST also uses a P-value, which is based upon the likelihood of aligning a given number of secondary structure elements with a certain length. A Z-score is defined and used by Shindyalov and Bourne for their CE and by Holm and Sander for DALI. Z-score of CE is based upon a Gaussian distribution of similarity score between aligned fragments of proteins. The information about the protein structure comparison methods, which are discussed in this paper, is summarized in table 1. The properties of the three major components of each method model of the protein structure, comparison algorithm and similarity measure are considered. Based on the preferred model, appropriate algorithm and chosen similarity score most of protein structure comparison methods will detect similar proteins and will possess positive score for their alignment. The situation is different when the comparison is made between less similar protein structures or the so called challenging sets [44]. In such case the alignment algorithms sometimes produce different results about the degree of detected similarity. Some of works, which are dedicated to that problem, are [44,45,46]. Novotni proposes [45] a comprehensive and critical comparative analysis and evaluation of 11 publicly available, Webbased servers for automatic fold comparison. The conclusion is that the tested algorithms differ in their performance - i.e., how well established structural similarities are recognized. Shierk and Pearsor also examine the sensitivity and selectivity of protein structure comparison methods in [46]. Seven protein structure comparison methods and two sequence comparison programs are evaluated on their ability to detect either protein homologs or domains with the same topology (fold) as defined by the CATH structure database. The programs show distinct differences in their misclassifications according to structural class. After analysis of the results, the authors conclude that with some exceptions, the relative performance of the methods tested is the same regardless of the error model, and that these results accurately reflect the general characteristics of the methods. Mayr, Domingues and Lackner [44] also propose comparative analysis of protein structure alignments methods. They analyze and compare several methods regarding the performance in the identification of structurally/evolutionary related proteins. Three sets of pairs of structurally related proteins are used, including remote homologous proteins according to the
10 Name Properties of the model Properties of the comparison algorithm Proteins are presented with sequences Generation of all initial superpositions RAND and RAG [6] of their Cá atoms as points in 3D and refinement with bipartite graph space. matching technique. MINRMS [7] CE [8] lexprot [9] ATCAT [10] Delaunay tessellation [11] DALI [12] Szustakowski and Weng s method [13] Carr and Hart s method [15] MATRAS [16] VAST [18] Taylor [19] MWBM [20] SSM [21] Geometric Hashing [25] Proteins are presented with sequences of their Cá atoms as points in 3D space. Proteins are presented with fragments of amino acids with fixed size. Proteins are presented with fragments of amino acids with fixed size. Proteins are presented with fragments of amino acids with fixed size. Topology is presented with Delaunay tessellation composition of simplexes. Cá atoms, which are the nearest neighbors are grouped and define a simplex. Proteins are presented at the atomic level with distance matrices. Proteins are presented at the atomic level with distance matrices. Proteins are presented at the atomic level with contact maps. Different levels of representation are used: secondary structure elements for initial model, followed by a model at the amino acid level. Markov transition model of evolution is applied. Different levels of representation are used: secondary structure elements for initial model, followed by a model at the amino acid level. Graph model is preferred here. Proteins are presented with bipartite graph with vertices, which present secondary structure elements. The level is chosen among the following: atoms, residues or secondary structure elements. Proteins are presented with graph model, whose vertices are secondary structure elements. Hash table with positions of all points (Cá atoms), situated according different reference frames. SCOP database (ASTRAL40 set), SISY set - derived from the SISYPHUS database, which includes 69 protein pairs and 40 pairs that are challenging to align (RIPC set). Two methods are applied to align the proteins in the ASTRAL40 set and the resulting alignments agree on average in more than half of the aligned positions. 6 methods are compared using the SISY and RIPC sets. The alignments generated by the different methods on average match more than half of the reference alignments in the SISY set. The alignments obtained in the more challenging RIPC set tend to differ considerably and match reference alignments less successfully than the SISY set alignments. The authors come to the conclusion that the alignments produced by different methods tend to agree to a considerable extent, but the agreement is lower for the more challeng- Generation of all initial superpositions, dynamic programming to evaluate the similarity and refinement with min RMSD. Combinatorial extension of the optimal path with greedy algorithm. Superposition with min RMSD, singlesource shortest path and clustering. It takes flexibility into account. Dynamic programming. It takes flexibility into account Comparison of corresponding elements in 3D arrays Branch-and-Bound strategy Similarity measure RMSD RMSD Z-score RMSD RMSD Q-score Z-score Genetic algorithm Elastic similarity score Genetic algorithm Maximum contact map overlap Branch-and-Bound strategy for Log-odds score secondary structure elements comparison and dynamic programming for residue-based alignment. Subgraph isomorphism algorithm by clique detection. P-value Bipartite graph matching technique. Score, which include type, distance, angle and packing Bipartite graph matching technique. Weight of the match Common subgraph isomorphism detection. Geometric hashing. Table 1. Protein structure comparison methods P-value RMSD ing pairs. The results for the comparison to reference alignments are encouraging, but also indicate that there is still room for improvement. Conclusion The comparison of protein structures is a task with many applications it is a part of protein classification methods, protein structure and protein function predictions. Different approaches are used when models are constructed and suitable algorithms are chosen according to a certain application. The detailed models are precise and bring a lot of structural information, but sometimes it exceeds the necessary one
CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction
CMPS 6630: Introduction to Computational Biology and Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the
More informationCMPS 3110: Bioinformatics. Tertiary Structure Prediction
CMPS 3110: Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the laws of physics! Conformation space is finite
More informationStructural Alignment of Proteins
Goal Align protein structures Structural Alignment of Proteins 1 2 3 4 5 6 7 8 9 10 11 12 13 14 PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE
More informationCMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison
CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture
More informationMotif Prediction in Amino Acid Interaction Networks
Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions
More informationProtein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche
Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its
More informationIntroduction to Comparative Protein Modeling. Chapter 4 Part I
Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature
More informationProtein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)
Computat onal Biology Lecture 21 Protein folding The goal is to determine the three-dimensional structure of a protein based on its amino acid sequence Assumption: amino acid sequence completely and uniquely
More informationBio nformatics. Lecture 23. Saad Mneimneh
Bio nformatics Lecture 23 Protein folding The goal is to determine the three-dimensional structure of a protein based on its amino acid sequence Assumption: amino acid sequence completely and uniquely
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationGiri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748
CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/15/07 CAP5510 1 EM Algorithm Goal: Find θ, Z that maximize Pr
More informationSelecting protein fuzzy contact maps through information and structure measures
Selecting protein fuzzy contact maps through information and structure measures Carlos Bousoño-Calzón Signal Processing and Communication Dpt. Univ. Carlos III de Madrid Avda. de la Universidad, 30 28911
More informationFinding Similar Protein Structures Efficiently and Effectively
Finding Similar Protein Structures Efficiently and Effectively by Xuefeng Cui A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy
More informationCAP 5510 Lecture 3 Protein Structures
CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity
More informationEfficient Protein Tertiary Structure Retrievals and Classifications Using Content Based Comparison Algorithms
Efficient Protein Tertiary Structure Retrievals and Classifications Using Content Based Comparison Algorithms A Dissertation presented to the Faculty of the Graduate School University of Missouri-Columbia
More information2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.
Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand
More informationA Novel Text Modeling Approach for Structural Comparison and Alignment of Biomolecules
A Novel Text Modeling Approach for Structural Comparison and Alignment of Biomolecules JAFAR RAZMARA, SAFAAI B. DERIS Faculty of Computer Science and Information Systems Universiti Teknologi Malaysia 83,
More informationAlpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University
Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University Department of Chemical Engineering Program of Applied and
More informationWeek 10: Homology Modelling (II) - HHpred
Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative
More informationAnalysis and Prediction of Protein Structure (I)
Analysis and Prediction of Protein Structure (I) Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 2006 Free for academic use. Copyright @ Jianlin Cheng
More informationProcheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.
Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics Iosif Vaisman Email: ivaisman@gmu.edu ----------------------------------------------------------------- Bond
More informationProtein Structure Prediction
Page 1 Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding is different from structure prediction --Folding is concerned with the process of taking the 3D shape, usually based on
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More information3D HP Protein Folding Problem using Ant Algorithm
3D HP Protein Folding Problem using Ant Algorithm Fidanova S. Institute of Parallel Processing BAS 25A Acad. G. Bonchev Str., 1113 Sofia, Bulgaria Phone: +359 2 979 66 42 E-mail: stefka@parallel.bas.bg
More informationProtein Structure Prediction, Engineering & Design CHEM 430
Protein Structure Prediction, Engineering & Design CHEM 430 Eero Saarinen The free energy surface of a protein Protein Structure Prediction & Design Full Protein Structure from Sequence - High Alignment
More informationNumber sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence
Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Naoto Morikawa (nmorika@genocript.com) October 7, 2006. Abstract A protein is a sequence
More informationLarge-Scale Genomic Surveys
Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction
More informationNon-sequential Structure Alignment
Chapter 5 Non-sequential Structure Alignment 5.1 Introduction In the last decades several structure alignment methods have been developed but most of them ignore the fact that structurally similar proteins
More informationReconstructing Amino Acid Interaction Networks by an Ant Colony Approach
Author manuscript, published in "Journal of Computational Intelligence in Bioinformatics 2, 2 (2009) 131-146" Reconstructing Amino Acid Interaction Networks by an Ant Colony Approach Omar GACI and Stefan
More informationHomology Modeling. Roberto Lins EPFL - summer semester 2005
Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,
More informationProgramme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues
Programme 8.00-8.20 Last week s quiz results + Summary 8.20-9.00 Fold recognition 9.00-9.15 Break 9.15-11.20 Exercise: Modelling remote homologues 11.20-11.40 Summary & discussion 11.40-12.00 Quiz 1 Feedback
More information3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT
3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode
More informationProtein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror
Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major
More information114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009
114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009 9 Protein tertiary structure Sources for this chapter, which are all recommended reading: D.W. Mount. Bioinformatics: Sequences and Genome
More informationBasics of protein structure
Today: 1. Projects a. Requirements: i. Critical review of one paper ii. At least one computational result b. Noon, Dec. 3 rd written report and oral presentation are due; submit via email to bphys101@fas.harvard.edu
More informationComputational Biology
Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,
More informationHomology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB
Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded
More informationBiological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor
Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms
More informationCS273: Algorithms for Structure Handout # 2 and Motion in Biology Stanford University Thursday, 1 April 2004
CS273: Algorithms for Structure Handout # 2 and Motion in Biology Stanford University Thursday, 1 April 2004 Lecture #2: 1 April 2004 Topics: Kinematics : Concepts and Results Kinematics of Ligands and
More information2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon
A Comparison of Methods for Assessing the Structural Similarity of Proteins Dean C. Adams and Gavin J. P. Naylor? Dept. Zoology and Genetics, Iowa State University, Ames, IA 50011, U.S.A. 1 Introduction
More informationDesign of a Novel Globular Protein Fold with Atomic-Level Accuracy
Design of a Novel Globular Protein Fold with Atomic-Level Accuracy Brian Kuhlman, Gautam Dantas, Gregory C. Ireton, Gabriele Varani, Barry L. Stoddard, David Baker Presented by Kate Stafford 4 May 05 Protein
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri RNA Structure Prediction Secondary
More informationA General Model for Amino Acid Interaction Networks
Author manuscript, published in "N/P" A General Model for Amino Acid Interaction Networks Omar GACI and Stefan BALEV hal-43269, version - Nov 29 Abstract In this paper we introduce the notion of protein
More informationThe Universal Similarity Metric, Applied to Contact Maps Comparison in A Two-Dimensional Space
The Universal Similarity Metric, Applied to Contact Maps Comparison in A Two-Dimensional Space by Sara Rahmati A thesis submitted to the School of Computing in conformity with the requirements for the
More information1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)
Protein structure databases; visualization; and classifications 1. Introduction to Protein Data Bank (PDB) 2. Free graphic software for 3D structure visualization 3. Hierarchical classification of protein
More informationMotivating the need for optimal sequence alignments...
1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use
More informationBayesian Models and Algorithms for Protein Beta-Sheet Prediction
0 Bayesian Models and Algorithms for Protein Beta-Sheet Prediction Zafer Aydin, Student Member, IEEE, Yucel Altunbasak, Senior Member, IEEE, and Hakan Erdogan, Member, IEEE Abstract Prediction of the three-dimensional
More informationMolecular Modeling lecture 17, Tue, Mar. 19. Rotation Least-squares Superposition Structure-based alignment algorithms
Molecular Modeling 2019 -- lecture 17, Tue, Mar. 19 Rotation Least-squares Superposition Structure-based alignment algorithms Matrices and vectors Matrix algebra allows you to express multiple equations
More informationAnt Colony Approach to Predict Amino Acid Interaction Networks
Ant Colony Approach to Predict Amino Acid Interaction Networks Omar Gaci, Stefan Balev To cite this version: Omar Gaci, Stefan Balev. Ant Colony Approach to Predict Amino Acid Interaction Networks. IEEE
More informationProtein Structure Overlap
Protein Structure Overlap Maximizing Protein Structural Alignment in 3D Space Protein Structure Overlap Motivation () As mentioned several times, we want to know more about protein function by assessing
More informationProtein structure similarity based on multi-view images generated from 3D molecular visualization
Protein structure similarity based on multi-view images generated from 3D molecular visualization Chendra Hadi Suryanto, Shukun Jiang, Kazuhiro Fukui Graduate School of Systems and Information Engineering,
More informationStructure to Function. Molecular Bioinformatics, X3, 2006
Structure to Function Molecular Bioinformatics, X3, 2006 Structural GeNOMICS Structural Genomics project aims at determination of 3D structures of all proteins: - organize known proteins into families
More informationProtein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror
Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major
More informationPrediction and refinement of NMR structures from sparse experimental data
Prediction and refinement of NMR structures from sparse experimental data Jeff Skolnick Director Center for the Study of Systems Biology School of Biology Georgia Institute of Technology Overview of talk
More informationA Simple Topological Representation of Protein Structure: Implications for New, Fast, and Robust Structural Classification
PROTEINS: Structure, Function, and Bioinformatics 56:487 501 (2004) A Simple Topological Representation of Protein Structure: Implications for New, Fast, and Robust Structural Classification David L. Bostick,
More informationJoana Pereira Lamzin Group EMBL Hamburg, Germany. Small molecules How to identify and build them (with ARP/wARP)
Joana Pereira Lamzin Group EMBL Hamburg, Germany Small molecules How to identify and build them (with ARP/wARP) The task at hand To find ligand density and build it! Fitting a ligand We have: electron
More informationproteins Refinement by shifting secondary structure elements improves sequence alignments
proteins STRUCTURE O FUNCTION O BIOINFORMATICS Refinement by shifting secondary structure elements improves sequence alignments Jing Tong, 1,2 Jimin Pei, 3 Zbyszek Otwinowski, 1,2 and Nick V. Grishin 1,2,3
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationTemplate Free Protein Structure Modeling Jianlin Cheng, PhD
Template Free Protein Structure Modeling Jianlin Cheng, PhD Associate Professor Computer Science Department Informatics Institute University of Missouri, Columbia 2013 Protein Energy Landscape & Free Sampling
More informationLecture 18 Generalized Belief Propagation and Free Energy Approximations
Lecture 18, Generalized Belief Propagation and Free Energy Approximations 1 Lecture 18 Generalized Belief Propagation and Free Energy Approximations In this lecture we talked about graphical models and
More informationStatistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics
Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia
More informationBetter Bond Angles in the Protein Data Bank
Better Bond Angles in the Protein Data Bank C.J. Robinson and D.B. Skillicorn School of Computing Queen s University {robinson,skill}@cs.queensu.ca Abstract The Protein Data Bank (PDB) contains, at least
More informationReconstruction of Protein Backbone with the α-carbon Coordinates *
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 26, 1107-1119 (2010) Reconstruction of Protein Backbone with the α-carbon Coordinates * JEN-HUI WANG, CHANG-BIAU YANG + AND CHIOU-TING TSENG Department of
More informationHomologous proteins have similar structures and structural superposition means to rotate and translate the structures so that corresponding atoms are
1 Homologous proteins have similar structures and structural superposition means to rotate and translate the structures so that corresponding atoms are as close to each other as possible. Structural similarity
More informationIntroduction to" Protein Structure
Introduction to" Protein Structure Function, evolution & experimental methods Thomas Blicher, Center for Biological Sequence Analysis Learning Objectives Outline the basic levels of protein structure.
More informationDihedral Angles. Homayoun Valafar. Department of Computer Science and Engineering, USC 02/03/10 CSCE 769
Dihedral Angles Homayoun Valafar Department of Computer Science and Engineering, USC The precise definition of a dihedral or torsion angle can be found in spatial geometry Angle between to planes Dihedral
More informationUniversal Similarity Measure for Comparing Protein Structures
Marcos R. Betancourt Jeffrey Skolnick Laboratory of Computational Genomics, The Donald Danforth Plant Science Center, 893. Warson Rd., Creve Coeur, MO 63141 Universal Similarity Measure for Comparing Protein
More informationMultiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling
63:644 661 (2006) Multiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling Brajesh K. Rai and András Fiser* Department of Biochemistry
More informationProtein Complex Identification by Supervised Graph Clustering
Protein Complex Identification by Supervised Graph Clustering Yanjun Qi 1, Fernanda Balem 2, Christos Faloutsos 1, Judith Klein- Seetharaman 1,2, Ziv Bar-Joseph 1 1 School of Computer Science, Carnegie
More informationHeuristics for The Whitehead Minimization Problem
Heuristics for The Whitehead Minimization Problem R.M. Haralick, A.D. Miasnikov and A.G. Myasnikov November 11, 2004 Abstract In this paper we discuss several heuristic strategies which allow one to solve
More informationLecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)
Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from
More information1.5 Sequence alignment
1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)
More informationCOMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University
COMP 598 Advanced Computational Biology Methods & Research Introduction Jérôme Waldispühl School of Computer Science McGill University General informations (1) Office hours: by appointment Office: TR3018
More informationEfficient Processing of 3D Protein Structure Similarity Queries
Efficient Processing of 3D Protein Structure Similarity Queries PhD Confirmation Report Zi Helen Huang Supervisors: Prof. Xiaofang Zhou and Prof. Peter Bruza (DSTC) School of Information Technology and
More informationProbabilistic Arithmetic Automata
Probabilistic Arithmetic Automata Applications of a Stochastic Computational Framework in Biological Sequence Analysis Inke Herms PhD thesis defense Overview 1 Probabilistic Arithmetic Automata 2 Application
More informationAb-initio protein structure prediction
Ab-initio protein structure prediction Jaroslaw Pillardy Computational Biology Service Unit Cornell Theory Center, Cornell University Ithaca, NY USA Methods for predicting protein structure 1. Homology
More informationBioinformatics. Dept. of Computational Biology & Bioinformatics
Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS
More informationAutomated Identification of Protein Structural Features
Automated Identification of Protein Structural Features Chandrasekhar Mamidipally 1, Santosh B. Noronha 1, Sumantra Dutta Roy 2 1 Dept. of Chemical Engg., IIT Bombay, Powai, Mumbai - 400 076, INDIA. chandra
More informationChapter 5. Proteomics and the analysis of protein sequence Ⅱ
Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and
More informationSupporting Online Material for
www.sciencemag.org/cgi/content/full/309/5742/1868/dc1 Supporting Online Material for Toward High-Resolution de Novo Structure Prediction for Small Proteins Philip Bradley, Kira M. S. Misura, David Baker*
More informationQuantifying sequence similarity
Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity
More informationProtein Structure: Data Bases and Classification Ingo Ruczinski
Protein Structure: Data Bases and Classification Ingo Ruczinski Department of Biostatistics, Johns Hopkins University Reference Bourne and Weissig Structural Bioinformatics Wiley, 2003 More References
More informationContext of the project...3. What is protein design?...3. I The algorithms...3 A Dead-end elimination procedure...4. B Monte-Carlo simulation...
Laidebeure Stéphane Context of the project...3 What is protein design?...3 I The algorithms...3 A Dead-end elimination procedure...4 B Monte-Carlo simulation...5 II The model...6 A The molecular model...6
More informationTemplate Free Protein Structure Modeling Jianlin Cheng, PhD
Template Free Protein Structure Modeling Jianlin Cheng, PhD Professor Department of EECS Informatics Institute University of Missouri, Columbia 2018 Protein Energy Landscape & Free Sampling http://pubs.acs.org/subscribe/archive/mdd/v03/i09/html/willis.html
More informationResidue Contexts: Non-sequential Protein Structure Alignment Using Structural and Biochemical Features
Residue Contexts: Non-sequential Protein Structure Alignment Using Structural and Biochemical Features Jay W. Kim and Rahul Singh 2,* Department of Biology 2 Department of Computer Science, San Francisco
More informationProtein Structure. W. M. Grogan, Ph.D. OBJECTIVES
Protein Structure W. M. Grogan, Ph.D. OBJECTIVES 1. Describe the structure and characteristic properties of typical proteins. 2. List and describe the four levels of structure found in proteins. 3. Relate
More informationDocking. GBCB 5874: Problem Solving in GBCB
Docking Benzamidine Docking to Trypsin Relationship to Drug Design Ligand-based design QSAR Pharmacophore modeling Can be done without 3-D structure of protein Receptor/Structure-based design Molecular
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationCS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C.
CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring 2006 Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C. Latombe Scribe: Neda Nategh How do you update the energy function during the
More informationBioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value
More informationFoldMiner: Structural motif discovery using an improved superposition algorithm
FoldMiner: Structural motif discovery using an improved superposition algorithm JESSICA SHAPIRO 1 AND DOUGLAS BRUTLAG 1,2 1 Biophysics Program and 2 Department of Biochemistry, Stanford University, Stanford,
More informationSome Problems from Enzyme Families
Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems
More informationUniversity of Toronto Department of Electrical and Computer Engineering. Final Examination. ECE 345 Algorithms and Data Structures Fall 2016
University of Toronto Department of Electrical and Computer Engineering Final Examination ECE 345 Algorithms and Data Structures Fall 2016 Print your first name, last name, UTORid, and student number neatly
More informationProtein Structure Determination from Pseudocontact Shifts Using ROSETTA
Supporting Information Protein Structure Determination from Pseudocontact Shifts Using ROSETTA Christophe Schmitz, Robert Vernon, Gottfried Otting, David Baker and Thomas Huber Table S0. Biological Magnetic
More informationAutomated Identification of Protein Structural Features
Automated Identification of Protein Structural Features Chandrasekhar Mamidipally 1, Santosh B. Noronha 1, and Sumantra Dutta Roy 2 1 Dept. of Chemical Engg., IIT Bombay, Powai, Mumbai - 400 076, India
More informationTowards Detecting Protein Complexes from Protein Interaction Data
Towards Detecting Protein Complexes from Protein Interaction Data Pengjun Pei 1 and Aidong Zhang 1 Department of Computer Science and Engineering State University of New York at Buffalo Buffalo NY 14260,
More informationBLAST: Target frequencies and information content Dannie Durand
Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences
More informationCommunities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices
Communities Via Laplacian Matrices Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices The Laplacian Approach As with betweenness approach, we want to divide a social graph into
More informationMultiple structure alignment with mstali
Multiple structure alignment with mstali Shealy and Valafar Shealy and Valafar BMC Bioinformatics 2012, 13:105 Shealy and Valafar BMC Bioinformatics 2012, 13:105 SOFTWARE Open Access Multiple structure
More information