Protein Structure Comparison Methods

Size: px

Start display at page:

Download "Protein Structure Comparison Methods"

Bernice Newton
5 years ago
Views:

1 Protein Structure Comparison Methods D. Petrova Key Words: Protein structure comparison; models; comparison algorithms; similarity measure Abstract. Existing methods for protein structure comparison are examined and their basic properties are presented. Analysis of the methods is made, which is based on their three major components model of the structure, comparison algorithm and similarity measure. Each method is evaluated according some criteria, which include the level of representation and the way of service of the model, the complexity and strategy of the comparison algorithm and how meaningful the similarity measure is. The methods are grouped according these criteria. Advantages and disadvantages of all groups are discussed in aspect to their application for solving different type of problems in contemporary structural biology. Introduction Proteins are long chains of amino acids and like other biological macromolecules are essential parts of organisms and participate in every process within cells. Many proteins are enzymes that catalyze biochemical reactions and are vital to metabolism. Proteins also have structural, mechanical or transport functions or form system of scaffolding that maintains cell shape. Other proteins are important in cell signaling, immune responses, cell adhesion and the cell cycle... The number of newly determined protein structures is growing fast. When the Protein Data Bank [1] was originally founded it contained just 7 protein structures. Since then it has achieved an approximate exponential growth in the number of structures and does not show any sign of falling off. Today there are protein structures, deposited in Protein Data Bank and up to 2015 their number would reach The need for developing new methods and algorithms for studying protein structure, function and evolution is evident. One of the main tasks in these studies is the precise and adequate comparison of protein structure. Protein structure comparisons are employed in almost all branches of contemporary structural biology. They are applied for: Protein fold classification SCOP [2] is one of the databases, where the classification of protein structures is based on evolutionary relationships and on the principles that govern their three-dimensional structure. The method used to construct the protein classification in SCOP is essentially the visual inspection and comparison of structures. CATH [3] includes the adaptation of a method for rapid structure comparison, based on secondary structure matching. Protein structure modelling for protein structure prediction [4]. Protein structure prediction is one of the most important goals pursued by bioinformatics and theoretical chemistry. Its aim is the prediction of the three-dimensional structure of proteins from their amino acid sequences, sometimes including additional relevant information such as the structures of related proteins. In other words, it deals with the prediction of a protein s tertiary structure from its primary structure, where the prediction is based on comparison with proteins with already known structure. Structure-based function prediction: [5] the functions of proteins are strongly connected with their structure and chemical characteristics. The structure of a protein can directly reveal the mechanisms of its functions. This is the reason why protein structure comparison methods can be applied to obtain structural information, which will aid the function prediction. Structure comparison methods are needed here for determining the unknown protein functions, where the conclusions are based on the detected structural similarity with proteins with already known functions. Protein Structure Comparison Methods In order to compare two (or more) protein structures three components are necessary: model of the protein, alignment algorithm and similarity measure. The ideal combination of these 3 components would be sensitive give high scores to similar structures, selective give low scores to structures, which are not similar and fast. A: Protein Structure Representation/Model One of the main characteristics of protein model is the level of representation and how detailed it is. Òwo sets of structure elements A = { a1, a2,..., an} for protein P A and B = { b 1, b 2,..., b m} for protein P B are given. The choice of structure elements defines the following possibilities for the level of representation: 1) Low, atomic level, where the coordinates of Cα atoms of the proteins are used to construct the model. Let the set of Cα atoms in protein X be defined with eq. (1). (1) Cα X ) = Cα, Cα,... Cα } ( { 1 2 p Then the sets A and B of structure elements for both compared proteins can be defined as: (2) A Cα P ) ( A B Cα( P B ) 2) Atomic level of representation, where proteins are presented with fragments with fixed size. ragment is defined as a sequence of amino acids: (3) f = { Cα 1, Cα 2,... Cα k} Parameter k in eq.(3) is the fragment s size (the number of amino acids, which are included in the fragment). Then the sets of structure elements A and B can be defined as sets of fragments with fixed size: (4) A = { f1, f2,,... f r } B = { f f,... } 1, 2, f s 3) The level of secondary structure elements, where

2 proteins are presented with their sets of helices and sheets/ strands. Let the set of secondary structure elements in protein X be defined with eq. (5). SSE (5) ( X ) = h1,... hp, { s11, s12,... s1 } { } l,... sq1, sq2,... sql SH 1 SHq Parameter p is the number of helices ( h i ) and q is the number of sheets ( SH j ) in the structure. Sheets are presented with the strands, which compose them ( s jk ). In this case the sets of structure elements for both proteins can be defined as: (6) A SSE P ) ( A B SSE( P B ) 4) Combination of both of the previous. When the level is specified, the type of distances, which would be compared and superimposed, can further define the model. If the model consists of set of points or secondary structure elements, then the intermolecular distances are computed between elements, which are fixed to be equivalent. This type of model does not usually consider the flexibility of the molecules. The other type of model uses intramolecular distances (distance matrices or contact maps). The model consists of all relative positions or distances between elements from one protein molecule, which have to be compared with the relative distances and positions in the other protein molecule. The oldest and the most common method to compare protein structures is to select a set of equivalent points usually Cα atoms and superimpose them in 3D space to minimize the least-squares distances between corresponding atoms from both proteins. Proteins are presented with sequences of their Cα atoms as points in 3D space in RAND and RAG [6], proposed by Akutsu and in MINRMS [7]. Their sets of structure elements to be compared are defined with eq. (2). igure 1 shows such kind of representation of two protein molecules P A and with their corresponding sets A and B of coordinates, which are rotated and translated to find the alignment {, b, a, b, a, b, a, b, a, b, a b } a , igure 1. Protein structure alignment by selecting equivalent points in 3D space These models are detailed and precise, if the only requirement is to compare points in 3D space, which are the atomic coordinates of the two different macromolecules. However, a 6 PB common comparison of coordinates of Cα atoms is not sufficient, when these are biological macromolecules. The disadvantage of such models is that they could miss similarities between divergent structures, which are biologically similar. These models are very detailed and the useful information that they bring in some cases exceeds the necessary one. Combinatorial extension (CE) of Shindyalov and Bourne [8], lexprot [9] and ATCAT [10] are also comparison methods, which examine the protein structure at the low atomic (Cα atoms) level, but they group the atoms in fragments parts with fixed size of compared proteins. ragment pairs with a size, defined by the number of residues, which compose them, are aligned. Models of CE, lexprot and ATCAT are defined with eq. (4). Bostick and Vaisman made an interesting proposal for a model [11]. They applied Delaunay-based topological mapping for protein structure comparison. The model of the protein is based upon the topology of its cores especially Cα atoms eq. (2), which form the protein backbone. our-body nearest neighbors of Cα atoms are clustered and used for comparison. Delaunay Tessellation is made to represent the set of points that compose Amino Acid residues the four nearest neighbors are arranged at the vertices of irregular tetrahedral, which is called Delaunay simplex. The length of a simplex edge is defined as the number of Cα atoms, which compose the segment of the protein between its simplex vertices. If all residues are enumerated with consecutive integers according their occurrence in primary structure of proteins, the length of a simplex edge d ij is defined as: (7) d ij = j i 1 Indices i and j in eq. (7) are the integer numbers, which residues have obtained by the enumeration. Each simplex in the tessellation has three lengths, associated with the four vertices. Some protein structure comparison methods use coordinates of Cα atoms to construct model, which is called distance matrix. The structure of compared proteins is examined at the level of Cα atoms as the group discussed above. The intramolecular distances between the sets, defined in eq. (2) are computed and results are distance matrices, which present the geometrical properties of compared proteins. The idea of DALI method [12], proposed by Holm and Sander is that similar 3D structures have similar inter-residue distances. This method uses the distance matrix as a 2D representation of 3D structure, which contains all pair wise distances between all residues centres i. e. Cα atoms of a protein. Szustakowski and Weng [13] also use intramolecular distance matrices, based on elastic similarity score developed by Holm and Sander [12], to evaluate the alignments between compared proteins. Another type of matrix, which is applied as a model, is the contact map. The contact map is a matrix of distance pairs between atoms, residues or secondary structure elements of proteins according the preferred model for analysis. The comparison of two contact maps is defined as the alignment of the set A of structure elements from the one contact map that describes first protein with corresponding set B from the other contact map from the second protein. Aligned components from

the sets are considered equivalent. In maximal contact map overlap problem, the degree of similarity between two proteins is defined as the number of equivalent contacts between the proteins.

3 the sets are considered equivalent. In maximal contact map overlap problem, the degree of similarity between two proteins is defined as the number of equivalent contacts between the proteins. This number is called the overlap of the contact maps and the goal is to maximize this value. The Max CMO problem is proven to be NP-complete [14]. Carr and Hart [15] use a way for simplification of this representation of protein s structure and define a contact as a pair of residues that are closer than a given threshold, which ranges usually between 2 and 9 Angstroms. The result is a binary contact map, which has the advantage that structural properties of proteins can be more easily visualized and compared. The detailed models, discussed above have the disadvantage to be complicated for comparison, because of the huge number of elements to be compared (Cα atoms, fragments of Cα atoms, or distances between them) the problem belongs to NP-complete class. It is challenge for comparison methods, which use these models to be fast and exact at the same time. Different simplifications are searched to decrease the complexity of the model, as Carr and Hart proposed or heuristic methods for comparison are used to achieve reasonable times. Another approach for hastening the comparison is to use not so detailed model and to examine the protein structure at the level of helices and sheets, which are secondary structure elements. Vectors of secondary structure elements are preferred for the initial model in MATRAS [16]. Kawabata and Nishikawa use the Markov transition model for evolution, which is similar to the Dayhoff s substitution model [17] between amino acids. The transition probability matrix of the Markov transition is generated by transforming the numbers of structure transitions within pairs of proteins with similar sequences. The protein structure is examined first at the level of secondary structure elements and then the similarity is detailed at the residues level. Sets, defined with eq. (2) and eq. (6) are used to construct the models of the proteins. Other methods use graph models for protein secondary structure representation. When the scope of interest is the atomic level and the average number of objects, which have to be represented is about (the common number of amino acids in a protein with one chain) the graph model is not appropriate. Such number of vertices is huge enough and the time for comparison would be enormous. The situation is different when the level is high and secondary structure elements are represented as vertices of a graph the average number is about 10 20, which makes the graph model suitable. VAST [18] such as MATRAS, first detects and aligns secondary structure elements and then refines this alignment by finding the residue equivalences, but VAST uses graph model. All pairs of secondary structure elements (one from each protein) that are of the same type are presented as vertices in a graph. An edge is defined between two vertices if the distance and angle between corresponding secondary structure elements are within some threshold. The resulting model of the proteins is a graph, which is composed of structure elements, defined with eq. (6). The graph model represents the correspondence between pairs of secondary structure elements that have the same type, relative orientation and connectivity. Taylor [19] also proposes graph model for protein structure comparison. The model is defined by the interaction between secondary structure elements, which are presented with their line segments, and other geometric properties of the protein molecules. The degree of interaction between two line segments is evaluated by the degree of overlap between the segments and these interactions are presented with a bipartite graph. igure 2. Line segment overlap measure Two line segments corresponding to secondary structure elements are shown (A -> B and C -> D) as thick lines in figdure 2, with their mutually perpendicular connecting line (p and q). A series of fine lines cover the span in which the line segments overlap, the end points of which are equidistant from their corresponding ends of the mutual perpendicular. A measure of interaction is calculated from this as a summation of the lengths (x) of these lines. Bipartite graph is constructed to compare protein structure P A and protein structure P B in the method proposed by Wang, Makedon and ord [20]. Parts of the vertices in this graph represent structural elements form both proteins left part for protein P A and right part for protein P B respectively. In contrast with the previous discussed graph models, here the level of representation can vary - the structural elements can be all atoms, only Cα atoms, amino acids residues or secondary structure elements. Each vertex is connected with all of the vertices in the opposite part. The weight of an edge between two vertices is defined by the similarity measure, which can include geometric and chemical properties of compared protein structures. Given the two sets of structure elements A and B for protein P A and protein P B, undirected weighted bipartite graph G (V, E) can be constructed, where V = A U B and E = e }, for i = 1,2,...n and j = 1,2,... m figure 3. igure 3. Bipartite graph matching proposed by Wang, Makedon and ord { ij

4 Each edge e ij corresponds to an weighted connection between à i and b j, and the weight w(e ij ) shows the degree of similarity between à i and b j. Edges between nodes from the same part are not allowed. In the graph model proposed by Krissinel and Henrick [21] the secondary structure elements (helices and strands eq. 6) are used as graph vertices with composite labels, which have a part for the type of the element and a part for the number of residues, which compose it. Any two vertices of the graph are connected by an edge, whose label describes the geometry of mutual position an orientation of the connected elements. igure 4 shows the properties, which are considered when the graph is constructed. Vertices v i and v j are represented by vectors r SSE ; edge e ij connects their centers. Edge length k p ij and angles α ij, k = 1..4, define mutual positions and orientations of all vertices in the graph. Models for proteins, which examine the structure at the igure 4. Properties of vertices and edges of the SSE graph level of helices and sheets/strands, are more compact and easy for construction and service. The disadvantage of using only a model of secondary structure elements is that the information may be not enough to make precise comparison. This is the reason why many methods use first this model for fast comparison and then refine it with detailed comparison at the atomic level, but with the alignment of secondary structure elements already available. At the same time geometric properties of the structures are considered when the model at the SSE level is constructed by detecting the mutual position and orientation of secondary structure elements each against the other including distances, angles, connectivity and overlaps. B: Comparison Algorithm /Strategy The variety of models for protein structure representation brings a variety of comparison algorithms. There are cases, when the models are almost the same, but the serving algorithms are different or the models are different, yet the search strategy for equivalences is the same. Comparison algorithms can be grouped into different classes, each class with specific characteristics, which determine the advantages and disadvantages of the algorithms and their applications. According the dependence of the chain order alignment algorithms can be: sequence-order dependent use the order of atoms in the protein chain, thus reducing a problem to 3D curve matching. The comparison task becomes easier, when the order of the chain is considered. sequence-order independent the structural similarity between compared models is measured without requiring that each residue of the one protein to be structurally matched with the corresponding residue of the other protein. Since these algorithms do not exploit the chain order, they can detect nonsequential motifs in proteins, such as molecular surface motifs, especially binding sites. One of the important advantages of such algorithms is that they can be applied to other molecular structures (drugs for example), not only for proteins. Protein structure alignment algorithms can search global when the purpose is comparison of the molecules as a whole, or local similarity. Global comparison algorithms are mainly used when protein structure classification and identification of evolutionary links between distant homologues are needed. or the purposes of protein function prediction the local structural comparison methods are applied. Local structural comparison refers to the possibility of detecting a similar 3D arrangement of a small set of residues, possibly in the context of completely different protein structures. More detailed comparison between alignment algorithms is made here according the strategies, which they use: branchand-bound, dynamic programming, geometric hashing, genetic algorithms, subgraph isomorphism, bipartite graph matching technique, etc. Some methods may use combination of two strategies, when comparison is made at different levels. The proposed algorithms can be compared and evaluated according their complexity. The problem for protein structure comparison is NP-Complete and any reported achievements in this field are marked below. Branch-and-Bound is a widely used strategy for solving large-scale NP combinatorial optimization problems and many comparison methods preferred it. This technique consists of a systematic enumeration of all candidate solutions by using upper and lower estimated bounds of the quantity being optimized, while in the process of searching large subsets of useless candidates are discarded together at the same time.this is done by a recursive procedure that is used to extend initial candidate solutions or matching seed. The extension stops when the algorithm determines that the current path cannot lead to solutions that are better than the current best one. In such case, the recursion goes one step back and the candidate is extended in another direction or another candidate is selected. The running time of these methods depends dramatically on how similar the proteins to be compared are. If the structures are very similar, then there will be a large number of seed matches to explore. One of the comparison methods, which use such technique, is DALI. The alignment algorithm compares protein A and protein B in two steps: 1) their distance matrices are first decomposed into elementary contact patterns hexapeptide submatrices. All elementary contact patterns in protein P A are pair wise compared with all elementary contact patterns in protein P B. Similar contact patterns are stored in a non-exclu

5 sive list of pairs which is the raw material for structural alignment; 2) The goal of the second steps is to assemble pairs of contact patterns into larger consistent set of pairs (larger alignment), maximizing the similarity score. A Monte Carlo procedure is used to build up the full alignment. MATRAS also uses branch-and-bound algorithm for initial alignment of SSEs. Then a residue-based alignment is iteratively performed by dynamic programming using the previous results to refine them. Combinatorial Extension CE, proposed by Shindyalov and Bourne, finds an optimal alignment between two protein structures using combinatorial extension of an alignment path, defined by aligned fragment pairs. It is based on local similarity detection. The algorithm first applies rigid-body superpositioning of the fragment pairs. Then it tries to extend this alignment using a greedy heuristics followed by an optimization of the best alignment. ATCAT also alignes fragment pairs and then uses dynamic programming to connect them, while considering the protein molecule flexibility. ATCAT aligns flexible protein molecules by including the possibility of twists in the peptide backbone within the alignment algorithm. This allows an alignment of two domains that are structurally similar but have local structural differences that preclude a full alignment when each domain is treated as a rigid body. lexprot is a sequence-order dependent proposal for alignment of two proteins structures; one of them can be a flexible molecule. irst lexprot detects congruent fragment pairs - one from each protein, which can be superimposed with minimum RMSD. Matching atom pairs are extended, following the protein backbone with one or more atom pairs until the RMSD and the length of matching fragments are within some thresholds. Then lexprot composes an acyclic directed graph, where vertices are fragment pairs and edges show the order of fragments (according the Amino Acid sequence). Weights are assigned to edges to award long matching fragments and to penalize big gaps. Single source shortest paths algorithm is applied to this graph and proceeded paths are compared regarding the total size and minimum RMSD. The last step of the algorithm clusters the consecutive fragment pairs that have a similar 3D transformation. The first step of the algorithm takes O ( n 2 ), the second step takes O ( n 4 ), and the clustering step takes O ( n 2 ). Thus, the overall complexity is bounded by O ( n 4 ), where n is the number of Cα atoms in the larger protein. (8) T : d a if if d = 0, d = 1, if d = 2, if d = 3, if 4 d 6, if 7 d 11, if 12 d 20, if 21 d 49, if 50 d 100, if d 101. Bostick and Vaisman use comparison of 3D arrays to find similarity between protein molecules. They apply a transformation T on the Delaunay simplices, which are the models for compared proteins. T is used to map each length of an edge of a simplex to an integer value. Equation (8) defines the transformation. Each simplex is mapped into a 3D array M, where M npr is the number of simplices, whose edges satisfy the following conditions: (1) the Euclidean length of each simplex edge is less than 10 A; (2) d ij = n; (3) d jk = p; (4) d ki = r. The comparison of proteins, presented with arrays Ì and ' M is computed by the evaluation of the difference between their corresponding elements (Q is the score value and measure for similarity): ' (9) Q = M npr M npr r= 1 p= 1 n= 1 The geometric hashing [22] is a technique, which originates from the computer vision [23], [24] and first has been applied for structural biology data comparison by ischer [25]. The coordinates of one structure are expressed relative to several local reference frames, which can be any triplets of points (Cα atoms) of the protein. Since the points used as a reference belong to the structure itself this representation is invariant under both rotation and translation. The positions in which the other points are situated for each frame are used as keys in a hash table. When such representation has been calculated it is possible to compare two structures using a series of fast searches - the hash table is queried with structural features from the second molecule. Each hit in the table identifies a transformation between the two molecules. Transformations that eceive many hits are those that are likely to superimpose essential structural features of both molecules. The complexity of this algorithm is O ( n 3 ), where n is the numbers of atoms, which compose the protein to be compared. The method does not assume the order of the protein Cα atoms and has the advantage of sequence-order independent algorithms - can be applied to any molecule type, not only to proteins. Some of the alignment techniques, which examine the protein structure at the atomic level, can be summarized in two distinguished steps: 1) Generation of all initial superpositions and 2) Identification of optimal alignment by RMSD. MINRMS uses comparison of all consecutive fragments of four residues from one protein with all such residues from the other to generate initial superspositions. Then, dynamic programming algorithm is used to evaluate the similarity at this step between two protein structures. MINRMS generates and fills a score pyramid, which is composed of matrices, stacked on

6 top of one another figure 5. The lowest layer represents the score matrix for alignment of a pair of residues. Each layer above is the score matrix for alignment of newly added pair of residues and is evaluated using the matrix from the layer below. The value of a cell in the pyramid is derived from one of three adjacent cells: by row, by column or by diagonal from the layer below. The optimal alignment can be reconstructed by backtracking the maximum value of each scoring matrix. The complexity of MINRMS algorithm is O ( m 3 n 2 ), where m and n are the number of Cα atoms in both protein molecules to be compared. Akutsu proposes two algorithms RAND and RAG, which consist of the two steps, described above. RAND algorithm finds the initial superposition by a random sampling technique, while RAG uses fragment search method. The part, which is common for RAND and RAG, is the second step the use of bipartite graph matching technique for protein structure alignment. igure 5. Alignment with MINRMS After finding an initial superposition between A and B, a bipartite graph G ( A, B, E) is composed by the sequences A and B, where Å is the set of edges between A and B figure 6. The edge ( ai, bj ) AxB is contained in Å, if the distance between ai and bj is less than δ (δ = A). The complexity of this method is O (mns), where m and n are the sequence lengths and s is the number of the initial superpositions. The genetic algorithms are a general purpose, global optimization technique that provides promising results in the entire area of computational structural biology [26]. The genetic algorithms mimic the process of evolution. A generation within this process comprises a set of configurations that are coded via chromosomes. Chromosomes are subjected to manipulation by some genetic operators such as crossover and mutation. The information content of the chromosomes varies depending on the application. Typically, it comprises the intramolecular matches or a coding of the orientation degrees of freedom and a coding of the torsion degrees of freedom in the case of considered molecular flexibility. The fitness function used to enable the process of selection typically comprises an efficiently computable similarity function. Szustakowski and Weng propose a genetic algorithm to determine the optimal alignment between protein structures. The genetic algorithm is applied after the initial superposition of all secondary structure elements from compared proteins which igure 6. Bipartite graph matching, proposed by Akutsu

7 generate all initial populations of possible alignments between secondary structure elements. Each population is altered with genetic operators mutate, hop, swap and crossover to make recombination between randomly chosen alignments. Resulting alignments are accepted or rejected according rules, which are defined to obey the validity of the alignment. Carr and Hart, whose goal is to maximize the number of equivalent contacts (which number defines the degree of similarity) between compared proteins in their contact maps, also use a genetic algorithm to solve maximal contact map overlap problem. The chromosome is presented by a vector c of dimension n, where each position can take values in the range [-1,..., m-1], where m is the length of the longer protein, n is the length of the shorter. The position j in c, c[j] specifies that the j-th residue in the longer protein is aligned to the c[j]-th residue of the shorter protein. The value -1 in the same position specifies that j-th residue in one protein is not aligned to any of the residues of the other protein. An important aspect of the method is that unfeasible configurations are not allowed. or this purpose genetic operators are defined to preserve the feasibilities. When models of compared proteins are graphs there are two different types of comparison algorithms, which can be applied subgraph isomorphism detection or bipartite graph matching technique. In complexity theory, the maximum common subgraph-isomorphism (MCS) is an optimization problem that is known to be NP-hard and this method is suitable for graph models with small number of vertices, in other cases the comparison would be slow and clumsy. That is the reason common subgraph isomorpism to be searched for models, which represent protein structure at the level of secondary strucure elements. After this initial alignment some of the algorihms make a refinement at the atomic level. VAST is one of the techniques which uses subgraph isomorphism to detect similiarity between protein molecules. The model of the proteins is a graph, which represents the correspondence between pairs of secondary structure elements that have the same type, relative orientation and connectivity. This graph is searched for cliques and the detection of cliques is the starting point for alignment. This initial alignment is extended to a residue level one with Gibbs sampling technique. Since the max clique problem is known to be NP-hard, it is only feasible for small graphs with about less than 30 vertices. SSM [21] is another tool which includes an original procedure of matching graphs built on the protein s secondarystructure elements, followed by an iterative three-dimensional alignment of protein backbone Cα atoms. This matching technique uses a described by the authors optimal backtracking algorithm for common subgraph isomorphism [27], CSIA, which represents an advancement of the widely known algorithm of Ullman [28] for exact subgraph isomorphism. The time complexity of CSIA is bounded by O( m n+ 1 n), which makes it applicable to graphs having up to n, m 70 unlabelled vertices. One of the main difficulties when aligning structures with sequences, which are not similar, is to determine the correspondence between equivalent residues. Typically the process is either iterated between residue assignment and a minimization step or it uses a stochastic optimization procedure to find the maximal subset of equivalent residues within some constraints. Some methods first use fast and not so accurate filter to define the initial alignment and slower and more accurate residuebased alignment, which is performed only for subsets, which satisfy the first step. One of the most intuitive methods for initial alignment is first to detect and align secondary structure elements and then to refine this alignment by finding the residue equivalences. This method is used by VAST and SSM. In contrast, Taylor uses representation at the secondary structure level without any refinement at the atomic one and BGMT instead of heavy search for the full subgraph isomorphism to construct a fast filter for protein structure comparison. The interactions between secondary structure elements are presented with a graph and a bipartite graph-matching algorithm ( stable marriage ) is used for searching between the two sets of interactions. This method takes 1/ 10 of a second for a typical comparison between two protein structures and this makes it suitable as a fast filter for slower and more complex algorithms for protein structure comparison. Wang, Macedon and ord also use bipartite graph matching in their framework for finding correspondences between structural elements (atoms residues or secondary structure elements) in two proteins. They define a maximum weight matching as a matching such that the sum of the weights of the edges in the matching is maximized. Then a maximum weight maximum cardinality matching is a matching with the maximum number of edges with the greatest weight. igure 3 shows an example of the maximum weight bipartite matching. In the context of protein structural elements correspondence, a maximum weight matching would return the correspondence with the maximum weight, but there is no guarantee of maximum cardinality. Therefore, some elements in the smaller protein may not be matched to any element in the other protein. In other words, a maximum weight matching favors good local matches. On the other hand, a maximum weight maximum cardinality matching, would always return the matching with maximum cardinality, even if some edges in the matching have relatively small weights. It guarantees that every element in the smaller protein will be matched to an element in the other protein. In other words, a maximum weight maximum cardinality matching favors good global matches. The best-known strongly polynomial time bound algorithm for weighted bipartite matching is the classical Hungarian method due to Kuhn [29], which runs in time O( V ( E + V logv ). The weighted bipartite matching algorithms can be implemented efficiently, and can be applied to graphs of reasonably large size (about 100,000 vertices) [30]. When the comparison algorithms produce their results a similarity measure is needed to evaluate them and to make them suitable for any conclusions for presence or absence of similarity between compared objects. C. Similarity Measure It is important for similarity measure to be sensitive and to rank more similar structures higher than more different ones when the measure is positive and the opposite, when the mea

8 sure is negative. Another important fact is that structural alignment lacks a theory that defines and describes the distribution of structural similarity scores. The most commonly used metric in this category is the root-mean-square deviation, RMSD, in which the root-meansquare distance between corresponding residues is calculated after an optimal rotation of one structure to another. This metric has a lower score if the structures are similar and higher in the other case. RMSD is defined as follows: RMSD A B 1 N N i 1 x i (10) (, ) = ( ( ) = y( i) 2 ) In eq. (10) N is the number of aligned atoms and x and y are their coordinates. Since the RMSD weights the distances between all residue pairs equally, a small number of local structural deviations could result in a high RMSD, even when the global topologies of the compared structures are similar. urthermore, the average RMSD of randomly related proteins depends on the length of compared structures, which makes the absolute magnitude of RMSD meaningless [31]. Carugo showes [32] that the root-mean-square distance (RMSD) of an alignment is linearly related to the resolution of the compared domains. Alignment of two domains with low or significantly different resolution would result in a higher RMSD than alignment of two domains with high resolution. In order to be the RMSD meaningful the number of aligned residues has to be considered and the metric should be normalized. or their relative RMSD, Betancourt and Skolnick [31] normalize the RMSD by the average RMSD from random structure pairs with similar size. (11) RRMSD ( A, = RMSD( A, / D( A, RMSD ( A, is the RMSD between proteins A and B and D ( A, is an estimate of the average RMSD between two random protein fragments with the same length when the proteins are aligned. or the definition of RMSD 100 score [33] Carugo and Pongor divide the RMSD by a factor of 1+ N / 100, with N representing the protein length. RMSD( A, (12) RMSD100 ( A, = 1+ ln N /100 Zemla [34,35] propose two different metrics which are used to evaluate protein structure prediction and service as major assessment criteria for the results of CASP experiments: LCS the longest common segment and GDT global distance test. The LCS is a measure that shows the longest continuous segment that can be aligned with RMSD between Cα atoms less than a specified value 1, 2 or 5 A. The GDT score is calculated as the largest set of amino acid residues Cα atoms in the model structure falling within a defined distance cutoff of their position in the experimental structure. It is typical to calculate the GDT score under several cutoff distances (1, 2, 4, and 8 Å are used in the CASP5 experiment), and scores generally increase with increasing cutoff. GDT is aimed to identify any accurately, not necessary continuous similar substructures. It attempts to find the maximum number of residues in the one protein that can be superimposed over the other protein within a given threshold. This measure can be applied to find the largest similar subsets. Levitt and Gerstein [36] propose different structure similarity measure, which takes into account the numbers of gaps, when the structures are aligned: (13) S str = M ( 1/(1 + ( dij / d 0 ) ) N gap / 2) d ij in eq. (13) is the distance between aligned Cα atoms, N gap is the number of gaps, M = 20 and d 0 = 20 A. This measure and GDT are bases for other protein structure similarity measures. With MaxSub [37] Siew, Elofsson, Rychlewski and ischer try to identify the maximum substructure in which the distances between equivalent residues of two structures after superposition are below some threshold value, such as 3.5 Å. MaxSub counts only the residues in the substructure and the spatial information of the templates outside this substructure is omitted. This measure is based on similar principles as GDT. It computes a single scalar in the range of 0 to 1, which measures the similarity between compared structures. This scalar is a normalization of the size of the most similar subset and is computed using the variation of a formula, suggested by Levitt and Gerstein [36]. And while RMSD considers intermolecular distances Holm and Sander propose DALI similarity score [12], which is based upon intramolecular rather than intermolecular distances: (14) S( A, = i A B dij dij d 0.2. e avg AB d j avg 2 AB ( / 20) 2 Kawabata and Nishikava [17] propose theory to evaluate protein structure similarity by log-odds score, which is based on the Markov transition model of evolution. Their similarity score between structures i and j is defined as: P( i j) (15) S( i, j) = log P( j) P( i j) is the probability that structure i changes to structure j during the evolutionary process, and P(j) is the probability that structure j appears by chance. This is a reasonable definition of structure similarity, especially for finding evolutionarily related (homologous) similarity. The probability P( i j) is estimated by the Markov transition model, which is similar to the Dayhoff s substitution model between amino acids. To evaluate pairwise similarities between proteins A and B Kawabata and Nishikawa define the next equation (10): S( A, S min (16) R( A, = * 100 S max Smin S(A, is the log-odds score between proteins A and B and Smin and Smax are the minimum and maximus scores. This score is similar to that, proposed by eng and Doolittle [38]. Tempalte modeling or TM-score [39] also uses a variation of Levitt Gerstein (LG) weight factor that weights the residue pairs at smaller distances relatively stronger than those at larger

9 distances. or that reason the TM-score is more sensitive to the global topology than to the local structural variations. Its value is normalized in a way that the score magnitude relative to random structures is not dependent on the protein s size, with a value of 0.17 for an average pair of randomly related structures. or aligning a template model to a native structure TMscore is defined as follows: (17) 1 TM score = Max L L T N i= 1 1 d 1 + d i 0 2 L N is the length of the native structure, L T is the length of the aligned residues to the template structure, d i is the distance between the i-th pair of aligned residues and d 0 is a scale to normalize the match difference. Yang and Honig propose the Protein structural distance [40]: (, ) a s A B log ( max(, ) (, ), ) log a b s A A RMSD PSD A B = + x y In previous equation a is the number of the SSEs for protein A, b is the number of SSEs for protein B, s(a, A) is the self-alignment score for protein A, s (A, is the score for alignment of SSEs of both proteins, x and y are adjustable parameters. The PSD is designed to describe relationships between protein structures in quantitative rather than descriptive terms and is applicable both when two structures are very similar, and when they are very different. It is calculated with a structural alignment procedure that uses double dynamic programming to align secondary structure elements and an iterative rigid body superposition that minimizes the root-mean-square deviation of Cα atoms. Krasnogor and Pelta showed [41] how Universal Similarity Metric (USM), introduced by Li in [42] can be used to calculate similarities between protein pairs. The USM approximates every possible similarity metric (i.e. those that exist today and those that are yet to be defined) and is based on the concept of Kolmogorov complexity. The Kolmogorov complexity K( ) of an object o is defined by the length of the shortest program for a Universal Turing Machine U that is needed to output o. It is an objective measure of the amount of information contained in a given object. A related measure is the conditional Kolmogorov complexity of o1 given o2, which measures how much information is needed to produce object 1 if object 2 is known: (18) K ( o 1 o 2 ) = min{ P, P program, U ( P, o 2 ) = o 1 } The information distance between two objects is equivalent to: (19) ID ( o1, o2 ) = max{ K( o1 o2 ), K ( o2 o1 )} The Universal Similarity Measure is a proper metric; it is universal and also normalized. The metric is formally defined as: 2 2 (20) * * max{ K ( o1 o2), K ( o2 o1 )} d ( o1, o2) = max{ K ( o1), K ( o2 )} * 2 1, o indicates a shortest program for 1 o ( or 2 o ). Magnitudes of some of the metrics, discussed above, depend on the evaluated proteins size [39], which makes the absolute magnitude of these scoring functions meaningless. To eliminate the dependence on protein size, some of the authors (Levitt and Gerstein, Ortiz, Strauss and Olmea [43]) convert their structure alignment score into a statistical significance score, called the P-value, on the basis of the statistics of their random structure database. VAST also uses a P-value, which is based upon the likelihood of aligning a given number of secondary structure elements with a certain length. A Z-score is defined and used by Shindyalov and Bourne for their CE and by Holm and Sander for DALI. Z-score of CE is based upon a Gaussian distribution of similarity score between aligned fragments of proteins. The information about the protein structure comparison methods, which are discussed in this paper, is summarized in table 1. The properties of the three major components of each method model of the protein structure, comparison algorithm and similarity measure are considered. Based on the preferred model, appropriate algorithm and chosen similarity score most of protein structure comparison methods will detect similar proteins and will possess positive score for their alignment. The situation is different when the comparison is made between less similar protein structures or the so called challenging sets [44]. In such case the alignment algorithms sometimes produce different results about the degree of detected similarity. Some of works, which are dedicated to that problem, are [44,45,46]. Novotni proposes [45] a comprehensive and critical comparative analysis and evaluation of 11 publicly available, Webbased servers for automatic fold comparison. The conclusion is that the tested algorithms differ in their performance - i.e., how well established structural similarities are recognized. Shierk and Pearsor also examine the sensitivity and selectivity of protein structure comparison methods in [46]. Seven protein structure comparison methods and two sequence comparison programs are evaluated on their ability to detect either protein homologs or domains with the same topology (fold) as defined by the CATH structure database. The programs show distinct differences in their misclassifications according to structural class. After analysis of the results, the authors conclude that with some exceptions, the relative performance of the methods tested is the same regardless of the error model, and that these results accurately reflect the general characteristics of the methods. Mayr, Domingues and Lackner [44] also propose comparative analysis of protein structure alignments methods. They analyze and compare several methods regarding the performance in the identification of structurally/evolutionary related proteins. Three sets of pairs of structurally related proteins are used, including remote homologous proteins according to the

10 Name Properties of the model Properties of the comparison algorithm Proteins are presented with sequences Generation of all initial superpositions RAND and RAG [6] of their Cá atoms as points in 3D and refinement with bipartite graph space. matching technique. MINRMS [7] CE [8] lexprot [9] ATCAT [10] Delaunay tessellation [11] DALI [12] Szustakowski and Weng s method [13] Carr and Hart s method [15] MATRAS [16] VAST [18] Taylor [19] MWBM [20] SSM [21] Geometric Hashing [25] Proteins are presented with sequences of their Cá atoms as points in 3D space. Proteins are presented with fragments of amino acids with fixed size. Proteins are presented with fragments of amino acids with fixed size. Proteins are presented with fragments of amino acids with fixed size. Topology is presented with Delaunay tessellation composition of simplexes. Cá atoms, which are the nearest neighbors are grouped and define a simplex. Proteins are presented at the atomic level with distance matrices. Proteins are presented at the atomic level with distance matrices. Proteins are presented at the atomic level with contact maps. Different levels of representation are used: secondary structure elements for initial model, followed by a model at the amino acid level. Markov transition model of evolution is applied. Different levels of representation are used: secondary structure elements for initial model, followed by a model at the amino acid level. Graph model is preferred here. Proteins are presented with bipartite graph with vertices, which present secondary structure elements. The level is chosen among the following: atoms, residues or secondary structure elements. Proteins are presented with graph model, whose vertices are secondary structure elements. Hash table with positions of all points (Cá atoms), situated according different reference frames. SCOP database (ASTRAL40 set), SISY set - derived from the SISYPHUS database, which includes 69 protein pairs and 40 pairs that are challenging to align (RIPC set). Two methods are applied to align the proteins in the ASTRAL40 set and the resulting alignments agree on average in more than half of the aligned positions. 6 methods are compared using the SISY and RIPC sets. The alignments generated by the different methods on average match more than half of the reference alignments in the SISY set. The alignments obtained in the more challenging RIPC set tend to differ considerably and match reference alignments less successfully than the SISY set alignments. The authors come to the conclusion that the alignments produced by different methods tend to agree to a considerable extent, but the agreement is lower for the more challeng- Generation of all initial superpositions, dynamic programming to evaluate the similarity and refinement with min RMSD. Combinatorial extension of the optimal path with greedy algorithm. Superposition with min RMSD, singlesource shortest path and clustering. It takes flexibility into account. Dynamic programming. It takes flexibility into account Comparison of corresponding elements in 3D arrays Branch-and-Bound strategy Similarity measure RMSD RMSD Z-score RMSD RMSD Q-score Z-score Genetic algorithm Elastic similarity score Genetic algorithm Maximum contact map overlap Branch-and-Bound strategy for Log-odds score secondary structure elements comparison and dynamic programming for residue-based alignment. Subgraph isomorphism algorithm by clique detection. P-value Bipartite graph matching technique. Score, which include type, distance, angle and packing Bipartite graph matching technique. Weight of the match Common subgraph isomorphism detection. Geometric hashing. Table 1. Protein structure comparison methods P-value RMSD ing pairs. The results for the comparison to reference alignments are encouraging, but also indicate that there is still room for improvement. Conclusion The comparison of protein structures is a task with many applications it is a part of protein classification methods, protein structure and protein function predictions. Different approaches are used when models are constructed and suitable algorithms are chosen according to a certain application. The detailed models are precise and bring a lot of structural information, but sometimes it exceeds the necessary one

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the