Part I: Introduction to Protein Structure Similarity

Size: px

Start display at page:

Download "Part I: Introduction to Protein Structure Similarity"

Georgina Grant
6 years ago
Views:

1 Lecture 8: Finding structural similarities among proteins (I) 24-1 CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 8: Finding structural similarities among proteins (I) Lecturer: Prof Jean-Claude Latombe "Could the search for ultimate truth really have revealed so hideous and visceral-looking an object?" Max Perutz, Nobel Prize Winner (1962) Part I: Introduction to Protein Structure Similarity Living systems have evolved a duality of information storage and function. Heritable information is stored encoded in DNA sequences, but function is obtained after the encoded information is translated into a polypeptide sequence - a copolymer synthesized from the 20 different proteinogenic amino acids - and this polypeptide spontaneously folds into a well defined three-dimensional structure. In general, proteins obtain their biological function only after this folding process. It is the spatial arrangement, the precise and reproducible patterning of hydrophobic and hydrophilic components, of positive and negative charges, of hydrogen-bond donors and acceptors, which gives proteins their unique properties. And it is the prediction of this structure from sequence data alone, that remains one of the grand challenges of computational biology. 1. Levels of Protein Structures There are mainly three levels of protein structures Primary Scribe: Chen Ding, Huang Yang, Meghna Agrawal The sequence of amino acids, connected by peptide bonds, constituting a protein is referred to as its primary structure. Even a slight change in the sequence can biologically change the overall structure and function of the protein. Note: If there are some cysteines in the amino-acid sequence, they often react two by two to form disulphide bridges. Disulphide bridges are part of the primary structure. Figure 1. Primary Structure with a Disulphide Bridge 1 Some biologists may consider upto 5 levels

Lecture 8: Finding structural similarities among proteins (I) 24-2 1.2 Secondary The secondary structure of a protein is the spatial arrangement of the atoms constituting the main protein backbone.

It has rod shape; in other words the peptide chain is coiled around an imaginary cylinder. Beta -sheets - consist of parallel or anti-parallel strands of amino acids linked to adjacent strands.

2 Lecture 8: Finding structural similarities among proteins (I) Secondary The secondary structure of a protein is the spatial arrangement of the atoms constituting the main protein backbone. There are two main structures in this category Alpha-helix - is a spiral arrangement of the protein backbone in the form of a helix with hydrogen bonding between side-chains. It has rod shape; in other words the peptide chain is coiled around an imaginary cylinder. Beta -sheets - consist of parallel or anti-parallel strands of amino acids linked to adjacent strands. The hydrogen on the amide of one protein chain is hydrogen bonded to the amide oxygen of the neighboring protein chain. The pleated sheet effect arises form the fact that the amide structure is planar while the "bends" occur at the carbon containing the side chain. E.g. : Collagen Figure 2. Secondary Structure with alpha-helix and beta-sheet Often there are parts of the structure that fall in neither of these categories. 1.3 Tertiary The tertiary structure of a protein, also called its folded or native state, is the natural folding of the entire protein chain into a very compact structure. It is the combination of elements of the secondary structure linked by loops and turns as well as random coils. The tertiary structure of a protein gives it most of its functions. Following factors determine this structure: Figure 3. Tertiary Structure Hydrogen bonds: essential in stabilizing the basic secondary structures Hydrophobic effects: strongest determinants of protein structures

Lecture 8: Finding structural similarities among proteins (I) 24-3 Van der Waal Forces: stabilizing the hydrophobic cores Electrostatic forces: oppositely charged side chains form salt bridges

$1 X-ray Diffraction Crystallography In this method the protein is crystallized and an X-ray beam is projected on the crystals.$

3 Lecture 8: Finding structural similarities among proteins (I) 24-3 Van der Waal Forces: stabilizing the hydrophobic cores Electrostatic forces: oppositely charged side chains form salt bridges Protein sequences almost always fold into the same tertiary structure in the same environment. Quaternary Structure: The quaternary structure is the arrangement of polypeptide subunits within complex proteins made up of two or more subunits. 2. Protein Structure Determination Techniques 2.1 X-ray Diffraction Crystallography In this method the protein is crystallized and an X-ray beam is projected on the crystals. It interacts with the electronic cloud of the crystal to produce diffracted X-ray beams. The diffraction pattern is obtained on a phosphor screen and an electron density map is generated from it which is used to create the 3D structure of the protein from the map. The map tends to be fuzzy in some parts (due to the problem of phasing loops) but the softwares used can usually predict upto 90% of the structure correctly and the rest is computed manually. This method is expensive and takes time, sometimes longer than an year. It is good to generate structure for relatively large proteins but the proteins have to be folded. Also, it requires the protein in form of a crystal and not every protein can be crystallized. Figure 4. X-ray Diffraction Crystallography 2.2 Nuclear Magnetic Resonance Spectroscopy NMR spectroscopy allows structure determination in solution under conditions that can approximate the physiological environment of a protein. It is based on the phenomenon that the energy levels of the atomic nuclei are split up by a magnetic field. Transitions between these levels can be induced by exciting the sample with radiation of

$It is used for smaller proteins. 82% 1% 14% NMR 3% Electron and Neutron Diffraction Theorectical Model X-Ray Diffraction 3. Protein Data Bank Figure 5.$

4 Lecture 8: Finding structural similarities among proteins (I) 24-4 an appropriate frequency. Such transitions can get coupled. This technique has low sensitivity and the data obtained is noisy. It is used for smaller proteins. 82% 1% 14% NMR 3% Electron and Neutron Diffraction Theorectical Model X-Ray Diffraction 3. Protein Data Bank Figure 5. Experimental Sources for Depositions in PDB The Protein Data Bank, a freely accessible database, was established at Brookhaven National Laboratory in 1971 as an archive for biological macromolecular crystal structures starting with 7 structures. It is used in comparative modeling which uses previously solved structures as starting points or templates. For example, the amino acid sequence of an unknown structure can be scanned against a database of solved structures with a scoring function to assess the compatibility of the sequence to the structure, thus yielding possible three-dimensional models new structures new structures 2000 >20,000 structures total 2004 ~30,000 structures total Figure 6. Structures deposited in PDB Protein structure determination still lags behind protein sequencing by about 50 times. The Protein Structure Initiative (PSI) is a federal, university, and industry effort aimed at dramatically reducing the costs and lessening the time it takes to determine a threedimensional protein structure.

can be aligned. A high structure similarity results in high functional similarity.

5 Lecture 8: Finding structural similarities among proteins (I) 24-5 Figure 7. Protein Structure Initiative 4. Structure Similarity and Its Importance Structure similarity refers to how well, or poor, molecule 3D structures can be aligned. A high structure similarity results in high functional similarity. For example, two proteins with similar structure would interact similarly with other molecules. Figure 8. Proteins in the TIM barrel fold family

Lecture 8: Finding structural similarities among proteins (I) 24-6 Alignment computed by DALI α helix axes Figure 9.

approximately 4000 different folds 2 which is 1/5 th of the former, suggesting that a lot of proteins essentially have

Thus, given a new structure, it is highly probable that it is similar to a known structure. Figure 10.

6 Lecture 8: Finding structural similarities among proteins (I) 24-6 Alignment computed by DALI α helix axes Figure 9. Alignment of 1xis and 1nar (TIM-Barrels) By year 2000, some 20,000 structures have been identified in PDB with approximately 4000 different folds 2 which is 1/5 th of the former, suggesting that a lot of proteins essentially have the same folds. In fact, 90% of the structures submitted to PDB in the last 3 years have similar folds. Thus, given a new structure, it is highly probable that it is similar to a known structure. Figure 10. Different Tertiary Structures 2 Two proteins will have a common fold if they have comparable elements of secondary structure with the same topology of connections.

7 Lecture 8: Finding structural similarities among proteins (I) 24-7 There could be two main reasons for this. In terms of evolution, a populated nonnative structure will not be available for its function and thus constitute a selective disadvantage. Mutation and selection must in general have arrived at sequences for which a single structure and none other is "frozen out" of the ensemble of all unfolded conformations upon folding. Secondly, certain physical constraints might prevent the protein from taking certain shapes. Like, physicochemically, there are only few ways to maximize a polypeptides hydrophobic interactions and the small amount of free energy that finally stabilizes a native structure requires that almost all available stabilizing interactions are indeed formed. Another reason could be the limitations of the current techniques used to determine the structures. For example, X-Ray diffraction requires proteins to be crystallized but not all proteins can be crystallized, NMR has low sensitivity and is useful for smaller proteins etc. Generally, the primary structure or the amino acid sequence determines the tertiary structure of the protein which in turn determines the biological functions. This means that highly similar sequence would correspond to high functional similarity as well. However, there are exceptions to this rule. Sometimes low sequence similarity may yield very similar structures whereas high sequence similarity may yield different structures. For example, 1xis and 1nar (See Figure 9) have only 7% sequence similarity, but approximately 70% of the residues are structurally similar. Therefore, structure comparison is expected to provide more pertinent information about functional (dis)similarity among proteins, especially with non-evolutionary relationships or nondetectable evolutionary relationships. The problem of structure similarity is an ill-posed problem due to the fact that biologists have different goals and expectations, and often the end results are unknown. There are many different terms in the literature used to describe this problem, like (dis-) similarity analysis, alignment of molecules, superposition, matching, classification and many others. 5. Mathematical relative : Largest Common Point Set Problem Certain mathematical tools can be used to assist in structure prediction. For example, if there are two curves, f and g, we can compute how similar they are using the average distance between them. g f s f g 2 Figure 11. Mathematical functions to measure Similarity However, applying the above function to proteins would not be sufficient. Instead, a very similar computing problem, the largest common point set problem, is mapped to the

8 Lecture 8: Finding structural similarities among proteins (I) 24-8 problem of structure prediction. The largest common point set problem is defined as follows. Given: Two point sets A and B A distance measure d A space of transforms T A threshold ε Determine P A and Q B such that, min T T d(p,t(q)) ε and P is maximal. The problem can easily mapped onto proteins. The 3D molecular structure is defined as a collection of atoms or a group of atoms or features, in a given 3D relative placement. The placement of a group of atoms is defined by the position of a reference point (e.g., the centre of an atom) and the orientation of a reference direction (e.g., direction of a bond).types can assigned to the atoms or groups of atoms, where the types can be the identity of an atom, an amino acid etc. Types might be important during matching as one point may be not be allowed to be matched to another point unless they have same or compatible types. In fact, types do not add to the complexity of the problem but reduce the number of possible combinations. Given that two molecular structures are defined as above, the problem can be defined as: Two structures A and B match iff: 1. There is a one-to-one correspondence between their elements 2. There exists a rigid body transform T such that the RMSD between the elements in A and those in T(B) is less than some threshold in ε 6. Alignments 6.1 Complete Alignment The above definition corresponds to complete alignment where the whole of structure A matches with whole of structure B. This is illustrated by Figure 12: Figure 12. Complete Alignment There would be some error in the match as the structures are not identical. Complete alignments are rarely possible because the molecules compared are of different sizes, hence there cannot be a one-to-one correspondence. Also, their shapes might match only locally. However, even though there is no satisfactory global alignment, good local alignments may be of interest as some functions might still be deduced from them.

Lecture 8: Finding structural similarities among proteins (I) 24-9 6.2 Partial Alignment Partial alignment refers to a part of a structure A matching with a part of structure B.

9 Lecture 8: Finding structural similarities among proteins (I) Partial Alignment Partial alignment refers to a part of a structure A matching with a part of structure B. These matching portions are referred to as supports, denoted as σ (A) and σ (B). Notion of support σ of the match: the match is between σ (A) and σ (B). Figure 13. Partial Alignment The notion of support gives rise to the dual problem of finding the support as well as the transformation. Often, one may have to choose between a bigger support with a worse alignment or a smaller support with a better alignment. Almost all the systems examining protein similarity will have to make such decisions. Also, there might be many possible supports in different regions of the structure. If a support is small, it is referred to as a motif 3. In this figure we can see the alignment of 3adk and 1gky. Both matching and nonmatching secondary structure elements can be seen. Figure 14. Partial Alignment of 3adk and 1gky Distributed Support If partial matches are allowed, then there may be cases where the support is not contiguous. 3 Explained later

10 Lecture 8: Finding structural similarities among proteins (I) A σ( B B A Figure 15(a). Contiguous Support σ(a) Gap Figure 15(b). Distributed Support Figure 15(a) on the left shows contiguous support while the support in Figure 15(b) has gaps. A similarity measure is needed to decide about which of these two supports is the best or should gaps be allowed in the support or how to penalize the gaps in the support. This measure would differ from system to system. In the case illustrated by Figure 16, the support matching does conserve the sequence in the backbone. This support would be rejected by many systems. However, the function of the protein is dependent on the interactions of its molecules on the outer surface. Therefore, it should not matter that the molecules positioned inwards are out of order in the support. A B Figure 16. Distributed Support without sequence preservation In case of partial alignments, the similarity measure is unlikely to satisfy the triangular inequality, otherwise considered to be the holy grail in computational biology. Figure 17. Partial Alignment and Triangular Inequality

11 Lecture 8: Finding structural similarities among proteins (I) Scoring Issues Certain scoring issues have to be taken into consideration in order to come up with a good measure. Trade-off between size of the support and RMSD ( bigger the support, bigger the RMSD and vice-versa) How should gaps be counted? Is there a quality of the correspondence? [The correspondence may, or may not, satisfy type and/or backbone sequence preferences] Should accessible surface be given more importance, since it is the atoms on the periphery of the protein that react and hence define the function of the protein? Similarity measures may be different from the inverse 4 of RSMD. There is no consensus on best measure! However, RMSD is computationally very convenient. Some examples - Note that in the first example the square of the distance measure is in the numerator while in the second example it is in the denominator. This will result in slight difference in the function s behaviour. 2 ( a T b ) 1 min ( ) T i i σ ( T ) i σ ( T) (RMSD dissimilarity measure emphasizes differences) 1 max T A NGAP/ 2 2 i σ ( T) ai T( bi) 1+ B (STRUCTAL s similarity measure emphasizes similarities) Gap There are many other measures used. The paper: A.C.M. May. Towards more meaningful hierarchical classification of amino acids scoring functions. Protein Engineering, 12: , 1999, reviews 37 protein structure similarity measures. The difficulty of defining a similarity score is probably due to the facts that structure comparison is an ill-posed problem and has multiple solutions. In fact, the measure is dependent upon what is to be extracted from the comparisons. 6.3 Conclusion on Protein Alignments Finding an optimal partial alignment is NP-hard. No fast algorithm is guaranteed to give an optimal answer for any given similarity measure. Rely on combinations of heuristic/approximate algorithms 4 Since RMS distance measures the dissimilarity.

Lecture 8: Finding structural similarities among proteins (I) 24-12 Probably not a single best solution, but application-dependent solutions But there exist general algorithmic principles Part II:

The substructure, for instance, can be a sequence of (possibly contiguous) Cα atoms in each molecule, or a set of secondary structure elements.

12 Lecture 8: Finding structural similarities among proteins (I) Probably not a single best solution, but application-dependent solutions But there exist general algorithmic principles Part II: Typical Applications of Structure Matching Techniques 1. Overview 1.1 Find Similarities Among Protein Structures The problem is to find similar substructures for given two protein structures. The substructure, for instance, can be a sequence of (possibly contiguous) Cα atoms in each molecule, or a set of secondary structure elements. There are many possible similarity measures as mentioned before. Those different measures are exploited in different algorithms for finding similarities among protein structures. Variants of this problem includes 1-to-1 (finding structure similarities between two proteins), 1-to-many (finding structure similarities for one given protein against many other proteins), many-to-many (finding structure similarities between two sets proteins, where both two sets have a great number of proteins). Each of the variants, especially for 1-to-many and many-to-many problems, must be automatic and reasonably fast. 1.2 Classify Proteins This problem is closely related to the previous problem. Once we have the similarities among protein structures, we can use them to classify proteins. There are many proteins, but relatively few distinct fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997]. With this knowledge, we can classify various proteins into different fold families, which bring insight into functions / structure stabilization and form the basis for threading and homology. Earlier classifications of proteins are done manually. For instance, SCOP [Murzin et al. 1995], is a manual protein classification scheme, which takes into account not only the structural similarities but also evolution information. SCOP classifies proteins by three levels, Class, Fold and Family, with figure 18 as an example of SCOP classification. Figure 18. Three levels of SCOP classification

13 Lecture 8: Finding structural similarities among proteins (I) As the size of PDB increases, automatic classifications are required. Examples of automatic classifications are CATH [Orengo et al. 1997], Pclass [Singh et al.] and FSSP [Holm and Sander]. However, the performance of automatic classifications is not very satisfactory. Figure 19 shows two different classifications for the same protein set using SCOP and Pclass. As shown in the picture, Pclass produces a quite different classification from SCOP. Figure 19. Manual vs. Automatic Classification 1.3 Finding Motif in Protein Structure A motif is a small collection of atoms corresponding to a binding site or a stabilizing scaffold. This problem is defined as the process of finding whether the motif matches a substructure of a given protein. A variant of this problem is to find motif against many proteins. Motifs are often repeated across different molecules. Detection of these common motifs in a new molecule can provide useful clues to the functional properties of such a molecule. In its simplest form, the motif finding problem can be formulated as follows: given a sample of sequences and an unknown pattern (motif) that is implanted at different unknown positions, can we find the unknown pattern? However, motifs are subject to mutations and usually do not appear exactly. The problem of finding motifs has two parts. The first part is finding the best match for the pattern in the example. The second part is to evaluate if this match is significant enough: e.g., finding a close match for a pattern of 10 amino acids is more significant than finding a close match for a pattern for 2 amino acids. One example of motif finding results is shown in figure 20, taken from the paper, Identifying Structural Motifs in Proteins (R. Singh and M. Saha.). The Y axis shows the RMSDs between motif (trypsin active site) and the matching part of protein. The X axis represents the 42 proteins belonging to the Trypsin family. Observe that the quality of the match is nearly perfect fro most of the proteins (RMSDs below 0.5). This confirms that all these proteins share the same structural and functional properties.

14 Lecture 8: Finding structural similarities among proteins (I) Figure 20. Results of motif finding in trypsin-like proteins 1.4 Finding Pharmacophore in Ligands Given a small collection of 5-10 small flexible ligands with similar activity (hence, assumed to bind at the same protein site), and several dozens to a few hundreds lowenergy conformations for each ligand, this problem is to find substructure (called pharmacophore) that occurs in at least one conformation of each ligand. The problem requires comparinge a relatively larger number of conformations for each ligand. 1.5 Searching for Ligands Containing a Pharmacophore This prolem is the complement of problem 4, which finds pharmacophore for a given ligand. Contrarily, this problem is to find all ligands that have a low-energy conformation containing a given pharmacophore from a database of small flexible ligands, usually 100,000 or more. The paper: a randomized kinematics-based approach to pharmacophore-constrained conformational search and database screening (S.M. LaValle, et al.) presents an approach to do conformational search and data mining of pharmaceutical databases. So far we have presented an overview of the five problems in finding protein similarities, which will be elaborated in great detail in the following sections. There are several web sites that provide resourceful information about protein classification tools and protein alignment methods, as listed below. Protein classification: SCOP: CATH Protein alignment: DALI: LOCK:

15 Lecture 8: Finding structural similarities among proteins (I) Problem 1 - Find Similarities Among Protein Structures 2.1 Problem Definition Input: Two sets of features (atoms or groups of atoms) {a 1,,a n } and {b 1,,b m } belonging to two different proteins A and B Output: - Maximal correspondence set C of pairs (a i,b j ), where all a i and all b j are distinct - Alignment transform T such that the RMSD of the pairs (a i,t(b j )) is less than a given threshold ε Some Explanations 1) For the output (a i, b j ) pairs, all a i and all b j being distinct means that an a i cannot map more than one b j -s and a b j cannot be mapped against more than one a i -s. 2) There may be several possible outputs as there can be more than one optimal solution. 3) The similarity measures other than RMSD may also be used and will be examined later Possible Correspondence Constraints There may be some special constraints that are often used on correspondence for different kinds of interests. The most common correspondence constraints are typed features and ordered features. Typed features (a i,b j ) is a possible correspondence pair iff Type(a i ) = Type(b j ) Ordered features (a i,b j ) and (a i,b j ), where i >i, are possible correspondence pairs iff j >j 2.2 Some Existing Software The following list shows examples of existing software for finding similarities among protein structures. Cα atoms: DALI [Holm and Sander, 1993] STRUCTAL [Gerstein and Levitt, 1996] MINAREA [Falicov and Cohen, 1996] CE [Shindyalov and Bourne, 1998] ProtDex [Aung,Fu and Tan, 2003] Secondary structure elements and Cα atoms: VAST [Gibrat et al., 1996] LOCK [Singh and Brutlag, 1996] 3dSEARCH [Singh and Brutlag, 1999]

16 Lecture 8: Finding structural similarities among proteins (I) Two Simultaneous Sub-problems In order to find similarities among protein structures, one has to solve the following two sub-problems simultaneously. a. Find correspondence set C b. Find alignment transform T There is a chicken-and-egg issue here for the two sub-problems. Each sub-problem is simple: if we know C, we could easily compute T by some known algorithm; if we know T, we could also get C by proximity possibly with help of dynamic programming. However, the combination of the two problems is extremely hard Find Alignment Transform We notice that the computation of C involves many more parameters compared to the computation of T, which requires only 6 parameters (3 for translation, 3 for orientation). Hence the computation of T should be easier than computing C. So the combination of the two sub-problems should start with computing T. Subsequently with known T, C can be obtained with proximity, and thus the combination is solved. The problem has the two sets of equal number of points as inputs: A= {a 1,,a n } and B = {b 1,,b n } where (a i, b i ) are correspondence pairs. The goal is to find transform T, such that the RMSD disctance between A and T(B) is minimized, denoted as T=argmin T RMSD(A, T(B)). There is an O(n) closed-form solution, SVD-based algorithm (Arun, Huang, and Blostein, 87; Horn, 87; Horn, Hilden, and Negahdaripour, 88), where SVD stands for singular value decomposition. Here a very brief introduction is presented, with further details available from the referenced papers. O(n) SVD-Based Algorithm T combines translation t and rotation R, such that T(b i ) = t + R(b i ). The origin of the coordinate system is placed at b, with b as the mean of the b i s, i.e., b= (Σ i=1,...,n b i )/n. Then min T RMSD(A, T(B)) is simplified to (up to some constants, which can be ignore n n 2 for the moment) min t,r ai t 2 ai, R( bi). Here t and R can be computed i= 1 i= 1 separately. Translation t is just the mean of the a i s, and thus we can easily compute t by the formula t=a, where a=(σ i=1,...,n a i )/n. Then we show how rotation R is computed. With the following derived matrices: A 3 n = [a 1 -a,..., a n -a] and B 3 n = [b 1 -b,..., b n -b], we can compute SVD decomposition of BA T (where A T denotes transform of A) with formula BA T = UDV T where D is a diagonal matrices with decreasing non-negative entries (singular values) along the diagonal. If det(u)det(v) = 1 then S = I; otherwise S = diag(1,1,-1). Finally R is computed using formula R=USV T Trial-and-Error Approach to Protein Structure Alignment The trial-and-error approach solves the two sub-problems by iteratively updating the correspondence set CS and transformation T with an initial guess of CS being a small set.

17 Lecture 8: Finding structural similarities among proteins (I) T is computed from the initial guess of CS and subsequently from the updated CS. Then T is applied to the two sets of features, A and B. The correspondence set CS is then updated by proximity from T. The updated CS is again used for computing new T unless there is no change in CS. Figure 21 depicts this approach and the algorithm is given in figure 22. Figure 21. Overview of Trial-and-Error Approach 2.4 Seed Generation by Fragment Matching Figure 22. Trial-and-Error Algorithm When scientists use trial-and-error approach to do protein structure alignment, it is desirable that the initial correspondence set is valid. Indeed, if the elements in the set do not come from similar substructure, then in step 3 of the trial-and-error approach, transformation will unlikely to bring in good elements. Many techniques are proposed to generate promising seed correspondence sets. 3 algorithms are introduced in the lecture, namely DALI [Holm and Sander, 1996] where seed set is generated from local shape; LOCK [Singh and Brutlag, 1996] where seed set is generated from secondary structure elements and 3dSEARCH [Singh and Brutlag, 2000] where seed set is generated from a voting scheme.

Lecture 8: Finding structural similarities among proteins (I) 24-18 2.4.1 DALI 40 45 1 Figure 23.

18 Lecture 8: Finding structural similarities among proteins (I) DALI Figure 23. Distance Matrix vs Inferred Protein Structure DALI makes use of distance matrix to compute the seed set. As shown on the right hand side of Figure, Cα atoms of a protein are encoded on each axis of the distance matrix (140 Cα atoms in the above example). Entry in the matrix shows the distance between each pairs of Cα atoms within a molecular. Darkness of the entry reflects the closeness of the two Cα atoms. Areas along the bisector are darker, since a Cα atom is certainly close to itself. We can use distance matrix to compute the seed set because the intra-molecular distances are invariant to rigid-body transformations. So if we have the distance matrix for protein A and protein B, these 2 matrix will be independent of the orientation or transformation of the two proteins. If there is a α-helix in the protein, then there will be typically a thick line along the bisector of the matrix. Patches of thick lines along the bisector denotes the several α-helix in the protein. For example, in the above example, the distance matrix shows that amino acid 1 to 40 form a helix and 45 to 80 forms another helix. The thick line in the orange circle shows that the two helixes are close to each other, since if we take the last amino acid (85) of the helix in blue and the first amino acid (1) of the red helix, the corresponding entry in distance matrix is quite dark. DALI looks for similar hexapeptides by searching for similar 7x7 Cα-Cα distance matrices (the green square illustrated in the Figure ) in distance matrix of another protein. And seed set is generated LOCK 85 Different from DALI which uses only the carbon atoms along the protein backbone, LOCK, on the other hand, uses the secondary structure elements (SSEs) as the initial seeds. This is due to a biological concern. Most of the time, types of atoms and aminoacids in proteins have many variations during evolution and thus are not a promising basis for correspondence. Secondary structure elements are better conserved. They are the ones who are responsible for most of the stability and functionality of proteins. Taking these biological facts into account, programs like LOCK combines information from two levels of features (see Figure 24):

Lecture 8: Finding structural similarities among proteins (I) 24-19 Stage 1 (SSE alignment): Compute initial alignment using SSEs represented as vectors.

SSE alignment and Atom alignment Before we go into details of LOCK, we should also note that by taking SSEs into account for substructure alignment, we narrows down possible applications.

4.2.1 SSEs and Its Vector-Based Representation SSEs that LOCK considers have three parts: α, β and loops.

19 Lecture 8: Finding structural similarities among proteins (I) Stage 1 (SSE alignment): Compute initial alignment using SSEs represented as vectors. Only three SSEs match in the example shown in Figure 24. Stage 2 (atom alignment): Refine alignment using Cα atoms represented as points. Stage 1 Stage 2 Figure 24. SSE alignment and Atom alignment Before we go into details of LOCK, we should also note that by taking SSEs into account for substructure alignment, we narrows down possible applications. The resulted program will not be able to find small motifs at atomic level. Also there is some deficiency with LOCK when it comes to many-to-many comparison SSEs and Its Vector-Based Representation SSEs that LOCK considers have three parts: α, β and loops. α helices and β sheets are stabilized by hydrogen bonds between backbone oxygen and hydrogen atoms. Loops, on the other hand are considered in a secondary way as they do not have definite shapes. SSEs are represented as a triplet vector: (helix, strand, loop). See Figure β-strands loops α-helices Figure 25. A Typical Secondary Structure Element Details for how the vectors are computed are as follows: 1. LOCK uses software DSSP [Kabsch and Sander, 1983] to classify residues into α- helices and β-strands. 2. Each α-helice and β-strand is represented as a vector with its start and end point. It is computed by a weighted average of carbon backbone atoms. For example, for a-helix starting at residue i:

20 Lecture 8: Finding structural similarities among proteins (I) Xorigin= (0.74Xi + Xi+1 + Xi Xi+3)/3.48 where Xi is the position of the Cα atom of residue i It is computed in such a strange way due to the fact that angle between two consecutive residues is 100degree. Computation for Xend and for β-strand is done similarly Preparatory Work: Scoring Similarity Once the vectors of two proteins are ready, LOCK will score similarity between the two proteins. Figure 26. Alignment of Two Proteins To score similarity, one question to ask is if i and p are aligned, what is the score of the alignment of k and r? To do so, two types of differences are considered: Position-independent differences (for example in Figure 26): angle(i,k)-angle(p,r) angle(i,j)-angle(p,q) angle(j,k)-angle(q,r) distance(i,k)-distance(p,r) length(k)-length(r) Position-dependent differences (used only when we have reasonably well aligned pairs of proteins): angle(k,r) (if the two proteins are well aligned, this value should be small) distance(k,r) Every difference is associated with a score. And the Scores are addictive, as all differences are considered. The di in the formula below is one of the differences mentioned above. Note that if the difference is rather small, then we may possibly hit maximal score. For example, Mi is the maximal score one can get from a specific difference, since S(di) = 2Mi/(1+0) Mi = Mi. When di equals di0, S(di) = 2Mi/(1+1) Mi = 0.

21 Lecture 8: Finding structural similarities among proteins (I) Maximal score S(d i ) = 2M i 1+(d i /d i0 ) 2 - M i Score = Σ S(d i ) Figure 27. Calculation of Score Stage 1: Secondary Structure Elements Alignment To assess an alignment of a pair of SSEs, LOCK uses a set of position-independent differences. For example, the angles or distances between successive SSEs or the length of SSE vectors. And for every pair of SSE vectors of protein A, LOCK finds all pairs of vectors in B that align well to generate seed correspondence sets. For each such correspondence set, LOCK computes the alignment transformation and apply it to all the SSEs in protein B (dynamic programming algorithm like variation of Smith-Waterman algorithm are used to find correspondence set with maximal score). Finally, LOCK records transform T and correspondence set CS that yields maximal score. The dynamic programming algorithm works as follows. Suppose we find 5 SSEs in protein A and B, namely I, j, k, l, m and p, q, r, s t. And SSE I aligns well with p, j aligns well with q. To compute the rest of the correspondence SSEs, we go through a tree. As illustrated in Figure 28, SSE k may corresponds to either r or s or t. LOCK computes the score for each correspondence, eg, (k, r). If the score is positive, LOCK will go down into the tree and compute score for correspondence (l, s). LOCK adds various pairs to the correspondence set as long as the score continues to increase. LOCK allows only one gap at a time. And that is why there is no fan out for (m, t) from (k, r) in the tree. Otherwise, there will be a gap in both A and B. A = (i, j, k, l, m) B = (p, q, r, s, t) Seed correspondence {(i,p),(j,q)} Value of d i for which score is 0

22 Lecture 8: Finding structural similarities among proteins (I) (i,p), (j,q) (k,r) (k,s) (k,t) (l,r) (m,r) (l,s) (l,t) (m,s) (l,t) (m,t) (m,s) (m,t) (m,t) Simultaneous gaps in both structures are not allowed (not in SCOP2) Terminate a path when score of new correspondence is negative Re-compute new transform with each new correspondence (?) Stage 2: Atom (Core) Alignment Figure 28. DM to Computer Correspondence Set With correspondence set of SSEs, we can now do a more precise alignment by computing the set of pairs of atoms from A and T(B) which are closest to each other and within a given threshold distance. The steps 2, 3, 4 of the algorithm is similar to previous trial-and-error approach where we try to arrive at the best transformation and largest correspondence set by iteratively computer transform which minimize the RMSD of the atoms in the CS Experiments and Results To validate how LOCK copes with proteins with significant difference in sequence, 685 proteins from the PDB are selected such that each pair have less than 25% sequence similarity. These proteins are drawn from three families of folds (myoglobins, TIM barrels and immunoglobulins). Now one query one query protein from each family is issued to query all other proteins to find other members of the family (685*3=2055 alignments) For each query, LOCK sorts the 685 structures by score and selects the top k proteins. The count for members of family (true positives) and non-members (false positives) is shown in the sensitivity vs. specificity tables below.

23 Lecture 8: Finding structural similarities among proteins (I) Myoglobins (11) TIM-barrels (50) Immunoglobulins (38) # True positive # False positive # True positives # False positives # True positives # False positives Experiments show that for the three folds considered, LOCK correctly identifies most of the members of each fold type. It performs especially well for Myoglobins (the top 11 proteins made a 100% hit) and pretty good for TIM-barrels (among the top 55 proteins selected; only 5 are not members of the family). However, there are still some false positives. Considering that Immunoglobulins have diverse structures, the result is already quite good. For running time of LOCK, it was quite slow to do a comparison of all proteins in the 90s. So geometric hashing is used to speed up stage 1. The rationality behind geometric hashing is to do a pre-computation so that we computer only once scores of pairs of SSEs instead of computing them many times Possible Improvements for LOCK LOCK uses RMSD to refine transformations. However, small RMSD does not necessarily mean similarity in biological view. In general, similarity is computed by compute the S value (some similarity measures) over all the transform between protein A and B where S is much more complex then RMSD. Usually, people will compute RMSD first and once we have the best transformation at hand, we do a further adjustment for transformations to maximize similarity measures. Other popular technique of correspondence from a proximity angle is the Iterated Closest Pair (ICP) [Besl and McKay, 1992] Reference [1] R. Singh, M. Saha, Identifying Structural Motifs in Proteins. Pacific Symposium on Biocomputing 8: , 2003 [2] L. Cowen, P. Bradley, M. Menke, J. King and B. Berger, Predicting the Beta-Helix Fold from Protein Sequence Data. J. Comput. Biol. 9(2), , 2002 [3] A.P. Singh and D.L. Brutlag, Hierarchical Protein Structure Superposition Using Both Secondary and Atomic Representations. Proc. ISMB, , 1997 [4] J. Shapiro and D.L. Brutlag, FoldMiner: Structural Motif Discovery Using an Improved Superposition Algorithm. Protein Science, 13: , 2004 [5] P.W. Finn, L.E. Kavraki, J.C. Latombe, R. Motwani, C. Shelton, S. Venkatasubramanian, and A.

24 Lecture 8: Finding structural similarities among proteins (I) Yao, RAPID: Randomized Pharmacophore Identification for Drug Design. Computational Geometry: Theory and Applications, 10, , 1998 [6] A. G. Murzin, S. E. Brenner, T. Hubbard and C. Chothia, SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J. Mol. Biol. 247: , 1995 [7] C. Chothia, One thousand families for the molecular biologist. Nature 357: , 1992 [8] C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, J.M. Thornton, CATH- A Hierarchic Classification of Protein Domain Structures. Structure. Vol 5. No 8: , 1997 [9] L. Holm, C.Sander, Dali/FSSP classification of three-dimensional protein folds. Nucl. Acids Res. 25, , 1997 [10] L. Holm, C.Sander, Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, , 1993 [11] A.C.M. May. Towards more meaningful heirarchical classification of amino acids scoring function. Protein Engineering, 12: , 1999 [12] PDB Annual Report 2003 [13] [14] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, P. Walter, Molecular Biology of the Cell, 4th Ed., Garland Science.

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its