Part I: Introduction to Protein Structure Similarity

Size: px
Start display at page:

Download "Part I: Introduction to Protein Structure Similarity"

Transcription

1 Lecture 8: Finding structural similarities among proteins (I) 24-1 CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 8: Finding structural similarities among proteins (I) Lecturer: Prof Jean-Claude Latombe "Could the search for ultimate truth really have revealed so hideous and visceral-looking an object?" Max Perutz, Nobel Prize Winner (1962) Part I: Introduction to Protein Structure Similarity Living systems have evolved a duality of information storage and function. Heritable information is stored encoded in DNA sequences, but function is obtained after the encoded information is translated into a polypeptide sequence - a copolymer synthesized from the 20 different proteinogenic amino acids - and this polypeptide spontaneously folds into a well defined three-dimensional structure. In general, proteins obtain their biological function only after this folding process. It is the spatial arrangement, the precise and reproducible patterning of hydrophobic and hydrophilic components, of positive and negative charges, of hydrogen-bond donors and acceptors, which gives proteins their unique properties. And it is the prediction of this structure from sequence data alone, that remains one of the grand challenges of computational biology. 1. Levels of Protein Structures There are mainly three levels of protein structures Primary Scribe: Chen Ding, Huang Yang, Meghna Agrawal The sequence of amino acids, connected by peptide bonds, constituting a protein is referred to as its primary structure. Even a slight change in the sequence can biologically change the overall structure and function of the protein. Note: If there are some cysteines in the amino-acid sequence, they often react two by two to form disulphide bridges. Disulphide bridges are part of the primary structure. Figure 1. Primary Structure with a Disulphide Bridge 1 Some biologists may consider upto 5 levels

2 Lecture 8: Finding structural similarities among proteins (I) Secondary The secondary structure of a protein is the spatial arrangement of the atoms constituting the main protein backbone. There are two main structures in this category Alpha-helix - is a spiral arrangement of the protein backbone in the form of a helix with hydrogen bonding between side-chains. It has rod shape; in other words the peptide chain is coiled around an imaginary cylinder. Beta -sheets - consist of parallel or anti-parallel strands of amino acids linked to adjacent strands. The hydrogen on the amide of one protein chain is hydrogen bonded to the amide oxygen of the neighboring protein chain. The pleated sheet effect arises form the fact that the amide structure is planar while the "bends" occur at the carbon containing the side chain. E.g. : Collagen Figure 2. Secondary Structure with alpha-helix and beta-sheet Often there are parts of the structure that fall in neither of these categories. 1.3 Tertiary The tertiary structure of a protein, also called its folded or native state, is the natural folding of the entire protein chain into a very compact structure. It is the combination of elements of the secondary structure linked by loops and turns as well as random coils. The tertiary structure of a protein gives it most of its functions. Following factors determine this structure: Figure 3. Tertiary Structure Hydrogen bonds: essential in stabilizing the basic secondary structures Hydrophobic effects: strongest determinants of protein structures

3 Lecture 8: Finding structural similarities among proteins (I) 24-3 Van der Waal Forces: stabilizing the hydrophobic cores Electrostatic forces: oppositely charged side chains form salt bridges Protein sequences almost always fold into the same tertiary structure in the same environment. Quaternary Structure: The quaternary structure is the arrangement of polypeptide subunits within complex proteins made up of two or more subunits. 2. Protein Structure Determination Techniques 2.1 X-ray Diffraction Crystallography In this method the protein is crystallized and an X-ray beam is projected on the crystals. It interacts with the electronic cloud of the crystal to produce diffracted X-ray beams. The diffraction pattern is obtained on a phosphor screen and an electron density map is generated from it which is used to create the 3D structure of the protein from the map. The map tends to be fuzzy in some parts (due to the problem of phasing loops) but the softwares used can usually predict upto 90% of the structure correctly and the rest is computed manually. This method is expensive and takes time, sometimes longer than an year. It is good to generate structure for relatively large proteins but the proteins have to be folded. Also, it requires the protein in form of a crystal and not every protein can be crystallized. Figure 4. X-ray Diffraction Crystallography 2.2 Nuclear Magnetic Resonance Spectroscopy NMR spectroscopy allows structure determination in solution under conditions that can approximate the physiological environment of a protein. It is based on the phenomenon that the energy levels of the atomic nuclei are split up by a magnetic field. Transitions between these levels can be induced by exciting the sample with radiation of

4 Lecture 8: Finding structural similarities among proteins (I) 24-4 an appropriate frequency. Such transitions can get coupled. This technique has low sensitivity and the data obtained is noisy. It is used for smaller proteins. 82% 1% 14% NMR 3% Electron and Neutron Diffraction Theorectical Model X-Ray Diffraction 3. Protein Data Bank Figure 5. Experimental Sources for Depositions in PDB The Protein Data Bank, a freely accessible database, was established at Brookhaven National Laboratory in 1971 as an archive for biological macromolecular crystal structures starting with 7 structures. It is used in comparative modeling which uses previously solved structures as starting points or templates. For example, the amino acid sequence of an unknown structure can be scanned against a database of solved structures with a scoring function to assess the compatibility of the sequence to the structure, thus yielding possible three-dimensional models new structures new structures 2000 >20,000 structures total 2004 ~30,000 structures total Figure 6. Structures deposited in PDB Protein structure determination still lags behind protein sequencing by about 50 times. The Protein Structure Initiative (PSI) is a federal, university, and industry effort aimed at dramatically reducing the costs and lessening the time it takes to determine a threedimensional protein structure.

5 Lecture 8: Finding structural similarities among proteins (I) 24-5 Figure 7. Protein Structure Initiative 4. Structure Similarity and Its Importance Structure similarity refers to how well, or poor, molecule 3D structures can be aligned. A high structure similarity results in high functional similarity. For example, two proteins with similar structure would interact similarly with other molecules. Figure 8. Proteins in the TIM barrel fold family

6 Lecture 8: Finding structural similarities among proteins (I) 24-6 Alignment computed by DALI α helix axes Figure 9. Alignment of 1xis and 1nar (TIM-Barrels) By year 2000, some 20,000 structures have been identified in PDB with approximately 4000 different folds 2 which is 1/5 th of the former, suggesting that a lot of proteins essentially have the same folds. In fact, 90% of the structures submitted to PDB in the last 3 years have similar folds. Thus, given a new structure, it is highly probable that it is similar to a known structure. Figure 10. Different Tertiary Structures 2 Two proteins will have a common fold if they have comparable elements of secondary structure with the same topology of connections.

7 Lecture 8: Finding structural similarities among proteins (I) 24-7 There could be two main reasons for this. In terms of evolution, a populated nonnative structure will not be available for its function and thus constitute a selective disadvantage. Mutation and selection must in general have arrived at sequences for which a single structure and none other is "frozen out" of the ensemble of all unfolded conformations upon folding. Secondly, certain physical constraints might prevent the protein from taking certain shapes. Like, physicochemically, there are only few ways to maximize a polypeptides hydrophobic interactions and the small amount of free energy that finally stabilizes a native structure requires that almost all available stabilizing interactions are indeed formed. Another reason could be the limitations of the current techniques used to determine the structures. For example, X-Ray diffraction requires proteins to be crystallized but not all proteins can be crystallized, NMR has low sensitivity and is useful for smaller proteins etc. Generally, the primary structure or the amino acid sequence determines the tertiary structure of the protein which in turn determines the biological functions. This means that highly similar sequence would correspond to high functional similarity as well. However, there are exceptions to this rule. Sometimes low sequence similarity may yield very similar structures whereas high sequence similarity may yield different structures. For example, 1xis and 1nar (See Figure 9) have only 7% sequence similarity, but approximately 70% of the residues are structurally similar. Therefore, structure comparison is expected to provide more pertinent information about functional (dis)similarity among proteins, especially with non-evolutionary relationships or nondetectable evolutionary relationships. The problem of structure similarity is an ill-posed problem due to the fact that biologists have different goals and expectations, and often the end results are unknown. There are many different terms in the literature used to describe this problem, like (dis-) similarity analysis, alignment of molecules, superposition, matching, classification and many others. 5. Mathematical relative : Largest Common Point Set Problem Certain mathematical tools can be used to assist in structure prediction. For example, if there are two curves, f and g, we can compute how similar they are using the average distance between them. g f s f g 2 Figure 11. Mathematical functions to measure Similarity However, applying the above function to proteins would not be sufficient. Instead, a very similar computing problem, the largest common point set problem, is mapped to the

8 Lecture 8: Finding structural similarities among proteins (I) 24-8 problem of structure prediction. The largest common point set problem is defined as follows. Given: Two point sets A and B A distance measure d A space of transforms T A threshold ε Determine P A and Q B such that, min T T d(p,t(q)) ε and P is maximal. The problem can easily mapped onto proteins. The 3D molecular structure is defined as a collection of atoms or a group of atoms or features, in a given 3D relative placement. The placement of a group of atoms is defined by the position of a reference point (e.g., the centre of an atom) and the orientation of a reference direction (e.g., direction of a bond).types can assigned to the atoms or groups of atoms, where the types can be the identity of an atom, an amino acid etc. Types might be important during matching as one point may be not be allowed to be matched to another point unless they have same or compatible types. In fact, types do not add to the complexity of the problem but reduce the number of possible combinations. Given that two molecular structures are defined as above, the problem can be defined as: Two structures A and B match iff: 1. There is a one-to-one correspondence between their elements 2. There exists a rigid body transform T such that the RMSD between the elements in A and those in T(B) is less than some threshold in ε 6. Alignments 6.1 Complete Alignment The above definition corresponds to complete alignment where the whole of structure A matches with whole of structure B. This is illustrated by Figure 12: Figure 12. Complete Alignment There would be some error in the match as the structures are not identical. Complete alignments are rarely possible because the molecules compared are of different sizes, hence there cannot be a one-to-one correspondence. Also, their shapes might match only locally. However, even though there is no satisfactory global alignment, good local alignments may be of interest as some functions might still be deduced from them.

9 Lecture 8: Finding structural similarities among proteins (I) Partial Alignment Partial alignment refers to a part of a structure A matching with a part of structure B. These matching portions are referred to as supports, denoted as σ (A) and σ (B). Notion of support σ of the match: the match is between σ (A) and σ (B). Figure 13. Partial Alignment The notion of support gives rise to the dual problem of finding the support as well as the transformation. Often, one may have to choose between a bigger support with a worse alignment or a smaller support with a better alignment. Almost all the systems examining protein similarity will have to make such decisions. Also, there might be many possible supports in different regions of the structure. If a support is small, it is referred to as a motif 3. In this figure we can see the alignment of 3adk and 1gky. Both matching and nonmatching secondary structure elements can be seen. Figure 14. Partial Alignment of 3adk and 1gky Distributed Support If partial matches are allowed, then there may be cases where the support is not contiguous. 3 Explained later

10 Lecture 8: Finding structural similarities among proteins (I) A σ( B B A Figure 15(a). Contiguous Support σ(a) Gap Figure 15(b). Distributed Support Figure 15(a) on the left shows contiguous support while the support in Figure 15(b) has gaps. A similarity measure is needed to decide about which of these two supports is the best or should gaps be allowed in the support or how to penalize the gaps in the support. This measure would differ from system to system. In the case illustrated by Figure 16, the support matching does conserve the sequence in the backbone. This support would be rejected by many systems. However, the function of the protein is dependent on the interactions of its molecules on the outer surface. Therefore, it should not matter that the molecules positioned inwards are out of order in the support. A B Figure 16. Distributed Support without sequence preservation In case of partial alignments, the similarity measure is unlikely to satisfy the triangular inequality, otherwise considered to be the holy grail in computational biology. Figure 17. Partial Alignment and Triangular Inequality

11 Lecture 8: Finding structural similarities among proteins (I) Scoring Issues Certain scoring issues have to be taken into consideration in order to come up with a good measure. Trade-off between size of the support and RMSD ( bigger the support, bigger the RMSD and vice-versa) How should gaps be counted? Is there a quality of the correspondence? [The correspondence may, or may not, satisfy type and/or backbone sequence preferences] Should accessible surface be given more importance, since it is the atoms on the periphery of the protein that react and hence define the function of the protein? Similarity measures may be different from the inverse 4 of RSMD. There is no consensus on best measure! However, RMSD is computationally very convenient. Some examples - Note that in the first example the square of the distance measure is in the numerator while in the second example it is in the denominator. This will result in slight difference in the function s behaviour. 2 ( a T b ) 1 min ( ) T i i σ ( T ) i σ ( T) (RMSD dissimilarity measure emphasizes differences) 1 max T A NGAP/ 2 2 i σ ( T) ai T( bi) 1+ B (STRUCTAL s similarity measure emphasizes similarities) Gap There are many other measures used. The paper: A.C.M. May. Towards more meaningful hierarchical classification of amino acids scoring functions. Protein Engineering, 12: , 1999, reviews 37 protein structure similarity measures. The difficulty of defining a similarity score is probably due to the facts that structure comparison is an ill-posed problem and has multiple solutions. In fact, the measure is dependent upon what is to be extracted from the comparisons. 6.3 Conclusion on Protein Alignments Finding an optimal partial alignment is NP-hard. No fast algorithm is guaranteed to give an optimal answer for any given similarity measure. Rely on combinations of heuristic/approximate algorithms 4 Since RMS distance measures the dissimilarity.

12 Lecture 8: Finding structural similarities among proteins (I) Probably not a single best solution, but application-dependent solutions But there exist general algorithmic principles Part II: Typical Applications of Structure Matching Techniques 1. Overview 1.1 Find Similarities Among Protein Structures The problem is to find similar substructures for given two protein structures. The substructure, for instance, can be a sequence of (possibly contiguous) Cα atoms in each molecule, or a set of secondary structure elements. There are many possible similarity measures as mentioned before. Those different measures are exploited in different algorithms for finding similarities among protein structures. Variants of this problem includes 1-to-1 (finding structure similarities between two proteins), 1-to-many (finding structure similarities for one given protein against many other proteins), many-to-many (finding structure similarities between two sets proteins, where both two sets have a great number of proteins). Each of the variants, especially for 1-to-many and many-to-many problems, must be automatic and reasonably fast. 1.2 Classify Proteins This problem is closely related to the previous problem. Once we have the similarities among protein structures, we can use them to classify proteins. There are many proteins, but relatively few distinct fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997]. With this knowledge, we can classify various proteins into different fold families, which bring insight into functions / structure stabilization and form the basis for threading and homology. Earlier classifications of proteins are done manually. For instance, SCOP [Murzin et al. 1995], is a manual protein classification scheme, which takes into account not only the structural similarities but also evolution information. SCOP classifies proteins by three levels, Class, Fold and Family, with figure 18 as an example of SCOP classification. Figure 18. Three levels of SCOP classification

13 Lecture 8: Finding structural similarities among proteins (I) As the size of PDB increases, automatic classifications are required. Examples of automatic classifications are CATH [Orengo et al. 1997], Pclass [Singh et al.] and FSSP [Holm and Sander]. However, the performance of automatic classifications is not very satisfactory. Figure 19 shows two different classifications for the same protein set using SCOP and Pclass. As shown in the picture, Pclass produces a quite different classification from SCOP. Figure 19. Manual vs. Automatic Classification 1.3 Finding Motif in Protein Structure A motif is a small collection of atoms corresponding to a binding site or a stabilizing scaffold. This problem is defined as the process of finding whether the motif matches a substructure of a given protein. A variant of this problem is to find motif against many proteins. Motifs are often repeated across different molecules. Detection of these common motifs in a new molecule can provide useful clues to the functional properties of such a molecule. In its simplest form, the motif finding problem can be formulated as follows: given a sample of sequences and an unknown pattern (motif) that is implanted at different unknown positions, can we find the unknown pattern? However, motifs are subject to mutations and usually do not appear exactly. The problem of finding motifs has two parts. The first part is finding the best match for the pattern in the example. The second part is to evaluate if this match is significant enough: e.g., finding a close match for a pattern of 10 amino acids is more significant than finding a close match for a pattern for 2 amino acids. One example of motif finding results is shown in figure 20, taken from the paper, Identifying Structural Motifs in Proteins (R. Singh and M. Saha.). The Y axis shows the RMSDs between motif (trypsin active site) and the matching part of protein. The X axis represents the 42 proteins belonging to the Trypsin family. Observe that the quality of the match is nearly perfect fro most of the proteins (RMSDs below 0.5). This confirms that all these proteins share the same structural and functional properties.

14 Lecture 8: Finding structural similarities among proteins (I) Figure 20. Results of motif finding in trypsin-like proteins 1.4 Finding Pharmacophore in Ligands Given a small collection of 5-10 small flexible ligands with similar activity (hence, assumed to bind at the same protein site), and several dozens to a few hundreds lowenergy conformations for each ligand, this problem is to find substructure (called pharmacophore) that occurs in at least one conformation of each ligand. The problem requires comparinge a relatively larger number of conformations for each ligand. 1.5 Searching for Ligands Containing a Pharmacophore This prolem is the complement of problem 4, which finds pharmacophore for a given ligand. Contrarily, this problem is to find all ligands that have a low-energy conformation containing a given pharmacophore from a database of small flexible ligands, usually 100,000 or more. The paper: a randomized kinematics-based approach to pharmacophore-constrained conformational search and database screening (S.M. LaValle, et al.) presents an approach to do conformational search and data mining of pharmaceutical databases. So far we have presented an overview of the five problems in finding protein similarities, which will be elaborated in great detail in the following sections. There are several web sites that provide resourceful information about protein classification tools and protein alignment methods, as listed below. Protein classification: SCOP: CATH Protein alignment: DALI: LOCK:

15 Lecture 8: Finding structural similarities among proteins (I) Problem 1 - Find Similarities Among Protein Structures 2.1 Problem Definition Input: Two sets of features (atoms or groups of atoms) {a 1,,a n } and {b 1,,b m } belonging to two different proteins A and B Output: - Maximal correspondence set C of pairs (a i,b j ), where all a i and all b j are distinct - Alignment transform T such that the RMSD of the pairs (a i,t(b j )) is less than a given threshold ε Some Explanations 1) For the output (a i, b j ) pairs, all a i and all b j being distinct means that an a i cannot map more than one b j -s and a b j cannot be mapped against more than one a i -s. 2) There may be several possible outputs as there can be more than one optimal solution. 3) The similarity measures other than RMSD may also be used and will be examined later Possible Correspondence Constraints There may be some special constraints that are often used on correspondence for different kinds of interests. The most common correspondence constraints are typed features and ordered features. Typed features (a i,b j ) is a possible correspondence pair iff Type(a i ) = Type(b j ) Ordered features (a i,b j ) and (a i,b j ), where i >i, are possible correspondence pairs iff j >j 2.2 Some Existing Software The following list shows examples of existing software for finding similarities among protein structures. Cα atoms: DALI [Holm and Sander, 1993] STRUCTAL [Gerstein and Levitt, 1996] MINAREA [Falicov and Cohen, 1996] CE [Shindyalov and Bourne, 1998] ProtDex [Aung,Fu and Tan, 2003] Secondary structure elements and Cα atoms: VAST [Gibrat et al., 1996] LOCK [Singh and Brutlag, 1996] 3dSEARCH [Singh and Brutlag, 1999]

16 Lecture 8: Finding structural similarities among proteins (I) Two Simultaneous Sub-problems In order to find similarities among protein structures, one has to solve the following two sub-problems simultaneously. a. Find correspondence set C b. Find alignment transform T There is a chicken-and-egg issue here for the two sub-problems. Each sub-problem is simple: if we know C, we could easily compute T by some known algorithm; if we know T, we could also get C by proximity possibly with help of dynamic programming. However, the combination of the two problems is extremely hard Find Alignment Transform We notice that the computation of C involves many more parameters compared to the computation of T, which requires only 6 parameters (3 for translation, 3 for orientation). Hence the computation of T should be easier than computing C. So the combination of the two sub-problems should start with computing T. Subsequently with known T, C can be obtained with proximity, and thus the combination is solved. The problem has the two sets of equal number of points as inputs: A= {a 1,,a n } and B = {b 1,,b n } where (a i, b i ) are correspondence pairs. The goal is to find transform T, such that the RMSD disctance between A and T(B) is minimized, denoted as T=argmin T RMSD(A, T(B)). There is an O(n) closed-form solution, SVD-based algorithm (Arun, Huang, and Blostein, 87; Horn, 87; Horn, Hilden, and Negahdaripour, 88), where SVD stands for singular value decomposition. Here a very brief introduction is presented, with further details available from the referenced papers. O(n) SVD-Based Algorithm T combines translation t and rotation R, such that T(b i ) = t + R(b i ). The origin of the coordinate system is placed at b, with b as the mean of the b i s, i.e., b= (Σ i=1,...,n b i )/n. Then min T RMSD(A, T(B)) is simplified to (up to some constants, which can be ignore n n 2 for the moment) min t,r ai t 2 ai, R( bi). Here t and R can be computed i= 1 i= 1 separately. Translation t is just the mean of the a i s, and thus we can easily compute t by the formula t=a, where a=(σ i=1,...,n a i )/n. Then we show how rotation R is computed. With the following derived matrices: A 3 n = [a 1 -a,..., a n -a] and B 3 n = [b 1 -b,..., b n -b], we can compute SVD decomposition of BA T (where A T denotes transform of A) with formula BA T = UDV T where D is a diagonal matrices with decreasing non-negative entries (singular values) along the diagonal. If det(u)det(v) = 1 then S = I; otherwise S = diag(1,1,-1). Finally R is computed using formula R=USV T Trial-and-Error Approach to Protein Structure Alignment The trial-and-error approach solves the two sub-problems by iteratively updating the correspondence set CS and transformation T with an initial guess of CS being a small set.

17 Lecture 8: Finding structural similarities among proteins (I) T is computed from the initial guess of CS and subsequently from the updated CS. Then T is applied to the two sets of features, A and B. The correspondence set CS is then updated by proximity from T. The updated CS is again used for computing new T unless there is no change in CS. Figure 21 depicts this approach and the algorithm is given in figure 22. Figure 21. Overview of Trial-and-Error Approach 2.4 Seed Generation by Fragment Matching Figure 22. Trial-and-Error Algorithm When scientists use trial-and-error approach to do protein structure alignment, it is desirable that the initial correspondence set is valid. Indeed, if the elements in the set do not come from similar substructure, then in step 3 of the trial-and-error approach, transformation will unlikely to bring in good elements. Many techniques are proposed to generate promising seed correspondence sets. 3 algorithms are introduced in the lecture, namely DALI [Holm and Sander, 1996] where seed set is generated from local shape; LOCK [Singh and Brutlag, 1996] where seed set is generated from secondary structure elements and 3dSEARCH [Singh and Brutlag, 2000] where seed set is generated from a voting scheme.

18 Lecture 8: Finding structural similarities among proteins (I) DALI Figure 23. Distance Matrix vs Inferred Protein Structure DALI makes use of distance matrix to compute the seed set. As shown on the right hand side of Figure, Cα atoms of a protein are encoded on each axis of the distance matrix (140 Cα atoms in the above example). Entry in the matrix shows the distance between each pairs of Cα atoms within a molecular. Darkness of the entry reflects the closeness of the two Cα atoms. Areas along the bisector are darker, since a Cα atom is certainly close to itself. We can use distance matrix to compute the seed set because the intra-molecular distances are invariant to rigid-body transformations. So if we have the distance matrix for protein A and protein B, these 2 matrix will be independent of the orientation or transformation of the two proteins. If there is a α-helix in the protein, then there will be typically a thick line along the bisector of the matrix. Patches of thick lines along the bisector denotes the several α-helix in the protein. For example, in the above example, the distance matrix shows that amino acid 1 to 40 form a helix and 45 to 80 forms another helix. The thick line in the orange circle shows that the two helixes are close to each other, since if we take the last amino acid (85) of the helix in blue and the first amino acid (1) of the red helix, the corresponding entry in distance matrix is quite dark. DALI looks for similar hexapeptides by searching for similar 7x7 Cα-Cα distance matrices (the green square illustrated in the Figure ) in distance matrix of another protein. And seed set is generated LOCK 85 Different from DALI which uses only the carbon atoms along the protein backbone, LOCK, on the other hand, uses the secondary structure elements (SSEs) as the initial seeds. This is due to a biological concern. Most of the time, types of atoms and aminoacids in proteins have many variations during evolution and thus are not a promising basis for correspondence. Secondary structure elements are better conserved. They are the ones who are responsible for most of the stability and functionality of proteins. Taking these biological facts into account, programs like LOCK combines information from two levels of features (see Figure 24):

19 Lecture 8: Finding structural similarities among proteins (I) Stage 1 (SSE alignment): Compute initial alignment using SSEs represented as vectors. Only three SSEs match in the example shown in Figure 24. Stage 2 (atom alignment): Refine alignment using Cα atoms represented as points. Stage 1 Stage 2 Figure 24. SSE alignment and Atom alignment Before we go into details of LOCK, we should also note that by taking SSEs into account for substructure alignment, we narrows down possible applications. The resulted program will not be able to find small motifs at atomic level. Also there is some deficiency with LOCK when it comes to many-to-many comparison SSEs and Its Vector-Based Representation SSEs that LOCK considers have three parts: α, β and loops. α helices and β sheets are stabilized by hydrogen bonds between backbone oxygen and hydrogen atoms. Loops, on the other hand are considered in a secondary way as they do not have definite shapes. SSEs are represented as a triplet vector: (helix, strand, loop). See Figure β-strands loops α-helices Figure 25. A Typical Secondary Structure Element Details for how the vectors are computed are as follows: 1. LOCK uses software DSSP [Kabsch and Sander, 1983] to classify residues into α- helices and β-strands. 2. Each α-helice and β-strand is represented as a vector with its start and end point. It is computed by a weighted average of carbon backbone atoms. For example, for a-helix starting at residue i:

20 Lecture 8: Finding structural similarities among proteins (I) Xorigin= (0.74Xi + Xi+1 + Xi Xi+3)/3.48 where Xi is the position of the Cα atom of residue i It is computed in such a strange way due to the fact that angle between two consecutive residues is 100degree. Computation for Xend and for β-strand is done similarly Preparatory Work: Scoring Similarity Once the vectors of two proteins are ready, LOCK will score similarity between the two proteins. Figure 26. Alignment of Two Proteins To score similarity, one question to ask is if i and p are aligned, what is the score of the alignment of k and r? To do so, two types of differences are considered: Position-independent differences (for example in Figure 26): angle(i,k)-angle(p,r) angle(i,j)-angle(p,q) angle(j,k)-angle(q,r) distance(i,k)-distance(p,r) length(k)-length(r) Position-dependent differences (used only when we have reasonably well aligned pairs of proteins): angle(k,r) (if the two proteins are well aligned, this value should be small) distance(k,r) Every difference is associated with a score. And the Scores are addictive, as all differences are considered. The di in the formula below is one of the differences mentioned above. Note that if the difference is rather small, then we may possibly hit maximal score. For example, Mi is the maximal score one can get from a specific difference, since S(di) = 2Mi/(1+0) Mi = Mi. When di equals di0, S(di) = 2Mi/(1+1) Mi = 0.

21 Lecture 8: Finding structural similarities among proteins (I) Maximal score S(d i ) = 2M i 1+(d i /d i0 ) 2 - M i Score = Σ S(d i ) Figure 27. Calculation of Score Stage 1: Secondary Structure Elements Alignment To assess an alignment of a pair of SSEs, LOCK uses a set of position-independent differences. For example, the angles or distances between successive SSEs or the length of SSE vectors. And for every pair of SSE vectors of protein A, LOCK finds all pairs of vectors in B that align well to generate seed correspondence sets. For each such correspondence set, LOCK computes the alignment transformation and apply it to all the SSEs in protein B (dynamic programming algorithm like variation of Smith-Waterman algorithm are used to find correspondence set with maximal score). Finally, LOCK records transform T and correspondence set CS that yields maximal score. The dynamic programming algorithm works as follows. Suppose we find 5 SSEs in protein A and B, namely I, j, k, l, m and p, q, r, s t. And SSE I aligns well with p, j aligns well with q. To compute the rest of the correspondence SSEs, we go through a tree. As illustrated in Figure 28, SSE k may corresponds to either r or s or t. LOCK computes the score for each correspondence, eg, (k, r). If the score is positive, LOCK will go down into the tree and compute score for correspondence (l, s). LOCK adds various pairs to the correspondence set as long as the score continues to increase. LOCK allows only one gap at a time. And that is why there is no fan out for (m, t) from (k, r) in the tree. Otherwise, there will be a gap in both A and B. A = (i, j, k, l, m) B = (p, q, r, s, t) Seed correspondence {(i,p),(j,q)} Value of d i for which score is 0

22 Lecture 8: Finding structural similarities among proteins (I) (i,p), (j,q) (k,r) (k,s) (k,t) (l,r) (m,r) (l,s) (l,t) (m,s) (l,t) (m,t) (m,s) (m,t) (m,t) Simultaneous gaps in both structures are not allowed (not in SCOP2) Terminate a path when score of new correspondence is negative Re-compute new transform with each new correspondence (?) Stage 2: Atom (Core) Alignment Figure 28. DM to Computer Correspondence Set With correspondence set of SSEs, we can now do a more precise alignment by computing the set of pairs of atoms from A and T(B) which are closest to each other and within a given threshold distance. The steps 2, 3, 4 of the algorithm is similar to previous trial-and-error approach where we try to arrive at the best transformation and largest correspondence set by iteratively computer transform which minimize the RMSD of the atoms in the CS Experiments and Results To validate how LOCK copes with proteins with significant difference in sequence, 685 proteins from the PDB are selected such that each pair have less than 25% sequence similarity. These proteins are drawn from three families of folds (myoglobins, TIM barrels and immunoglobulins). Now one query one query protein from each family is issued to query all other proteins to find other members of the family (685*3=2055 alignments) For each query, LOCK sorts the 685 structures by score and selects the top k proteins. The count for members of family (true positives) and non-members (false positives) is shown in the sensitivity vs. specificity tables below.

23 Lecture 8: Finding structural similarities among proteins (I) Myoglobins (11) TIM-barrels (50) Immunoglobulins (38) # True positive # False positive # True positives # False positives # True positives # False positives Experiments show that for the three folds considered, LOCK correctly identifies most of the members of each fold type. It performs especially well for Myoglobins (the top 11 proteins made a 100% hit) and pretty good for TIM-barrels (among the top 55 proteins selected; only 5 are not members of the family). However, there are still some false positives. Considering that Immunoglobulins have diverse structures, the result is already quite good. For running time of LOCK, it was quite slow to do a comparison of all proteins in the 90s. So geometric hashing is used to speed up stage 1. The rationality behind geometric hashing is to do a pre-computation so that we computer only once scores of pairs of SSEs instead of computing them many times Possible Improvements for LOCK LOCK uses RMSD to refine transformations. However, small RMSD does not necessarily mean similarity in biological view. In general, similarity is computed by compute the S value (some similarity measures) over all the transform between protein A and B where S is much more complex then RMSD. Usually, people will compute RMSD first and once we have the best transformation at hand, we do a further adjustment for transformations to maximize similarity measures. Other popular technique of correspondence from a proximity angle is the Iterated Closest Pair (ICP) [Besl and McKay, 1992] Reference [1] R. Singh, M. Saha, Identifying Structural Motifs in Proteins. Pacific Symposium on Biocomputing 8: , 2003 [2] L. Cowen, P. Bradley, M. Menke, J. King and B. Berger, Predicting the Beta-Helix Fold from Protein Sequence Data. J. Comput. Biol. 9(2), , 2002 [3] A.P. Singh and D.L. Brutlag, Hierarchical Protein Structure Superposition Using Both Secondary and Atomic Representations. Proc. ISMB, , 1997 [4] J. Shapiro and D.L. Brutlag, FoldMiner: Structural Motif Discovery Using an Improved Superposition Algorithm. Protein Science, 13: , 2004 [5] P.W. Finn, L.E. Kavraki, J.C. Latombe, R. Motwani, C. Shelton, S. Venkatasubramanian, and A.

24 Lecture 8: Finding structural similarities among proteins (I) Yao, RAPID: Randomized Pharmacophore Identification for Drug Design. Computational Geometry: Theory and Applications, 10, , 1998 [6] A. G. Murzin, S. E. Brenner, T. Hubbard and C. Chothia, SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J. Mol. Biol. 247: , 1995 [7] C. Chothia, One thousand families for the molecular biologist. Nature 357: , 1992 [8] C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, J.M. Thornton, CATH- A Hierarchic Classification of Protein Domain Structures. Structure. Vol 5. No 8: , 1997 [9] L. Holm, C.Sander, Dali/FSSP classification of three-dimensional protein folds. Nucl. Acids Res. 25, , 1997 [10] L. Holm, C.Sander, Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, , 1993 [11] A.C.M. May. Towards more meaningful heirarchical classification of amino acids scoring function. Protein Engineering, 12: , 1999 [12] PDB Annual Report 2003 [13] [14] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, P. Walter, Molecular Biology of the Cell, 4th Ed., Garland Science.

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Analysis and Prediction of Protein Structure (I)

Analysis and Prediction of Protein Structure (I) Analysis and Prediction of Protein Structure (I) Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 2006 Free for academic use. Copyright @ Jianlin Cheng

More information

Basics of protein structure

Basics of protein structure Today: 1. Projects a. Requirements: i. Critical review of one paper ii. At least one computational result b. Noon, Dec. 3 rd written report and oral presentation are due; submit via email to bphys101@fas.harvard.edu

More information

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Naoto Morikawa (nmorika@genocript.com) October 7, 2006. Abstract A protein is a sequence

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

Protein Structure: Data Bases and Classification Ingo Ruczinski

Protein Structure: Data Bases and Classification Ingo Ruczinski Protein Structure: Data Bases and Classification Ingo Ruczinski Department of Biostatistics, Johns Hopkins University Reference Bourne and Weissig Structural Bioinformatics Wiley, 2003 More References

More information

Introduction to" Protein Structure

Introduction to Protein Structure Introduction to" Protein Structure Function, evolution & experimental methods Thomas Blicher, Center for Biological Sequence Analysis Learning Objectives Outline the basic levels of protein structure.

More information

Structural Alignment of Proteins

Structural Alignment of Proteins Goal Align protein structures Structural Alignment of Proteins 1 2 3 4 5 6 7 8 9 10 11 12 13 14 PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/15/07 CAP5510 1 EM Algorithm Goal: Find θ, Z that maximize Pr

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

More information

Protein Structures. 11/19/2002 Lecture 24 1

Protein Structures. 11/19/2002 Lecture 24 1 Protein Structures 11/19/2002 Lecture 24 1 All 3 figures are cartoons of an amino acid residue. 11/19/2002 Lecture 24 2 Peptide bonds in chains of residues 11/19/2002 Lecture 24 3 Angles φ and ψ in the

More information

Protein Structure & Motifs

Protein Structure & Motifs & Motifs Biochemistry 201 Molecular Biology January 12, 2000 Doug Brutlag Introduction Proteins are more flexible than nucleic acids in structure because of both the larger number of types of residues

More information

Motif Prediction in Amino Acid Interaction Networks

Motif Prediction in Amino Acid Interaction Networks Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions

More information

Bioinformatics. Macromolecular structure

Bioinformatics. Macromolecular structure Bioinformatics Macromolecular structure Contents Determination of protein structure Structure databases Secondary structure elements (SSE) Tertiary structure Structure analysis Structure alignment Domain

More information

A General Model for Amino Acid Interaction Networks

A General Model for Amino Acid Interaction Networks Author manuscript, published in "N/P" A General Model for Amino Acid Interaction Networks Omar GACI and Stefan BALEV hal-43269, version - Nov 29 Abstract In this paper we introduce the notion of protein

More information

CS273: Algorithms for Structure Handout # 2 and Motion in Biology Stanford University Thursday, 1 April 2004

CS273: Algorithms for Structure Handout # 2 and Motion in Biology Stanford University Thursday, 1 April 2004 CS273: Algorithms for Structure Handout # 2 and Motion in Biology Stanford University Thursday, 1 April 2004 Lecture #2: 1 April 2004 Topics: Kinematics : Concepts and Results Kinematics of Ligands and

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction CMPS 6630: Introduction to Computational Biology and Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the

More information

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction CMPS 3110: Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the laws of physics! Conformation space is finite

More information

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES Protein Structure W. M. Grogan, Ph.D. OBJECTIVES 1. Describe the structure and characteristic properties of typical proteins. 2. List and describe the four levels of structure found in proteins. 3. Relate

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

Protein Structure Prediction

Protein Structure Prediction Page 1 Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding is different from structure prediction --Folding is concerned with the process of taking the 3D shape, usually based on

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB) Protein structure databases; visualization; and classifications 1. Introduction to Protein Data Bank (PDB) 2. Free graphic software for 3D structure visualization 3. Hierarchical classification of protein

More information

Dana Alsulaibi. Jaleel G.Sweis. Mamoon Ahram

Dana Alsulaibi. Jaleel G.Sweis. Mamoon Ahram 15 Dana Alsulaibi Jaleel G.Sweis Mamoon Ahram Revision of last lectures: Proteins have four levels of structures. Primary,secondary, tertiary and quaternary. Primary structure is the order of amino acids

More information

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years. Structure Determination and Sequence Analysis The vast majority of the experimentally determined three-dimensional protein structures have been solved by one of two methods: X-ray diffraction and Nuclear

More information

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition Sequence identity Structural similarity Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Fold recognition Sommersemester 2009 Peter Güntert Structural similarity X Sequence identity Non-uniform

More information

Protein structure similarity based on multi-view images generated from 3D molecular visualization

Protein structure similarity based on multi-view images generated from 3D molecular visualization Protein structure similarity based on multi-view images generated from 3D molecular visualization Chendra Hadi Suryanto, Shukun Jiang, Kazuhiro Fukui Graduate School of Systems and Information Engineering,

More information

Details of Protein Structure

Details of Protein Structure Details of Protein Structure Function, evolution & experimental methods Thomas Blicher, Center for Biological Sequence Analysis Anne Mølgaard, Kemisk Institut, Københavns Universitet Learning Objectives

More information

Protein Structure. Hierarchy of Protein Structure. Tertiary structure. independently stable structural unit. includes disulfide bonds

Protein Structure. Hierarchy of Protein Structure. Tertiary structure. independently stable structural unit. includes disulfide bonds Protein Structure Hierarchy of Protein Structure 2 3 Structural element Primary structure Secondary structure Super-secondary structure Domain Tertiary structure Quaternary structure Description amino

More information

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE Examples of Protein Modeling Protein Modeling Visualization Examination of an experimental structure to gain insight about a research question Dynamics To examine the dynamics of protein structures To

More information

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 2 Amino Acid Structures from Klug & Cummings

More information

ALL LECTURES IN SB Introduction

ALL LECTURES IN SB Introduction 1. Introduction 2. Molecular Architecture I 3. Molecular Architecture II 4. Molecular Simulation I 5. Molecular Simulation II 6. Bioinformatics I 7. Bioinformatics II 8. Prediction I 9. Prediction II ALL

More information

Protein structure alignments

Protein structure alignments Protein structure alignments Proteins that fold in the same way, i.e. have the same fold are often homologs. Structure evolves slower than sequence Sequence is less conserved than structure If BLAST gives

More information

Protein Structure Basics

Protein Structure Basics Protein Structure Basics Presented by Alison Fraser, Christine Lee, Pradhuman Jhala, Corban Rivera Importance of Proteins Muscle structure depends on protein-protein interactions Transport across membranes

More information

A Tool for Structure Alignment of Molecules

A Tool for Structure Alignment of Molecules A Tool for Structure Alignment of Molecules Pei-Ken Chang, Chien-Cheng Chen and Ming Ouhyoung Department of Computer Science and Information Engineering, National Taiwan University {zick, ccchen}@cmlab.csie.ntu.edu.tw,

More information

Getting To Know Your Protein

Getting To Know Your Protein Getting To Know Your Protein Comparative Protein Analysis: Part III. Protein Structure Prediction and Comparison Robert Latek, PhD Sr. Bioinformatics Scientist Whitehead Institute for Biomedical Research

More information

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society 1 of 5 1/30/00 8:08 PM Protein Science (1997), 6: 246-248. Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society FOR THE RECORD LPFC: An Internet library of protein family

More information

Useful background reading

Useful background reading Overview of lecture * General comment on peptide bond * Discussion of backbone dihedral angles * Discussion of Ramachandran plots * Description of helix types. * Description of structures * NMR patterns

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

Protein Structure Analysis and Verification. Course S Basics for Biosystems of the Cell exercise work. Maija Nevala, BIO, 67485U 16.1.

Protein Structure Analysis and Verification. Course S Basics for Biosystems of the Cell exercise work. Maija Nevala, BIO, 67485U 16.1. Protein Structure Analysis and Verification Course S-114.2500 Basics for Biosystems of the Cell exercise work Maija Nevala, BIO, 67485U 16.1.2008 1. Preface When faced with an unknown protein, scientists

More information

Biomolecules: lecture 10

Biomolecules: lecture 10 Biomolecules: lecture 10 - understanding in detail how protein 3D structures form - realize that protein molecules are not static wire models but instead dynamic, where in principle every atom moves (yet

More information

Receptor Based Drug Design (1)

Receptor Based Drug Design (1) Induced Fit Model For more than 100 years, the behaviour of enzymes had been explained by the "lock-and-key" mechanism developed by pioneering German chemist Emil Fischer. Fischer thought that the chemicals

More information

Protein structure analysis. Risto Laakso 10th January 2005

Protein structure analysis. Risto Laakso 10th January 2005 Protein structure analysis Risto Laakso risto.laakso@hut.fi 10th January 2005 1 1 Summary Various methods of protein structure analysis were examined. Two proteins, 1HLB (Sea cucumber hemoglobin) and 1HLM

More information

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment Molecular Modeling 2018-- Lecture 7 Homology modeling insertions/deletions manual realignment Homology modeling also called comparative modeling Sequences that have similar sequence have similar structure.

More information

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Prof. Dr. M. A. Mottalib, Md. Rahat Hossain Department of Computer Science and Information Technology

More information

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION AND CALIBRATION Calculation of turn and beta intrinsic propensities. A statistical analysis of a protein structure

More information

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction Protein Secondary Structure Prediction Doug Brutlag & Scott C. Schmidler Overview Goals and problem definition Existing approaches Classic methods Recent successful approaches Evaluating prediction algorithms

More information

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like SCOP all-β class 4-helical cytokines T4 endonuclease V all-α class, 3 different folds Globin-like TIM-barrel fold α/β class Profilin-like fold α+β class http://scop.mrc-lmb.cam.ac.uk/scop CATH Class, Architecture,

More information

Structure to Function. Molecular Bioinformatics, X3, 2006

Structure to Function. Molecular Bioinformatics, X3, 2006 Structure to Function Molecular Bioinformatics, X3, 2006 Structural GeNOMICS Structural Genomics project aims at determination of 3D structures of all proteins: - organize known proteins into families

More information

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon A Comparison of Methods for Assessing the Structural Similarity of Proteins Dean C. Adams and Gavin J. P. Naylor? Dept. Zoology and Genetics, Iowa State University, Ames, IA 50011, U.S.A. 1 Introduction

More information

Introduction to Computational Structural Biology

Introduction to Computational Structural Biology Introduction to Computational Structural Biology Part I 1. Introduction The disciplinary character of Computational Structural Biology The mathematical background required and the topics covered Bibliography

More information

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics. Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics Iosif Vaisman Email: ivaisman@gmu.edu ----------------------------------------------------------------- Bond

More information

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University COMP 598 Advanced Computational Biology Methods & Research Introduction Jérôme Waldispühl School of Computer Science McGill University General informations (1) Office hours: by appointment Office: TR3018

More information

Ant Colony Approach to Predict Amino Acid Interaction Networks

Ant Colony Approach to Predict Amino Acid Interaction Networks Ant Colony Approach to Predict Amino Acid Interaction Networks Omar Gaci, Stefan Balev To cite this version: Omar Gaci, Stefan Balev. Ant Colony Approach to Predict Amino Acid Interaction Networks. IEEE

More information

Dihedral Angles. Homayoun Valafar. Department of Computer Science and Engineering, USC 02/03/10 CSCE 769

Dihedral Angles. Homayoun Valafar. Department of Computer Science and Engineering, USC 02/03/10 CSCE 769 Dihedral Angles Homayoun Valafar Department of Computer Science and Engineering, USC The precise definition of a dihedral or torsion angle can be found in spatial geometry Angle between to planes Dihedral

More information

Improving Protein 3D Structure Prediction Accuracy using Dense Regions Areas of Secondary Structures in the Contact Map

Improving Protein 3D Structure Prediction Accuracy using Dense Regions Areas of Secondary Structures in the Contact Map American Journal of Biochemistry and Biotechnology 4 (4): 375-384, 8 ISSN 553-3468 8 Science Publications Improving Protein 3D Structure Prediction Accuracy using Dense Regions Areas of Secondary Structures

More information

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009 114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009 9 Protein tertiary structure Sources for this chapter, which are all recommended reading: D.W. Mount. Bioinformatics: Sequences and Genome

More information

Supporting Online Material for

Supporting Online Material for www.sciencemag.org/cgi/content/full/309/5742/1868/dc1 Supporting Online Material for Toward High-Resolution de Novo Structure Prediction for Small Proteins Philip Bradley, Kira M. S. Misura, David Baker*

More information

NMR, X-ray Diffraction, Protein Structure, and RasMol

NMR, X-ray Diffraction, Protein Structure, and RasMol NMR, X-ray Diffraction, Protein Structure, and RasMol Introduction So far we have been mostly concerned with the proteins themselves. The techniques (NMR or X-ray diffraction) used to determine a structure

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Protein structure (and biomolecular structure more generally) CS/CME/BioE/Biophys/BMI 279 Sept. 28 and Oct. 3, 2017 Ron Dror

Protein structure (and biomolecular structure more generally) CS/CME/BioE/Biophys/BMI 279 Sept. 28 and Oct. 3, 2017 Ron Dror Protein structure (and biomolecular structure more generally) CS/CME/BioE/Biophys/BMI 279 Sept. 28 and Oct. 3, 2017 Ron Dror Please interrupt if you have questions, and especially if you re confused! Assignment

More information

FoldMiner: Structural motif discovery using an improved superposition algorithm

FoldMiner: Structural motif discovery using an improved superposition algorithm FoldMiner: Structural motif discovery using an improved superposition algorithm JESSICA SHAPIRO 1 AND DOUGLAS BRUTLAG 1,2 1 Biophysics Program and 2 Department of Biochemistry, Stanford University, Stanford,

More information

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix) Computat onal Biology Lecture 21 Protein folding The goal is to determine the three-dimensional structure of a protein based on its amino acid sequence Assumption: amino acid sequence completely and uniquely

More information

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins Margaret Daugherty Fall 2003 Outline Four levels of structure are used to describe proteins; Alpha helices and beta sheets

More information

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller Chemogenomic: Approaches to Rational Drug Design Jonas Skjødt Møller Chemogenomic Chemistry Biology Chemical biology Medical chemistry Chemical genetics Chemoinformatics Bioinformatics Chemoproteomics

More information

Packing of Secondary Structures

Packing of Secondary Structures 7.88 Lecture Notes - 4 7.24/7.88J/5.48J The Protein Folding and Human Disease Professor Gossard Retrieving, Viewing Protein Structures from the Protein Data Base Helix helix packing Packing of Secondary

More information

Announcements. Primary (1 ) Structure. Lecture 7 & 8: PROTEIN ARCHITECTURE IV: Tertiary and Quaternary Structure

Announcements. Primary (1 ) Structure. Lecture 7 & 8: PROTEIN ARCHITECTURE IV: Tertiary and Quaternary Structure Announcements TA Office Hours: Brian Eckenroth Monday 3-4 pm Thursday 11 am-12 pm Lecture 7 & 8: PROTEIN ARCHITECTURE IV: Tertiary and Quaternary Structure Margaret Daugherty Fall 2003 Homework II posted

More information

FlexSADRA: Flexible Structural Alignment using a Dimensionality Reduction Approach

FlexSADRA: Flexible Structural Alignment using a Dimensionality Reduction Approach FlexSADRA: Flexible Structural Alignment using a Dimensionality Reduction Approach Shirley Hui and Forbes J. Burkowski University of Waterloo, 200 University Avenue W., Waterloo, Canada ABSTRACT A topic

More information

BIOCHEMISTRY Course Outline (Fall, 2011)

BIOCHEMISTRY Course Outline (Fall, 2011) BIOCHEMISTRY 402 - Course Outline (Fall, 2011) Number OVERVIEW OF LECTURE TOPICS: of Lectures INSTRUCTOR 1. Structural Components of Proteins G. Brayer (a) Amino Acids and the Polypeptide Chain Backbone...2

More information

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Course Name: Structural Bioinformatics Course Description: Instructor: This course introduces fundamental concepts and methods for structural

More information

From Amino Acids to Proteins - in 4 Easy Steps

From Amino Acids to Proteins - in 4 Easy Steps From Amino Acids to Proteins - in 4 Easy Steps Although protein structure appears to be overwhelmingly complex, you can provide your students with a basic understanding of how proteins fold by focusing

More information

Biochemistry Prof. S. DasGupta Department of Chemistry Indian Institute of Technology Kharagpur. Lecture - 06 Protein Structure IV

Biochemistry Prof. S. DasGupta Department of Chemistry Indian Institute of Technology Kharagpur. Lecture - 06 Protein Structure IV Biochemistry Prof. S. DasGupta Department of Chemistry Indian Institute of Technology Kharagpur Lecture - 06 Protein Structure IV We complete our discussion on Protein Structures today. And just to recap

More information

HIV protease inhibitor. Certain level of function can be found without structure. But a structure is a key to understand the detailed mechanism.

HIV protease inhibitor. Certain level of function can be found without structure. But a structure is a key to understand the detailed mechanism. Proteins are linear polypeptide chains (one or more) Building blocks: 20 types of amino acids. Range from a few 10s-1000s They fold into varying three-dimensional shapes structure medicine Certain level

More information

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University Department of Chemical Engineering Program of Applied and

More information

F. Piazza Center for Molecular Biophysics and University of Orléans, France. Selected topic in Physical Biology. Lecture 1

F. Piazza Center for Molecular Biophysics and University of Orléans, France. Selected topic in Physical Biology. Lecture 1 Zhou Pei-Yuan Centre for Applied Mathematics, Tsinghua University November 2013 F. Piazza Center for Molecular Biophysics and University of Orléans, France Selected topic in Physical Biology Lecture 1

More information

RNA and Protein Structure Prediction

RNA and Protein Structure Prediction RNA and Protein Structure Prediction Bioinformatics: Issues and Algorithms CSE 308-408 Spring 2007 Lecture 18-1- Outline Multi-Dimensional Nature of Life RNA Secondary Structure Prediction Protein Structure

More information

Reconstructing Amino Acid Interaction Networks by an Ant Colony Approach

Reconstructing Amino Acid Interaction Networks by an Ant Colony Approach Author manuscript, published in "Journal of Computational Intelligence in Bioinformatics 2, 2 (2009) 131-146" Reconstructing Amino Acid Interaction Networks by an Ant Colony Approach Omar GACI and Stefan

More information

Molecular Modelling. part of Bioinformatik von RNA- und Proteinstrukturen. Sonja Prohaska. Leipzig, SS Computational EvoDevo University Leipzig

Molecular Modelling. part of Bioinformatik von RNA- und Proteinstrukturen. Sonja Prohaska. Leipzig, SS Computational EvoDevo University Leipzig part of Bioinformatik von RNA- und Proteinstrukturen Computational EvoDevo University Leipzig Leipzig, SS 2011 Protein Structure levels or organization Primary structure: sequence of amino acids (from

More information

Bio nformatics. Lecture 23. Saad Mneimneh

Bio nformatics. Lecture 23. Saad Mneimneh Bio nformatics Lecture 23 Protein folding The goal is to determine the three-dimensional structure of a protein based on its amino acid sequence Assumption: amino acid sequence completely and uniquely

More information

The protein folding problem consists of two parts:

The protein folding problem consists of two parts: Energetics and kinetics of protein folding The protein folding problem consists of two parts: 1)Creating a stable, well-defined structure that is significantly more stable than all other possible structures.

More information

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy Design of a Novel Globular Protein Fold with Atomic-Level Accuracy Brian Kuhlman, Gautam Dantas, Gregory C. Ireton, Gabriele Varani, Barry L. Stoddard, David Baker Presented by Kate Stafford 4 May 05 Protein

More information

Protein Dynamics. The space-filling structures of myoglobin and hemoglobin show that there are no pathways for O 2 to reach the heme iron.

Protein Dynamics. The space-filling structures of myoglobin and hemoglobin show that there are no pathways for O 2 to reach the heme iron. Protein Dynamics The space-filling structures of myoglobin and hemoglobin show that there are no pathways for O 2 to reach the heme iron. Below is myoglobin hydrated with 350 water molecules. Only a small

More information

D Dobbs ISU - BCB 444/544X 1

D Dobbs ISU - BCB 444/544X 1 11/7/05 Protein Structure: Classification, Databases, Visualization Announcements BCB 544 Projects - Important Dates: Nov 2 Wed noon - Project proposals due to David/Drena Nov 4 Fri PM - Approvals/responses

More information

Computational Molecular Modeling

Computational Molecular Modeling Computational Molecular Modeling Lecture 1: Structure Models, Properties Chandrajit Bajaj Today s Outline Intro to atoms, bonds, structure, biomolecules, Geometry of Proteins, Nucleic Acids, Ribosomes,

More information

The Structure and Functions of Proteins

The Structure and Functions of Proteins Wright State University CORE Scholar Computer Science and Engineering Faculty Publications Computer Science and Engineering 2003 The Structure and Functions of Proteins Dan E. Krane Wright State University

More information

Principles of Physical Biochemistry

Principles of Physical Biochemistry Principles of Physical Biochemistry Kensal E. van Hold e W. Curtis Johnso n P. Shing Ho Preface x i PART 1 MACROMOLECULAR STRUCTURE AND DYNAMICS 1 1 Biological Macromolecules 2 1.1 General Principles

More information

Automated Assignment of Backbone NMR Data using Artificial Intelligence

Automated Assignment of Backbone NMR Data using Artificial Intelligence Automated Assignment of Backbone NMR Data using Artificial Intelligence John Emmons στ, Steven Johnson τ, Timothy Urness*, and Adina Kilpatrick* Department of Computer Science and Mathematics Department

More information

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Hemashree Bordoloi and Kandarpa Kumar Sarma Abstract. Protein secondary structure prediction is the method of extracting

More information

Orientational degeneracy in the presence of one alignment tensor.

Orientational degeneracy in the presence of one alignment tensor. Orientational degeneracy in the presence of one alignment tensor. Rotation about the x, y and z axes can be performed in the aligned mode of the program to examine the four degenerate orientations of two

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

BIRKBECK COLLEGE (University of London)

BIRKBECK COLLEGE (University of London) BIRKBECK COLLEGE (University of London) SCHOOL OF BIOLOGICAL SCIENCES M.Sc. EXAMINATION FOR INTERNAL STUDENTS ON: Postgraduate Certificate in Principles of Protein Structure MSc Structural Molecular Biology

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinff18.html Proteins and Protein Structure

More information

STRUCTURAL BIOLOGY AND PATTERN RECOGNITION

STRUCTURAL BIOLOGY AND PATTERN RECOGNITION STRUCTURAL BIOLOGY AND PATTERN RECOGNITION V. Cantoni, 1 A. Ferone, 2 O. Ozbudak, 3 and A. Petrosino 2 1 University of Pavia, Department of Electrical and Computer Engineering, Via A. Ferrata, 1, 27, Pavia,

More information

Model Mélange. Physical Models of Peptides and Proteins

Model Mélange. Physical Models of Peptides and Proteins Model Mélange Physical Models of Peptides and Proteins In the Model Mélange activity, you will visit four different stations each featuring a variety of different physical models of peptides or proteins.

More information

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins Margaret Daugherty Fall 2004 Outline Four levels of structure are used to describe proteins; Alpha helices and beta sheets

More information

AP Biology. Proteins. AP Biology. Proteins. Multipurpose molecules

AP Biology. Proteins. AP Biology. Proteins. Multipurpose molecules Proteins Proteins Multipurpose molecules 2008-2009 1 Proteins Most structurally & functionally diverse group Function: involved in almost everything u enzymes (pepsin, DNA polymerase) u structure (keratin,

More information

Assignment 2 Atomic-Level Molecular Modeling

Assignment 2 Atomic-Level Molecular Modeling Assignment 2 Atomic-Level Molecular Modeling CS/BIOE/CME/BIOPHYS/BIOMEDIN 279 Due: November 3, 2016 at 3:00 PM The goal of this assignment is to understand the biological and computational aspects of macromolecular

More information

Docking. GBCB 5874: Problem Solving in GBCB

Docking. GBCB 5874: Problem Solving in GBCB Docking Benzamidine Docking to Trypsin Relationship to Drug Design Ligand-based design QSAR Pharmacophore modeling Can be done without 3-D structure of protein Receptor/Structure-based design Molecular

More information