Pairwise Sequence Alignment. Robert Giegerich and David Wheeler. Version 2.01 (2 typos corrected w.r.t Version 2.0) May 21, 1996.

Size: px
Start display at page:

Download "Pairwise Sequence Alignment. Robert Giegerich and David Wheeler. Version 2.01 (2 typos corrected w.r.t Version 2.0) May 21, 1996."

Transcription

1 Pairwise Sequence Alignment Robert Giegerich and David Wheeler Version 2.01 (2 typos corrected w.r.t Version 2.0) May 21, An HTML version is available at Instructions for obtaining the Solution Sheet are available by sending to with no subject line, and the following message body: subscribe vsns-bcd-solutions VSNS{BCD Copyright. Copyright 1995/1996 by the author, who agreed to the following conditions for all time : The text may be redistributed verbatim, in whole or in part, for research and educational purposes only. If distributed in part, it must include this copyright notice, and a pointer (URL) to the full text. It may not be sold, or placed in something else for sale, without explicit permission in writing. The text is provided as is without any express or implied warranty. 1

2 Contents 1 Distance and Similarity Introduction Alphabets and Sequences Edit Distances Pairwise Alignment via Dynamic Programming Calculating Edit Distances and Optimal Alignments A Word on the Dynamic Programming Paradigm A Word on Scoring Functions and Related Notions Weight Matrices for Sequence Similarity Scoring 10 4 Realistic Gap Models 11 5 Variations of Pairwise Alignment Local Alignment and Local Similarity Heuristic Methods Appendix The DNA alphabet The extended genetic alphabet The single-letter amino-acid code Abstract alphabets Some examples of taking subsequences Some simple properties of subsequences Metric axioms Ambiguity of optimal alignment Some Recommended Reading 19 2

3 1 Distance and Similarity 1.1 Introduction (Introd.) This chapter is about sequence similarity. Let us start with a warning: There is no unique, precise, and universally applicable notion of similarity. Consider the words \beer", \here" and \hear". They sound very similar (if not the same), are somewhat similar in their spelling, and totally unrelated in their meaning. The situation is a little more regular in molecular biology: proteins and DNA can be similar with respect to their function, their structure, or their primary sequence of amino or nucleic acids. The general rule is that sequence determines shape, and shape determines function. So when we study sequence similarity, we eventually hope to discover or validate similarity in shape and function. This approach is often successful. However, there are many examples where two sequences have little or no similarity, but still the molecules fold into the same shape and share the same function. The present chapter speaks of neither shape nor function. Sequences are seen as strings of characters. In fact, the ideas and techniques we discuss have important applications in text processing, too. Similarity has both a quantitative and a qualitative aspect: A similarity measure gives a quantitative answer, saying that two sequences show a certain degree of similarity. An alignment is a mutual arrangement of two sequences which is a sort of qualitative answer; it exhibits where the two sequences are similar, and where they dier. An optimal alignment, of course, is one that exhibits the most correspondences, and the least dierences. 1.2 Alphabets and Sequences (Alphabets) Generally, an alphabet is a supply of symbols or a set of characters, from which sequences are composed. Some important alphabets are the 4-letter DNA alphabet (see Section 6.1), the extended genetic alphabet or IUPAC-Code (see Section 6.2), the single-letter amino acid code (see Section 6.3), some smaller alphabets which are abstractions (see Section 6.4) of the above. All the techniques we consider are independent of the particular alphabet used. We shall use the following notation: A is the alphabet, A is the set of all nite sequences of characters from A, 3

4 a; b; c are variables denoting individual characters, s; t; u; v are variables denoting sequences from A, jsj denotes the length of the sequence s (in characters). (Subseq.) To address subsequences and individual characters in a sequence s, we mark the boundaries between the characters of s by numbers, as shown here for s = ACCACGT A: 0A 1 C 2 C 3 A 4 C 5 G 6 T 7 A 8 By i : s : j we denote the subsequence of s between positions i and j. Of course, 0 i j jsj. Subsequences 0 : s : i are called prexes, subsequences i : s : jsj are called suxes of s. (A few examples for the use of this notation are available in Section 6.5). Two important special cases are i = j and i = j? 1: i : s: i denotes the empty sequence (with 0 characters), (j?1) : s: j denotes the j-th character in s. Sometimes we shall abbreviate this and simply write s j for (j?1) : s: j. The concatenation of two sequences s and t is denoted s ++t, e.g. ACC ++CT A = ACCCT A. There are quite a few convenient properties of this subsequence notation (see Section 6.6). 1.3 Edit Distances (Distances) Two ways are used to quantify similarity of two sequences: A similarity measure is a function that associates a numeric value with a pair of sequences, with the idea that a higher value indicates greater similarity. Beyond this, similarity measures vary widely, and care must be taken when interpreting similarity measures. The notion of distance is somewhat dual to similarity. It treats sequences as points in a metric space. A distance measure is a function that also associates a numeric value with a pair of sequences, but with the idea that the larger the distance, the smaller the similarity, and vice versa. Distance measures usually satisfy the mathematical axioms of a metric (see Section 6.7 for more information). In particular, distance values are never negative. In most cases, distance and similarity measures are interchangable in the sense that a small distance means high similarity, and vice versa. Sometimes similarity measures are a little more exible (see section on \local similarity"). (Hamming) Maybe the simplest notion of distance is the so-called Hamming Distance: For two sequences of equal length, we just count the character positions in which they dier. For example: sequence s AAT AGCAA AGCACACA sequence t TAA ACATA ACACACTA HammingDistance(s; t)

5 This distance measure is very useful in some cases, but in general it is not exible enough. First of all, the sequences may have dierent length. Second, there is generally no xed correspondence between their character positions. In the mechanism of DNA replication, errors like deleting or inserting a nucleotide are not unusual. Although the rest of the sequences is identical, such a shift of position leads to exaggerated values in the Hamming distance. Look at the rightmost example above. The Hamming distance says that s and t are apart by 6 characters (out of 8). On the other hand, by deleting G from s and T from t, both become equal to ACACACA. In this sense, they are only two characters apart! (Edit Op.) Let us instead model the distance of s and t by considering the simple, onecharacter edit operations that turn s into t. We introduce a gap character \{" and say that the pair (a; a) denotes a match (no change from s to t), (a;?) denotes deletion of character a (in s), (a; b) denotes replacement of a (in s) by b (in t), where a 6= b, (?; b) denotes insertion of character b (in s). Since the problem is symmetric in s and t, a deletion in s can be seen as an insertion in t, and vice versa. An alignment of two sequences s and t is an arrangement of s and t by position, where s and t can be padded with gap symbols to achieve the same length: s : A G C A C A C? A t : A? C A C A C T A or A G? C A C A C A A C A C A C T? A If we read the alignment column-wise, we have a protocol of edit operations that lead from s to t. (Edit Dist. Example) Left: Match (A; A) Right: Match (A; A) Delete (G;?) Replace (G; C) Match (C; C) Insert (?; A) Match (A; A) Match (C; C) Match (C; C) Match (A; A) Match (A; A) Match (C; C) Match (C; C) Replace (A; T ) Insert (?; T ) Delete (C;?) Match (A; A) Match (A; A) The left-hand alignment shows one Delete, one Insert, and the other edit operations are Matches. 5

6 The right-hand alignment shows one Insert, one Delete, two Replaces, and some trivial ones. Next we turn the edit protocol into a measure of distance by assigning a \cost" or \weight" w to each operation. For example, for arbitrary characters a; b from A we may dene w(a; a) = 0 w(a; b) = 1 for a 6= b w(a;?) = w(?; b) = 1 (Unit Cost) This scheme is known as the Levenshtein Distance, also called unit cost model. Its predominant virtue is its simplicity. In general, more sophisticated cost models must be used. For example, replacing an amino acid by a biochemically similar one should weight less than a replacement by an amino acid with totally dierent properties. Section 3 is exclusively dedicated to the choice of cost models. Now we are ready to dene the most important notion for sequence analysis: The cost of an alignment of two sequences s and t is the sum of the costs of all the edit operations that lead from s to t. An optimal alignment of s and t is an alignment which has minimal cost among all possible alignments. The edit distance of s and t is the cost of an optimal alignment of s and t under a cost function w. We denote it by d w (s; t). Using the unit cost model for w in our previous example, we obtain the following cost: s : t : A G C A C A C? A A? C A C A C T A cost : 2 or A G? C A C A C A A C A C A C T? A cost : 4 Here it is easily seen that the left-hand assignment is optimal under the unit cost model, and hence the edit distance d w (s; t) = 2. At this point, you should improve your understanding by doing some exercises. Some Exercises involving Edit Distances (1) Verify: For arbitrary sequences s; t, and an arbitrary cost function w, d w (s; t) is always unique, but their optimal alignment is not. (2) What is the edit distance of the words \INDUSTRY" and \INTEREST" under the unit cost model? What of \CCT" and \ACGCTT"? Show an optimal alignment for each case. 6

7 (3) If jsj < n + 1; jtj < n + 1 and w is the unit cost model, we always have d w (s; t) < n + 1. Why? (4) Can you think of a weight function that weights replacements so highly that an optimal alignment will never show replacements. (This is simple!) (5) Return to the alignment example given in the text. Try to devise a dierent cost function w, such that the alignment on the right-hand side has smaller cost than the one on the left. (6) Consider the gap symbol \{" as a regular character. What is the dierence between d w (s; t) and HammingDistance(s 0 ; t 0 ), where s 0 and t 0 are equal to s; t, but expanded by gap symbols according to an optimal alignment? 2 Pairwise Alignment via Dynamic Programming 2.1 Calculating Edit Distances and Optimal Alignments (Edit Dist. Calculation) The number of possible alignments between two sequences is gigantic, and unless the weight function is very simple, it may seem dicult to pick out an optimal alignment. But fortunately, there is an easy and systematic way to nd it. The algorithm described now is very famous in biocomputing, it is usually called \the dynamic programming algorithm". Consider two prexes 0 : s : i and 0 : t : j, with i; j 1. Let us assume we already know optimal alignments between all shorter prexes of s and t, in particular of (1) 0 : s: (i?1) and 0 : t: (j?1), of (2) 0 : s: (i?1) and 0 : t: j, and of (3) 0 : s: i and 0 : t: (j?1). An optimal alignment of 0 : s: i and 0 : t: j must be an extension of one of the above by (1) a Replacement(s i ; t j ), or a Match(s i ; t j ), depending on whether s i = t j (2) a Deletion(s i ;?), or (3) an Insertion(?; t j ). (Edit Dist. Recursion) We simply have to choose the minimum: d w ( 0 : s: i ; 0 : t: j ) = min f d w ( 0 : s: (i?1) ; 0 : t: (j?1) ) + w(s i ; t j ); d w ( 0 : s: (i?1) ; 0 : t: j ) + w(s i ;?); d w ( 0 : s: i ; 0 : t: (j?1) ) + w(?; t j ) g 7

8 There is no choice when one of the prexes is empty, i.e. i = 0, or j = 0, or both: d w ( 0 : s: 0 ; 0 : t: 0 ) = 0 d w ( 0 : s: i ; 0 : t: 0 ) = d w ( 0 : s: (i?1) ; 0 : t: 0 ) + w(s i ;?) for i = 1; : : : ; m d w ( 0 : s: 0 ; 0 : t: j ) = d w ( 0 : s: 0 ; 0 : t: (j?1) ) + w(?; t j ) for j = 1; : : : ; n According to this scheme, and for a given w, the edit distances of all prexes of i and j dene an (m + 1) (n + 1) distance matrix D = (d i;j ) with d i;j = d w ( 0 : s: i ; 0 : t: j ). The three-way choice in the minimization formula for d ij leads to the following pattern of dependencies between matrix elements: d i?1;j?1 d i?1;j & # d i;j?1! d i;j The bottom right corner of the distance matrix contains the desired result: d mn = d w ( 0 : s: m ; 0 : t: n ) = d w (s; t). (Edit Dist. Matrix) This is the distance matrix for our previous example with s = AGCACACA; t = ACACACT A: t?! A C A C A C T A s # A G C A C A C A A C A C A C T A A G C A C A C @@r

9 In the second diagram, we have drawn a path through the distance matrix indicating which case was chosen when taking the minimum. A diagonal line means Replacement or Match, a vertical line means Deletion, and a horizontal line means Insertion. Thus, this path indicates the edit operation protocol of the optimal alignment with d w (s; t) = 2. Note that in some cases, the minimal choice is not unique, and dierent paths could have been drawn which indicate alternative optimal alignments. Another example is here (see Section6.8 for more information). In which order should we calculate the matrix entries? The only constraint is the above pattern of dependencies. The most common order of calculation is line by line (each line from left to right), or column by column (each column from top-to-bottom). Some Exercises involving Dynamic Programming (1) Find out the cost model used by the BioMOO aligner. Calculate a dynamic programming matrix and alignment for the sequences ATT and TTC. Check your results using the BioMOO alignment, i.e. type "opt_align ATT TTC matrix with #90" on the MOO. (You can also use the WWW-Interface, see How many optimal alignments are there? (2) The number of possible alignments is described as "gigantic". How many are there for the sequences ATT and TTC? (Extra Credit.) If you wish do devise a formula for the number of alignments, which method can be used to enumerate them systematically? Devise such a formula. 2.2 A Word on the Dynamic Programming Paradigm \Dynamic Programming" is a very general programming technique. It is applicable when a large search space can be structured into a succession of stages, such that the initial stage contains trivial solutions to sub-problems, each partial solution in a later stage can be calculated by recurring on only a xed number of partial solutions in an earlier stage, the nal stage contains the overall solution. This applies to our distance matrix: The columns are the stages, the rst column is trivial, the nal one contains the overall result. A matrix entry d ij is the partial solution d w ( 0 : s : i ; 0 : t : j ) and can be determined from two solutions in the previous column d i?1;j?1 and d i;j?1 plus one in the same column, namely d i?1;j. Since calculating edit distances is the predominant approach to sequence comparison, some people simply call this THE dynamic programming algorithm. Just note that the dynamic programming paradigm has many other applications as well, even within bioinformatics. 9

10 2.3 A Word on Scoring Functions and Related Notions Many authors use dierent words for essentially the same idea: scores, weights, costs, distance and similarity functions all attribute a numeric value to a pair of sequences. \distance" should only be used when the metric axioms are satised. In particular, distance values are never negative. The optimal alignment minimizes distance. The term \costs" usually implies positive values, with the overall cost to be minimized. However, metric axioms are not assumed. \weights" and \scores" can be positive or negative. The most popular use is that a high score is good, i.e. it indicates a lot of similarity. Hence, the optimal alignments maximize scores. The term \similarity" immediately implies that large values are good, i.e. an optimal alignment maximizes similarity. Intuitively, one would expect that similarity values should not be negative (what is less than zero similarity?). But don't be surprised to see negative similarity scores shortly. Mathematically, distances are a little more tractable than the others. In terms of programming, general scoring functions are a little more exible. For example, the algorithm for local similarity presented in section 5.1 depends on the use of both positive and negative scores. The accumulated score of two subsequences may rise above the threshold value, and may fall below it after encountering some negative scores. Let us close with another caveat concerning the inuence of sequence length on similarity. Let us just count exact matches and let us assume that two sequences of length n and m, respectively, have 99 exact matches. Let c w (s; t) be the similarity score calculated for s and t under this cost model. So, c w (s; t) = 99. What this means depends on n: If n = m = 100, the sequences are very similar - almost identical. If n = m = 1000, we have only 10% identity! (Two typos were corrected in this paragraph on Wed May 15 17:06:38 MDT 1996) So if we relate sequences of varying length, it makes sense to use length-relative scores - rather than c w (s; t) we use c w (s; t)=(n + m) for sequence comparison. 3 Weight Matrices for Sequence Similarity Scoring This section is available at A Postscript Version of it is available at 10

11 Some Exercises involving Weight Matrices (Author: Georg Fuellen. Further exercises are planned.) (1) 1. Why does 2 PAM, i.e. 1 PAM multiplied with itself, not correspond to exactly 2% of the amino acids having mutated, but a little less than 2%? Or, in other words, why does a 250 PAM matrix not correspond to 250% accepted mutations? (2) Is it biologically plausible that the C-C and W-W entries in the matrix are the most prominent? Which entries (or groups of entries) are the least prominent? (3) Why do some people talk about the 256 PAM matrix, and not the 250 PAM matrix? What's so special about the value 256? 4 Realistic Gap Models (No-gap Alignment) So far, we have treated the gap symbol \{" as yet another character, denoting an individual insertion or deletion. However, this view is not always adequate. Sometimes we want no-gap alignments. For example, in a family of proteins there may be a strongly conserved subunit which is the site of some protein-protein interaction. Any deletion/insertion in the chain of amino acids would be likely to destroy its biochemical function. Such regions we want to align using matches/replacements only. This of course can be achieved by a very simple algorithm. But also our dynamic programming algorithm can be geared to do this by setting costs for insertion and deletion to innity (or something close to it). Hence, an optimal alignment will not use gaps. (Block-indel) Sometimes, from an evolutionary point of view, it is more realistic to assume that nature inserts or deletes entire substrings as a unit. This is called the block-indel-model. It means that we charge a certain set-up cost for introducing a new gap, whereas extending an existing gap is less expensive. For example, a linear gap cost function is of the form gapcost(n) = g i + (n? 1) g e, where n is the length of the gap (number of consecutive \{"), and g i > g e 0. Our dynamic programming scheme can be adapted to this model without much eect on its eciency. (A typo in this section was corrected Tue May 21 01:08:57 MET DST 1996.) Some Exercises involving Gap Cost Models (1) Consider the four alignments of CAAAAGAT and CGAGGGGT shown below. Calculate the cost of each under a) the unit cost model b) the unit cost model with block-indels, with g i = 1 and g e = 0:2. 11

12 (1) C A A A A G A T C G A G G G G T cost a) = cost b) = (2) C A A A A G A??T C??G A G G G G T cost a) = cost b) = (3) C A A A A G A????T C????G A G G G G T cost a) = cost b) = (2) (Extra Credit.) Design the dynamic programming (recursive) formula for the block-indel case. 5 Variations of Pairwise Alignment 5.1 Local Alignment and Local Similarity (Local Alignment) The notion of edit distance and its implementation via dynamic programming are easily adapted to variations of the original problem. Two such variations are discussed here. We rst discuss the problem of local alignment, which is also called approximate pattern matching. We then turn to the problem of local similarity, which (unfortunately) has often been called "local alignment". Consider the problem of locating the famous TATAAT -box (a bacterial promoter) in a piece of DNA. Typically, this motif does not occur exactly as spelled here, but may show up with one or two bases changed. Local alignment is the problem where s is relatively short with respect to t and we seek that subunit of t which s aligns best with. More precisely: Given 0 : s : m and 0 : t : n, nd i : t : j such that d w (s; i : t : j ) is minimal among all choices of 0 i j n. This is also called the approximate matching problem (of s against t), and very sophisticated methods have been developed to do this eciently. But a simple change to the dynamic programming scheme will also do the job, albeit a little slower. (Local Alignment Recursion) Local alignment means that we do not charge costs for (1) deletion of a prex 0 : t: i (2) deletion of a sux j : t: n (1) means that d w ( 0 : s : 0 ; 0 : t : i ) = 0, and hence we initialize the rst row of the distance matrix accordingly: 12

13 d 0;j = 0 for 0 j n The recursion formula for the inner row is unchanged (as given in section 2.1). However (2) means that d w ( 0 : s: m ; 0 : t: j ) = min f d w ( 0 : s: (m?1) ; 0 : t: (j?1) ) + w(s m ; t j ); d w ( 0 : s: (m?1) ; 0 : t: j ) + w(s m ;?); d w ( 0 : s: m ; 0 : t: (j?1) ) g So the last line of the cost matrix is calculated according to d m;j = min f d m?1;j?1 + w(s m ; t j ); d m?1;j + w(s m ;?); d m;j?1 g As before, d mn gives the cost of the optimal local alignment. The matching subsequence i : t: j is then found as follows: j = i minfkj d m;k = d m;n g is the point where the optimal path leading to d m;j starts from the rst row. An interesting alternative is to calculate the last line of the matrix according to the general case. Then, minimal values in the last line indicate those positions where the best matches of s against substrings of t end. This is the way the \local"-option of the BioMOO alignment server is programmed. (Local Similarity) For the second variation of our similarity theme, consider the following case. Let s and t be two proteins, which we suspect to carry some functionally related subunits. However, most parts of s and t do not contribute to this function, and may well be very dierent. A pairwise alignment of s and t would be unlikely to clearly exhibit small regions of high similarity, as it is geared to minimize distance over the full length of both sequences. Thus we are faced with another problem, this one being symmetric between s and t. We ask for those subunits of s and t that exhibit most similarity. This is called the local similarity problem, since we use a similarity rather than a distance measure in the following way: w(a; b) > 0; w(a; b) < 0; w(a;?) < 0 if a; b are similar, if a; b are not similar, and w(?; b) < 0; in particular. This still leaves a lot of freedom to dierentiate degrees of (dis-)similarity. The point is that we use the score 0 as a cut-o value between subsequences with/without similarity. We are now maximizing similarity rather than minimizing distance. The border cases now become 13

14 and the general recursion formula is d 0;j = 0 for 0 j n; d i;0 = 0 for 1 i m; d i;j = max f 0; d i?1;j?1 + w(s i ; t j ); d i?1;j + w(s i ;?); d i;j?1 + w(?; t j ) g The cut-o value of zero means that long stretches of dissimilarity show as regions of zeroes in the matrix, from which stretches of local similarity rise as islands of positive values. Some Exercises involving Local Alignment / Local Similarity (1) Calculate a dynamic programming matrix and local alignment for the sequences ATT and ATCTTC. Compare your results with the BioMOO server, i.e. type "opt_align ATT ATCTTC matrix zero with #90". (You can also use the WWW-Interface, see (2) Justify intuitively the treatment of the border cases in the local simililarity problem. (3) What happens if you formulate "local alignment" in a symmetric fashion, not charging for terminal gaps in either of the two sequences? (4) If you use \2" as the cuto value for local similarity calculations (see text), can you expect meaningful results? 5.2 Heuristic Methods (Edit Dist. Calc. Complexity) Let us now consider the computational resources needed for calculating edit distances: The dynamic programming scheme calculates (for input sequences 0 : s : m and 0 : t : n ) m n matrix entries. Each matrix element can be calculated from the three adjacent elements with a xed number of operations. Then the execution time of this program is proportional to m n. In the jargon of computer science, we say that its time complexity is O(m n). If we are only interested in the edit distance of s and t, we only need to store one column (or row) of the matrix at a time in order to calculate the next one. In this case, the space complexity is O(m) or O(n). If we need to retrace the path that leads to the optimal alignment, the whole matrix must be stored and the space complexity moves up to O(m n), too. This is quite feasible on today's computers, even when s and t are or characters long. However, for comparing a sequence s with a whole database of tens or hundreds of thousands of sequences, we must seek for a more ecient solution. 14

15 The answer is to use heuristic approaches, which only approximate the proper edit distance and the optimal alignment, but do this with a time complexity close to O(n + m) for a given pair 0 : s : m and 0 : t : n. We will discuss the ideas of two such heuristics, and you will learn how to use them in the next chapter. (FASTA) FASTA was developed by Lipman and Pearson in FASTA considers exact matches between short substrings of s and t, i.e. all pairs (i; j) such that i : s: (i+k) = j : t: (j+k), for a given parameter k. If a signicant number of such exact matches is found, FASTA uses the dynamic programming algorithm to compute optimal alignments. This approach allows to trade speed for precision: The larger we choose the parameter k, the smaller is the number of exact matches. This makes the program faster, but loses precision: It becomes less likely that the optimal alignment contains enough exact matches of length k, and the procedure may nd nothing. Nevertheless, experience shows that with sensibly chosen parameters, FASTA misses very few cases of signicant homology. (BLAST) BLAST, developed by Altschul et al. in 1990, is another heuristic based on a similar idea. BLAST focusses on no-gap alignments of (again) a certain, xed length k. Rather than requiring exact matches, BLAST uses a scoring function to measure similarity (rather than distance). In particular for proteins, one can argue that segment pairs with no gaps and a high similarity score indicate regions of functional similarity. For a given threshold parameter S, BLAST reports to the user all database entries which have a segment pair with the query sequence that scores higher than S. If the scoring function used has a probabilistic interpretation, BLAST can also give an assessment of the statistical signicance of the matches it reports. Some Exercises involving Heuristic Methods These exercises should be done after going through Chapter 2, which tells you where to nd the FASTA and BLAST servers. More exercises are available at thanks to Francisco M. De La Vega. These also include a solution sheet. (1) Retrieve an arbitrary sequence from Genebank/EMBL or SwissProt. (You may choose a sequence from your research context.) Slightly modify it and call the modied sequence s. (2) Search for homologues using FASTA. Is the original sequence among the reported matches? Save the results. (3) Repeat (3) using BLAST. (4) Compare the matches found in (3) and (4) with each other. Can you explain the dierences? (5) Redo (3) or (4) and try to use more restrictive parameters, such that the original sequence is reported as the only match. If this is not possible, explain why! 15

16 6 Appendix 6.1 The DNA alphabet All DNA is built from nucleotides always containing sugar, a phosphate group, and one of the four nucleic acids Adenine, Cytosine, Guanine, and Thymine. The DNA alphabet consists of their initials, fa; C; G; T g. In RNA, Thymine is replaced by Uracil, so U takes the place of T. However, to determine the nucleotide sequence of RNA, it is usually retranscribed into DNA (for technical reasons). So many RNA sequences are represented by their DNA equivalents. It is not always the case that all positions in a sequence are precisely known. In those cases, an extended alphabet is used. 6.2 The extended genetic alphabet In sequencing experiments, gel readings are not always unique. Some positions may be unknown or allow a choice of one of several nucleotides. Similarly, when a sequence actually represents a consensus of several related sequences, there may still be dierent degrees of variation on certain positions. The IUPAC extended genetic alphabet allows to express such ambiguities. Some tools like automatic uorescent sequencers use this code where gel readings do not allow a precise interpretation. Summary of single-letter code recommendations Symbol Meaning Origin of designation G G Guanine A A Adenine T T Thymine C C Cytosine R G or A purine Y T or C pyrimidine M A or C amino K G or T Keto S G or C Strong interaction (3 H bonds) W A or T Weak interaction (2 H bonds) H A or C or T not-g, H follows G in the alphabet B G or T or C not-a, B follows A V G or C or A not-t (not-u), V follows U D G or A or T not-c, D follows C N G or A or T or C any 16

17 6.3 The single-letter amino-acid code There are 20 amino acids from which all proteins are built. A three- and a single-letter code have been designed. single-letter code abbreviation full name A Ala Alanine R Arg Arginine N Asn Asparagine D Asp Aspartic acid C Cys Cysteine Q Gln Glutamine E Glu Glutamic acid G Gly Glycine H His Histidine I Ile Isoleucine L Leu Leucine K Lys Lysine M Met Methionine F Phe Phenylalanine P Pro Proline S Ser Serine T Thr Threonine W Trp Tryptophan Y Tyr Tyrosine V Val Valine 6.4 Abstract alphabets Sometimes we want to know whether a particular property of amino acids shows a characteristic pattern. For example, amino acids are hydrophobic, neutral, or hydrophilic. To express this, we map the single-letter code into the alphabet f+; 0;?g, each according to its behaviour. Characteristic patterns that would go undetected in the original encoding may become visible by such a translation. For DNA, we may only care whether a pyrimidine (C or T ) or a purine (A or G) resides at a given point. Thus, we map A and G onto R, C and T onto Y, and analyse strings over the binary alphabet fr; Y g instead. 6.5 Some examples of taking subsequences Let u = ACCACGT AC, a sequence over the DNA alphabet. 17

18 0 : u: 1 = A; 0 : u: 2 = AC, and so forth, 0 : u: 2 = 3 : u: 5 = 7 : u: 9 = AC, u ++u = ACCACGT ACACCACGT AC, (denotes sequence concatenation) 7 (u ++u) 11 = ACAC. 6.6 Some simple properties of subsequences Let s and t be sequences, with jsj = m and jtj = n: s = 0 : s: m ; t = 0 : t: n, j i : s: j j = j? i; and js ++tj = m + n, i : s: j ++ j : s: l = i : s: l, If we delete subsequence i : s: j from s, the result is ( 0 : s: i ) ++( j : s: m ). 6.7 Metric axioms A distance measure d is a metric on sequences if it has the following properties for all sequences s; t; u: (1) d(s; t) 0 (2) d(s; t) = 0 if and only if s = t (3) d(s; t) = d(t; s) (4) d(s; t) d(s; u) + d(u; t) (the triangle inequality) (Note that (1) actually is a consequence of (2) { (4).) 18

19 6.8 Ambiguity of optimal alignment Here is an example where the optimal alignment is not unique. Let s = T T CC; t = AAT T. The distance matrix, A A T T T T C C T T C C the minimal choices, A A T T #&! &! &! &! #& #&! & & #&#&#&#! &# #&#&#&#! &# T T C C the optimal paths, A A T T! &! & &! & & & &# &# &# In the center matrix, we have recorded which choice leads to the minimum for each d ij. Optimal paths follow these arrows and must reach the \target" d m;n. The right-hand matrix shows the network of all optimal paths. They represent alignments as diverse as s = T T C C t = A A T T s =?? T T C C t = A A T T?? (with four proper Replaces) (with two Inserts and two Deletes) And of course there are some others, too. 7 Some Recommended Reading A classical, but still interesting reference is [KS83]. The following book gives a detailed treatment of pairwise alignment, as well as many related problems in sequence analysis: [Wat89] A new book by Waterman has just come out, [Wat95]. Classical references on BLAST and FASTA are: [AGM + 89], [WL83] A paper by Pearson on the new version of FASTA appeared in Protein Science 1995: [Pea95] Before you start developing faster alignment algorithms for special cases of pairwise alignment, take a look at the following overview: [Mye91] 19

20 References [AGM + 89] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. A Basic Local Alignment Search Tool. J. Mol. Biol., 215:403{410, [Pea95] [KS83] [Mye91] [Wat89] Pearson, W. R. Comparison of methods for searching protein sequence databases. Prot. Sci. 4: J.B. Kruskal and D. Sanko. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, E.W. Myers. An Overview of Sequence Comparison Algorithms in Molecular Biology. Technical Report TR 91-29, University of Arizona, Tucson, Department of Computer Science, M.S. Waterman. Sequence Alignments. In M.S. Waterman, editor, Mathematical Methods for DNA Sequences, pages 53{92. CRC-Press, Boca Raton, FL, [Wat95] M.S. Waterman. Introduction to Computational Biology. Chapman and Hall, London, [WL83] W.J. Wilbur and D.J. Lipman. Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Natl. Acad. Sci. USA, 80:726{730,

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler STATC141 Spring 2005 The materials are from Pairise Sequence Alignment by Robert Giegerich and David Wheeler Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (I) Sequence similarity

More information

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.

More information

Proteins: Characteristics and Properties of Amino Acids

Proteins: Characteristics and Properties of Amino Acids SBI4U:Biochemistry Macromolecules Eachaminoacidhasatleastoneamineandoneacidfunctionalgroupasthe nameimplies.thedifferentpropertiesresultfromvariationsinthestructuresof differentrgroups.thergroupisoftenreferredtoastheaminoacidsidechain.

More information

Viewing and Analyzing Proteins, Ligands and their Complexes 2

Viewing and Analyzing Proteins, Ligands and their Complexes 2 2 Viewing and Analyzing Proteins, Ligands and their Complexes 2 Overview Viewing the accessible surface Analyzing the properties of proteins containing thousands of atoms is best accomplished by representing

More information

Lecture 15: Realities of Genome Assembly Protein Sequencing

Lecture 15: Realities of Genome Assembly Protein Sequencing Lecture 15: Realities of Genome Assembly Protein Sequencing Study Chapter 8.10-8.15 1 Euler s Theorems A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell Mathematics and Biochemistry University of Wisconsin - Madison 0 There Are Many Kinds Of Proteins The word protein comes

More information

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009 8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

More information

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

More information

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods Cell communication channel Bioinformatics Methods Iosif Vaisman Email: ivaisman@gmu.edu SEQUENCE STRUCTURE DNA Sequence Protein Sequence Protein Structure Protein structure ATGAAATTTGGAAACTTCCTTCTCACTTATCAGCCACCT...

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Properties of amino acids in proteins

Properties of amino acids in proteins Properties of amino acids in proteins one of the primary roles of DNA (but not the only one!) is to code for proteins A typical bacterium builds thousands types of proteins, all from ~20 amino acids repeated

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas Informal inductive proof of best alignment path onsider the last step in the best

More information

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS Int. J. LifeSc. Bt & Pharm. Res. 2012 Kaladhar, 2012 Research Paper ISSN 2250-3137 www.ijlbpr.com Vol.1, Issue. 1, January 2012 2012 IJLBPR. All Rights Reserved PROTEIN SECONDARY STRUCTURE PREDICTION:

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Amino Acids and Peptides

Amino Acids and Peptides Amino Acids Amino Acids and Peptides Amino acid a compound that contains both an amino group and a carboxyl group α-amino acid an amino acid in which the amino group is on the carbon adjacent to the carboxyl

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best alignment path onsider the last step in

More information

Translation. A ribosome, mrna, and trna.

Translation. A ribosome, mrna, and trna. Translation The basic processes of translation are conserved among prokaryotes and eukaryotes. Prokaryotic Translation A ribosome, mrna, and trna. In the initiation of translation in prokaryotes, the Shine-Dalgarno

More information

UNIT TWELVE. a, I _,o "' I I I. I I.P. l'o. H-c-c. I ~o I ~ I / H HI oh H...- I II I II 'oh. HO\HO~ I "-oh

UNIT TWELVE. a, I _,o ' I I I. I I.P. l'o. H-c-c. I ~o I ~ I / H HI oh H...- I II I II 'oh. HO\HO~ I -oh UNT TWELVE PROTENS : PEPTDE BONDNG AND POLYPEPTDES 12 CONCEPTS Many proteins are important in biological structure-for example, the keratin of hair, collagen of skin and leather, and fibroin of silk. Other

More information

Sequence comparison: Score matrices

Sequence comparison: Score matrices Sequence comparison: Score matrices http://facultywashingtonedu/jht/gs559_2013/ Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Exam III. Please read through each question carefully, and make sure you provide all of the requested information.

Exam III. Please read through each question carefully, and make sure you provide all of the requested information. 09-107 onors Chemistry ame Exam III Please read through each question carefully, and make sure you provide all of the requested information. 1. A series of octahedral metal compounds are made from 1 mol

More information

Chemistry Chapter 22

Chemistry Chapter 22 hemistry 2100 hapter 22 Proteins Proteins serve many functions, including the following. 1. Structure: ollagen and keratin are the chief constituents of skin, bone, hair, and nails. 2. atalysts: Virtually

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Similarity or Identity? When are molecules similar?

Similarity or Identity? When are molecules similar? Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Packing of Secondary Structures

Packing of Secondary Structures 7.88 Lecture Notes - 4 7.24/7.88J/5.48J The Protein Folding and Human Disease Professor Gossard Retrieving, Viewing Protein Structures from the Protein Data Base Helix helix packing Packing of Secondary

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Lecture 14 - Cells. Astronomy Winter Lecture 14 Cells: The Building Blocks of Life

Lecture 14 - Cells. Astronomy Winter Lecture 14 Cells: The Building Blocks of Life Lecture 14 Cells: The Building Blocks of Life Astronomy 141 Winter 2012 This lecture describes Cells, the basic structural units of all life on Earth. Basic components of cells: carbohydrates, lipids,

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético Paulo Mologni 1, Ailton Akira Shinoda 2, Carlos Dias Maciel

More information

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool .. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

More information

The translation machinery of the cell works with triples of types of RNA bases. Any triple of RNA bases is known as a codon. The set of codons is

The translation machinery of the cell works with triples of types of RNA bases. Any triple of RNA bases is known as a codon. The set of codons is Relations Supplement to Chapter 2 of Steinhart, E. (2009) More Precisely: The Math You Need to Do Philosophy. Broadview Press. Copyright (C) 2009 Eric Steinhart. Non-commercial educational use encouraged!

More information

A Theoretical Inference of Protein Schemes from Amino Acid Sequences

A Theoretical Inference of Protein Schemes from Amino Acid Sequences A Theoretical Inference of Protein Schemes from Amino Acid Sequences Angel Villahoz-Baleta angel_villahozbaleta@student.uml.edu ABSTRACT Proteins are based on tri-dimensional dispositions generated from

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Part 4 The Select Command and Boolean Operators

Part 4 The Select Command and Boolean Operators Part 4 The Select Command and Boolean Operators http://cbm.msoe.edu/newwebsite/learntomodel Introduction By default, every command you enter into the Console affects the entire molecular structure. However,

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Read more about Pauling and more scientists at: Profiles in Science, The National Library of Medicine, profiles.nlm.nih.gov

Read more about Pauling and more scientists at: Profiles in Science, The National Library of Medicine, profiles.nlm.nih.gov 2018 Biochemistry 110 California Institute of Technology Lecture 2: Principles of Protein Structure Linus Pauling (1901-1994) began his studies at Caltech in 1922 and was directed by Arthur Amos oyes to

More information

PROTEIN STRUCTURE AMINO ACIDS H R. Zwitterion (dipolar ion) CO 2 H. PEPTIDES Formal reactions showing formation of peptide bond by dehydration:

PROTEIN STRUCTURE AMINO ACIDS H R. Zwitterion (dipolar ion) CO 2 H. PEPTIDES Formal reactions showing formation of peptide bond by dehydration: PTEI STUTUE ydrolysis of proteins with aqueous acid or base yields a mixture of free amino acids. Each type of protein yields a characteristic mixture of the ~ 20 amino acids. AMI AIDS Zwitterion (dipolar

More information

CHEMISTRY ATAR COURSE DATA BOOKLET

CHEMISTRY ATAR COURSE DATA BOOKLET CHEMISTRY ATAR COURSE DATA BOOKLET 2018 2018/2457 Chemistry ATAR Course Data Booklet 2018 Table of contents Periodic table of the elements...3 Formulae...4 Units...4 Constants...4 Solubility rules for

More information

Problem Set 1

Problem Set 1 2006 7.012 Problem Set 1 Due before 5 PM on FRIDAY, September 15, 2006. Turn answers in to the box outside of 68-120. PLEASE WRITE YOUR ANSWERS ON THIS PRINTOUT. 1. For each of the following parts, pick

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Potentiometric Titration of an Amino Acid. Introduction

Potentiometric Titration of an Amino Acid. Introduction NAME: Course: DATE Sign-Off: Performed: Potentiometric Titration of an Amino Acid Introduction In previous course-work, you explored the potentiometric titration of a weak acid (HOAc). In this experiment,

More information

C CH 3 N C COOH. Write the structural formulas of all of the dipeptides that they could form with each other.

C CH 3 N C COOH. Write the structural formulas of all of the dipeptides that they could form with each other. hapter 25 Biochemistry oncept heck 25.1 Two common amino acids are 3 2 N alanine 3 2 N threonine Write the structural formulas of all of the dipeptides that they could form with each other. The carboxyl

More information

Sequence Alignment. Johannes Starlinger

Sequence Alignment. Johannes Starlinger Sequence Alignment Johannes Starlinger his Lecture Approximate String Matching Edit distance and alignment Computing global alignments Local alignment Johannes Starlinger: Bioinformatics, Summer Semester

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Towards Understanding the Origin of Genetic Languages

Towards Understanding the Origin of Genetic Languages Towards Understanding the Origin of Genetic Languages Why do living organisms use 4 nucleotide bases and 20 amino acids? Apoorva Patel Centre for High Energy Physics and Supercomputer Education and Research

More information

Protein Struktur (optional, flexible)

Protein Struktur (optional, flexible) Protein Struktur (optional, flexible) 22/10/2009 [ 1 ] Andrew Torda, Wintersemester 2009 / 2010, AST nur für Informatiker, Mathematiker,.. 26 kt, 3 ov 2009 Proteins - who cares? 22/10/2009 [ 2 ] Most important

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Solutions In each case, the chirality center has the R configuration

Solutions In each case, the chirality center has the R configuration CAPTER 25 669 Solutions 25.1. In each case, the chirality center has the R configuration. C C 2 2 C 3 C(C 3 ) 2 D-Alanine D-Valine 25.2. 2 2 S 2 d) 2 25.3. Pro,, Trp, Tyr, and is, Trp, Tyr, and is Arg,

More information

The Select Command and Boolean Operators

The Select Command and Boolean Operators The Select Command and Boolean Operators Part of the Jmol Training Guide from the MSOE Center for BioMolecular Modeling Interactive version available at http://cbm.msoe.edu/teachingresources/jmol/jmoltraining/boolean.html

More information

Protein Structure Bioinformatics Introduction

Protein Structure Bioinformatics Introduction 1 Swiss Institute of Bioinformatics Protein Structure Bioinformatics Introduction Basel, 27. September 2004 Torsten Schwede Biozentrum - Universität Basel Swiss Institute of Bioinformatics Klingelbergstr

More information

1.5 Sequence alignment

1.5 Sequence alignment 1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

More information

Practice Midterm Exam 200 points total 75 minutes Multiple Choice (3 pts each 30 pts total) Mark your answers in the space to the left:

Practice Midterm Exam 200 points total 75 minutes Multiple Choice (3 pts each 30 pts total) Mark your answers in the space to the left: MITES ame Practice Midterm Exam 200 points total 75 minutes Multiple hoice (3 pts each 30 pts total) Mark your answers in the space to the left: 1. Amphipathic molecules have regions that are: a) polar

More information

Major Types of Association of Proteins with Cell Membranes. From Alberts et al

Major Types of Association of Proteins with Cell Membranes. From Alberts et al Major Types of Association of Proteins with Cell Membranes From Alberts et al Proteins Are Polymers of Amino Acids Peptide Bond Formation Amino Acid central carbon atom to which are attached amino group

More information

CSE 549: Computational Biology. Substitution Matrices

CSE 549: Computational Biology. Substitution Matrices CSE 9: Computational Biology Substitution Matrices How should we score alignments So far, we ve looked at arbitrary schemes for scoring mutations. How can we assign scores in a more meaningful way? Are

More information

EXAM 1 Fall 2009 BCHS3304, SECTION # 21734, GENERAL BIOCHEMISTRY I Dr. Glen B Legge

EXAM 1 Fall 2009 BCHS3304, SECTION # 21734, GENERAL BIOCHEMISTRY I Dr. Glen B Legge EXAM 1 Fall 2009 BCHS3304, SECTION # 21734, GENERAL BIOCHEMISTRY I 2009 Dr. Glen B Legge This is a Scantron exam. All answers should be transferred to the Scantron sheet using a #2 pencil. Write and bubble

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Lecture 5,6 Local sequence alignment

Lecture 5,6 Local sequence alignment Lecture 5,6 Local sequence alignment Chapter 6 in Jones and Pevzner Fall 2018 September 4,6, 2018 Evolution as a tool for biological insight Nothing in biology makes sense except in the light of evolution

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India. 1 st November, 2013

Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India. 1 st November, 2013 Hydration of protein-rna recognition sites Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India 1 st November, 2013 Central Dogma of life DNA

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Biochemistry Quiz Review 1I. 1. Of the 20 standard amino acids, only is not optically active. The reason is that its side chain.

Biochemistry Quiz Review 1I. 1. Of the 20 standard amino acids, only is not optically active. The reason is that its side chain. Biochemistry Quiz Review 1I A general note: Short answer questions are just that, short. Writing a paragraph filled with every term you can remember from class won t improve your answer just answer clearly,

More information

Scoring Matrices. Shifra Ben-Dor Irit Orr

Scoring Matrices. Shifra Ben-Dor Irit Orr Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

More information

Studies Leading to the Development of a Highly Selective. Colorimetric and Fluorescent Chemosensor for Lysine

Studies Leading to the Development of a Highly Selective. Colorimetric and Fluorescent Chemosensor for Lysine Supporting Information for Studies Leading to the Development of a Highly Selective Colorimetric and Fluorescent Chemosensor for Lysine Ying Zhou, a Jiyeon Won, c Jin Yong Lee, c * and Juyoung Yoon a,

More information

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid. 1. A change that makes a polypeptide defective has been discovered in its amino acid sequence. The normal and defective amino acid sequences are shown below. Researchers are attempting to reproduce the

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

How did they form? Exploring Meteorite Mysteries

How did they form? Exploring Meteorite Mysteries Exploring Meteorite Mysteries Objectives Students will: recognize that carbonaceous chondrite meteorites contain amino acids, the first step towards living plants and animals. conduct experiments that

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University Measures of Sequence Similarity Alignment with dot

More information

LS1a Fall 2014 Problem Set #2 Due Monday 10/6 at 6 pm in the drop boxes on the Science Center 2 nd Floor

LS1a Fall 2014 Problem Set #2 Due Monday 10/6 at 6 pm in the drop boxes on the Science Center 2 nd Floor LS1a Fall 2014 Problem Set #2 Due Monday 10/6 at 6 pm in the drop boxes on the Science Center 2 nd Floor Note: Adequate space is given for each answer. Questions that require a brief explanation should

More information

Protein Struktur. Biologen und Chemiker dürfen mit Handys spielen (leise) go home, go to sleep. wake up at slide 39

Protein Struktur. Biologen und Chemiker dürfen mit Handys spielen (leise) go home, go to sleep. wake up at slide 39 Protein Struktur Biologen und Chemiker dürfen mit Handys spielen (leise) go home, go to sleep wake up at slide 39 Andrew Torda, Wintersemester 2016/ 2017 Andrew Torda 17.10.2016 [ 1 ] Proteins - who cares?

More information

M.O. Dayhoff, R.M. Schwartz, and B. C, Orcutt

M.O. Dayhoff, R.M. Schwartz, and B. C, Orcutt A Model of volutionary Change in Proteins M.O. Dayhoff, R.M. Schwartz, and B. C, Orcutt n the eight years since we last examined the amino acid exchanges seen in closely related proteins,' the information

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

Collision Cross Section: Ideal elastic hard sphere collision:

Collision Cross Section: Ideal elastic hard sphere collision: Collision Cross Section: Ideal elastic hard sphere collision: ( r r 1 ) Where is the collision cross-section r 1 r ) ( 1 Where is the collision distance r 1 r These equations negate potential interactions

More information

7.05 Spring 2004 February 27, Recitation #2

7.05 Spring 2004 February 27, Recitation #2 Recitation #2 Contact Information TA: Victor Sai Recitation: Friday, 3-4pm, 2-132 E-mail: sai@mit.edu ffice ours: Friday, 4-5pm, 2-132 Unit 1 Schedule Recitation/Exam Date Lectures covered Recitation #2

More information

BIS Office Hours

BIS Office Hours BIS103-001 001 ffice ours TUE (2-3 pm) Rebecca Shipman WED (9:30-10:30 am) TUE (12-1 pm) Stephen Abreu TUR (12-1 pm) FRI (9-11 am) Steffen Abel Lecture 2 Topics Finish discussion of thermodynamics (ΔG,

More information

Chemical Properties of Amino Acids

Chemical Properties of Amino Acids hemical Properties of Amino Acids Protein Function Make up about 15% of the cell and have many functions in the cell 1. atalysis: enzymes 2. Structure: muscle proteins 3. Movement: myosin, actin 4. Defense:

More information

Housekeeping. Housekeeping. Molecules of Life: Biopolymers

Housekeeping. Housekeeping. Molecules of Life: Biopolymers Molecules of Life: Biopolymers Dr. Dale Hancock D.Hancock@mmb.usyd.edu.au Room 377 Biochemistry building Housekeeping Answers to the practise calculations and a narration are on WebT. Access these through

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

A Plausible Model Correlates Prebiotic Peptide Synthesis with. Primordial Genetic Code

A Plausible Model Correlates Prebiotic Peptide Synthesis with. Primordial Genetic Code Electronic Supplementary Material (ESI) for ChemComm. This journal is The Royal Society of Chemistry 2018 A Plausible Model Correlates Prebiotic Peptide Synthesis with Primordial Genetic Code Jianxi Ying,

More information

Discussion Section (Day, Time):

Discussion Section (Day, Time): Chemistry 27 pring 2005 Exam 3 Chemistry 27 Professor Gavin MacBeath arvard University pring 2005 our Exam 3 Friday April 29 th, 2005 11:07 AM 12:00 PM Discussion ection (Day, Time): TF: Directions: 1.

More information

Protein Structure Marianne Øksnes Dalheim, PhD candidate Biopolymers, TBT4135, Autumn 2013

Protein Structure Marianne Øksnes Dalheim, PhD candidate Biopolymers, TBT4135, Autumn 2013 Protein Structure Marianne Øksnes Dalheim, PhD candidate Biopolymers, TBT4135, Autumn 2013 The presentation is based on the presentation by Professor Alexander Dikiy, which is given in the course compedium:

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description. Version Document Published by the wwpdb

Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description. Version Document Published by the wwpdb Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description Version 3.30 Document Published by the wwpdb This format complies with the PDB Exchange Dictionary (PDBx) http://mmcif.pdb.org/dictionaries/mmcif_pdbx.dic/index/index.html.

More information

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty Enumeration and symmetry of edit metric spaces by Jessie Katherine Campbell A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

More information

Discussion Section (Day, Time):

Discussion Section (Day, Time): Chemistry 27 Spring 2005 Exam 3 Chemistry 27 Professor Gavin MacBeath arvard University Spring 2005 our Exam 3 Friday April 29 th, 2005 11:07 AM 12:00 PM Discussion Section (Day, Time): TF: Directions:

More information