Pairwise Sequence Alignment. Robert Giegerich and David Wheeler. Version 2.01 (2 typos corrected w.r.t Version 2.0) May 21, 1996.

Size: px

Start display at page:

Download "Pairwise Sequence Alignment. Robert Giegerich and David Wheeler. Version 2.01 (2 typos corrected w.r.t Version 2.0) May 21, 1996."

Antony Heath
5 years ago
Views:

1 Pairwise Sequence Alignment Robert Giegerich and David Wheeler Version 2.01 (2 typos corrected w.r.t Version 2.0) May 21, An HTML version is available at Instructions for obtaining the Solution Sheet are available by sending to with no subject line, and the following message body: subscribe vsns-bcd-solutions VSNS{BCD Copyright. Copyright 1995/1996 by the author, who agreed to the following conditions for all time : The text may be redistributed verbatim, in whole or in part, for research and educational purposes only. If distributed in part, it must include this copyright notice, and a pointer (URL) to the full text. It may not be sold, or placed in something else for sale, without explicit permission in writing. The text is provided as is without any express or implied warranty. 1

2 Contents 1 Distance and Similarity Introduction Alphabets and Sequences Edit Distances Pairwise Alignment via Dynamic Programming Calculating Edit Distances and Optimal Alignments A Word on the Dynamic Programming Paradigm A Word on Scoring Functions and Related Notions Weight Matrices for Sequence Similarity Scoring 10 4 Realistic Gap Models 11 5 Variations of Pairwise Alignment Local Alignment and Local Similarity Heuristic Methods Appendix The DNA alphabet The extended genetic alphabet The single-letter amino-acid code Abstract alphabets Some examples of taking subsequences Some simple properties of subsequences Metric axioms Ambiguity of optimal alignment Some Recommended Reading 19 2

3 1 Distance and Similarity 1.1 Introduction (Introd.) This chapter is about sequence similarity. Let us start with a warning: There is no unique, precise, and universally applicable notion of similarity. Consider the words \beer", \here" and \hear". They sound very similar (if not the same), are somewhat similar in their spelling, and totally unrelated in their meaning. The situation is a little more regular in molecular biology: proteins and DNA can be similar with respect to their function, their structure, or their primary sequence of amino or nucleic acids. The general rule is that sequence determines shape, and shape determines function. So when we study sequence similarity, we eventually hope to discover or validate similarity in shape and function. This approach is often successful. However, there are many examples where two sequences have little or no similarity, but still the molecules fold into the same shape and share the same function. The present chapter speaks of neither shape nor function. Sequences are seen as strings of characters. In fact, the ideas and techniques we discuss have important applications in text processing, too. Similarity has both a quantitative and a qualitative aspect: A similarity measure gives a quantitative answer, saying that two sequences show a certain degree of similarity. An alignment is a mutual arrangement of two sequences which is a sort of qualitative answer; it exhibits where the two sequences are similar, and where they dier. An optimal alignment, of course, is one that exhibits the most correspondences, and the least dierences. 1.2 Alphabets and Sequences (Alphabets) Generally, an alphabet is a supply of symbols or a set of characters, from which sequences are composed. Some important alphabets are the 4-letter DNA alphabet (see Section 6.1), the extended genetic alphabet or IUPAC-Code (see Section 6.2), the single-letter amino acid code (see Section 6.3), some smaller alphabets which are abstractions (see Section 6.4) of the above. All the techniques we consider are independent of the particular alphabet used. We shall use the following notation: A is the alphabet, A is the set of all nite sequences of characters from A, 3

4 a; b; c are variables denoting individual characters, s; t; u; v are variables denoting sequences from A, jsj denotes the length of the sequence s (in characters). (Subseq.) To address subsequences and individual characters in a sequence s, we mark the boundaries between the characters of s by numbers, as shown here for s = ACCACGT A: 0A 1 C 2 C 3 A 4 C 5 G 6 T 7 A 8 By i : s : j we denote the subsequence of s between positions i and j. Of course, 0 i j jsj. Subsequences 0 : s : i are called prexes, subsequences i : s : jsj are called suxes of s. (A few examples for the use of this notation are available in Section 6.5). Two important special cases are i = j and i = j? 1: i : s: i denotes the empty sequence (with 0 characters), (j?1) : s: j denotes the j-th character in s. Sometimes we shall abbreviate this and simply write s j for (j?1) : s: j. The concatenation of two sequences s and t is denoted s ++t, e.g. ACC ++CT A = ACCCT A. There are quite a few convenient properties of this subsequence notation (see Section 6.6). 1.3 Edit Distances (Distances) Two ways are used to quantify similarity of two sequences: A similarity measure is a function that associates a numeric value with a pair of sequences, with the idea that a higher value indicates greater similarity. Beyond this, similarity measures vary widely, and care must be taken when interpreting similarity measures. The notion of distance is somewhat dual to similarity. It treats sequences as points in a metric space. A distance measure is a function that also associates a numeric value with a pair of sequences, but with the idea that the larger the distance, the smaller the similarity, and vice versa. Distance measures usually satisfy the mathematical axioms of a metric (see Section 6.7 for more information). In particular, distance values are never negative. In most cases, distance and similarity measures are interchangable in the sense that a small distance means high similarity, and vice versa. Sometimes similarity measures are a little more exible (see section on \local similarity"). (Hamming) Maybe the simplest notion of distance is the so-called Hamming Distance: For two sequences of equal length, we just count the character positions in which they dier. For example: sequence s AAT AGCAA AGCACACA sequence t TAA ACATA ACACACTA HammingDistance(s; t)

5 This distance measure is very useful in some cases, but in general it is not exible enough. First of all, the sequences may have dierent length. Second, there is generally no xed correspondence between their character positions. In the mechanism of DNA replication, errors like deleting or inserting a nucleotide are not unusual. Although the rest of the sequences is identical, such a shift of position leads to exaggerated values in the Hamming distance. Look at the rightmost example above. The Hamming distance says that s and t are apart by 6 characters (out of 8). On the other hand, by deleting G from s and T from t, both become equal to ACACACA. In this sense, they are only two characters apart! (Edit Op.) Let us instead model the distance of s and t by considering the simple, onecharacter edit operations that turn s into t. We introduce a gap character \{" and say that the pair (a; a) denotes a match (no change from s to t), (a;?) denotes deletion of character a (in s), (a; b) denotes replacement of a (in s) by b (in t), where a 6= b, (?; b) denotes insertion of character b (in s). Since the problem is symmetric in s and t, a deletion in s can be seen as an insertion in t, and vice versa. An alignment of two sequences s and t is an arrangement of s and t by position, where s and t can be padded with gap symbols to achieve the same length: s : A G C A C A C? A t : A? C A C A C T A or A G? C A C A C A A C A C A C T? A If we read the alignment column-wise, we have a protocol of edit operations that lead from s to t. (Edit Dist. Example) Left: Match (A; A) Right: Match (A; A) Delete (G;?) Replace (G; C) Match (C; C) Insert (?; A) Match (A; A) Match (C; C) Match (C; C) Match (A; A) Match (A; A) Match (C; C) Match (C; C) Replace (A; T ) Insert (?; T ) Delete (C;?) Match (A; A) Match (A; A) The left-hand alignment shows one Delete, one Insert, and the other edit operations are Matches. 5

6 The right-hand alignment shows one Insert, one Delete, two Replaces, and some trivial ones. Next we turn the edit protocol into a measure of distance by assigning a \cost" or \weight" w to each operation. For example, for arbitrary characters a; b from A we may dene w(a; a) = 0 w(a; b) = 1 for a 6= b w(a;?) = w(?; b) = 1 (Unit Cost) This scheme is known as the Levenshtein Distance, also called unit cost model. Its predominant virtue is its simplicity. In general, more sophisticated cost models must be used. For example, replacing an amino acid by a biochemically similar one should weight less than a replacement by an amino acid with totally dierent properties. Section 3 is exclusively dedicated to the choice of cost models. Now we are ready to dene the most important notion for sequence analysis: The cost of an alignment of two sequences s and t is the sum of the costs of all the edit operations that lead from s to t. An optimal alignment of s and t is an alignment which has minimal cost among all possible alignments. The edit distance of s and t is the cost of an optimal alignment of s and t under a cost function w. We denote it by d w (s; t). Using the unit cost model for w in our previous example, we obtain the following cost: s : t : A G C A C A C? A A? C A C A C T A cost : 2 or A G? C A C A C A A C A C A C T? A cost : 4 Here it is easily seen that the left-hand assignment is optimal under the unit cost model, and hence the edit distance d w (s; t) = 2. At this point, you should improve your understanding by doing some exercises. Some Exercises involving Edit Distances (1) Verify: For arbitrary sequences s; t, and an arbitrary cost function w, d w (s; t) is always unique, but their optimal alignment is not. (2) What is the edit distance of the words \INDUSTRY" and \INTEREST" under the unit cost model? What of \CCT" and \ACGCTT"? Show an optimal alignment for each case. 6

7 (3) If jsj < n + 1; jtj < n + 1 and w is the unit cost model, we always have d w (s; t) < n + 1. Why? (4) Can you think of a weight function that weights replacements so highly that an optimal alignment will never show replacements. (This is simple!) (5) Return to the alignment example given in the text. Try to devise a dierent cost function w, such that the alignment on the right-hand side has smaller cost than the one on the left. (6) Consider the gap symbol \{" as a regular character. What is the dierence between d w (s; t) and HammingDistance(s 0 ; t 0 ), where s 0 and t 0 are equal to s; t, but expanded by gap symbols according to an optimal alignment? 2 Pairwise Alignment via Dynamic Programming 2.1 Calculating Edit Distances and Optimal Alignments (Edit Dist. Calculation) The number of possible alignments between two sequences is gigantic, and unless the weight function is very simple, it may seem dicult to pick out an optimal alignment. But fortunately, there is an easy and systematic way to nd it. The algorithm described now is very famous in biocomputing, it is usually called \the dynamic programming algorithm". Consider two prexes 0 : s : i and 0 : t : j, with i; j 1. Let us assume we already know optimal alignments between all shorter prexes of s and t, in particular of (1) 0 : s: (i?1) and 0 : t: (j?1), of (2) 0 : s: (i?1) and 0 : t: j, and of (3) 0 : s: i and 0 : t: (j?1). An optimal alignment of 0 : s: i and 0 : t: j must be an extension of one of the above by (1) a Replacement(s i ; t j ), or a Match(s i ; t j ), depending on whether s i = t j (2) a Deletion(s i ;?), or (3) an Insertion(?; t j ). (Edit Dist. Recursion) We simply have to choose the minimum: d w ( 0 : s: i ; 0 : t: j ) = min f d w ( 0 : s: (i?1) ; 0 : t: (j?1) ) + w(s i ; t j ); d w ( 0 : s: (i?1) ; 0 : t: j ) + w(s i ;?); d w ( 0 : s: i ; 0 : t: (j?1) ) + w(?; t j ) g 7

8 There is no choice when one of the prexes is empty, i.e. i = 0, or j = 0, or both: d w ( 0 : s: 0 ; 0 : t: 0 ) = 0 d w ( 0 : s: i ; 0 : t: 0 ) = d w ( 0 : s: (i?1) ; 0 : t: 0 ) + w(s i ;?) for i = 1; : : : ; m d w ( 0 : s: 0 ; 0 : t: j ) = d w ( 0 : s: 0 ; 0 : t: (j?1) ) + w(?; t j ) for j = 1; : : : ; n According to this scheme, and for a given w, the edit distances of all prexes of i and j dene an (m + 1) (n + 1) distance matrix D = (d i;j ) with d i;j = d w ( 0 : s: i ; 0 : t: j ). The three-way choice in the minimization formula for d ij leads to the following pattern of dependencies between matrix elements: d i?1;j?1 d i?1;j & # d i;j?1! d i;j The bottom right corner of the distance matrix contains the desired result: d mn = d w ( 0 : s: m ; 0 : t: n ) = d w (s; t). (Edit Dist. Matrix) This is the distance matrix for our previous example with s = AGCACACA; t = ACACACT A: t?! A C A C A C T A s # A G C A C A C A A C A C A C T A A G C A C A C @@r

9 In the second diagram, we have drawn a path through the distance matrix indicating which case was chosen when taking the minimum. A diagonal line means Replacement or Match, a vertical line means Deletion, and a horizontal line means Insertion. Thus, this path indicates the edit operation protocol of the optimal alignment with d w (s; t) = 2. Note that in some cases, the minimal choice is not unique, and dierent paths could have been drawn which indicate alternative optimal alignments. Another example is here (see Section6.8 for more information). In which order should we calculate the matrix entries? The only constraint is the above pattern of dependencies. The most common order of calculation is line by line (each line from left to right), or column by column (each column from top-to-bottom). Some Exercises involving Dynamic Programming (1) Find out the cost model used by the BioMOO aligner. Calculate a dynamic programming matrix and alignment for the sequences ATT and TTC. Check your results using the BioMOO alignment, i.e. type "opt_align ATT TTC matrix with #90" on the MOO. (You can also use the WWW-Interface, see How many optimal alignments are there? (2) The number of possible alignments is described as "gigantic". How many are there for the sequences ATT and TTC? (Extra Credit.) If you wish do devise a formula for the number of alignments, which method can be used to enumerate them systematically? Devise such a formula. 2.2 A Word on the Dynamic Programming Paradigm \Dynamic Programming" is a very general programming technique. It is applicable when a large search space can be structured into a succession of stages, such that the initial stage contains trivial solutions to sub-problems, each partial solution in a later stage can be calculated by recurring on only a xed number of partial solutions in an earlier stage, the nal stage contains the overall solution. This applies to our distance matrix: The columns are the stages, the rst column is trivial, the nal one contains the overall result. A matrix entry d ij is the partial solution d w ( 0 : s : i ; 0 : t : j ) and can be determined from two solutions in the previous column d i?1;j?1 and d i;j?1 plus one in the same column, namely d i?1;j. Since calculating edit distances is the predominant approach to sequence comparison, some people simply call this THE dynamic programming algorithm. Just note that the dynamic programming paradigm has many other applications as well, even within bioinformatics. 9

10 2.3 A Word on Scoring Functions and Related Notions Many authors use dierent words for essentially the same idea: scores, weights, costs, distance and similarity functions all attribute a numeric value to a pair of sequences. \distance" should only be used when the metric axioms are satised. In particular, distance values are never negative. The optimal alignment minimizes distance. The term \costs" usually implies positive values, with the overall cost to be minimized. However, metric axioms are not assumed. \weights" and \scores" can be positive or negative. The most popular use is that a high score is good, i.e. it indicates a lot of similarity. Hence, the optimal alignments maximize scores. The term \similarity" immediately implies that large values are good, i.e. an optimal alignment maximizes similarity. Intuitively, one would expect that similarity values should not be negative (what is less than zero similarity?). But don't be surprised to see negative similarity scores shortly. Mathematically, distances are a little more tractable than the others. In terms of programming, general scoring functions are a little more exible. For example, the algorithm for local similarity presented in section 5.1 depends on the use of both positive and negative scores. The accumulated score of two subsequences may rise above the threshold value, and may fall below it after encountering some negative scores. Let us close with another caveat concerning the inuence of sequence length on similarity. Let us just count exact matches and let us assume that two sequences of length n and m, respectively, have 99 exact matches. Let c w (s; t) be the similarity score calculated for s and t under this cost model. So, c w (s; t) = 99. What this means depends on n: If n = m = 100, the sequences are very similar - almost identical. If n = m = 1000, we have only 10% identity! (Two typos were corrected in this paragraph on Wed May 15 17:06:38 MDT 1996) So if we relate sequences of varying length, it makes sense to use length-relative scores - rather than c w (s; t) we use c w (s; t)=(n + m) for sequence comparison. 3 Weight Matrices for Sequence Similarity Scoring This section is available at A Postscript Version of it is available at 10

11 Some Exercises involving Weight Matrices (Author: Georg Fuellen. Further exercises are planned.) (1) 1. Why does 2 PAM, i.e. 1 PAM multiplied with itself, not correspond to exactly 2% of the amino acids having mutated, but a little less than 2%? Or, in other words, why does a 250 PAM matrix not correspond to 250% accepted mutations? (2) Is it biologically plausible that the C-C and W-W entries in the matrix are the most prominent? Which entries (or groups of entries) are the least prominent? (3) Why do some people talk about the 256 PAM matrix, and not the 250 PAM matrix? What's so special about the value 256? 4 Realistic Gap Models (No-gap Alignment) So far, we have treated the gap symbol \{" as yet another character, denoting an individual insertion or deletion. However, this view is not always adequate. Sometimes we want no-gap alignments. For example, in a family of proteins there may be a strongly conserved subunit which is the site of some protein-protein interaction. Any deletion/insertion in the chain of amino acids would be likely to destroy its biochemical function. Such regions we want to align using matches/replacements only. This of course can be achieved by a very simple algorithm. But also our dynamic programming algorithm can be geared to do this by setting costs for insertion and deletion to innity (or something close to it). Hence, an optimal alignment will not use gaps. (Block-indel) Sometimes, from an evolutionary point of view, it is more realistic to assume that nature inserts or deletes entire substrings as a unit. This is called the block-indel-model. It means that we charge a certain set-up cost for introducing a new gap, whereas extending an existing gap is less expensive. For example, a linear gap cost function is of the form gapcost(n) = g i + (n? 1) g e, where n is the length of the gap (number of consecutive \{"), and g i > g e 0. Our dynamic programming scheme can be adapted to this model without much eect on its eciency. (A typo in this section was corrected Tue May 21 01:08:57 MET DST 1996.) Some Exercises involving Gap Cost Models (1) Consider the four alignments of CAAAAGAT and CGAGGGGT shown below. Calculate the cost of each under a) the unit cost model b) the unit cost model with block-indels, with g i = 1 and g e = 0:2. 11

12 (1) C A A A A G A T C G A G G G G T cost a) = cost b) = (2) C A A A A G A??T C??G A G G G G T cost a) = cost b) = (3) C A A A A G A????T C????G A G G G G T cost a) = cost b) = (2) (Extra Credit.) Design the dynamic programming (recursive) formula for the block-indel case. 5 Variations of Pairwise Alignment 5.1 Local Alignment and Local Similarity (Local Alignment) The notion of edit distance and its implementation via dynamic programming are easily adapted to variations of the original problem. Two such variations are discussed here. We rst discuss the problem of local alignment, which is also called approximate pattern matching. We then turn to the problem of local similarity, which (unfortunately) has often been called "local alignment". Consider the problem of locating the famous TATAAT -box (a bacterial promoter) in a piece of DNA. Typically, this motif does not occur exactly as spelled here, but may show up with one or two bases changed. Local alignment is the problem where s is relatively short with respect to t and we seek that subunit of t which s aligns best with. More precisely: Given 0 : s : m and 0 : t : n, nd i : t : j such that d w (s; i : t : j ) is minimal among all choices of 0 i j n. This is also called the approximate matching problem (of s against t), and very sophisticated methods have been developed to do this eciently. But a simple change to the dynamic programming scheme will also do the job, albeit a little slower. (Local Alignment Recursion) Local alignment means that we do not charge costs for (1) deletion of a prex 0 : t: i (2) deletion of a sux j : t: n (1) means that d w ( 0 : s : 0 ; 0 : t : i ) = 0, and hence we initialize the rst row of the distance matrix accordingly: 12

13 d 0;j = 0 for 0 j n The recursion formula for the inner row is unchanged (as given in section 2.1). However (2) means that d w ( 0 : s: m ; 0 : t: j ) = min f d w ( 0 : s: (m?1) ; 0 : t: (j?1) ) + w(s m ; t j ); d w ( 0 : s: (m?1) ; 0 : t: j ) + w(s m ;?); d w ( 0 : s: m ; 0 : t: (j?1) ) g So the last line of the cost matrix is calculated according to d m;j = min f d m?1;j?1 + w(s m ; t j ); d m?1;j + w(s m ;?); d m;j?1 g As before, d mn gives the cost of the optimal local alignment. The matching subsequence i : t: j is then found as follows: j = i minfkj d m;k = d m;n g is the point where the optimal path leading to d m;j starts from the rst row. An interesting alternative is to calculate the last line of the matrix according to the general case. Then, minimal values in the last line indicate those positions where the best matches of s against substrings of t end. This is the way the \local"-option of the BioMOO alignment server is programmed. (Local Similarity) For the second variation of our similarity theme, consider the following case. Let s and t be two proteins, which we suspect to carry some functionally related subunits. However, most parts of s and t do not contribute to this function, and may well be very dierent. A pairwise alignment of s and t would be unlikely to clearly exhibit small regions of high similarity, as it is geared to minimize distance over the full length of both sequences. Thus we are faced with another problem, this one being symmetric between s and t. We ask for those subunits of s and t that exhibit most similarity. This is called the local similarity problem, since we use a similarity rather than a distance measure in the following way: w(a; b) > 0; w(a; b) < 0; w(a;?) < 0 if a; b are similar, if a; b are not similar, and w(?; b) < 0; in particular. This still leaves a lot of freedom to dierentiate degrees of (dis-)similarity. The point is that we use the score 0 as a cut-o value between subsequences with/without similarity. We are now maximizing similarity rather than minimizing distance. The border cases now become 13

14 and the general recursion formula is d 0;j = 0 for 0 j n; d i;0 = 0 for 1 i m; d i;j = max f 0; d i?1;j?1 + w(s i ; t j ); d i?1;j + w(s i ;?); d i;j?1 + w(?; t j ) g The cut-o value of zero means that long stretches of dissimilarity show as regions of zeroes in the matrix, from which stretches of local similarity rise as islands of positive values. Some Exercises involving Local Alignment / Local Similarity (1) Calculate a dynamic programming matrix and local alignment for the sequences ATT and ATCTTC. Compare your results with the BioMOO server, i.e. type "opt_align ATT ATCTTC matrix zero with #90". (You can also use the WWW-Interface, see (2) Justify intuitively the treatment of the border cases in the local simililarity problem. (3) What happens if you formulate "local alignment" in a symmetric fashion, not charging for terminal gaps in either of the two sequences? (4) If you use \2" as the cuto value for local similarity calculations (see text), can you expect meaningful results? 5.2 Heuristic Methods (Edit Dist. Calc. Complexity) Let us now consider the computational resources needed for calculating edit distances: The dynamic programming scheme calculates (for input sequences 0 : s : m and 0 : t : n ) m n matrix entries. Each matrix element can be calculated from the three adjacent elements with a xed number of operations. Then the execution time of this program is proportional to m n. In the jargon of computer science, we say that its time complexity is O(m n). If we are only interested in the edit distance of s and t, we only need to store one column (or row) of the matrix at a time in order to calculate the next one. In this case, the space complexity is O(m) or O(n). If we need to retrace the path that leads to the optimal alignment, the whole matrix must be stored and the space complexity moves up to O(m n), too. This is quite feasible on today's computers, even when s and t are or characters long. However, for comparing a sequence s with a whole database of tens or hundreds of thousands of sequences, we must seek for a more ecient solution. 14

15 The answer is to use heuristic approaches, which only approximate the proper edit distance and the optimal alignment, but do this with a time complexity close to O(n + m) for a given pair 0 : s : m and 0 : t : n. We will discuss the ideas of two such heuristics, and you will learn how to use them in the next chapter. (FASTA) FASTA was developed by Lipman and Pearson in FASTA considers exact matches between short substrings of s and t, i.e. all pairs (i; j) such that i : s: (i+k) = j : t: (j+k), for a given parameter k. If a signicant number of such exact matches is found, FASTA uses the dynamic programming algorithm to compute optimal alignments. This approach allows to trade speed for precision: The larger we choose the parameter k, the smaller is the number of exact matches. This makes the program faster, but loses precision: It becomes less likely that the optimal alignment contains enough exact matches of length k, and the procedure may nd nothing. Nevertheless, experience shows that with sensibly chosen parameters, FASTA misses very few cases of signicant homology. (BLAST) BLAST, developed by Altschul et al. in 1990, is another heuristic based on a similar idea. BLAST focusses on no-gap alignments of (again) a certain, xed length k. Rather than requiring exact matches, BLAST uses a scoring function to measure similarity (rather than distance). In particular for proteins, one can argue that segment pairs with no gaps and a high similarity score indicate regions of functional similarity. For a given threshold parameter S, BLAST reports to the user all database entries which have a segment pair with the query sequence that scores higher than S. If the scoring function used has a probabilistic interpretation, BLAST can also give an assessment of the statistical signicance of the matches it reports. Some Exercises involving Heuristic Methods These exercises should be done after going through Chapter 2, which tells you where to nd the FASTA and BLAST servers. More exercises are available at thanks to Francisco M. De La Vega. These also include a solution sheet. (1) Retrieve an arbitrary sequence from Genebank/EMBL or SwissProt. (You may choose a sequence from your research context.) Slightly modify it and call the modied sequence s. (2) Search for homologues using FASTA. Is the original sequence among the reported matches? Save the results. (3) Repeat (3) using BLAST. (4) Compare the matches found in (3) and (4) with each other. Can you explain the dierences? (5) Redo (3) or (4) and try to use more restrictive parameters, such that the original sequence is reported as the only match. If this is not possible, explain why! 15

16 6 Appendix 6.1 The DNA alphabet All DNA is built from nucleotides always containing sugar, a phosphate group, and one of the four nucleic acids Adenine, Cytosine, Guanine, and Thymine. The DNA alphabet consists of their initials, fa; C; G; T g. In RNA, Thymine is replaced by Uracil, so U takes the place of T. However, to determine the nucleotide sequence of RNA, it is usually retranscribed into DNA (for technical reasons). So many RNA sequences are represented by their DNA equivalents. It is not always the case that all positions in a sequence are precisely known. In those cases, an extended alphabet is used. 6.2 The extended genetic alphabet In sequencing experiments, gel readings are not always unique. Some positions may be unknown or allow a choice of one of several nucleotides. Similarly, when a sequence actually represents a consensus of several related sequences, there may still be dierent degrees of variation on certain positions. The IUPAC extended genetic alphabet allows to express such ambiguities. Some tools like automatic uorescent sequencers use this code where gel readings do not allow a precise interpretation. Summary of single-letter code recommendations Symbol Meaning Origin of designation G G Guanine A A Adenine T T Thymine C C Cytosine R G or A purine Y T or C pyrimidine M A or C amino K G or T Keto S G or C Strong interaction (3 H bonds) W A or T Weak interaction (2 H bonds) H A or C or T not-g, H follows G in the alphabet B G or T or C not-a, B follows A V G or C or A not-t (not-u), V follows U D G or A or T not-c, D follows C N G or A or T or C any 16

17 6.3 The single-letter amino-acid code There are 20 amino acids from which all proteins are built. A three- and a single-letter code have been designed. single-letter code abbreviation full name A Ala Alanine R Arg Arginine N Asn Asparagine D Asp Aspartic acid C Cys Cysteine Q Gln Glutamine E Glu Glutamic acid G Gly Glycine H His Histidine I Ile Isoleucine L Leu Leucine K Lys Lysine M Met Methionine F Phe Phenylalanine P Pro Proline S Ser Serine T Thr Threonine W Trp Tryptophan Y Tyr Tyrosine V Val Valine 6.4 Abstract alphabets Sometimes we want to know whether a particular property of amino acids shows a characteristic pattern. For example, amino acids are hydrophobic, neutral, or hydrophilic. To express this, we map the single-letter code into the alphabet f+; 0;?g, each according to its behaviour. Characteristic patterns that would go undetected in the original encoding may become visible by such a translation. For DNA, we may only care whether a pyrimidine (C or T ) or a purine (A or G) resides at a given point. Thus, we map A and G onto R, C and T onto Y, and analyse strings over the binary alphabet fr; Y g instead. 6.5 Some examples of taking subsequences Let u = ACCACGT AC, a sequence over the DNA alphabet. 17

18 0 : u: 1 = A; 0 : u: 2 = AC, and so forth, 0 : u: 2 = 3 : u: 5 = 7 : u: 9 = AC, u ++u = ACCACGT ACACCACGT AC, (denotes sequence concatenation) 7 (u ++u) 11 = ACAC. 6.6 Some simple properties of subsequences Let s and t be sequences, with jsj = m and jtj = n: s = 0 : s: m ; t = 0 : t: n, j i : s: j j = j? i; and js ++tj = m + n, i : s: j ++ j : s: l = i : s: l, If we delete subsequence i : s: j from s, the result is ( 0 : s: i ) ++( j : s: m ). 6.7 Metric axioms A distance measure d is a metric on sequences if it has the following properties for all sequences s; t; u: (1) d(s; t) 0 (2) d(s; t) = 0 if and only if s = t (3) d(s; t) = d(t; s) (4) d(s; t) d(s; u) + d(u; t) (the triangle inequality) (Note that (1) actually is a consequence of (2) { (4).) 18

19 6.8 Ambiguity of optimal alignment Here is an example where the optimal alignment is not unique. Let s = T T CC; t = AAT T. The distance matrix, A A T T T T C C T T C C the minimal choices, A A T T #&! &! &! &! #& #&! & & #&#&#&#! &# #&#&#&#! &# T T C C the optimal paths, A A T T! &! & &! & & & &# &# &# In the center matrix, we have recorded which choice leads to the minimum for each d ij. Optimal paths follow these arrows and must reach the \target" d m;n. The right-hand matrix shows the network of all optimal paths. They represent alignments as diverse as s = T T C C t = A A T T s =?? T T C C t = A A T T?? (with four proper Replaces) (with two Inserts and two Deletes) And of course there are some others, too. 7 Some Recommended Reading A classical, but still interesting reference is [KS83]. The following book gives a detailed treatment of pairwise alignment, as well as many related problems in sequence analysis: [Wat89] A new book by Waterman has just come out, [Wat95]. Classical references on BLAST and FASTA are: [AGM + 89], [WL83] A paper by Pearson on the new version of FASTA appeared in Protein Science 1995: [Pea95] Before you start developing faster alignment algorithms for special cases of pairwise alignment, take a look at the following overview: [Mye91] 19

20 References [AGM + 89] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. A Basic Local Alignment Search Tool. J. Mol. Biol., 215:403{410, [Pea95] [KS83] [Mye91] [Wat89] Pearson, W. R. Comparison of methods for searching protein sequence databases. Prot. Sci. 4: J.B. Kruskal and D. Sanko. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, E.W. Myers. An Overview of Sequence Comparison Algorithms in Molecular Biology. Technical Report TR 91-29, University of Arizona, Tucson, Department of Computer Science, M.S. Waterman. Sequence Alignments. In M.S. Waterman, editor, Mathematical Methods for DNA Sequences, pages 53{92. CRC-Press, Boca Raton, FL, [Wat95] M.S. Waterman. Introduction to Computational Biology. Chapman and Hall, London, [WL83] W.J. Wilbur and D.J. Lipman. Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Natl. Acad. Sci. USA, 80:726{730,

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

STATC141 Spring 2005 The materials are from Pairise Sequence Alignment by Robert Giegerich and David Wheeler Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (I) Sequence similarity