Figure 3: A cartoon plot of how a database of annotated proteins is used to identify a novel sequence

() Exact seuence comparison..1 Introduction Seuence alignment (to e defined elow) is a tool to test for a potential similarity etween a seuence of an unknown target protein and a single (or a family of) well-characterized protein(s). Figure 3: A cartoon plot of how a dataase of annotated proteins is used to identify a novel seuence Annotated seuences B 1 = 1 1 n 1 Unidentified seuence B (gloin) A=a 1 a n B 3 (α/β protein) B 4 = 1 4. n 4 We denote the seuence of the unknown protein y A. It is a short cut for the explicit seuence A= a1, a,..., an, where a i is an amino acid placed at the i position along the seuence. The seuences of the known proteins, that make the comparison dataase, are Q called { B } = 1 where B = 1,,..., m, and Q is the total numer of proteins in the dataase. For seuences only, the dataase may include hundreds of thousands of entries [http://www.expasy.ch/srs5/]. Dataases that include oth seuences and structures are limited to aout ten thousands [http://www.rcs.org/pd/]. For clarity we omit the index from B and use B to represent a memer of the comparison set. Tale 1: A sample of protein seuences. Typical lengths of protein seuences vary from a few tens to a few hundreds. The single letter notation (one character for an amino acid) is used. Serine Protease inhiitor EDICSLPPEV GPCRAGFLKF AYYSELNKCK LFTYGGCQGN ENNFETLQAC XQA Amicyanin alpha MRALAFAAALAAFSATAALAAGALEAVQEAPAGSTEVKIAKMKFQTPEVR IKAGSAVTWTNTEALPHNVHFKSGPGVEKDVEGPMLRSNQTYSVKFNAPG TYDYICTPHPFMKGKVVVE 1

Major Histocompataility Complex (class I) MAVMAPRTLV LLLSGALALT QTWAGSHSMR YFSTSVSRPG RGEPRFIAVG YVDDTQFVRF DSDAASQRME PRAPWIEQEG PEYWDRNTRN VKAHSQTDRV DLGTLRGYYN QSEDGSHTIQ RMYGCDVGSD GRFLRGYQQD AYDGKDYIAL NEDLRSWTAA DMAAEITKRK WEAAHFAEQL RAYLEGTCVE WLRRHLENGK ETLQRTDAPK THMTHHAVSD HEAILRCWAL SFYPAEITLT WQRDGEDQTQ DTELVETRPA GDGTFQKWAA VVVPSGQEQR YTCHVQHEGL PEPLTLRWEP SSQPTIPIVG IIAGLVLFGA VIAGAVVAAV RWRRKSSDRK GGSYSQAASS DSAQGSDVSL TACKV For any two seuences A and B we wish to determine how similar they are. This is a nontrivial task that reuires a pause and a clear definition of what we mean y similar. The simplest definition is the fraction of amino acids in B that are identical to A. In proteins high seuence identity (roughly aove 35 percent) immediately implies close relationship. A more sutle definition use amino acids with similar physical or iochemical properties. For example, it is ok to replace a hydrophoic amino acid like leucine (L) y valine (V). The effect on the hydrophoic core of the protein is expected to e small. However, changing (L) to a charged residue like arginine (R) it is not ok and may cause major reduction in protein staility. Hence, there is a gray area etween an exact matching a mismatch. This gray area is etter descried y a continuous scoring function that gives the highest score for an identity and the lowest possile score for sustitution that do not make chemical, physical or iological sense. For that purpose a score function needs to e estalished to measure the similarity etween the seuences. We now assume that when comparing the seuences A and B, the total score S( A, B ) will e given y a sum of terms, each term comparing two elements. The choice of the pairs of elements for the similarity test is called an alignment and is the focus of the present section. An element is not necessarily an amino acid. It can also e a space (gap). For example, consider the case in which A is longer than B. In that case it is ovious that some amino acids of A will e aligned against a space (gap). Before asking uestions aout the alignment process, let us start with the score, how to measure the similarity of a pair of elements. The scores of pairs of elements, which measure the similarity of the a -s and the -s, form the so-called sustitution matrix S( a, ). The matrix is symmetric, and of size 0 0 (the numer of amino acid types). Note that the term sustitution may e misleading since it suggests a direction (sustituting a y implies that a was there first). A symmetric matrix does not have a sense of time, and the order of sustitutions does not change the score, or the degree of similarity. Nevertheless, a non-symmetric variation of the seuence as a function of time is of considerale interest. A direction in seuence-similarity-scores is potentially informative, providing molecular fingerprints for evolutionary time scales. However, it is hard to estimate it without a specific model in mind. At present, for the purpose of finding protein relatives, we shall use the simplest way out and ignore the time arrow of evolution. A widely used (symmetric) sustitution matrix is given elow. It is extracted from proailities. Consider two seuences A and B of evolutionary related

proteins that are aligned against each other. The choice of the proteins and the alignment P a, e the proaility that amino acids a and procedure will e discussed later. Let ( ) are aligned against each other. Let P( a ) and ( ) P e the proailities of finding the amino acids a and in the protein that we considered. The score is given y S( a, ) = λ log P( a, ) P( a) P( ) The BLOSUM 50 matrix A R N D C Q E G H I L K M F P S T W Y V :A 5 - -1 - -1-1 -1 0 - -1 - -1-1 -3-1 1 0-3 - 0:A :R - 7-1 - -4 1 0-3 0-4 -3 3 - -3-3 -1-1 -3-1 -3:R :N -1-1 7-0 0 0 1-3 -4 0 - -4-1 0-4 - -3:N :D - - 8-4 0-1 -1-4 -4-1 -4-5 -1 0-1 -5-3 -4:D :C -1-4 - -4 13-3 -3-3 -3 - - -3 - - -4-1 -1-5 -3-1:C :Q -1 1 0 0-3 7-1 -3-0 -4-1 0-1 -1-1 -3:Q :E -1 0 0-3 6-3 0-4 -3 1 - -3-1 -1-1 -3 - -3:E :G 0-3 0-1 -3 - -3 8 - -4-4 - -3-4 - 0 - -3-3 -4:G :H - 0 1-1 -3 1 0-10 -4-3 0-1 -1 - -1 - -3-4:H :I -1-4 -3-4 - -3-4 -4-4 5-3 0-3 -3-1 -3-1 4:I :L - -3-4 -4 - - -3-4 -3 5-3 3 1-4 -3-1 - -1 1:L :K -1 3 0-1 -3 1-0 -3-3 6 - -4-1 0-1 -3 - -3:K :M -1 - - -4-0 - -3-1 3-7 0-3 - -1-1 0 1:M :F -3-3 -4-5 - -4-3 -4-1 0 1-4 0 8-4 -3-1 4-1:F :P -1-3 - -1-4 -1-1 - - -3-4 -1-3 -4 10-1 -1-4 -3-3:P :S 1-1 1 0-1 0-1 0-1 -3-3 0 - -3-1 5-4 - -:S :T 0-1 0-1 -1-1 -1 - - -1-1 -1-1 - -1 5-3 - 0:T :W -3-3 -4-5 -5-1 -3-3 -3-3 - -3-1 1-4 -4-3 15-3:W :Y - -1 - -3-3 -1 - -3-1 -1-0 4-3 - - 8-1:Y :V 0-3 -3-4 -1-3 -3-4 -4 4 1-3 1-1 -3-0 -3-1 5:V * * A R N D C Q E G H I L K M F P S T W Y V * Note the high scores for a sustitution of an amino acid to self (the diagonal elements of the matrix). Note also that certain amino acids have higher tendency for self-preservation (e.g. Cysteine) while a small and polar amino acid like threonine can e sustituted to other amino acids more easily. The detailed construction of the sustitution matrix is of significant interest and is discussed later in the section: Learning scoring parameters. Concerning the goals of the present section we still need to choose the pairs to calculate the similarity score. Can we just pick pairs of amino acids from the pools of residues that each protein makes? Of course, the choice of pairs of amino acids from A and B is not completely free. The order of the amino acids in the seuence (the primary structure of the protein) counts, and our comparison should maintain that order. This is a significant restriction on the comparisons that we may make and limits our choices consideraly. Adding further to the complexity of the prolem is the oservation that not all proteins are of the same length; we 3

need to account for variation in the lengths of the proteins and for the fact that some amino acids will not have corresponding amino acids to match. Two closely related proteins may e different at a specific site in which an amino acid was added to one protein ut not to the second. To descrie such cases we introduce a gap residue. A gap residue is denoted y - and is used to indicate an empty space along the seuence. For example, the comparison of two proteins elow (PDB codes 1rsy and 1a5_A) yields the following optimal arrangement of the two seuences with gaps: TABLE: Optimal alignment of two seuences (top is 1rsy, Synaptotagmin I (First C Domain), lower 1a5, Protein Kinase C ( ); Chain: A). The percentage of seuence identity is 33 percent. EKLGKLQYSLDYDFQNNQLLVGIIQ-AAEL-PALDMGGTSDPYVKVFLLPD-K-KKKFE ERRGRIYIQAHID-R--EVLIVVVRDAKNLVP-MDPNGLSDPYVKLKLIPDPKSESKQK TKVHRKTLNPVFNEQFTFKVPYSELGGKTLVMAVYDFDRFSKHDIIGEFKVPMNTVD-F TKTIKCSLNPEWNETFRFQLKESDKD-RRLSVEIWDWDLTSRNDFMGSLSFGISELQKA GHVTEEWRDLQS G-V-DGWFKLLS A gap is placed in one seuence (say A ) to match an amino acid in the second seuence (say B ). It indicates a missing amino acid in the A seuence when comparing it to the B seuence. Note that it is not possile for us to decide if an amino acid was deleted from A or added to B, without a detailed evolutionary mechanism linking the two proteins to a common ancestor. It is therefore common to use the term indel to descrie a gap (indel = an INsertion or a DELetion), a name that remains undecided aout the mechanism of gap formation. Our goal is to find the est match etween two seuences including the possiilities of gaps. In order to score a given alignment and decide if it is good or ad we need a score for an indel. What is the score of the gap, i.e. the alignment of an indel against another amino acid? It is philosophically useful to think on a gap as an ordinary amino acid and ask what is the score for sustituting an amino acid type a with an amino acid type. Such an approach is good in theory and there are some studies following this line of investigation. Unfortunately so far, determining gap scores (also called gap penalties) is still an open uestion. This difficulty is in contrast with the sustitution matrices of amino acids that are pretty well estalished. A common practice in alignment programs is to leave the gap energy as a parameter to e determined y the user, setting a default value of (for example) zero. The score of a gap influences the alignment. If it is set too low, it is difficult to match related proteins of considerale variation in lengths (remote homologous proteins). If it is set too high, two vastly different proteins may get fragmented to many small overlapping pieces and a large numer of gaps. The fragmented alignment (with high gap scoring) may still maintain a good score. We expect that the two proteins are related if a sustantial fraction of the two seuences is similar. 4

High-scoring short-seuence segments can e analyzed statistically to determine their significance. In fact such an analysis is ehind the popular BLAST algorithm [x]. To the first order BLAST does not consider gaps at all. A statistical test determines if short compatile segments of aligned seuences are significant. In some sense, BLAST solves the gap prolem y avoiding it all together. The BLAST algorithm will e discussed in the section: Statistical Analysis of Seuences. A typical practical solution for the indel/gap penalty is to give it a single negative value, say g, regardless of the pairing amino. The gap penalty is set to e independent of the type of the amino acid it is aligned to (e.g. the pair W / score the same as the pair G / ). This choice is non intuitive since W (tryptophan) is much larger than G (glycine). The space left for the - residue y the removal of G or W will e therefore different and the cost likely to e dissimilar. The less detailed treatment of gaps, (compared to usual amino acids), can lead to poor alignments. Therefore a few suggestions of extending the simple model of gap penalty were proposed. One popular model is to differentiate etween opening and extension of gaps. It is ased on the argument that gaps should aggregate together. Claims were made that insertions and deletions are likely to appear at certain structural domains of proteins (mostly loops). The concentration at few structural sites makes the indels appear together, and aggregate. To maximize the size of groups of gap (and minimize the numer of gap clusters), two gap penalties are assigned. One penalty is for initiating a gap group. A second (lower penalty) is for growing an existing gap cluster. For example, (we construct our alignment from left to right), extending an alignment from X / X to XX / X is less favorale than the extension of the pair X / to XX /. This is regardless of the fact that in oth cases the same pair ( X / ) was added to an existing alignment. This model improves the overall appearance of the alignments. However, the present author is not enthusiastic aout it. It is highly asymmetric with respect to other amino acids (rememer, we would like to consider a gap to e an amino acid), and it is not ovious that the asymmetry is indeed reuired. It is also making the identification of optimal alignment messier. Alternative modeling of gaps (so far less popular) will e discussed later and include structural dependent gap. For the moment we shall concentrate on the simplest model of gaps (one value does it all), and after solving that prolem the effect of the extensions will e examined. We wish to place the gaps in such a way that the alignment or the comparison of the two seuences will e optimal. Early studies of seuence comparisons found optimal alignments manually. Gaps were inserted y hand into positions that made iochemical sense and increased the numer of good-looking pairs [x]. For an implementation on the computer we consider an optimal alignment to e an arrangement of the two seuences with respect to each other in such a way that the total score of the alignment is as high as possile. Let us examine in more details the prolem of optimal alignments and possile arrangements of seuences. There are many ways of aligning a seuence A against a 5

seuence B once gaps are introduced (even if the original order of the amino acids is maintained). Gaps can enter anywhere and in wide range of numers in the alignment, creating many alternative arrangements. We denote the extended seuences of A and B (with gaps) as A and B. For example, A might e A= aa 1 a3a4... a n. The introduction of the extended seuences raises another intriguing uestion, what is the length of the seuences A (or B )? Of course, the lengths may vary, ut they still have upper ounds. Consider two seuences A and B of the same length -- n. The maximum length of A and B should e n. In the alignment of the extended-seuences that are also maximally long every amino acid of A or B is aligned against a gap -- ( a / ) and ( / ). An increase of the length eyond the n limit will necessarily include an alignment of an indel with respect to another indel, i.e. ( / ). Such an alignment does not make sense from a scientific point of view. We have no way to determine the numer of doule gaps or their locations. Moreover, also from a technical view point there is a prolem if the ( / ) pair generates a favorale plausile score. Consider the alignment a1... an 1 an 1... n 1 n Let the total score e T AB. Extending the aove alignment y one more pair of indels, we have a1... an 1 an 1... n 1 n The new score is TAB + S / where S / is the element of the sustitution matrix replacing an indel y an indel. If the new score is etter than T AB, it is trivial to construct even a etter alignment (and score) y adding yet another pair of indels with yet a etter score of T +. The favorale extension with indels can proceed to infinite and is unound. AB S / We eliminate in our seuence-to-seuence alignments the possiility of (. / ) Ok. It is settled then, the maximum length of A is n. How many possile alignments (with gaps) do we have? Or how many we need to examine efore deciding on an optimal alignment? 1. Counting alignments To count the numer of possile alignments, and as a starting point of the discussion on optimal alignments, we consider the dynamic matrix. The dynamic matrix is a tale used for the alignment of two seuences. Below we provide one example for two seuences with the same length ( n = 5). The rows are associated with the A seuence and the columns with the B. The numers at different matrix entries will e explained elow. 6

1 3 4 5 a a a a a 1 3 4 5 1 1 1 1 1 1 3 5 7 9 11 1 5 13 5 41 61 1 7 5 63 19 31 1 9 41 19 31 681 1 11 61 31 681 1683 The hairy picture with the numerous arrows is actually telling. Paths in this tale, which start at the upper left corner and end at the right lower corner, present all the possile alignments of the two seuences. From each entry in the dynamic matrix, there are three alternative moves: following the diagonal, going down or moving to the right. A step in the matrix, which is a part of a legitimate alignment, never proceeds from left to right or up. For example, the thick line in the aove matrix corresponds to the alignment: a1 a a3 a4 a5 1 3 4 5 A move along a diagonal aligns an amino acid against another amino acid. A vertical step in the matrix aligns a amino acid against an indel and a horizontal step puts a gap against an a amino acid. Note that we use for the gap residue (or an indel). The numers at the different entries of the tale denote the numer of paths (alignments) that can reach this point. For example, there are 5 possile alignments of aa 1 against 1 (check the tale element at the cross etween a and 1). They are: a1 a a1 a a1 a a1 a a1 a (1) () (3) (4) (5) 1 1 1 1 1 The last three alignments are essentially the same and are degenerate. At present, we consider all paths, including the degenerate paths. In the Appendix a clever counting protocol (y Dr. Jaroslaw Meller) is outlined that estimates the numer of non-degenerate n+ m paths. Interestingly, the exact numer of non-degenerate paths is. It has the same m asymptotic ehavior as the approximate lower-ound expression we sketched elow for the numer of all paths. 7

Another interesting property of this tale is a summation rule and the possiility of constructing a recursion formula for the numer of alignments. There are three ways ( sources ) to extend a shorter alignment to otain one of the five longer alignments listed aove. (a) Extend an earlier alignment y the pair ( a / 1), a diagonal move in the aove tale. Alternatively, () the pairs ( a / ) or (c) ( / 1 ) (horizontal or perpendicular moves in the aove tale) can e used to extend a shorter alignment and to otain a memer of the aove group. Each of the three sources ( a) ( c) corresponding numer of alignments. For example, efore adding the pair ( / ) has its own position in the matrix with a a 1 we were at a tale entry aligning a 1 against " ". We find that there is only one-way of aligning a 1 against a gap and 1 is indeed the corresponding entry. Another example of extending the alignment is to add a gap against a, which means that our earlier position in the tale was the alignment of a 1 and 1. The last alignment can e done in three different ways and therefore the tale entry is 3 a a a ( 1 1 1 ). 1 1 1 If we add the numer of paths starting at the previous three sources and leading to the alignment of aa 1 with respect to 1 we otain 1+ 1+ 3= 5, exactly the numer of alignments of our target. To summarize the aove empirical oservations more precisely: The numer of possile alignments of n a -s against m -s is defined as Nnm, (, ) this numer can e determined using the recursive formula Nnm (, ) = Nn ( 1, m 1) + Nn ( 1, m) + Nnm (, 1), and the initial conditions N(0,0) = N(0, m) = N( m,0) = 1. Note that the definition of N (0,0) was done for computational convenience and it does not imply that nothing against nothing can e aligned in exactly one way. While this formula can e used directly, it is useful to have an order-of-magnitude estimate of the numer of alignments. This estimate is especially useful if we are planning a computation that will enumerate all of the alignments. If a calculation is not feasile with existing computer resources it is etter knowing that it is not feasile in advance, and not after a few weeks of futile attempts to execute the desired computation. A simple lower ound for Nnm (, ) can e otained uickly. Nnm (, ) is a positive numer, so we can write Nnm (, ) > Nn ( 1, m) + Nnm (, 1) (we forgot for convenience the term Nn ( 1, m 1) in the original euality). So, the alternative recursion formula ( ) N'( n, m) = N'( n 1, m) + N' n, m 1 (with the same initial conditions as for Nnm) (, ) always yields lower numers than Nnm. (, ) For N'( n, m ) we have a close expression: 8

n+ m ( n+ m)! N'( n, m) = =. This formula is easy to verify y direct sustitution of the m nm!! closed expression to the recursion. Note that this is the same as the (exact) result derived y Meller for non-degenerate paths (Appendix). We now make use of Stirling formula: log [!] log ( ) n n n n [x], valid for large n -s. The logarithm of the numer of alignments N'( n, n ) is estimated as ( n)! log = log! log! log log log + = ( n!) ( n) [ n ] n ( n) n n ( n) n n ( ) And the lower ound for the numer of alignments is N' ( n, n) n. For a short protein 30 ( n 50 ; N 1.7 10 ) the numer is sustantial and is eyond what we can do today in a systematic search. Even if the computation of a single alignment reuires a nanosecond 9 ( 10 1 13 second), which is unrealistically fast, it still necessary to use 10 sec 10 years to examine all possile alignments. If this is not impressive enough, rememer that this is a lower ound and the precise counting of all paths will yield a numer significantly larger than this one. For example, for 10 n = 5 we have = 104 already a numer significantly lower than the exact numer in the tale (1683). Hence, this underlines the statement that it is impossile to examine all alignments one y one. Readers with a ackground in structural iology may recall the Levinthal paradox in protein folding. The paradox contrasts the huge numer of plausile protein conformations and the efficiency in which proteins fold. However, as we see elow a large space to search does not necessarily mean that optimization in that space is difficult. It is not ovious that the optimization must e performed at a cost proportional to the volume of that space. In fact it can e profoundly cheaper. 1.3 Dynamic programming and optimal alignments After this long detour, it is aout the time to return to the asic uestion: What is the optimal alignment of A and B? The score matrix and the gap penalty are provided and we need is to fish out the alignment(s) with the highest overall score(s). This is a point in which algorithms developed y computer scientists can e extremely useful. The fact that the numer of possile alignments is exponentially large in the seuence length does not mean that the search for the optimal alignment needs to e done in exponential time. Seuence alignments can e done with dynamic programming, an algorithm that reuires only n operations to find the alignment with the est score, a remarkale saving compared to n operations. The efficient search for the optimal alignments consists of two steps. In the first step a dynamic matrix is constructed and in the second step an optimal path is found in the tale just constructed. The first step is similar to the counting of possile alignments and the recursive expression we wrote to compute the total numer of alignments. We denote the 9

optimal score of the alignment of a seuence length n against a seuence with length m y T n, m, and consider the following uestion: ( ) Assuming that a very kind fellow gave us the optimal scores for the following alignment: T n 1, m 1) T n 1, m, 1 T n, m? ( ), ( ), T( n m ), can we construct the score ( ) The answer is yes. Since the total score is given y a sum of the scores of the individual aligned pairs, we construct the score T( n, m ) from the three alignments leading to it. We consider three possiilities to otain an alignment of n against m amino acids. Option a: Align n 1 against m 1 amino acids (this alignment has the known optimal score of T( n1, m 1) ). Extend it y the alignment of an / m with a score of Sa ( n, m) which is the sustitution score according to the types of the amino acids at a n and m (e.g., the BLOSUM matrix). Hence the first suggestion for an optimal score is T( n1, m 1 ) + S( an, n). Option : Align n amino acids against m 1 amino acids. This alignment also has a known score, which is Tnm (, 1). Extend this alignment y / m with a corresponding score g for a gap. The second possiility for an optimal alignment is therefore T( n, m 1) + g. Option c: Align n 1 amino acids against m amino acids with the known (optimal) score of T( n 1, m). Extend the alignment, y adding an / with a corresponding score of g. The T n 1, m + g. third suggestion is therefore ( ) Note that we use the simple model of gaps in which only a single score is associated with an indel, regardless of the amino acid it is aligned against. Our final task is to select the highest score from one of the three alternatives: T n 1, m 1 S a, T n, m 1 g T n 1, m + g, which is the optimal score of ( ) + ( n n), ( ) +, ( ) T( n, m ). ( ) Tn ( 1, m 1) + San, m More compactly, we write: T( n, m) = max T( n, m 1) + g Tn ( 1, m) + g The aove recursion can e used to fill with optimal scores the complete dynamic matrix. We start with the condition T( 1, ) = T(,1) = g and use the initial values to grow the T( a1, ) + g g matrix. For example, T( a1, 1) = max T(, 1) + g = max g T( 0,0 ) + S( a1, 1) S( a1, 1) 10

Where T ( 0,0) denotes alignment of nothing that (not surprisingly) scores zero. Another simple example is of T( n, n( )) = n g. Hence, there is only one alignment against all gaps arrangement, providing us with immediate optimal scores for the first row and column of the dynamic matrix. It is also clear that similarly to the direct counting of the numer of paths we can also construct all the optimal scores. 5 a a a a a 1 3 4 5 0 g g 3g 4g 5g 1 g g 3 3g 4 4g 5g A simple pseudo code to create the dynamic matrix is given elow /* fill the first (zero) column and the first (zero) row */ T(0,0) = 0 Do I=1:n T(I,0) = I*g End do Do I=1:m T(0,I) = I*g End do /* Now fill the rest of the matrix picking the maximum value from the three possiilities */ Do I = 1:n Do J = 1:m T(I,J) = max[ T(I-1,J)+g, T(I,J-1)+g, T(I-1,J-1)+S(a(i),(j))] End do End do 11

Actually, if we are primarily interested in the score of the alignment of A= a1... a n with respect to B = 1... m only a single element of the dynamic matrix, Tnm, (, ) is of interest. Of course, in order to get at it we need to compute first the whole matrix. However this score, which is recorded at the lower right side of the dynamic matrix, is only part of the story. Besides the score, in many cases, it is important to know the alignment itself. The path in the dynamic matrix that corresponds to the optimal alignment can e found y a trace-ack procedure. Starting from T( n, m ) we ask which of the three possile steps could have generated the final optimal score? That is, we examine the three possiilities? T( n, m) = T( n1, m 1) + S( an, m)? T( n, m) = T( n 1, m) + g? Tnm (, ) = Tnm (, 1) + g For at least one of the tests aove the euality will hold. It is possile that more than one test is correct and in that case the optimal alignment is degenerate. Hence, there is more than one alignment that provides an optimal score. Usually we consider only one path. For example, an if the first test is correct the alignment ends with the pair: and we repeat the three tests to m find the next path segment, this time starting from T( n1, m 1). The process is repeated until it reaches the upper left corner and provides the desired optimal alignment. The maximum numer of times that the process is repeated is n+ m (all the steps are horizontal or vertical) and the smallest numer of steps is either n or m, the larger numer of the two. We are also ready for a rough estimate of the computational effort. As we discussed earlier a lower ound on the numer of possile alignments is n+ m. Examining all possile alignments will reuire computational effort that is growing exponentially with n+ m. The computations that we just descried, using dynamics programming, are much cheaper. Let us consider the two steps separately. In the first step we create the dynamic matrix T( n, m ). To generate a single element we need (aout) four operations: Evaluating three expressions, and a decision which of the three is the largest. The calculation of a single element is therefore independent of n or m, and the cost associated with the computation of the matrix is proportional to n m (the numer of matrix elements). To trace the path of alignment (once the matrix is known) takes a maximum of n+ m operations. Something to think aout (*.1) We wish to align simultaneously the three seuences A= a1... a n ; B = 1... m and C = c1... c l. A seuence of ordered triplets defines the alignment. Each triplet is of the form 1

a where a (for example) is either an amino acid or an indel. A maximum of two gaps is c allowed in a triplet. What is the minimum and maximum lengths of an alignment? Write a recursion relation for the numer of alignments. Compute the numer of alignment for the special case in which n= m= l = 5. Gap opening and extension An empirical expectation from gaps is that they should aggregate. Structurally, proteins are characterized y secondary structure elements, as helices or sheets, linked y loops. The loops are less defined structurally and are more exposed to solvent. We expect gaps to appear in loops since changes in loops have a smaller effect on the gloal shape. One way to induce this tendency is to make the gap score structure dependent. Structural dependent gap penalty will reuire a reasonaly reliale structural model and is therefore not availale for all seuences. The alternative for modeling gaps, which is uite popular, is to introduce different penalties for gap opening g o and gap extension g e. The gap extension is set less costly than gap opening. For a fixed numer of gaps the preferred arrangements of the gaps is in one or a few ig clusters. A schematic drawing of a protein shape is shown that consists of a two-helix undle and a loop. It is expected that the indels will concentrate at the loop region. Unfortunately the two types of gaps complicated our simple and elegant dynamic programming protocol. The prolem is that extending the alignment with different types of gaps has a memory. If we consider adding a gap to an existing alignment we need to check that in the previous step we did not have an indel already. If we did, the gap penalty of the extra indel will e g e, if we did not the gap penalty will e g o. Hence in contrast to the earlier dynamic programming protocol the penalty depends on more than just the current pair of aligned characters. For example 13

a a n n with a gap penalty of e m m1 m m1 m g. Or an 1 an an 1 an with a gap penalty of g o m m1 m m1 m We cannot just extend our state to include two pairs since we will not know if the pair of gaps was just opened or an extension of yet a previous gap. We need to think on m m + 1 three(!) dynamic matrices, or three score functions. We ask how is it possile to otain the optimal score (and the alignment) T( n, m ). Consider what we are facing now. As efore in order to complete the alignment we need to consider an the addition of three possile pairs: the pair, an m, or to complete an alignment to an m ( nm, ) alignment. Unfortunately, and in contrast to the simple dynamic matrix we had efore, we cannot forget what the previous alignments were. The tails of the prior n1, m 1 ) have three options of their own: alignments (with ( )... a... a... n1 n1 (1) () (3)......... m1 m1 Matching three against three creates nine possiilities (in principle). Let us consider a few of an them. Adding the pair is actually uite straightforward, we do not have to think on gap m extension or opening, simply adding to the current optimal scores of (1)-(3) the score a S( an, n). More interesting are the additions of the pairs n or to (1)-(3). Again we m have no prolem with extending the optimal alignment (1), since it includes no gaps. However, the pairs that include gaps are more complicated. For example, adding to () will have a different score than adding the same pair to (3). This oservation means that the optimal score depends on what we attach to it. Lucky for us the extent of the memory is for just one pair. We therefore define three optimal scores depending on the presence/asence (and the up/down location) of a gap at the right edge of the alignment: 1 T n, m () T n, m 3 T n, m ( ) ( ) ( ) ( ) ( ) a n 14

As the names suggest the first optimal score is a function than ends with a pair of amino acids, the second option with a gap aligned against a n + 1, and the third option with a gap aligned against m + 1. T( n, m) + S( an+ 1, m+ 1) T( n+ 1, m+ 1) = max T ( n, m) + S( an+ 1, m+ 1) T ( n, m) + S( an+ 1, m+ 1) a n + 1 The addition of the pair T( n, m+ 1) + go T( n+ 1, m+ 1) = max T( n, m+ 1) + ge T ( n, m 1) g + + o And the final pair gives a similar recursive relationship m + 1 helps to define the recursive relationship for T ( n 1, m 1 ) ( 1, ) ( ) ( 1, ) T n+ m + go T( n+ 1, m+ 1) = max T n+ 1, m + go T n m g + + e + + Local alignments So far we considered alignment in which the complete A seuence was aligned against the complete seuence of B. This procedure is also called gloal alignment. In a gloal alignment we account for all the amino acids, we may have added indels ut we did not took away any of the different amino acids in A or B. In other words, our paths in the dynamic matrix always started at the upper left corner and end at the lower right corner of the dynamic matrix. We can imagine asking a different uestion for which gloal alignments are not important. For example, having a DNA seuence (a gene, say aout 1000 asepairs) and searching for a relative in a whole genome (say, a million of asepairs). Aligning a single seuence against a large continuous string (sometimes only partially annotated) cannot e done with the procedure we just discussed. Another exercise that cannot e done with gloal alignment is the detection of similar domains. In numerous cases proteins share similar fragments (domains) while the rest of the structure (or seuence) is not similar. It is useful to identify these domains since they can serve as uilding locks for a complete model of the protein, provide hints to the protein family, or suggest plausile iological activity (if the domain has characteristic activity). Alignments of fragments do not exclude the possiility that the whole protein seuences will e aligned, ut we do not enforce it. 15

A schematic drawing of the ackone traces of the two proteins show a similar fragment (a helix and a β strand). Clearly, the fragment similarity will not e detected y a gloal alignment. To otain a local alignment, and to enale the identification of similar fragment, we modify the creation of the dynamic matrix. For local alignment we have: 0 Tn ( 1, m 1) Sa ( n, + m) T( n, m) = max T( n 1, m) + g T( n, m 1) + g The newly added value of zero is used to terminate an alignment. If the extension of an alignment is not increasing the overall score, then it is etter terminated here. Note that there is a specific assumption on the scoring matrix. The random hits of extending a seuence should not e positive on the average. Otherwise, the local alignment will increase without ound. Also the search for the optimal alignment is modified. In gloal alignments we started our trace ack procedure from the lower right corner of the matrix and searched for the optimal path until we hit the upper left corner. Local alignments do not necessarily start or end at the corners. The first step in determining the local alignment is the identification of the start. The start is a matrix entry with a high score. Then a trace ack procedure, similar to gloal alignments, is used until we hit a gloal score of zero. This is a sign that we need to terminate this alignment and that we found the highest scoring fragment for A and B 16

1 3 4 5 a a a a a 1 3 4 5 0 0 0 0 0 0 0 5 6 0 0 8 9 0 0 10 7 0 5 8 0 A short path is charting a local alignment in a dynamic matrix. The maximum score is 10 and a trace ack a1 a a3 a4 procedure provides us with the alignment. The aove example of a local alignment ends 1 3 at the upper left corner. However, similarly to the start of the alignment, there is no condition at its termination point. In contrast to gloal alignment we added now one more operation (the search for the highest scoring element) that scales also as n m. Of course, the highest scoring element can e stored during the construction of the matrix, saving us the additional computation afterward. Something to think aout Can you suggest a procedure that will use the generation of the dynamic matrix to speed-up the computation of the alignment path? The idea is to avoid the if statement, testing the sources of the optimal alignment. Suppose that A represents one protein while B is a whole genome. In that case we may wish to align the whole seuence A against a fragment of B. How to design a rule to generate the appropriate dynamic matrix? Qualitatively, how the matrix will e different from gloal alignment, local alignment matrices? Note that the optimal local alignment, which is defined as the highest scoring local alignment, is not necessarily the longest. It is proaly worth checking also a few high scores to detect potentially longer alignment. In general the longer is the alignment the more 17

significant it is likely to e using the statistical tests to e descried later, even if it is not necessarily with the highest score. Rough estimate of practical computational cost using dynamic programming The complexity of the dynamic programming algorithm or the computational effort of applying it, is proportional to n for large n -s. This is clearly much etter that rute force counting of all the alignments that we discuss previously (the computational effort grows exponentially with n ). Nevertheless dynamic programming can e still expensive, depending on the size of the prolem at hand, and our resources, as discussed elow. Powerful and low cost Personal Computers (PCs) make one type of high performance computing, which is ased on clusters of PC-s, more accessile. We therefore estimate the calculation time of typical alignments using PC units. An alignment of two proteins of average length ( n = 00 ) reuires a tenth of a second on a 600MHz PC. Comparing a single seuence to a dataase of 100,000 proteins, takes 10,000 seconds or.8 hours. Of course, this time is getting longer if we wish to examine more than one seuence. The non-trivial computational efforts promoted the use of approximate algorithms [x] with a computational cost linear with the protein length. The two leading approximations are FASTA and BLAST. The approximate procedures will e discussed later in the section: Approximate alignments. The computational cost further promoted the use of special purpose hardware for rapid parallel alignments [x]. There is an on-going competition here etween the rapid growth of computer resources and approximate algorithms. As the speed and capacity of computer increases, the option of finding the optimal alignment ecomes more feasile. Statistical verification of the results Once we finished comparing our seuences to all other seuences in the dataase we have a winner. A seuence B (or perhaps a few seuences) scores etter than the other seuences in the dataase when aligned to A. The winner is our est guess for a relative (homologous protein) of the unknown target. We should keep in mind that our dataase is limited. It does not contain all seuences, just our current sample in annotated seuence space. It is possile that none of the current reference proteins is homologous to the unknown seuence. Of course, we always get est scores. Every distriution has tails and the distriution of scores is no exception. It has a tail with highly scoring winning seuences. However, are these est scores of iological significance or just an aritrary choice among other irrelevant possiilities? We need to measure the significance of the score of the winning alignment to differentiate a true hit from a random hit. If we find that the winner is not significant, then either the dataase lacks relevant proteins, or our detection scheme is not sensitive enough. 18

We call a winning match -- a positive (signal). Positives, as discussed aove, can e either true or false. The seuence B with a high-score-alignment to A may e a true relative of the unknown protein, or may e a false positive. How can we verify (to the extent possile) the significance of the score? To differentiate etween true and false hits we generate random seuences, score them, and compare them to the score of our winner. If our match is truly significant its score should e higher than the score of a random seuence aligned to the same proe, B. So the plan is straightforward: generate a set of random seuences (called { R s} s 1 = ) and examine their alignment to the B match. The optimal alignment of each of the random seuences against B gives a score. The random score should e significantly lower than the score of our match of A to B, if the match is true. How should we generate random seuences? I.e., what is the sample space that we should use? We can generate novel random seuences y sampling from the general pool of amino acids. We can even try to e reasonaly clever and consider the natural freuencies of amino acids, ( ) i Pa. It is the proaility of oserving an amino acid i anywhere in the known dataases of proteins. We may generate the random seuences in such a way that the freuencies of the amino acids in those random seuences will reproduce the natural freuencies. However, changing the chemical composition of the target protein (the distriution of types of amino acids within A ) is prolematic. Clearly, seuences and alignments to B can e found that will have higher scores than aligning A against B. The self-alignment of B is an example. However, we are interested in significance test for A. For that purpose we maintain the composition of the amino acid S types in { R s} s = 1 to e the same as in A. The set of random seuences include all the proteins of the same length and the same amino acid composition as in A. For example if A= aa 1 a3a4, then a random seuence satisfying our criterion is R = aaaa 1 4 3. The process of generating random seuences with the same length and chemical composition is called seuence shuffling. These random seuences avoid the prolem of the weight of different amino acid types since they have the same composition as the A seuence. In the next step each of the random seuences is optimally aligned against B and their (optimal) score is computed. Note that the optimally aligned R s is not necessarily the same length as A. The ensemle of random seuences (or more precisely their scores) is used in a convenient estimate of the significance of the A alignment: The so-called Z score, defined elow S 19

Z = TAB T T T T AB is the optimal score for the alignment of A against B. T is the optimal score averaged over the set of random seuences as descried elow, and T is the second moment of the distriution of optimal scores of random seuences. The Z score is a cleverly constructed function. First T AB is tested for eing as far as possile from the average, or a typical score of a random seuence. This difference, T AB T, has units. Changing those units will make the difference larger or smaller with no real modification of the underline data. To determine a dimensionless measure of the distance etween T AB and T, we divide the difference y the width of the distriution. Hence, we are using the width of the distriution as a yardstick to determine appropriate units for the present distriution. The result is dimensionless ut it is also a etter measure of the significance of our alignment. Consider for example a fixed difference T AB T. If the width of the distriution greatly exceeds the difference, many random seuences will e found with etter scores than T AB. On the other hand, if the distriution is very narrow, even a small difference can ring us to the right (higher) edge of the score distriution function and make the prediction significant. PT ( ) T A schematic drawing of two distriutions of random scores: T is the value of the score, and PT ( ) is the proaility of oserving a given score. To keep the plot clear the distriutions are not normalized. Each of the distriutions has aout the same average value T ut the width is significantly different. The dashed-line distriution is roader compared to the continuous thin line distriution. The winning score, T AB, is denoted y an ellipse and a vertical line. Though the distance of the winning score from the average is the same in the two distriutions, it is clearly more significant when compared to the thin line distriution. By more significant we mean that it is less likely to otain the same score y chance, using a random seuence with the same amino acid composition. 0

The Z score measure has also a weakness. It relies only two moments of the distriution. It is possile that the distriution will e highly asymmetric, and (or) far from a Gaussian. If this is indeed the case, the calculated position of T AB relative to the random seuences, may e inaccurate. 1