Figure 3: A cartoon plot of how a database of annotated proteins is used to identify a novel sequence

Size: px
Start display at page:

Download "Figure 3: A cartoon plot of how a database of annotated proteins is used to identify a novel sequence"

Transcription

1 () Exact seuence comparison..1 Introduction Seuence alignment (to e defined elow) is a tool to test for a potential similarity etween a seuence of an unknown target protein and a single (or a family of) well-characterized protein(s). Figure 3: A cartoon plot of how a dataase of annotated proteins is used to identify a novel seuence Annotated seuences B 1 = 1 1 n 1 Unidentified seuence B (gloin) A=a 1 a n B 3 (α/β protein) B 4 = 1 4. n 4 We denote the seuence of the unknown protein y A. It is a short cut for the explicit seuence A= a1, a,..., an, where a i is an amino acid placed at the i position along the seuence. The seuences of the known proteins, that make the comparison dataase, are Q called { B } = 1 where B = 1,,..., m, and Q is the total numer of proteins in the dataase. For seuences only, the dataase may include hundreds of thousands of entries [ Dataases that include oth seuences and structures are limited to aout ten thousands [ For clarity we omit the index from B and use B to represent a memer of the comparison set. Tale 1: A sample of protein seuences. Typical lengths of protein seuences vary from a few tens to a few hundreds. The single letter notation (one character for an amino acid) is used. Serine Protease inhiitor EDICSLPPEV GPCRAGFLKF AYYSELNKCK LFTYGGCQGN ENNFETLQAC XQA Amicyanin alpha MRALAFAAALAAFSATAALAAGALEAVQEAPAGSTEVKIAKMKFQTPEVR IKAGSAVTWTNTEALPHNVHFKSGPGVEKDVEGPMLRSNQTYSVKFNAPG TYDYICTPHPFMKGKVVVE 1

2 Major Histocompataility Complex (class I) MAVMAPRTLV LLLSGALALT QTWAGSHSMR YFSTSVSRPG RGEPRFIAVG YVDDTQFVRF DSDAASQRME PRAPWIEQEG PEYWDRNTRN VKAHSQTDRV DLGTLRGYYN QSEDGSHTIQ RMYGCDVGSD GRFLRGYQQD AYDGKDYIAL NEDLRSWTAA DMAAEITKRK WEAAHFAEQL RAYLEGTCVE WLRRHLENGK ETLQRTDAPK THMTHHAVSD HEAILRCWAL SFYPAEITLT WQRDGEDQTQ DTELVETRPA GDGTFQKWAA VVVPSGQEQR YTCHVQHEGL PEPLTLRWEP SSQPTIPIVG IIAGLVLFGA VIAGAVVAAV RWRRKSSDRK GGSYSQAASS DSAQGSDVSL TACKV For any two seuences A and B we wish to determine how similar they are. This is a nontrivial task that reuires a pause and a clear definition of what we mean y similar. The simplest definition is the fraction of amino acids in B that are identical to A. In proteins high seuence identity (roughly aove 35 percent) immediately implies close relationship. A more sutle definition use amino acids with similar physical or iochemical properties. For example, it is ok to replace a hydrophoic amino acid like leucine (L) y valine (V). The effect on the hydrophoic core of the protein is expected to e small. However, changing (L) to a charged residue like arginine (R) it is not ok and may cause major reduction in protein staility. Hence, there is a gray area etween an exact matching a mismatch. This gray area is etter descried y a continuous scoring function that gives the highest score for an identity and the lowest possile score for sustitution that do not make chemical, physical or iological sense. For that purpose a score function needs to e estalished to measure the similarity etween the seuences. We now assume that when comparing the seuences A and B, the total score S( A, B ) will e given y a sum of terms, each term comparing two elements. The choice of the pairs of elements for the similarity test is called an alignment and is the focus of the present section. An element is not necessarily an amino acid. It can also e a space (gap). For example, consider the case in which A is longer than B. In that case it is ovious that some amino acids of A will e aligned against a space (gap). Before asking uestions aout the alignment process, let us start with the score, how to measure the similarity of a pair of elements. The scores of pairs of elements, which measure the similarity of the a -s and the -s, form the so-called sustitution matrix S( a, ). The matrix is symmetric, and of size 0 0 (the numer of amino acid types). Note that the term sustitution may e misleading since it suggests a direction (sustituting a y implies that a was there first). A symmetric matrix does not have a sense of time, and the order of sustitutions does not change the score, or the degree of similarity. Nevertheless, a non-symmetric variation of the seuence as a function of time is of considerale interest. A direction in seuence-similarity-scores is potentially informative, providing molecular fingerprints for evolutionary time scales. However, it is hard to estimate it without a specific model in mind. At present, for the purpose of finding protein relatives, we shall use the simplest way out and ignore the time arrow of evolution. A widely used (symmetric) sustitution matrix is given elow. It is extracted from proailities. Consider two seuences A and B of evolutionary related

3 proteins that are aligned against each other. The choice of the proteins and the alignment P a, e the proaility that amino acids a and procedure will e discussed later. Let ( ) are aligned against each other. Let P( a ) and ( ) P e the proailities of finding the amino acids a and in the protein that we considered. The score is given y S( a, ) = λ log P( a, ) P( a) P( ) The BLOSUM 50 matrix A R N D C Q E G H I L K M F P S T W Y V :A :A :R :R :N :N :D :D :C :C :Q :Q :E :E :G :G :H :H :I :I :L :L :K :K :M :M :F :F :P :P :S :S :T :T :W :W :Y :Y :V :V * * A R N D C Q E G H I L K M F P S T W Y V * Note the high scores for a sustitution of an amino acid to self (the diagonal elements of the matrix). Note also that certain amino acids have higher tendency for self-preservation (e.g. Cysteine) while a small and polar amino acid like threonine can e sustituted to other amino acids more easily. The detailed construction of the sustitution matrix is of significant interest and is discussed later in the section: Learning scoring parameters. Concerning the goals of the present section we still need to choose the pairs to calculate the similarity score. Can we just pick pairs of amino acids from the pools of residues that each protein makes? Of course, the choice of pairs of amino acids from A and B is not completely free. The order of the amino acids in the seuence (the primary structure of the protein) counts, and our comparison should maintain that order. This is a significant restriction on the comparisons that we may make and limits our choices consideraly. Adding further to the complexity of the prolem is the oservation that not all proteins are of the same length; we 3

4 need to account for variation in the lengths of the proteins and for the fact that some amino acids will not have corresponding amino acids to match. Two closely related proteins may e different at a specific site in which an amino acid was added to one protein ut not to the second. To descrie such cases we introduce a gap residue. A gap residue is denoted y - and is used to indicate an empty space along the seuence. For example, the comparison of two proteins elow (PDB codes 1rsy and 1a5_A) yields the following optimal arrangement of the two seuences with gaps: TABLE: Optimal alignment of two seuences (top is 1rsy, Synaptotagmin I (First C Domain), lower 1a5, Protein Kinase C ( ); Chain: A). The percentage of seuence identity is 33 percent. EKLGKLQYSLDYDFQNNQLLVGIIQ-AAEL-PALDMGGTSDPYVKVFLLPD-K-KKKFE ERRGRIYIQAHID-R--EVLIVVVRDAKNLVP-MDPNGLSDPYVKLKLIPDPKSESKQK TKVHRKTLNPVFNEQFTFKVPYSELGGKTLVMAVYDFDRFSKHDIIGEFKVPMNTVD-F TKTIKCSLNPEWNETFRFQLKESDKD-RRLSVEIWDWDLTSRNDFMGSLSFGISELQKA GHVTEEWRDLQS G-V-DGWFKLLS A gap is placed in one seuence (say A ) to match an amino acid in the second seuence (say B ). It indicates a missing amino acid in the A seuence when comparing it to the B seuence. Note that it is not possile for us to decide if an amino acid was deleted from A or added to B, without a detailed evolutionary mechanism linking the two proteins to a common ancestor. It is therefore common to use the term indel to descrie a gap (indel = an INsertion or a DELetion), a name that remains undecided aout the mechanism of gap formation. Our goal is to find the est match etween two seuences including the possiilities of gaps. In order to score a given alignment and decide if it is good or ad we need a score for an indel. What is the score of the gap, i.e. the alignment of an indel against another amino acid? It is philosophically useful to think on a gap as an ordinary amino acid and ask what is the score for sustituting an amino acid type a with an amino acid type. Such an approach is good in theory and there are some studies following this line of investigation. Unfortunately so far, determining gap scores (also called gap penalties) is still an open uestion. This difficulty is in contrast with the sustitution matrices of amino acids that are pretty well estalished. A common practice in alignment programs is to leave the gap energy as a parameter to e determined y the user, setting a default value of (for example) zero. The score of a gap influences the alignment. If it is set too low, it is difficult to match related proteins of considerale variation in lengths (remote homologous proteins). If it is set too high, two vastly different proteins may get fragmented to many small overlapping pieces and a large numer of gaps. The fragmented alignment (with high gap scoring) may still maintain a good score. We expect that the two proteins are related if a sustantial fraction of the two seuences is similar. 4

5 High-scoring short-seuence segments can e analyzed statistically to determine their significance. In fact such an analysis is ehind the popular BLAST algorithm [x]. To the first order BLAST does not consider gaps at all. A statistical test determines if short compatile segments of aligned seuences are significant. In some sense, BLAST solves the gap prolem y avoiding it all together. The BLAST algorithm will e discussed in the section: Statistical Analysis of Seuences. A typical practical solution for the indel/gap penalty is to give it a single negative value, say g, regardless of the pairing amino. The gap penalty is set to e independent of the type of the amino acid it is aligned to (e.g. the pair W / score the same as the pair G / ). This choice is non intuitive since W (tryptophan) is much larger than G (glycine). The space left for the - residue y the removal of G or W will e therefore different and the cost likely to e dissimilar. The less detailed treatment of gaps, (compared to usual amino acids), can lead to poor alignments. Therefore a few suggestions of extending the simple model of gap penalty were proposed. One popular model is to differentiate etween opening and extension of gaps. It is ased on the argument that gaps should aggregate together. Claims were made that insertions and deletions are likely to appear at certain structural domains of proteins (mostly loops). The concentration at few structural sites makes the indels appear together, and aggregate. To maximize the size of groups of gap (and minimize the numer of gap clusters), two gap penalties are assigned. One penalty is for initiating a gap group. A second (lower penalty) is for growing an existing gap cluster. For example, (we construct our alignment from left to right), extending an alignment from X / X to XX / X is less favorale than the extension of the pair X / to XX /. This is regardless of the fact that in oth cases the same pair ( X / ) was added to an existing alignment. This model improves the overall appearance of the alignments. However, the present author is not enthusiastic aout it. It is highly asymmetric with respect to other amino acids (rememer, we would like to consider a gap to e an amino acid), and it is not ovious that the asymmetry is indeed reuired. It is also making the identification of optimal alignment messier. Alternative modeling of gaps (so far less popular) will e discussed later and include structural dependent gap. For the moment we shall concentrate on the simplest model of gaps (one value does it all), and after solving that prolem the effect of the extensions will e examined. We wish to place the gaps in such a way that the alignment or the comparison of the two seuences will e optimal. Early studies of seuence comparisons found optimal alignments manually. Gaps were inserted y hand into positions that made iochemical sense and increased the numer of good-looking pairs [x]. For an implementation on the computer we consider an optimal alignment to e an arrangement of the two seuences with respect to each other in such a way that the total score of the alignment is as high as possile. Let us examine in more details the prolem of optimal alignments and possile arrangements of seuences. There are many ways of aligning a seuence A against a 5

6 seuence B once gaps are introduced (even if the original order of the amino acids is maintained). Gaps can enter anywhere and in wide range of numers in the alignment, creating many alternative arrangements. We denote the extended seuences of A and B (with gaps) as A and B. For example, A might e A= aa 1 a3a4... a n. The introduction of the extended seuences raises another intriguing uestion, what is the length of the seuences A (or B )? Of course, the lengths may vary, ut they still have upper ounds. Consider two seuences A and B of the same length -- n. The maximum length of A and B should e n. In the alignment of the extended-seuences that are also maximally long every amino acid of A or B is aligned against a gap -- ( a / ) and ( / ). An increase of the length eyond the n limit will necessarily include an alignment of an indel with respect to another indel, i.e. ( / ). Such an alignment does not make sense from a scientific point of view. We have no way to determine the numer of doule gaps or their locations. Moreover, also from a technical view point there is a prolem if the ( / ) pair generates a favorale plausile score. Consider the alignment a1... an 1 an 1... n 1 n Let the total score e T AB. Extending the aove alignment y one more pair of indels, we have a1... an 1 an 1... n 1 n The new score is TAB + S / where S / is the element of the sustitution matrix replacing an indel y an indel. If the new score is etter than T AB, it is trivial to construct even a etter alignment (and score) y adding yet another pair of indels with yet a etter score of T +. The favorale extension with indels can proceed to infinite and is unound. AB S / We eliminate in our seuence-to-seuence alignments the possiility of (. / ) Ok. It is settled then, the maximum length of A is n. How many possile alignments (with gaps) do we have? Or how many we need to examine efore deciding on an optimal alignment? 1. Counting alignments To count the numer of possile alignments, and as a starting point of the discussion on optimal alignments, we consider the dynamic matrix. The dynamic matrix is a tale used for the alignment of two seuences. Below we provide one example for two seuences with the same length ( n = 5). The rows are associated with the A seuence and the columns with the B. The numers at different matrix entries will e explained elow. 6

7 a a a a a The hairy picture with the numerous arrows is actually telling. Paths in this tale, which start at the upper left corner and end at the right lower corner, present all the possile alignments of the two seuences. From each entry in the dynamic matrix, there are three alternative moves: following the diagonal, going down or moving to the right. A step in the matrix, which is a part of a legitimate alignment, never proceeds from left to right or up. For example, the thick line in the aove matrix corresponds to the alignment: a1 a a3 a4 a A move along a diagonal aligns an amino acid against another amino acid. A vertical step in the matrix aligns a amino acid against an indel and a horizontal step puts a gap against an a amino acid. Note that we use for the gap residue (or an indel). The numers at the different entries of the tale denote the numer of paths (alignments) that can reach this point. For example, there are 5 possile alignments of aa 1 against 1 (check the tale element at the cross etween a and 1). They are: a1 a a1 a a1 a a1 a a1 a (1) () (3) (4) (5) The last three alignments are essentially the same and are degenerate. At present, we consider all paths, including the degenerate paths. In the Appendix a clever counting protocol (y Dr. Jaroslaw Meller) is outlined that estimates the numer of non-degenerate n+ m paths. Interestingly, the exact numer of non-degenerate paths is. It has the same m asymptotic ehavior as the approximate lower-ound expression we sketched elow for the numer of all paths. 7

8 Another interesting property of this tale is a summation rule and the possiility of constructing a recursion formula for the numer of alignments. There are three ways ( sources ) to extend a shorter alignment to otain one of the five longer alignments listed aove. (a) Extend an earlier alignment y the pair ( a / 1), a diagonal move in the aove tale. Alternatively, () the pairs ( a / ) or (c) ( / 1 ) (horizontal or perpendicular moves in the aove tale) can e used to extend a shorter alignment and to otain a memer of the aove group. Each of the three sources ( a) ( c) corresponding numer of alignments. For example, efore adding the pair ( / ) has its own position in the matrix with a a 1 we were at a tale entry aligning a 1 against " ". We find that there is only one-way of aligning a 1 against a gap and 1 is indeed the corresponding entry. Another example of extending the alignment is to add a gap against a, which means that our earlier position in the tale was the alignment of a 1 and 1. The last alignment can e done in three different ways and therefore the tale entry is 3 a a a ( ) If we add the numer of paths starting at the previous three sources and leading to the alignment of aa 1 with respect to 1 we otain = 5, exactly the numer of alignments of our target. To summarize the aove empirical oservations more precisely: The numer of possile alignments of n a -s against m -s is defined as Nnm, (, ) this numer can e determined using the recursive formula Nnm (, ) = Nn ( 1, m 1) + Nn ( 1, m) + Nnm (, 1), and the initial conditions N(0,0) = N(0, m) = N( m,0) = 1. Note that the definition of N (0,0) was done for computational convenience and it does not imply that nothing against nothing can e aligned in exactly one way. While this formula can e used directly, it is useful to have an order-of-magnitude estimate of the numer of alignments. This estimate is especially useful if we are planning a computation that will enumerate all of the alignments. If a calculation is not feasile with existing computer resources it is etter knowing that it is not feasile in advance, and not after a few weeks of futile attempts to execute the desired computation. A simple lower ound for Nnm (, ) can e otained uickly. Nnm (, ) is a positive numer, so we can write Nnm (, ) > Nn ( 1, m) + Nnm (, 1) (we forgot for convenience the term Nn ( 1, m 1) in the original euality). So, the alternative recursion formula ( ) N'( n, m) = N'( n 1, m) + N' n, m 1 (with the same initial conditions as for Nnm) (, ) always yields lower numers than Nnm. (, ) For N'( n, m ) we have a close expression: 8

9 n+ m ( n+ m)! N'( n, m) = =. This formula is easy to verify y direct sustitution of the m nm!! closed expression to the recursion. Note that this is the same as the (exact) result derived y Meller for non-degenerate paths (Appendix). We now make use of Stirling formula: log [!] log ( ) n n n n [x], valid for large n -s. The logarithm of the numer of alignments N'( n, n ) is estimated as ( n)! log = log! log! log log log + = ( n!) ( n) [ n ] n ( n) n n ( n) n n ( ) And the lower ound for the numer of alignments is N' ( n, n) n. For a short protein 30 ( n 50 ; N ) the numer is sustantial and is eyond what we can do today in a systematic search. Even if the computation of a single alignment reuires a nanosecond 9 ( second), which is unrealistically fast, it still necessary to use 10 sec 10 years to examine all possile alignments. If this is not impressive enough, rememer that this is a lower ound and the precise counting of all paths will yield a numer significantly larger than this one. For example, for 10 n = 5 we have = 104 already a numer significantly lower than the exact numer in the tale (1683). Hence, this underlines the statement that it is impossile to examine all alignments one y one. Readers with a ackground in structural iology may recall the Levinthal paradox in protein folding. The paradox contrasts the huge numer of plausile protein conformations and the efficiency in which proteins fold. However, as we see elow a large space to search does not necessarily mean that optimization in that space is difficult. It is not ovious that the optimization must e performed at a cost proportional to the volume of that space. In fact it can e profoundly cheaper. 1.3 Dynamic programming and optimal alignments After this long detour, it is aout the time to return to the asic uestion: What is the optimal alignment of A and B? The score matrix and the gap penalty are provided and we need is to fish out the alignment(s) with the highest overall score(s). This is a point in which algorithms developed y computer scientists can e extremely useful. The fact that the numer of possile alignments is exponentially large in the seuence length does not mean that the search for the optimal alignment needs to e done in exponential time. Seuence alignments can e done with dynamic programming, an algorithm that reuires only n operations to find the alignment with the est score, a remarkale saving compared to n operations. The efficient search for the optimal alignments consists of two steps. In the first step a dynamic matrix is constructed and in the second step an optimal path is found in the tale just constructed. The first step is similar to the counting of possile alignments and the recursive expression we wrote to compute the total numer of alignments. We denote the 9

10 optimal score of the alignment of a seuence length n against a seuence with length m y T n, m, and consider the following uestion: ( ) Assuming that a very kind fellow gave us the optimal scores for the following alignment: T n 1, m 1) T n 1, m, 1 T n, m? ( ), ( ), T( n m ), can we construct the score ( ) The answer is yes. Since the total score is given y a sum of the scores of the individual aligned pairs, we construct the score T( n, m ) from the three alignments leading to it. We consider three possiilities to otain an alignment of n against m amino acids. Option a: Align n 1 against m 1 amino acids (this alignment has the known optimal score of T( n1, m 1) ). Extend it y the alignment of an / m with a score of Sa ( n, m) which is the sustitution score according to the types of the amino acids at a n and m (e.g., the BLOSUM matrix). Hence the first suggestion for an optimal score is T( n1, m 1 ) + S( an, n). Option : Align n amino acids against m 1 amino acids. This alignment also has a known score, which is Tnm (, 1). Extend this alignment y / m with a corresponding score g for a gap. The second possiility for an optimal alignment is therefore T( n, m 1) + g. Option c: Align n 1 amino acids against m amino acids with the known (optimal) score of T( n 1, m). Extend the alignment, y adding an / with a corresponding score of g. The T n 1, m + g. third suggestion is therefore ( ) Note that we use the simple model of gaps in which only a single score is associated with an indel, regardless of the amino acid it is aligned against. Our final task is to select the highest score from one of the three alternatives: T n 1, m 1 S a, T n, m 1 g T n 1, m + g, which is the optimal score of ( ) + ( n n), ( ) +, ( ) T( n, m ). ( ) Tn ( 1, m 1) + San, m More compactly, we write: T( n, m) = max T( n, m 1) + g Tn ( 1, m) + g The aove recursion can e used to fill with optimal scores the complete dynamic matrix. We start with the condition T( 1, ) = T(,1) = g and use the initial values to grow the T( a1, ) + g g matrix. For example, T( a1, 1) = max T(, 1) + g = max g T( 0,0 ) + S( a1, 1) S( a1, 1) 10

11 Where T ( 0,0) denotes alignment of nothing that (not surprisingly) scores zero. Another simple example is of T( n, n( )) = n g. Hence, there is only one alignment against all gaps arrangement, providing us with immediate optimal scores for the first row and column of the dynamic matrix. It is also clear that similarly to the direct counting of the numer of paths we can also construct all the optimal scores. 5 a a a a a g g 3g 4g 5g 1 g g 3 3g 4 4g 5g A simple pseudo code to create the dynamic matrix is given elow /* fill the first (zero) column and the first (zero) row */ T(0,0) = 0 Do I=1:n T(I,0) = I*g End do Do I=1:m T(0,I) = I*g End do /* Now fill the rest of the matrix picking the maximum value from the three possiilities */ Do I = 1:n Do J = 1:m T(I,J) = max[ T(I-1,J)+g, T(I,J-1)+g, T(I-1,J-1)+S(a(i),(j))] End do End do 11

12 Actually, if we are primarily interested in the score of the alignment of A= a1... a n with respect to B = 1... m only a single element of the dynamic matrix, Tnm, (, ) is of interest. Of course, in order to get at it we need to compute first the whole matrix. However this score, which is recorded at the lower right side of the dynamic matrix, is only part of the story. Besides the score, in many cases, it is important to know the alignment itself. The path in the dynamic matrix that corresponds to the optimal alignment can e found y a trace-ack procedure. Starting from T( n, m ) we ask which of the three possile steps could have generated the final optimal score? That is, we examine the three possiilities? T( n, m) = T( n1, m 1) + S( an, m)? T( n, m) = T( n 1, m) + g? Tnm (, ) = Tnm (, 1) + g For at least one of the tests aove the euality will hold. It is possile that more than one test is correct and in that case the optimal alignment is degenerate. Hence, there is more than one alignment that provides an optimal score. Usually we consider only one path. For example, an if the first test is correct the alignment ends with the pair: and we repeat the three tests to m find the next path segment, this time starting from T( n1, m 1). The process is repeated until it reaches the upper left corner and provides the desired optimal alignment. The maximum numer of times that the process is repeated is n+ m (all the steps are horizontal or vertical) and the smallest numer of steps is either n or m, the larger numer of the two. We are also ready for a rough estimate of the computational effort. As we discussed earlier a lower ound on the numer of possile alignments is n+ m. Examining all possile alignments will reuire computational effort that is growing exponentially with n+ m. The computations that we just descried, using dynamics programming, are much cheaper. Let us consider the two steps separately. In the first step we create the dynamic matrix T( n, m ). To generate a single element we need (aout) four operations: Evaluating three expressions, and a decision which of the three is the largest. The calculation of a single element is therefore independent of n or m, and the cost associated with the computation of the matrix is proportional to n m (the numer of matrix elements). To trace the path of alignment (once the matrix is known) takes a maximum of n+ m operations. Something to think aout (*.1) We wish to align simultaneously the three seuences A= a1... a n ; B = 1... m and C = c1... c l. A seuence of ordered triplets defines the alignment. Each triplet is of the form 1

13 a where a (for example) is either an amino acid or an indel. A maximum of two gaps is c allowed in a triplet. What is the minimum and maximum lengths of an alignment? Write a recursion relation for the numer of alignments. Compute the numer of alignment for the special case in which n= m= l = 5. Gap opening and extension An empirical expectation from gaps is that they should aggregate. Structurally, proteins are characterized y secondary structure elements, as helices or sheets, linked y loops. The loops are less defined structurally and are more exposed to solvent. We expect gaps to appear in loops since changes in loops have a smaller effect on the gloal shape. One way to induce this tendency is to make the gap score structure dependent. Structural dependent gap penalty will reuire a reasonaly reliale structural model and is therefore not availale for all seuences. The alternative for modeling gaps, which is uite popular, is to introduce different penalties for gap opening g o and gap extension g e. The gap extension is set less costly than gap opening. For a fixed numer of gaps the preferred arrangements of the gaps is in one or a few ig clusters. A schematic drawing of a protein shape is shown that consists of a two-helix undle and a loop. It is expected that the indels will concentrate at the loop region. Unfortunately the two types of gaps complicated our simple and elegant dynamic programming protocol. The prolem is that extending the alignment with different types of gaps has a memory. If we consider adding a gap to an existing alignment we need to check that in the previous step we did not have an indel already. If we did, the gap penalty of the extra indel will e g e, if we did not the gap penalty will e g o. Hence in contrast to the earlier dynamic programming protocol the penalty depends on more than just the current pair of aligned characters. For example 13

14 a a n n with a gap penalty of e m m1 m m1 m g. Or an 1 an an 1 an with a gap penalty of g o m m1 m m1 m We cannot just extend our state to include two pairs since we will not know if the pair of gaps was just opened or an extension of yet a previous gap. We need to think on m m + 1 three(!) dynamic matrices, or three score functions. We ask how is it possile to otain the optimal score (and the alignment) T( n, m ). Consider what we are facing now. As efore in order to complete the alignment we need to consider an the addition of three possile pairs: the pair, an m, or to complete an alignment to an m ( nm, ) alignment. Unfortunately, and in contrast to the simple dynamic matrix we had efore, we cannot forget what the previous alignments were. The tails of the prior n1, m 1 ) have three options of their own: alignments (with ( )... a... a... n1 n1 (1) () (3) m1 m1 Matching three against three creates nine possiilities (in principle). Let us consider a few of an them. Adding the pair is actually uite straightforward, we do not have to think on gap m extension or opening, simply adding to the current optimal scores of (1)-(3) the score a S( an, n). More interesting are the additions of the pairs n or to (1)-(3). Again we m have no prolem with extending the optimal alignment (1), since it includes no gaps. However, the pairs that include gaps are more complicated. For example, adding to () will have a different score than adding the same pair to (3). This oservation means that the optimal score depends on what we attach to it. Lucky for us the extent of the memory is for just one pair. We therefore define three optimal scores depending on the presence/asence (and the up/down location) of a gap at the right edge of the alignment: 1 T n, m () T n, m 3 T n, m ( ) ( ) ( ) ( ) ( ) a n 14

15 As the names suggest the first optimal score is a function than ends with a pair of amino acids, the second option with a gap aligned against a n + 1, and the third option with a gap aligned against m + 1. T( n, m) + S( an+ 1, m+ 1) T( n+ 1, m+ 1) = max T ( n, m) + S( an+ 1, m+ 1) T ( n, m) + S( an+ 1, m+ 1) a n + 1 The addition of the pair T( n, m+ 1) + go T( n+ 1, m+ 1) = max T( n, m+ 1) + ge T ( n, m 1) g + + o And the final pair gives a similar recursive relationship m + 1 helps to define the recursive relationship for T ( n 1, m 1 ) ( 1, ) ( ) ( 1, ) T n+ m + go T( n+ 1, m+ 1) = max T n+ 1, m + go T n m g + + e + + Local alignments So far we considered alignment in which the complete A seuence was aligned against the complete seuence of B. This procedure is also called gloal alignment. In a gloal alignment we account for all the amino acids, we may have added indels ut we did not took away any of the different amino acids in A or B. In other words, our paths in the dynamic matrix always started at the upper left corner and end at the lower right corner of the dynamic matrix. We can imagine asking a different uestion for which gloal alignments are not important. For example, having a DNA seuence (a gene, say aout 1000 asepairs) and searching for a relative in a whole genome (say, a million of asepairs). Aligning a single seuence against a large continuous string (sometimes only partially annotated) cannot e done with the procedure we just discussed. Another exercise that cannot e done with gloal alignment is the detection of similar domains. In numerous cases proteins share similar fragments (domains) while the rest of the structure (or seuence) is not similar. It is useful to identify these domains since they can serve as uilding locks for a complete model of the protein, provide hints to the protein family, or suggest plausile iological activity (if the domain has characteristic activity). Alignments of fragments do not exclude the possiility that the whole protein seuences will e aligned, ut we do not enforce it. 15

16 A schematic drawing of the ackone traces of the two proteins show a similar fragment (a helix and a β strand). Clearly, the fragment similarity will not e detected y a gloal alignment. To otain a local alignment, and to enale the identification of similar fragment, we modify the creation of the dynamic matrix. For local alignment we have: 0 Tn ( 1, m 1) Sa ( n, + m) T( n, m) = max T( n 1, m) + g T( n, m 1) + g The newly added value of zero is used to terminate an alignment. If the extension of an alignment is not increasing the overall score, then it is etter terminated here. Note that there is a specific assumption on the scoring matrix. The random hits of extending a seuence should not e positive on the average. Otherwise, the local alignment will increase without ound. Also the search for the optimal alignment is modified. In gloal alignments we started our trace ack procedure from the lower right corner of the matrix and searched for the optimal path until we hit the upper left corner. Local alignments do not necessarily start or end at the corners. The first step in determining the local alignment is the identification of the start. The start is a matrix entry with a high score. Then a trace ack procedure, similar to gloal alignments, is used until we hit a gloal score of zero. This is a sign that we need to terminate this alignment and that we found the highest scoring fragment for A and B 16

17 a a a a a A short path is charting a local alignment in a dynamic matrix. The maximum score is 10 and a trace ack a1 a a3 a4 procedure provides us with the alignment. The aove example of a local alignment ends 1 3 at the upper left corner. However, similarly to the start of the alignment, there is no condition at its termination point. In contrast to gloal alignment we added now one more operation (the search for the highest scoring element) that scales also as n m. Of course, the highest scoring element can e stored during the construction of the matrix, saving us the additional computation afterward. Something to think aout Can you suggest a procedure that will use the generation of the dynamic matrix to speed-up the computation of the alignment path? The idea is to avoid the if statement, testing the sources of the optimal alignment. Suppose that A represents one protein while B is a whole genome. In that case we may wish to align the whole seuence A against a fragment of B. How to design a rule to generate the appropriate dynamic matrix? Qualitatively, how the matrix will e different from gloal alignment, local alignment matrices? Note that the optimal local alignment, which is defined as the highest scoring local alignment, is not necessarily the longest. It is proaly worth checking also a few high scores to detect potentially longer alignment. In general the longer is the alignment the more 17

18 significant it is likely to e using the statistical tests to e descried later, even if it is not necessarily with the highest score. Rough estimate of practical computational cost using dynamic programming The complexity of the dynamic programming algorithm or the computational effort of applying it, is proportional to n for large n -s. This is clearly much etter that rute force counting of all the alignments that we discuss previously (the computational effort grows exponentially with n ). Nevertheless dynamic programming can e still expensive, depending on the size of the prolem at hand, and our resources, as discussed elow. Powerful and low cost Personal Computers (PCs) make one type of high performance computing, which is ased on clusters of PC-s, more accessile. We therefore estimate the calculation time of typical alignments using PC units. An alignment of two proteins of average length ( n = 00 ) reuires a tenth of a second on a 600MHz PC. Comparing a single seuence to a dataase of 100,000 proteins, takes 10,000 seconds or.8 hours. Of course, this time is getting longer if we wish to examine more than one seuence. The non-trivial computational efforts promoted the use of approximate algorithms [x] with a computational cost linear with the protein length. The two leading approximations are FASTA and BLAST. The approximate procedures will e discussed later in the section: Approximate alignments. The computational cost further promoted the use of special purpose hardware for rapid parallel alignments [x]. There is an on-going competition here etween the rapid growth of computer resources and approximate algorithms. As the speed and capacity of computer increases, the option of finding the optimal alignment ecomes more feasile. Statistical verification of the results Once we finished comparing our seuences to all other seuences in the dataase we have a winner. A seuence B (or perhaps a few seuences) scores etter than the other seuences in the dataase when aligned to A. The winner is our est guess for a relative (homologous protein) of the unknown target. We should keep in mind that our dataase is limited. It does not contain all seuences, just our current sample in annotated seuence space. It is possile that none of the current reference proteins is homologous to the unknown seuence. Of course, we always get est scores. Every distriution has tails and the distriution of scores is no exception. It has a tail with highly scoring winning seuences. However, are these est scores of iological significance or just an aritrary choice among other irrelevant possiilities? We need to measure the significance of the score of the winning alignment to differentiate a true hit from a random hit. If we find that the winner is not significant, then either the dataase lacks relevant proteins, or our detection scheme is not sensitive enough. 18

19 We call a winning match -- a positive (signal). Positives, as discussed aove, can e either true or false. The seuence B with a high-score-alignment to A may e a true relative of the unknown protein, or may e a false positive. How can we verify (to the extent possile) the significance of the score? To differentiate etween true and false hits we generate random seuences, score them, and compare them to the score of our winner. If our match is truly significant its score should e higher than the score of a random seuence aligned to the same proe, B. So the plan is straightforward: generate a set of random seuences (called { R s} s 1 = ) and examine their alignment to the B match. The optimal alignment of each of the random seuences against B gives a score. The random score should e significantly lower than the score of our match of A to B, if the match is true. How should we generate random seuences? I.e., what is the sample space that we should use? We can generate novel random seuences y sampling from the general pool of amino acids. We can even try to e reasonaly clever and consider the natural freuencies of amino acids, ( ) i Pa. It is the proaility of oserving an amino acid i anywhere in the known dataases of proteins. We may generate the random seuences in such a way that the freuencies of the amino acids in those random seuences will reproduce the natural freuencies. However, changing the chemical composition of the target protein (the distriution of types of amino acids within A ) is prolematic. Clearly, seuences and alignments to B can e found that will have higher scores than aligning A against B. The self-alignment of B is an example. However, we are interested in significance test for A. For that purpose we maintain the composition of the amino acid S types in { R s} s = 1 to e the same as in A. The set of random seuences include all the proteins of the same length and the same amino acid composition as in A. For example if A= aa 1 a3a4, then a random seuence satisfying our criterion is R = aaaa The process of generating random seuences with the same length and chemical composition is called seuence shuffling. These random seuences avoid the prolem of the weight of different amino acid types since they have the same composition as the A seuence. In the next step each of the random seuences is optimally aligned against B and their (optimal) score is computed. Note that the optimally aligned R s is not necessarily the same length as A. The ensemle of random seuences (or more precisely their scores) is used in a convenient estimate of the significance of the A alignment: The so-called Z score, defined elow S 19

20 Z = TAB T T T T AB is the optimal score for the alignment of A against B. T is the optimal score averaged over the set of random seuences as descried elow, and T is the second moment of the distriution of optimal scores of random seuences. The Z score is a cleverly constructed function. First T AB is tested for eing as far as possile from the average, or a typical score of a random seuence. This difference, T AB T, has units. Changing those units will make the difference larger or smaller with no real modification of the underline data. To determine a dimensionless measure of the distance etween T AB and T, we divide the difference y the width of the distriution. Hence, we are using the width of the distriution as a yardstick to determine appropriate units for the present distriution. The result is dimensionless ut it is also a etter measure of the significance of our alignment. Consider for example a fixed difference T AB T. If the width of the distriution greatly exceeds the difference, many random seuences will e found with etter scores than T AB. On the other hand, if the distriution is very narrow, even a small difference can ring us to the right (higher) edge of the score distriution function and make the prediction significant. PT ( ) T A schematic drawing of two distriutions of random scores: T is the value of the score, and PT ( ) is the proaility of oserving a given score. To keep the plot clear the distriutions are not normalized. Each of the distriutions has aout the same average value T ut the width is significantly different. The dashed-line distriution is roader compared to the continuous thin line distriution. The winning score, T AB, is denoted y an ellipse and a vertical line. Though the distance of the winning score from the average is the same in the two distriutions, it is clearly more significant when compared to the thin line distriution. By more significant we mean that it is less likely to otain the same score y chance, using a random seuence with the same amino acid composition. 0

21 The Z score measure has also a weakness. It relies only two moments of the distriution. It is possile that the distriution will e highly asymmetric, and (or) far from a Gaussian. If this is indeed the case, the calculated position of T AB relative to the random seuences, may e inaccurate. 1

3.5 Solving Quadratic Equations by the

3.5 Solving Quadratic Equations by the www.ck1.org Chapter 3. Quadratic Equations and Quadratic Functions 3.5 Solving Quadratic Equations y the Quadratic Formula Learning ojectives Solve quadratic equations using the quadratic formula. Identify

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Representation theory of SU(2), density operators, purification Michael Walter, University of Amsterdam

Representation theory of SU(2), density operators, purification Michael Walter, University of Amsterdam Symmetry and Quantum Information Feruary 6, 018 Representation theory of S(), density operators, purification Lecture 7 Michael Walter, niversity of Amsterdam Last week, we learned the asic concepts of

More information

Lecture 12: Grover s Algorithm

Lecture 12: Grover s Algorithm CPSC 519/619: Quantum Computation John Watrous, University of Calgary Lecture 12: Grover s Algorithm March 7, 2006 We have completed our study of Shor s factoring algorithm. The asic technique ehind Shor

More information

Polynomial Degree and Finite Differences

Polynomial Degree and Finite Differences CONDENSED LESSON 7.1 Polynomial Degree and Finite Differences In this lesson, you Learn the terminology associated with polynomials Use the finite differences method to determine the degree of a polynomial

More information

Upper Bounds for Stern s Diatomic Sequence and Related Sequences

Upper Bounds for Stern s Diatomic Sequence and Related Sequences Upper Bounds for Stern s Diatomic Sequence and Related Sequences Colin Defant Department of Mathematics University of Florida, U.S.A. cdefant@ufl.edu Sumitted: Jun 18, 01; Accepted: Oct, 016; Pulished:

More information

Expansion formula using properties of dot product (analogous to FOIL in algebra): u v 2 u v u v u u 2u v v v u 2 2u v v 2

Expansion formula using properties of dot product (analogous to FOIL in algebra): u v 2 u v u v u u 2u v v v u 2 2u v v 2 Least squares: Mathematical theory Below we provide the "vector space" formulation, and solution, of the least squares prolem. While not strictly necessary until we ring in the machinery of matrix algera,

More information

Module 9: Further Numbers and Equations. Numbers and Indices. The aim of this lesson is to enable you to: work with rational and irrational numbers

Module 9: Further Numbers and Equations. Numbers and Indices. The aim of this lesson is to enable you to: work with rational and irrational numbers Module 9: Further Numers and Equations Lesson Aims The aim of this lesson is to enale you to: wor with rational and irrational numers wor with surds to rationalise the denominator when calculating interest,

More information

Genetic Algorithms applied to Problems of Forbidden Configurations

Genetic Algorithms applied to Problems of Forbidden Configurations Genetic Algorithms applied to Prolems of Foridden Configurations R.P. Anstee Miguel Raggi Department of Mathematics University of British Columia Vancouver, B.C. Canada V6T Z2 anstee@math.uc.ca mraggi@gmail.com

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line

The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line Chapter 27: Inferences for Regression And so, there is one more thing which might vary one more thing aout which we might want to make some inference: the slope of the least squares regression line. The

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

At first numbers were used only for counting, and 1, 2, 3,... were all that was needed. These are called positive integers.

At first numbers were used only for counting, and 1, 2, 3,... were all that was needed. These are called positive integers. 1 Numers One thread in the history of mathematics has een the extension of what is meant y a numer. This has led to the invention of new symols and techniques of calculation. When you have completed this

More information

Travel Grouping of Evaporating Polydisperse Droplets in Oscillating Flow- Theoretical Analysis

Travel Grouping of Evaporating Polydisperse Droplets in Oscillating Flow- Theoretical Analysis Travel Grouping of Evaporating Polydisperse Droplets in Oscillating Flow- Theoretical Analysis DAVID KATOSHEVSKI Department of Biotechnology and Environmental Engineering Ben-Gurion niversity of the Negev

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

PHY451, Spring /5

PHY451, Spring /5 PHY451, Spring 2011 Notes on Optical Pumping Procedure & Theory Procedure 1. Turn on the electronics and wait for the cell to warm up: ~ ½ hour. The oven should already e set to 50 C don t change this

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

bioinformatics 1 -- lecture 7

bioinformatics 1 -- lecture 7 bioinformatics 1 -- lecture 7 Probability and conditional probability Random sequences and significance (real sequences are not random) Erdos & Renyi: theoretical basis for the significance of an alignment

More information

1 Caveats of Parallel Algorithms

1 Caveats of Parallel Algorithms CME 323: Distriuted Algorithms and Optimization, Spring 2015 http://stanford.edu/ reza/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 1, 9/26/2015. Scried y Suhas Suresha, Pin Pin, Andreas

More information

Dynamical Systems Solutions to Exercises

Dynamical Systems Solutions to Exercises Dynamical Systems Part 5-6 Dr G Bowtell Dynamical Systems Solutions to Exercises. Figure : Phase diagrams for i, ii and iii respectively. Only fixed point is at the origin since the equations are linear

More information

Sequence Analysis '17 -- lecture 7

Sequence Analysis '17 -- lecture 7 Sequence Analysis '17 -- lecture 7 Significance E-values How significant is that? Please give me a number for......how likely the data would not have been the result of chance,......as opposed to......a

More information

Solving Systems of Linear Equations Symbolically

Solving Systems of Linear Equations Symbolically " Solving Systems of Linear Equations Symolically Every day of the year, thousands of airline flights crisscross the United States to connect large and small cities. Each flight follows a plan filed with

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Chaos and Dynamical Systems

Chaos and Dynamical Systems Chaos and Dynamical Systems y Megan Richards Astract: In this paper, we will discuss the notion of chaos. We will start y introducing certain mathematical concepts needed in the understanding of chaos,

More information

1Number ONLINE PAGE PROOFS. systems: real and complex. 1.1 Kick off with CAS

1Number ONLINE PAGE PROOFS. systems: real and complex. 1.1 Kick off with CAS 1Numer systems: real and complex 1.1 Kick off with CAS 1. Review of set notation 1.3 Properties of surds 1. The set of complex numers 1.5 Multiplication and division of complex numers 1.6 Representing

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Structuring Unreliable Radio Networks

Structuring Unreliable Radio Networks Structuring Unreliale Radio Networks Keren Censor-Hillel Seth Gilert Faian Kuhn Nancy Lynch Calvin Newport March 29, 2011 Astract In this paper we study the prolem of uilding a connected dominating set

More information

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015 S 88 Spring 205 Introduction to rtificial Intelligence Midterm 2 V ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

Mathematical Ideas Modelling data, power variation, straightening data with logarithms, residual plots

Mathematical Ideas Modelling data, power variation, straightening data with logarithms, residual plots Kepler s Law Level Upper secondary Mathematical Ideas Modelling data, power variation, straightening data with logarithms, residual plots Description and Rationale Many traditional mathematics prolems

More information

Simple Examples. Let s look at a few simple examples of OI analysis.

Simple Examples. Let s look at a few simple examples of OI analysis. Simple Examples Let s look at a few simple examples of OI analysis. Example 1: Consider a scalar prolem. We have one oservation y which is located at the analysis point. We also have a ackground estimate

More information

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment Molecular Modeling 2018-- Lecture 7 Homology modeling insertions/deletions manual realignment Homology modeling also called comparative modeling Sequences that have similar sequence have similar structure.

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Determinants of generalized binary band matrices

Determinants of generalized binary band matrices Determinants of generalized inary and matrices Dmitry Efimov arxiv:17005655v1 [mathra] 18 Fe 017 Department of Mathematics, Komi Science Centre UrD RAS, Syktyvkar, Russia Astract Under inary matrices we

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

ChemActivity L4: Proton ( 1 H) NMR

ChemActivity L4: Proton ( 1 H) NMR hemactivity L: Proton ( ) NMR (What can a NMR spectrum tell you aout the structure of a molecule?) This activity is designed to e completed in a ½ hour laoratory session or two classroom sessions. Model

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas Informal inductive proof of best alignment path onsider the last step in the best

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

1.5 Sequence alignment

1.5 Sequence alignment 1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

More information

Non-Linear Regression Samuel L. Baker

Non-Linear Regression Samuel L. Baker NON-LINEAR REGRESSION 1 Non-Linear Regression 2006-2008 Samuel L. Baker The linear least squares method that you have een using fits a straight line or a flat plane to a unch of data points. Sometimes

More information

#A50 INTEGERS 14 (2014) ON RATS SEQUENCES IN GENERAL BASES

#A50 INTEGERS 14 (2014) ON RATS SEQUENCES IN GENERAL BASES #A50 INTEGERS 14 (014) ON RATS SEQUENCES IN GENERAL BASES Johann Thiel Dept. of Mathematics, New York City College of Technology, Brooklyn, New York jthiel@citytech.cuny.edu Received: 6/11/13, Revised:

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Each element of this set is assigned a probability. There are three basic rules for probabilities:

Each element of this set is assigned a probability. There are three basic rules for probabilities: XIV. BASICS OF ROBABILITY Somewhere out there is a set of all possile event (or all possile sequences of events which I call Ω. This is called a sample space. Out of this we consider susets of events which

More information

Sequence comparison: Score matrices

Sequence comparison: Score matrices Sequence comparison: Score matrices http://facultywashingtonedu/jht/gs559_2013/ Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Building 3D models of proteins

Building 3D models of proteins Building 3D models of proteins Why make a structural model for your protein? The structure can provide clues to the function through structural similarity with other proteins With a structure it is easier

More information

Lecture 2: Connecting the Three Models

Lecture 2: Connecting the Three Models IAS/PCMI Summer Session 2000 Clay Mathematics Undergraduate Program Advanced Course on Computational Complexity Lecture 2: Connecting the Three Models David Mix Barrington and Alexis Maciel July 18, 2000

More information

SVETLANA KATOK AND ILIE UGARCOVICI (Communicated by Jens Marklof)

SVETLANA KATOK AND ILIE UGARCOVICI (Communicated by Jens Marklof) JOURNAL OF MODERN DYNAMICS VOLUME 4, NO. 4, 010, 637 691 doi: 10.3934/jmd.010.4.637 STRUCTURE OF ATTRACTORS FOR (a, )-CONTINUED FRACTION TRANSFORMATIONS SVETLANA KATOK AND ILIE UGARCOVICI (Communicated

More information

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Background How does an evolutionary biologist decide how closely related two different species are? The simplest way is to compare

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

2. FUNCTIONS AND ALGEBRA

2. FUNCTIONS AND ALGEBRA 2. FUNCTIONS AND ALGEBRA You might think of this chapter as an icebreaker. Functions are the primary participants in the game of calculus, so before we play the game we ought to get to know a few functions.

More information

8.04 Spring 2013 March 12, 2013 Problem 1. (10 points) The Probability Current

8.04 Spring 2013 March 12, 2013 Problem 1. (10 points) The Probability Current Prolem Set 5 Solutions 8.04 Spring 03 March, 03 Prolem. (0 points) The Proaility Current We wish to prove that dp a = J(a, t) J(, t). () dt Since P a (t) is the proaility of finding the particle in the

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best alignment path onsider the last step in

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Robot Position from Wheel Odometry

Robot Position from Wheel Odometry Root Position from Wheel Odometry Christopher Marshall 26 Fe 2008 Astract This document develops equations of motion for root position as a function of the distance traveled y each wheel as a function

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Weak bidders prefer first-price (sealed-bid) auctions. (This holds both ex-ante, and once the bidders have learned their types)

Weak bidders prefer first-price (sealed-bid) auctions. (This holds both ex-ante, and once the bidders have learned their types) Econ 805 Advanced Micro Theory I Dan Quint Fall 2007 Lecture 9 Oct 4 2007 Last week, we egan relaxing the assumptions of the symmetric independent private values model. We examined private-value auctions

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017 CMSC CMSC : Lecture Greedy Algorithms for Scheduling Tuesday, Sep 9, 0 Reading: Sects.. and. of KT. (Not covered in DPV.) Interval Scheduling: We continue our discussion of greedy algorithms with a number

More information

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction part of Bioinformatik von RNA- und Proteinstrukturen Computational EvoDevo University Leipzig Leipzig, SS 2011 the goal is the prediction of the secondary structure conformation which is local each amino

More information

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms Computer Science 385 Analysis of Algorithms Siena College Spring 2011 Topic Notes: Limitations of Algorithms We conclude with a discussion of the limitations of the power of algorithms. That is, what kinds

More information

Advanced Undecidability Proofs

Advanced Undecidability Proofs 17 Advanced Undecidability Proofs In this chapter, we will discuss Rice s Theorem in Section 17.1, and the computational history method in Section 17.3. As discussed in Chapter 16, these are two additional

More information

CURRICULUM TIDBITS FOR THE MATHEMATICS CLASSROOM

CURRICULUM TIDBITS FOR THE MATHEMATICS CLASSROOM TANTON S TAKE ON LOGARITHMS CURRICULUM TIDBITS FOR THE MATHEMATICS CLASSROOM APRIL 01 Try walking into an algera II class and writing on the oard, perhaps even in silence, the following: = Your turn! (

More information

1 Hoeffding s Inequality

1 Hoeffding s Inequality Proailistic Method: Hoeffding s Inequality and Differential Privacy Lecturer: Huert Chan Date: 27 May 22 Hoeffding s Inequality. Approximate Counting y Random Sampling Suppose there is a ag containing

More information

Inequalities. Inequalities. Curriculum Ready.

Inequalities. Inequalities. Curriculum Ready. Curriculum Ready www.mathletics.com Copyright 009 3P Learning. All rights reserved. First edition printed 009 in Australia. A catalogue record for this ook is availale from 3P Learning Ltd. ISBN 978--986-60-4

More information

CHAPTER 5. Linear Operators, Span, Linear Independence, Basis Sets, and Dimension

CHAPTER 5. Linear Operators, Span, Linear Independence, Basis Sets, and Dimension A SERIES OF CLASS NOTES TO INTRODUCE LINEAR AND NONLINEAR PROBLEMS TO ENGINEERS, SCIENTISTS, AND APPLIED MATHEMATICIANS LINEAR CLASS NOTES: A COLLECTION OF HANDOUTS FOR REVIEW AND PREVIEW OF LINEAR THEORY

More information

Exercise 5. Sequence Profiles & BLAST

Exercise 5. Sequence Profiles & BLAST Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)

More information

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético Paulo Mologni 1, Ailton Akira Shinoda 2, Carlos Dias Maciel

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

RMT 2014 Power Round Solutions February 15, 2014

RMT 2014 Power Round Solutions February 15, 2014 Introduction This Power Round develops the many and varied properties of the Thue-Morse sequence, an infinite sequence of 0s and 1s which starts 0, 1, 1, 0, 1, 0, 0, 1,... and appears in a remarkable number

More information

Ch. 9 Multiple Sequence Alignment (MSA)

Ch. 9 Multiple Sequence Alignment (MSA) Ch. 9 Multiple Sequence Alignment (MSA) - gather seqs. to make MSA - doing MSA with ClustalW - doing MSA with Tcoffee - comparing seqs. that cannot align Introduction - from pairwise alignment to MSA -

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Algorithms Exam TIN093 /DIT602

Algorithms Exam TIN093 /DIT602 Algorithms Exam TIN093 /DIT602 Course: Algorithms Course code: TIN 093, TIN 092 (CTH), DIT 602 (GU) Date, time: 21st October 2017, 14:00 18:00 Building: SBM Responsible teacher: Peter Damaschke, Tel. 5405

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

Math 216 Second Midterm 28 March, 2013

Math 216 Second Midterm 28 March, 2013 Math 26 Second Midterm 28 March, 23 This sample exam is provided to serve as one component of your studying for this exam in this course. Please note that it is not guaranteed to cover the material that

More information

19 Tree Construction using Singular Value Decomposition

19 Tree Construction using Singular Value Decomposition 19 Tree Construction using Singular Value Decomposition Nicholas Eriksson We present a new, statistically consistent algorithm for phylogenetic tree construction that uses the algeraic theory of statistical

More information

Getting Started with Communications Engineering

Getting Started with Communications Engineering 1 Linear algebra is the algebra of linear equations: the term linear being used in the same sense as in linear functions, such as: which is the equation of a straight line. y ax c (0.1) Of course, if we

More information

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Institute of Bioinformatics Johannes Kepler University, Linz, Austria Sequence Alignment 2. Sequence Alignment Sequence Alignment 2.1

More information

Divide-and-Conquer. Reading: CLRS Sections 2.3, 4.1, 4.2, 4.3, 28.2, CSE 6331 Algorithms Steve Lai

Divide-and-Conquer. Reading: CLRS Sections 2.3, 4.1, 4.2, 4.3, 28.2, CSE 6331 Algorithms Steve Lai Divide-and-Conquer Reading: CLRS Sections 2.3, 4.1, 4.2, 4.3, 28.2, 33.4. CSE 6331 Algorithms Steve Lai Divide and Conquer Given an instance x of a prolem, the method works as follows: divide-and-conquer

More information

Scoring Matrices. Shifra Ben-Dor Irit Orr

Scoring Matrices. Shifra Ben-Dor Irit Orr Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

DIMACS Technical Report March Game Seki 1

DIMACS Technical Report March Game Seki 1 DIMACS Technical Report 2007-05 March 2007 Game Seki 1 by Diogo V. Andrade RUTCOR, Rutgers University 640 Bartholomew Road Piscataway, NJ 08854-8003 dandrade@rutcor.rutgers.edu Vladimir A. Gurvich RUTCOR,

More information

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)

More information

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler STATC141 Spring 2005 The materials are from Pairise Sequence Alignment by Robert Giegerich and David Wheeler Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (I) Sequence similarity

More information

CS 4120 Lecture 3 Automating lexical analysis 29 August 2011 Lecturer: Andrew Myers. 1 DFAs

CS 4120 Lecture 3 Automating lexical analysis 29 August 2011 Lecturer: Andrew Myers. 1 DFAs CS 42 Lecture 3 Automating lexical analysis 29 August 2 Lecturer: Andrew Myers A lexer generator converts a lexical specification consisting of a list of regular expressions and corresponding actions into

More information