Figure 3: A cartoon plot of how a database of annotated proteins is used to identify a novel sequence
|
|
- Jasmine Rice
- 5 years ago
- Views:
Transcription
1 () Exact seuence comparison..1 Introduction Seuence alignment (to e defined elow) is a tool to test for a potential similarity etween a seuence of an unknown target protein and a single (or a family of) well-characterized protein(s). Figure 3: A cartoon plot of how a dataase of annotated proteins is used to identify a novel seuence Annotated seuences B 1 = 1 1 n 1 Unidentified seuence B (gloin) A=a 1 a n B 3 (α/β protein) B 4 = 1 4. n 4 We denote the seuence of the unknown protein y A. It is a short cut for the explicit seuence A= a1, a,..., an, where a i is an amino acid placed at the i position along the seuence. The seuences of the known proteins, that make the comparison dataase, are Q called { B } = 1 where B = 1,,..., m, and Q is the total numer of proteins in the dataase. For seuences only, the dataase may include hundreds of thousands of entries [ Dataases that include oth seuences and structures are limited to aout ten thousands [ For clarity we omit the index from B and use B to represent a memer of the comparison set. Tale 1: A sample of protein seuences. Typical lengths of protein seuences vary from a few tens to a few hundreds. The single letter notation (one character for an amino acid) is used. Serine Protease inhiitor EDICSLPPEV GPCRAGFLKF AYYSELNKCK LFTYGGCQGN ENNFETLQAC XQA Amicyanin alpha MRALAFAAALAAFSATAALAAGALEAVQEAPAGSTEVKIAKMKFQTPEVR IKAGSAVTWTNTEALPHNVHFKSGPGVEKDVEGPMLRSNQTYSVKFNAPG TYDYICTPHPFMKGKVVVE 1
2 Major Histocompataility Complex (class I) MAVMAPRTLV LLLSGALALT QTWAGSHSMR YFSTSVSRPG RGEPRFIAVG YVDDTQFVRF DSDAASQRME PRAPWIEQEG PEYWDRNTRN VKAHSQTDRV DLGTLRGYYN QSEDGSHTIQ RMYGCDVGSD GRFLRGYQQD AYDGKDYIAL NEDLRSWTAA DMAAEITKRK WEAAHFAEQL RAYLEGTCVE WLRRHLENGK ETLQRTDAPK THMTHHAVSD HEAILRCWAL SFYPAEITLT WQRDGEDQTQ DTELVETRPA GDGTFQKWAA VVVPSGQEQR YTCHVQHEGL PEPLTLRWEP SSQPTIPIVG IIAGLVLFGA VIAGAVVAAV RWRRKSSDRK GGSYSQAASS DSAQGSDVSL TACKV For any two seuences A and B we wish to determine how similar they are. This is a nontrivial task that reuires a pause and a clear definition of what we mean y similar. The simplest definition is the fraction of amino acids in B that are identical to A. In proteins high seuence identity (roughly aove 35 percent) immediately implies close relationship. A more sutle definition use amino acids with similar physical or iochemical properties. For example, it is ok to replace a hydrophoic amino acid like leucine (L) y valine (V). The effect on the hydrophoic core of the protein is expected to e small. However, changing (L) to a charged residue like arginine (R) it is not ok and may cause major reduction in protein staility. Hence, there is a gray area etween an exact matching a mismatch. This gray area is etter descried y a continuous scoring function that gives the highest score for an identity and the lowest possile score for sustitution that do not make chemical, physical or iological sense. For that purpose a score function needs to e estalished to measure the similarity etween the seuences. We now assume that when comparing the seuences A and B, the total score S( A, B ) will e given y a sum of terms, each term comparing two elements. The choice of the pairs of elements for the similarity test is called an alignment and is the focus of the present section. An element is not necessarily an amino acid. It can also e a space (gap). For example, consider the case in which A is longer than B. In that case it is ovious that some amino acids of A will e aligned against a space (gap). Before asking uestions aout the alignment process, let us start with the score, how to measure the similarity of a pair of elements. The scores of pairs of elements, which measure the similarity of the a -s and the -s, form the so-called sustitution matrix S( a, ). The matrix is symmetric, and of size 0 0 (the numer of amino acid types). Note that the term sustitution may e misleading since it suggests a direction (sustituting a y implies that a was there first). A symmetric matrix does not have a sense of time, and the order of sustitutions does not change the score, or the degree of similarity. Nevertheless, a non-symmetric variation of the seuence as a function of time is of considerale interest. A direction in seuence-similarity-scores is potentially informative, providing molecular fingerprints for evolutionary time scales. However, it is hard to estimate it without a specific model in mind. At present, for the purpose of finding protein relatives, we shall use the simplest way out and ignore the time arrow of evolution. A widely used (symmetric) sustitution matrix is given elow. It is extracted from proailities. Consider two seuences A and B of evolutionary related
3 proteins that are aligned against each other. The choice of the proteins and the alignment P a, e the proaility that amino acids a and procedure will e discussed later. Let ( ) are aligned against each other. Let P( a ) and ( ) P e the proailities of finding the amino acids a and in the protein that we considered. The score is given y S( a, ) = λ log P( a, ) P( a) P( ) The BLOSUM 50 matrix A R N D C Q E G H I L K M F P S T W Y V :A :A :R :R :N :N :D :D :C :C :Q :Q :E :E :G :G :H :H :I :I :L :L :K :K :M :M :F :F :P :P :S :S :T :T :W :W :Y :Y :V :V * * A R N D C Q E G H I L K M F P S T W Y V * Note the high scores for a sustitution of an amino acid to self (the diagonal elements of the matrix). Note also that certain amino acids have higher tendency for self-preservation (e.g. Cysteine) while a small and polar amino acid like threonine can e sustituted to other amino acids more easily. The detailed construction of the sustitution matrix is of significant interest and is discussed later in the section: Learning scoring parameters. Concerning the goals of the present section we still need to choose the pairs to calculate the similarity score. Can we just pick pairs of amino acids from the pools of residues that each protein makes? Of course, the choice of pairs of amino acids from A and B is not completely free. The order of the amino acids in the seuence (the primary structure of the protein) counts, and our comparison should maintain that order. This is a significant restriction on the comparisons that we may make and limits our choices consideraly. Adding further to the complexity of the prolem is the oservation that not all proteins are of the same length; we 3
4 need to account for variation in the lengths of the proteins and for the fact that some amino acids will not have corresponding amino acids to match. Two closely related proteins may e different at a specific site in which an amino acid was added to one protein ut not to the second. To descrie such cases we introduce a gap residue. A gap residue is denoted y - and is used to indicate an empty space along the seuence. For example, the comparison of two proteins elow (PDB codes 1rsy and 1a5_A) yields the following optimal arrangement of the two seuences with gaps: TABLE: Optimal alignment of two seuences (top is 1rsy, Synaptotagmin I (First C Domain), lower 1a5, Protein Kinase C ( ); Chain: A). The percentage of seuence identity is 33 percent. EKLGKLQYSLDYDFQNNQLLVGIIQ-AAEL-PALDMGGTSDPYVKVFLLPD-K-KKKFE ERRGRIYIQAHID-R--EVLIVVVRDAKNLVP-MDPNGLSDPYVKLKLIPDPKSESKQK TKVHRKTLNPVFNEQFTFKVPYSELGGKTLVMAVYDFDRFSKHDIIGEFKVPMNTVD-F TKTIKCSLNPEWNETFRFQLKESDKD-RRLSVEIWDWDLTSRNDFMGSLSFGISELQKA GHVTEEWRDLQS G-V-DGWFKLLS A gap is placed in one seuence (say A ) to match an amino acid in the second seuence (say B ). It indicates a missing amino acid in the A seuence when comparing it to the B seuence. Note that it is not possile for us to decide if an amino acid was deleted from A or added to B, without a detailed evolutionary mechanism linking the two proteins to a common ancestor. It is therefore common to use the term indel to descrie a gap (indel = an INsertion or a DELetion), a name that remains undecided aout the mechanism of gap formation. Our goal is to find the est match etween two seuences including the possiilities of gaps. In order to score a given alignment and decide if it is good or ad we need a score for an indel. What is the score of the gap, i.e. the alignment of an indel against another amino acid? It is philosophically useful to think on a gap as an ordinary amino acid and ask what is the score for sustituting an amino acid type a with an amino acid type. Such an approach is good in theory and there are some studies following this line of investigation. Unfortunately so far, determining gap scores (also called gap penalties) is still an open uestion. This difficulty is in contrast with the sustitution matrices of amino acids that are pretty well estalished. A common practice in alignment programs is to leave the gap energy as a parameter to e determined y the user, setting a default value of (for example) zero. The score of a gap influences the alignment. If it is set too low, it is difficult to match related proteins of considerale variation in lengths (remote homologous proteins). If it is set too high, two vastly different proteins may get fragmented to many small overlapping pieces and a large numer of gaps. The fragmented alignment (with high gap scoring) may still maintain a good score. We expect that the two proteins are related if a sustantial fraction of the two seuences is similar. 4
5 High-scoring short-seuence segments can e analyzed statistically to determine their significance. In fact such an analysis is ehind the popular BLAST algorithm [x]. To the first order BLAST does not consider gaps at all. A statistical test determines if short compatile segments of aligned seuences are significant. In some sense, BLAST solves the gap prolem y avoiding it all together. The BLAST algorithm will e discussed in the section: Statistical Analysis of Seuences. A typical practical solution for the indel/gap penalty is to give it a single negative value, say g, regardless of the pairing amino. The gap penalty is set to e independent of the type of the amino acid it is aligned to (e.g. the pair W / score the same as the pair G / ). This choice is non intuitive since W (tryptophan) is much larger than G (glycine). The space left for the - residue y the removal of G or W will e therefore different and the cost likely to e dissimilar. The less detailed treatment of gaps, (compared to usual amino acids), can lead to poor alignments. Therefore a few suggestions of extending the simple model of gap penalty were proposed. One popular model is to differentiate etween opening and extension of gaps. It is ased on the argument that gaps should aggregate together. Claims were made that insertions and deletions are likely to appear at certain structural domains of proteins (mostly loops). The concentration at few structural sites makes the indels appear together, and aggregate. To maximize the size of groups of gap (and minimize the numer of gap clusters), two gap penalties are assigned. One penalty is for initiating a gap group. A second (lower penalty) is for growing an existing gap cluster. For example, (we construct our alignment from left to right), extending an alignment from X / X to XX / X is less favorale than the extension of the pair X / to XX /. This is regardless of the fact that in oth cases the same pair ( X / ) was added to an existing alignment. This model improves the overall appearance of the alignments. However, the present author is not enthusiastic aout it. It is highly asymmetric with respect to other amino acids (rememer, we would like to consider a gap to e an amino acid), and it is not ovious that the asymmetry is indeed reuired. It is also making the identification of optimal alignment messier. Alternative modeling of gaps (so far less popular) will e discussed later and include structural dependent gap. For the moment we shall concentrate on the simplest model of gaps (one value does it all), and after solving that prolem the effect of the extensions will e examined. We wish to place the gaps in such a way that the alignment or the comparison of the two seuences will e optimal. Early studies of seuence comparisons found optimal alignments manually. Gaps were inserted y hand into positions that made iochemical sense and increased the numer of good-looking pairs [x]. For an implementation on the computer we consider an optimal alignment to e an arrangement of the two seuences with respect to each other in such a way that the total score of the alignment is as high as possile. Let us examine in more details the prolem of optimal alignments and possile arrangements of seuences. There are many ways of aligning a seuence A against a 5
6 seuence B once gaps are introduced (even if the original order of the amino acids is maintained). Gaps can enter anywhere and in wide range of numers in the alignment, creating many alternative arrangements. We denote the extended seuences of A and B (with gaps) as A and B. For example, A might e A= aa 1 a3a4... a n. The introduction of the extended seuences raises another intriguing uestion, what is the length of the seuences A (or B )? Of course, the lengths may vary, ut they still have upper ounds. Consider two seuences A and B of the same length -- n. The maximum length of A and B should e n. In the alignment of the extended-seuences that are also maximally long every amino acid of A or B is aligned against a gap -- ( a / ) and ( / ). An increase of the length eyond the n limit will necessarily include an alignment of an indel with respect to another indel, i.e. ( / ). Such an alignment does not make sense from a scientific point of view. We have no way to determine the numer of doule gaps or their locations. Moreover, also from a technical view point there is a prolem if the ( / ) pair generates a favorale plausile score. Consider the alignment a1... an 1 an 1... n 1 n Let the total score e T AB. Extending the aove alignment y one more pair of indels, we have a1... an 1 an 1... n 1 n The new score is TAB + S / where S / is the element of the sustitution matrix replacing an indel y an indel. If the new score is etter than T AB, it is trivial to construct even a etter alignment (and score) y adding yet another pair of indels with yet a etter score of T +. The favorale extension with indels can proceed to infinite and is unound. AB S / We eliminate in our seuence-to-seuence alignments the possiility of (. / ) Ok. It is settled then, the maximum length of A is n. How many possile alignments (with gaps) do we have? Or how many we need to examine efore deciding on an optimal alignment? 1. Counting alignments To count the numer of possile alignments, and as a starting point of the discussion on optimal alignments, we consider the dynamic matrix. The dynamic matrix is a tale used for the alignment of two seuences. Below we provide one example for two seuences with the same length ( n = 5). The rows are associated with the A seuence and the columns with the B. The numers at different matrix entries will e explained elow. 6
7 a a a a a The hairy picture with the numerous arrows is actually telling. Paths in this tale, which start at the upper left corner and end at the right lower corner, present all the possile alignments of the two seuences. From each entry in the dynamic matrix, there are three alternative moves: following the diagonal, going down or moving to the right. A step in the matrix, which is a part of a legitimate alignment, never proceeds from left to right or up. For example, the thick line in the aove matrix corresponds to the alignment: a1 a a3 a4 a A move along a diagonal aligns an amino acid against another amino acid. A vertical step in the matrix aligns a amino acid against an indel and a horizontal step puts a gap against an a amino acid. Note that we use for the gap residue (or an indel). The numers at the different entries of the tale denote the numer of paths (alignments) that can reach this point. For example, there are 5 possile alignments of aa 1 against 1 (check the tale element at the cross etween a and 1). They are: a1 a a1 a a1 a a1 a a1 a (1) () (3) (4) (5) The last three alignments are essentially the same and are degenerate. At present, we consider all paths, including the degenerate paths. In the Appendix a clever counting protocol (y Dr. Jaroslaw Meller) is outlined that estimates the numer of non-degenerate n+ m paths. Interestingly, the exact numer of non-degenerate paths is. It has the same m asymptotic ehavior as the approximate lower-ound expression we sketched elow for the numer of all paths. 7
8 Another interesting property of this tale is a summation rule and the possiility of constructing a recursion formula for the numer of alignments. There are three ways ( sources ) to extend a shorter alignment to otain one of the five longer alignments listed aove. (a) Extend an earlier alignment y the pair ( a / 1), a diagonal move in the aove tale. Alternatively, () the pairs ( a / ) or (c) ( / 1 ) (horizontal or perpendicular moves in the aove tale) can e used to extend a shorter alignment and to otain a memer of the aove group. Each of the three sources ( a) ( c) corresponding numer of alignments. For example, efore adding the pair ( / ) has its own position in the matrix with a a 1 we were at a tale entry aligning a 1 against " ". We find that there is only one-way of aligning a 1 against a gap and 1 is indeed the corresponding entry. Another example of extending the alignment is to add a gap against a, which means that our earlier position in the tale was the alignment of a 1 and 1. The last alignment can e done in three different ways and therefore the tale entry is 3 a a a ( ) If we add the numer of paths starting at the previous three sources and leading to the alignment of aa 1 with respect to 1 we otain = 5, exactly the numer of alignments of our target. To summarize the aove empirical oservations more precisely: The numer of possile alignments of n a -s against m -s is defined as Nnm, (, ) this numer can e determined using the recursive formula Nnm (, ) = Nn ( 1, m 1) + Nn ( 1, m) + Nnm (, 1), and the initial conditions N(0,0) = N(0, m) = N( m,0) = 1. Note that the definition of N (0,0) was done for computational convenience and it does not imply that nothing against nothing can e aligned in exactly one way. While this formula can e used directly, it is useful to have an order-of-magnitude estimate of the numer of alignments. This estimate is especially useful if we are planning a computation that will enumerate all of the alignments. If a calculation is not feasile with existing computer resources it is etter knowing that it is not feasile in advance, and not after a few weeks of futile attempts to execute the desired computation. A simple lower ound for Nnm (, ) can e otained uickly. Nnm (, ) is a positive numer, so we can write Nnm (, ) > Nn ( 1, m) + Nnm (, 1) (we forgot for convenience the term Nn ( 1, m 1) in the original euality). So, the alternative recursion formula ( ) N'( n, m) = N'( n 1, m) + N' n, m 1 (with the same initial conditions as for Nnm) (, ) always yields lower numers than Nnm. (, ) For N'( n, m ) we have a close expression: 8
9 n+ m ( n+ m)! N'( n, m) = =. This formula is easy to verify y direct sustitution of the m nm!! closed expression to the recursion. Note that this is the same as the (exact) result derived y Meller for non-degenerate paths (Appendix). We now make use of Stirling formula: log [!] log ( ) n n n n [x], valid for large n -s. The logarithm of the numer of alignments N'( n, n ) is estimated as ( n)! log = log! log! log log log + = ( n!) ( n) [ n ] n ( n) n n ( n) n n ( ) And the lower ound for the numer of alignments is N' ( n, n) n. For a short protein 30 ( n 50 ; N ) the numer is sustantial and is eyond what we can do today in a systematic search. Even if the computation of a single alignment reuires a nanosecond 9 ( second), which is unrealistically fast, it still necessary to use 10 sec 10 years to examine all possile alignments. If this is not impressive enough, rememer that this is a lower ound and the precise counting of all paths will yield a numer significantly larger than this one. For example, for 10 n = 5 we have = 104 already a numer significantly lower than the exact numer in the tale (1683). Hence, this underlines the statement that it is impossile to examine all alignments one y one. Readers with a ackground in structural iology may recall the Levinthal paradox in protein folding. The paradox contrasts the huge numer of plausile protein conformations and the efficiency in which proteins fold. However, as we see elow a large space to search does not necessarily mean that optimization in that space is difficult. It is not ovious that the optimization must e performed at a cost proportional to the volume of that space. In fact it can e profoundly cheaper. 1.3 Dynamic programming and optimal alignments After this long detour, it is aout the time to return to the asic uestion: What is the optimal alignment of A and B? The score matrix and the gap penalty are provided and we need is to fish out the alignment(s) with the highest overall score(s). This is a point in which algorithms developed y computer scientists can e extremely useful. The fact that the numer of possile alignments is exponentially large in the seuence length does not mean that the search for the optimal alignment needs to e done in exponential time. Seuence alignments can e done with dynamic programming, an algorithm that reuires only n operations to find the alignment with the est score, a remarkale saving compared to n operations. The efficient search for the optimal alignments consists of two steps. In the first step a dynamic matrix is constructed and in the second step an optimal path is found in the tale just constructed. The first step is similar to the counting of possile alignments and the recursive expression we wrote to compute the total numer of alignments. We denote the 9
10 optimal score of the alignment of a seuence length n against a seuence with length m y T n, m, and consider the following uestion: ( ) Assuming that a very kind fellow gave us the optimal scores for the following alignment: T n 1, m 1) T n 1, m, 1 T n, m? ( ), ( ), T( n m ), can we construct the score ( ) The answer is yes. Since the total score is given y a sum of the scores of the individual aligned pairs, we construct the score T( n, m ) from the three alignments leading to it. We consider three possiilities to otain an alignment of n against m amino acids. Option a: Align n 1 against m 1 amino acids (this alignment has the known optimal score of T( n1, m 1) ). Extend it y the alignment of an / m with a score of Sa ( n, m) which is the sustitution score according to the types of the amino acids at a n and m (e.g., the BLOSUM matrix). Hence the first suggestion for an optimal score is T( n1, m 1 ) + S( an, n). Option : Align n amino acids against m 1 amino acids. This alignment also has a known score, which is Tnm (, 1). Extend this alignment y / m with a corresponding score g for a gap. The second possiility for an optimal alignment is therefore T( n, m 1) + g. Option c: Align n 1 amino acids against m amino acids with the known (optimal) score of T( n 1, m). Extend the alignment, y adding an / with a corresponding score of g. The T n 1, m + g. third suggestion is therefore ( ) Note that we use the simple model of gaps in which only a single score is associated with an indel, regardless of the amino acid it is aligned against. Our final task is to select the highest score from one of the three alternatives: T n 1, m 1 S a, T n, m 1 g T n 1, m + g, which is the optimal score of ( ) + ( n n), ( ) +, ( ) T( n, m ). ( ) Tn ( 1, m 1) + San, m More compactly, we write: T( n, m) = max T( n, m 1) + g Tn ( 1, m) + g The aove recursion can e used to fill with optimal scores the complete dynamic matrix. We start with the condition T( 1, ) = T(,1) = g and use the initial values to grow the T( a1, ) + g g matrix. For example, T( a1, 1) = max T(, 1) + g = max g T( 0,0 ) + S( a1, 1) S( a1, 1) 10
11 Where T ( 0,0) denotes alignment of nothing that (not surprisingly) scores zero. Another simple example is of T( n, n( )) = n g. Hence, there is only one alignment against all gaps arrangement, providing us with immediate optimal scores for the first row and column of the dynamic matrix. It is also clear that similarly to the direct counting of the numer of paths we can also construct all the optimal scores. 5 a a a a a g g 3g 4g 5g 1 g g 3 3g 4 4g 5g A simple pseudo code to create the dynamic matrix is given elow /* fill the first (zero) column and the first (zero) row */ T(0,0) = 0 Do I=1:n T(I,0) = I*g End do Do I=1:m T(0,I) = I*g End do /* Now fill the rest of the matrix picking the maximum value from the three possiilities */ Do I = 1:n Do J = 1:m T(I,J) = max[ T(I-1,J)+g, T(I,J-1)+g, T(I-1,J-1)+S(a(i),(j))] End do End do 11
12 Actually, if we are primarily interested in the score of the alignment of A= a1... a n with respect to B = 1... m only a single element of the dynamic matrix, Tnm, (, ) is of interest. Of course, in order to get at it we need to compute first the whole matrix. However this score, which is recorded at the lower right side of the dynamic matrix, is only part of the story. Besides the score, in many cases, it is important to know the alignment itself. The path in the dynamic matrix that corresponds to the optimal alignment can e found y a trace-ack procedure. Starting from T( n, m ) we ask which of the three possile steps could have generated the final optimal score? That is, we examine the three possiilities? T( n, m) = T( n1, m 1) + S( an, m)? T( n, m) = T( n 1, m) + g? Tnm (, ) = Tnm (, 1) + g For at least one of the tests aove the euality will hold. It is possile that more than one test is correct and in that case the optimal alignment is degenerate. Hence, there is more than one alignment that provides an optimal score. Usually we consider only one path. For example, an if the first test is correct the alignment ends with the pair: and we repeat the three tests to m find the next path segment, this time starting from T( n1, m 1). The process is repeated until it reaches the upper left corner and provides the desired optimal alignment. The maximum numer of times that the process is repeated is n+ m (all the steps are horizontal or vertical) and the smallest numer of steps is either n or m, the larger numer of the two. We are also ready for a rough estimate of the computational effort. As we discussed earlier a lower ound on the numer of possile alignments is n+ m. Examining all possile alignments will reuire computational effort that is growing exponentially with n+ m. The computations that we just descried, using dynamics programming, are much cheaper. Let us consider the two steps separately. In the first step we create the dynamic matrix T( n, m ). To generate a single element we need (aout) four operations: Evaluating three expressions, and a decision which of the three is the largest. The calculation of a single element is therefore independent of n or m, and the cost associated with the computation of the matrix is proportional to n m (the numer of matrix elements). To trace the path of alignment (once the matrix is known) takes a maximum of n+ m operations. Something to think aout (*.1) We wish to align simultaneously the three seuences A= a1... a n ; B = 1... m and C = c1... c l. A seuence of ordered triplets defines the alignment. Each triplet is of the form 1
13 a where a (for example) is either an amino acid or an indel. A maximum of two gaps is c allowed in a triplet. What is the minimum and maximum lengths of an alignment? Write a recursion relation for the numer of alignments. Compute the numer of alignment for the special case in which n= m= l = 5. Gap opening and extension An empirical expectation from gaps is that they should aggregate. Structurally, proteins are characterized y secondary structure elements, as helices or sheets, linked y loops. The loops are less defined structurally and are more exposed to solvent. We expect gaps to appear in loops since changes in loops have a smaller effect on the gloal shape. One way to induce this tendency is to make the gap score structure dependent. Structural dependent gap penalty will reuire a reasonaly reliale structural model and is therefore not availale for all seuences. The alternative for modeling gaps, which is uite popular, is to introduce different penalties for gap opening g o and gap extension g e. The gap extension is set less costly than gap opening. For a fixed numer of gaps the preferred arrangements of the gaps is in one or a few ig clusters. A schematic drawing of a protein shape is shown that consists of a two-helix undle and a loop. It is expected that the indels will concentrate at the loop region. Unfortunately the two types of gaps complicated our simple and elegant dynamic programming protocol. The prolem is that extending the alignment with different types of gaps has a memory. If we consider adding a gap to an existing alignment we need to check that in the previous step we did not have an indel already. If we did, the gap penalty of the extra indel will e g e, if we did not the gap penalty will e g o. Hence in contrast to the earlier dynamic programming protocol the penalty depends on more than just the current pair of aligned characters. For example 13
14 a a n n with a gap penalty of e m m1 m m1 m g. Or an 1 an an 1 an with a gap penalty of g o m m1 m m1 m We cannot just extend our state to include two pairs since we will not know if the pair of gaps was just opened or an extension of yet a previous gap. We need to think on m m + 1 three(!) dynamic matrices, or three score functions. We ask how is it possile to otain the optimal score (and the alignment) T( n, m ). Consider what we are facing now. As efore in order to complete the alignment we need to consider an the addition of three possile pairs: the pair, an m, or to complete an alignment to an m ( nm, ) alignment. Unfortunately, and in contrast to the simple dynamic matrix we had efore, we cannot forget what the previous alignments were. The tails of the prior n1, m 1 ) have three options of their own: alignments (with ( )... a... a... n1 n1 (1) () (3) m1 m1 Matching three against three creates nine possiilities (in principle). Let us consider a few of an them. Adding the pair is actually uite straightforward, we do not have to think on gap m extension or opening, simply adding to the current optimal scores of (1)-(3) the score a S( an, n). More interesting are the additions of the pairs n or to (1)-(3). Again we m have no prolem with extending the optimal alignment (1), since it includes no gaps. However, the pairs that include gaps are more complicated. For example, adding to () will have a different score than adding the same pair to (3). This oservation means that the optimal score depends on what we attach to it. Lucky for us the extent of the memory is for just one pair. We therefore define three optimal scores depending on the presence/asence (and the up/down location) of a gap at the right edge of the alignment: 1 T n, m () T n, m 3 T n, m ( ) ( ) ( ) ( ) ( ) a n 14
15 As the names suggest the first optimal score is a function than ends with a pair of amino acids, the second option with a gap aligned against a n + 1, and the third option with a gap aligned against m + 1. T( n, m) + S( an+ 1, m+ 1) T( n+ 1, m+ 1) = max T ( n, m) + S( an+ 1, m+ 1) T ( n, m) + S( an+ 1, m+ 1) a n + 1 The addition of the pair T( n, m+ 1) + go T( n+ 1, m+ 1) = max T( n, m+ 1) + ge T ( n, m 1) g + + o And the final pair gives a similar recursive relationship m + 1 helps to define the recursive relationship for T ( n 1, m 1 ) ( 1, ) ( ) ( 1, ) T n+ m + go T( n+ 1, m+ 1) = max T n+ 1, m + go T n m g + + e + + Local alignments So far we considered alignment in which the complete A seuence was aligned against the complete seuence of B. This procedure is also called gloal alignment. In a gloal alignment we account for all the amino acids, we may have added indels ut we did not took away any of the different amino acids in A or B. In other words, our paths in the dynamic matrix always started at the upper left corner and end at the lower right corner of the dynamic matrix. We can imagine asking a different uestion for which gloal alignments are not important. For example, having a DNA seuence (a gene, say aout 1000 asepairs) and searching for a relative in a whole genome (say, a million of asepairs). Aligning a single seuence against a large continuous string (sometimes only partially annotated) cannot e done with the procedure we just discussed. Another exercise that cannot e done with gloal alignment is the detection of similar domains. In numerous cases proteins share similar fragments (domains) while the rest of the structure (or seuence) is not similar. It is useful to identify these domains since they can serve as uilding locks for a complete model of the protein, provide hints to the protein family, or suggest plausile iological activity (if the domain has characteristic activity). Alignments of fragments do not exclude the possiility that the whole protein seuences will e aligned, ut we do not enforce it. 15
16 A schematic drawing of the ackone traces of the two proteins show a similar fragment (a helix and a β strand). Clearly, the fragment similarity will not e detected y a gloal alignment. To otain a local alignment, and to enale the identification of similar fragment, we modify the creation of the dynamic matrix. For local alignment we have: 0 Tn ( 1, m 1) Sa ( n, + m) T( n, m) = max T( n 1, m) + g T( n, m 1) + g The newly added value of zero is used to terminate an alignment. If the extension of an alignment is not increasing the overall score, then it is etter terminated here. Note that there is a specific assumption on the scoring matrix. The random hits of extending a seuence should not e positive on the average. Otherwise, the local alignment will increase without ound. Also the search for the optimal alignment is modified. In gloal alignments we started our trace ack procedure from the lower right corner of the matrix and searched for the optimal path until we hit the upper left corner. Local alignments do not necessarily start or end at the corners. The first step in determining the local alignment is the identification of the start. The start is a matrix entry with a high score. Then a trace ack procedure, similar to gloal alignments, is used until we hit a gloal score of zero. This is a sign that we need to terminate this alignment and that we found the highest scoring fragment for A and B 16
17 a a a a a A short path is charting a local alignment in a dynamic matrix. The maximum score is 10 and a trace ack a1 a a3 a4 procedure provides us with the alignment. The aove example of a local alignment ends 1 3 at the upper left corner. However, similarly to the start of the alignment, there is no condition at its termination point. In contrast to gloal alignment we added now one more operation (the search for the highest scoring element) that scales also as n m. Of course, the highest scoring element can e stored during the construction of the matrix, saving us the additional computation afterward. Something to think aout Can you suggest a procedure that will use the generation of the dynamic matrix to speed-up the computation of the alignment path? The idea is to avoid the if statement, testing the sources of the optimal alignment. Suppose that A represents one protein while B is a whole genome. In that case we may wish to align the whole seuence A against a fragment of B. How to design a rule to generate the appropriate dynamic matrix? Qualitatively, how the matrix will e different from gloal alignment, local alignment matrices? Note that the optimal local alignment, which is defined as the highest scoring local alignment, is not necessarily the longest. It is proaly worth checking also a few high scores to detect potentially longer alignment. In general the longer is the alignment the more 17
18 significant it is likely to e using the statistical tests to e descried later, even if it is not necessarily with the highest score. Rough estimate of practical computational cost using dynamic programming The complexity of the dynamic programming algorithm or the computational effort of applying it, is proportional to n for large n -s. This is clearly much etter that rute force counting of all the alignments that we discuss previously (the computational effort grows exponentially with n ). Nevertheless dynamic programming can e still expensive, depending on the size of the prolem at hand, and our resources, as discussed elow. Powerful and low cost Personal Computers (PCs) make one type of high performance computing, which is ased on clusters of PC-s, more accessile. We therefore estimate the calculation time of typical alignments using PC units. An alignment of two proteins of average length ( n = 00 ) reuires a tenth of a second on a 600MHz PC. Comparing a single seuence to a dataase of 100,000 proteins, takes 10,000 seconds or.8 hours. Of course, this time is getting longer if we wish to examine more than one seuence. The non-trivial computational efforts promoted the use of approximate algorithms [x] with a computational cost linear with the protein length. The two leading approximations are FASTA and BLAST. The approximate procedures will e discussed later in the section: Approximate alignments. The computational cost further promoted the use of special purpose hardware for rapid parallel alignments [x]. There is an on-going competition here etween the rapid growth of computer resources and approximate algorithms. As the speed and capacity of computer increases, the option of finding the optimal alignment ecomes more feasile. Statistical verification of the results Once we finished comparing our seuences to all other seuences in the dataase we have a winner. A seuence B (or perhaps a few seuences) scores etter than the other seuences in the dataase when aligned to A. The winner is our est guess for a relative (homologous protein) of the unknown target. We should keep in mind that our dataase is limited. It does not contain all seuences, just our current sample in annotated seuence space. It is possile that none of the current reference proteins is homologous to the unknown seuence. Of course, we always get est scores. Every distriution has tails and the distriution of scores is no exception. It has a tail with highly scoring winning seuences. However, are these est scores of iological significance or just an aritrary choice among other irrelevant possiilities? We need to measure the significance of the score of the winning alignment to differentiate a true hit from a random hit. If we find that the winner is not significant, then either the dataase lacks relevant proteins, or our detection scheme is not sensitive enough. 18
19 We call a winning match -- a positive (signal). Positives, as discussed aove, can e either true or false. The seuence B with a high-score-alignment to A may e a true relative of the unknown protein, or may e a false positive. How can we verify (to the extent possile) the significance of the score? To differentiate etween true and false hits we generate random seuences, score them, and compare them to the score of our winner. If our match is truly significant its score should e higher than the score of a random seuence aligned to the same proe, B. So the plan is straightforward: generate a set of random seuences (called { R s} s 1 = ) and examine their alignment to the B match. The optimal alignment of each of the random seuences against B gives a score. The random score should e significantly lower than the score of our match of A to B, if the match is true. How should we generate random seuences? I.e., what is the sample space that we should use? We can generate novel random seuences y sampling from the general pool of amino acids. We can even try to e reasonaly clever and consider the natural freuencies of amino acids, ( ) i Pa. It is the proaility of oserving an amino acid i anywhere in the known dataases of proteins. We may generate the random seuences in such a way that the freuencies of the amino acids in those random seuences will reproduce the natural freuencies. However, changing the chemical composition of the target protein (the distriution of types of amino acids within A ) is prolematic. Clearly, seuences and alignments to B can e found that will have higher scores than aligning A against B. The self-alignment of B is an example. However, we are interested in significance test for A. For that purpose we maintain the composition of the amino acid S types in { R s} s = 1 to e the same as in A. The set of random seuences include all the proteins of the same length and the same amino acid composition as in A. For example if A= aa 1 a3a4, then a random seuence satisfying our criterion is R = aaaa The process of generating random seuences with the same length and chemical composition is called seuence shuffling. These random seuences avoid the prolem of the weight of different amino acid types since they have the same composition as the A seuence. In the next step each of the random seuences is optimally aligned against B and their (optimal) score is computed. Note that the optimally aligned R s is not necessarily the same length as A. The ensemle of random seuences (or more precisely their scores) is used in a convenient estimate of the significance of the A alignment: The so-called Z score, defined elow S 19
20 Z = TAB T T T T AB is the optimal score for the alignment of A against B. T is the optimal score averaged over the set of random seuences as descried elow, and T is the second moment of the distriution of optimal scores of random seuences. The Z score is a cleverly constructed function. First T AB is tested for eing as far as possile from the average, or a typical score of a random seuence. This difference, T AB T, has units. Changing those units will make the difference larger or smaller with no real modification of the underline data. To determine a dimensionless measure of the distance etween T AB and T, we divide the difference y the width of the distriution. Hence, we are using the width of the distriution as a yardstick to determine appropriate units for the present distriution. The result is dimensionless ut it is also a etter measure of the significance of our alignment. Consider for example a fixed difference T AB T. If the width of the distriution greatly exceeds the difference, many random seuences will e found with etter scores than T AB. On the other hand, if the distriution is very narrow, even a small difference can ring us to the right (higher) edge of the score distriution function and make the prediction significant. PT ( ) T A schematic drawing of two distriutions of random scores: T is the value of the score, and PT ( ) is the proaility of oserving a given score. To keep the plot clear the distriutions are not normalized. Each of the distriutions has aout the same average value T ut the width is significantly different. The dashed-line distriution is roader compared to the continuous thin line distriution. The winning score, T AB, is denoted y an ellipse and a vertical line. Though the distance of the winning score from the average is the same in the two distriutions, it is clearly more significant when compared to the thin line distriution. By more significant we mean that it is less likely to otain the same score y chance, using a random seuence with the same amino acid composition. 0
21 The Z score measure has also a weakness. It relies only two moments of the distriution. It is possile that the distriution will e highly asymmetric, and (or) far from a Gaussian. If this is indeed the case, the calculated position of T AB relative to the random seuences, may e inaccurate. 1
3.5 Solving Quadratic Equations by the
www.ck1.org Chapter 3. Quadratic Equations and Quadratic Functions 3.5 Solving Quadratic Equations y the Quadratic Formula Learning ojectives Solve quadratic equations using the quadratic formula. Identify
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationSequence analysis and Genomics
Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationRepresentation theory of SU(2), density operators, purification Michael Walter, University of Amsterdam
Symmetry and Quantum Information Feruary 6, 018 Representation theory of S(), density operators, purification Lecture 7 Michael Walter, niversity of Amsterdam Last week, we learned the asic concepts of
More informationLecture 12: Grover s Algorithm
CPSC 519/619: Quantum Computation John Watrous, University of Calgary Lecture 12: Grover s Algorithm March 7, 2006 We have completed our study of Shor s factoring algorithm. The asic technique ehind Shor
More informationPolynomial Degree and Finite Differences
CONDENSED LESSON 7.1 Polynomial Degree and Finite Differences In this lesson, you Learn the terminology associated with polynomials Use the finite differences method to determine the degree of a polynomial
More informationUpper Bounds for Stern s Diatomic Sequence and Related Sequences
Upper Bounds for Stern s Diatomic Sequence and Related Sequences Colin Defant Department of Mathematics University of Florida, U.S.A. cdefant@ufl.edu Sumitted: Jun 18, 01; Accepted: Oct, 016; Pulished:
More informationExpansion formula using properties of dot product (analogous to FOIL in algebra): u v 2 u v u v u u 2u v v v u 2 2u v v 2
Least squares: Mathematical theory Below we provide the "vector space" formulation, and solution, of the least squares prolem. While not strictly necessary until we ring in the machinery of matrix algera,
More informationModule 9: Further Numbers and Equations. Numbers and Indices. The aim of this lesson is to enable you to: work with rational and irrational numbers
Module 9: Further Numers and Equations Lesson Aims The aim of this lesson is to enale you to: wor with rational and irrational numers wor with surds to rationalise the denominator when calculating interest,
More informationGenetic Algorithms applied to Problems of Forbidden Configurations
Genetic Algorithms applied to Prolems of Foridden Configurations R.P. Anstee Miguel Raggi Department of Mathematics University of British Columia Vancouver, B.C. Canada V6T Z2 anstee@math.uc.ca mraggi@gmail.com
More informationIntroduction to Comparative Protein Modeling. Chapter 4 Part I
Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature
More informationThe Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line
Chapter 27: Inferences for Regression And so, there is one more thing which might vary one more thing aout which we might want to make some inference: the slope of the least squares regression line. The
More informationComputational Biology
Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More informationPairwise & Multiple sequence alignments
Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived
More informationAt first numbers were used only for counting, and 1, 2, 3,... were all that was needed. These are called positive integers.
1 Numers One thread in the history of mathematics has een the extension of what is meant y a numer. This has led to the invention of new symols and techniques of calculation. When you have completed this
More informationTravel Grouping of Evaporating Polydisperse Droplets in Oscillating Flow- Theoretical Analysis
Travel Grouping of Evaporating Polydisperse Droplets in Oscillating Flow- Theoretical Analysis DAVID KATOSHEVSKI Department of Biotechnology and Environmental Engineering Ben-Gurion niversity of the Negev
More informationBLAST: Target frequencies and information content Dannie Durand
Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences
More informationPHY451, Spring /5
PHY451, Spring 2011 Notes on Optical Pumping Procedure & Theory Procedure 1. Turn on the electronics and wait for the cell to warm up: ~ ½ hour. The oven should already e set to 50 C don t change this
More informationBioinformatics and BLAST
Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists
More informationbioinformatics 1 -- lecture 7
bioinformatics 1 -- lecture 7 Probability and conditional probability Random sequences and significance (real sequences are not random) Erdos & Renyi: theoretical basis for the significance of an alignment
More information1 Caveats of Parallel Algorithms
CME 323: Distriuted Algorithms and Optimization, Spring 2015 http://stanford.edu/ reza/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 1, 9/26/2015. Scried y Suhas Suresha, Pin Pin, Andreas
More informationDynamical Systems Solutions to Exercises
Dynamical Systems Part 5-6 Dr G Bowtell Dynamical Systems Solutions to Exercises. Figure : Phase diagrams for i, ii and iii respectively. Only fixed point is at the origin since the equations are linear
More informationSequence Analysis '17 -- lecture 7
Sequence Analysis '17 -- lecture 7 Significance E-values How significant is that? Please give me a number for......how likely the data would not have been the result of chance,......as opposed to......a
More informationSolving Systems of Linear Equations Symbolically
" Solving Systems of Linear Equations Symolically Every day of the year, thousands of airline flights crisscross the United States to connect large and small cities. Each flight follows a plan filed with
More informationLarge-Scale Genomic Surveys
Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction
More informationChaos and Dynamical Systems
Chaos and Dynamical Systems y Megan Richards Astract: In this paper, we will discuss the notion of chaos. We will start y introducing certain mathematical concepts needed in the understanding of chaos,
More information1Number ONLINE PAGE PROOFS. systems: real and complex. 1.1 Kick off with CAS
1Numer systems: real and complex 1.1 Kick off with CAS 1. Review of set notation 1.3 Properties of surds 1. The set of complex numers 1.5 Multiplication and division of complex numers 1.6 Representing
More informationSara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)
Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline
More informationStructuring Unreliable Radio Networks
Structuring Unreliale Radio Networks Keren Censor-Hillel Seth Gilert Faian Kuhn Nancy Lynch Calvin Newport March 29, 2011 Astract In this paper we study the prolem of uilding a connected dominating set
More informationMidterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015
S 88 Spring 205 Introduction to rtificial Intelligence Midterm 2 V ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib
More informationMathematical Ideas Modelling data, power variation, straightening data with logarithms, residual plots
Kepler s Law Level Upper secondary Mathematical Ideas Modelling data, power variation, straightening data with logarithms, residual plots Description and Rationale Many traditional mathematics prolems
More informationSimple Examples. Let s look at a few simple examples of OI analysis.
Simple Examples Let s look at a few simple examples of OI analysis. Example 1: Consider a scalar prolem. We have one oservation y which is located at the analysis point. We also have a ackground estimate
More informationMolecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment
Molecular Modeling 2018-- Lecture 7 Homology modeling insertions/deletions manual realignment Homology modeling also called comparative modeling Sequences that have similar sequence have similar structure.
More informationIn-Depth Assessment of Local Sequence Alignment
2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.
More informationDeterminants of generalized binary band matrices
Determinants of generalized inary and matrices Dmitry Efimov arxiv:17005655v1 [mathra] 18 Fe 017 Department of Mathematics, Komi Science Centre UrD RAS, Syktyvkar, Russia Astract Under inary matrices we
More informationLecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)
Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from
More informationChemActivity L4: Proton ( 1 H) NMR
hemactivity L: Proton ( ) NMR (What can a NMR spectrum tell you aout the structure of a molecule?) This activity is designed to e completed in a ½ hour laoratory session or two classroom sessions. Model
More informationSequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas Informal inductive proof of best alignment path onsider the last step in the best
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationHomology Modeling. Roberto Lins EPFL - summer semester 2005
Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,
More information3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT
3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode
More information1.5 Sequence alignment
1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)
More informationNon-Linear Regression Samuel L. Baker
NON-LINEAR REGRESSION 1 Non-Linear Regression 2006-2008 Samuel L. Baker The linear least squares method that you have een using fits a straight line or a flat plane to a unch of data points. Sometimes
More information#A50 INTEGERS 14 (2014) ON RATS SEQUENCES IN GENERAL BASES
#A50 INTEGERS 14 (014) ON RATS SEQUENCES IN GENERAL BASES Johann Thiel Dept. of Mathematics, New York City College of Technology, Brooklyn, New York jthiel@citytech.cuny.edu Received: 6/11/13, Revised:
More informationChapter 5. Proteomics and the analysis of protein sequence Ⅱ
Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and
More informationCMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison
CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture
More informationEach element of this set is assigned a probability. There are three basic rules for probabilities:
XIV. BASICS OF ROBABILITY Somewhere out there is a set of all possile event (or all possile sequences of events which I call Ω. This is called a sample space. Out of this we consider susets of events which
More informationSequence comparison: Score matrices
Sequence comparison: Score matrices http://facultywashingtonedu/jht/gs559_2013/ Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationBuilding 3D models of proteins
Building 3D models of proteins Why make a structural model for your protein? The structure can provide clues to the function through structural similarity with other proteins With a structure it is easier
More informationLecture 2: Connecting the Three Models
IAS/PCMI Summer Session 2000 Clay Mathematics Undergraduate Program Advanced Course on Computational Complexity Lecture 2: Connecting the Three Models David Mix Barrington and Alexis Maciel July 18, 2000
More informationSVETLANA KATOK AND ILIE UGARCOVICI (Communicated by Jens Marklof)
JOURNAL OF MODERN DYNAMICS VOLUME 4, NO. 4, 010, 637 691 doi: 10.3934/jmd.010.4.637 STRUCTURE OF ATTRACTORS FOR (a, )-CONTINUED FRACTION TRANSFORMATIONS SVETLANA KATOK AND ILIE UGARCOVICI (Communicated
More informationBioinformatics: Investigating Molecular/Biochemical Evidence for Evolution
Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Background How does an evolutionary biologist decide how closely related two different species are? The simplest way is to compare
More informationQuantifying sequence similarity
Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity
More informationComparing whole genomes
BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will
More informationProtein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror
Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major
More information2. FUNCTIONS AND ALGEBRA
2. FUNCTIONS AND ALGEBRA You might think of this chapter as an icebreaker. Functions are the primary participants in the game of calculus, so before we play the game we ought to get to know a few functions.
More information8.04 Spring 2013 March 12, 2013 Problem 1. (10 points) The Probability Current
Prolem Set 5 Solutions 8.04 Spring 03 March, 03 Prolem. (0 points) The Proaility Current We wish to prove that dp a = J(a, t) J(, t). () dt Since P a (t) is the proaility of finding the particle in the
More informationSequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best alignment path onsider the last step in
More informationBioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value
More informationRobot Position from Wheel Odometry
Root Position from Wheel Odometry Christopher Marshall 26 Fe 2008 Astract This document develops equations of motion for root position as a function of the distance traveled y each wheel as a function
More informationCISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)
CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST
More informationWeak bidders prefer first-price (sealed-bid) auctions. (This holds both ex-ante, and once the bidders have learned their types)
Econ 805 Advanced Micro Theory I Dan Quint Fall 2007 Lecture 9 Oct 4 2007 Last week, we egan relaxing the assumptions of the symmetric independent private values model. We examined private-value auctions
More informationSingle alignment: Substitution Matrix. 16 march 2017
Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block
More informationModule: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment
Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand
More informationCMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017
CMSC CMSC : Lecture Greedy Algorithms for Scheduling Tuesday, Sep 9, 0 Reading: Sects.. and. of KT. (Not covered in DPV.) Interval Scheduling: We continue our discussion of greedy algorithms with a number
More informationProtein Secondary Structure Prediction
part of Bioinformatik von RNA- und Proteinstrukturen Computational EvoDevo University Leipzig Leipzig, SS 2011 the goal is the prediction of the secondary structure conformation which is local each amino
More informationComputer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms
Computer Science 385 Analysis of Algorithms Siena College Spring 2011 Topic Notes: Limitations of Algorithms We conclude with a discussion of the limitations of the power of algorithms. That is, what kinds
More informationAdvanced Undecidability Proofs
17 Advanced Undecidability Proofs In this chapter, we will discuss Rice s Theorem in Section 17.1, and the computational history method in Section 17.3. As discussed in Chapter 16, these are two additional
More informationCURRICULUM TIDBITS FOR THE MATHEMATICS CLASSROOM
TANTON S TAKE ON LOGARITHMS CURRICULUM TIDBITS FOR THE MATHEMATICS CLASSROOM APRIL 01 Try walking into an algera II class and writing on the oard, perhaps even in silence, the following: = Your turn! (
More information1 Hoeffding s Inequality
Proailistic Method: Hoeffding s Inequality and Differential Privacy Lecturer: Huert Chan Date: 27 May 22 Hoeffding s Inequality. Approximate Counting y Random Sampling Suppose there is a ag containing
More informationInequalities. Inequalities. Curriculum Ready.
Curriculum Ready www.mathletics.com Copyright 009 3P Learning. All rights reserved. First edition printed 009 in Australia. A catalogue record for this ook is availale from 3P Learning Ltd. ISBN 978--986-60-4
More informationCHAPTER 5. Linear Operators, Span, Linear Independence, Basis Sets, and Dimension
A SERIES OF CLASS NOTES TO INTRODUCE LINEAR AND NONLINEAR PROBLEMS TO ENGINEERS, SCIENTISTS, AND APPLIED MATHEMATICIANS LINEAR CLASS NOTES: A COLLECTION OF HANDOUTS FOR REVIEW AND PREVIEW OF LINEAR THEORY
More informationExercise 5. Sequence Profiles & BLAST
Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)
More informationFinding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético
Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético Paulo Mologni 1, Ailton Akira Shinoda 2, Carlos Dias Maciel
More informationEvolutionary Models. Evolutionary Models
Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment
More informationMotivating the need for optimal sequence alignments...
1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use
More informationWeek 10: Homology Modelling (II) - HHpred
Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationRMT 2014 Power Round Solutions February 15, 2014
Introduction This Power Round develops the many and varied properties of the Thue-Morse sequence, an infinite sequence of 0s and 1s which starts 0, 1, 1, 0, 1, 0, 0, 1,... and appears in a remarkable number
More informationCh. 9 Multiple Sequence Alignment (MSA)
Ch. 9 Multiple Sequence Alignment (MSA) - gather seqs. to make MSA - doing MSA with ClustalW - doing MSA with Tcoffee - comparing seqs. that cannot align Introduction - from pairwise alignment to MSA -
More informationAn Introduction to Sequence Similarity ( Homology ) Searching
An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,
More informationAlgorithms Exam TIN093 /DIT602
Algorithms Exam TIN093 /DIT602 Course: Algorithms Course code: TIN 093, TIN 092 (CTH), DIT 602 (GU) Date, time: 21st October 2017, 14:00 18:00 Building: SBM Responsible teacher: Peter Damaschke, Tel. 5405
More informationLocal Alignment Statistics
Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison
More informationMath 216 Second Midterm 28 March, 2013
Math 26 Second Midterm 28 March, 23 This sample exam is provided to serve as one component of your studying for this exam in this course. Please note that it is not guaranteed to cover the material that
More information19 Tree Construction using Singular Value Decomposition
19 Tree Construction using Singular Value Decomposition Nicholas Eriksson We present a new, statistically consistent algorithm for phylogenetic tree construction that uses the algeraic theory of statistical
More informationGetting Started with Communications Engineering
1 Linear algebra is the algebra of linear equations: the term linear being used in the same sense as in linear functions, such as: which is the equation of a straight line. y ax c (0.1) Of course, if we
More informationBioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter
Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Institute of Bioinformatics Johannes Kepler University, Linz, Austria Sequence Alignment 2. Sequence Alignment Sequence Alignment 2.1
More informationDivide-and-Conquer. Reading: CLRS Sections 2.3, 4.1, 4.2, 4.3, 28.2, CSE 6331 Algorithms Steve Lai
Divide-and-Conquer Reading: CLRS Sections 2.3, 4.1, 4.2, 4.3, 28.2, 33.4. CSE 6331 Algorithms Steve Lai Divide and Conquer Given an instance x of a prolem, the method works as follows: divide-and-conquer
More informationScoring Matrices. Shifra Ben-Dor Irit Orr
Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison
More information, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1
Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression
More informationCONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018
CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of
More information2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51
2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each
More informationDIMACS Technical Report March Game Seki 1
DIMACS Technical Report 2007-05 March 2007 Game Seki 1 by Diogo V. Andrade RUTCOR, Rutgers University 640 Bartholomew Road Piscataway, NJ 08854-8003 dandrade@rutcor.rutgers.edu Vladimir A. Gurvich RUTCOR,
More informationSequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University
Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)
More informationSTATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler
STATC141 Spring 2005 The materials are from Pairise Sequence Alignment by Robert Giegerich and David Wheeler Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (I) Sequence similarity
More informationCS 4120 Lecture 3 Automating lexical analysis 29 August 2011 Lecturer: Andrew Myers. 1 DFAs
CS 42 Lecture 3 Automating lexical analysis 29 August 2 Lecturer: Andrew Myers A lexer generator converts a lexical specification consisting of a list of regular expressions and corresponding actions into
More information