Algorithms for biological sequence Comparison and Alignment Sara Brunetti, Dipartimento di Ingegneria dell'informazione e Scienze Matematiche University of Siena, Italy, sara.brunetti@unisi.it 1
A piece of history 1953 DNA structure, Watson e Crick 1975 development of the sequencing technique, Ranger, Maxam e Gilbert 1990 beginning of the Genome Project Goals: 1. sequence the entire human genome producing the complete DNA trascript 2. produce maps of the genome showing locations of expressed sites 2000 Tony Blair and Bill Clinton announce the completion of the human genome sequencing Cost: 3 10 9 euros 2002 High-throughput sequencing (HTS) 2008 1000 gemones pilot project 2009 1000 gemones phase 1 2011 1000 gemones phase 2 2
Amounts of data Human genome: 3x10 9 bp; Contained in 10 15 cells Macromolecular structures 35460 entry (14 Marz 06, PDB) Bioinformatics: study of problems of storage, organization and distribution of large amounts of genomic data 3
Computational biology study of mathematical and combinatorial problems of modeling biological processes in the cell, interpreting the data and providing theories about their biological relations 1. Data representation 2. Problem formulation 3. (Efficient) algorithm design 4
Data representation Alphabet Italian: A B C D E F G H I L M N O P Q R S T U V Z English: A B C D E F G H I J K L M N O P Q R S T U V W Y Z DNA: A C G T (adenine, cytosine,guanine,thymine) Protein: A Q W E R T Y I P L K H F D S C V N M Binary: 0 1 5
Data representation: strings DNA prefix substring suffix String: ACCGTATATAAAAGGCCGGGTT Length: 22 6
DNA information 7
Biological motivation Learning about the functionality or structure of a protein without performing any experiments Basic idea: In biomolecular sequences (DNA, RNA, Aminoacid sequences) high similarity usually implies significant functional or structural similarity. Usually 25% sequence identity suffice two proteins to have same 3-dim structure and almost identical function 8
WARNING Sequence similarities implies functional similarities, but the reverse is not necessarily true! Beside sequences other levels to enquire: 3D protein structure, cellular biochemistry or morphology etc., but sequences are easier to study. 9
DNA Sequence Comparison: First Success Story Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene s function In 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene 10
Compare sequences. Why? The resemblance of two DNA sequences taken from different organisms can be explained by the theory that all contemporary genetic material has one ancestral ancient DNA. According to this theory, during the course of evolution mutations occurred, creating differences between families of contemporary species. Most of these changes are due to local mutations, each modifying the DNA sequence at a specific manner. 11
Similarities and differences? -Differences between the human genome and the chimpanzee genome: 2% -Differences betweeen human and worm: 50% -Similarity between two humans: 99,9% But: genome length 3 10 9 bp They can differ into 3 10 6 positions 12
Sequence Comparison Problems Informally: find which parts of sequences are alike and which parts are different. 1) Given two sequences over the same alphabet, about of the same length ( 10.000 char.), and almost equal, find the places where differences occur. Problem 1): the same gene is sequenced by two laboratories and they want to compare the results. 2) Given two sequences with a few hundred of char., find two similar sub-strings (one from each sequence). 3) Same as Problem 2), but one sequence is compared with thousand of others. Problems 2), 3): in searching local similarities in large databases of bio-sequences. 13
Sequence Comparison Problems 4) Given two sequences with a few hundred of char., find a prefix of one similar to the suffix of the other. Problem 4): in the fragment assembly procedure in large scale DNA sequencing. We introduce a single basic algorithmic idea to solve all the above problems. 14
Pairwise alignment How to compare two sequences? Alignment Similarity 15
Sequence alignment: an example s: ATGCAGCTGAGCATCG? t: ATACAGCGAGTATCG 16
Sequence alignment: an example s: ATGCAGCTGAGCATCG t: ATACA GC GAGTATCG 17
Edit Distance vs Hamming Distance Hamming distance always compares i -th letter of v with i -th letter of w V = ATATATAT W = TATATATA Hamming distance: d(v, w)=8 Just one shift Make it all line up Computing Hamming distance distance is a trivial task task Edit distance may compare i -th letter of v with j -th letter of w V = - ATATATAT W = TATATATA Edit distance: d(v, w)=2 Computing edit is a non-trivial 18
Edit Distance: Example TGCATAT ATCCGAT in 5 steps TGCATAT (delete last T) TGCATA (delete last A) TGCAT (insert A at front) ATGCAT (substitute C for 3 rd G) ATCCAT (insert G before last A) ATCCGAT (Done) What is the edit distance? 5? 19
Edit Distance: Example (cont d) TGCATAT ATCCGAT in 4 steps TGCATAT (insert A at front) ATGCATAT (delete 6 th T) ATGCAAT (substitute G for 5 th A) ATGCGAT (substitute C for 3 rd G) ATCCGAT (Done) Can it be done in 3 steps??? 20
Sequence alignment Sequence 1 Sequence2 s=(s1,,sm) of size m t=(t1,,tn) of size n An alignment (s,t ) between s and t is obtained by insertion of spaces in arbitrary positions along the sequences so that they end up with the same size s 1 s 2 s l t 1 t 2 t l (s i,t i) pair of characters in s and t or - Not allowed (-,-) 21
Number of alignments How many ways s can be aligned with t? s 1 s 2 s l t 1 t 2 t l Max(n,m) <= l <= n+m: s1.. sm - - - - - - - - - t1 tn - s1.-. sm - t1. tn f(i,j)=#alignments of one sequence of i letters with another of j letters f(n,m)=f(n-1,m)+f(n-1,m-1)+f(n,m-1) and f(n,n) (1+ 2) 2n+1 n as n 22
Es. two sequences of length 1000 have the following number of possible alignments: f(1000,1000) (1+ 2) 2001 1000=10 767,4..!!!!!!!! (there are 10 80 elementary particles in the universe) 23
Global alignment Given two sequences s and t of roughly the same length, determine the alignment of s and t with maximal (or minimal) score AC - GCTTTG - CATG TAT- (Needleman&Wunsch Algorithm) Motivation: the same gene is sequenced by two laboratories and they want to compare the results 24
More about similarity and distance 25
Similarity and distance Two approaches to comparing strings: Similarity: measures how much the strings are alike Its definition derives from the concept of one ancestral ancient DNA An alignment (s,t ) of the strings s and t is obtained by inserting space characters in them in such a way that: 1 s = t 2 Removal of - from s gives s 3 Removal of - from t gives t 4 For every i, either s [i] or t [i] is not A scoring system (p,g) has members: p:axa->r, g<0 additive scoring sim(s,t)=max score(s,t ) 26
Similarity and distance Distance: measures how much the strings differ Its definition derives from the concept of mutations A distance d on E is d:exe->r: 1 d(x,x)=0 for all x in E and d(x,y)>0 for x<>y 2 d(x,y)=d(y,x) for all x,y in E 3 d(x,y)<=d(x,z)+d(y,z) for all x,y in E An allignment is obtained by successive applications of a number of admissible operations transforming s into t 1 substitution a->b 2 insertion or deletion of any character (indel) A cost measure (c,h) has members: c:axa->r, h>0 27
When are similarity and distance algorithms equivalent? When sequences are aligned by distance in global alignment, there is a similarity algorithm that gives the same set of optimal alignments, and vice versa The measures are related by the formula: p(a,b)=m-c(a,b) g=-h+m/2 dist(s,t)+sim(s,t)=m/2( s + t ) Es. Edit distance, M=0=> p(a,a)=0, p(a,b)=-1, g=-1; M=2=> p(a,a)=2, p(a,b)=1, g=0; Same set of optimal solutions, different scores. Usually 0<=M<=max c(a,b ) 28
aligned letters<->substitutions #=l spaces<->indel operations #=r l score(s,t)= Σ i p(a i,b i )+rg l cost(s->t)= Σ i c(a i,b i )+rh score(s,t)+cost(s->t)=lm+r M/2 if global alignment: s + t =2l+r, score(s,t)+cost(s->t)=m/2( s + t ) dist(s->t)=min(m/2( s + t )- score(s,t)) =M/2( s + t )- sim(s,t) 29
Dynamic Programming Algorithm Basic Idea of Dynamic Programming: a problem is solved taking advantage of the already solved sub-problems. Each optimal alignment contains optimal alignments of the subproblems (example) - GCTGATATAGCT GGGTGAT -TAGCT Additivity of the penalty function Three essential components: Recurrence relation Tabular computation Traceback 30
Dynamic Programming Algorithm Recurrence relation Sequence 1: Sequence 2: s of size m t of size n s[i..j] sub-string from char i to char j of s. M(i,j) is the score of the best alignment between s[1..i] and t[1..j] M(j,0) = M(0,j)=-2j M(m,n) is computed by solving the more general problem of computing M(i,j) for all i,j M[i,j-1] - 2 M[i,j]= max M[i-1,j-1] + p(i,j), p(i,j)= M[i-1,j] -2 +1, if s i = t j -1, if s i t j No top-down approach, but bottom up The computation is arranged in a (m+1) (n+1) array M 31
Dynamic Programming Algorithm Tabular computation s t A G C M A A A C 0-2 -4-2 1-4 -3-6 -4 1-1 1-6 -3-4 -1 0-2 -6-3 -2-1 -8-5 -4-1 row 0: comparison between t and an empty sequence. column 0: comparison between s and an empty sequence M[i,j] is computed by observing the 3 previous entries M[i-1,j-1], M[i,j-1] and M[i-1,j]. M[i-1,j-1]: a new char of s and a new char of t are considered; +1 is added in case of match and 1 in case of mismatch. Align s[1..i-1] with t[1..j-1] M[i,j-1]: a new char of the sequence t is considered corresponding to a space in s (-2). Align s[1..i] with t[1..j-1] and match a space with tj M[i-1,j]: a new char of the sequence s is considered corresponding to a space in t (-2). Align s[1..i-1] with t[1..j] and match a space with si 32
Dynamic Programming Algorithm- Traceback Trace back to find the best alignment(s) AG -C A -GC solution1 solution2 solution3 -AGC AAAC AAAC AAAC s A A A C t A G C 0-2 -4-6 -2 1 1-3 -4-1 0-2 -6-3 -2-1 -8-5 -4-1 Best Score 33
Algorithm Similarity input: S,T,m,n output: M for i 1 to m do M(i,0) i g for j 0 to n do M(0,j) j g for i 1 to m do for j 1 to n do M(i,j) max( M(i-1,j)+g M(i-1,j-1)+p(i,j) M(i,j-1)+g ) return M Complexity: O(nm) 34
Align (i, j, len) input: i, j, array M obtained by Similarity Alg. output: alignment in align-s align-t, vectors of length len if i = 0 and j =0 then len = 0; else if i > 0 and M[i, j] = M[i-1, j] + c s then Align (i-1,j,len); len = len+1; align-s = s[i]; align-t = ;(space) else if i > 0 and j>0 and M[i, j] = M[i-1, j-1] + c(i,j) then Align (i-1,j-1,len); len = len+1; align-s = s[i]; align-t = t[j]; else Align (i,j-1,len); len = len+1; align-s = (space); align-t = t[j]; First call Align(m,n,len) max ( s, t ) len m + n Algorithm Align finds solution1. By inverting the order of the if statements it is possible to find the other solutions. 35
Complexity Algorithm Similarity takes O( m n) time and space Algorithm Align takes O (m + n) time: Let h=m+n T(h) = k for h 2 T(h) = T(h-1) + k, for h > 2 T(h) = O(h) = O(m+n) (k constant) Algorithm Similarity can be refined to run with O(m+n) space. In a row by row computation store the last and the current row only. Algorithm Align can be designed to run with O(m+n) space with a divide and conquer strategy. It is not a trivial task! The basic algorithm Similarity can be modified to solve a variety of different problems!! 36
Semi-global alignment 37
Semi-global Comparison Find the best fit of a short sequence t of size n into a larger sequence s of size m s: t: s1 sk sl sm The solution to this problem as formulated above will take time proportional to Σ k=1..m Σ l=k..m n(l-k)=o(nm 3 ) 38
(Exact matching) Problem: given a pattern p and a larger string s, find all the occurrences of the pattern p in s Is there an occurrence? How many times p occurrs in s? Naive method Boyer-Moore Algorithm/Knuth-Morris Pratt Algorithm 39
Semi-global Comparison Ignore the spaces at the beginning and at the end of a sequence. Problem: Find the highest score semi-global alignment between t and substring (prefix of a suffix) of s. s: CAGCA -CTTGG ATTCTCGG t: - - - CAGCTTGG(- - - - - - - - 1. Ignore final spaces. Find the best score between t and a prefix of s. M[i,j] of problem1 contains the best score between s[1..i] and t[1..j], hence take the maximum value M[i,n] in the last column n. There is no need to reach the last row. 40
Semi-global Comparison s: CAGCA -CTTGG ATTCTCGG t: - - -)CAGCTTGG - - - - - - - - 2. Ignore initial spaces Find the best alignment between t and a suffix of s. M[i,j] now contains the best score between t[1..j] and a suffix of s[1..i], hence in the first column we have all zeroes. C A G C. C 0 Initial char A 0 G 0 C 0 1 A!!Join solutions 1 and 2 to solve semi-global comparison!! 41
Local Alignment 42
Local Alignment Find the best fit between a sub-string of s and a sub-string of t. s: t: s1 sk si sm t1 th tj tn Motivation: Ignore streches of non-coding DNA 43
Global Alignment Local Alignment Algorithm Smith&Waterman --T -CC-C-AGT -TATGT-CAGGGGACACG A-GCATGCAGA-GAC AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG T-CAGAT--C Local Alignment better alignment to find conserved segment tcccagttatgtcaggggacacgagcatgcagagac aattgccgccgtcgttttcagcagttatgtcagatc 44
Local Alignment: Example Local alignment Compute a mini Global Alignment to get Local Global alignment 45
Local Alignment Algorithm Smith&Waterman The LA problem is still solved computing M. M[i,j] holds the value of the best alignment between a suffix of s[1..i] and a suffix of t[1..j]. The first row and the first column are initialized with zeros. 46
Local Alignment M[i,j]= max M[i,j-1] - 2 M[i-1,j-1] + p(i,j), p(i,j)= M[i-1,j] -2 0 +1 if s i = t j -1 if s i t j For any entry M[i,j] there exists always the alignment between the empty suffixes of s[1..i] and of t[1..j] with score 0 At the end choose the entry M[i,j] with maximal score in any position. Start align tracing back, as before, from there until you find a value 0. 47
Example H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0 0 0 0 2 0 20 12 4 0 0 H 0 10 2 0 0 0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18 28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0 0 6 13 18 12 4 0 4 16 26 HEGAWGHEE PAW HEAE 48
End free-space alignment -Motivation Find the best fit of substrings of s and t, where at least one of these substrings must be a prefix of the original string and one must be a suffix? Motivation: in the shotgun sequence assembly procedure, one has a large set of partially overlapping substrings that come from many copies of one original but unknown DNA sequences. The problem is to use comparisons of pairs of substrings to infer the correct original string. 49
50 End free-space alignment = = = + = = = ), ( ), ( max ), ( ), ( max ), ( ), ( max ), ( 2 1), ( 2 ) 1, ( ), ( 1) 1, ( max ), ( 0 ) (0, 0,0) ( * * 1 * 1 * j m V n i V T S V j m V j m V n i V n i V j i V j i V T S p j i V j i V j V i V n j m i j i
Example H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 P 0-2 -1-1 -2-1 -4-2 -2-1 -1 A 0-2 -2 4-1 3-4 -4-4 -3-2 W 0-3 -5-4 1-4 18 10 2 6-6 H 0 10 2 6-6 -1 10 16 20 12 4 E 0 2 16 8 0 7 2 8 16 26 18 A 0-2 8 21 13 5 3 2 8 18 25 E 0 0 4 13 18 12 4 4 2 14 24 HEGAWGHEE PAW HEAE 51
Kinds of Alignment Global Alignment INPUT: Two strings S and T of roughly the same length. QUESTION: What is the similarity between the two? Semi-global Alignment INPUT: Two strings S and T. QUESTION: What is the similarity between a substring of S and T? Local Alignment INPUT: Two strings S and T. QUESTION: What is the similarity (difference) between a substring of S and a substring of T? What are these most similar substrings? Ends free-space alignment INPUT: Two strings S and T of different length. QUESTION: What is the similarity between substrings of S and T, respectively? where at least one of these substrings 52 must be a prefix of the original string and one (not necessary
Complexity of Alignments Problem Time complexity Space complexity Global Alignment O(nm) O(n+m) (O(nm) to bt) Semi-global Alignment O(nm) O(nm) Local Alignment O(nm) O(nm) Ends free-space alignment O(nm) O(nm) The space complexity could be a critical bottleneck. How we can improve such a complexity? Linear-Space Alignment Hirschberg s algorithm -- Miller and Myers algorithm 53
Extensions to the basic algorithm Hirschberg s linear space method for alignment uses a divide-et-conquer strategy 54
Gap penalty 55
Gap penalty function Gap: consecutive number (k>1) of spaces. From Biology we know that when mutations are involved, gap of k spaces are more probable than k isolated spaces. One concrete example is given by the c-dna matching. In the previous problems the cost w(k) of k internal consecutive spaces was proportional to k, w(k) = k g. Now w(k) = h +kg where h + g is the cost of the first space of a gap and g the cost of the following ones, k>1. CA-----CTTGG h+g g g g g w(k) = h +5g gap 56
Attention! The scoring system is no more additive, i.e. we cannot break an alignment in two parts and expect the total score to be the sum of the partial scores AAC - - - A ATTC C G ACT AC ACT ACC T - - - - - - CGC - - The scoring of an alignment is done at the block level 57
Similarities with gap We need three matrices a, b, c, with the following meaning: a[i,j] = maximum score of an alignment between s[1..i] and t[1..j] where s[i] is matched with t[j]. b[i,j] = maximum score of an alignment between s[1..i] and t[1..j] that ends in a - aligned with t[j]. c[i,j] = maximum score of an alignment between s[1..i] and t[1..j] that ends in s[i] aligned with a -. Where a[i-1,j-1] a[i,j] =p(i,j) + max b[i-1,j-1] c[i-1,j-1] a[i,j-1] -(h+g) First space a[i-1,j] -(h+g) b[i,j] =max b[i,j-1]-g c[i,j] =max (b[i-1,j]-(h+g) ) (c[i,j-1] -(h+g)) 58 c[i-1,j]-g
Initialization: a[0,0] = 0, a[i,0] =- for 0 i m, a[0,j] = - for 0 j n b[i,0] = - for 0 i m b[0,j] = -(h+gj) for 0 j n c[i,0] = -(h+gi) for 0 i m c[0,j] = - for 0 j n a[m,n] Final result Get the maximum among b[m,n] c[m,n] Trace back to obtain the optimal alignment, remembering the current position and which array belongs to. Time O(mn) Space 3(mn) = O(mn) 59
Other gap penalty models Constant. Affine. Convex: each additional space in a gap contributes less to the gap weight than the previous space (ex. Log(q)) the problem is solvable in O(nm log(m)) time Arbitrary: Any gap weight function is acceptable the problem is solvable in O(nm (m+n)) time 60