Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1

Computational sequence-analysis The major goal of computational sequence analysis is to predict the function and structure of genes and proteins from their sequence. This is made possible since organisms evolve by mutation, duplication and selection of their genes. Thus, sequence similarity often indicates functional and structural similarity. 2

Sequence alignment 5 ATCAGAGTC 3 5 TTCAGTC 3 ATC CTA AG GA etc. 3

Sequence alignment ATCAGAGTC TTCAGTC TTCAGTC TTCAGTC TTCA--GTC ++++ +++^^+++ We wish to identify what regions are most similar to each other in the two sequences. Sequences are shifted one by the other and gaps introduced, to cover all possible alignments. The shifts and gaps provide the steps by which one sequence can be converted into the other. 4

Sequence alignment dot-plot T T C A G T C A T C A G A G T C T T C A G T C A T TCAGAGTC TCA-- GTC 5

Sequence alignment scoring Substitution matrix - the similarity value between each pair of residues A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty - the cost of introducing gaps Gap penalty-2 ATCAGAGTC TTCA--GTC +++^^+++ : 0+2+2+2-2-2+2+2+2 = 8 6

Sequence alignment Needleman-Wunsch global alignment A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C T 0 2 0 0 0 0 0 2 0 T 0 2 0 0 0 0 0 2 0 C 0 0 2 0 0 0 0 0 2 A 2 0 0 2 0 2 0 0 0 G 0 0 0 0 2 0 2 0 0 T 0 2 0 0 0 0 0 2 0 C 0 0 2 0 0 0 0 0 2 Position 3,2: [T 2 T 1 ] ATC -TT [C 3 T 1 ] ATC- --TT [T 2 T 2 ] ATC TT- Initialization [ a b ] [ a -] [ - b ] 7

Sequence alignment Needleman-Wunsch global alignment A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T -2 0 2 0 0 0 0 0 2 0 T -4 0 2 0 0 0 0 0 2 0 C -6 0 0 2 0 0 0 0 0 2 A -8 2 0 0 2 0 2 0 0 0 G -10 0 0 0 0 2 0 2 0 0 T -12 0 2 0 0 0 0 0 2 0 C -14 0 0 2 0 0 0 0 0 2 Initialization Directionality of score calculation [ a b ] [ a -] [ - b ] 8

Sequence alignment Needleman-Wunsch global alignment A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T -2 0 0-2 -4-6 -8-10 -12-14 T -4-2 2 0-2 -4-6 -8-8 -10 C -6 0 0 2 0 0 0 0 0 2 A -8 2 0 0 2 0 2 0 0 0 G -10 0 0 0 0 2 0 2 0 0 T -12 0 2 0 0 0 0 0 2 0 C -14 0 0 2 0 0 0 0 0 2 9

Sequence alignment Needleman-Wunsch global alignment A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T -2 0 0-2 -4-6 -8-10 -12-14 T -4-2 2 0-2 -4-6 -8-8 -10 C -6-4 0 2 0 0 0 0 0 2 A -8 2 0 0 2 0 2 0 0 0 G -10 0 0 0 0 2 0 2 0 0 T -12 0 2 0 0 0 0 0 2 0 C -14 0 0 2 0 0 0 0 0 2 10

Sequence alignment Needleman-Wunsch global alignment A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T -2 0 0-2 -4-6 -8-10 -12-14 T -4-2 2 0-2 -4-6 -8-8 -10 C -6-4 0 2 0 0 0 0 0 2 A -8 2 0 0 2 0 2 0 0 0 G -10 0 0 0 0 2 0 2 0 0 T -12 0 2 0 0 0 0 0 2 0 C -14 0 0 2 0 0 0 0 0 2 11

Sequence alignment Needleman-Wunsch global alignment A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T -2 0 0-2 -4-6 -8-10 -12-14 T -4-2 2 0-2 -4-6 -8-8 -10 C -6-4 0 4 0 0 0 0 0 2 A -8 2 0 0 2 0 2 0 0 0 G -10 0 0 0 0 2 0 2 0 0 T -12 0 2 0 0 0 0 0 2 0 C -14 0 0 2 0 0 0 0 0 2 12

Sequence alignment Needleman-Wunsch algorithm σ[ a ] b : score of aligning a pair of residues a and b σ[ a ] - : score of aligning residue a with a gap (gap penalty: -q) S : score matrix S(i,j) : optimal score of aligning residues positions 1 to i on one sequence with residues positions 1 to j on another sequence 13

Sequence alignment Needleman-Wunsch algorithm S(0,0) 0 for j 1 to N do S(0,j) S(0,j-1) + σ[ - bj ] for i 1 to M do { S(i,0) S(i-1,0) + σ[ a i - ] } for j 1 to N do S(i,j) max (S(i-1, j-1) + σ[ a i b j ], S(i-1, j) + σ[ a i - ], S(i, j-1) + σ[ - bj ]) 14 Pearson & Miller Meth Enz 210:575, 92

Sequence alignment Needleman-Wunsch global alignment Optimal score/s is found - more steps needed to find the corresponding alignment/s. This is a time-saving property in database searches and other applications. Only a single pass through the alignment matrix is needed. 15

Needleman-Wunsch global alignment: The TRACEBACK A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T -2 0 0-2 -4-6 -8-10 -12-14 T -4-2 2 0-2 -4-6 -8-8 -10 C -6-4 0 4 2 0-2 -4-6 -6 A -8-4 -2 2 6 4 2 0-2 -4 G -10-6 -4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 -6-2 0 4 6 8 6 8 ATCAGAGTC -- TTC--AGTC Score: 2 x 6 2x2 = 8 ATCAGAGTC -- TTCAG--TC Score: 2 x 6 2x2 = 8 16

Sequence alignment Needleman-Wunsch global alignment Algorithm calculates score/s of optimal global sequence alignments, penalizes end gaps and penalizes each residue in a gap is equally. ATCAGAGTC has lower score then CAGAGTC --TTCAGTC TTCAGTC ATCACAGTC has same score as ATCACAGTC T-C--AGTC T---CAGTC ATCACAGTC has lower score then ACACAGTC T---CAGTC T--CAGTC 17

Sequence alignment Needleman-Wunsch global alignment In order to score a gap penalty q independent of the gap length, i.e ACACAGTC ATCACAGTC AGCTTTCACAGTC all have the T--CAGTC T---CAGTC T-------CAGTC same score the algorithm we presented is modified to extend alignments in more then the three ways we considered. 18

Sequence alignment Needleman-Wunsch global alignment A T C A G A G T C T 0 2 0 0 0 0 0 2 0 T 0 2 0 0 0 0 0 2 0 [ - b ] C 0 0 2 0 0 0 0 0 2 A 2 0 0 2 0 2 0 0 0 G 0 0 0 0 2 0 2 0 0 T 0 2 0 0 0 0 0 2 0 C 0 0 2 0 0 0 0 0 2 [ a - ] [ a b ] [ a - ] [ - b ] 19

Sequence alignment Needleman-Wunsch algorithm S(0,0) 0 for j 1 to N do S(0,j) -q for i 1 to M do { S(i,0) -q for j 1 to N do S(i,j) max (S(i-1, j-1) + σ[ a i b j ], max {S(0, j)...s(i-1, j)} -q, max {S(i, 0)...S(i, j-1)} -q) } 20 Pearson & Miller Meth Enz 210:575, 92

Sequence alignment Needleman-Wunsch global alignment caveats Every algorithm is limited by the model it is built upon. For example, the NW dynamic programming algorithm guarantees us optimal global alignments with the parameters we supply (substitution matrix, gap penalty and gap scoring). However - Different parameters can give different alignments, The correct alignment might not be the optimal one. The correct alignment might correspond only to part of the global alignments, 21

More details, sources and things to do for next class Source: Pearson WR & Miller W "Dynamic programming algorithms for biological sequence comparison." Methods in Enzymology, 210:575-601 (1992). Assignment: Calculate NW alignments with constant gap penalty seeing the effect of different gap penalties and match/mismatch scores. In all cases use substitution matrices that have two types of scores only a value for an exact match and a lower value for mismatches. Try the nucleotide sequences used in class and the following amino acid sequences: ACDGSMF & AMDFR. 22

Local sequence alignments Local sequence alignments are necessary for cases of: Modular organization of genes and proteins (exons, domains, etc.) Repeats Sequences diverged so that similarity was retained, or can be detected, just in some sub-regions 23

Modular organization Advanced Topics of in Bioinformatics genes Weizmann Institute Science, spring 2003 gene A gene B gene C gene W gene X gene Y gene Z 24

Modular protein Adapted from Henikoff et al Science 278:609, 97 organization Kringle domain IG domain IG domain IG domain IG domain TLK receptor tyrosine-kinase Protein-kinase domain FN3 domain FN3 domain IG domain FN3 domain TEK receptor tyrosine-kinase EGF domain EGF domain EGF domain IG domain 25

Modular protein organization 1KAP secreted calcium-binding alkaline-protease Calcium-binding repeats Protease domain 26

Local sequence alignment 27

Local sequence alignment For local sequence alignment we wish to find what regions (sub-sequences) in the compared pair of sequences will give the best alignment scores with the parameters we supply (substitution matrix, gap penalty and gap scoring model. The aligned regions may be anywhere along the sequences. More then one region might be aligned with a score above the threshold. 28

Sequence alignment Needleman-Wunsch algorithm S(0,0) 0 for j 1 to N do S(1,j) -q for i 1 to M do { S(i,1) -q } for j 1 to N do [ a - ] [ a b ] S(i,j) max (S(i-1, j-1) + σ[ a i b j ], max {S(0, j)...s(i-1, j)} -q, max {S(i, 0)...S(i, j-1)} -q) 29 [ - b ]

Local sequence alignment Smith-Waterman algorithm σ[ a ] b : score of aligning a pair of residues a and b -q : gap penalty S (i,j) : optimal score of an alignment ending at residues i,j best : highest score in the scores-matrix (S) 30

best 0 for j 1 to N do S (0,j) 0 for i 1 to M do { S (i,0) 0 } Local sequence alignment Smith-Waterman algorithm for j 1 to N do S (i,j) max (S (i-1, j-1) + σ[ a i b j ], max {S (0, j)...s(i-1, j)} -q, max {S (i, 0)...S(i, j-1)} -q, 0) best max (S (i, j), best) 31 Pearson & Miller Meth Enz 210:575, 92

Local sequence alignment Smith-Waterman algorithm Finding the optimal alignment A C G T A 1-1 -1-1 C -1 1-1 -1 G -1-1 1-1 T -1-1 -1 1 A T C A G A G T C 0 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 1 0 1 0 0 T 0 0 1 0 0 0 0 0 2 0 C 0 0 0 2 0 0 0 0 0 3 A 0 1 0 0 3 1 1 1 1 1 G 0 0 0 0 1 4 2 2 2 2 T 0 0 1 0 1 2 3 1 3 1 Gap penalty -2 The optimal local alignment is: C 0 0 0 2 1 2 1 2 1 4 A 0 1 0 0 3 2 3 1 1 2 ATCAGAGTC G TCAG--TC A ++++^^++ : 1+1+1+1-2+1+1=4 32

Local sequence alignment Smith-Waterman algorithm Finding the optimal alignment A C G T A 1-1 -1-1 C -1 1-1 -1 G -1-1 1-1 T -1-1 -1 1 Gap penalty -2 A T C A G A G T C 0 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 1 0 1 0 0 T 0 0 1 0 0 0 0 0 2 0 C 0 0 0 2 0 0 0 0 0 3 A 0 1 0 0 3 1 1 1 1 1 G 0 0 0 0 1 4 2 2 2 2 T 0 0 1 0 1 2 3 1 3 1 C 0 0 0 2 1 2 1 2 1 4 A 0 1 0 0 3 2 3 1 1 2 Score threshold 3 33

Local sequence alignment Smith-Waterman algorithm Finding the optimal alignment A C G T A 1-1 -1-1 C -1 1-1 -1 G -1-1 1-1 T -1-1 -1 1 A T C A G A G T C 0 0 0 0 0 0 0 0 0 0 G 0-1 -1-1 -1 1-1 1-1 -1 T 0-1 0-1 -1-1 -1-1 1-1 C 0-1 -1 0-1 -1-1 -1-1 1 A 0 1-1 -1 0-1 1-1 -1-1 G 0-1 -1-1 -1 0-1 1-1 -1 T 0-1 1-1 -1-1 -1-1 0-1 Gap penalty -2 C 0-1 -1 1-1 -1-1 -1-1 0 A 0 1-1 -1 1-1 1-1 -1-1 Remove scores of the current optimal ATCAGAGTC alignment and then recalculate the GTCAG--TCA matrix to find the next best alignment /s 34

Local sequence alignment Smith-Waterman algorithm Finding the sub-optimal alignment A C G T A 1-1 -1-1 C -1 1-1 -1 G -1-1 1-1 T -1-1 -1 1 Gap penalty -2 A T C A G A G T C 0 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 1 0 1 0 0 T 0 0 0 0 0 0 0 0 2 0 C 0 0 0 0 0 0 0 0 0 3 A 0 1 0 0 0 0 1 0 0 0 G 0 0 0 0 0 0 0 2 0 0 T 0 0 1 0 0 0 0 0 0 0 C 0 0 0 2 0 0 0 0 0 0 A 0 1 0 0 3 1 1 1 1 1 A TCAGAGTC GTCAGTCA +++ : 1+1+1 =3 Score threshold 3 35

Local sequence alignment Smith-Waterman algorithm In order for the algorithm to identify local alignments the score for aligning unrelated sequence segments should typically be negative. Otherwise true optimal local alignments will be extended beyond their correct ends or have lower scores then longer alignments between unrelated regions. Alignment scores are determined by substitution matrix and by the gap penalties and gap scoring model. 36

Alignment scoring schemes: gap models Gap scoring by a constant relation to the gap length: σ -q g (g is the number ATCACA σ -3q of gapped residues) T---CA Gap scoring by a constant relation to the gap length: σ -q ATCACA σ -q T---CA Affine gap scoring (opening [d] and extending gap penalties [e]): σ -(d + e (g-1)) ATCACA σ -(d + 2e) T---CA 37

Local sequence alignment Smith-Waterman algorithm If alignment scores of unrelated sequences are mainly or solely determined by the substitution scores then such alignments would have negative scores if the sum of expected substitution scores would be negative: Σ i,j p i p j s ij < 0 i & j - residues, p i - frequency of residue i s ij - score of aligning residues i and j 38

Local sequence alignment Smith-Waterman algorithm We can easily identify substitution matrices that will not give positive scores to random alignments. However, we have no analytical way for finding which gap scores will satisfy the demand for random alignment scores to be less or equal to zero and produce local sequence alignments. Nevertheless, certain sets of scoring schemes (substitution matrix and gap scores) were found to give satisfactory local alignments. 39

More details, sources and things to do for next lecture Sources: Pearson & Miller "Dynamic programming algorithms for biological sequence comparison." Methods in Enz., 210:575-601 (1992), Altschul Amino acid substitution matrices from an information theoretic perspective J Mol Biol 219:555-565 (1991), Henikoff Scores for sequence searches and alignments Curr Opin Struct Biol 6:353-360 (1996). Assignment: Read the source articles for this lecture. They have more details on the material we covered and introduce topics for next lectures. Calculate S for the sequences presented in class, using the unitary matrix (1 for match, -1 for mismatch), and the constant gap penalty model 40 with q=-1, -2 or -4.

More details, sources and things to do for next lecture For those who are not acquainted with information theory or want to be certain they know the basics of it: An information theory primer for molecular biologistshttp://www.lecb.ncifcrf.gov/~toms/paper/primer 41

Next lecture, 12/12/2001: Substitution Matrices: amino-acids features and empirical matrices BLAST and FASTA: algorithms and statistics; assumptions and associated artifacts 42