Lecture 1, 31/10/2001: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties 1
Computational sequence-analysis The major goal of computational sequence analysis is to predict the function and structure of genes and proteins from their sequence. This is made possible since organisms evolve by mutation, duplication and selection of their genes. Thus, sequence similarity often indicates functional and structural similarity. 2
5 ATCAGAGTC 3 5 TTCAGTC 3 ATC CTA AG GA etc. 3
ATCAGAGTC TTCA--GTC +++^^+++ We wish to identify what regions are most similar to each other in the two sequences. Sequences are shifted one by the other and gaps introduced, to cover all possible alignments. The shifts and gaps provide the steps by which one sequence can be converted into the other. 4
dot-plot T T C A G T C A T C A G A G T C T T C A G T C A T TCAGAGTC TCA--GTC 5
scoring Substitution matrix - the similarity value between each pair of residues A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty - the cost of introducing gaps Gap penalty -2 ATCAGAGTC TTCA--GTC +++^^+++ : 0+2+2+2-2-2+2+2+2 = 8 6
A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C T 0 2 0 0 0 0 0 2 0 T 0 2 0 0 0 0 0 2 0 C 0 0 2 0 0 0 0 0 2 A 2 0 0 2 0 2 0 0 0 G 0 0 0 0 2 0 2 0 0 T 0 2 0 0 0 0 0 2 0 C 0 0 2 0 0 0 0 0 2 Initialization Position 3,2 : [ a b ] [ a - ] [T 2 T 1 ] ATC -TT [C 3 T 1 ] ATC- --TT [T 2 T 2 ] ATC TT- [ - b ] 7
A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 2 0 0 0 0 0 2 0 T - 4 0 2 0 0 0 0 0 2 0 C - 6 0 0 2 0 0 0 0 0 2 A - 8 2 0 0 2 0 2 0 0 0 G -10 0 0 0 0 2 0 2 0 0 T -12 0 2 0 0 0 0 0 2 0 C -14 0 0 2 0 0 0 0 0 2 Initialization Directionality of score calculation [ a b ] [ a - ] [ - b ] 8
A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 2 0 0 0 0 0 2 0 T - 4 0 2 0 0 0 0 0 2 0 C - 6 0 0 2 0 0 0 0 0 2 A - 8 2 0 0 2 0 2 0 0 0 G -10 0 0 0 0 2 0 2 0 0 T -12 0 2 0 0 0 0 0 2 0 C -14 0 0 2 0 0 0 0 0 2 9
A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 0-2 -4-6 -8-10 -12-14 T - 4-2 2 0-2 -4-6 -8-8 -10 C - 6-4 0 4 2 0-2 - 4-6 - 6 A - 8-4 - 2 2 6 4 2 0-2 - 4 G -10-6 - 4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 - 6-2 0 4 6 8 6 8 10
Needleman-Wunsch algorithm σ[ a ] b : score of aligning a pair of residues a and b σ[ a ] - : score of aligning residue a with a gap (gap penalty: -q) S : score matrix S(i,j) : optimal score of aligning residues positions 1 to i on one sequence with residues positions 1 to j on another sequence 11
Needleman-Wunsch algorithm S(0,0) 0 for j 1 to N do S(0,j) S(0,j-1) + σ[ - bj ] for i 1 to M do { S(i,0) S(i-1,0) + σ[ a i - ] for j 1 to N do S(i,j) max (S(i-1, j-1) + σ[ a i b j ], S(i-1, j) + σ[ a i - ], S(i, j-1) + σ[ - bj ]) } Pearson & Miller Meth Enz 210:575, 92 12
Optimal score/s is found - more steps needed to find the corresponding alignment/s. This is a time-saving property in database searches and other applications. Only a single pass through the alignment matrix is needed. 13
A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 0-2 -4-6 -8-10 -12-14 T - 4-2 2 0-2 -4-6 -8-8 -10 C - 6-4 0 4 2 0-2 - 4-6 - 6 A - 8-4 - 2 2 6 4 2 0-2 - 4 G -10-6 - 4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 - 6-2 0 4 6 8 6 8 14
the traceback A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 0-2 -4-6 -8-10 -12-14 T - 4-2 2 0-2 -4-6 -8-8 -10 C - 6-4 0 4 2 0-2 - 4-6 - 6 A - 8-4 - 2 2 6 4 2 0-2 - 4 G -10-6 - 4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 - 6-2 0 4 6 8 6 8 15
the traceback A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 0-2 -4-6 -8-10 -12-14 T - 4-2 2 0-2 -4-6 -8-8 -10 C - 6-4 0 4 2 0-2 - 4-6 - 6 A - 8-4 - 2 2 6 4 2 0-2 - 4 G -10-6 - 4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 - 6-2 0 4 6 8 6 8 ATCAGAGTC TTCAG--TC 16 ++++^^++ : 0+2+2+2+2-2-2+2+2=8
the traceback A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 0-2 -4-6 -8-10 -12-14 T - 4-2 2 0-2 -4-6 -8-8 -10 C - 6-4 0 4 2 0-2 - 4-6 - 6 A - 8-4 - 2 2 6 4 2 0-2 - 4 G -10-6 - 4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 - 6-2 0 4 6 8 6 8 ATCAGAGTC TTC--AGTC 17 ++^^++++ : 0+2+2-2-2+2+2+2+2=8
the traceback A C G T A 2 0 0 0 C 0 2 0 0 G 0 0 2 0 T 0 0 0 2 Gap penalty -2 A T C A G A G T C 0-2 -4-6 -8-10 -12-14 -16-18 T - 2 0 0-2 -4-6 -8-10 -12-14 T - 4-2 2 0-2 -4-6 -8-8 -10 C - 6-4 0 4 2 0-2 - 4-6 - 6 A - 8-4 - 2 2 6 4 2 0-2 - 4 G -10-6 - 4 0 4 8 6 4 2 0 T -12-8 -4-2 2 6 8 6 6 4 C -14-10 - 6-2 0 4 6 8 6 8 ATCAGAGTC : 8 TTCAG--TC ATCAGAGTC : 8 TTC--AGTC ATCAGAGTC : 8 TTCA--GTC 18
Algorithm calculates score/s of optimal global sequence alignments, penalizes end gaps and penalizes each residue in a gap is equally. ATCAGAGTC has lower score then CAGAGTC --TTCAGTC TTCAGTC ATCACAGTC has same score as ATCACAGTC T-C--AGTC T---CAGTC ATCACAGTC has lower score then ACACAGTC T---CAGTC T--CAGTC 19
In order to score a gap penalty q independent of the gap length, i.e ACACAGTC ATCACAGTC AGCTTTCACAGTC all have the T--CAGTC T---CAGTC T-------CAGTC same score the algorithm we presented is modified to extend alignments in more then the three ways we considered. 20
A T C A G A G T C T 0 2 0 0 0 0 0 2 0 T 0 2 0 0 0 0 0 2 0 [ - b ] C 0 0 2 0 0 0 0 0 2 A 2 0 0 2 0 2 0 0 0 G 0 0 0 0 2 0 2 0 0 T 0 2 0 0 0 0 0 2 0 C 0 0 2 0 0 0 0 0 2 [ a - ] [ a b ] [ a - ] [ - b ] 21
Needleman-Wunsch algorithm S(0,0) 0 for j 1 to N do S(0,j) -q for i 1 to M do { S(i,0) -q for j 1 to N do S(i,j) max (S(i-1, j-1) + σ[ a i b j ], max {S(0, j)...s(i-1, j)} -q, max {S(i, 0)...S(i, j-1)} -q) } 22 Pearson & Miller Meth Enz 210:575, 92
caveats Every algorithm is limited by the model it is built upon. For example, the NW dynamic programming algorithm guarantees us optimal global alignments with the parameters we supply (substitution matrix, gap penalty and gap scoring). However - Different parameters can give different alignments, The correct alignment might not be the optimal one. The correct alignment might correspond only to part of the global alignments, 23
More details, sources and things to do for next class Source: Pearson WR & Miller W "Dynamic programming algorithms for biological sequence comparison." Methods in Enzymology, 210:575-601 (1992). Assignment: Calculate NW alignments with constant gap penalty seeing the effect of different gap penalties and match/mismatch scores. In all cases use substitution matrices that have two types of scores only a value for an exact match and a lower value for mismatches. Try the nucleotide sequences used in class and the following amino acid sequences: ACDGSMF & AMDFR. 24