Sequence Comparison: Local Alignment Genome 373 Genomic Informatics Elhanan Borenstein
A quick review: Global Alignment Global Alignment Mission: Fin the best global alignment between two sequences. An algorithm for fining the alignment with the best score A metho for scoring alignments
A quick review: Global Alignment Global Alignment Mission: Fin the best global alignment between two sequences. An algorithm for fining the alignment with the best score A metho for scoring alignments?
Review: Global Alignment Three Possible Moves: A iagonal move aligns a character from each sequence. A horizontal move aligns a gap in the seq along the left ege A vertical move aligns a gap in the seq along the top ege. The move you keep is the best scoring of the three.
Review: Global Alignment Fill DP matrix from upper left to lower right. Traceback alignment from lower right corner. G A A T C -4-8 -12-16 -2 C -4-5 -9-13 -12-6 A -8-4 5 1-3 -7 T -12-8 1 11 7 A -16-12 2 11 7 6 C -2-16 -2 7 11 17
DP in equation form Align sequence x an y. F is the DP matrix; s is the substitution matrix; is the linear gap penalty. j i F j i F y x s j i F j i F F j i 1, 1,, 1 1, max,,
DP equation graphically F i 1, j 1 F i, j 1 s x, y i j F i 1, j F i, j take the max of these three
Local alignment Mission: Fin best partial alignment between two sequences. Why?
Local alignment A single-omain protein may be similar only to one region within a multi-omain protein. A DNA/RNA query may align to a small part of a genome/genomes/metagenomes. An alignment that spans the complete length of both sequences may be unesirable.
BLAST oes local alignments Typical search has a short query against long targets. The alignments returne show only the wellaligne match region of both query an target. Query: matche regions returne in alignment Targets: (e.g. genome contigs, full genomes, metagenomes)
How can we moify the Neeleman- Wunsch DP algorithm (for fining global alignment) such that it will fin instea the best local alignment??
G A A T - C - C A T A C -4-5 1 1-4 1 = 17
G A A T - C - C A T A C -4-5 1 1-4 1 = 17-4 -9 1 11 7 17
Remember: Global alignment DP Align sequence x an y. F is the DP matrix; s is the substitution matrix; is the linear gap penalty. j i F j i F y x s j i F j i F F j i 1, 1,, 1 1, max,,
Local alignment DP Align sequence x an y. F is the DP matrix; s is the substitution matrix; is the linear gap penalty. (correspons to start of alignment)
A simple example A C G T A 2-7 -5-7 initialize the same way as for global alignment C -7 2-7 -5 G -5-7 2-7 A A G T -7-5 -7 2 = -5 A F i 1, j 1 s x i, y j F i, j 1 G C F i 1, j F i, j
A simple example A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A? A A G??? F i 1, j 1 s x i, y j F i, j 1 G? C? F i 1, j F i, j
A simple example A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A A A G F i 1, j 1 s x i, y j F i, j 1 G C F i 1, j F i, j
A simple example A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A A G A? F i 1, j 1 s x i, y j F i, j 1 G C F i 1, j F i, j
A simple example A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A A A G 2-5 -5 F i 1, j 1 F i, j 1 G s x i, y j C F i 1, j F i, j
A A A simple example A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A A G A 2 F i 1, j 1 s x i, y j F i, j 1 G C F i 1, j F i, j
A simple example A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A A G A 2 F i 1, j 1 s x i, y j F i, j 1 G? C? F i 1, j F i, j
A simple example A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A A A G 2 F i 1, j 1 F i, j 1 G -5-5 -3 s x i, y j C? F i 1, j F i, j
A simple example A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A A G A 2 F i 1, j 1 s x i, y j F i, j 1 G C? F i 1, j F i, j (signify no preceing alignment with no arrow)
A simple example A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A A G A 2? F i 1, j 1 s x i, y j F i, j 1 G? C? F i 1, j F i, j
A simple example A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A A G A 2 2 F i 1, j 1 s x i, y j F i, j 1 G C F i 1, j F i, j
A simple example A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A A G A 2 2? F i 1, j 1 s x i, y j F i, j 1 G? C? F i 1, j F i, j
A simple example A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A A G A 2 2 F i 1, j 1 s x i, y j F i, j 1 G 4 C F i 1, j F i, j
What s ifferent about the DP matrix Global Alignment DP Matrix Local Alignment DP Matrix
A simple example F i 1, j 1 F i 1, A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 j traceback? F i, j But F i, j 1 how s x, i y j o we A A G A 2 2 G 4 C
Traceback AG AG A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 A A G A 2 2 G 4 F i 1, j 1 F i 1, j s x i, y j F i, j F i, j 1 C Start traceback at highest score anywhere in matrix, follow arrows back until you reach
Multiple local alignments Traceback from highest score, marking each DP matrix score along traceback. Now traceback from the remaining highest score, etc. The alignments may or may not inclue the same parts of the two sequences. 2 1
Local alignment Two ifferences from global alignment: If a DP score is negative, replace with. Traceback from the highest score in the matrix an continue until you reach. Global alignment algorithm: Neeleman-Wunsch. Local alignment algorithm: Smith-Waterman.
Another example F i 1, j 1 F i 1, A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 j s x i, y j F i, j F i, j 1 Fin the optimal local alignment of AAG an GAAGGC. Use a gap penalty of = -5. A A G G 2 A 2 2 A 2 4 G 6 G 2 C
Compare with the Best GLOBAL Alignment F i 1, j 1 F i 1, A C G T A 2-7 -5-7 C -7 2-7 -5 G -5-7 2-7 T -7-5 -7 2 = -5 j s x i, y j F i, j F i, j 1 (contrast with the best local alignment) Fin the optimal Global alignment of AAG an GAAGGC. Use a gap penalty of = -5. G -5 A -1 A -15 G -2 G -25 C -3 A A G -5-1 -15
Summary Global alignment algorithm: Neeleman-Wunsch. Local alignment algorithm: Smith-Waterman.
Using sequence alignment to stuy evolution
Are these proteins relate? SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY L P L Y N Y C L SEQ 2: QFFPLMPPAPYFILATDYENLPLVYSCTTFFWLF The intuitive answer: score = -1 NO? SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY L P W LDATYKN A Y C L SEQ 2: QFFPLMPPAPYWILDATYKNLALVYSCTTFFWLF score = 15 PROBABLY? SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY RVV L PS W LDATYKNYA Y CDVTYKL SEQ 2: RVVPLMPSAPYWILDATYKNYALVYSCDVTYKLF score = 37 YES?
Significance of scores HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT Alignment algorithm 45 Low score = unrelate High score = relate LENENQGKCTIAEYKYDGKKASVYNSFVSNGVKE But how high is high enough?
Significance of scores HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT Alignment algorithm 45 Low score = unrelate High score = relate LENENQGKCTIAEYKYDGKKASVYNSFVSNGVKE But how high is high enough? Subjective Problem specific Parameter specific
The null hypothesis We want to know how surprising a given score is, assuming that the two sequences are not relate. This assumption is calle the null hypothesis. The purpose of most statistical tests is to etermine whether the observe result provies a reason to reject the null hypothesis. We want to characterize the istribution of scores from pairwise sequence alignments.
Frequency Sequence similarity score istribution Sequence comparison score Search a atabase of unrelate sequences using a given query sequence. What will be the form of the resulting istribution of pairwise alignment scores?