Alignment Strategies for Large Scale Genome Alignments

Alignment Strategies for Large Scale Genome Alignments CSHL Computational Genomics 9 November 2003 Algorithms for Biological Sequence Comparison algorithm value scoring gap time calculated matrix penalty required Needleman- global arbitrary penalty/gap O(n 2 ) Needleman and Wunsch similarity q Wunsch, 1970 Sellers (global) unity penalty/residue O(n 2 ) Sellers, 1974 distance r k Smith- local S ij < 0.0 affine O(n 2 ) Smith and Waterman, 1981 Waterman similarity q + r k optimal Gotoh, 1982 SRCHN approx local S ij < 0.0 penalty/gap O(n)-O(n 2 ) Wilbur and Lipman, 1983 similarity lookup-diagonal FASTA approx. local S ij < 0.0 limited size O(n 2 )/K Lipman and Pearson, 1985 similarity q + r k lookup-rescan Pearson and Lipman, 1988 BLASTP maximum S ij < 0.0 multiple O(n 2 )/K Altshul et al., 1990 segment score segments DFA-extend BLAST2.0 approx. local S ij < 0.0 q+r k O(n 2 )/K Altshul et al., 1997 similarity lookup-extend 1

The sequence alignment problem: PMILGYWNVRGL PMILGYWNVRGL PM-ILGYWNVRGL : : : ::: : : : ::: PPYTIVYFPVRG PPYTIVYFPVRG PPYTIV-YFPVRG PMILGYWNVRGL PMILGYWNVRGL PM-ILGYWNVRGL :. :.. :. ::: :. :.:. ::: PPYTIVYFPVRG PPYTIVYFPVRG PPYTIV-YFPVRG P M I L G Y W N V R G L P X P X Y x X x T x x I X x x x V x x x X x Y x X x F x x x x x x P X V x x x X x R X G X Global: -PMILGYWNVRGL :..:. ::: PPYTIVYFPVRG- Local: AAAAAAAPMILGYWNVRGLBBBBB :..:. ::: XXXXXXPPYTIVYFPVRGYYYYYY Algorithms for Biological Sequence Comparison Global Local Distance HBHU vs HBHU Hemoglobin beta-chain - human 725 725 0 HAHU Hemoglobin alpha-chain - human 314 322 152 MYHU Myoglobin - Human 121 166 212 GPYL Leghemoglobin - Yellow lupin 8 43 239 LZCH Lysozyme precursor - Chicken -107 32 220 NRBO Pancreatic ribonuclease - Bovine -124 31 280 CCHU Cytochrome c - Human -160 26 321 MCHU vs MCHU Calmodulin - Human 671 671 0 TPHUCS Troponin C, skeletal muscle 395 438 161 PVPK2 Parvalbumin beta - Pike -57 115 313 CIHUH Calpain heavy chain - Human -2085 100 2463 AQJFNV Aequorin precursor - Jelly fish -65 76 391 KLSWM Calcium binding protein - Scallop -89 52 323 QRHULD vs EGMSMG EGF precursor -591 655 2549 2

Genomic Alignments nw/sw/lalign - dynamic programming -O(n 2 ) searchn - lookup on diagonals - O(n)-O(n 2 ) fasta - lookup on diagonals/rescan - O(n 2 )/K blast - DFA, extend O(n 2 )/K blastz - lookup/extend ssaha, blat, - lookup - waba mummer, avid - Suffix tree alignment dialign, glass, lagan Global and Local Alignment Paths Global A B D D E F G H I A \ \ \ \ \ \ \ \ \ 1 _-1-1 -1-1 -1-1 -1-1 B \! \ \ \ \ \ \ \ -1 2 _ 0 _-2-2 -2-2 -2-2 D \! \ \ \ \ \ \ -1 0 3 _ 1 _-1 _-3-3 -3-3 E \ \!! \ \ \ \ -1-2 1 2 2 _ 0 _-2 _-4-4 G \ \! \! \ \ \ -1-2 -1 0 1 1 1 _-1 _-3 K \ \ \! \! \! \ \ \ \ -1-2 -3-2 -1 0 0 0 _-2 H \ \ \ \! \! \! \ \ \ -1-2 -3-4 -3-2 -1 1 _-1 I \ \ \ \ \! \! \!! \ -1-2 -3-4 -5-4 -3-1 2 Optimum global alignment ( score: 2) A B D D E F G H I (top) A B D - E G K H I (side) or A B - D E G K H I Local A B D D E F G H I A \ 1 0 0 0 0 0 0 0 0 B \ 0 2 _ 0 0 0 0 0 0 0 D! \ \ 0 0 3 _ 1 0 0 0 0 0 E! \ \ 0 0 1 2 2 _ 0 0 0 0 G \! \ \ \ 0 0 0 0 1 1 1 0 0 K \ \ \ 0 0 0 0 0 0 0 0 0 H \ 0 0 0 0 0 0 0 1 0 I \ 0 0 0 0 0 0 0 0 2 Optimal local alignment (score 3): A B D (top) A B D (side) 3

Algorithms for Global and Local Similarity Scores Global: Local: Smith-Waterman Space, Time Requirements scorespace: O(n); time: O(n 2 ) alignmentspace: O(n); time: O(n 2 ) 4

FAST alignment by lookup 1 9 GT8.7 KITQSNATQ.::. ::: XURT8C LLTQTRATQ 1 9 1.Scan query, build 2 tables: 1-1 2-1 3-1 4-1 5-1 6-1 7-1 8 3 n-1 entries AT 7 IT 2 KI 1 LL -1 LT -1 NA 6 QS 4 QT -1 SN 5 TR -1 TQ 8 400 entries LL LT TQ QT TR RA AT TQ K I T Q S N A T Q L L T T T Q Q Q T R A A T T T Q Q Q 87654321012345678 2 2 4 2 5 O(n) space O(n+m) time (if few repeat hits) BLASTZ in a nutshell gap of length k is penalized by subtracting 400 + 30 k from the score 5

Other improvements Formerly, BLASTZ looked for identical runs of eight consecutive nucleotides in each sequence. Ma et al. (2002) propose looking for runs of 19 consecutive nucleotides in each sequence, within which the 12 positions indicated by a 1 in the string 1110100110010101111 are identical. To increase sensitivity, we allow a transition (A-G, G-A, C-T or T-C) in any one of the 12 positions. If the separation between the two alignments is <50 kb in both sequences, then BLASTZ recursively searches the intervening regions for 7-mer exact matches and requires a threshold of 2200 for initiating dynamic programming (without the adjustment for sequence complexity). If the separation is <10 kb, the threshold is lowered to 2000. In either case, the higher-sensitivity matches are required to occur with an order and orientation consistent with the stronger flanking matches. blastz local alignments 6

PipMaker PipMaker GSTM1 vs Cluster 7

Suffix trees - Mummer, AVID Figure 2 Finding maximal matches using a suffix tree: The suffixes of the word at the root are represented by the characters along the paths from the root to the leaves. Branchings in the tree correspond to locations where different suffixes shared the same prefix, and therefore are matches. Every internal node in the tree is therefore a match (with the matching sequence corresponding to the path characters along the path from the root). Maximal matches can be efficiently detected by considering some additional criteria. Figure 3 Selecting anchors from the set of matches. Every maximal match is shown in blue. A set of good anchors is shown in red. 8

Table 1. Coverage Results for the Different Programs on Human Sequence Alignments With Cat, Chicken, Cow, Dog, Pig, and Rat Coverage (bp) RefSeq UTR Time (S) Cat AVID 1200310 21966 13742 78 BLASTZ 1270696 22383 13632 101 BLASTZ (chaining) 1170350 21993 13632 52 CHAOS 360436 19022 7604 128 GLASS 1179833 21942 13426 9215 Chicken AVID 77114 17886 1742 35 BLASTZ 56370 19204 1357 11 BLASTZ (chaining) 56370 19204 1357 11 CHAOS 5635 3060 70 31 GLASS 56445 17437 984 2866 Chimp AVID 852611 0 0 36 BLASTZ 881206 0 0 11330 BLASTZ (chaining) 854023 0 0 45 CHAOS 680491 0 0 4471 GLASS Cow * * * * AVID 1195913 26359 10777 89 BLASTZ 1288895 27911 13505 79 BLASTZ (chaining) 1150323 25814 10434 57 CHAOS 317660 22551 6633 198 GLASS 1131693 25187 12860 10115 Coverage of the human genome using the mouse genome is described in O. Couronne (2003). An asterisk indicates that the program was not able to successfully align the sequences. A minus sign indicates that the program crashed on one sequence pair. Multiple minus signs are used for multiple crashes. RefSeq annotations are based on the human December 2001 hg10 freeze. LAGAN "anchor" local alignments (CHAOS) rough global map ("band")recursive anchoring translated anchoring Genome Research (2003) 13:721 9

parameters: k: word length c: degeneracy t: score threshold 10

MLAGAN Tree-guided, progressive with anchors affine gaps, open/continue/end penalties iterative refinement with anchors 11

Figure 2 Visualization of a multiple alignment using VISTA. (A) MLAGAN alignments can be visualized using VISTA, if they are projected to pairwise alignments with respect to one reference sequence. This plot shows the conservation between human and chimpanzee, cow, mouse, and fugu around the first intron of the cmet gene. The human/chimpanzee conservation is uniformly very high; human/cow and human/mouse show varying levels of conservation. The human/chicken alignment also shows some conservation in the non-coding areas. The human/fugu alignment shows conservation only within the first coding exon, and to a lesser degree within the regions upstream and downstream of that exon. (B) First introns of cmet, comparison of CLUSTALW and MLAGAN alignments. We compared the alignment generated by LAGAN and CLUSTALW for the first intron of the cmet gene in eight mammalian sequences (human, baboon, cat, dog, cow, pig, mouse, and rat). The alignments between all of the species except rodents were similar. VISTA plots of the projections to human and mouse are shown. CLUSTALW (top) misaligned the mouse sequence around 4 Kb and 10 Kb, whereas MLAGAN (bottom) found significant conservation in these regions. 15

Genome Alignment Strategies Alignment based or Lookup based? (BLASTZ vs BLAT) Identities vs Similarities (EST:genome vs Human:Cat:Mouse) Scoring Matrix? Statistical Estimates (BLASTN) Memory vs Speed Iterative alignment? 16