Alignment Strategies for Large Scale Genome Alignments

Similar documents
Whole Genome Alignments and Synteny Maps

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Biol4230 Tues, February 21, 2017 Bill Pearson Jordan 6-057

Multiple Alignment of Genomic Sequences

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Bioinformatics and BLAST

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

EECS730: Introduction to Bioinformatics

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Tools and Algorithms in Bioinformatics

In-Depth Assessment of Local Sequence Alignment

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Sequence Alignment Techniques and Their Uses

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Multiple Whole Genome Alignment

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Pairwise sequence alignment

Introduction to Bioinformatics

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Collected Works of Charles Dickens

Heuristic Alignment and Searching

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Sequence analysis and Genomics

Single alignment: Substitution Matrix. 16 march 2017

Sequence Database Search Techniques I: Blast and PatternHunter tools

Algorithms in Bioinformatics

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Handling Rearrangements in DNA Sequence Alignment

Lecture 2: Pairwise Alignment. CG Ron Shamir

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Comparative genomics. Lucy Skrabanek ICB, WMC 6 May 2008

Ch. 9 Multiple Sequence Alignment (MSA)

Lecture 5: September Time Complexity Analysis of Local Alignment

MegAlign Pro Pairwise Alignment Tutorials

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Introduction to Sequence Alignment. Manpreet S. Katari

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Tools and Algorithms in Bioinformatics

Contact 1 University of California, Davis, 2 Lawrence Berkeley National Laboratory, 3 Stanford University * Corresponding authors

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Practical search strategies

Sequence Comparison. mouse human

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Bioinformatics for Biologists

Lecture 4: September 19

Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons

Copyright 2000 N. AYDIN. All rights reserved. 1

Sequence Homology and Analysis. Understanding How FASTA and BLAST work to optimize your sequence similarity searches.

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON M5R 3G4 Canada

Multiple Genome Alignment by Clustering Pairwise Matches

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

A Browser for Pig Genome Data

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

BLAT The BLAST-Like Alignment Tool

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Bioinformatics Exercises

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Algorithms in Bioinformatics I, ZBIT, Uni Tübingen, Daniel Huson, WS 2003/4 1

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Motivating the need for optimal sequence alignments...

Similarity or Identity? When are molecules similar?

Mitochondrial Genome Annotation

Practical considerations of working with sequencing data

Alignment & BLAST. By: Hadi Mozafari KUMS

7 Multiple Genome Alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

BLAST. Varieties of BLAST

Introduction to Bioinformatics Online Course: IBT

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species


Basic Local Alignment Search Tool

Comparative Genomics. Chapter for Human Genetics - Principles and Approaches - 4 th Edition

Benchmarking tools for the alignment of functional

Multiple sequence alignment

Pairwise & Multiple sequence alignments

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Frazer et al. ago (Aparicio et al. 2002), conserved long-range sequence organization has not been reported for more distantly related species. Figure

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

String Matching Problem

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Introduction to Evolutionary Concepts

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Distribution and intensity of constraint in mammalian genomic sequence

Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012

BINF6201/8201. Molecular phylogenetic methods

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

1.5 Sequence alignment

Transcription:

Alignment Strategies for Large Scale Genome Alignments CSHL Computational Genomics 9 November 2003 Algorithms for Biological Sequence Comparison algorithm value scoring gap time calculated matrix penalty required Needleman- global arbitrary penalty/gap O(n 2 ) Needleman and Wunsch similarity q Wunsch, 1970 Sellers (global) unity penalty/residue O(n 2 ) Sellers, 1974 distance r k Smith- local S ij < 0.0 affine O(n 2 ) Smith and Waterman, 1981 Waterman similarity q + r k optimal Gotoh, 1982 SRCHN approx local S ij < 0.0 penalty/gap O(n)-O(n 2 ) Wilbur and Lipman, 1983 similarity lookup-diagonal FASTA approx. local S ij < 0.0 limited size O(n 2 )/K Lipman and Pearson, 1985 similarity q + r k lookup-rescan Pearson and Lipman, 1988 BLASTP maximum S ij < 0.0 multiple O(n 2 )/K Altshul et al., 1990 segment score segments DFA-extend BLAST2.0 approx. local S ij < 0.0 q+r k O(n 2 )/K Altshul et al., 1997 similarity lookup-extend 1

The sequence alignment problem: PMILGYWNVRGL PMILGYWNVRGL PM-ILGYWNVRGL : : : ::: : : : ::: PPYTIVYFPVRG PPYTIVYFPVRG PPYTIV-YFPVRG PMILGYWNVRGL PMILGYWNVRGL PM-ILGYWNVRGL :. :.. :. ::: :. :.:. ::: PPYTIVYFPVRG PPYTIVYFPVRG PPYTIV-YFPVRG P M I L G Y W N V R G L P X P X Y x X x T x x I X x x x V x x x X x Y x X x F x x x x x x P X V x x x X x R X G X Global: -PMILGYWNVRGL :..:. ::: PPYTIVYFPVRG- Local: AAAAAAAPMILGYWNVRGLBBBBB :..:. ::: XXXXXXPPYTIVYFPVRGYYYYYY Algorithms for Biological Sequence Comparison Global Local Distance HBHU vs HBHU Hemoglobin beta-chain - human 725 725 0 HAHU Hemoglobin alpha-chain - human 314 322 152 MYHU Myoglobin - Human 121 166 212 GPYL Leghemoglobin - Yellow lupin 8 43 239 LZCH Lysozyme precursor - Chicken -107 32 220 NRBO Pancreatic ribonuclease - Bovine -124 31 280 CCHU Cytochrome c - Human -160 26 321 MCHU vs MCHU Calmodulin - Human 671 671 0 TPHUCS Troponin C, skeletal muscle 395 438 161 PVPK2 Parvalbumin beta - Pike -57 115 313 CIHUH Calpain heavy chain - Human -2085 100 2463 AQJFNV Aequorin precursor - Jelly fish -65 76 391 KLSWM Calcium binding protein - Scallop -89 52 323 QRHULD vs EGMSMG EGF precursor -591 655 2549 2

Genomic Alignments nw/sw/lalign - dynamic programming -O(n 2 ) searchn - lookup on diagonals - O(n)-O(n 2 ) fasta - lookup on diagonals/rescan - O(n 2 )/K blast - DFA, extend O(n 2 )/K blastz - lookup/extend ssaha, blat, - lookup - waba mummer, avid - Suffix tree alignment dialign, glass, lagan Global and Local Alignment Paths Global A B D D E F G H I A \ \ \ \ \ \ \ \ \ 1 _-1-1 -1-1 -1-1 -1-1 B \! \ \ \ \ \ \ \ -1 2 _ 0 _-2-2 -2-2 -2-2 D \! \ \ \ \ \ \ -1 0 3 _ 1 _-1 _-3-3 -3-3 E \ \!! \ \ \ \ -1-2 1 2 2 _ 0 _-2 _-4-4 G \ \! \! \ \ \ -1-2 -1 0 1 1 1 _-1 _-3 K \ \ \! \! \! \ \ \ \ -1-2 -3-2 -1 0 0 0 _-2 H \ \ \ \! \! \! \ \ \ -1-2 -3-4 -3-2 -1 1 _-1 I \ \ \ \ \! \! \!! \ -1-2 -3-4 -5-4 -3-1 2 Optimum global alignment ( score: 2) A B D D E F G H I (top) A B D - E G K H I (side) or A B - D E G K H I Local A B D D E F G H I A \ 1 0 0 0 0 0 0 0 0 B \ 0 2 _ 0 0 0 0 0 0 0 D! \ \ 0 0 3 _ 1 0 0 0 0 0 E! \ \ 0 0 1 2 2 _ 0 0 0 0 G \! \ \ \ 0 0 0 0 1 1 1 0 0 K \ \ \ 0 0 0 0 0 0 0 0 0 H \ 0 0 0 0 0 0 0 1 0 I \ 0 0 0 0 0 0 0 0 2 Optimal local alignment (score 3): A B D (top) A B D (side) 3

Algorithms for Global and Local Similarity Scores Global: Local: Smith-Waterman Space, Time Requirements scorespace: O(n); time: O(n 2 ) alignmentspace: O(n); time: O(n 2 ) 4

FAST alignment by lookup 1 9 GT8.7 KITQSNATQ.::. ::: XURT8C LLTQTRATQ 1 9 1.Scan query, build 2 tables: 1-1 2-1 3-1 4-1 5-1 6-1 7-1 8 3 n-1 entries AT 7 IT 2 KI 1 LL -1 LT -1 NA 6 QS 4 QT -1 SN 5 TR -1 TQ 8 400 entries LL LT TQ QT TR RA AT TQ K I T Q S N A T Q L L T T T Q Q Q T R A A T T T Q Q Q 87654321012345678 2 2 4 2 5 O(n) space O(n+m) time (if few repeat hits) BLASTZ in a nutshell gap of length k is penalized by subtracting 400 + 30 k from the score 5

Other improvements Formerly, BLASTZ looked for identical runs of eight consecutive nucleotides in each sequence. Ma et al. (2002) propose looking for runs of 19 consecutive nucleotides in each sequence, within which the 12 positions indicated by a 1 in the string 1110100110010101111 are identical. To increase sensitivity, we allow a transition (A-G, G-A, C-T or T-C) in any one of the 12 positions. If the separation between the two alignments is <50 kb in both sequences, then BLASTZ recursively searches the intervening regions for 7-mer exact matches and requires a threshold of 2200 for initiating dynamic programming (without the adjustment for sequence complexity). If the separation is <10 kb, the threshold is lowered to 2000. In either case, the higher-sensitivity matches are required to occur with an order and orientation consistent with the stronger flanking matches. blastz local alignments 6

PipMaker PipMaker GSTM1 vs Cluster 7

Suffix trees - Mummer, AVID Figure 2 Finding maximal matches using a suffix tree: The suffixes of the word at the root are represented by the characters along the paths from the root to the leaves. Branchings in the tree correspond to locations where different suffixes shared the same prefix, and therefore are matches. Every internal node in the tree is therefore a match (with the matching sequence corresponding to the path characters along the path from the root). Maximal matches can be efficiently detected by considering some additional criteria. Figure 3 Selecting anchors from the set of matches. Every maximal match is shown in blue. A set of good anchors is shown in red. 8

Table 1. Coverage Results for the Different Programs on Human Sequence Alignments With Cat, Chicken, Cow, Dog, Pig, and Rat Coverage (bp) RefSeq UTR Time (S) Cat AVID 1200310 21966 13742 78 BLASTZ 1270696 22383 13632 101 BLASTZ (chaining) 1170350 21993 13632 52 CHAOS 360436 19022 7604 128 GLASS 1179833 21942 13426 9215 Chicken AVID 77114 17886 1742 35 BLASTZ 56370 19204 1357 11 BLASTZ (chaining) 56370 19204 1357 11 CHAOS 5635 3060 70 31 GLASS 56445 17437 984 2866 Chimp AVID 852611 0 0 36 BLASTZ 881206 0 0 11330 BLASTZ (chaining) 854023 0 0 45 CHAOS 680491 0 0 4471 GLASS Cow * * * * AVID 1195913 26359 10777 89 BLASTZ 1288895 27911 13505 79 BLASTZ (chaining) 1150323 25814 10434 57 CHAOS 317660 22551 6633 198 GLASS 1131693 25187 12860 10115 Coverage of the human genome using the mouse genome is described in O. Couronne (2003). An asterisk indicates that the program was not able to successfully align the sequences. A minus sign indicates that the program crashed on one sequence pair. Multiple minus signs are used for multiple crashes. RefSeq annotations are based on the human December 2001 hg10 freeze. LAGAN "anchor" local alignments (CHAOS) rough global map ("band")recursive anchoring translated anchoring Genome Research (2003) 13:721 9

parameters: k: word length c: degeneracy t: score threshold 10

MLAGAN Tree-guided, progressive with anchors affine gaps, open/continue/end penalties iterative refinement with anchors 11

12

13

14

Figure 2 Visualization of a multiple alignment using VISTA. (A) MLAGAN alignments can be visualized using VISTA, if they are projected to pairwise alignments with respect to one reference sequence. This plot shows the conservation between human and chimpanzee, cow, mouse, and fugu around the first intron of the cmet gene. The human/chimpanzee conservation is uniformly very high; human/cow and human/mouse show varying levels of conservation. The human/chicken alignment also shows some conservation in the non-coding areas. The human/fugu alignment shows conservation only within the first coding exon, and to a lesser degree within the regions upstream and downstream of that exon. (B) First introns of cmet, comparison of CLUSTALW and MLAGAN alignments. We compared the alignment generated by LAGAN and CLUSTALW for the first intron of the cmet gene in eight mammalian sequences (human, baboon, cat, dog, cow, pig, mouse, and rat). The alignments between all of the species except rodents were similar. VISTA plots of the projections to human and mouse are shown. CLUSTALW (top) misaligned the mouse sequence around 4 Kb and 10 Kb, whereas MLAGAN (bottom) found significant conservation in these regions. 15

Genome Alignment Strategies Alignment based or Lookup based? (BLASTZ vs BLAT) Identities vs Similarities (EST:genome vs Human:Cat:Mouse) Scoring Matrix? Statistical Estimates (BLASTN) Memory vs Speed Iterative alignment? 16