General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

Similar documents
Improved Sensitivity And Reliability Of Anchor Based Genome Alignment

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

7 Multiple Genome Alignment

Whole Genome Alignments and Synteny Maps

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Genomics and bioinformatics summary. Finding genes -- computer searches

Basic Local Alignment Search Tool

Inferring positional homologs with common intervals of sequences

Multiple Whole Genome Alignment

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Quantifying sequence similarity

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Introduction to Bioinformatics Algorithms Homework 3 Solution

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Aligning the unalignable: bacteriophage whole genome alignments

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence analysis and comparison

Gibbs Sampling Methods for Multiple Sequence Alignment

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

Comparative genomics: Overview & Tools + MUMmer algorithm

Overview of IslandPick pipeline and the generation of GI datasets

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Multiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling

Novel Definition and Algorithm for Chaining Fragments with Proportional Overlaps

SUPPLEMENTARY INFORMATION

Multiple sequence alignment

Course: Visual Analytics of largescale biological data. Kay Nieselt Center for Bioinformatics Tübingen University of Tübingen

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Sequence analysis and Genomics

arxiv: v1 [q-bio.gn] 30 Oct 2009

EECS730: Introduction to Bioinformatics

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

Tools and Algorithms in Bioinformatics

Hidden Markov Models

Repeat resolution. This exposition is based on the following sources, which are all recommended reading:

Overview Multiple Sequence Alignment

Multiple Genome Alignment by Clustering Pairwise Matches

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Local Alignment: Smith-Waterman algorithm

Algorithms in Bioinformatics

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

BIOINFORMATICS: An Introduction

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Heuristic Alignment and Searching

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Paired-End Read Length Lower Bounds for Genome Re-sequencing

Chapter 11 Multiple sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

CS612 - Algorithms in Bioinformatics

Similarity searching summary (2)

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

Single alignment: Substitution Matrix. 16 march 2017

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

Introduction to Sequence Alignment. Manpreet S. Katari

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Collected Works of Charles Dickens

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Global alignments - review

Supplementary Information

BLAST. Varieties of BLAST

Tools and Algorithms in Bioinformatics

Sequence Database Search Techniques I: Blast and PatternHunter tools

Bio nformatics. Lecture 23. Saad Mneimneh

Effects of Gap Open and Gap Extension Penalties

Introduction to Evolutionary Concepts

Probabilistic Arithmetic Automata

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Seed-based sequence search: some theory and some applications

Matrix-based pattern discovery algorithms

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Ch. 9 Multiple Sequence Alignment (MSA)

SUPPLEMENTARY INFORMATION

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB

Reducing storage requirements for biological sequence comparison

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Bioinformatics and BLAST

A profile-based protein sequence alignment algorithm for a domain clustering database

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Sequence Assembly

Tex 25mer ssrna Binding Stoichiometry

Chapter 2 Class Notes Words and Probability

HMMs and biological sequence analysis

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

A conserved P-loop anchor limits the structural dynamics that mediate. nucleotide dissociation in EF-Tu.

Transcription:

CoCoGen meeting Accuracy of the anchor-based strategy for genome alignment Raluca Uricaru LIRMM, CNRS Université de Montpellier 2 3 octobre 2008 1 / 31

Summary 1 General context 2 Global alignment : anchor-based method 3 Benchmark, evaluation criteria, evaluation of performance 4 Discussion 2 / 31

Summary 1 General context 2 Global alignment : anchor-based method 3 Benchmark, evaluation criteria, evaluation of performance 4 Discussion 3 / 31

Motivation Do I really have to explain this to YOU? We are working along with the people from the MOSAIC project segmentation of bacterial genomes : backbone & variable segments 4 / 31

Local alignments vs exact matches a c c a g t c g a c c a g t c g G A A T T C A G G G A _ T C _ G Local alignment G A A T T C A G G A A T T C A G Exact match t t g c c c g c c a t t g c c c g c c a 5 / 31

Genome alignment approaches Local alignment in case of genomic rearrangements align common regions instead of the entire genomes for big, distant, rearranged genomes (e.g. eukaryotes) Global alignment in case of collinear genomes technique : anchor-based, 3-phase method for closely related, non-rearranged genomes (e.g. bacteria) 6 / 31

Summary 1 General context 2 Global alignment : anchor-based method 3 Benchmark, evaluation criteria, evaluation of performance 4 Discussion 7 / 31

Anchor-based, 3-phase method 8 / 31

Anchor-based, 3-phase method (2) 1 Computation of fragments pairwise similarity regions 2 Anchors computation (chaining phase) basis of the alignment subset of non-overlapping fragments forming an optimal scoring, global chain 3 Recursive anchoring for each pair of yet unaligned regions between the anchors last chance align. with ClustalW [J.D. Thompson et al., 1994] 9 / 31

Global alignment (GA) programs Present global alignment programs are heuristic variations on the anchor based strategy MGA [M. Hohl et al., 2002] (does not deal with rearrangements) 1 compute Maximal Exact Matches (VMatch [S. Kurtz, 2003]) 2 select the set of anchors (Chainer [M.I. Abouelhoda et al., 2005]) subset of non-overlapping MEMs forming the maximal scoring consistent chain consistent chain = increasing on both genomes 3 recursive anchoring last chance alignment 10 / 31

Global alignment programs (2) MAUVE [A.C.E. Darling et al., 2004] (handles rearrangements) 1 search for approximate matches using spaced seeds 2 select a subset of non-overlapping anchors without imposing the consistency property use a heuristic consisting in finding a minimum partition in Longest Collinear Blocks 3 recursive anchoring last chance alignment 11 / 31

Global alignment programs (3) Non investigated questions on the anchor based strategy Lack of evalution for : global accuracy benchmark, evaluation criteria & significance measure the influence of exact & approximate matches the impact of types of matches on chaining 1 Does the anchor based strategy produce satisfactory results on non-rearranged genomes? 2 How can one deal with rearrangements? 12 / 31

2 initial prototypes To address the 1 st previous question, we designed 2 prototypes for pairwise genome alignment without rearrangements 1. VMatch + Chainer (VC) simulate the first 2 phases of MGA 2. Yass + Chainer (YC) same outline as MGA replace VMatch by Yass [L. Noe et al., 2005] Yass = fast similarity search tool based on spaced seeds use local alignments instead of MEMs 13 / 31

Comments on the 2 prototypes frequent overlaps due to the use of local alignments (2 nd prot.) overlaps should be taken into account in the chaining proposed solution : hierarchical chaining method (3 rd prot.) not dealing with rearrangements may generate poor results : singular inversions, complementary strands (all 3 prot.) circularity (3 rd prot.) other rearrangement events : successive inversions, translocations, duplications,... need chaining without consistency property : NP-complete prob? 14 / 31

the 3 rd prototype 3. Yass + Hierarchical (YH) replace the first 2 phases of MGA : use local alignments instead of MEMs replace Chainer by greedy, hierarchical chaining method Hierarchical method keep stronger simil. regardless of their conflicts with weaker ones multiple level chaining heuristic, dealing with overlaps 15 / 31

Summary 1 General context 2 Global alignment : anchor-based method 3 Benchmark, evaluation criteria, evaluation of performance 4 Discussion 16 / 31

Benchmark MOSAIC [H. Chiapello et al., 2005] public curated database of backbone alignments for intra species bacterial genomes 42 species from Genome Reviews [P. Sterk, 2006], except 5 Buchnera aph., Prochl. marinus, Synechococcus sp, Pseudomonas fluor., Rhodopseudomonas pal. excluded due to not satisfactory alignments alignments are made with : MAUVE (rearranged) & MGA (non-rearranged) parameters calibrated on E-coli backbone alignments were post-processed keep aligned gaps 76% id% 17 / 31

Global evaluation criteria complexity of the anchor chain number of anchors in the chain coverage total length of regions common to both sequences filtered cov. = cov. without aligned gaps <76% id% identity percentage (id%) ratio of identical bp in the segm. from the cov., over gen. len. filtered id%. = id% without aligned gaps <76% id% 18 / 31

Comparison protocol 42 bacterial species (236 intra-species genome pairs) close, moderately divergent, rearranged genomes for exceptions, see the 5 species excluded from MOSAIC 94 collinear pairs, 142 rearranged pairs existence of benchmark (MOSAIC) compare alignments obtained with MGA, MAUVE, VC, YC & YH 3 evaluation criteria (see prev. slide) tested several parameters and options, kept the best ones 19 / 31

Present achievements (MOSAIC) MGA and MAUVE species with average good coverages Staphylococcus aureus : >90% both MGA and MAUVE coverage differences in different species Streptococcus thermophilus : 97% both MGA and MAUVE Synechococcus sp : <10% both MGA and MAUVE coverage differences inside a species Prochlorococcus marinus : [0, 78]% MGA, [6, 96]% MAUVE Programs succeed in aligning some pairs and fail in others Causes : high level of divergence, rearrangements 20 / 31

YC vs VC Local similarities vs MEMs YC/YH vs VC compute the difference in cov. YC-VC [ 69, 81]%, positive in 89 (over 236) cases unexpected bad results for close genome cases Cause : large overlaps of local align. Solution : YH YH vs VC YH improves the cov. over VC : 11% in avg YH improves the cov. over YC : 19% in avg Example CP000046 vs BA000018 (Staphylococcus aureus) YC 64%, VC 88%, YH 99% 21 / 31

Hierarchical method vs MOSAIC YH vs MGA/MAUVE YH vs MGA on the 94 collinear pairs compute the difference in cov/id% YH-MGA cov : [ 4, 45]%, 3% in avg positive in 52 cases, <-2% in 3 cases id% : [1, 45]%, 9% in avg YH vs MAUVE on the 142 rearranged pairs MAUVE covers 21% more nucleotides than that of YH MAUVE has 18% more identities than YH 22 / 31

Summary 1 General context 2 Global alignment : anchor-based method 3 Benchmark, evaluation criteria, evaluation of performance 4 Discussion 23 / 31

Discussion : evaluation of the anchor-based strategy we do not solve rearrangements and we do not do recursion steps However, we obtain gain in cov, id% and simplicity of chain complexity of YH over MGA and VC nb of anchors : <150 YH, >3000 MAUVE, >13000 MGA 24 / 31

Discussion : evaluation of the anchor-based strategy (2) in terms of filtered coverage and id%, YH improves MGA... and MAUVE, only in the case of divergent genomes (thanks to spaced seeds?) what is happening with the aligned gaps? Example (Buchnera aphidicola - 3 genome pairs) filtered coverage : YH improves MGA/MAUVE by 69% filtered id% : YH improves MGA/MAUVE by 47% 25 / 31

Conclusion pairwise alignment is incompletely solved nowadays heuristic anchor methods can be improved : quantitatively local similarities chaining with overlaps qualitatively : E-values for local similarities improved backbone alignment avoid alignment post-processing in terms of chain complexity computationally less expensive easy to biologically interpret 26 / 31

Short time perspectives recursive anchoring successively search for similarities in non-aligned regions "chained spaced seeds" knowing that the spaced seed (set) A is not found on step i, compute the best spaced seed (set) B for step i + 1 dealing with rearrangements : method for non-consistent global alignment allowing overlaps 27 / 31

Short time perspectives (2) biological validation global, average values are not entirely relevant reinforce (test) the gain from using spaced seeds compare to BLAST use negative filters detect unique regions validation method on the non-aligned regions pre-processing step 28 / 31

Bibliography J.D. Thompson, D.G. Higgins, T.J. Gibson CLUSTAL W : improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Research, 1994 L. Noe and G. Kucherov YASS : enhancing the sensitivity of DNA similarity search Nucleic Acids Research, 2005 H. Chiapello, I. Bourgait, F. Sourivong, G. Heuclin, A. Gendrault-Jacquemard, M-A Petit and M. El Karoui Systematic determination of the mosaic structure of bacterial genomes : species backbone versus strain-specific loops BMC Bioinformatics, 2005 29 / 31

Bibliography (2) M. Hohl, S. Kurtz and E. Ohlebusch Efficient multiple genome alignment Bioinformatics, 2002 A.C.E. Darling, B. Mau, F.R. Blattner and N.T. Perna Mauve : Multiple Align. of Conserved Gen. Seq. With Rearr. Genome Research, 2004 M.I. Abouelhoda and E. Ohlebusch Chaining Algorithms for Multiple Genome Comparison JDA, 2005 S. Kurtz VMatch, technical report, 2003 P. Sterk, P.J. Kersey and R. Apweiler Genome Reviews : Standardizing Content and Representation of Information about Complete Genomes OMICS, 2006 30 / 31

Annexe Support : CocoGEN http://www.lirmm.fr/~rivals/ PlasmoExplore http://www.lirmm.fr/~brehelin/plasmoexplore/ Supplementary material available at : www.lirmm.fr/~uricaru/appendix_recombcg.html Thank you for your attention! Questions? 31 / 31