Inferring positional homologs with common intervals of sequences

Similar documents
Genes order and phylogenetic reconstruction: application to γ-proteobacteria

The breakpoint distance for signed sequences

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Reversing Gene Erosion Reconstructing Ancestral Bacterial Genomes from Gene-Content and Order Data

Identifying Positional Homologs as Bidirectional Best Hits of Sequence and Gene Context Similarity

7 Multiple Genome Alignment

Phylogenetic Reconstruction: Handling Large Scale

Multiple Whole Genome Alignment

Comparing Genomes with Duplications: a Computational Complexity Point of View

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

The combinatorics and algorithmics of genomic rearrangements have been the subject of much

Some Algorithmic Challenges in Genome-Wide Ortholog Assignment

Comparative genomics: Overview & Tools + MUMmer algorithm

Gene Maps Linearization using Genomic Rearrangement Distances

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering

Analysis of Gene Order Evolution beyond Single-Copy Genes

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

Example of Function Prediction

Perfect Sorting by Reversals and Deletions/Insertions

Gene Maps Linearization using Genomic Rearrangement Distances

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Bioinformatics and BLAST

Integration of Omics Data to Investigate Common Intervals

Ancestral Genome Organization: an Alignment Approach

Fitness constraints on horizontal gene transfer


Algorithms for Bioinformatics

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

ARTICLE IN PRESS Discrete Applied Mathematics ( )

Scaffold Filling Under the Breakpoint Distance

Homology and Information Gathering and Domain Annotation for Proteins

Computational Biology

An Integrative Method for Accurate Comparative Genome Mapping

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Comparing Genomes! Homologies and Families! Sequence Alignments!

BLAST. Varieties of BLAST

Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Motivating the need for optimal sequence alignments...

Phylogenetic Tree Reconstruction

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Alignment Algorithms. Alignment Algorithms

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Whole Genome Alignments and Synteny Maps

Local Search Based Approximation Algorithms. Vinayaka Pandit. IBM India Research Laboratory

Revisiting the Minimum Breakpoint Linearization Problem Theoretical Computer Science

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

Tools and Algorithms in Bioinformatics

OGtree: a tool for creating genome trees of prokaryotes based on overlapping genes

On Critical Path Selection Based Upon Statistical Timing Models -- Theory and Practice

Scaffold Filling Under the Breakpoint and Related Distances

Comparative Genomics II

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

MiGA: The Microbial Genome Atlas

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Session 5: Phylogenomics

Computational methods for predicting protein-protein interactions

Handling Rearrangements in DNA Sequence Alignment

Tests for gene clustering

A PROTEOMIC APPROACH FOR IDENTIFICATION OF BACTERIA USING TANDEM MASS SPECTROMETRY COMBINED WITH A TRANSLATOME DATABASE AND STATISTICAL SCORING

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Comparative Network Analysis

1 ATGGGTCTC 2 ATGAGTCTC

Homology. and. Information Gathering and Domain Annotation for Proteins

Figure S1. Pangenome plots of ten recombining bacterial species based on RAST annotated

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

Genome Rearrangement Problems with Single and Multiple Gene Copies: A Review

arxiv: v2 [cs.ds] 2 Dec 2013

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore

Network Alignment 858L

A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes

Introduction to protein alignments

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Sequence Alignment (chapter 6)

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

Genome Rearrangement Problems with Single and Multiple Gene Copies: A Review

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Horizontal gene transfer and the evolution of transcriptional regulation in Escherichia coli

From Phylogenetics to Phylogenomics: The Evolutionary Relationships of Insect Endosymbiotic γ-proteobacteria as a Test Case

Biol478/ August

Bioinformatics: Network Analysis

Basic Local Alignment Search Tool

Genomes Comparision via de Bruijn graphs

arxiv: v1 [cs.ds] 21 May 2013

Evolutionary Analysis by Whole-Genome Comparisons

Phylogenetic Networks with Recombination

Graphs, permutations and sets in genome rearrangement

1 Introduction. Abstract

X X (2) X Pr(X = x θ) (3)

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden

Pairwise & Multiple sequence alignments

Linear-Space Alignment

Genetic Basis of Variation in Bacteria

Transcription:

Outline Introduction Our approach Results Conclusion Inferring positional homologs with common intervals of sequences Guillaume Blin, Annie Chateau, Cedric Chauve, Yannick Gingras CGL - Université du Québec à Montréal Université de Marne la Vallée Séminaire de Bioinformatique BIF7002 2007-01-17 1

Outline Introduction Our approach Results Conclusion 1 Ortholog assignment Definitions Importance Automated Inference of Orthologs 2 Our approach Common Intervals Matching Extraction Demo 3 Results Human and Mouse γ-proteobacteria 4 Conclusion Discussion 2

Outline Introduction Our approach Results Conclusion Definitions Importance Automation 1 Ortholog assignment Definitions Importance Automated Inference of Orthologs 2 Our approach Common Intervals Matching Extraction Demo 3 Results Human and Mouse γ-proteobacteria 4 Conclusion Discussion 3

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Definitions: orthologs, paralogs and homologs source: http://www.ncbi.nlm.nih.gov/education/blastinfo/orthology.html 4

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Why infer orthology? Inference of gene functions Phylogenomics: which copy of a gene do we compare? Gene order: nice to have a permutation etc. source: Wikipedia 5

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Mecanisms of evolution Duplications Mutations Losses Rearrangements = Variable gene content 6

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Assignment of orthologs: the easy part Make pairs with putative orthologs Single copy genes are paired together 7

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Assignment of orthologs: the easy part Make pairs with putative orthologs Single copy genes are paired together 7

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Assignment of orthologs: the easy part Make pairs with putative orthologs Single copy genes are paired together What about gene number 4? 7

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Harder: multiple copies 8

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Harder: multiple copies 8

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Harder: multiple copies 8

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Harder: multiple copies Which solution is the best? 8

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Mu A monk once asked master Zhao Zhou, Does a dog have Buddha-nature or not? Zhao Zhou said, Mu 9

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Exemplar Method Keep only one copy of each gene Choose the copy that optimizes a metric or a criterion Sankoff (1999): Minimize the breakpoint/reversal distance Bourque, Yacef, El-Mabrouk (2005): Maximize common/conserved intervals 10

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Exemplar Method Keep only one copy of each gene Choose the copy that optimizes a metric or a criterion Sankoff (1999): Minimize the breakpoint/reversal distance Bourque, Yacef, El-Mabrouk (2005): Maximize common/conserved intervals 10

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Exemplar Method Keep only one copy of each gene Choose the copy that optimizes a metric or a criterion Sankoff (1999): Minimize the breakpoint/reversal distance Bourque, Yacef, El-Mabrouk (2005): Maximize common/conserved intervals 10

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Exemplar Method Keep only one copy of each gene Choose the copy that optimizes a metric or a criterion Sankoff (1999): Minimize the breakpoint/reversal distance Bourque, Yacef, El-Mabrouk (2005): Maximize common/conserved intervals This is the Right Thing 10

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Exemplar Method The Right Thing is often hard to do: Minimize the breakpoint/reversal distance: NP-Hard (Bryant 2000) Maximize common/conserved intervals: NP-Hard (Blin et al 2005, 2006) 11

Outline Introduction Our approach Results Conclusion Definitions Importance Automation General Matching Keep one or more copy of each gene New pairs are separated from their family Minimize the number of rearrangements Uses breakpoint graph analysis Chen et al (2005): minimize reversal distance Fu et al (2006): extension to translocations Swenson et al (2005): maximize the number of cycles General problem is NP-Hard 12

Outline Introduction Our approach Results Conclusion Definitions Importance Automation General Matching Keep one or more copy of each gene New pairs are separated from their family Minimize the number of rearrangements Uses breakpoint graph analysis Chen et al (2005): minimize reversal distance Fu et al (2006): extension to translocations Swenson et al (2005): maximize the number of cycles General problem is NP-Hard 12

Outline Introduction Our approach Results Conclusion Definitions Importance Automation General Matching Keep one or more copy of each gene New pairs are separated from their family Minimize the number of rearrangements Uses breakpoint graph analysis Chen et al (2005): minimize reversal distance Fu et al (2006): extension to translocations Swenson et al (2005): maximize the number of cycles General problem is NP-Hard 12

Outline Introduction Our approach Results Conclusion Definitions Importance Automation General Matching Keep one or more copy of each gene New pairs are separated from their family Minimize the number of rearrangements Uses breakpoint graph analysis Chen et al (2005): minimize reversal distance Fu et al (2006): extension to translocations Swenson et al (2005): maximize the number of cycles General problem is NP-Hard 12

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Longest Common Substrings Find long conserved words (colinear segments, conserved segments) Greedy version: always make a complete assignment in the longest unmatched conserved word Swenson et al (2005) Blin et al (2005) Used in Chen et al (2005) 13

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Longest Common Substrings Find long conserved words (colinear segments, conserved segments) Greedy version: always make a complete assignment in the longest unmatched conserved word Swenson et al (2005) Blin et al (2005) Used in Chen et al (2005) 13

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Longest Common Substrings Find long conserved words (colinear segments, conserved segments) Greedy version: always make a complete assignment in the longest unmatched conserved word Swenson et al (2005) Blin et al (2005) Used in Chen et al (2005) 13

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Longest Common Substrings Find long conserved words (colinear segments, conserved segments) Greedy version: always make a complete assignment in the longest unmatched conserved word Swenson et al (2005) Blin et al (2005) Used in Chen et al (2005) 13

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Longest Common Substrings Generates lots of false positives Misses local rearrangements 14

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Longest Common Substrings Generates lots of false positives Misses local rearrangements 14

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Longest Common Substrings Generates lots of false positives Misses local rearrangements 14

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo 1 Ortholog assignment Definitions Importance Automated Inference of Orthologs 2 Our approach Common Intervals Matching Extraction Demo 3 Results Human and Mouse γ-proteobacteria 4 Conclusion Discussion 15

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our Approach: The Idea Orthologous gene copies are more likely to share the same genome positions and share the same gene neighbors. Burgetz et al, Positional homology in bacterial genomes 16

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals Sets of genes with a contiguous occurence on each genome {1, 2, 3, 4} is a common interval {1, 3} is not Easy to detect: O(n 2 ) (Schmidt and Stoye 2004) Capture local rearrangements 17

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our Approach: Overview A common interval is a bunch of gene who stick together 18

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our Approach: Overview A common interval is a bunch of gene who stick together That s a good place to start a matching 18

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Box Representation 19

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Box Representation 19

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Box Representation 19

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Box Representation 19

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Box Representation 19

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Incompatibility A common interval occurence is incompatible with another one if we can t assign all the genes in both at the same time without conflicts 20

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Incompatibility A common interval occurence is incompatible with another one if we can t assign all the genes in both at the same time without conflicts 20

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Incompatibility A common interval occurence is incompatible with another one if we can t assign all the genes in both at the same time without conflicts 20

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Incompatible Boxes Two boxes are incompatible if they can hit each other by a vertical or horizontal translation 21

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Incompatible Boxes Two boxes are incompatible if they can hit each other by a vertical or horizontal translation 21

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our approach Repeat as long as there is an unmatched common interval Pick L, the largest box Filter out boxes incompatible with L Recurse on L 22

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our approach Repeat as long as there is an unmatched common interval Pick L, the largest box Filter out boxes incompatible with L Recurse on L 22

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our approach Repeat as long as there is an unmatched common interval Pick L, the largest box Filter out boxes incompatible with L Recurse on L 22

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our approach Repeat as long as there is an unmatched common interval Pick L, the largest box Filter out boxes incompatible with L Recurse on L 22

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our approach Repeat as long as there is an unmatched common interval Pick L, the largest box Filter out boxes incompatible with L Recurse on L 22

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our approach Repeat as long as there is an unmatched common interval Pick L, the largest box Filter out boxes incompatible with L Recurse on L 22

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our approach Repeat as long as there is an unmatched common interval Pick L, the largest box Filter out boxes incompatible with L Recurse on L 22

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our approach Repeat as long as there is an unmatched common interval Pick L, the largest box Filter out boxes incompatible with L Recurse on L 22

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our approach Repeat as long as there is an unmatched common interval Pick L, the largest box Filter out boxes incompatible with L Recurse on L 22

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Demo Let s try it! 23

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria 1 Ortholog assignment Definitions Importance Automated Inference of Orthologs 2 Our approach Common Intervals Matching Extraction Demo 3 Results Human and Mouse γ-proteobacteria 4 Conclusion Discussion 24

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria Results: Human and mouse orthologs assignment Assignment on whole genomes Homologs identified with MSOAR hit graph and clustering MSOAR LCS Common Intervals Matched pairs 13218 13380 13301 True positives 9214 9227 9186 False positives 2240 2357 2327 % True positives 70% 69% 69% % False positives 17% 18% 17% MSOAR: General Matching (Z. Fu, X. Chen, V. Vacic, P. Nan, Y. Zhong, T. Jiang, 2006) LCS: Longest Common Substrings (G. Blin, C. Chauve, G. Fertin, 2005) True positives: genes with the same Uniprot name MSOAR is the most accurate Results are comparable Looking for large conserved structures is a valid approach 25

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria Results: assignment on bacterial genomes Assignment of orthologs on 8 γ-proteobacteria: Buchnera aphidicola APS Escherichia coli K12 Haemophilus influenzae Rd Pasteurella multocida Pm70 Pseudomonas aeruginosa PA01 Salmonella typhimurium LT2 Xylella fastidiosa 9a5c Yersinia pestis CO_92 Homologs identified by BLAST and clustering All 28 pairwise matching 26

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria Consistent Components Gene matching defines a graph on the 8 genomes A connected component is consistent if it contains at most one gene in each genome A perfect component is a consistent component that contains only true positives 27

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria Results: assignment on bacterial genomes LCS Common Intervals True positives 19142 18875 False positives 3045 3324 % True positives 86% 85% Components 3439 3539 Consistent 2907 3117 % Consistent 85% 88% TP in a CC 14954 10729 Perfect Comp. 1531 1628 % Perfect Comp. 53% 52% LCS: Longest Common Substrings (G. Blin, C. Chauve, G. Fertin, 2005) True positives: genes with the same Uniprot name LCS is more accurate but less consistent 28

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria Note on low complexity Gene content vs area 29

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria Results: assignment on bacterial genomes LCS CI Filtered LCS Filtered CI True positives 19142 18875 10553 10420 False positives 3045 3324 789 1054 % True positives 86% 85% 93% 90% Components 3439 3539 3606 3480 Consistent 2907 3117 3537 3382 % Consistent 85% 88% 98% 97% TP in a CC 14954 10729 14954 17180 Perfect Comp. 1531 1628 1538 1962 % Perfect Comp. 53% 52% 43% 58% Filter with side 3 Consistency increases with filtering LCS has lower perfect component ratio with filtering Common intervals has higher perfect component ratio with filtering There are 263 perfect components of size 8 with filtering 30

Outline Introduction Our approach Results Conclusion Discussion 1 Ortholog assignment Definitions Importance Automated Inference of Orthologs 2 Our approach Common Intervals Matching Extraction Demo 3 Results Human and Mouse γ-proteobacteria 4 Conclusion Discussion 31

Outline Introduction Our approach Results Conclusion Discussion Conclusion The Right Thing is hard Simple heuristics can t handle natural shuffling Segments with similar gene content are likely to be related Common intervals are an efficient technique to locate them Future work: handle gaps, smart filtering, etc. 32

Outline Introduction Our approach Results Conclusion Discussion Discussion Note to self: stop here unless you have extra time! 33

Outline Introduction Our approach Results Conclusion Discussion Cigal: The program 1/3 34

Outline Introduction Our approach Results Conclusion Discussion Cigal: The program 2/3 35

Outline Introduction Our approach Results Conclusion Discussion Cigal: The program 3/3 36

Outline Introduction Our approach Results Conclusion Discussion Human and mouse: with filtering Filtered vs raw results Minimum side of length 3 Common Intervals Filtered Common Intervals Matched pairs 13301 12394 True positives 9186 8792 % True positives 69% 71% False positives 2327 1996 % False positives 17% 16% LCS Filtered LCS Matched pairs 13380 11764 True positives 9227 8491 % True positives 69% 72% False positives 2357 1749 % False positives 18% 15% 37