Inferring positional homologs with common intervals of sequences

Outline Introduction Our approach Results Conclusion Inferring positional homologs with common intervals of sequences Guillaume Blin, Annie Chateau, Cedric Chauve, Yannick Gingras CGL - Université du Québec à Montréal Université de Marne la Vallée Séminaire de Bioinformatique BIF7002 2007-01-17 1

Outline Introduction Our approach Results Conclusion 1 Ortholog assignment Definitions Importance Automated Inference of Orthologs 2 Our approach Common Intervals Matching Extraction Demo 3 Results Human and Mouse γ-proteobacteria 4 Conclusion Discussion 2

Outline Introduction Our approach Results Conclusion Definitions Importance Automation 1 Ortholog assignment Definitions Importance Automated Inference of Orthologs 2 Our approach Common Intervals Matching Extraction Demo 3 Results Human and Mouse γ-proteobacteria 4 Conclusion Discussion 3

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Definitions: orthologs, paralogs and homologs source: http://www.ncbi.nlm.nih.gov/education/blastinfo/orthology.html 4

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Why infer orthology? Inference of gene functions Phylogenomics: which copy of a gene do we compare? Gene order: nice to have a permutation etc. source: Wikipedia 5

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Mecanisms of evolution Duplications Mutations Losses Rearrangements = Variable gene content 6

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Assignment of orthologs: the easy part Make pairs with putative orthologs Single copy genes are paired together 7

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Harder: multiple copies 8

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Harder: multiple copies Which solution is the best? 8

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Mu A monk once asked master Zhao Zhou, Does a dog have Buddha-nature or not? Zhao Zhou said, Mu 9

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Exemplar Method Keep only one copy of each gene Choose the copy that optimizes a metric or a criterion Sankoff (1999): Minimize the breakpoint/reversal distance Bourque, Yacef, El-Mabrouk (2005): Maximize common/conserved intervals 10

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Exemplar Method The Right Thing is often hard to do: Minimize the breakpoint/reversal distance: NP-Hard (Bryant 2000) Maximize common/conserved intervals: NP-Hard (Blin et al 2005, 2006) 11

Outline Introduction Our approach Results Conclusion Definitions Importance Automation General Matching Keep one or more copy of each gene New pairs are separated from their family Minimize the number of rearrangements Uses breakpoint graph analysis Chen et al (2005): minimize reversal distance Fu et al (2006): extension to translocations Swenson et al (2005): maximize the number of cycles General problem is NP-Hard 12

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Longest Common Substrings Find long conserved words (colinear segments, conserved segments) Greedy version: always make a complete assignment in the longest unmatched conserved word Swenson et al (2005) Blin et al (2005) Used in Chen et al (2005) 13

Outline Introduction Our approach Results Conclusion Definitions Importance Automation Longest Common Substrings Generates lots of false positives Misses local rearrangements 14

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo 1 Ortholog assignment Definitions Importance Automated Inference of Orthologs 2 Our approach Common Intervals Matching Extraction Demo 3 Results Human and Mouse γ-proteobacteria 4 Conclusion Discussion 15

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our Approach: The Idea Orthologous gene copies are more likely to share the same genome positions and share the same gene neighbors. Burgetz et al, Positional homology in bacterial genomes 16

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals Sets of genes with a contiguous occurence on each genome {1, 2, 3, 4} is a common interval {1, 3} is not Easy to detect: O(n 2 ) (Schmidt and Stoye 2004) Capture local rearrangements 17

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our Approach: Overview A common interval is a bunch of gene who stick together 18

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our Approach: Overview A common interval is a bunch of gene who stick together That s a good place to start a matching 18

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Box Representation 19

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Incompatibility A common interval occurence is incompatible with another one if we can t assign all the genes in both at the same time without conflicts 20

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Common Intervals: Incompatible Boxes Two boxes are incompatible if they can hit each other by a vertical or horizontal translation 21

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Our approach Repeat as long as there is an unmatched common interval Pick L, the largest box Filter out boxes incompatible with L Recurse on L 22

Outline Introduction Our approach Results Conclusion Common Intervals Matching Extraction Demo Demo Let s try it! 23

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria 1 Ortholog assignment Definitions Importance Automated Inference of Orthologs 2 Our approach Common Intervals Matching Extraction Demo 3 Results Human and Mouse γ-proteobacteria 4 Conclusion Discussion 24

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria Results: Human and mouse orthologs assignment Assignment on whole genomes Homologs identified with MSOAR hit graph and clustering MSOAR LCS Common Intervals Matched pairs 13218 13380 13301 True positives 9214 9227 9186 False positives 2240 2357 2327 % True positives 70% 69% 69% % False positives 17% 18% 17% MSOAR: General Matching (Z. Fu, X. Chen, V. Vacic, P. Nan, Y. Zhong, T. Jiang, 2006) LCS: Longest Common Substrings (G. Blin, C. Chauve, G. Fertin, 2005) True positives: genes with the same Uniprot name MSOAR is the most accurate Results are comparable Looking for large conserved structures is a valid approach 25

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria Results: assignment on bacterial genomes Assignment of orthologs on 8 γ-proteobacteria: Buchnera aphidicola APS Escherichia coli K12 Haemophilus influenzae Rd Pasteurella multocida Pm70 Pseudomonas aeruginosa PA01 Salmonella typhimurium LT2 Xylella fastidiosa 9a5c Yersinia pestis CO_92 Homologs identified by BLAST and clustering All 28 pairwise matching 26

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria Consistent Components Gene matching defines a graph on the 8 genomes A connected component is consistent if it contains at most one gene in each genome A perfect component is a consistent component that contains only true positives 27

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria Results: assignment on bacterial genomes LCS Common Intervals True positives 19142 18875 False positives 3045 3324 % True positives 86% 85% Components 3439 3539 Consistent 2907 3117 % Consistent 85% 88% TP in a CC 14954 10729 Perfect Comp. 1531 1628 % Perfect Comp. 53% 52% LCS: Longest Common Substrings (G. Blin, C. Chauve, G. Fertin, 2005) True positives: genes with the same Uniprot name LCS is more accurate but less consistent 28

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria Note on low complexity Gene content vs area 29

Outline Introduction Our approach Results Conclusion Human and Mouse γ-proteobacteria Results: assignment on bacterial genomes LCS CI Filtered LCS Filtered CI True positives 19142 18875 10553 10420 False positives 3045 3324 789 1054 % True positives 86% 85% 93% 90% Components 3439 3539 3606 3480 Consistent 2907 3117 3537 3382 % Consistent 85% 88% 98% 97% TP in a CC 14954 10729 14954 17180 Perfect Comp. 1531 1628 1538 1962 % Perfect Comp. 53% 52% 43% 58% Filter with side 3 Consistency increases with filtering LCS has lower perfect component ratio with filtering Common intervals has higher perfect component ratio with filtering There are 263 perfect components of size 8 with filtering 30

Outline Introduction Our approach Results Conclusion Discussion 1 Ortholog assignment Definitions Importance Automated Inference of Orthologs 2 Our approach Common Intervals Matching Extraction Demo 3 Results Human and Mouse γ-proteobacteria 4 Conclusion Discussion 31

Outline Introduction Our approach Results Conclusion Discussion Conclusion The Right Thing is hard Simple heuristics can t handle natural shuffling Segments with similar gene content are likely to be related Common intervals are an efficient technique to locate them Future work: handle gaps, smart filtering, etc. 32

Outline Introduction Our approach Results Conclusion Discussion Discussion Note to self: stop here unless you have extra time! 33

Outline Introduction Our approach Results Conclusion Discussion Cigal: The program 1/3 34

Outline Introduction Our approach Results Conclusion Discussion Cigal: The program 2/3 35

Outline Introduction Our approach Results Conclusion Discussion Cigal: The program 3/3 36

Outline Introduction Our approach Results Conclusion Discussion Human and mouse: with filtering Filtered vs raw results Minimum side of length 3 Common Intervals Filtered Common Intervals Matched pairs 13301 12394 True positives 9186 8792 % True positives 69% 71% False positives 2327 1996 % False positives 17% 16% LCS Filtered LCS Matched pairs 13380 11764 True positives 9227 8491 % True positives 69% 72% False positives 2357 1749 % False positives 18% 15% 37