6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 18 Nov 10, 2005 Evolution by duplication Somewhere, something went wrong
Challenges in Computational Biology 4 Genome Assembly Regulatory motif discovery Gene Finding DNA Sequence alignment 8 Comparative Genomics TCATGCTAT TCGTGATAA TGAGGATAT 7 Evolutionary Theory TTATCATAT TTATGATTT Database lookup RNA folding 9 Gene expression analysis 12 Protein network analysis RNA transcript 10 Cluster discovery Gibbs sampling 13 Regulatory network inference 14 Emerging network properties
Open questions (?) Image removed due to copyright restrictions. Image removed due to copyright restrictions. Image removed due to copyright restrictions. Panda Bear or raccoon? Out of Africa mitochondrial evolution story? Human evolution Did we ever meet Neanderthal? Primate evolution Are we chimp-like or gorilla-like? Vertebrate evolution How did complex body plans arise? Recent evolution What genes are under selection?
What we have learned Phylogenetic trees Distance-based methods UPGMA, Neighbor-Joining Alignment-based methods Parsimony: set-based, dynamic programming Evolution by nucleotide mutation Probability of back-mutation Markov chain Models of evolution Jukes-Cantor: Kimura 2-parameter model Evolution by rearrangements Sorting by reversals Signed / unsigned version & approximation algorithms
Today s goals: Evolution by Duplication Detecting gene duplication Orthologs and paralogs Gene trees and species trees Reconciliation Detecting genome duplication Evidence across species Evidence in a single species Duplicate gene evolution Detect accelerated divergence Measuring positive selection Gene conversion
Determining orthologs and paralogs
Orthologs and paralogs human mouse rat dog rabbit orthologs paralogs Orthologs arise by speciation typically keep same function Paralogs arise by duplication typically take on new functions Ortholog identification a prerequisite to genomic studies
Why are orthologs & paralogs important? Comparative genomics relies on correct orthology Signal discovery by orthologous conservation Evolutionary genomics relies on complete mapping Duplicated regions are also the most interesting ones Image removed due to copyright restrictions. Please see: Kellis, Manolis, Bruce W. Birren, and Eric S. Lander. "Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae." Nature 428 (April 2004): 617-624. Whole-genome duplication in yeast, fish, and vertebrates
Challenges in genome-wide orthology Tens of thousands of genes Abundant duplication and loss Spurious matches Noisy data Many paralogous families precede species divergence Single phylogeny is impossible not enough traits Protein family expansions Gene conversion, loss, inactivation Common domains in unrelated proteins Similarity not always due to common ancestry Varying rates of mutation (gene & species) Pseudogenes, incorrect/incomplete gene models Goal: Systematic ortholog identification across multiple, complete, mammalian genomes
Current methods for ortholog finding Pair-wise sequence comparison Hit clustering methods Synteny methods Phylogenetic methods Best bi-directional BLAST hits Focuses on one-to-one orthologs (no duplications) Detect clusters in graph of pair-wise hits Difficulty to separate large connected components Detect conserved regions, stretches of nearby hits Genome alignment methods focus on best hits Phylogeny of family clusters orthologs near each other Traditionally applied to specific families (not genome-wide) Current methods successful in limited datasets Complete mammalian genomes present new challenges
Algorithm: SynPhyl Images removed due to copyright restrictions. Combine synteny and phylogeny to find orthologs Initial gene family construction Build phylogenetic trees within families Reconcile gene trees to determine orthology
Building Meaningful Gene Families
Step 1. Initial gene family construction Challenge: How to keep cluster sizes balanced Limitations of traditional clustering methods UPGMA, k-means, graph-partitioning lead to imbalance Bi-partitioning methods lead to arbitrary midway splitting SynPhyl approach: a. Seed clusters with unambiguous hits b. Extend clusters in gene pulling step c. Refine clusters in phylogeny step Balanced Clusters
Step 1. Initial gene family construction (1) Initial cluster seeds from unambiguous matches Syntenic orthologs Multi-species significant BBH Human BBH component Dog human Mouse dog mouse Rat Initial gene clusters
Step 2. Cluster extension (1) Initial cluster seeds from unambiguous matches (2) Cluster extension Pull unassigned genes to existing clusters Ensure distance of new gene within cluster distribution Unassigned genes Initial gene clusters
Step 3. Phylogenetic reconstruction (1) Initial cluster seeds from unambiguous matches (2) Cluster extension (3) Phylogenetic reconstruction Phylogeny for each cluster Align each cluster (MUSCLE protein alignment) Neighbor-Joining: fast, distance-based (JTT model) Bootstrapping used for confidence measure, propagates Use phylogeny to further separate clusters Reconciliation Four mammals - 78,744 genes - 17,586 trees - Largest:` 103 genes Ten fungi - 54,890 genes - 5,537 trees - Largest: 164 genes 80% 60% 90% 90% Extended gene clusters
Bootstrap confidence scores Repeat 100 times Gene cluster Alignment Sample with replacement Bootstrapping: Sample columns from the alignment randomly Build trees based on these columns (NJ, ML, MP) For every internal branch Count how many topologies agree with inferred split Percentage is the bootstrap confidence score Building a final tree Full tree, using all the data Consensus tree Tree
Phylogenetic Tree Reconciliation Gene Tree Ù Species Tree
Gene Tree / Species Tree reconciliation Known species tree G1: Each species contains each subfamily Easy to infer duplication events G2: Loss events in each family hide complex ancestry Reconciliation with species tree recovers the events
Reconciliation to determine orthology Reconcile each gene tree to the species tree Each node in gene tree maps to node in species tree Read off orthology and paralogy Infer gene duplication and loss events Gene tree Species tree d 1 h 1 m 1 r 1 m 2 r 2 gene loss in chimp gene duplication in rodent ancestor dog human chimp mouse rat
Reconciliation algorithm For every node g, decide duplication or speciation Map left child to tree Æ M(a). Map right child to tree Æ M(b) M(g) is least common ancestor of M(a) and M(b) After mapping: g is a duplication node if M(g)={M(a) or M(b)} g is a speciation node if M(g) is distinct from its children Post-processing: count loss edges Limitation: Reconciliation assumes correct species tree Generally NOT the case
Mammalian tree: Abundance of alternate tree topologies Most trees are incorrect Count most frequent subtrees of size four Correct species tree a minority <20% Reason: Long branch attraction Due to rapidly evolving rodent lineage Common phylogenetic reconstruction problem What happens to reconciliation?
Reconciliation with erroneous trees Gene tree Species tree duplication D H M R D H M R D H M R With erroneous trees: Direct reconciliation leads to spurious duplications & losses Solution: Use species tree to constrain gene tree
Towards better reconciliation methods Gene Tree Species Tree new root d 1 h 1 m 1 r 1 Topology 1 d 2 h 2 m m 2 r 2 3 r 3 dog Topology 2 Full solution: Maximize joint likelihood Incorporate cost of reconciliation in tree building Tradeoff: nucleotide mutations & gene duplication/loss One solution: Partitioning by Reconciliation human Key insight: most errors are on older branches, irrelevant to orthology Use species tree to partition gene tree Allow re-rooting of each partition based on species tree Î Apply reconciliation algorithm to each partition mouse rat
Step 4: Partitioning by reconciliation (1) Initial cluster seeds (2) Cluster extension (3) Phylogenetic reconstruction Gene Clusters (4) Partitioning by reconciliation Partitioned Trees Partition Unrooted Trees Unrooted Trees Phylogeny Repeat 100 times Rooted Trees Select root Reconciliation Bootstrapping Loop Ortholog assignments with confidence score
Putting it all together: SynPhyl Gene Annotations Gene Family Clusters Initial clustering Genome synteny Repeat 100 times Unrooted Trees Partitioned Trees Unrooted Trees Partition Phylogeny Rooted Trees Reconciliation Select root Bootstrapping Loop Ortholog and Paralog Database Assign orthology with confidence scores
Benchmarks and Results
Results: Mammalian comparisons Compare human, mouse, rat, dog complete genomes Coverage: 75,753 genes Number of groups: 18,446 (of which 13,741 have all four species) One-to-one orthologs in four species: 12,359 Species Present # Groups Dog Human Mouse Rat 13741 Count of ortholog groups by species - Human Mouse Rat 752 Dog - Mouse Rat 457 Dog Human - Rat 270 Dog Human Mouse - 1073 - - Mouse Rat 502 Dog Human - - 361 Dog - Mouse - 101 Dog - - Rat 97 - Human Mouse - 75 - Human - Rat 41 Contribution of phylogenetic reconstruction More one-to-one orthologs: 11,619 Æ 12,359 Large families split into small groups: 17,586 Æ 18,446 Figure by MIT OCW.
Higher resolution: resolving fine-grain correspondence
Higher sensitivity: recognize subtle duplication events S P E C I E S C O M P O S I T I O N S DOG HUMAN MOUSE RAT COUNT 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 2 2 1 1 1 2 2 2 2 1 1 2 2 1 2 2 2 OTHER 10444 225 214 205 106 59 39 37 340 Figure by MIT OCW. Additional duplicates found for ENSEMBL 1-to-1 orthologs Hundreds of additional duplicates detected Confirmed by branch lengths and topology
SynPhyl comparison to direct reconciliation Fewer gene losses Fewer gene duplications Direct reconciliation SynPhyl reconciliation Total count of losses: 18,352 11,750 Total count of duplications: 10,114 8,942 More gene trees reconcile to species tree Gene duplications and losses dramatically decreased
Result: Genome-wide correspondence of multiple species Image removed due to copyright restrictions.
Summary / Contributions SynPhyl: new tool for genome-wide orthology Uses synteny, phylogeny, and known species tree Automatically determines orthologs and paralogs Returns ortholog assignments, trees for each family Algorithmic highlights Initial clustering constrained by synteny Fine-grain correspondence uses phylogeny Partition by reconciliation constrained by species trees Advantages of the algorithm Practical, fast (< ½ day on a PC) Uses information available: phylogeny, synteny Confidence metric: bootstrap values propagate to orthology Phylogeny ensures consistent orthologs (no over-collapsing) Performance Successfully applied to mammals, fungi Fine-grain resolution: phylogeny disambiguates large families High sensitivity: captures all duplication events
Outline Detecting gene duplication Orthologs and paralogs Gene trees and species trees Reconciliation Detecting genome duplication Evidence across species Evidence in a single species Duplicate gene evolution Detect accelerated divergence Measuring positive selection Gene conversion
Genome Duplication
A range of evolutionary distances 20 Myr 5 Myr S.cerevisiae S.paradoxus S.mikatae S.bayanus 100 Myr K. waltii Ability to ask different set of questions
Gene correspondence Image removed due to copyright restrictions. Please see: Kellis, Manolis, Bruce W. Birren, and Eric S. Lander. "Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae." Nature 428 (April 8, 2004): 617-624.
Gene correspondence Image removed due to copyright restrictions. Please see: Kellis, Manolis, Bruce W. Birren, and Eric S. Lander. "Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae." Nature 428 (April 8, 2004): 617-624.
Signatures of evolutionary events Image removed due to copyright restrictions. Please see: Kellis, Manolis, Bruce W. Birren, and Eric S. Lander. "Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae." Nature 428 (April 8, 2004): 617-624. Few genes remain in 2 copies Gene interleaving is evidence of complete duplication
Duplicate mapping tiles K. waltii Image removed due to copyright restrictions. Please see: Figure 3 in Kellis, Manolis, Bruce W. Birren, and Eric S. Lander. "Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae." Nature 428 (April 8, 2004): 617-624.
Duplicate mapping of centromeres Image removed due to copyright restrictions. Please see: Figure 2 in Kellis, Manolis, Bruce W. Birren, and Eric S. Lander. "Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae." Nature 428 (April 8, 2004): 617-624. Recognize sister regions solely based on gene order
Conclusion: Whole Genome Duplication has happened Image removed due to copyright restrictions. Please see: Figure 1 in Kellis, Manolis, Bruce W. Birren, and Eric S. Lander. "Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae." Nature 428 (April 8, 2004): 617-624.
Whole Genome Duplications are everywhere! Image removed due to copyright restrictions. Yeast Duplication - Most genes 1-to-1 mapping - Gene interleaving evidence of duplication - Complete tiling of the genome Image removed due to copyright restrictions. Vertebrate Duplication in Fish - Fish: Gene order not conserved, only chromosomes - Mammals: Gene order conserved, not chromosomes Image removed due to copyright restrictions. Two rounds of WGD in base of vertebrate lineage - Build clusters of related genes (use Ciona as outgroup) - Count duplications by reconciliation - Find regions of duplicate overlap Æ 4-way synteny
Genome duplication evidence in a single species
Evidence of duplication using a single genome? Image removed due to copyright restrictions. Please see: Figure 1 in Kellis, Manolis, Bruce W. Birren, and Eric S. Lander. "Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae." Nature 428 (April 8, 2004): 617-624. Genomic evidence However Conserved order of paralogous genes Same transcriptional orientation Interspersed with single-copy genes Interpretation: Genome duplication followed by gene loss
Whole genome duplication is controversial Insufficient evidence Only 50% of genome in duplicate regions Only 8% of genes present in two copies Extensive redundancy outside duplicate regions Evidence against WGD Divergence-based dating show multiple times Other species have similar level of redundancy Alternative evolutionary scenario proposed Independent segmental duplications Also consistent with the evidence Image removed due to copyright restrictions. Please see: Figure 1 in Kellis, Manolis, Bruce W. Birren, and Eric S. Lander. "Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae." Nature 428 (April 8, 2004): 617-624. There was a whole-genome duplication. Wolfe, Nature 97 There was no whole-genome duplication. Dujon, FEBS 2000 At least some chrom dup. occurred independently Langkjaer, JMB, 2000 Dynamic equilibrium of duplications and loss Llorente, FEBS, 2000 Recent evidence supports single event. Wong, PNAS 02 Continuous block duplications and deletions Dujon, Yeast 2003 Dup. precedes divergence from Kluyveromyces. Piskur, Nature, 2003 Telomere-mediated duplication events Coissac, Mol Bio Evo 1997 Multiple closely spaced events Friedman, Genome Res, 2003 Spontaneous duplication of large chromosomal segments Koszul, EMBO 04 Evidence remains inconclusive
Conclusion: Whole Genome Duplication has happened Image removed due to copyright restrictions. Please see: Figure 1 in Kellis, Manolis, Bruce W. Birren, and Eric S. Lander. "Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae." Nature 428 (April 8, 2004): 617-624.
Outline Detecting gene duplication Orthologs and paralogs Gene trees and species trees Reconciliation Detecting genome duplication Evidence across species Evidence in a single species Duplicate gene evolution Detect accelerated divergence Measuring positive selection Gene conversion
Post-duplication evolution
Whole-genome duplication results in 500 new genes Number of genes 10,000 5,000 WGD Gene Loss 5,500 ~500 gained 100Myrs Today time Evidence of accelerated gene evolution
Fate of duplicated genes 457 genes kept in two copies, result of selection Involved in sugar metabolism and fermentation WGD S. cerevisiae copy 1 S. cerevisiae copy 2 K. waltii Evidence of accelerated protein divergence?
Measuring accelerated divergence 1 GTT(V:Val) TTT(F:Phe)? Two shortest paths possible GTA(V:Val) 2 TTA(L:Leu) Protein divergence Count amino-acid changes Use BLOSUM substitution matrix Nucleotide divergence Count nucleotide substitutions Correct for back-mutations Use transition/transversion evolutionary model d N / d S Two types of nucleotide substitutions S = synonymous: Preserve amino-acid translation N = non-synonymous: Change amino-acid Count synonymous / non-synonymous sites Depends on path taken between two codons
Scenarios for rapid gene evolution One copy faster Scer - copy2 Scer - copy1 Kwal Ohno, 1970 Both copies faster Scer - copy1 Kwal Scer - copy2 Lynch, 2000 20% of duplicated genes show acceleration 95% of cases: Only one copy faster
Emerging gene functions after duplication Origin of replication Æ silencing 4-fold acceleration Scer Scer - Orc1 (origin of replication) Kwal -Orc1 - Sir3 (silencing) Translation initiation Æ anti-viral defense 3-fold acceleration Scer - Hbs1 (translation initiation) Kwal - Hbs1 Scer - Ski7 (anti-viral defense) Asymmetric divergence Æ recognize ancestral / derived
Distinct functional properties Ancestral function Derived function Gene deletion Lethal (20%) Never lethal Gain new function and lose ancestral function
Distinct functional properties Ancestral function Derived function Gene deletion Expression Localization Lethal (20%) Abundant General Never lethal Specific (stress, starvation) Specific (mitochondrion, spores) Gain new function and lose ancestral function
Gene conversion
Decelerated evolution Scer copy1 Scer copy2 Kwal 60 gene pairs (13% of 457 pairs) 98% protein identity (all pairs: 55%) 90% identity in 4fold degenerate sites (all pairs: 41%) Not recent duplication Gene order argues ancestral WGD pairs Gene conversion?
Evidence of gene conversion WGD YBL072C S. cerevisiae YER102W S. cerevisiae YBL072C S. bayanus YER102W S. bayanus K. waltii A. gossypii Tree root reveals time of duplication No acceleration in the K. waltii branch The two genes have recently replaced each other Branching order reveals gene conversion Paralogs are closer to each other than to their ortholog Both S. cerevisiae and S. bayanus show gene conversion Periodic gene conversion
Summary Detecting gene duplication Orthologs and paralogs Gene trees and species trees Reconciliation Detecting genome duplication Evidence across species Evidence in a single species Duplicate gene evolution Detect accelerated divergence Measuring positive selection Gene conversion