Comparative Genomics II Advances in Bioinformatics and Genomics GEN 240B Jason Stajich May 19 Comparative Genomics II Slide 1/31
Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods Gene tree reconciliation Databases of Orthologs References Comparative Genomics II Introduction Slide 2/31
Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods Gene tree reconciliation Databases of Orthologs References Comparative Genomics II Introduction Gene Families Slide 3/31
Orthology inference and Gene family identification How to cluster genes by similarity? Want to uncover paralogy and orthology relationships. Approaches: Single-linkage Markov-Clustering Phylogenetic approaches Comparative Genomics II Introduction Gene Families Slide 4/31
Orthology prediction methods Comparative Genomics II Introduction Gene Families Slide 5/31
Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods Gene tree reconciliation Databases of Orthologs References Comparative Genomics II Introduction Pairwise Methods Slide 6/31
Pairwise methods Best Bidirectional Hits (A B) Single linkage COGs InParanoid & OrthoMCL Comparative Genomics II Introduction Pairwise Methods Slide 7/31
Best Bidirectional Hits (BBH) All pairs of proteins with reciprocal best hits are considered orthologs. Note that this method is unable to predict the othology with the yellow protein. Pro Intuitive and fast Con Has problem of promiscuous domains leading to over-connecting Con Requires a single cutoff for establishing linkages Comparative Genomics II Introduction Pairwise Methods Slide 8/31
Clusters of Orthologous Genes (COG) Proteins in the nodes of triangular networks of BBHs are considered as orthologs (green, red and yellow protein 1). New proteins are added to the orthologous group if they are present in BBH triangles that share an edge with a given cluster. The COG-like approach can add additional proteins from the same genome if they are more similar to each other than to proteins in other genomes, or if they form BBH triangles with members of the cluster. This is not the case for yellow protein 2, which is, again, misclassified. Comparative Genomics II Introduction Pairwise Methods Slide 9/31
InParanoid approach - correct for paralogy This is similar to BBH but other proteins within a proteome (yellow protein 2 in this example) are included as in-paralogs if they are more similar to each other than to their corresponding hits in the other species. Comparative Genomics II Introduction Pairwise Methods Slide 10/31
OrthOMCL approach - Markov Cluster http://www.micans.org/mcl/ani/mcl-animation.html This is similar to BBH but other proteins within a proteome (yellow protein 2 in this example) are included as in-paralogs if they are more similar to each other than to their corresponding hits in the other species. Comparative Genomics II Introduction Pairwise Methods Slide 11/31
OrthoMCL workflow Comparative Genomics II Introduction Pairwise Methods Slide 12/31
OrthoMCL distance correction for paralog method Comparative Genomics II Introduction Pairwise Methods Slide 13/31
OrthoMCL able to connect families unlinked by Single-Linkage or COG Comparative Genomics II Introduction Pairwise Methods Slide 14/31
Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods Gene tree reconciliation Databases of Orthologs References Comparative Genomics II Introduction Phylogenetic Methods Slide 15/31
Tree Reconciliation Duplication nodes (marked with a D) are defined by comparing the gene tree (small tree at the top) with the species tree (small tree at the bottom) to derive a reconciled tree (big tree on the right) in which the minimal number of duplication and gene loss (dashed lines) events necessary to explain the gene tree are included. In this case, both the yellow proteins are included in the orthologous group but the red and gray proteins are excluded. Comparative Genomics II Introduction Phylogenetic Methods Slide 16/31
Species overlap phylogenetic approach All proteins that derive from a common ancestor by speciation are considered members of the same orthologous group. Duplication nodes are detected when they define partitions with at least one shared species. A one-to-many orthology relationship emerges because of a recent duplication in the lineage leading to the yellow proteome. Comparative Genomics II Introduction Phylogenetic Methods Slide 17/31
SYNERGY [Wapinski] Clusters of similar genes are found and trees inferred at once Phylogenetic approach that builds up a tree and breaks groups when a ancestral duplication is found that is older than the species group. Can take into account scoring scheme that uses synteny SYNERGY InParanoid Comparative Genomics II Introduction Phylogenetic Methods Slide 18/31
SYNERGY Comparative Genomics II Introduction Phylogenetic Methods Slide 19/31
SYNERGY starts (top) with a collection of genes (A1, B1, C1 and so on), their chromosomal order (grey lines) and sequence distances (blue arrows; arrows of the same thickness have similar sequence distances). It then builds orthogroups as it climbs the species tree. First, it collects the genes in species A and B that share a common ancestor in species X (second panel, orange ovals). Then, it merges orthogroups formed in the previous stage with the genes in C, resulting in new orthogroups representing ancestral genes in species Y (third panel, yellow ovals). The orthogroups assembled at each stage are associated with gene trees reflecting divergence, duplication and loss events (bottom). b, Gene tree reconstruction and refining orthogroup assignments. An unrooted phylogeny is reconstructed for the genes and sub-orthogroups in each putative orthogroup (dashed oval). Some rootings (purple arrow) indicate that all the genes descended from a common ancestor (for example, X3, bottom left). Others (green arrow) show that a duplication occurred at the root of the gene tree (for example, X2 and X3, bottom right). In the latter case, the orthogroup is partitioned before proceeding. Comparative Genomics II Introduction Phylogenetic Methods Slide 20/31
Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods Gene tree reconciliation Databases of Orthologs References Comparative Genomics II Introduction Gene tree reconciliation Slide 21/31
Gene tree reconciliation Resolve Duplication and Speciation events on a gene tree Uses the known phylogeny of species and walk up the gene tree and assign nodes Some methods impute missing data (gene losses that are unobserved) Comparative Genomics II Introduction Gene tree reconciliation Slide 22/31
Speciation-Duplication Inference [Zmaseck and Eddy 2002] Very simple recursion to reconcile gene tree and species tree. Each node is labeled. Doesn t try and infer that there is missing data. Improved upon with Resampling Inference of Orthology (RIO) by same authors. Comparative Genomics II Introduction Gene tree reconciliation Slide 23/31
Notung Very simple recursion to reconcile gene tree and species tree. Each node is labeled. Doesn t try and infer that there is missing data. Comparative Genomics II Introduction Gene tree reconciliation Slide 24/31
Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods Gene tree reconciliation Databases of Orthologs References Comparative Genomics II Introduction Databases of Orthologs Slide 25/31
COGs and KOGs Don t use this as a way to classify your orthologs. Many other more accurate methods exist. Comparative Genomics II Introduction Databases of Orthologs Slide 26/31
OrthoMCL database OrthoMCL is an MCL based clustering gene family assignment Comparative Genomics II Introduction Databases of Orthologs Slide 27/31
PhylomeDB PhylomeDB strategy Comparative Genomics II Introduction Databases of Orthologs Slide 28/31
TreeFam Curated gene trees and gene families starting with automated clusters. Comparative Genomics II Introduction Databases of Orthologs Slide 29/31
Other tools Bayesian gene tree with species tree knowledge Prime-GSR OrthoStrapper for orthology TreeBEST Likelihood gene tree inference which is species tree aware. Comparative Genomics II Introduction Databases of Orthologs Slide 30/31
Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods Gene tree reconciliation Databases of Orthologs References Comparative Genomics II References Slide 31/31
References Frech C and Chen N. (2010) Genome-Wide Comparative Gene Family Classification PLoS One 5(10):e13409. URL http://dx.doi.org/10.1371/journal.pone.0013409 Gabaldon T. (2008) Large-scale assignment of orthology: back to phylogenetics? Genome Biol 9:235. URL http://dx.doi.org/10.1186/gb-2008-9-10-235 Zmaseck C and Eddy SR. (2001) A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics 17(9):821-8. URL http://www.hubmed.org/display.cgi?uids=11590098 Comparative Genomics II References Slide 31/31