Orthologs Detection and Applications Marcus Lechner Bioinformatics Leipzig 2009-10-23 Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 1 / 25
Table of contents 1 Background on homology 2 Proteinortho 3 Domain wide commons 4 Annotation pipeline 5 References Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 2 / 25
Definitions Homologous genes have derived from a common ancestor Orthology evolved by speciation thought to have a similar function Paralogy homologous genes within the same species thought to have a related function (neo-/subfunctionalization) out-paralogs arose form a duplication preceding a speciation in-paralogs evolved by duplication subsequent to speciation Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 3 / 25
Example Figure: Illustration of relationships: Three species with orthologs, xeno-, in- and out-paralogs Adapted from [1] Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 4 / 25
Problems Interpretation original definition of homology (1843): the same organ under every variety of form and function [2] Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 5 / 25
Problems Interpretation original definition of homology (1843): the same organ under every variety of form and function [2] still a very good quantitative indication but neither essential nor sufficient Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 5 / 25
Problems Interpretation original definition of homology (1843): the same organ under every variety of form and function [2] still a very good quantitative indication but neither essential nor sufficient Homology of two proteins is not equivalent with a common function, sequence nor structure! Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 5 / 25
Problems Relative definition in-/out-paralog definition only in subjection to a certain species greatly dependent on available data no absolute view Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 6 / 25
Problems Figure: Illustration of relationships: Complete view needed Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 7 / 25
Problems Figure: Illustration of relationships: Complete view needed Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 8 / 25
Problems Information benefit duplications are known to be a major source of innovation in evolution proteins are homologs per definition, if they have a common ancestor irrespective of their actual similarity or function most proteins are anciently related but have evolved far Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 9 / 25
Problems Figure: Multiple gene duplications: All are homologs per definition but smaller groups may be more of use Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 10 / 25
Problems Information benefit duplications are known to be a major source of innovation in evolution proteins are homologs per definition, if they have a common ancestor irrespective of their actual similarity or function most proteins are anciently related but have evolved far Up to which point is the homology information useful? Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 11 / 25
Conclusion Proteinortho approach arose from the same ancestor + similar function similar sequence should return a useful subset of homologs (isofunctional aimed) reciprocal best blast(s) Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 12 / 25
Reciprocal best blast(s) for homologs detection Figure: Homology detection using blast Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 13 / 25
Proteinortho Features orthologs and paralogs assignment for proteins/protein coding genes designed for large-scale application behaves nicely in memory consumption capable of distributed computing Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 14 / 25
Workflow Figure: Proteinortho workflow: 1) Reciprocal blasts 2) Transformation into graph representation 3) Coloring and decomposition 4) Reconversion and mapping to species with encoded proteins Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 15 / 25
Distributed computing Figure: a) Multiple PCs running Proteinortho, cooperating dynamically using an N-way technique b) Workflow of synchronization Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 16 / 25
Challenge Application to all bacteria available on NCBI 710 species, 15 million proteins Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 17 / 25
Challenge Application to all bacteria available on NCBI 710 species, 15 million proteins took about two weeks on 50 CPU-cores (Intel Xenon 233 GHz) peak of only 25 GB RAM, but 300 GB hard disk Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 17 / 25
Results 300 Coverage overview cumulative # of connected components 275 250 225 200 175 150 125 100 75 original blasted blasted filtered 50 25 0 400 450 500 550 600 650 700 # of species covered Figure: Number of common proteins Sets with over 5% paralogs where filtered Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 18 / 25
Results Common proteins 30S ribosomal proteins S2-5, S7, S8, S10-13, S17, S19 50S ribosomal proteins L1-3, L5, L6, L11, L14, L22, L23 trna synthetases for seryl, arginyl, phenylalanyl (alpha chain) preprotein translocase, SecY subunit peptidase M22, O-sialoglycoprotein endopeptidase transcription elongation/termination factor NusA Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 19 / 25
Annotation pipeline Application for annotation in: newly sequenced bacterial genome out: annotation of protein coding genes candidates for non-coding genes no previous knowledge required runs in 10 to 90 minutes Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 20 / 25
Relatives discovery Figure: Relatives detection using reference proteins and tree Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 21 / 25
Relatives discovery with colors Figure: Advanced relatives detection using colors Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 22 / 25
Seeding Figure: Pipeline seeding with proteins Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 23 / 25
Pipeline overview Figure: Pipline seeding with proteins Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 24 / 25
The end Thank you for listening! Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 25 / 25
W M Fitch Homology a personal view on some of the problems Trends Genet, 16(5):227 31, May 2000 Richard Owen, Cooper, and William White Lectures on the comparative anatomy and physiology of the invertebrate animals London :Longman, Brown, Green, and Longmans, 1843 http://wwwbiodiversitylibraryorg/bibliography/6788 Marcus Lechner (Bioinformatics Leipzig) Orthologs Detection and Applications 2009-10-23 25 / 25