Bioinformatics Phylogeny Species trees versus molecule tree! A species tree aims at representing the evolutionary relationships between species.! A molecule tree represents the evolutionary history of a family of related molecules (genes, proteins).! Species trees and gene trees are generally related... " Species tree can be inferred from various criteria, including the history of carefully chosen molecules.!... but t identical. " A molecular family can contain several copies in the same species (in-paralogs), due to gene duplications. " Some molecules can be transferred horizontally between species. " Due to combinations of duplications-divergences, the tree of a given gene may be inconsistent with the species tree.! Illustration: Figure 7.3 from Zvelebil and Baum. Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Gémes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/ Source: Zvelebil, M.J. and Baum, J.O. (2008) Understanding Bioinformatics. Garland Science, New York and London.! Tree reconciliation Concept definitions from Fitch (2000) Source: Zvelebil, M.J. and Baum, J.O. (2008) Understanding Bioinformatics. Garland Science, New York and London.!! Discussion about definitions of the paper " Fitch, W. M. (2000). Homology a personal view on some of the problems. Trends Genet 16, 227-31.! Homology " Owen (1843). «the same organ under every variety of form and function». " Fitch (2000). Homology is the relationship of any two characters that have descendent, usually with divergence, from a common ancestral character. Note: character can be a phetypic trait, or a site at a given position of a protein, or a whole gene,... " Molecular application: two genes are homologous if diverge from a common ancestral gene.! Analogy: relationship of two characters that have developed convergently from unrelated ancestor.! Cenancestor: the most recent common ancestor of the taxa under consideration! Orthology: relationship of any two homologous characters whose common ancestor lies in the cenancestor of the taxa from which the two were obtained.! Paralogy: Relationship of two characters arising from a duplication of the gene for that character.! Xelogy: relationship of any two characters whose history, since their common ancestor, involves interspecies (horizontal) transfer of the genetic material for at least one of those characters. Analogy Homology Paralogy Xelogy or t (xeologs from paralogs) Orthology Xelogy or t Exercise Exercise! On the basis of Fitch s definitions (previous slide), qualify the relationships between each pair of genes in the illustrative schema. " P paralog " O ortholog " X xelog " A analog! Example: B1 versus C1 " The two (B1 and C1) were obtained from taxa B and C, respectively. " The cenancestor (blue arrow) is the taxon that preceded the second speciation event (Sp2). " The common ancestor gene (green dot) coincides with the cenancestor! -> B1 and C1 are orthologs A1 AB1 B1 B2 C1 C2 C3 A1 AB1 B1 B2 C1 C2 C3 # Orthologs can fomally be defined as a speciation event (ex: a 1 and a 2 ). # Paralogs can fomally be defined as a gene duplication event (ex: b 2 and b 2' ). Source: Zvelebil & Baum, 2000 A1 AB1 B1 B2 C1 C2 C3 A1 AB1 B1 B2 C1 C2 C3 O # Orthologs can fomally be defined as a speciation event. # Paralogs can fomally be defined as a gene duplication event. # Source: Zvelebil & Baum, 2000
Exercise Solution to the exercise! Example: B1 versus C2 " The two (B1 and C2) were obtained from taxa B and C, respectively. " The common ancestor gene (green dot) is the gene that just preceded the duplication Dp1. " This common ancestor is much anterior to the cenancestor (blue arrow).! -> B1 and C2 are paralogs! On the basis of Fitch s definitions (previous slide), qualify the relationships between each pair of genes in the illustrative schema. " P paralog " O ortholog " X xelog " A analog A1 AB1 B1 B2 C1 C2 C3 A1 AB1 B1 B2 C1 C2 C3 O P # Orthologs can fomally be defined as a speciation event. # Paralogs can fomally be defined as a gene duplication event. # Source: Zvelebil & Baum, 2000 A1 AB1 B1 B2 C1 C2 C3 A1 I AB1 X I B1 O X I B2 O X P I C1 O X O P I C2 O X P O P I C3 O X P O P P I Cladistics, cladograms and clades Phylogram! Cladistics " (Greek: klados = branch) is a branch of biology that determines the evolutionary relationships between organisms based on derived similarities (source: Wilkipaedia).! Cladogram " tree-like drawing, usually with binary bifurcations, representing one evolutionary scenario about divergences between species or.! Clade " Any sub-tree of a cladogram.! Note: branch lengths to t reflect evolutionary time. YBIH ECOLI BETI ECOLI YBJK ECOLI YJDC ECOLI YIJC ECOLI TETC ECOLI YDHM ECOLI YJGJ ECOLI YCFQ ECOLI TTK ECOLI YCDC ECOLI UIDR ECOLI TER5 ECOLI TER4 ECOLI TER2 ECOLI TER3 ECOLI! Phylogram : tree-like structure representing an evolutionary scenario, and including " the events of divergence between species or ; " the evolutionary time between each species and the divergence events. ACRR ECOLI ENVR ECOLI TETC ECOLI UIDR ECOLI BETI ECOLI YBJK ECOLI YBIH ECOLI YCDC ECOLI YDHM ECOLI TER1 ECOLI TER3 ECOLI TER2 ECOLI TER4 ECOLI TER5 ECOLI YIJC ECOLI TTK ECOLI YJDC ECOLI YCFQ ECOLI TER1 ECOLI ENVR ECOLI YJGJ ECOLI ACRR ECOLI Molecular clock Phylogenetic inference from sequence comparison! The "molecular clock" hypothesis (left tree) assumes that rates of evolution do t vary between branches. All leaf des are thus aligned vertically.! This hypothesis is t always valid " in some cases, two genes can diverge from a common ancestor, but one of them may have diverged faster than the other one. This is a rather classical mechanism of evolution: a duplication creates some redundancy, and one copy of the gene will evolve whereas the other one retains the initial function. Ultrametric tree (with clock) (e.g. UPGMA) META BRUME META RHIME Q8UBY0 META CAMJE META VIBCH META YERPE META ECOLI META ECO57 META SALTI META SALTY META LACLA META STRPN AAL00238 META BACSU META THEMA META BACHD META CLOAB Without clock (e.g. neighbour-joining) META BACSU META BACHD META CLOAB META THEMA META BRUME META RHIME Q8UBY0 META LACLA META STRPN AAL00238 META CAMJE META VIBCH META YERPE META ECOLI META ECO57 META SALTI META SALTY! Alternative approaches " Maximum parsimony " Distance " Maximum likelihood Unaligned Sequence alignment Aligned strong many (> 20)? Maximum parsimy Source: Mount (2000)
Maximum parsimony Maximum parsimony example! For each column of the alignment, all possible trees are evaluated and the tree with the smallest number of mutations is retained! The trees which fit with the highest number of columns are retained! The program can return several trees position 1 2 3 4 5 6 7 8 9 seq1 A A G A G T G C A seq2 A G C C G T G C G seq3 A G A T A T C C A seq4 A G A G A T C C G Column 5 mutation seq1 G A seq3 G A seq2 G A seq4 seq 1G G seq 2 A A seq 3 A A seq 4 seq 1G G seq 2 A A seq 4 A A seq 3 +-----------CYTR_ECOLI! +--------------------------6!!! +--------EBGR_ECOLI!! +-13!!! +-----CSCR_ECOLI!! +-12!!! +--IDNR_ECOLI!! +--5!! +--GNTR_ECOLI! +--4!!! +-----MALI_ECOLI!!! +-10!!!!! +--TRER_ECOLI!!! +--------------9 +-14!!!!! +--YCJW_ECOLI!!!!!!!!! +--------LACI_ECOLI!! +--------------8! +--2! +--FRUR_ECOLI!!!! +-------15!!!!! +--RAFR_ECOLI!!! +----------11!!!! +-----ASCG_ECOLI!!! +-----7! --1!! +--GALS_ECOLI!!! +--3!!! +--GALR_ECOLI!!!!! +-----------------------------------------RBSR_ECOLI!!! +--------------------------------------------PURR_ECOLI! remember: this is an unrooted tree!!! Parsimony tree calculated from a multiple alignment of the E.coli proteins containing a laci-type HTH domain " Left: text representation (protpars output) " Bottom right: visualized with njplot (in the ClustalX distribution) Adapted from Mount (2000) requires a total of 4095.000! Maximum parsimony - drawbacks Phylogenetic inference from sequence comparison! Number of trees to evaluate increases exponentially with the number of.! Assumes that all evolved at the same rate (molecular clock hypothesis).! Only works for well conserved sequence families.! Alternative approaches " Maximum parsimony " Distance " Maximum likelihood Unaligned Sequence alignment Aligned strong many (> 20)? Maximum parsimy clear Distance Source: Mount (2000) Distance method Distance matrix! Starting from a multiple alignment, calculate the distance between each pair of! Calculate a tree which fits as well as possible with the distance matrix " branch lengths should correspond to distances " rooted or unrooted! Several methods can be used for calculating a tree from the distance matrix. " Fitch-Margoliah " Neighbour-Joining " UPGMA Aligned Distance calculation Distance matrix Tree calculation Tree! The distance matrix indicates the distance between each pair of sequence.! The matrix is symmetrical, and the diagonal only contains 0s. META_BACHD META_BACSU META_CLOAB META_STRPN AAL00238 META_LACLA META_ECOLI META_ECO57 META_SALTI META_SALTY META_YERPE META_VIBCH META_CAMJE META_THEMA META_RHIME Q8UBY0 META_BRUME META_BACHD 0.00 0.51 0.50 0.65 0.64 0.82 0.74 0.73 0.76 0.76 0.77 0.68 0.95 0.58 0.76 0.76 0.91 META_BACSU 0.51 0.00 0.66 0.81 0.80 0.90 0.86 0.85 0.88 0.88 0.85 0.80 0.87 0.65 0.99 0.98 1.05 META_CLOAB 0.50 0.66 0.00 0.75 0.74 0.79 0.81 0.82 0.83 0.83 0.85 0.82 0.82 0.60 0.79 0.80 0.96 META_STRPN 0.65 0.81 0.75 0.00 0.00 0.74 0.87 0.88 0.89 0.90 0.90 0.88 1.04 0.74 1.07 1.07 1.02 AAL00238 0.64 0.80 0.74 0.00 0.00 0.74 0.87 0.87 0.89 0.89 0.89 0.87 1.03 0.73 1.06 1.06 1.01 META_LACLA 0.82 0.90 0.79 0.74 0.74 0.00 0.93 0.93 0.96 0.95 0.94 0.95 0.99 0.78 1.11 1.15 1.07 META_ECOLI 0.74 0.86 0.81 0.87 0.87 0.93 0.00 0.02 0.06 0.05 0.24 0.46 1.04 0.81 1.03 0.97 1.08 META_ECO57 0.73 0.85 0.82 0.88 0.87 0.93 0.02 0.00 0.06 0.05 0.24 0.46 1.03 0.82 1.03 0.97 1.08 META_SALTI 0.76 0.88 0.83 0.89 0.89 0.96 0.06 0.06 0.00 0.01 0.26 0.46 1.08 0.82 1.08 1.00 1.11 META_SALTY 0.76 0.88 0.83 0.90 0.89 0.95 0.05 0.05 0.01 0.00 0.25 0.46 1.09 0.82 1.08 1.01 1.11 META_YERPE 0.77 0.85 0.85 0.90 0.89 0.94 0.24 0.24 0.26 0.25 0.00 0.43 0.94 0.84 1.06 1.04 1.08 META_VIBCH 0.68 0.80 0.82 0.88 0.87 0.95 0.46 0.46 0.46 0.46 0.43 0.00 0.96 0.72 1.07 1.00 1.10 META_CAMJE 0.95 0.87 0.82 1.04 1.03 0.99 1.04 1.03 1.08 1.09 0.94 0.96 0.00 0.97 1.15 1.12 1.31 META_THEMA 0.58 0.65 0.60 0.74 0.73 0.78 0.81 0.82 0.82 0.82 0.84 0.72 0.97 0.00 0.78 0.75 0.89 META_RHIME 0.76 0.99 0.79 1.07 1.06 1.11 1.03 1.03 1.08 1.08 1.06 1.07 1.15 0.78 0.00 9 0.55 Q8UBY0 0.76 0.98 0.80 1.07 1.06 1.15 0.97 0.97 1.00 1.01 1.04 1.00 1.12 0.75 9 0.00 0.54 META_BRUME 0.91 1.05 0.96 1.02 1.01 1.07 1.08 1.08 1.11 1.11 1.08 1.10 1.31 0.89 0.55 0.54 0.00
Trees Methods for calculating trees from a distance matrix branch de root b4 b3 Rooted tree b2 b1 seq5 seq1 seq4 seq2 seq3 b3 Unrooted tree seq5 b1 seq1 seq4 b2 seq2 seq3 Unrooted tree seq5 seq1 b1! It is usually t possible to find a tree whose branch length fit with all the values of the distance matrix.! Several approaches exist to calculate a tree which approximates the distances. " The Fitch-Margoliah method minimizes the sum of squares between distances in the matrix and distances in the tree. " The Neighbour-Joining (NJ) method minimizes the sum of branch lengths for the resulting tree. This methods does t assume a molecular clock: it is thus appropriate when some proteins have evolved faster than some other ones. It returns an unrooted tree. " The Unweighted Pair-Group Method by arithmetic Averaging (UPGMA) clusters the by order of distance in the distance matrix. This method relies on the assumption of evolutionary clock, and it produces a rooted tree. leaf des b3! The distance between two des is the sum of lengths of the branches between them b2 seq2 seq4 seq3 Example of phylogenetic tree! This tree was obtained with the Neighbour- Joining method (implemented in ClustalX).! The drawing was obtained with njplot (part of the ClustalX package)! Each branch of the tree is labelled with the distance. 0.383 16 sw P08497 LPA2_BACSU 0.242 sw P00562 AK2H_ECOLI 0.353 sw Q9ZCI7 AK_RICPR 0.318 sw Q04795 AK1_BACSU 0.030 0.265 sw P61489 AK_THETH sw P61488 AK_THET2 0.053 0.014 sw P41403 AK_MYCSM 0.010 0.079 0.063 sw P0A4Z8 AK_MYCTU 25 sw P0A4Z9 AK_MYCBO 0.025 sw Q8RQN1 AK_COREF 11 0.043 0.020 sw P26512 AK_CORGL sw P41398 AK_CORFL 35 sw P53553 AK2_BACST 16 36 0.009 sw P08495 AK2_BACSU 34 sw Q59229 AK2_BACSG 0.242 sw O25827 AK_HELPY 0.008 sw Q9ZJZ7 AK_HELPJ 0.229 0.019 sw O69077 AK_PSEAE 0.226 sw O67221 AK_AQUAE 0.234 48 sw P10869 AK_YEAST 0.226 sw O60163 AK_SCHPO 0.009 0.309 sw Q57991 AK_METJA 0.008 59 sw P37142 AKH_DAUCA 0.01588 0.098 0.070 sw P49080 AKH2_MAIZE 0.091 sw P49079 AKH1_MAIZE 0.015 0.267 sw Q89AR4 AKH_BUCBP 0.033 64 0.086 sw Q8K9U9 AKH_BUCAP 0.074 60 0.047 sw P57290 AKH_BUCAI 0.201 sw P44505 AKH_HAEIN 0.059 0.073 18 sw P27725 AK1H_SERMA 0.071 sw P00561 AK1H_ECOLI 0.409 sw P94417 AK3_BACSU 0.329 sw P08660 AK3_ECOLI 0.034 0.289 sw Q9Z6L0 AK_CHLPN 0.070 16 71 sw O84367 AK_CHLTR 21 sw Q9PK32 AK_CHLMU Distance-based methods for calculating trees in the package PHYLIP! Summary of the methods for calculating a tree from a distance matrix. Phylip program method rooted tree time accuracy remarks fitch Fitch-Margoliah O(n^4) higher loss of accuracy when the tree contains long branches kitsch Fitch-Margoliah O(n^4) higher neighbor neighbour-joining O(n^2) lower suitable when rate of evolution varies among branches neighbor UPGMA O(n^2) lower assumes constant rate of evolution along the banches Bootstrapping Phylogenetic inference from sequence comparison! In some cases, the data does t allow to infer phylogeny! To assess the reliability of the inference, one can apply the bootstrap method " Given an alignment of n and p columns, one performs a random selection of p columns, with replacement. Some columns can thus be selected multiple times, whilst some others are t selected at all. " Calculate a tree with the sampled columns. " Repeat many (e.g. ) times, and check whether the same branches occur frequently (e.g. > 70%). 788 677 509 992 338 996 304 552 766 462 221 342 sw P08497 LPA2_BACSU sw P00562 AK2H_ECOLI sw Q9ZCI7 AK_RICPR sw Q04795 AK1_BACSU sw P61489 AK_THETH sw P61488 AK_THET2 sw P41403 AK_MYCSM sw P0A4Z8 AK_MYCTU sw P0A4Z9 AK_MYCBO sw Q8RQN1 AK_COREF sw P26512 AK_CORGL sw P41398 AK_CORFL sw P53553 AK2_BACST sw P08495 AK2_BACSU 686 sw Q59229 AK2_BACSG sw O25827 AK_HELPY sw Q9ZJZ7 AK_HELPJ sw O69077 AK_PSEAE sw O67221 AK_AQUAE sw P10869 AK_YEAST sw O60163 AK_SCHPO sw Q57991 AK_METJA sw P37142 AKH_DAUCA sw P49080 AKH2_MAIZE sw P49079 AKH1_MAIZE! Alternative approaches " Maximum parsimony " Distance " Maximum likelihood Unaligned Sequence alignment Aligned strong many (> 20)? Maximum parsimy 912 994 sw Q89AR4 AKH_BUCBP sw Q8K9U9 AKH_BUCAP sw P57290 AKH_BUCAI sw P44505 AKH_HAEIN clear Distance sw P27725 AK1H_SERMA sw P00561 AK1H_ECOLI sw P94417 AK3_BACSU 990 sw P08660 AK3_ECOLI sw Q9Z6L0 AK_CHLPN sw O84367 AK_CHLTR sw Q9PK32 AK_CHLMU Maximum likelihood Source: Mount (2000)
Phylogeny.fr Practicals with phylogeny.fr! http://www.phylogeny.fr! Offers a user-friendly interface to run all the steps for inferring phylogeny from a set of unaligned. " Completely automated workflow or user-specified parameters. " Alternative methods for each step of the workflow. " Results are exported in multiple formats (convenient for using them with other programs). " Results can be displayed immediately (for fast programs) or sent by email (slow programs). Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Gémes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/ Phylogeny.fr: sequence input! The one click option only requires for you to enter a set of, and click on the submit button. Phylogeny.fr: work flow! At each step of the workflow, you can " Check the parameters used for the analysis " Choose alternative parameters (advanced use) " Export the intermediate and final results in a variety of formats, which can then be opened in other programs. Phylogeny.fr - alignment result Phylogeny.fr - phylogenic tree in text format
Phylogeny.fr - Phylogram (various output formats are supported) Phylogeny.fr - display options Phylogram with an outgroup added (Bacillus) but t correctly rooted (midpoint grouping) Cladogram incorrectly rooted (midpoint) Phylogram rooted with an outgroup Further reading Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Gémes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
Further reading! Textbooks " Zvelebil, M.J. and Baum, J.O. (2008) Understanding Bioinformatics. Garland Science, New York and London.! " Mount, M. (2001) Bioinformatics: Sequence and Geme Analysis. Cold Spring Harbor Laboratory Press, New York.! " Pevzner, J. (2003) Bioinformatics and Functional Gemics. Wiley.! + all his teaching material on http://pevsnerlab.kennedykrieger.org/bioinfo_course.htm! Supplementary material Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Gémes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/ PHYLIP flowchart Taxomy of bacteria having a gene meta (August 2004) Bacteria Bacillales Bacillaceae Bacillus Bootstrapping seqboot aligned Distance calculation protdist dnadist distance matrix Firmicutes Clostridia Lactobacillales Clostridiales Streptococcaceae Clostridium Lactococcus Streptococcus Brucella Parsimony protpars dnapars Branch-and-bound dnapenny Maximum likelihood dnaml protml Neighbor -joining neighbor UPGMA neighbor (rooted) Fitch-Margoliash fitch (unrooted) kitsch (rooted) Alpha subdivision Rhizobiaceae group Rhizobium Sirhizobium Proteobacteria Epsilon subdivision Campylobacter group Campylobacter tree Escherichia retree consense Tree drawing drawtree Tree drawing drawgram Gamma subdivision Enterobacteriaceae Salmonella Yersinia drawing of unrooted tree drawing of rooted tree Thermotogae Thermotogae (class) Vibrionaceae Thermotogales Vibrio Thermogata Tree menclature Alignment methods! Node! Leave! Internal branch! External branch Source: Zvelebil, M.J. and Baum, J.O. (2008) Understanding Bioinformatics. Garland Science, New York and London.!
Evolutionary model Source: Zvelebil, M.J. and Baum, J.O. (2008) Understanding Bioinformatics. Garland Science, New York and London.!