Introduction to Molecular Phylogeny

Similar documents
Practical Bioinformatics

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Advanced topics in bioinformatics

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

Supplementary Information for

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Crick s early Hypothesis Revisited

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

Evolutionary Analysis of Viral Genomes

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Protein Threading. Combinatorial optimization approach. Stefan Balev.

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

Why do more divergent sequences produce smaller nonsynonymous/synonymous

The Trigram and other Fundamental Philosophies

Using algebraic geometry for phylogenetic reconstruction

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

Aoife McLysaght Dept. of Genetics Trinity College Dublin

SUPPLEMENTARY DATA - 1 -

Constructing Evolutionary/Phylogenetic Trees

Number-controlled spatial arrangement of gold nanoparticles with

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Codon Distribution in Error-Detecting Circular Codes

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Sequence Divergence & The Molecular Clock. Sequence Divergence

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Electronic supplementary material

Phylogenetic Tree Reconstruction

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

part 3: analysis of natural selection pressure

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

Supporting Information

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

Supplementary Information

Dr. Amira A. AL-Hosary

TM1 TM2 TM3 TM4 TM5 TM6 TM bp

Probabilistic modeling and molecular phylogeny

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Supplemental Figure 1.

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetic inference

THE MATHEMATICAL STRUCTURE OF THE GENETIC CODE: A TOOL FOR INQUIRING ON THE ORIGIN OF LIFE

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

SUPPLEMENTARY INFORMATION

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Evolutionary dynamics of abundant stop codon readthrough in Anopheles and Drosophila

Constructing Evolutionary/Phylogenetic Trees

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

evoglow - express N kit distributed by Cat.#: FP product information broad host range vectors - gram negative bacteria

Molecular phylogeny How to infer phylogenetic trees using molecular sequences


Theory of Evolution Charles Darwin

Biosynthesis of Bacterial Glycogen: Primary Structure of Salmonella typhimurium ADPglucose Synthetase as Deduced from the

SUPPLEMENTARY INFORMATION

evoglow - express N kit Cat. No.: product information broad host range vectors - gram negative bacteria

Sex-Linked Inheritance in Macaque Monkeys: Implications for Effective Population Size and Dispersal to Sulawesi

The 3 Genomic Numbers Discovery: How Our Genome Single-Stranded DNA Sequence Is Self-Designed as a Numerical Whole

The Mathematics of Phylogenomics

Evolutionary Change in Nucleotide Sequences. Lecture 3

The role of the FliD C-terminal domain in pentamer formation and

Codon-model based inference of selection pressure. (a very brief review prior to the PAML lab)

Re- engineering cellular physiology by rewiring high- level global regulatory genes

EVOLUTIONARY DISTANCES

It is the author's version of the article accepted for publication in the journal "Biosystems" on 03/10/2015.

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Quantifying sequence similarity

BINF6201/8201. Molecular phylogenetic methods

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

The Generalized Neighbor Joining method

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

An Analytical Model of Gene Evolution with 9 Mutation Parameters: An Application to the Amino Acids Coded by the Common Circular Code

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Phylogeny: traditional and Bayesian approaches

AtTIL-P91V. AtTIL-P92V. AtTIL-P95V. AtTIL-P98V YFP-HPR

7. Tests for selection

ChemiScreen CaS Calcium Sensor Receptor Stable Cell Line

Theory of Evolution. Charles Darwin

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Diversity of Chlamydia trachomatis Major Outer Membrane

Near-instant surface-selective fluorogenic protein quantification using sulfonated

Introduction to protein alignments

Evidence for Evolution: Change Over Time (Make Up Assignment)

Transcription:

Introduction to Molecular Phylogeny Starting point: a set of homologous, aligned DNA or protein sequences Result of the process: a tree describing evolutionary relationships between studied sequences = a genealogy of sequences = a phylogenetic tree CLUSTAL W (1.74) multiple sequence alignment Xenopus Gallus Bos Homo Mus Rattus ATGCATGGGCCAACATGACCAGGAGTTGGTGTCGGTCCAAACAGCGTT---GGCTCTCTA ATGCATGGGCCAGCATGACCAGCAGGAGGTAGC---CAAAATAACACCAACATGCAAATG ATGCATCCGCCACCATGACCAGCAGGAGGTAGCACCCAAAACAGCACCAACGTGCAAATG ATGCATCCGCCACCATGACCAGCAGGAGGTAGCACTCAAAACAGCACCAACGTGCAAATG ATGCATCCGCCACCATGACCAGCAGGAGGTAGCACTCAAAACAGCACCAACGTGCAAATG ATGCATCCGCCACCATGACCAGCGGGAGGTAGCTCTCAAAACAGCACCAACGTGCAAATG ****** **** ********* * *** * * *** * * * Phylogenetic Tree Internal branch : between 2 nodes. External branch : between a node and a leaf Horizontal branch length is proportional to evolutionary distances between sequences and their ancestors (unit = substitution / site). Tree Topology = shape of tree = branching order between nodes Branch 0.066 Gallus 0.01 Root 0.012 0.011 Rattus Mus 0.038 0.025 Bos Node Homo 0.011 Leaf

Alignment and Gaps The quality of the alignment is essential : each column of the alignment (site) is supposed to contain homologous residues (nucleotides, amino acids) that derive from a common ancestor. ==> Unreliable parts of the alignment must be omitted from further phylogenetic analysis. Most methods take into account only substitutions ; gaps (insertion/deletion events) are not used. ==> gaps-containing sites are ignored. Xenopus Gallus Bos Homo Mus Rattus ATGCATGGGCCAACATGACCAGGAGTTGGTGTCggtCCAAACAGCGTT---GGCTCTCTA ATGCATGGGCCAGCATGACCAGCAGGAGGTAGC---CAAAATAACACCaacATGCAAATG ATGCATCCGCCACCATGACCAGCAGGAGGTAGCagtCAAAACAGCACCaacGTGCAAATG ATGCATCCGCCACCATGACCAGCAGGAGGTAGCagtCAAAACAGCACCaacGTGCAAATG ATGCATCCGCCACCATGACCAGCAGGAGGTAGCactCAAAACAGCACCaacGTGCAAATG ATGCATCCGCCACCATGACCAGCGGGAGGTAGCtctCAAAACAGCACCaacGTGCAAATG Rooted and Unrooted Trees Most phylogenetic methods produce unrooted trees. This is because they detect differences between sequences, but have no means to orient residue changes relatively to time. Two means to root an unrooted tree : The outgroup method : include in the analysis a group of sequences known a priori to be external to the group under study; the root is by necessity on the branch joining the outgroup to other sequences. Make the molecular clock hypothesis : all lineages are supposed to have evolved with the same speed since divergence from their common ancestor. The root is at the equidistant point from all tree leaves.

Unrooted Tree Rattus Mus Gallus Bos Homo 0.02 Rooted Tree 0.02 Gallus Rattus Mus Bos Homo Xenopus

Eucarya Universal phylogeny (1) Bacteria Archaea deduced from comparison of SSU and LSU rrna sequences (2508 homologous sites) using Kimura s 2-parameter distance and the NJ method. The absence of root in this tree is expressed using a circular design. Universal phylogeny (2) Schematic drawing of a universal rrna tree. The location of the root corresponds to that proposed by reciprocally rooted gene phylogenies. Brown & Doolittle (1997) Microbiol.Mol.Biol.Rev. 61:456-502

Number of possible tree topologies for n taxa N trees =3.5.7...(2n 5) = (2n 5)! 2n 3(n 3)! n N trees 4 3 5 15 6 105 7 945...... 10 2,027,025...... 20 ~ 2 x10 20 Methods for Phylogenetic reconstruction Three main families of methods : Parsimony Distance methods Maximum likelihood methods

Parsimony (1) Step 1: for a given tree topology (shape), and for a given alignment site, determine what ancestral residues (at tree nodes) require the smallest total number of changes in the whole tree. Let d be this total number of changes. A G G A G G 1 2 3 1 2 3 4 G 4 G A A A G G G G G C 6 X : ancestral nucleotide 5 A : substitution event Example: At this site and for this tree shape, at least 3 substitution events are needed to explain the nucleotide pattern at tree leaves. Several distinct scenarios with 3 changes are possible. 6 C 5 A Step 2: Parsimony (2) Compute d (step 1) for each alignment site. Add d values for all alignment sites. This gives the length L of tree. Step 3: Compute L value (step 2) for each possible tree shape. Retain the shortest tree(s) = the tree(s) that require the smallest number of changes = the most parsimonious tree(s).

Some properties of Parsimony Several trees can be equally parsimonious (same length, the shortest of all possible lengths). The position of changes on each branch is not uniquely defined => parsimony does not allow to define tree branch lengths in a unique way. The number of trees to evaluate grows extremely fast with the number of processed sequences : Parsimony can be very computation - intensive. The search for the shortest tree must often be restricted to a fraction of the set of all possible tree shapes (heuristic search) => there is no mathematical certainty of finding the shortest (most parsimonious) tree. Building phylogenetic trees by distance methods General principle : Sequence alignment (1) Matrix of evolutionary distances between sequence pairs (2) (unrooted) tree (1) Measuring evolutionary distances. (2) Tree computation from a matrix of distance values.

Correspondence between trees and distance matrices Any phylogenetic tree induces a matrix of distances between sequence pairs Perfect distance matrices correspond to a single phylogenetic tree A B C A 0 A B C tree B 1 0 C 4 3 0 Distance matrix Evolutionary Distances They measure the total number of substitutions that occurred on both lineages since divergence from last common ancestor. Divided by sequence length. Expressed in substitutions / site ancestor sequence 1 sequence 2

Quantification of evolutionary distances (1): The problem of hidden or multiple changes D (true evolutionary distance) fraction of observed differences (p) A C G A A G D = p + hidden changes A Through hypotheses about the nature of the residue substitution process, it becomes possible to estimate D from observed differences between sequences. Estimated D : d G A A A C G Quantification of evolutionary distances(2): Jukes and Cantor s distance (DNA) Hypotheses of the model (Jukes & Cantor, 1969) : (a) All sites evolve independently and following the same process. (b) All substitutions have the same probability. (c) The base substitution process is constant in time. Quantification of evolutionary distance (d) as a function of the fraction of observed differences (p): d = 3 4 ln(1 4 3 p) V(d) = 9p(1 p) (3 4 p) 2 N N = number of compared sites p d 0,10 0,11 0,20 0,23 0,40 0,57 0,60 1,21 0,75 +

Quantification of evolutionary distances (3): Poisson distances (proteins) Hypotheses of the model : (a) All sites evolve independently and following the same process. (b) All substitutions have the same probability. (c) The amino acid substitution process is constant in time. Quantification of evolutionary distance (d) as a function of the fraction of observed differences (p) : d = - ln(1 - p)!! The hypotheses of the Jukes-Cantor and the Poisson models are very simplistic!! Quantification of evolutionary distances (3bis): PAM and Kimura s distances (proteins) Hypotheses of the model (Dayhoff, 1979) : (a) All sites evolve independently and following the same process. (b) Each type of amino acid replacement has a given, empirical probability : Large numbers of highly similar protein sequences have been collected; probabilities of replacement of any a.a. by any other have been tabulated. (c) The amino acid substitution process is constant in time. Quantification of evolutionary distance (d) : the number of replacements most compatible with the observed pattern of amino acid changes and individual replacement probabilities. Kimura s empirical approximation : d = - ln( 1 - p - 0.2 p 2 ) (Kimura, 1983) where p = fraction of observed differences

Quantification of evolutionary distances (4): Kimura s two parameter distance (DNA) Hypotheses of the model : (a) All sites evolve independently and following the same process. (b) Substitutions occur according to two probabilities : One for transitions, one for transversions. Transitions : G < >A or C < >T Transversions : other changes (c) The base substitution process is constant in time. Quantification of evolutionary distance (d) as a function of the fraction of observed differences (p: transitions, q: transversions): d = 1 ln[(1 2 p q) 1 2q ] 2 Kimura (1980) J. Mol. Evol. 16:111 Quantification of evolutionary distances (5): Synonymous and non-synonymous distances (coding DNA): Ka, Ks Hypothesis of previous models : (a) All sites evolve independently and following the same process. Problem: in protein-coding genes, there are two classes of sites with very different evolutionary rates. non-synonymous substitutions (change the a.a.): slow synonymous substitutions (do not change the a.a.): fast Solution: compute two evolutionary distances Ka = non-synonymous distance Ka = nbr. non-synonymous substitutions / nbr. non-synonymous sites Ks = synonymous distance Ks = nbr. synonymous substitutions / nbr. synonymous sites

The genetic code TTT Phe TCT Ser TAT Tyr TGT Cys TTC Phe TCC Ser TAC Tyr TGC Cys TTA Leu TCA Ser TAA stop TGA stop TTG Leu TCG Ser TAG stop TGG Trp CTT Leu CCT Pro CAT His CGT Arg CTC Leu CCC Pro CAC His CGC Arg CTA Leu CCA Pro CAA Gln CGA Arg CTG Leu CCG Pro CAG Gln CGG Arg ATT Ile ACT Thr AAT Asn AGT Ser ATC Ile ACC Thr AAC Asn AGC Ser ATA Ile ACA Thr AAA Lys AGA Arg ATG Met ACG Thr AAG Lys AGG Arg GTT Val GCT Ala GAT Asp GGT Gly GTC Val GCC Ala GAC Asp GGC Gly GTA Val GCA Ala GAA Glu GGA Gly GTG Val GCG Ala GAG Glu GGG Gly Substitution rate = f (mutation, selection) Mutations Selection Substitutions NB: the vast majority of mutations are either neutral (i.e. have no phenotypic effect), or deleterious. Advantageous mutations are very rare.

Quantification of evolutionary distances (6): Calculation of Ka and Ks The details of the method are quite complex. Roughly : Split all sites of the 2 compared genes in 3 categories : I: non degenerate, II: partially degenerate, III: totally degenerate Compute the number of non-synonymous sites = I + 2/3 II Compute the number of synonymous sites = III + 1/3 II Compute the numbers of synonymous and non-synonymous changes Compute, with Kimura s 2-parameter method, Ka and Ks Frequently, one of these two situations occur : Evolutionarily close sequences : Ks is informative, Ka is not. Evolutionarily distant sequences : Ks is saturated, Ka is informative. Li, Wu & Luo (1985) Mol.Biol.Evol. 2:150 Ka and Ks : example # sites observed diffs. J & C K2P K A K S 10254 0.077 0.082 0.082 0.035 0.228 Urotrophin gene of rat (AJ002967) and mouse (Y12229)

Saturation: loss of phylogenetic signal When compared homologous sequences have experienced too many residue substitutions since divergence, it is impossible to determine the phylogenetic tree, whatever the tree-building method used. ancestor seq. 1 seq. 2 seq. 3 NB: with distance methods, the saturation phenomenon may express itself through mathematical impossibility to compute d. Example: Jukes-Cantor: p 0.75 => d -- and V(d) -- NB: often saturation may not be detectable Quantification of evolutionary distances (7): Other distance measures Several other, more realistic models of the evolutionary process at the molecular level have been used : Accounting for biased base compositions (Tajima & Nei). Accounting for variation of the evolutionary rate across sequence sites. etc...

Building phylogenetic trees by distance methods General principle : Sequence alignment (1) Matrix of evolutionary distances between sequence pairs (2) (unrooted) tree (1) Measuring evolutionary distances. (2) Tree computation from a matrix of distance values. A (bad) method : UPGMA Human Chimpanzee Gorilla Orang-utan Gibbon Human - 0.088 0.103 0.160 0.181 Chimpanzee 0.094-0.106 0.170 0.189 Gorilla 0.111 0.115-0.166 0.189 Orang-utan 0.180 0.194 0.188-0.188 Gibbon 0.207 0.218 0.218 0.216-0.014 0.037 0.009 0.093 0.047 0.047 0.056 Human Chimpanzee Gorilla Orang-utan Proportion of differences (p) (above diagonal) and Kimura s 2- parameter distances (d) (below) for mitochondrial DNA sequences (895 bp). Resulting UPGMA tree 0.107 Gibbon d(gibbon,[human+chimp]) = 1/2 [ d(gibbon,human) + d(gibbon,chimp) ]

Example of extremely unequal evolutionary rates Distance-based analysis of 42 LSU rrna sequences from microsporidia and other eukaryotes. Distances were corrected for among-site rate variation. Van de peer et al. (2000) Gene 246:1 UPGMA : properties UPGMA produces a rooted tree with branch length. It is a very fast method. But UPGMA fails if evolutionary rate varies among lineages. UPGMA would not have recovered the fungal evolutionary origin of microsporidia. ==> need methods insensitive to rate variations.

Distance matrix -> tree (1): preliminary Let us consider the following tree : lj j Let us consider two sets of distances between sequence pairs : d = distance as measured on sequences δ = distance induced by the above tree : δ i,j = l i + l j δ i,k = l i + l c + l k It is possible (with a computer) to compute branch lengths (l i, l j, l c, etc.) so that distances δ correspond best to distances d. Best" means that the divergence between d and δ values is minimal : = (d x, y x, y ) 2 1 x< y n It is then possible to compute the total tree length, S : S = l i + l j + l c + + l k +... i li lc lk k Distance matrix -> tree (2): The Minimum Evolution Method Step 1: for a given tree topology (shape), compute branch lengths that minimise ; compute tree length S. Step 2: repeat step 1 for all possible topologies. Keep the tree with smallest S value. Problem: this method is very computation intensive. It is practically not usable with more than 25 sequences. => approximate (heuristic) methods are used. Example: Neighbor-Joining.

Distance matrix -> tree (3): The Neighbor-Joining Method: algorithm Start from a star - topology and progressively construct a tree as : Step 1: Use d distances measured between the N sequences Step 2: For all pairs i et j: consider the following tree topology, and compute S i,j, the sum of all best branch lengths. (Saitou and Nei have found a simple way to compute S i,j ). j i lj li lc Step 3: Retain the pair (i,j) with smallest S i,j value. Group i and j in the tree. lk k Saitou & Nei (1987) Mol.Biol.Evol. 4:406 Distance matrix -> tree (4): The Neighbor-Joining Method: algorithm (2) Step 4: Compute new distances d between N-1 objects: pair (i,j) and the N-2 remaining sequences. d (i,j),k = (d i,k + d j,k ) / 2 Step 5: Return to step 1 as long as N 4. When N = 3, an (unrooted) tree is obtained Example 1 2 6 3 5 4 3 1 6 5 2 4 3 1 6 4 2 5 3 1 6 4 5 2

Distance matrix -> tree (5): The Neighbor-Joining Method (NJ): properties NJ is a fast method, even for hundreds of sequences. The NJ tree is an approximation of the minimum evolution tree (that whose total branch length is minimum). In that sense, the NJ method is very similar to parsimony because branch lengths represent substitutions. NJ produces always unrooted trees, that need to be rooted by the outgroup method. NJ always finds the correct tree if distances are tree-like. NJ performs well when substitution rates vary among lineages. Thus NJ should find the correct tree if distances are well estimated. Maximum likelihood methods (program fastdnaml, Olsen & Felsenstein) Hypotheses The substitution process follows a probabilistic model whose mathematical expression, but not parameter values, is known a priori. Sites evolve independently from each other. All sites follow the same substitution process (some methods use a more realistic hypothesis). Substitution probabilities do not change with time on any tree branch. They may vary between branches.

Maximum likelihood methods (1) Simple example : one - parameter substitution model : v = probability that a base changes per unit time (fastdnaml uses a more elaborate model) Maximum likelihood methods (2) Let us consider evolution along a tree branch : base x ancestor t time units base y descendant Our probabilistic model allows to compute the probability of substitution x y along this branch : P l (x,y) = 3 e 4 1 4e 4 l 3 + 1 4 4 l 3 + 1 4 if x = y if x y with l =3vt Quantity l = 3vt is the average number of substitutions / site along this branch, i.e. the branch length.

Maximum likelihood algorithm (1) Step 1: Let us consider a given rooted tree, a given site, and a given set of branch lengths. Let us compute the probability that the observed pattern of nucleotides at that site has evolved along this tree. S1, S2, S3, S4: observed bases at site in seq. 1, 2, 3, 4 S5, S6, S7: unknown and variable ancestral bases l1, l2,, l6: given branch lengths S1 l1 S5 l5 S2 l2 S7 S3 l3 S6 l6 S4 l4 P(S1, S2, S3, S4)= Σ S7 Σ S5 Σ S6 P(S7) P l5 (S7,S5) P l6 (S7,S6) P l1 (S5,S1) P l2 (S5,S2) P l3 (S6,S3) P l4 (S6,S4) where P(S7) is estimated by the average base frequencies in studied sequences. Maximum likelihood algorithm (2) Step 2: Let us compute the probability that entire sequences have evolved : P(Sq1, Sq2, Sq3, Sq4) = Π all sites P(S1, S2, S3, S4) Step 2: Let us compute branch lengths l1, l2,, l6 that give the highest P(Sq1, Sq2, Sq3, Sq4) value. This is the likelihood of the tree. Step 3: Let us compute the likelihood of all possible trees. The tree predicted by the method is that having the highest likelihood.

Maximum likelihood : properties This is the best justified method from a theoretical viewpoint. Sequence simulation experiments have shown that this method works better than all others in most cases. But it is a very computer-intensive method. It is nearly always impossible to evaluate all possible trees because there are too many. A partial exploration of the space of possible trees is done. The mathematical certainty of obtaining the most likely tree is lost. Reliability of phylogenetic trees: the bootstrap The phylogenetic information expressed by an unrooted tree resides entirely in its internal branches. 4 6 1 2 internal branch separating group 1+4 from group 6+2+5+3 5 3 The tree shape can be deduced from the list of its internal branches. Testing the reliability of a tree = testing the reliability of each internal branch.

Bootstrap procedure real alignment 1 N acgtacatagtatagcgtctagtggtaccgtatg aggtacatagtatgg-gtatactggtaccgtatg acgtaaat-gtatagagtctaatggtac-gtatg acgtacatggtatagcgactactggtaccgtatg random sampling, with replacement, of N sites artificial alignments 1 N gatcagtcatgtataggtctagtggtacgtatat tgagagtcatgtatggtgtatactggtacgtaat tgac-gtaatgtataggtctaatggtactgtaat tgacggtcatgtataggactactggtacgtatat } tree-building method 1000 times same tree-building method tree = series of internal branches for each internal branch, compute fraction of artificial trees containing this internal branch artificial trees The support of each internal branch is expressed as percent of replicates. "bootstrapped tree 0.02 Gallus 91 46 Rattus Mus 97 Bos Homo Xenopus

Bootstrap procedure : properties Internal branches supported by 90% of replicates are considered as statistically significant. The bootstrap procedure only detects if sequence length is enough to support a particular node. The bootstrap procedure does not help determining if the tree-building method is good. A wrong tree can have 100 % bootstrap support for all its branches! Gene tree vs. Species tree The evolutionary history of genes reflects that of species that carry them, except if : horizontal transfer = gene transfer between species (e.g. bacteria, mitochondria) Gene duplication : orthology/ paralogy

Orthology / Paralogy speciation duplication ancestral GNS gene Homology : two genes are homologous iff they have a common ancestor. Orthology : two genes are orthologous iff they diverged following a speciation event. Primates GNS1 Rodents GNS2! Paralogy : two genes are paralogous iff they diverged following a duplication event. Orthology functional equivalence GNS Human GNS1 GNS1 Rat Mouse GNS2 Rat GNS2 Mouse Reconstruction of species phylogeny: artefacts due to paralogy speciation duplication GNS1 GNS2 GNS1 Hamster GNS1 GNS1 GNS2 GNS2 Rat Mouse Rat Mouse true tree GNS2 Hamster GNS Hamster GNS Mouse GNS Rat tree obtained with a partial sampling of homologous genes!! Gene loss can occur during evolution : even with complete genome sequences it may be difficult to detect paralogy!!

Exploring the Bcl-2 family of inhibitors of apoptosis Aouacheria et al. (20001) Oncogene 20:5846 Phylogenetic tree of the Bcl-2 family derived from the NJ method applied to PAM evolutionary distances (94 homologous sites). The tree suggests human NRH, mouse Diva, chicken Nr-13, and Danio Nr-13 to be orthologous genes. The tree also suggests the 2 mammalian genes have evolved much faster than other family members. WWW resources for molecular phylogeny (1) Compilations A list of sites and resources: http://www.ucmp.berkeley.edu/subway/phylogen.html An extensive list of phylogeny programs http://evolution.genetics.washington.edu/ phylip/software.html Databases of rrna sequences and associated software The rrna WWW Server - Antwerp, Belgium. http://rrna.uia.ac.be The Ribosomal Database Project - Michigan State University http://rdp.cme.msu.edu/html/

WWW resources for molecular phylogeny (2) Database similarity searches (Blast) : http://www.ncbi.nlm.nih.gov/blast/ http://www.infobiogen.fr/services/menuserv.html http://bioweb.pasteur.fr/seqanal/blast/intro-fr.html http://pbil.univ-lyon1.fr/blast/blast.html Multiple sequence alignment ClustalX : multiple sequence alignment with a graphical interface (for all types of computers). http://www.ebi.ac.uk/ftp/index.html and go to software Web interface to ClustalW algorithm for proteins: http://pbil.univ-lyon1.fr/ and press clustal WWW resources for molecular phylogeny (3) Sequence alignment editor SEAVIEW : for windows and unix http://pbil.univ-lyon1.fr/software/seaview.html Programs for molecular phylogeny PHYLIP : an extensive package of programs for all platforms http://evolution.genetics.washington.edu/phylip.html CLUSTALX : beyond alignment, it also performs NJ PAUP* : a very performing commercial package http://paup.csit.fsu.edu/index.html PHYLO_WIN : a graphical interface, for unix only http://pbil.univ-lyon1.fr/software/phylowin.html WWW-interface at Institut Pasteur, Paris http://bioweb.pasteur.fr/seqanal/phylogeny

WWW resources for molecular phylogeny (4) Tree drawing NJPLOT (for all platforms) http://pbil.univ-lyon1.fr/software/njplot.html Lecture notes of molecular systematics http://www.bioinf.org/molsys/lectures.html WWW resources for molecular phylogeny (5) Books Laboratory techniques Molecular Systematics (2nd edition), Hillis, Moritz & Mable eds.; Sinauer, 1996. Molecular evolution Fundamentals of molecular evolution (2nd edition); Graur & Li; Sinauer, 2000. Evolution in general Evolution (2nd edition); M. Ridley; Blackwell, 1996.