A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field Station 16 Aug 2016 1/39
Context Phylogenies... Distance trees Parsimony Likelihood/Bayesian Uncertainty Pitfalls & best practices And more... Outline 2/39
Context Phylogenies... Distance trees Parsimony Likelihood/Bayesian Uncertainty Pitfalls & best practices And more... Outline 3/39
Phylogenetics: from the origins... From the first growth of the tree, many a limb and branch has decayed and dropped off; and these fallen branches of various sizes may represent those whole orders, families, and genera which have now no living representatives, and which are known to us only in a fossil state. C. Darwin, Notebook, 1837. 4/39
Phylogenetics:...to the present phylogenetic trees are part of the standard toolbox of genetic data analysis represent the evolutionary history of a set of (sampled) taxa Bininda-Emonds et al., 2007, Nature. 5/39
And the main difference is... Current trees look better! (and some other minor differences) 6/39
And the main difference is... Current trees look better! (and some other minor differences) 6/39
And the main difference is... Current trees look better! (and some other minor differences) 6/39
About the minor differences... DNA sequencing revolution huge data banks freely available (e.g. GenBank) easier, cheaper, faster to obtain DNA sequences increasing number of full genomes available Different ways to exploit this information. 7/39
About the minor differences... DNA sequencing revolution huge data banks freely available (e.g. GenBank) easier, cheaper, faster to obtain DNA sequences increasing number of full genomes available Different ways to exploit this information. 7/39
Context Phylogenies... Distance trees Parsimony Likelihood/Bayesian Uncertainty Pitfalls & best practices And more... Outline 8/39
Genetic changes (substitution) accumulate over time Substitution: replacement of a nucleotide (e.g. a t)...attgtacgtaggatctagg......attgaacgtaggatctagg......attgtacgtaccatctagg... Time...attgaacgtagttgctagg......atcgtacgtaccatctggg... 9/39
Genetic changes (substitution) accumulate over time Substitution: replacement of a nucleotide (e.g. a t)...attgtacgtaggatctagg......attgaacgtaggatctagg......attgtacgtaccatctagg... Time...attgaacgtagttgctagg......atcgtacgtaccatctggg... 9/39
Genetic changes (substitution) accumulate over time Substitution: replacement of a nucleotide (e.g. a t)...attgtacgtaggatctagg......attgaacgtaggatctagg......attgtacgtaccatctagg... Time...attgaacgtagttgctagg......atcgtacgtaccatctggg... 9/39
Genetic changes (substitution) accumulate over time Substitution: replacement of a nucleotide (e.g. a t)...attgtacgtaggatctagg......attgaacgtaggatctagg......attgtacgtaccatctagg... Time...attgaacgtagttgctagg......atcgtacgtaccatctggg... 9/39
Genetic changes (substitution) accumulate over time Substitution: replacement of a nucleotide (e.g. a t)...attgtacgtaggatctagg......attgaacgtaggatctagg......attgtacgtaccatctagg... Time...attgaacgtagttgctagg......atcgtacgtaccatctggg... 9/39
Genetic changes (substitution) accumulate over time Substitution: replacement of a nucleotide (e.g. a t)...attgtacgtaggatctagg......attgaacgtaggatctagg......attgtacgtaccatctagg... Time...attgaacgtagttgctagg......atcgtacgtaccatctggg... Substitution patterns reflect the evolutionary history 9/39
Using substitution patterns to reconstruct the evolutionary history...attaaacgtaggatctagg......attaaacgtaggatctagg......attcatacgtaggatcagg......attgtacgtaggatctttt......attgtacgtaggatctttt......attgcatgtaggatctttt... Reconstructed phylogeny 10/39
Using substitution patterns to reconstruct the evolutionary history...attaaacgtaggatctagg......attaaacgtaggatctagg......attcatacgtaggatcagg......attgtacgtaggatctttt......attgtacgtaggatctttt......attgcatgtaggatctttt... Reconstructed phylogeny 10/39
Using substitution patterns to reconstruct the evolutionary history...attaaacgtaggatctagg......attaaacgtaggatctagg......attcatacgtaggatcagg......attgtacgtaggatctttt......attgtacgtaggatctttt......attgcatgtaggatctttt... Reconstructed phylogeny Phylogenetics aim to reconstruct evolutionary trees (phylogenies) from genetic sequence data. 10/39
Using trees to represent the evolutionary history 11/39
Using trees to represent the evolutionary history - "taxa" - "tips" - "leaves" - "Operational Taxonomic Units (OTUs)" 11/39
Using trees to represent the evolutionary history Most Recent Common Ancestors (MRCA) - "nodes" - "Hypothetical Taxonomic Units (HTUs)" 11/39
Using trees to represent the evolutionary history branch - "edge" - length = amount of evolution (not time, as a rule) - length is optional 11/39
Using trees to represent the evolutionary history distances between tips - "patristic" distance: sum of branch lengths - other measures of distance/dissimilarity - vertical axis meaningless 11/39
Using trees to represent the evolutionary history Root - oldest part of the tree - defined using an outgroup - optional 11/39
Using trees to represent the evolutionary history Root - oldest part of the tree - defined using an outgroup - optional How to we build them? 11/39
Workflow Prepare data align sequences: alignment software + manual refinement Build the tree distance-based methods maximum parsimony likelihood-based methods (ML, Bayesian) Analyse the tree assess uncertainty test phylogenetic signal model trait evolution... 12/39
Workflow Prepare data align sequences: alignment software + manual refinement Build the tree distance-based methods maximum parsimony likelihood-based methods (ML, Bayesian) Analyse the tree assess uncertainty test phylogenetic signal model trait evolution... 12/39
Workflow Prepare data align sequences: alignment software + manual refinement Build the tree distance-based methods maximum parsimony likelihood-based methods (ML, Bayesian) Analyse the tree assess uncertainty test phylogenetic signal model trait evolution... 12/39
Workflow Prepare data align sequences: alignment software + manual refinement Build the tree distance-based methods maximum parsimony likelihood-based methods (ML, Bayesian) Analyse the tree assess uncertainty test phylogenetic signal model trait evolution... 12/39
Context Phylogenies... Distance trees Parsimony Likelihood/Bayesian Uncertainty Pitfalls & best practices And more... Outline 13/39
Distance-based phylogenetic reconstruction Approaches relying on agglomerative clustering algorithms (e.g. Single linkage, UPGMA, Neighbor-Joining) Rationale 1. compute pairwise genetic distances D 2. group closest sequences 3. update D 4. go back to 2) until all sequences are grouped 14/39
Distance-based phylogenetic reconstruction 15/39
Distance-based phylogenetic reconstruction 15/39
Distance-based phylogenetic reconstruction 15/39
Distance-based phylogenetic reconstruction 15/39
Distance-based phylogenetic reconstruction 15/39
Distance-based phylogenetic reconstruction 15/39
What is the distance between a node and tips? Hierarchical clustering: k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 Neighbor joining: Transforms original distances to account for heterogeneous rates of evolution. 16/39
What is the distance between a node and tips? Hierarchical clustering: k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 Neighbor joining: Transforms original distances to account for heterogeneous rates of evolution. 16/39
What is the distance between a node and tips? Hierarchical clustering: k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 Neighbor joining: Transforms original distances to account for heterogeneous rates of evolution. 16/39
What is the distance between a node and tips? Hierarchical clustering: k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 Neighbor joining: Transforms original distances to account for heterogeneous rates of evolution. 16/39
What is the distance between a node and tips? Hierarchical clustering: k D =... k,g single linkage: D k,g = min(d k,i, D k,j ) complete linkage: D k,g = max(d k,i, D k,j ) g i D i,j j UPGMA: D k,g = D k,i+d k,j 2 Neighbor joining: Transforms original distances to account for heterogeneous rates of evolution. 16/39
Distance-based phylogenetic reconstruction Advantages simple flexible (many distances and clustering algorithms) fast and scalable (applicable to large datasets) Limitations sensitive to distance/clustering chosen evolutionary rates are not estimated no measure of uncertainty for the tree obtained 17/39
Distance-based phylogenetic reconstruction Advantages simple flexible (many distances and clustering algorithms) fast and scalable (applicable to large datasets) Limitations sensitive to distance/clustering chosen evolutionary rates are not estimated no measure of uncertainty for the tree obtained 17/39
Context Phylogenies... Distance trees Parsimony Likelihood/Bayesian Uncertainty Pitfalls & best practices And more... Outline 18/39
Maximum parsimony phylogenies Approaches relying on finding the tree with the smallest number of character changes (substitutions) Rationale 1. start from a pre-defined tree 2. compute initial parsimony score 3. permute branches and compute parsimony score 4. accept new tree if the parsimony score is improved 5. go back to 3) until convergence 19/39
Maximum parsimony phylogenies score: 8 score: 6 Initial tree score: 5 20/39
Maximum parsimony phylogenies score: 8 score: 6 Initial tree score: 5 20/39
Maximum parsimony phylogenies score: 8 score: 6 Initial tree score: 5 20/39
Maximum parsimony phylogenies score: 8 score: 6 Initial tree score: 5 20/39
Maximum parsimony phylogenies score: 8 score: 6 Initial tree score: 5 20/39
Maximum parsimony phylogenies Advantages applicable to any discontinuous characters (not just DNA) intuitive explanation: simplest evolutionary scenario Limitations evolutionary rates are not estimated no measure of uncertainty for the tree obtained computer-intensive different types of substitutions ignored evolution not necessarily parsimonious sensitive to heterogeneous rates of evolution (long branch attraction) 21/39
Maximum parsimony phylogenies Advantages applicable to any discontinuous characters (not just DNA) intuitive explanation: simplest evolutionary scenario Limitations evolutionary rates are not estimated no measure of uncertainty for the tree obtained computer-intensive different types of substitutions ignored evolution not necessarily parsimonious sensitive to heterogeneous rates of evolution (long branch attraction) 21/39
Context Phylogenies... Distance trees Parsimony Likelihood/Bayesian Uncertainty Pitfalls & best practices And more... Outline 22/39
Likelihood-based phylogenies (ML / Bayesian) Approaches relying on a model of sequence evolution: ML: find tree and evolutionary rates with highest likelihood Bayesian: find tree and evolutionary rates to posterior probability Rationale 1. start from a pre-defined tree 2. compute initial likelihood/posterior 3. permute branches, sample new parameters and compute likelihood/posterior 4. accept new tree and parameters based on likelihood/posterior improvement 5. go back to 3) until convergence 23/39
Likelihood-based phylogenies (ML / Bayesian) Approaches relying on a model of sequence evolution: ML: find tree and evolutionary rates with highest likelihood Bayesian: find tree and evolutionary rates to posterior probability Rationale 1. start from a pre-defined tree 2. compute initial likelihood/posterior 3. permute branches, sample new parameters and compute likelihood/posterior 4. accept new tree and parameters based on likelihood/posterior improvement 5. go back to 3) until convergence 23/39
Likelihood-based phylogenies (ML / Bayesian) Likelihood / Posterior Density 24/39
Likelihood-based phylogenies (ML / Bayesian) Advantages very flexible consistent with a model of evolution statistically consistent (model comparison) parameter estimation (Bayesian) several trees measure of uncertainty Limitations computer-intensive choice of the model of evolution (ML) no measure of uncertainty for the tree obtained (Bayesian) need to find a consensus tree 25/39
Likelihood-based phylogenies (ML / Bayesian) Advantages very flexible consistent with a model of evolution statistically consistent (model comparison) parameter estimation (Bayesian) several trees measure of uncertainty Limitations computer-intensive choice of the model of evolution (ML) no measure of uncertainty for the tree obtained (Bayesian) need to find a consensus tree 25/39
Context Phylogenies... Distance trees Parsimony Likelihood/Bayesian Uncertainty Pitfalls & best practices And more... Outline 26/39
How do we know the tree is robust? Main issue: assess the uncertainty of the tree topology / individual nodes Approaches ML: model selection to compare trees (whole tree) Bayesian methods: between-samples variability (individual nodes) any method: bootstrap (individual nodes) 27/39
How do we know the tree is robust? Main issue: assess the uncertainty of the tree topology / individual nodes Approaches ML: model selection to compare trees (whole tree) Bayesian methods: between-samples variability (individual nodes) any method: bootstrap (individual nodes) 27/39
How do we know the tree is robust? Main issue: assess the uncertainty of the tree topology / individual nodes Approaches ML: model selection to compare trees (whole tree) Bayesian methods: between-samples variability (individual nodes) any method: bootstrap (individual nodes) 27/39
How do we know the tree is robust? Main issue: assess the uncertainty of the tree topology / individual nodes Approaches ML: model selection to compare trees (whole tree) Bayesian methods: between-samples variability (individual nodes) any method: bootstrap (individual nodes) 27/39
How do we know the tree is robust? Main issue: assess the uncertainty of the tree topology / individual nodes Approaches ML: model selection to compare trees (whole tree) Bayesian methods: between-samples variability (individual nodes) any method: bootstrap (individual nodes) 27/39
Bootstrapping phylogenies assess variability due to sampling the genome and conflicting signals relies on analysing resampled datasets Rationale 1. obtain a reference tree 2. resample the sites with replacement 3. obtain a tree for the resampled dataset 4. go back to 2) until the desired number of bootstrapped trees is attained 5. compute the frequency of each bifurcation of the reference tree occuring in bootstrapped trees 28/39
Bootstrapping phylogenies assess variability due to sampling the genome and conflicting signals relies on analysing resampled datasets Rationale 1. obtain a reference tree 2. resample the sites with replacement 3. obtain a tree for the resampled dataset 4. go back to 2) until the desired number of bootstrapped trees is attained 5. compute the frequency of each bifurcation of the reference tree occuring in bootstrapped trees 28/39
Bootstrapping phylogenies...attaaacgtaggatctagg......attaaacgtaggatctagg......attcatacgtaggatcagg......attgtacgtaggatctttt......attgtacgtaggatctttt......attgcatgtaggatctttt... Reconstructed phylogeny sampling sites with replacement compare topologies...ttttaaaccatggatctagg......ttttaaaccatggatctagg......ttttaaaccatggatctagg......ttttaaaccatggatctagg......ttttaaaccatggatctagg......atttcatacgtaggatcagg......ttttaaaccatggatctagg......atttcatacgtaggatcagg......ttttaaaccatggatctagg......atttcatacgtaggatcagg......aaatgtaccatggatcttgt......atttcatacgtaggatcagg......aaatgtaccatggatcttgt......atttcatacgtaggatcagg......aaatgtaccatggatcttgt......aaatgtaccatggatcttgt......aaatgtaccatggatcttgt......aaatgcatcatggatcttgt......aaatgtaccatggatcttgt......aaatgcatcatggatcttgt......aaatgtaccatggatcttgt......aaatgcatcatggatcttgt......aaatgcatcatggatcttgt......aaatgcatcatggatcttgt... Reconstructed phylogeny Reconstructed phylogeny Reconstructed phylogeny Reconstructed phylogeny Reconstructed phylogeny 29/39
Bootstrapping phylogenies Advantages standard simple to implement Limitations possibly computer-intensive assumes that the genome has been sampled randomly (often wrong) 30/39
Bootstrapping phylogenies Advantages standard simple to implement Limitations possibly computer-intensive assumes that the genome has been sampled randomly (often wrong) 30/39
Context Phylogenies... Distance trees Parsimony Likelihood/Bayesian Uncertainty Pitfalls & best practices And more... Outline 31/39
Plotting trees as rooted 9 1 3 10 11 2 12 4 13 7 14 5 9 2 11 6 8 15 6 15 12 3 9 2 1 4 13 7 14 5 10 8 11 12 1 11 15 6 15 6 10 4 8 7 13 14 11 6 3 5 3 15 1 4 13 7 14 5 9 2 9 2 12 5 14 3 7 10 10 8 5 10 8 8 3 9 12 14 2 1 7 13 6 11 13 4 4 15 Never plot an unrooted tree as rooted. 1 12 32/39
Plotting trees as rooted 9 1 3 10 11 2 12 4 13 7 14 5 9 2 11 6 8 15 6 15 12 3 9 2 1 4 13 7 14 5 10 8 11 12 1 11 15 6 15 6 10 4 8 7 13 14 11 6 3 5 3 15 1 4 13 7 14 5 9 2 9 2 12 5 14 3 7 10 10 8 5 10 8 8 3 9 12 14 2 1 7 13 6 11 13 4 4 15 Never plot an unrooted tree as rooted. 1 12 32/39
Interpreting distances 33/39
Interpreting distances 33/39
Interpreting distances meaningless meaningful distance = sum of branch lengths 33/39
The paradox of divergent clusters MRCA and genetic distances may give different information. 34/39
The paradox of divergent clusters D(red, red)>d(red,blue) MRCA and genetic distances may give different information. 34/39
The paradox of divergent clusters D(red, red)>d(red,blue) MRCA and genetic distances may give different information. 34/39
Taking uncertainty into account 193 HIV 1 sequences from DRC (Strimmer & Pybus 2001) 0.2 0.15 0.1 0.05 0 At best, the tree is an estimate of the likely evolutionary history of the taxa studied. 35/39
Taking uncertainty into account 193 HIV 1 sequences from DRC (Strimmer & Pybus 2001) Collapsed tree (threshold length 0.01) 0.2 0.15 0.1 0.05 0 0.2 0.15 0.1 0.05 0 At best, the tree is an estimate of the likely evolutionary history of the taxa studied. 35/39
Taking uncertainty into account 193 HIV 1 sequences from DRC (Strimmer & Pybus 2001) Collapsed tree (threshold length 0.01) 0.2 0.15 0.1 0.05 0 0.2 0.15 0.1 0.05 0 At best, the tree is an estimate of the likely evolutionary history of the taxa studied. 35/39
(Over, Mis)Interpreting temporal trends t6 t62 t4 t48 t72 t29t14 t83 t39 t69 t96 t63 t44 t24 t51 t95 t85 t68 t92 t87 t31 t81 t99 t73 t30t25 t64 t46 t33 t54 t10 t26 t76 t59 t16 t32 t100 t91 t77 t42 t36 t58 t98 t56 t74 t65t79 t37 t80 t12t93 t78 t90 t18 t11 t94 t5 t45 t60t67 t53 t88 t52 t22 t27 t23 t38 t40 t3 t15 t13 t61 t55 t35 t89 t75 t86 t70 t66 t34 t2 t17 t28 t47 t19 t43 t41 t82 t49 t1 t84 t20 t97 t21 t7 t71 t50 t57 12 10 8 6 4 2 0 Time trees only make sense under a near-perfect molecular clock. 36/39
(Over, Mis)Interpreting temporal trends t6 t62 t4 t48 t72 t29t14 t83 t39 t69 t96 t63 t44 t24 t51 t95 t85 t68 t92 t87 t31 t81 t99 t73 t30t25 t64 t46 t33 t54 t10 t26 t76 t59 t16 t32 t100 t91 t77 t42 t36 t58 t98 t56 t74 t65t79 t37 t80 t12t93 t78 t90 t18 t11 t94 t5 t45 t60t67 t53 t88 t52 t22 t27 t23 t38 t40 t3 t15 t13 t61 t55 t35 t89 t75 t86 t70 t66 t34 t2 t17 t28 t47 t19 t43 t41 t82 t49 t1 t84 t20 t97 t21 t7 t71 t50 t57 Mutations to the root 0 2 4 6 8 10 12 > anova(lm(d.root~-1+t.root)) anova(lm(d.root~-1+t.root)) Analysis of Variance Table Response: d.root Df Sum Sq Mean Sq F value Pr(>F) t.root 1 5434.7 5434.7 214.66 < 2.2e-16 *** Residuals 99 2506.4 25.3 12 10 8 6 4 2 0 0 20 40 60 80 Time to the root Time trees only make sense under a near-perfect molecular clock. 36/39
(Over, Mis)Interpreting temporal trends t6 t62 t4 t48 t72 t29t14 t83 t39 t69 t96 t63 t44 t24 t51 t95 t85 t68 t92 t87 t31 t81 t99 t73 t30t25 t64 t46 t33 t54 t10 t26 t76 t59 t16 t32 t100 t91 t77 t42 t36 t58 t98 t56 t74 t65t79 t37 t80 t12t93 t78 t90 t18 t11 t94 t5 t45 t60t67 t53 t88 t52 t22 t27 t23 t38 t40 t3 t15 t13 t61 t55 t35 t89 t75 t86 t70 t66 t34 t2 t17 t28 t47 t19 t43 t41 t82 t49 t1 t84 t20 t97 t21 t7 t71 t50 t57 Mutations to the root 0 2 4 6 8 10 12 > anova(lm(d.root~-1+t.root)) anova(lm(d.root~-1+t.root)) Analysis of Variance Table Response: d.root Df Sum Sq Mean Sq F value Pr(>F) t.root 1 5434.7 5434.7 214.66 < 2.2e-16 *** Residuals 99 2506.4 25.3 12 10 8 6 4 2 0 0 20 40 60 80 Time to the root Time trees only make sense under a near-perfect molecular clock. 36/39
Context Phylogenies... Distance trees Parsimony Likelihood/Bayesian Uncertainty Pitfalls & best practices And more... Outline 37/39
This is only the beginning Many things can be done with trees estimate divergence time model trait evolution (phylogenetic comparative method) reconstruct ancestral states measure diversity infer past demographics/effective population size (coalescence)... and also, other approaches than phylogenetics to analyse genetic data 38/39
This is only the beginning Many things can be done with trees estimate divergence time model trait evolution (phylogenetic comparative method) reconstruct ancestral states measure diversity infer past demographics/effective population size (coalescence)... and also, other approaches than phylogenetics to analyse genetic data 38/39
This is only the beginning Many things can be done with trees estimate divergence time model trait evolution (phylogenetic comparative method) reconstruct ancestral states measure diversity infer past demographics/effective population size (coalescence)... and also, other approaches than phylogenetics to analyse genetic data 38/39
This is only the beginning Many things can be done with trees estimate divergence time model trait evolution (phylogenetic comparative method) reconstruct ancestral states measure diversity infer past demographics/effective population size (coalescence)... and also, other approaches than phylogenetics to analyse genetic data 38/39
This is only the beginning Many things can be done with trees estimate divergence time model trait evolution (phylogenetic comparative method) reconstruct ancestral states measure diversity infer past demographics/effective population size (coalescence)... and also, other approaches than phylogenetics to analyse genetic data 38/39
This is only the beginning Many things can be done with trees estimate divergence time model trait evolution (phylogenetic comparative method) reconstruct ancestral states measure diversity infer past demographics/effective population size (coalescence)... and also, other approaches than phylogenetics to analyse genetic data 38/39
This is only the beginning Many things can be done with trees estimate divergence time model trait evolution (phylogenetic comparative method) reconstruct ancestral states measure diversity infer past demographics/effective population size (coalescence)... and also, other approaches than phylogenetics to analyse genetic data 38/39
Time to get your hands dirty! The pdf of the practical is online: http://adegenet.r-forge.r-project.org/ or Google adegenet documents GDAR August 2016 39/39