Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Similar documents
Dr. Amira A. AL-Hosary

Constructing Evolutionary/Phylogenetic Trees

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetic Tree Reconstruction

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location

Constructing Evolutionary/Phylogenetic Trees

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

A (short) introduction to phylogenetics

Lecture 11 Friday, October 21, 2011

Consistency Index (CI)

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

EVOLUTIONARY DISTANCES

8/23/2014. Phylogeny and the Tree of Life

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Hypothesis testing and phylogenetics

The Nonparametric Bootstrap

One-Sample Numerical Data

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

STAT Section 2.1: Basic Inference. Basic Definitions

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Bootstrap Confidence Intervals

Algorithms in Bioinformatics

How to read and make phylogenetic trees Zuzana Starostová

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Political Science 236 Hypothesis Testing: Review and Bootstrapping

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Inference for Single Proportions and Means T.Scofield

Biology 211 (2) Week 1 KEY!

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley


Classification and Phylogeny

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Classification and Phylogeny

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

Phylogenetic inference

Lecture 6 Phylogenetic Inference

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Evaluating phylogenetic hypotheses

Chapter 26 Phylogeny and the Tree of Life

Nonparametric Methods II

Theory of Evolution Charles Darwin

Model Selection, Estimation, and Bootstrap Smoothing. Bradley Efron Stanford University

Intraspecific gene genealogies: trees grafting into networks

Molecular Evolution & Phylogenetics

What to do today (Nov 22, 2018)?

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Phylogeny is the evolutionary history of a group of organisms. Based on the idea that organisms are related by evolution

Name: Class: Date: ID: A

Phylogeny and Molecular Evolution. Introduction

3 Joint Distributions 71

One-minute responses. Nice class{no complaints. Your explanations of ML were very clear. The phylogenetics portion made more sense to me today.

A Note on Bootstraps and Robustness. Tony Lancaster, Brown University, December 2003.

11. Bootstrap Methods

Random Numbers and Simulation

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

What is Phylogenetics

Phylogenetics. BIOL 7711 Computational Bioscience

Bayesian learning Probably Approximately Correct Learning

Systematics - Bio 615

Questions we can ask. Recall. Accuracy and Precision. Systematics - Bio 615. Outline

Estimating Evolutionary Trees. Phylogenetic Methods

BINF6201/8201. Molecular phylogenetic methods

Statistical Applications in Genetics and Molecular Biology

An Introduction to Mplus and Path Analysis

Business Statistics. Lecture 10: Course Review

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Resampling and the Bootstrap

Probability and Statistics

Bootstrapping Australian inbound tourism

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Fig. 26.7a. Biodiversity. 1. Course Outline Outcomes Instructors Text Grading. 2. Course Syllabus. Fig. 26.7b Table

4 Resampling Methods: The Bootstrap

PHYLOGENY & THE TREE OF LIFE

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Cladistics and Bioinformatics Questions 2013

Frequentist Properties of Bayesian Posterior Probabilities of Phylogenetic Trees Under Simple and Complex Substitution Models

The Life System and Environmental & Evolutionary Biology II

What Is Conservation?

Phylogenetic analyses. Kirsi Kostamo

An Introduction to Path Analysis

Resampling techniques for statistical modeling

Inverse Sampling for McNemar s Test

Reading for Lecture 13 Release v10

CONTENTS OF DAY 2. II. Why Random Sampling is Important 10 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Phylogenetic Analysis

Transcription:

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Following Confidence limits on phylogenies: an approach using the bootstrap, J. Felsenstein, 1985 1

I. Short review of the bootstrap method [Efron, B. & Tibshirani, R. (1993) An Introduction to the Bootstrap; Saharon s course Bootstrap and Resampling Methods ] II. III. Short review of phylogenetic trees [Saharon s course Topics in Statistical Genetics ] Felsenstein s method for computing confidence levels on monophyly of groups in phylogenetic trees using the bootstrap, and critiques of it [Felsenstein, J. (1985) Confidence Limits on Phylogenies: An Approach Using the Bootstrap; Saharon s course Bootstrap and Resampling Methods ] IV. Efron s justification of Felsenstein s method and corrections of it [Efron, B., Halloran, E. & Holmes, S. (1996) Bootstrap confidence levels for phylogenetic trees; Efron, B. (1982) The jackknife, the bootstrap and other resampling plans; Efron, B. (1987) Better Bootstrap Confidence Intervals] 2

I. Short review of the bootstrap method Population distribution function F, some parameter or feature uf A random sample X x,, 1 xn from this population Estimator ˆ s X What is the estimator s quality? We want to estimate some parameter that characterizes it, t F, which may be a function of the unknown parameter. 3

F u F X ˆ s X ˆF plug-in u F * ˆ * * X,, 1 XB ˆ * * ˆ ˆ 1,, B t F ˆB 4

II. Short review of phylogenetic trees Every organism carries sequences of DNA, and each species is characterized by its typical sequences. The history of all species begins with one ancestor, whose offspring developed various mutations. Throughout history species were formed through accumulation of mutations. We want to build a tree to map the history and relationship between species. All species existing today are leaves in this tree. The real tree is rooted, since there was only one species in the beginning. 5

6 Bacteria - a large domain of prokariotic microorganisms (Prokariots - organisms whose cells lack a cell nucleus (karyon)). Archaea - another domain of prokariots, all single-celled. Eucaryota - organisms whose cells do have a nucleus.

If we take sequences from very different species, with high probability all nucleotides can be described by the same tree. It means that for every site, the convergence of (human, cow) is closer than of (human, bird) and of (cow, bird). This is probable given the history. 7

Table of sequences: 8

The simplest metrics between species: Hamming distance - the number of sites at which the two species differ. Tree building methods: 1. Neighbour joining: Connect the two closest species, and so on. 2. Maximum parsimony: Minimize the total number of mutations in the tree. Good for many thousands of species. 9

3. Maximum likelihood: Parameters: - Tree topology - Mutation model - Edge lengths Usually assume that the sites are independent. Good for a few tens of species. None of these models gives information about the root s location. 10

11

Methods for placing the root: 1. Molecular clock: Mutations occur in a constant rate over history. Choose the most distant point from the leaves, according to some metrics (e.g. the point with biggest minimal distance from a leaf). 2. Outgroup: Add a species that is surely separated from all, and hang the tree from the edge leading to it. 12

III. Felsenstein s method a monophyletic group is a group of species which consists of all the species that are descended from some edge in the rooted tree, and only of them. A rooted tree is a series of arguments about the monophyly of groups in the tree. Use bootstrap to build phylogenetic trees and test hypotheses about the monophyly of groups The researcher s main interest is to decide whether or not a group of species is monophyletic. 13

Data table: Bootstrap across the characters: sample columns with replacement. n 14

We get a bootstrap sample that is a multinomial r.v. : each of the possible columns has a probability that is its frequency in the original sample of columns. Choose some optimality criterion for tree building (e.g. maximum parsimony). Build a tree based on each of the bootstrap samples. The estimated tree(s) is built according to the majority rule : the groups that are monophyletic in it are the ones that where monophyletic in most of the bootstrap trees. 15

S - some group of species. H 0 : S is not monophyletic Reject H 0 If S is monophyletic in at least 95% of the bootstrap trees, i.e. if it is not monophyletic in less than 5% of the bootstrap trees. This is equivalent to building a one-sided percentile confidence interval for the indicator 1, S is monophyletic 0, otherwise 16

For each bootstrap sample ( b1,, B ): * 1, S is monophyletic according to sample ˆ b 0, otherwise b * And CI ˆ,1. pct, 0.95 0.05 ˆ * B 1 If the 5 th percentile of b b is 1 (i.e., the lower bound of the bootstrap percentile confidence interval is 1), Felsenstein rejects H 0 and says that the group is monophyletic. The proportion of bootstrap trees where S is monophyletic is Felsenstein s confidence in the monophyly of the group. Efron denotes it by 1Felsenstein's p-value. 17

Data set used by Efron to demonstrate the use of Felsenstein s method: 11 malaria species, 221 sites. The first 20 columns of the data matrix X : 18

Tree building algorithm: distance matrix between species (221 x 221) ˆD X Dˆ ˆ tree Proceeding with Felsenstein s method: B 200 bootstrap samples were generated, and bootstrap trees were built: X Dˆ tree ˆ * * * 19

For every monophyletic group in the original tree, the proportion of bootstrap trees where this group is monophyletic (Felsenstein s confidence) was calculated. For example: the 9-10 clade appeared (as monophyletic) in 193 of the 200 bootstrap trees, so its confidence value is 0.965. 20

21

What are the main problems with Felsenstein s method: 1) With respect to this course? 2) As put by Efron? 3) According to other critiques, such as Hillis and Bull? 22

1) With respect to this course: This is not correct hypothesis testing: 1, S is monophyletic 0, otherwise H 0 : S is not monophyletic, i.e. 0. H 1 : is monophyletic, i.e.. S 1 Test statistic: ˆ s X We need to create a null bootstrap world, where S is not monophyletic, i.e. 0, and use it to estimate by calculating H 0 P s X s X obs 23

ˆP F 0 s X s X ASL ˆ obs boot, B * b s X This is the proportion of bootstrap samples that give a tree in which S is monophyletic, when F 0, the * distribution function of bootstrap samples X, is in H 0 and is as similar as possible to F (the distribution of in the real world). Instead, Felsenstein resamples from a specific distribution in H 1 the empirical distribution of X, that gave us a tree in which S is monophyletic. This Is not valid hypothesis testing. # s X B obs X 24

Bootstrap percentile confidence interval cannot be justified, for example because the monophyly indicators and ˆ can only get 0 or 1, so there cannot exist a monotone s.t. 2 ˆ g g ~ N 0,. g 25

2) As put by Efron: The use of the (nonparametric) bootstrap to estimate the parameter itself (, the indicator for the monophyly of S, by estimating the tree), instead of the quality of the estimator (bias, variance etc.). We want to infer our confidence that S is truly monophyletic given that we estimated it as monophyletic. But we are actually estimating the probability of inferring monophyly, assuming that our original sample is a true description of the real tree (in the sense that S is indeed monophyletic). 26

In other words: We want to decide how confident (confidence = 1 - p-value) we are that 1 given that our estimate is ˆ 1. This is equivalent to building a valid percentile plugin CI: ˆ ˆ ˆ ˆ ˆ * * CI pct, plug-in, 0.95,1 2,1 0.95 0.95 But we are actually estimating the probability to get ˆ 1 when 1. This is equivalent to building the less valid percentile CI: * CI pct, 0.95 ˆ 0.05,1 27

3) According to other critiques, such as Hillis and Bull: Felsenstein s confidence values are consistently too conservative (i.e., biased downward) as an assessment of the tree accuracy. 28

IV. Efron s justification and corrections Efron, Halloran & Holmes: a) Felsenstein s method provides a reasonable first approximation to the actual confidence levels of the observed clades. This is an answer to problem (2). b) Felsenstein s confidence assessment is not consistently biased downward. This is an answer to problem (3). c) Suggest a more complex method to give better assessments of confidence. 29

a) Efron s answer to problem (2): Why is Felsenstein s confidence justified? The rationale: The bootstrap samples come from a multinomial model: In the malaria example, there are 11 species, so 11 there are K 4 4 possible column vectors for x. Denote them y,, 1 yk. Suppose each observed column of X is independently chosen from these vectors, with K some probability to choose y ( ). These k k k 1 k 1 probabilities depend on the population of sites (the true tree). Denote. 1,, K 30

X can be characterized by the proportion of its n 221 columns equaling each possible y : ˆ k # columns of X equaling ˆ ˆ,, ˆ 1 K ˆD is a function of the original observed proportions ˆ, so the tree-building algorithm can be described as ˆ n Dˆ tree ˆ y k k 31

We can add the group of interest s monophyly indicator to the process: In a similar way (although possibly in an opposite causal order the tree comes first in reality), the vector of true probabilities gives the true distance matrix and the true tree (that we assume we would get by applying our algorithm to the true distance matrix): K (where D 2 ij k yki ykj ). k1 π D tree ψ π D tree ψ 32

The proportions of columns in a bootstrap sample are: ˆ * k And we get a bootstrap tree: * # columns of X equaling ˆ ˆ,, ˆ * * * 1 K n y k (and ψ ). ˆ Dˆ tree ˆ * * * 33

A schematic picture of the space of possible s: We hope to have ˆ tree = tree. 34

It is not clear why we can use the distribution * of tree ˆ tree ˆ to infer our confidence in tree tree ˆ. Or in terms of the monophyly indicator, why we can decide how confident we are that given that our estimate is ˆ 1 by estimating the probability that ˆ 1 given that 1. 1 35

The answer is in using a Bayesian approach: * The bootstrap distribution of tree ˆ tree ˆ is almost the same as the posterior distribution of tree tree ˆ if we begin with an uninformative (i.e., uniform) prior density. An explanation can be found in Efron s booklet The Jackknife, the Bootstrap and other resampling plans (1982). 36

Assume a discrete sample space. In our phylogenetic case, this is the space of all 11 possible columns ( K 4 4 ). k - the probability that a column in a sample will be identical to the possible column y k. The observed frequency: # xi equaling y k ˆ k n The observed frequencies: The real probabilities: ˆ ˆ ˆ 1,, K,, 1 K 37

Choose as the prior distribution of (the real probabilities the parameter we want to estimate) a K dimentional Dirichlet distribution with parameter, which has to be a vector of positive numbers:,, 1 K We choose,, symmetric distribution.. K a a a1 to get a 38

From Wikipedia: Several images of probability densities of the Dirichlet distribution with K=3. Clockwise from top left: α = (6,2,2), (3,7,5), (6,2,6), (2,3,4). K 39

Left plot: Center plot: Right plot: 1 2 3 0.1 1 2 3 1 1 2 3 10 40

So apriori density function,, ~ Dir 1 K K with the f,, B 1 K 1 K a K 1 a1, 1 k k1 k1 k k 1 K1, 1,, K0 0, otherwise 41

From the sample we have the observed frequencies ˆ. The likelihood function P ˆ nˆ ~ Mult n, ˆ ~ K k 1 ˆ Mult n, n ˆ! n! k K k1 n is multinomial: n ˆ K k k, n ˆ n, n ˆ k k Z k 1 0, otherwise 42

The Dirichlet distribution is self conjugate with regard to multinomial likelihood, and the posterior distribution of is ˆ ~ Dir a1 + n ˆ which means that we add to each parameter the number of times that the corresponding column appears in the sample. If we minimize the effect of the prior knowledge by taking a 0, we will get K ˆ ~ Dir n ˆ K X 43

The distribution of the observed frequencies in a bootstrap sample, ˆ * Mult n, ˆ ˆ ~ is a good approximation for the limit of the posterior distribution ˆ ~ Dir n ˆ. Therefore we have a basis to think that Felsenstein s confidence level is a good approximation for the confidence level we try to estimate. n K 44

b) Efron s answer to problem (3): Why is Felsenstein s confidence assessment not consistently biased downward? To better explain the problem in Felsenstein s method which causes the apparent bias, Efron uses a simpler example: A normal model (instead of multinomial) and parametric bootstrap (instead of nonparametric). 45

A simpler model: x~n 2 μ, I μ = μ 1, μ 2 ˆ x The - plane is divided into regions, 1. ˆ lies in 1, and we wish to assign a confidence value to the event that itself lies in 1. In our terms: H 0 : μ ϵ 2 H 1 : μ ϵ 1 confidence = 1 - p-value 46

Parametric bootstrap: x ~N 2 μ, I Felsenstein s confidence value: P ˆ * ˆ 1 As in the multinomial case, it can be justified by using the Bayesian argument. 47

Why is a reasonable assessment of the confidence that? As in the multinomial model, is the posterior probability that 1 given that when we assume apriori that can be anywhere in the plane with equal probability. The posterior distribution:, exactly the distribution of ˆ ˆ. μ μ ~ N 2 μ, I ˆ 1 48

μ μ, μ μ ~ N 2 0, I, but μ μ~ N 2 0, 2I. According to Efron, this generates the bias for which the method was criticized. In the phylogeny case, the fact is that the * probability that tree ˆ = tree is usually less than the probability that tree ˆ = tree. But Efron also gives a perhaps more convincing explanation for the alleged bias. 49

Two possible examples: 50

Felsenstein s confidence value: * P ˆ Calculated theoretically: I 0.933, II 0.949 How would we assign a confidence level to μ ϵ 1? In our terms: H 0 : μ ϵ 2 H 1 : μ ϵ 1 confidence = 1 - p-value ˆ 1 51

A more customary way of assigning a confidence level to : μ ~ F = N 2 μ, I 1 μ ~ F = N 2 μ 0, I 52

p-value = 1 ˆ = # μ further than μ from the boundary of R 2 B The confidence level for the two cases (computed numerically): I II ˆ 0.933, Felsenstein's confidence level I ˆ 0.914, Felsenstein's confidence level II 53

Why are the answers different? The boundary curves away from ˆ The probabilistic distance from ˆ to ( 2 ) > the probabilistic distance from to ˆ ( ˆ ). ˆ 54

is not systematically biased downward. The bias depends on the geometrics of the problem. In this example, Felsenstein s confidence is biased upward. 55

c) Efron s method for better assessment of confidence Efron suggests an approximation formula for converting Felsenstein s confidence level to a hypothesis-testing confidence level ˆ. Denote: z = Φ 1 α, z = Φ 1 α z 0 = Φ 1 P μ0 μ ε R 1 Then in the normal case: zˆ z 2z 0 z 0 1 is of order n, and the error in estimating ẑ is of 1 order. This is called second order accuracy. n 56

Estimating ẑ μ ~ N 2 μ, I in the normal example: μ 1,, μ B z # * ˆ vectors in 1 1, B~ 100 B μ ~ N 2 μ 0, I ˆ 2 # ** ˆ vectors in 1 1 0 B 2 B2 zˆ, ~ 2000 z z z 0 μ 1,, μ B 2 57

In the multinomial model (trees) z z0 zˆ 1 a z z a - acceleration factor from BC a Example with malaria data: Efron got for monophyly of 9-10 clade with B 200. With B 2000 he got. He wants to compute ˆ. 0 z 0 58

Denote: P cent = 1 n,, 1 n P = (P 1,, P n ) Mult P cent * P - vector of proportions of the original matrix columns in a bootstrap sample * A bootstrap sample ( matrix ): The original data ( matrix ): X X P P Dˆ tree ˆ * * * Dˆ ˆ tree 59

Computing : ˆ Step 1: P 1,, P B 2 ~ Mult P cent, B = 2000 9-10 clade was monophyletic in 1923 out of 2000 samples. 0.962 60

Step 2: Out of the first 200 bootstrap vectors, 9-10 was not monophyletic in 7: 1 7,, For each, choose 0w 1 s.t. the vector p j wp j 1 w P cent is on the boundary of 9-10 monophyly. Use binary search to do it. j The vectors p play the role of. P ˆ P 61

Step 3: j For each boundary vector p, create B2 400 bootstrap samples: P 1,, P B 2 ~ Mult p j ** Each P gives a tree. The trees where 9-10 is monophyletic were counted: z 0 = Φ 1 1511 2800 = 0.0995 1511 62

Step 4: j cent Given a direction vector U p P from the center ( ˆ, cent ) to the boundary (, j ), P a values they got: au 1 6 n k 1 n k 1 U U 2 k 3 k 32 ˆ p 63

The average: a 0.0129 Step 5: z = Φ 1 α = Φ 1 0.962 = 1.77 z 0 = 0.0995 a = 0.0129 z = z z 0 1 + a z z 0 z 0 = 1.5357 α = Φ z = 0.938 < 0.962 = α 64

In this example Felsenstein s was biased upward. This happens when z. 0 0 The opposite can also happen, for example with the 7-8 clade. 65

Summary In a Bayesian sense, is a reasonable assessment of confidence. is not systematically biased downward. ˆ has a more familiar interpretation as hypothesis-testing confidence level. ˆ can be estimated by a two-level bootstrap algorithm. For, 50-100 bootstrap samples at the first level are enough. For ˆ, ~2000 bootstrap samples are needed at the second level. 66