Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Size: px
Start display at page:

Download "Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996"

Transcription

1 Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Following Confidence limits on phylogenies: an approach using the bootstrap, J. Felsenstein,

2 I. Short review of the bootstrap method [Efron, B. & Tibshirani, R. (1993) An Introduction to the Bootstrap; Saharon s course Bootstrap and Resampling Methods ] II. III. Short review of phylogenetic trees [Saharon s course Topics in Statistical Genetics ] Felsenstein s method for computing confidence levels on monophyly of groups in phylogenetic trees using the bootstrap, and critiques of it [Felsenstein, J. (1985) Confidence Limits on Phylogenies: An Approach Using the Bootstrap; Saharon s course Bootstrap and Resampling Methods ] IV. Efron s justification of Felsenstein s method and corrections of it [Efron, B., Halloran, E. & Holmes, S. (1996) Bootstrap confidence levels for phylogenetic trees; Efron, B. (1982) The jackknife, the bootstrap and other resampling plans; Efron, B. (1987) Better Bootstrap Confidence Intervals] 2

3 I. Short review of the bootstrap method Population distribution function F, some parameter or feature uf A random sample X x,, 1 xn from this population Estimator ˆ s X What is the estimator s quality? We want to estimate some parameter that characterizes it, t F, which may be a function of the unknown parameter. 3

4 F u F X ˆ s X ˆF plug-in u F * ˆ * * X,, 1 XB ˆ * * ˆ ˆ 1,, B t F ˆB 4

5 II. Short review of phylogenetic trees Every organism carries sequences of DNA, and each species is characterized by its typical sequences. The history of all species begins with one ancestor, whose offspring developed various mutations. Throughout history species were formed through accumulation of mutations. We want to build a tree to map the history and relationship between species. All species existing today are leaves in this tree. The real tree is rooted, since there was only one species in the beginning. 5

6 6 Bacteria - a large domain of prokariotic microorganisms (Prokariots - organisms whose cells lack a cell nucleus (karyon)). Archaea - another domain of prokariots, all single-celled. Eucaryota - organisms whose cells do have a nucleus.

7 If we take sequences from very different species, with high probability all nucleotides can be described by the same tree. It means that for every site, the convergence of (human, cow) is closer than of (human, bird) and of (cow, bird). This is probable given the history. 7

8 Table of sequences: 8

9 The simplest metrics between species: Hamming distance - the number of sites at which the two species differ. Tree building methods: 1. Neighbour joining: Connect the two closest species, and so on. 2. Maximum parsimony: Minimize the total number of mutations in the tree. Good for many thousands of species. 9

10 3. Maximum likelihood: Parameters: - Tree topology - Mutation model - Edge lengths Usually assume that the sites are independent. Good for a few tens of species. None of these models gives information about the root s location. 10

11 11

12 Methods for placing the root: 1. Molecular clock: Mutations occur in a constant rate over history. Choose the most distant point from the leaves, according to some metrics (e.g. the point with biggest minimal distance from a leaf). 2. Outgroup: Add a species that is surely separated from all, and hang the tree from the edge leading to it. 12

13 III. Felsenstein s method a monophyletic group is a group of species which consists of all the species that are descended from some edge in the rooted tree, and only of them. A rooted tree is a series of arguments about the monophyly of groups in the tree. Use bootstrap to build phylogenetic trees and test hypotheses about the monophyly of groups The researcher s main interest is to decide whether or not a group of species is monophyletic. 13

14 Data table: Bootstrap across the characters: sample columns with replacement. n 14

15 We get a bootstrap sample that is a multinomial r.v. : each of the possible columns has a probability that is its frequency in the original sample of columns. Choose some optimality criterion for tree building (e.g. maximum parsimony). Build a tree based on each of the bootstrap samples. The estimated tree(s) is built according to the majority rule : the groups that are monophyletic in it are the ones that where monophyletic in most of the bootstrap trees. 15

16 S - some group of species. H 0 : S is not monophyletic Reject H 0 If S is monophyletic in at least 95% of the bootstrap trees, i.e. if it is not monophyletic in less than 5% of the bootstrap trees. This is equivalent to building a one-sided percentile confidence interval for the indicator 1, S is monophyletic 0, otherwise 16

17 For each bootstrap sample ( b1,, B ): * 1, S is monophyletic according to sample ˆ b 0, otherwise b * And CI ˆ,1. pct, ˆ * B 1 If the 5 th percentile of b b is 1 (i.e., the lower bound of the bootstrap percentile confidence interval is 1), Felsenstein rejects H 0 and says that the group is monophyletic. The proportion of bootstrap trees where S is monophyletic is Felsenstein s confidence in the monophyly of the group. Efron denotes it by 1Felsenstein's p-value. 17

18 Data set used by Efron to demonstrate the use of Felsenstein s method: 11 malaria species, 221 sites. The first 20 columns of the data matrix X : 18

19 Tree building algorithm: distance matrix between species (221 x 221) ˆD X Dˆ ˆ tree Proceeding with Felsenstein s method: B 200 bootstrap samples were generated, and bootstrap trees were built: X Dˆ tree ˆ * * * 19

20 For every monophyletic group in the original tree, the proportion of bootstrap trees where this group is monophyletic (Felsenstein s confidence) was calculated. For example: the 9-10 clade appeared (as monophyletic) in 193 of the 200 bootstrap trees, so its confidence value is

21 21

22 What are the main problems with Felsenstein s method: 1) With respect to this course? 2) As put by Efron? 3) According to other critiques, such as Hillis and Bull? 22

23 1) With respect to this course: This is not correct hypothesis testing: 1, S is monophyletic 0, otherwise H 0 : S is not monophyletic, i.e. 0. H 1 : is monophyletic, i.e.. S 1 Test statistic: ˆ s X We need to create a null bootstrap world, where S is not monophyletic, i.e. 0, and use it to estimate by calculating H 0 P s X s X obs 23

24 ˆP F 0 s X s X ASL ˆ obs boot, B * b s X This is the proportion of bootstrap samples that give a tree in which S is monophyletic, when F 0, the * distribution function of bootstrap samples X, is in H 0 and is as similar as possible to F (the distribution of in the real world). Instead, Felsenstein resamples from a specific distribution in H 1 the empirical distribution of X, that gave us a tree in which S is monophyletic. This Is not valid hypothesis testing. # s X B obs X 24

25 Bootstrap percentile confidence interval cannot be justified, for example because the monophyly indicators and ˆ can only get 0 or 1, so there cannot exist a monotone s.t. 2 ˆ g g ~ N 0,. g 25

26 2) As put by Efron: The use of the (nonparametric) bootstrap to estimate the parameter itself (, the indicator for the monophyly of S, by estimating the tree), instead of the quality of the estimator (bias, variance etc.). We want to infer our confidence that S is truly monophyletic given that we estimated it as monophyletic. But we are actually estimating the probability of inferring monophyly, assuming that our original sample is a true description of the real tree (in the sense that S is indeed monophyletic). 26

27 In other words: We want to decide how confident (confidence = 1 - p-value) we are that 1 given that our estimate is ˆ 1. This is equivalent to building a valid percentile plugin CI: ˆ ˆ ˆ ˆ ˆ * * CI pct, plug-in, 0.95,1 2, But we are actually estimating the probability to get ˆ 1 when 1. This is equivalent to building the less valid percentile CI: * CI pct, 0.95 ˆ 0.05,1 27

28 3) According to other critiques, such as Hillis and Bull: Felsenstein s confidence values are consistently too conservative (i.e., biased downward) as an assessment of the tree accuracy. 28

29 IV. Efron s justification and corrections Efron, Halloran & Holmes: a) Felsenstein s method provides a reasonable first approximation to the actual confidence levels of the observed clades. This is an answer to problem (2). b) Felsenstein s confidence assessment is not consistently biased downward. This is an answer to problem (3). c) Suggest a more complex method to give better assessments of confidence. 29

30 a) Efron s answer to problem (2): Why is Felsenstein s confidence justified? The rationale: The bootstrap samples come from a multinomial model: In the malaria example, there are 11 species, so 11 there are K 4 4 possible column vectors for x. Denote them y,, 1 yk. Suppose each observed column of X is independently chosen from these vectors, with K some probability to choose y ( ). These k k k 1 k 1 probabilities depend on the population of sites (the true tree). Denote. 1,, K 30

31 X can be characterized by the proportion of its n 221 columns equaling each possible y : ˆ k # columns of X equaling ˆ ˆ,, ˆ 1 K ˆD is a function of the original observed proportions ˆ, so the tree-building algorithm can be described as ˆ n Dˆ tree ˆ y k k 31

32 We can add the group of interest s monophyly indicator to the process: In a similar way (although possibly in an opposite causal order the tree comes first in reality), the vector of true probabilities gives the true distance matrix and the true tree (that we assume we would get by applying our algorithm to the true distance matrix): K (where D 2 ij k yki ykj ). k1 π D tree ψ π D tree ψ 32

33 The proportions of columns in a bootstrap sample are: ˆ * k And we get a bootstrap tree: * # columns of X equaling ˆ ˆ,, ˆ * * * 1 K n y k (and ψ ). ˆ Dˆ tree ˆ * * * 33

34 A schematic picture of the space of possible s: We hope to have ˆ tree = tree. 34

35 It is not clear why we can use the distribution * of tree ˆ tree ˆ to infer our confidence in tree tree ˆ. Or in terms of the monophyly indicator, why we can decide how confident we are that given that our estimate is ˆ 1 by estimating the probability that ˆ 1 given that

36 The answer is in using a Bayesian approach: * The bootstrap distribution of tree ˆ tree ˆ is almost the same as the posterior distribution of tree tree ˆ if we begin with an uninformative (i.e., uniform) prior density. An explanation can be found in Efron s booklet The Jackknife, the Bootstrap and other resampling plans (1982). 36

37 Assume a discrete sample space. In our phylogenetic case, this is the space of all 11 possible columns ( K 4 4 ). k - the probability that a column in a sample will be identical to the possible column y k. The observed frequency: # xi equaling y k ˆ k n The observed frequencies: The real probabilities: ˆ ˆ ˆ 1,, K,, 1 K 37

38 Choose as the prior distribution of (the real probabilities the parameter we want to estimate) a K dimentional Dirichlet distribution with parameter, which has to be a vector of positive numbers:,, 1 K We choose,, symmetric distribution.. K a a a1 to get a 38

39 From Wikipedia: Several images of probability densities of the Dirichlet distribution with K=3. Clockwise from top left: α = (6,2,2), (3,7,5), (6,2,6), (2,3,4). K 39

40 Left plot: Center plot: Right plot:

41 So apriori density function,, ~ Dir 1 K K with the f,, B 1 K 1 K a K 1 a1, 1 k k1 k1 k k 1 K1, 1,, K0 0, otherwise 41

42 From the sample we have the observed frequencies ˆ. The likelihood function P ˆ nˆ ~ Mult n, ˆ ~ K k 1 ˆ Mult n, n ˆ! n! k K k1 n is multinomial: n ˆ K k k, n ˆ n, n ˆ k k Z k 1 0, otherwise 42

43 The Dirichlet distribution is self conjugate with regard to multinomial likelihood, and the posterior distribution of is ˆ ~ Dir a1 + n ˆ which means that we add to each parameter the number of times that the corresponding column appears in the sample. If we minimize the effect of the prior knowledge by taking a 0, we will get K ˆ ~ Dir n ˆ K X 43

44 The distribution of the observed frequencies in a bootstrap sample, ˆ * Mult n, ˆ ˆ ~ is a good approximation for the limit of the posterior distribution ˆ ~ Dir n ˆ. Therefore we have a basis to think that Felsenstein s confidence level is a good approximation for the confidence level we try to estimate. n K 44

45 b) Efron s answer to problem (3): Why is Felsenstein s confidence assessment not consistently biased downward? To better explain the problem in Felsenstein s method which causes the apparent bias, Efron uses a simpler example: A normal model (instead of multinomial) and parametric bootstrap (instead of nonparametric). 45

46 A simpler model: x~n 2 μ, I μ = μ 1, μ 2 ˆ x The - plane is divided into regions, 1. ˆ lies in 1, and we wish to assign a confidence value to the event that itself lies in 1. In our terms: H 0 : μ ϵ 2 H 1 : μ ϵ 1 confidence = 1 - p-value 46

47 Parametric bootstrap: x ~N 2 μ, I Felsenstein s confidence value: P ˆ * ˆ 1 As in the multinomial case, it can be justified by using the Bayesian argument. 47

48 Why is a reasonable assessment of the confidence that? As in the multinomial model, is the posterior probability that 1 given that when we assume apriori that can be anywhere in the plane with equal probability. The posterior distribution:, exactly the distribution of ˆ ˆ. μ μ ~ N 2 μ, I ˆ 1 48

49 μ μ, μ μ ~ N 2 0, I, but μ μ~ N 2 0, 2I. According to Efron, this generates the bias for which the method was criticized. In the phylogeny case, the fact is that the * probability that tree ˆ = tree is usually less than the probability that tree ˆ = tree. But Efron also gives a perhaps more convincing explanation for the alleged bias. 49

50 Two possible examples: 50

51 Felsenstein s confidence value: * P ˆ Calculated theoretically: I 0.933, II How would we assign a confidence level to μ ϵ 1? In our terms: H 0 : μ ϵ 2 H 1 : μ ϵ 1 confidence = 1 - p-value ˆ 1 51

52 A more customary way of assigning a confidence level to : μ ~ F = N 2 μ, I 1 μ ~ F = N 2 μ 0, I 52

53 p-value = 1 ˆ = # μ further than μ from the boundary of R 2 B The confidence level for the two cases (computed numerically): I II ˆ 0.933, Felsenstein's confidence level I ˆ 0.914, Felsenstein's confidence level II 53

54 Why are the answers different? The boundary curves away from ˆ The probabilistic distance from ˆ to ( 2 ) > the probabilistic distance from to ˆ ( ˆ ). ˆ 54

55 is not systematically biased downward. The bias depends on the geometrics of the problem. In this example, Felsenstein s confidence is biased upward. 55

56 c) Efron s method for better assessment of confidence Efron suggests an approximation formula for converting Felsenstein s confidence level to a hypothesis-testing confidence level ˆ. Denote: z = Φ 1 α, z = Φ 1 α z 0 = Φ 1 P μ0 μ ε R 1 Then in the normal case: zˆ z 2z 0 z 0 1 is of order n, and the error in estimating ẑ is of 1 order. This is called second order accuracy. n 56

57 Estimating ẑ μ ~ N 2 μ, I in the normal example: μ 1,, μ B z # * ˆ vectors in 1 1, B~ 100 B μ ~ N 2 μ 0, I ˆ 2 # ** ˆ vectors in B 2 B2 zˆ, ~ 2000 z z z 0 μ 1,, μ B 2 57

58 In the multinomial model (trees) z z0 zˆ 1 a z z a - acceleration factor from BC a Example with malaria data: Efron got for monophyly of 9-10 clade with B 200. With B 2000 he got. He wants to compute ˆ. 0 z 0 58

59 Denote: P cent = 1 n,, 1 n P = (P 1,, P n ) Mult P cent * P - vector of proportions of the original matrix columns in a bootstrap sample * A bootstrap sample ( matrix ): The original data ( matrix ): X X P P Dˆ tree ˆ * * * Dˆ ˆ tree 59

60 Computing : ˆ Step 1: P 1,, P B 2 ~ Mult P cent, B = clade was monophyletic in 1923 out of 2000 samples

61 Step 2: Out of the first 200 bootstrap vectors, 9-10 was not monophyletic in 7: 1 7,, For each, choose 0w 1 s.t. the vector p j wp j 1 w P cent is on the boundary of 9-10 monophyly. Use binary search to do it. j The vectors p play the role of. P ˆ P 61

62 Step 3: j For each boundary vector p, create B2 400 bootstrap samples: P 1,, P B 2 ~ Mult p j ** Each P gives a tree. The trees where 9-10 is monophyletic were counted: z 0 = Φ =

63 Step 4: j cent Given a direction vector U p P from the center ( ˆ, cent ) to the boundary (, j ), P a values they got: au 1 6 n k 1 n k 1 U U 2 k 3 k 32 ˆ p 63

64 The average: a Step 5: z = Φ 1 α = Φ = 1.77 z 0 = a = z = z z a z z 0 z 0 = α = Φ z = < = α 64

65 In this example Felsenstein s was biased upward. This happens when z. 0 0 The opposite can also happen, for example with the 7-8 clade. 65

66 Summary In a Bayesian sense, is a reasonable assessment of confidence. is not systematically biased downward. ˆ has a more familiar interpretation as hypothesis-testing confidence level. ˆ can be estimated by a two-level bootstrap algorithm. For, bootstrap samples at the first level are enough. For ˆ, ~2000 bootstrap samples are needed at the second level. 66

Dr. Amira A. AL-Hosary

Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

Phylogenetic Tree Reconstruction

I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley B.D. Mishler Jan. 22, 2009. Trees I. Summary of previous lecture: Hennigian

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 A non-phylogeny

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location

Bootstrap tests Patrick Breheny October 11 Patrick Breheny STA 621: Nonparametric Statistics 1/14 Introduction Conditioning on the observed data to obtain permutation tests is certainly an important idea

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

Lecture 11 Friday, October 21, 2011

Lecture 11 Friday, October 21, 2011 Phylogenetic tree (phylogeny) Darwin and classification: In the Origin, Darwin said that descent from a common ancestral species could explain why the Linnaean system

Consistency Index (CI)

Consistency Index (CI) minimum number of changes divided by the number required on the tree. CI=1 if there is no homoplasy negatively correlated with the number of species sampled Retention Index (RI)

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Hypothesis Testing with the Bootstrap Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods Bootstrap Hypothesis Testing A bootstrap hypothesis test starts with a test statistic

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

8/23/2014. Phylogeny and the Tree of Life

Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

- Phylogeny? - Systematics? The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogenetic systematics? Connection between phylogeny and classification. - Phylogenetic systematics informs the

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Why uncertainty? Why should data mining care about uncertainty? We

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Bot 421/521 PHYLOGENETIC ANALYSIS I. Origins A. Hennig 1950 (German edition) Phylogenetic Systematics 1966 B. Zimmerman (Germany, 1930 s) C. Wagner (Michigan, 1920-2000) II. Characters and character states

Hypothesis testing and phylogenetics

Hypothesis testing and phylogenetics Woods Hole Workshop on Molecular Evolution, 2017 Mark T. Holder University of Kansas Thanks to Paul Lewis, Joe Felsenstein, and Peter Beerli for slides. Motivation

The Nonparametric Bootstrap

The Nonparametric Bootstrap The nonparametric bootstrap may involve inferences about a parameter, but we use a nonparametric procedure in approximating the parametric distribution using the ECDF. We use

One-Sample Numerical Data

One-Sample Numerical Data quantiles, boxplot, histogram, bootstrap confidence intervals, goodness-of-fit tests University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Five Sami Khuri Department of Computer Science San José State University San José, California, USA sami.khuri@sjsu.edu v Distance Methods v Character Methods v Molecular Clock v UPGMA v Maximum Parsimony

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

STAT Section 2.1: Basic Inference. Basic Definitions

STAT 518 --- Section 2.1: Basic Inference Basic Definitions Population: The collection of all the individuals of interest. This collection may be or even. Sample: A collection of elements of the population.

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

Bootstrap Confidence Intervals

Bootstrap Confidence Intervals Patrick Breheny September 18 Patrick Breheny STA 621: Nonparametric Statistics 1/22 Introduction Bootstrap confidence intervals So far, we have discussed the idea behind

Algorithms in Bioinformatics

Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

How to read and make phylogenetic trees Zuzana Starostová

How to read and make phylogenetic trees Zuzana Starostová How to make phylogenetic trees? Workflow: obtain DNA sequence quality check sequence alignment calculating genetic distances phylogeny estimation

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships

Political Science 236 Hypothesis Testing: Review and Bootstrapping

Political Science 236 Hypothesis Testing: Review and Bootstrapping Rocío Titiunik Fall 2007 1 Hypothesis Testing Definition 1.1 Hypothesis. A hypothesis is a statement about a population parameter The

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

Inference for Single Proportions and Means T.Scofield

Inference for Single Proportions and Means TScofield Confidence Intervals for Single Proportions and Means A CI gives upper and lower bounds between which we hope to capture the (fixed) population parameter

Biology 211 (2) Week 1 KEY!

Biology 211 (2) Week 1 KEY Chapter 1 KEY FIGURES: 1.2, 1.3, 1.4, 1.5, 1.6, 1.7 VOCABULARY: Adaptation: a trait that increases the fitness Cells: a developed, system bound with a thin outer layer made of

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008 University of California, Berkeley B.D. Mishler March 18, 2008. Phylogenetic Trees I: Reconstruction; Models, Algorithms & Assumptions

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley B.D. Mishler Feb. 14, 2018. Phylogenetic trees VI: Dating in the 21st century: clocks, & calibrations;

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

Classification and Phylogeny

Classification and Phylogeny The diversity of life is great. To communicate about it, there must be a scheme for organization. There are many species that would be difficult to organize without a scheme

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Today s topics Inferring phylogeny Introduction! Distance methods! Parsimony method!"#\$%&'(!)* +,-.'/01!23454(6!7!2845*0&4'9#6!:&454(6 ;?@AB=C?DEF Overview of phylogenetic inferences Methodology Methods

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

Classification and Phylogeny

Classification and Phylogeny The diversity it of life is great. To communicate about it, there must be a scheme for organization. There are many species that would be difficult to organize without a scheme

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetics Todd Vision iology 522 March 26, 2007 pplications of phylogenetics Studying organismal or biogeographic history Systematics ating events in the fossil record onservation biology Studying

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

Patrick Breheny December 6 Patrick Breheny BST 764: Applied Statistical Modeling 1/21 The empirical distribution function Suppose X F, where F (x) = Pr(X x) is a distribution function, and we wish to estimate

Phylogenetic inference

Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

Lecture 6 Phylogenetic Inference

Lecture 6 Phylogenetic Inference From Darwin s notebook in 1837 Charles Darwin Willi Hennig From The Origin in 1859 Cladistics Phylogenetic inference Willi Hennig, Cladistics 1. Clade, Monophyletic group,

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

Evaluating phylogenetic hypotheses

Evaluating phylogenetic hypotheses Methods for evaluating topologies Topological comparisons: e.g., parametric bootstrapping, constrained searches Methods for evaluating nodes Resampling techniques: bootstrapping,

Chapter 26 Phylogeny and the Tree of Life

Chapter 26 Phylogeny and the Tree of Life Biologists estimate that there are about 5 to 100 million species of organisms living on Earth today. Evidence from morphological, biochemical, and gene sequence

Nonparametric Methods II

Nonparametric Methods II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University hslu@stat.nctu.edu.tw http://tigpbp.iis.sinica.edu.tw/courses.htm 1 PART 3: Statistical Inference by

Theory of Evolution Charles Darwin

Theory of Evolution Charles arwin 858-59: Origin of Species 5 year voyage of H.M.S. eagle (83-36) Populations have variations. Natural Selection & Survival of the fittest: nature selects best adapted varieties

Model Selection, Estimation, and Bootstrap Smoothing. Bradley Efron Stanford University

Model Selection, Estimation, and Bootstrap Smoothing Bradley Efron Stanford University Estimation After Model Selection Usually: (a) look at data (b) choose model (linear, quad, cubic...?) (c) fit estimates

Intraspecific gene genealogies: trees grafting into networks

Intraspecific gene genealogies: trees grafting into networks by David Posada & Keith A. Crandall Kessy Abarenkov Tartu, 2004 Article describes: Population genetics principles Intraspecific genetic variation

Molecular Evolution & Phylogenetics

Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic

What to do today (Nov 22, 2018)?

What to do today (Nov 22, 2018)? Part 1. Introduction and Review (Chp 1-5) Part 2. Basic Statistical Inference (Chp 6-9) Part 3. Important Topics in Statistics (Chp 10-13) Part 4. Further Topics (Selected

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Chapter 4 Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods 4.1 Introduction It is now explicable that ridge regression estimator (here we take ordinary ridge estimator (ORE)

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

Phylogeny is the evolutionary history of a group of organisms. Based on the idea that organisms are related by evolution

Bio 1M: Phylogeny and the history of life 1 Phylogeny S25.1; Bioskill 11 (2ndEd S27.1; Bioskills 3) Bioskills are in the back of your book Phylogeny is the evolutionary history of a group of organisms

Name: Class: Date: ID: A

Class: _ Date: _ Ch 17 Practice test 1. A segment of DNA that stores genetic information is called a(n) a. amino acid. b. gene. c. protein. d. intron. 2. In which of the following processes does change

Phylogeny and Molecular Evolution. Introduction

Phylogeny and Molecular Evolution Introduction 1 2/62 3/62 Credit Serafim Batzoglou (UPGMA slides) http://www.stanford.edu/class/cs262/slides Notes by Nir Friedman, Dan Geiger, Shlomo Moran, Ron Shamir,

3 Joint Distributions 71

2.2.3 The Normal Distribution 54 2.2.4 The Beta Density 58 2.3 Functions of a Random Variable 58 2.4 Concluding Remarks 64 2.5 Problems 64 3 Joint Distributions 71 3.1 Introduction 71 3.2 Discrete Random

One-minute responses. Nice class{no complaints. Your explanations of ML were very clear. The phylogenetics portion made more sense to me today.

One-minute responses Nice class{no complaints. Your explanations of ML were very clear. The phylogenetics portion made more sense to me today. The pace/material covered for likelihoods was more dicult

A Note on Bootstraps and Robustness. Tony Lancaster, Brown University, December 2003.

A Note on Bootstraps and Robustness Tony Lancaster, Brown University, December 2003. In this note we consider several versions of the bootstrap and argue that it is helpful in explaining and thinking about

11. Bootstrap Methods

11. Bootstrap Methods c A. Colin Cameron & Pravin K. Trivedi 2006 These transparencies were prepared in 20043. They can be used as an adjunct to Chapter 11 of our subsequent book Microeconometrics: Methods

Random Numbers and Simulation

Random Numbers and Simulation Generating random numbers: Typically impossible/unfeasible to obtain truly random numbers Programs have been developed to generate pseudo-random numbers: Values generated

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

!! www.clutchprep.com CONCEPT: OVERVIEW OF EVOLUTION Evolution is a process through which variation in individuals makes it more likely for them to survive and reproduce There are principles to the theory

What is Phylogenetics

What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

Phylogenetics. BIOL 7711 Computational Bioscience

Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

Bayesian learning Probably Approximately Correct Learning

Bayesian learning Probably Approximately Correct Learning Peter Antal antal@mit.bme.hu A.I. December 1, 2017 1 Learning paradigms Bayesian learning Falsification hypothesis testing approach Probably Approximately

Systematics - Bio 615

Bayesian Phylogenetic Inference 1. Introduction, history 2. Advantages over ML 3. Bayes Rule 4. The Priors 5. Marginal vs Joint estimation 6. MCMC Derek S. Sikes University of Alaska 7. Posteriors vs Bootstrap

Questions we can ask. Recall. Accuracy and Precision. Systematics - Bio 615. Outline

Outline 1. Mechanistic comparison with Parsimony - branch lengths & parameters 2. Performance comparison with Parsimony - Desirable attributes of a method - The Felsenstein and Farris zones - Heterotachous

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

BINF6201/8201. Molecular phylogenetic methods

BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

An Introduction to Mplus and Path Analysis

An Introduction to Mplus and Path Analysis PSYC 943: Fundamentals of Multivariate Modeling Lecture 10: October 30, 2013 PSYC 943: Lecture 10 Today s Lecture Path analysis starting with multivariate regression

Business Statistics. Lecture 10: Course Review

Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Phylogeny and systematics Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Phylogeny: the evolutionary history of a species

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

Resampling and the Bootstrap

Resampling and the Bootstrap Axel Benner Biostatistics, German Cancer Research Center INF 280, D-69120 Heidelberg benner@dkfz.de Resampling and the Bootstrap 2 Topics Estimation and Statistical Testing

Probability and Statistics

CHAPTER 5: PARAMETER ESTIMATION 5-0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 5: PARAMETER

Bootstrapping Australian inbound tourism

19th International Congress on Modelling and Simulation, Perth, Australia, 12 16 December 2011 http://mssanz.org.au/modsim2011 Bootstrapping Australian inbound tourism Y.H. Cheunga and G. Yapa a School

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA, 00 MODULE : Statistical Inference Time Allowed: Three Hours Candidates should answer FIVE questions. All questions carry equal marks. The

Fig. 26.7a. Biodiversity. 1. Course Outline Outcomes Instructors Text Grading. 2. Course Syllabus. Fig. 26.7b Table

Fig. 26.7a Biodiversity 1. Course Outline Outcomes Instructors Text Grading 2. Course Syllabus Fig. 26.7b Table 26.2-1 1 Table 26.2-2 Outline: Systematics and the Phylogenetic Revolution I. Naming and

4 Resampling Methods: The Bootstrap

4 Resampling Methods: The Bootstrap Situation: Let x 1, x 2,..., x n be a SRS of size n taken from a distribution that is unknown. Let θ be a parameter of interest associated with this distribution and

PHYLOGENY & THE TREE OF LIFE

PHYLOGENY & THE TREE OF LIFE PREFACE In this powerpoint we learn how biologists distinguish and categorize the millions of species on earth. Early we looked at the process of evolution here we look at

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Tracing the Evolution of Numerical Phylogenetics: History, Philosophy, and Significance Adam W. Ferguson Phylogenetic Systematics 26 January 2009 Inferring Phylogenies Historical endeavor Darwin- 1837

AP Biology Name Cladistics and Bioinformatics Questions 2013 1. The following table shows the percentage similarity in sequences of nucleotides from a homologous gene derived from five different species

Frequentist Properties of Bayesian Posterior Probabilities of Phylogenetic Trees Under Simple and Complex Substitution Models

Syst. Biol. 53(6):904 913, 2004 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150490522629 Frequentist Properties of Bayesian Posterior Probabilities

The Life System and Environmental & Evolutionary Biology II

The Life System and Environmental & Evolutionary Biology II EESC V2300y / ENVB W2002y Laboratory 1 (01/28/03) Systematics and Taxonomy 1 SYNOPSIS In this lab we will give an overview of the methodology

What Is Conservation?

What Is Conservation? Lee A. Newberg February 22, 2005 A Central Dogma Junk DNA mutates at a background rate, but functional DNA exhibits conservation. Today s Question What is this conservation? Lee A.

Phylogenetic analyses. Kirsi Kostamo

Phylogenetic analyses Kirsi Kostamo The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among different groups (individuals, populations, species,

An Introduction to Path Analysis

An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving

Resampling techniques for statistical modeling

Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error

Inverse Sampling for McNemar s Test

International Journal of Statistics and Probability; Vol. 6, No. 1; January 27 ISSN 1927-7032 E-ISSN 1927-7040 Published by Canadian Center of Science and Education Inverse Sampling for McNemar s Test

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

CONTENTS OF DAY 2. II. Why Random Sampling is Important 10 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 4 Problems with small populations 9 II. Why Random Sampling is Important 10 A myth,