Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Similar documents
Week 7: Bayesian inference, Testing trees, Bootstraps

Bootstraps and testing trees. Alog-likelihoodcurveanditsconfidenceinterval

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

= 1 = 4 3. Odds ratio justification for maximum likelihood. Likelihoods, Bootstraps and Testing Trees. Prob (H 2 D) Prob (H 1 D) Prob (D H 2 )

Week 5: Distance methods, DNA and protein models

Phylogenetic inference

Inference of phylogenies, with some thoughts on statistics and geometry p.1/31

Concepts and Methods in Molecular Divergence Time Estimation

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Dr. Amira A. AL-Hosary

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

Inferring Molecular Phylogeny

Statistical nonmolecular phylogenetics: can molecular phylogenies illuminate morphological evolution?

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Interval Estimation III: Fisher's Information & Bootstrapping

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny

Mathematical statistics

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Constructing Evolutionary/Phylogenetic Trees

Political Science 236 Hypothesis Testing: Review and Bootstrapping

1/24/2008. Review of Statistical Inference. C.1 A Sample of Data. C.2 An Econometric Model. C.4 Estimating the Population Variance and Other Moments

EVOLUTIONARY DISTANCES

Topic 19 Extensions on the Likelihood Ratio

Parameter Estimation and Fitting to Data

Mathematical statistics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Institute of Actuaries of India

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Statistical Data Analysis Stat 3: p-values, parameter estimation

Lab 9: Maximum Likelihood and Modeltest

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Goodness of Fit Goodness of fit - 2 classes

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

This does not cover everything on the final. Look at the posted practice problems for other topics.

Master s Written Examination

Generalized Linear Models (1/29/13)

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Mathematical statistics

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Evolutionary Models. Evolutionary Models

First Year Examination Department of Statistics, University of Florida

Lecture 4. Models of DNA and protein change. Likelihood methods

ST495: Survival Analysis: Hypothesis testing and confidence intervals


"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2011 University of California, Berkeley

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

BTRY 4830/6830: Quantitative Genomics and Genetics

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Recall the Basics of Hypothesis Testing

Hypothesis Testing - Frequentist

POLI 8501 Introduction to Maximum Likelihood Estimation

Chapter 6. Estimation of Confidence Intervals for Nodal Maximum Power Consumption per Customer

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Statistical Distribution Assumptions of General Linear Models

Practice Problems Section Problems

Some New Aspects of Dose-Response Models with Applications to Multistage Models Having Parameters on the Boundary

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Chapter 4: Factor Analysis

User s Manual for. Continuous. (copyright M. Pagel) Mark Pagel School of Animal and Microbial Sciences University of Reading Reading RG6 6AJ UK

Statistics and Data Analysis

The Surprising Conditional Adventures of the Bootstrap

Advanced Quantitative Methods: maximum likelihood

INTERVAL ESTIMATION AND HYPOTHESES TESTING

SPRING 2007 EXAM C SOLUTIONS

Marginal Screening and Post-Selection Inference

Gov 2001: Section 4. February 20, Gov 2001: Section 4 February 20, / 39

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Understanding relationship between homologous sequences

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

PHYLOGENY ESTIMATION AND HYPOTHESIS TESTING USING MAXIMUM LIKELIHOOD

arxiv: v1 [math.st] 22 Jun 2018

Testing Restrictions and Comparing Models

Parameter Estimation

BINF6201/8201. Molecular phylogenetic methods

Chapter 10: Inferences based on two samples

STT 843 Key to Homework 1 Spring 2018

Review. December 4 th, Review

Phylogenetic Tree Reconstruction

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Maximum Likelihood (ML) Estimation

1 Mixed effect models and longitudinal data analysis

Probabilistic modeling and molecular phylogeny

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Notes on Machine Learning for and

Phylogenetics: Building Phylogenetic Trees

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Unsupervised machine learning

Transcription:

Week 8: Testing trees, ootstraps, jackknifes, gene frequencies Genome 570 ebruary, 2016 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.1/69

density e log (density) Normal distribution: curvature of log of height x x 1 σ 2π e 1 2 (x µ) 2 σ 2 (constant stuff) 1 2 (x µ) 2 σ 2 Taking the logarithm of the height of the density curve of a normal distribution whose variance is σ 2, we see that it is a quadratic curve whose curvature is 1/σ 2 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.2/69

The likelihood curve is nearly a normal distribution for large amounts of data θ θ the value for our data set θ from t(x), the "sufficient statistic" If we have large amounts of data, the values of parameters we need to try are all very similar, and the shape of the distribution (which is nearly normal) will not be too different for these values. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.3/69

The likelihood curve is nearly a normal distribution for large amounts of data θ θ the value for our data set θ from t(x), the "sufficient statistic" If we have large amounts of data, the values of parameters we need to try are all very similar, and the shape of the distribution (which is nearly normal) will not be too different for these values. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.4/69

The likelihood curve is nearly a normal distribution for large amounts of data θ θ the value for our data set θ from t(x), the "sufficient statistic" If we have large amounts of data, the values of parameters we need to try are all very similar, and the shape of the distribution (which is nearly normal) will not be too different for these values. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.5/69

The likelihood curve is nearly a normal distribution for large amounts of data θ θ the value for our data set θ from t(x), the "sufficient statistic" If we have large amounts of data, the values of parameters we need to try are all very similar, and the shape of the distribution (which is nearly normal) will not be too different for these values. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.6/69

The likelihood curve is nearly a normal distribution for large amounts of data θ θ the value for our data set θ from t(x), the "sufficient statistic" If we have large amounts of data, the values of parameters we need to try are all very similar, and the shape of the distribution (which is nearly normal) will not be too different for these values. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.7/69

The likelihood curve is nearly a normal distribution for large amounts of data θ θ the value for our data set θ from t(x), the "sufficient statistic" If we have large amounts of data, the values of parameters we need to try are all very similar, and the shape of the distribution (which is nearly normal) will not be too different for these values. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.8/69

The likelihood curve is nearly a normal distribution for large amounts of data θ θ the value for our data set θ from t(x), the "sufficient statistic" If we have large amounts of data, the values of parameters we need to try are all very similar, and the shape of the distribution (which is nearly normal) will not be too different for these values. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.9/69

The likelihood curve is nearly a normal distribution for large amounts of data θ θ the value for our data set θ from t(x), the "sufficient statistic" If we have large amounts of data, the values of parameters we need to try are all very similar, and the shape of the distribution (which is nearly normal) will not be too different for these values. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.10/69

urvatures and covariances of ML estimates ML estimates have covariances computable from curvatures of the expected log-likelihood: ] Var[ θ 1 / ( d 2 (log(l)) dθ 2 The same is true when there are multiple parameters: ] Var[ θ V 1 ) where ( 2 ) log(l) ij = θ i θ j Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.11/69

With large amounts of data, asymptotically When the true value of θ is θ 0, ˆθ θ 0 v N(0, 1) Since 1/v is the negative of the curvature of the log-likelihood: lnl(θ 0 ) = lnl(ˆθ) 1 2 (θ 0 ˆθ) 2 v so that twice the difference of log-likelihoods is the square of a normal: 2 ( ) ln L(ˆθ) lnl(θ 0 ) χ 2 1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.12/69

orresponding results for multiple parameters lnl(θ 0 ) lnl(θ 0 ) 1 2 (θ 0 θ) T (θ 0 θ) (θ θ 0 ) T (θ θ 0 ) χ 2 p so that the log-likelihood difference is: 2 ( ) ln L(ˆθ) lnl(θ 0 ) χ 2 p When in the (true) null hypothesis θ 0 we have q of the p parameters constrained: ( ) 2 ln L(ˆθ) lnl(θ 0 ) χ 2 q Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.13/69

log-likelihood curve Likelihood curve in one parameter Ln (Likelihood) length of a branch in the tree Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.14/69

Its maximum likelihood estimate Likelihood curve in one parameter and the maximum likelihood estimate Ln (Likelihood) length of a branch in the tree maximum likelihood estimate (ML) Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.15/69

The(approximate, asymptotic) confidence interval Likelihood curve in one parameter and the maximum likelihood estimate and confidence interval derived from it Ln (Likelihood) 1/2 the value of a chi square with 1 d.f. significant at 95% 95% confidence interval length of a branch in the tree maximum likelihood estimate (ML) Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.16/69

ontours of a log-likelihood surface in two dimensions length of branch 2 length of branch 1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.17/69

ontours of a log-likelihood surface in two dimensions length of branch 2 ML length of branch 1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.18/69

Log-likelihood-based confidence set for two variables shaded area is the joint confidence interval length of branch 2 height of this contour is less than at the peak by an amount equal to 1/2 the chi square value with two degrees of freedom which is significant at 95% level length of branch 1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.19/69

onfidence interval for one variable length of branch 2 height of this contour is less than at the peak by an amount equal to 1/2 the chi square value with one degree of freedom which is significant at 95% level length of branch 1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.20/69

onfidence interval for the other variable length of branch 2 height of this contour is less than at the peak by an amount equal to 1/2 the chi square value with one degree of freedom which is significant at 95% level length of branch 1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.21/69

ln L Likelihood ratio interval for a parameter 2620 2625 2630 2635 2640 5 10 20 50 100 200 Transition / transversion ratio Inferring the transition/transversion ratio for an 84 model with the 14-species primate mitochondria data set. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.22/69

LRTofamolecularclock howmanyparameters? onstraints for a clock v 1 v 2 v 4 v 5 v 1 = v 2 v 6 v 3 v v 4 = 5 v 8 v 1 + v v 6 = 3 v 7 v v 3 + 7 = v 4 + v 8 How does each equation constrain the branch lengths in the unrooted tree? What about the red equation? Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.23/69

Likelihood Ratio Test for a molecular clock Using the 7-species mitochondrial N data set (the great apes plus ovine and Mouse), we get with Ts/Tn = 30 and an 84 model: Tree ln L No clock 1372.77620 lock 1414.45053 ifference 41.67473 hi-square statistic: 2 41.675 = 83.35, with n 2 = 5 degrees of freedom highly significant. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.24/69

Model selection using the LRT Parameters 29 84, T estimated 28 81 84, T=2 27 26 K2P, T estimated 25 Jukes antor K2P, T=2 The problem with using likelihood ratio tests is the multiplicity of tests and the multiple routes to the same hypotheses. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.25/69

The kaike Information riterion ompare between hypotheses 2 lnl + 2p (the same as reducing the log-likelihood by the number of parameters) Number of Model ln L parameters I Jukes-antor 3068.29186 25 6186.58 K2P, R = 2.0 2953.15830 25 5956.32 K2P, R = 1.889 2952.94264 26 5957.89 81 2935.25430 28 5926.51 84, R = 2.0 2680.32982 28 5416.66 84, R = 28.95 2616.3981 29 5290.80 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.26/69

ln Likelihood anwetesttreesusingthelrt? 192 194 t 196 t 198 t 200 0.2 0.1 t 0.0 0.1 0.2 If so, how many degrees of freedom for the comparison of the two peaks? These are three-species clocklike trees (shown here plotted in a profile log-likelihood plot plotting the highest likelihood for each value of the interior branch length). Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.27/69 t

The bootstrap estimate of θ (unknown) true value of θ 150 data points empirical distribution of sample (unknown) true distribution ootstrap replicates (each 150 draws) istribution of estimates of parameters n example with mixed normal distributions. raw from the empirical distribution 150 times if there are 150 data points. With replacement! Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.28/69

The bootstrap for phylogenies Original ata sites sequences ootstrap sample #1 sites stimate of the tree sequences sample same number of sites, with replacement ootstrap sample #2 sequences sites sample same number of sites, with replacement ootstrap estimate of the tree, #1 (and so on) ootstrap estimate of the tree, #2 rawing columns of the data matrix, with replacement. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.29/69

partitiondefinedbyabranchinthefirsttree Trees: How many times each partition of species is found: 1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.30/69

nother partition from the first tree Trees: How many times each partition of species is found: 1 1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.31/69

The third partition from that tree Trees: How many times each partition of species is found: 1 1 1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.32/69

Partitions from the second tree Trees: How many times each partition of species is found: 1 2 1 1 1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.33/69

Partitions from the third tree Trees: How many times each partition of species is found: 2 2 1 1 1 1 1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.34/69

Partitions from the fourth tree Trees: How many times each partition of species is found: 3 2 1 1 1 2 2 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.35/69

Partitions from the fifth tree Trees: How many times each partition of species is found: 3 3 1 1 1 2 1 3 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.36/69

The table of partitions from all trees Trees: How many times each partition of species is found: 3 3 1 1 1 2 1 3 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.37/69

The majority-rule consensus tree Trees: How many times each partition of species is found: 3 3 1 1 1 2 1 3 60 60 60 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.38/69

WhywilltheMRconsensusgiveatree? Suppose that for each partition in a tree we construct a (fake) morphological character with 0 for one set in the partition, 1 for the other. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.39/69

WhywilltheMRconsensusgiveatree? Suppose that for each partition in a tree we construct a (fake) morphological character with 0 for one set in the partition, 1 for the other. Such a character is compatible with a tree if (and only if) the tree contains that partition. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.39/69

WhywilltheMRconsensusgiveatree? Suppose that for each partition in a tree we construct a (fake) morphological character with 0 for one set in the partition, 1 for the other. Such a character is compatible with a tree if (and only if) the tree contains that partition. If two of these characters both occur in more than 50% of the trees, they must co-occur in at least one tree. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.39/69

WhywilltheMRconsensusgiveatree? Suppose that for each partition in a tree we construct a (fake) morphological character with 0 for one set in the partition, 1 for the other. Such a character is compatible with a tree if (and only if) the tree contains that partition. If two of these characters both occur in more than 50% of the trees, they must co-occur in at least one tree. Thus the set of these characters that occur in more then 50% of the trees are all pairwise compatible. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.39/69

WhywilltheMRconsensusgiveatree? Suppose that for each partition in a tree we construct a (fake) morphological character with 0 for one set in the partition, 1 for the other. Such a character is compatible with a tree if (and only if) the tree contains that partition. If two of these characters both occur in more than 50% of the trees, they must co-occur in at least one tree. Thus the set of these characters that occur in more then 50% of the trees are all pairwise compatible. y the Pairwise ompatibility Theorem (remember that?) they must then be jointly compatible Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.39/69

WhywilltheMRconsensusgiveatree? Suppose that for each partition in a tree we construct a (fake) morphological character with 0 for one set in the partition, 1 for the other. Such a character is compatible with a tree if (and only if) the tree contains that partition. If two of these characters both occur in more than 50% of the trees, they must co-occur in at least one tree. Thus the set of these characters that occur in more then 50% of the trees are all pairwise compatible. y the Pairwise ompatibility Theorem (remember that?) they must then be jointly compatible So there must be a tree that contains them all. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.39/69

The MR tree with 14-species primate mtn data ovine Mouse Squir Monk 35 42 72 74 84 80 himp Human Gorilla Orang 77 49 100 99 99 Gibbon Rhesus Mac Jpn Macaq rab.mac arbmacaq Tarsier Lemur Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.40/69

Potential problems with the bootstrap 1. Sites may not evolve independently 2. Sites may not come from a common distribution (but can consider them sampled from a mixture of possible distributions) 3. If do not know which branch is of interest at the outset, a multiple-tests" problem means P values are overstated 4. P values are biased (too conservative) 5. ootstrapping does not correct biases in phylogeny methods Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.41/69

Other resampling methods elete-half jackknife. Sample a random 50% of the sites, without replacement. elete-1/e jackknife (arris et. al. 1996) (too little deletion from a statistical viewpoint). Reweighting characters by choosing weights from an exponential distribution. In fact, reweighting them by any exchangeable weights having coefficient of variation of 1 Parametric bootstrap simulate data sets of this size assuming the estimate of the tree is the truth (to correct for correlation among adjacent sites) (Künsch, 1989) lock-bootstrapping sample n/b blocks of b adjacent sites. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.42/69

With the delete-half jackknife 50 69 72 84 80 ovine Mouse Squir Monk himp Human Gorilla Orang 32 80 100 98 99 Gibbon Rhesus Mac Jpn Macaq rab.mac arbmacaq 59 Tarsier Lemur Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.43/69

ootstrap versus jackknife in a simple case xact computation of the effects of deletion fraction for the jackknife n 1 n 2 n characters (suppose 1 and 2 are conflicting groups) m 1 m 2 n(1 δ) characters We can compute for various n s the probabilities of getting more evidence for group 1 than for group 2 typical result is for n 1 = 10, n 2 = 8, n = 100 : ootstrap Jackknife δ = 1/2 δ = 1/e Prob( m >m ) 1 2 0.6384 0.5923 0.6441 Prob( m >m ) 1 2 0.7230 0.7587 0.8040 Prob( m >m ) 1 2 + 1 Prob( m =m ) 2 1 2 0.6807 0.6755 0.7240 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.44/69

Probability of a character being omitted from a bootstrap N (1 1/N) N 1 0 11 0.35049 25 0.36040 100 0.36603 2 0.25 12 0.35200 30 0.36166 150 0.36665 3 0.29630 13 0.35326 35 0.36256 200 0.36696 4 0.31641 14 0.35434 40 0.36323 250 0.36714 5 0.32768 15 0.35526 45 0.36375 300 0.36727 6 0.33490 16 0.35607 50 0.36417 500 0.36751 7 0.33992 17 0.35679 60 0.36479 1000 0.36770 8 0.34361 18 0.35742 70 0.36524 0.36788 9 0.34644 19 0.35798 80 0.36557 10 0.34868 20 0.35849 90 0.36583 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.45/69

toyexampletoexaminebiasof Pvalues True value of mean istribution of individual values of x True distribution of sample means stimated distributions of sample means "Topology" II 0 "Topology" I ssuming a normal distribution, trying to infer whether the mean is above 0, when the mean is unknown and the variance known to be 1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.46/69

iasinthepvalues note that the true P is more extreme than the average of the P s P estimate of the "phylogeny" topology II 0 topology I the true mean Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.47/69

HowmuchbiasinthePvalues? 1.0 0.8 verage P 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 True P Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.48/69

iasinthepvalueswithdifferentpriors 1.00 Probability of correct topology 0.80 0.60 0.40 0.20 n σ 2 = 2.0 1.0 0.1 0.00 0.00 0.50 1.00 P for expectation of µ Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.49/69

The parametric bootstrap computer simulation estimation of tree data set #1 T 1 estimate of tree data set #2 T 2 original data data set #3 T 3 data set #100 T 100 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.50/69

The parametric bootstrap with the primates data ovine 83 98 78 81 93 98 96 Lemur Tarsier Squirrel Monkey Mouse Jp Macacque 98 arbary Mac 82 rab ating Mac Rhesus Mac Gorilla 95 himp 96 Human Orang Gibbon Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.51/69

Goldman s test using simulation (related to the "parametric bootstrap") no clock tree T log likelihood l data 2 ( l l ) c clock T c l c simulating data sets... data data data... data estimating clocklike and nonclocklike trees from each data set...... 2 ( l l ) 2 ( l l ) 2 ( l l ) 2 ( l l c c c c ) 2 ( l l c ) Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.52/69

n outcome of rownian motion on a 5-species tree Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.53/69

n outcome of rownian motion on a 5-species tree Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.54/69

n outcome of rownian motion on a 5-species tree Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.55/69

n outcome of rownian motion on a 5-species tree Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.56/69

rownian motion along a tree x 1 x x x 1 8 2 v x x 1 2 8 v 2 x 8 x x 8 9 v 8 x 3 x x 1 3 8 9 v 3 x 9 x x 9 0 v 9 x 6 x x 5 7 x x 6 10 x x v 7 10 x x x 6 v7 4 5 11 x 10 v x x 5 x x 10 11 4 12 v 10 v x 4 11 x x 11 12 v x 11 12 x x 12 0 v 12 x 0 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.57/69

istribution of tips on a tree under rownian Motion root v 3 0 3 v 1 v 2 1 2 Tip 1 is the sum of two independent changes each of which is drawn from a normal distribution (with mean 0 and variances v 3 and v 1 ) so it is normally distributed with mean 0 and variance v 3 + v 1. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.58/69

istribution of tips on a tree under rownian Motion root v 3 0 3 v 1 v 2 1 2 Tip 1 is the sum of two independent changes each of which is drawn from a normal distribution (with mean 0 and variances v 3 and v 1 ) so it is normally distributed with mean 0 and variance v 3 + v 1. Similarly for tip 2 (variance is v 3 + v 2 ). Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.58/69

istribution of tips on a tree under rownian Motion root v 3 0 3 v 1 v 2 1 2 Tip 1 is the sum of two independent changes each of which is drawn from a normal distribution (with mean 0 and variances v 3 and v 1 ) so it is normally distributed with mean 0 and variance v 3 + v 1. Similarly for tip 2 (variance is v 3 + v 2 ). They share branch 3, and the change there affects both random variables. So they are not independent or uncorrelated. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.58/69

istribution of tips on a tree under rownian Motion root v 3 0 3 v 1 v 2 1 2 Tip 1 is the sum of two independent changes each of which is drawn from a normal distribution (with mean 0 and variances v 3 and v 1 ) so it is normally distributed with mean 0 and variance v 3 + v 1. Similarly for tip 2 (variance is v 3 + v 2 ). They share branch 3, and the change there affects both random variables. So they are not independent or uncorrelated. Variance is the expectation of the square (of deviation from the mean), and covariance is the expectation of the product of those deviations, for the two variables. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.58/69

istribution of tips on a tree under rownian Motion root v 3 0 3 v 1 v 2 1 2 Tip 1 is the sum of two independent changes each of which is drawn from a normal distribution (with mean 0 and variances v 3 and v 1 ) so it is normally distributed with mean 0 and variance v 3 + v 1. Similarly for tip 2 (variance is v 3 + v 2 ). They share branch 3, and the change there affects both random variables. So they are not independent or uncorrelated. Variance is the expectation of the square (of deviation from the mean), and covariance is the expectation of the product of those deviations, for the two variables. In fact the covariance of the values at tip 1 and tip 2 is the variance of the shared term that is the same in both of them, so it is v 3. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.58/69

ovariances of species on the tree 2 v 1 + v 8 + v 9 v 8 + v 9 v 9 0 0 0 0 v 8 + v 9 v 2 + v 8 + v 9 v 9 0 0 0 0 v 9 v 9 v 3 + v 9 0 0 0 0 0 0 0 v 4 + v 12 v 12 v 12 v 12 0 0 0 v 12 v 5 + v 11 + v 12 v 11 + v 12 v 11 + v 12 6 0 0 0 v 4 12 v 11 + v 12 v 6 + v 10 + v 11 + v 12 v 10 + v 11 + v 12 3 7 5 0 0 0 v 12 v 11 + v 12 v 10 + v 11 + v 12 v 7 + v 10 + v 11 + v 12 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.59/69

ovariances are of form a b c 0 0 0 0 b d c 0 0 0 0 c c e 0 0 0 0 0 0 0 f g g g 0 0 0 g h i i 0 0 0 g i j k 0 0 0 g i k l Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.60/69

Likelihood under rownian motion with two species x 1 x 2 v 1 v 2 x 0 f ( x; µ, σ 2) = ) 1 ( σ 2π exp (x µ)2 2σ 2 L = p i=1 ( 1 (2π) exp 1 v 1 v 2 2 [ (x 1i x 0i ) 2 + (x 2i x 0i ) 2 v 1 v 2 ]) Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.61/69

What happens if we estimate means and branch lengths? o we get the right answer if we estimate for each coordinate (each character) the value at the root and the branch lengths v 1 and v 2? ctually no. elow, we will do this by finding values of these that maximize the likelihood, and show that the likelihood becomes infinite if either v 1 or v 2 approaches zero. ven if we constrain there to be a clock, so v 1 = v 2 and look only at their sum v 1 + v 2 this turns out to be half as big as the truth, even with an infinite number of characters. Why? The problem seems to be that we are estimating too many parameters. There is one parameter (the root value) for each character. So the ratio of data to parameters does not rise to infinity as we increase the number of parameters. In circumstances like this, likelihood methods can misbehave. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.62/69

The solution: don t infer ancestors; use RML We can eliminate these problems by: 1. o not infer the states of the interior nodes. 2. Use only the relative positions of the tips. This eliminates the starting state at the root. It is RML, a variant of ML that loses almost no statistical power. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.63/69

Minimizing for each character i so: and then: Q = (x 1i x 0i ) 2 v 1 + (x 2i x 0i ) 2 v 2 dq dx 0i = 2 (x 1i x 0i ) v 1 2 (x 2i x 0i ) v 2 = 0 x 0i = 1 v 1 x 1i + 1 v 2 x 2i 1 v 1 + 1 v 2 So that we have a maximum likelihood estimate of the starting value x 0i for each character. The result is that Q = (x 1i x 2i ) 2 v 1 + v 2 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.64/69

Likelihood after estimating initial coordinates Substituting in our estimates of x 0i, we end up with L = ( 1 (2π) p (v 1 v 2 ) exp 1 1 2 p 2 p i=1 ) (x 1i x 2i ) 2 v 1 + v 2 and this finally turns into: lnl = p ln(2π) 1 2 p ln(v 1v 2 ) 1 2 p i=1 (x 1i x 2i ) 2 v 1 + v 2 This actually goes to infinity as either v 1 or v 2 goes to zero! This is related to the problem that dwards and avalli-sforza had with their maximum likelihood method in 1964. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.65/69

Ifthereisaclock... If instead we constrain v 1 = v 2 because assume a clock: ln L = K p ln(v 1 + v 2 ) 1 2 2 (v 1 + v 2 ) which leads to v 1 = v 2 = 2 /(4p) (which is half as big as it should be!) The number of parameters being estimated is p + 1, which rises as we consider more characters. The fact that the ratio of data to parameters does not rise without limit is the reason why likelihood misbehaves in this case. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.66/69

The difference between ML and RML Information we use for ML inference: species 2 species 1 species 4 species 3 1.0 2.0 3.0 4.0 Information we use for RML inference: species 2 species 1 species 4 species 3 1.0+x 2.0+x 3.0+x 4.0+x oes it matter that we don t know x? It makes it unnecessary to estimate the starting value x 0, and that eliminates p parameters. It means that the ratio of data to parameters does then rise as we add characters. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.67/69

Using only differences between populations(rml) We assume that we have observed only the differences x 1i x 2i, and not the actual locations on the phenotype scale. Then ( ) p 1 L = exp 1 (x 1i x 2i ) 2 2π v1 + v 2 2 v 1 + v 2 i=1 lnl = K p 2 ln(v 1 + v 2 ) + 1 2 (v 1 + v 2 ) n (x i1 x i2 ) 2 i=1 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.68/69

Likelihood with two species using RML lnl = K p 2 ln (v 1 + v 2 ) + 2 2 (v 1 + v 2 ) lnl = K p 2 ln (v T) + 2 2 v T v T = 2 /p The number of parameters being estimated is 1 (it is the sum v 1 + v 2 ). The number of parameters does not rise as we consider more characters. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.69/69

Pruning a tree in the rownian motion case x 1 x 2 v 1 v 2 x 1 x 2 x 3 x 4 + δ = v v 1 2 v + v 1 2 v 1 v 2 v 5 v 3 v 4 x 12 x 3 x 4 v x + v x 2 1 1 2 x = 12 v + v 1 2 v 6 δ v 3 v 4 v 5 v 6 The likelihood for the tree is the product of the linkelihoods for these two trees. y repeatedly applying this we can decompose the tree into n 1 independent two-species trees. Getting their likelihoods is easy. Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.70/69