Week 7: Bayesian inference, Testing trees, Bootstraps

Similar documents
Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Bootstraps and testing trees. Alog-likelihoodcurveanditsconfidenceinterval

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Inference of phylogenies, with some thoughts on statistics and geometry p.1/31

= 1 = 4 3. Odds ratio justification for maximum likelihood. Likelihoods, Bootstraps and Testing Trees. Prob (H 2 D) Prob (H 1 D) Prob (D H 2 )

Lecture 4. Models of DNA and protein change. Likelihood methods

Week 5: Distance methods, DNA and protein models

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Inferring Molecular Phylogeny

Bayes factors, marginal likelihoods, and how to choose a population model

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Modern Phylogenetics. An Introduction to Phylogenetics. Phylogenetics and Systematics. Phylogenetic Tree of Whales

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

A Very Brief Summary of Statistical Inference, and Examples

Phylogenetic inference

Exam C Solutions Spring 2005

A Bayesian Approach to Phylogenetics

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

New Bayesian methods for model comparison

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

Hypothesis Testing: The Generalized Likelihood Ratio Test

Constructing Evolutionary/Phylogenetic Trees

A Very Brief Summary of Statistical Inference, and Examples

Bayesian Learning (II)

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Bayesian Phylogenetics

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Data Mining. Chapter 5. Credibility: Evaluating What s Been Learned

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

7. Estimation and hypothesis testing. Objective. Recommended reading

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Concepts and Methods in Molecular Divergence Time Estimation

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Introductory Econometrics. Review of statistics (Part II: Inference)

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

14.30 Introduction to Statistical Methods in Economics Spring 2009

EVOLUTIONARY DISTANCES

(1) Introduction to Bayesian statistics

MATH5745 Multivariate Methods Lecture 07

Answers and expectations

Sequential Monte Carlo Algorithms

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Frequentist Statistics and Hypothesis Testing Spring

Hypothesis testing: theory and methods

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Phylogenetics Todd Vision Spring Some applications. Uncultured microbial diversity

Non-parametric Inference and Resampling

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Nuisance parameters and their treatment

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Probabilistic modeling and molecular phylogeny

7. Estimation and hypothesis testing. Objective. Recommended reading

Likelihoods and Phylogenies

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Lecture 4. Models of DNA and protein change. Likelihood methods

Master s Written Examination - Solution

Bayesian Methods for Machine Learning

Hypothesis testing and phylogenetics

Phylogenetics: Building Phylogenetic Trees

Lecture 6 Phylogenetic Inference

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Topic 15: Simple Hypotheses

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

Dr. Amira A. AL-Hosary

Estimating Evolutionary Trees. Phylogenetic Methods

Section 9.4. Notation. Requirements. Definition. Inferences About Two Means (Matched Pairs) Examples

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

ST495: Survival Analysis: Hypothesis testing and confidence intervals

exp{ (x i) 2 i=1 n i=1 (x i a) 2 (x i ) 2 = exp{ i=1 n i=1 n 2ax i a 2 i=1

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Optimal exact tests for complex alternative hypotheses on cross tabulated data

Hypothesis Testing - Frequentist

Hypothesis Testing. Testing Hypotheses MIT Dr. Kempthorne. Spring MIT Testing Hypotheses

Phylogenetic Tree Reconstruction

Bayesian Phylogenetics:

The comparative studies on reliability for Rayleigh models

Lab 9: Maximum Likelihood and Modeltest

Branch-Length Prior Influences Bayesian Posterior Probability of Phylogeny

Comparison of Bayesian and Frequentist Inference

Bayesian Updating with Discrete Priors Class 11, Jeremy Orloff and Jonathan Bloom

Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models

First Year Examination Department of Statistics, University of Florida

Transcription:

Week 7: ayesian inference, Testing trees, ootstraps Genome 570 May, 2008 Week 7: ayesian inference, Testing trees, ootstraps p.1/54

ayes Theorem onditional probability of hypothesis given data is: Prob (H ) = Prob (H & ) Prob () Since Prob (H & ) = Prob (H) Prob ( H) Substituting this in: Prob (H ) = Prob (H) Prob ( H) Prob () The denominator Prob () is the sum of the numerators over all possible hypotheses H Prob (H ) = H Prob (H) Prob ( H) Prob (H) Prob ( H) Week 7: ayesian inference, Testing trees, ootstraps p.2/54

dramatic, if not real, example xample: Space probe photos show no Little Green Men on Mars! yes no no priors yes likelihoods no 1 yes 0 no yes no posteriors yes Week 7: ayesian inference, Testing trees, ootstraps p.3/54

alculations for that example Using the Odds Ratio form of ayes Theorem: Prob (H 1 ) Prob (H 2 ) = Prob ( H 1) Prob ( H 2 ) Prob (H 1 ) Prob (H 2 ) }{{} }{{} }{{} posterior likelihood prior odds ratio ratio odds ratio or the odds favoring their existence, the calculation is, for the optimist about Little Green Men: While for the pessimist it is 4 1 1/3 1 1 4 1/3 1 = 4/3 1 = 1/12 1 = 4 : 3 = 1 : 12 Week 7: ayesian inference, Testing trees, ootstraps p.4/54

With repeated observation the prior matters less If we send 5 space probes, and all fail to see LGMs, since the probability of this observation is (1/3) 5 if there are LGMs, and 1 if there aren t, we get for the optimist about Little Green Men: 4 1 (1/3)5 1 = = 4/243 1 = 4 : 243 while for the pessimist about Little Green Men: 1 4 (1/3)5 1 = = 1/972 1 = 1 : 972 Week 7: ayesian inference, Testing trees, ootstraps p.5/54

coin tossing example 0.0 0.2 0.4 0.6 0.8 1.0 p 0.0 0.2 0.4 0.6 0.8 1.0 p 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p 0.0 0.2 0.4 0.6 0.8 1.0 p 11 tosses, 5 heads 44 tosses, 20 heads Week 7: ayesian inference, Testing trees, ootstraps p.6/54

Markov hain Monte arlo sampling To draw trees from a distribution whose probabilities are proportional to f(t), we can use the Metropolis algorithm: 1. Start at some tree. all this T i. 2. Pick a tree that is a neighbor of this tree in the graph of trees. all this the proposal T j. 3. ompute the ratio of the probabilities (or probability density functions) of the proposed new tree and the old tree: trees: R = f(t j) f(t i ) 4. If R 1, accept the new tree as the current tree. 5. If R < 1, draw a uniform random number (a random fraction between 0 and 1). If it is less than R, accept the new tree as the current tree. 6. Otherwise reject the new tree and continue with tree T i as the current tree. 7. Return to step 2. Week 7: ayesian inference, Testing trees, ootstraps p.7/54

Reversibility (again) π j P ij Pji π i Week 7: ayesian inference, Testing trees, ootstraps p.8/54

oes it achieve the desired equilibrium distribution? If f(t i ) > f(t j ), then Prob (T i T j ) = 1, and Prob (T j T i ) = f(t j )/f(t i ) so that Prob (T j T i ) Prob (T i T j ) = f(t j) f(t i ) (the same formula can be shown to hold when f(t i ) < f(t j ) ). Then in both cases f(t i ) Prob (T j T i ) = Prob (T i T j ) f(t j ) Summing over all T i, the right-hand side sums up to f(t j ) so T i f(t i ) Prob (T j T i ) = f(t j ) This shows that applying this algorithm, if we start in the desired distribution, we stay in it. It is also true under most conditions that if you start in any other distribution, you converge to this one. Week 7: ayesian inference, Testing trees, ootstraps p.9/54

We try to achieve the posterior ayesian MM Prob (T) Prob ( T) / (denominator) and this turns out to need the acceptance ratio R = Prob (T new) Prob ( T new ) Prob (T old ) Prob ( T old ) (the denominators are the same and cancel out. This is a great convenience, as we often cannot evaluate the deonominator, but we can usually evaluate the numerators). Note that we could also have a prior on model parameters too. Week 7: ayesian inference, Testing trees, ootstraps p.10/54

Mau and Newton s proposal mechanism Week 7: ayesian inference, Testing trees, ootstraps p.11/54

Using Mrayes on the primates data 0.856 0.938 0.416 0.949 1.00 0.905 0.999 0.986 0.899 1.00 0.997 ovine Lemur Tarsier rab.mac Rhesus Mac Jpn Macaq arbmacaq Orang himp Human Gorilla Gibbon Squir Monk Mouse requencies of partitions (posterior clade probabilities) Week 7: ayesian inference, Testing trees, ootstraps p.12/54

n example Suppose we have two species with a Jukes-antor model, so that the estimation of the (unrooted) tree is simply the estimation of the branch length between the two species. We can express the result either as branch length t, or as the net probability of base change p = 3 4 (1 exp( 43 t) ) Week 7: ayesian inference, Testing trees, ootstraps p.13/54

flat prior on p 0.0 0.25 0.5 0.75 p Week 7: ayesian inference, Testing trees, ootstraps p.14/54

The corresponding prior on t 0 1 2 3 4 5 t Week 7: ayesian inference, Testing trees, ootstraps p.15/54

lat prior for t between 0 and 5 0 1 2 3 4 5 t Week 7: ayesian inference, Testing trees, ootstraps p.16/54

The corresponding prior on p 0.0 0.25 0.5 0.75 p Week 7: ayesian inference, Testing trees, ootstraps p.17/54

ln L Likelihood curve for t 6 8 10 12 14 0 1 2 3 4 5 t ^ t = 0.383112 ( p ^ = 0.3 ) When 3 sites differ out of 10 Week 7: ayesian inference, Testing trees, ootstraps p.18/54

ln L Likelihood curve for p 6 8 10 12 14 0.0 0.2 0.4 0.6 0.8 p ^p = 0.3 ( ^t = 0.383112 ) When 3 sites differ out of 10 Week 7: ayesian inference, Testing trees, ootstraps p.19/54

When t has a wide flat prior t ML 5 4 3 2 1 95% interval 0 1 10 100 1000 The 95% two-tailed credible interval for t with various truncation points on a flat prior for t T Week 7: ayesian inference, Testing trees, ootstraps p.20/54

What is going on in that case is... 0 T T is so large this is < 2.5% of the area Week 7: ayesian inference, Testing trees, ootstraps p.21/54

The likelihood curve is nearly a normal distribution for large amounts of data θ t(x), the "sufficient statistic" Week 7: ayesian inference, Testing trees, ootstraps p.22/54

urvatures and covariances of ML estimates ML estimates have covariances computable from curvatures of the expected log-likelihood: ] Var [ θ 1 / ( d 2 (log(l)) dθ 2 The same is true when there are multiple parameters: ] Var [ θ V 1 ) where ( 2 ) log(l) ij = θ i θ j Week 7: ayesian inference, Testing trees, ootstraps p.23/54

With large amounts of data, asymptotically When the true value of θ is θ 0, ˆθ θ 0 v N (0, 1) Since 1/v is the negative of the curvature of the log-likelihood: ln L (θ 0 ) = ln L(ˆθ) 1 2 (θ 0 ˆθ) 2 v so that twice the difference of log-likelihoods is the square of a normal: 2 ( ) ln L(ˆθ) ln L (θ 0 ) χ 2 1 Week 7: ayesian inference, Testing trees, ootstraps p.24/54

orresponding results for multiple parameters ln L(θ 0 ) ln L(θ 0 ) 1 2 (θ 0 θ) T (θ 0 θ) (θ θ 0 ) T (θ θ 0 ) χ 2 p so that the log-likelihood difference is: 2 ( ) ln L(ˆθ) ln L (θ 0 ) χ 2 p When q of the p parameters are constrained: 2 ( ) ln L(ˆθ) ln L (θ 0 ) χ 2 q Week 7: ayesian inference, Testing trees, ootstraps p.25/54

ln L Likelihood ratio interval for a parameter 2620 2625 2630 2635 2640 5 10 20 50 100 200 Transition / transversion ratio Week 7: ayesian inference, Testing trees, ootstraps p.26/54

Model selection using the LRT Parameters 29 84, T estimated 28 81 84, T=2 27 26 K2P, T estimated 25 Jukes antor K2P, T=2 Week 7: ayesian inference, Testing trees, ootstraps p.27/54

The kaike information criterion ompare between hypotheses 2 ln L + 2p Number of Model ln L parameters I Jukes-antor 3068.29186 25 6186.58 K2P, R = 2.0 2953.15830 25 5956.32 K2P, R = 1.889 2952.94264 26 5957.89 81 2935.25430 28 5926.51 84, R = 2.0 2680.32982 28 5416.66 84, R = 28.95 2616.3981 29 5290.80 Week 7: ayesian inference, Testing trees, ootstraps p.28/54

ln Likelihood an we test trees using the LRT? 192 194 t 196 t 198 t 200 0.2 0.1 t 0.0 0.1 0.2 t Week 7: ayesian inference, Testing trees, ootstraps p.29/54

LRT of a molecular clock how many parameters? onstraints for a clock v 1 v 2 v 4 v 5 v 1 = v 2 v 6 v 3 v v 4 = 5 v 8 v 1 + v v 6 = 3 v 7 v v 3 + 7 = v 4 + v 8 Week 7: ayesian inference, Testing trees, ootstraps p.30/54

Likelihood Ratio Test for a molecular clock Using the 7-species mitochondrial N data set (the great apes plus ovine and Mouse), we get with Ts/Tn = 30 and an 84 model: Tree ln L No clock 1372.77620 lock 1414.45053 ifference 41.67473 hi-square statistic: 2 41.675 = 83.35, with n 2 = 5 degrees of freedom highly significant. Week 7: ayesian inference, Testing trees, ootstraps p.31/54

Goldman s test using simulation (related to the "parametric bootstrap") no clock trees T log likelihoods l data 2 ( l l ) c clock T c l c simulating data sets... data data data... data estimating clocklike and nonclocklike trees from each data set...... 2 ( l l ) 2 ( l l ) 2 ( l l ) 2 ( l l c c c c ) 2 ( l l c ) Week 7: ayesian inference, Testing trees, ootstraps p.32/54

estimate of θ The bootstrap (unknown) true value of θ empirical distribution of sample ootstrap replicates (unknown) true distribution istribution of estimates of parameters Week 7: ayesian inference, Testing trees, ootstraps p.33/54

The bootstrap for phylogenies Original ata sites sequences ootstrap sample #1 sites stimate of the tree sequences sample same number of sites, with replacement ootstrap sample #2 sequences sites sample same number of sites, with replacement ootstrap estimate of the tree, #1 (and so on) ootstrap estimate of the tree, #2 Week 7: ayesian inference, Testing trees, ootstraps p.34/54

partition defined by a branch in the first tree Trees: How many times each partition of species is found: 1 Week 7: ayesian inference, Testing trees, ootstraps p.35/54

nother partition from the first tree Trees: How many times each partition of species is found: 1 1 Week 7: ayesian inference, Testing trees, ootstraps p.36/54

The third partition from that tree Trees: How many times each partition of species is found: 1 1 1 Week 7: ayesian inference, Testing trees, ootstraps p.37/54

Partitions from the second tree Trees: How many times each partition of species is found: 1 2 1 1 1 Week 7: ayesian inference, Testing trees, ootstraps p.38/54

Partitions from the third tree Trees: How many times each partition of species is found: 2 2 1 1 1 1 1 Week 7: ayesian inference, Testing trees, ootstraps p.39/54

Partitions from the fourth tree Trees: How many times each partition of species is found: 3 2 1 1 1 2 2 Week 7: ayesian inference, Testing trees, ootstraps p.40/54

Partitions from the fifth tree Trees: How many times each partition of species is found: 3 3 1 1 1 2 1 3 Week 7: ayesian inference, Testing trees, ootstraps p.41/54

The table of partitions from all trees Trees: How many times each partition of species is found: 3 3 1 1 1 2 1 3 Week 7: ayesian inference, Testing trees, ootstraps p.42/54

The majority-rule consensus tree Trees: How many times each partition of species is found: 3 3 1 1 1 2 1 3 Week 7: ayesian inference, Testing trees, ootstraps p.43/54

The MR tree with 14-species primate mtn data ovine Mouse Squir Monk 35 42 72 74 84 80 himp Human Gorilla Orang 77 49 100 99 99 Gibbon Rhesus Mac Jpn Macaq rab.mac arbmacaq Tarsier Lemur Week 7: ayesian inference, Testing trees, ootstraps p.44/54

With the delete-half jackknife ovine Mouse Squir Monk 50 69 72 84 80 himp Human Gorilla Orang 32 80 100 98 99 Gibbon Rhesus Mac Jpn Macaq rab.mac arbmacaq 59 Tarsier Lemur Week 7: ayesian inference, Testing trees, ootstraps p.45/54

ootstrap versus jackknife in a simple case xact computation of the effects of deletion fraction for the jackknife n 1 n 2 n characters (suppose 1 and 2 are conflicting groups) m 1 m 2 n(1 δ) characters We can compute for various n s the probabilities of getting more evidence for group 1 than for group 2 typical result is for n 1 = 10, n 2 = 8, n = 100 : ootstrap Jackknife δ = 1/2 δ = 1/e Prob( m >m ) 1 2 0.6384 0.5923 0.6441 Prob( m >m ) 1 2 0.7230 0.7587 0.8040 Prob( m >m ) 1 2 + 1 Prob( m =m ) 2 1 2 0.6807 0.6755 0.7240 Week 7: ayesian inference, Testing trees, ootstraps p.46/54

Probability of a character being omitted from a bootstrap N (1 1/N) N 1 0 11 0.35049 25 0.36040 100 0.36603 2 0.25 12 0.35200 30 0.36166 150 0.36665 3 0.29630 13 0.35326 35 0.36256 200 0.36696 4 0.31641 14 0.35434 40 0.36323 250 0.36714 5 0.32768 15 0.35526 45 0.36375 300 0.36727 6 0.33490 16 0.35607 50 0.36417 500 0.36751 7 0.33992 17 0.35679 60 0.36479 1000 0.36770 8 0.34361 18 0.35742 70 0.36524 0.36788 9 0.34644 19 0.35798 80 0.36557 10 0.34868 20 0.35849 90 0.36583 Week 7: ayesian inference, Testing trees, ootstraps p.47/54

toy example to examine bias of P values True value of mean istribution of individual values of x True distribution of sample means stimated distributions of sample means "Topology" II 0 "Topology" I ssuming a normal distribution, trying to infer whether the mean is above 0, when the mean is unknown and the variance known to be 1 Week 7: ayesian inference, Testing trees, ootstraps p.48/54

ias in the P values note that the true P is more extreme than the average of the P s P estimate of the "phylogeny" topology II 0 topology I the true mean Week 7: ayesian inference, Testing trees, ootstraps p.49/54

How much bias in the P values? 1.0 0.8 verage P 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 True P Week 7: ayesian inference, Testing trees, ootstraps p.50/54

ias in the P values with different priors 1.00 Probability of correct topology 0.80 0.60 0.40 0.20 n σ 2 = 2.0 1.0 0.1 0.00 0.00 0.50 1.00 P for expectation of µ Week 7: ayesian inference, Testing trees, ootstraps p.51/54

The parametric bootstrap computer simulation estimation of tree data set #1 T 1 estimate of tree data set #2 T 2 original data data set #3 T 3 data set #100 T 100 Week 7: ayesian inference, Testing trees, ootstraps p.52/54

The parametric bootstrap with the primates data ovine 83 98 78 81 93 98 96 Lemur Tarsier Squirrel Monkey Mouse Jp Macacque 98 arbary Mac 82 rab ating Mac Rhesus Mac Gorilla 95 himp 96 Human Orang Gibbon Week 7: ayesian inference, Testing trees, ootstraps p.53/54

How it was done This freeware-friendly presentation prepared with Linux (operating system) LaTeX (mathematical typesetting) ps2pdf (turning Postscript into P) Idraw (drawing program to modify plots and draw figures) dobe crobat Reader (to display the P in full-screen mode) Week 7: ayesian inference, Testing trees, ootstraps p.54/54