The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

Similar documents
1. Can we use the CFN model for morphological traits?

Systematics - Bio 615

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Reconstruire le passé biologique modèles, méthodes, performances, limites

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Distances that Perfectly Mislead

Woods Hole brief primer on Multiple Sequence Alignment presented by Mark Holder

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Dr. Amira A. AL-Hosary

Consistency Index (CI)

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Constructing Evolutionary/Phylogenetic Trees

Phylogenetic Tree Reconstruction

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

Algebraic Statistics Tutorial I

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Concepts and Methods in Molecular Divergence Time Estimation

Constructing Evolutionary/Phylogenetic Trees

X X (2) X Pr(X = x θ) (3)

Discrete & continuous characters: The threshold model

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Letter to the Editor. The Effect of Taxonomic Sampling on Accuracy of Phylogeny Estimation: Test Case of a Known Phylogeny Steven Poe 1

TheDisk-Covering MethodforTree Reconstruction

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Estimating Evolutionary Trees. Phylogenetic Methods

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Phylogenetic Algebraic Geometry

One-minute responses. Nice class{no complaints. Your explanations of ML were very clear. The phylogenetics portion made more sense to me today.

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Phylogenetics: Building Phylogenetic Trees

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

EVOLUTIONARY DISTANCES

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Quartet Inference from SNP Data Under the Coalescent Model

Evaluating phylogenetic hypotheses

Inferring Molecular Phylogeny

Consensus methods. Strict consensus methods

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Workshop III: Evolutionary Genomics

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

arxiv: v1 [q-bio.pe] 1 Jun 2014

Phylogenetic Geometry

Evolutionary Models. Evolutionary Models

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Introduction to Phylogenetics Workshop on Molecular Evolution 2018 Marine Biological Lab, Woods Hole, MA. USA. Mark T. Holder University of Kansas

Counting phylogenetic invariants in some simple cases. Joseph Felsenstein. Department of Genetics SK-50. University of Washington

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

Parsimony via Consensus

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

The Generalized Neighbor Joining method

Phylogenetic inference

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Comparison of Cost Functions in Sequence Alignment. Ryan Healey

arxiv: v1 [q-bio.pe] 27 Oct 2011

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution


Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Bayesian Phylogenetics

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Phylogenetic Inference using RevBayes

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Molecular Evolution, course # Final Exam, May 3, 2006

Phylogenetic methods in molecular systematics

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Rapid evolution of the cerebellum in humans and other great apes

Non-independence in Statistical Tests for Discrete Cross-species Data

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

Effects of Gap Open and Gap Extension Penalties

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference

Lecture Notes: Markov chains

Isolating - A New Resampling Method for Gene Order Data

Phylogenetic analyses. Kirsi Kostamo

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2011 University of California, Berkeley

Letter to the Editor. Department of Biology, Arizona State University

Statistical nonmolecular phylogenetics: can molecular phylogenies illuminate morphological evolution?

EMPIRICAL REALISM OF SIMULATED DATA IS MORE IMPORTANT THAN THE MODEL USED TO GENERATE IT: A REPLY TO GOLOBOFF ET AL.

AUTHOR COPY ONLY. Hetero: a program to simulate the evolution of DNA on a four-taxon tree

BMI/CS 776 Lecture 4. Colin Dewey

arxiv: v3 [q-bio.pe] 3 Dec 2011

AdvAlg9.7LogarithmsToBasesOtherThan10.notebook. March 08, 2018

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Is the equal branch length model a parsimony model?

Pinvar approach. Remarks: invariable sites (evolve at relative rate 0) variable sites (evolves at relative rate r)

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Transcription:

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection Mark T. Holder and Jordan M. Koch Department of Ecology and Evolutionary Biology, University of Kansas. David Swoord, Tracy Heath, David Bryant and Paul Lewis are collaborators on the second part of the talk. Thanks to NSF and KU's IMSD for funding. ievobio - 2013

Ascertainment bias a bias in parameter estimation or testing caused by non-random sampling of the data. This talk will focus on analyses of ltered data. Filtered data Some types of data will never be sampled.

Correcting for ltered data in tree estimation Use: P(Data Tree, not excluded) as the likelihood instead of: P(Data Tree) P(Data Tree, not excluded) = P(Data Tree) P(not excluded Tree) Felsenstein (1992) and Lewis (2001)

Conclusions Analyzing variable-only data with Lewis' Mk v model is consistent Inferring trees from parsimony-informative-only data: can be consistent if the tree is not tiny. can be feasible for multi-state data using new algorithms Treating gaps as missing data: does not lead to inconsistency if indel process is independent of the substitution process. can be positively misleading under mild violations of this independence assumption. rules for character encoding need to be linked to the data.? in molecular data.

Filtering data: retain variable patterns Character Character Taxon 1 2 3 4 5 6 Taxon 1 2 3 4 5 t1 0 1 1 0 0 0 t1 1 1 0 0 0 t2 0 1 0 0 1 0 t2 1 0 0 1 0 t3 0 1 0 1 1 1 t3 1 0 1 1 1 t4 0 0 0 1 0 1 t4 0 0 1 0 1

Filtering data: retain parsimony-informative patterns Character Char. Taxon 1 2 3 4 5 6 Taxon 1 2 3 t1 0 1 1 0 0 0 t1 0 0 0 t2 0 1 0 0 1 0 t2 0 1 0 t3 0 1 0 1 1 1 t3 1 1 1 t4 0 0 0 1 0 1 t4 1 0 1

Identiability of the tree A B C D 3 ABCD 1001 1010 1100 Pattern probability A C B D

Partial identiability of the tree A B C 3 D ABCD 1001 1010 1100 Pattern probability A C B D 7

A Markov model for character evolution 0 q 01 q 10 1 Mk: q 10 = q 01 GMk: q 10 q 01

P(1100 T ) A T = B C D P(1010 T ) P(1001 T )

Extending identiability results Filtering Model Identiable? None GMk Yes. (Steel, 1994) Variable GMk v Yes Mk p i Part. (Steel et al., 1993) Pars-inf GMk p i N 8 Yes N = 4 No 5 < N 7? results in red: (Allman et al., 2010), extending (Allman and Rhodes, 2008) for GMk v

Analyzing ltered data Filtering Analysis Variable Pars-inf. Mk Pos. Misleading Pos. Misleading Mk v Consistent Pos. Misleading Mk p i - Consistent N 8

Calculating the probability of not being excluded P(not excluded Tree) = 1 P(excluded Tree) For variable-only, binary data: P(Var. pat Tree) = 1 P(all 0 pat. Tree) P(all 1 pat. Tree)

Calculating the probability of a parsimony-uninformative pattern For multi-state character (k > 2), and parsimony-informative-only data, there are lots of patterns: ( ( )) N O 2 k 1 k 1 JMK and MTH have implemented parsimony-uninformative specializations of an algorithm for calculating the prob. of classes of patterns. Koch and Holder (2012) https://github.com/mtholder/phypatclassprob

Informatics implications To correct for ascertainment bias we need to know what form of data ltering was used. If a character was chosen because it was variable in a related group, it is dicult to correct of the ascertainment bias.

A PEEKSAVTALWGKVNVDEVGG B GEEKAAVLALWDKVNEEEVGG C PADKTNVKAAWGKVGAHAGEYGA D AADKTNVKAAWSKVGGHAGEYGA E EHEWQLVLHVWAKVEADVAGHGQ pairwise alignment A - B.17 - C.59.60 - D.59.59.13 - E.77.77.75.75 - tree inference A PEEKSAVTALWGKVNVDEVGG B GEEKAAVLALWDKVNEEEVGG C PADKTNVKAAWGKVGAHAGEYGA + D AADKTNVKAAWSKVGGHAGEYGA E EHEWQLVLHVWAKVEADVAGHGQ alignment stage A B C D E A PEEKSAVTALWGKVN--VDEVGG B GEEKAAVLALWDKVN--EEEVGG C PADKTNVKAAWGKVGAHAGEYGA D AADKTNVKAAWSKVGGHAGEYGA E EHEWQLVLHVWAKVEADVAGHGQ gaps to missing A PEEKSAVTALWGKVN??VDEVGG B GEEKAAVLALWDKVN??EEEVGG C PADKTNVKAAWGKVGAHAGEYGA D AADKTNVKAAWSKVGGHAGEYGA E EHEWQLVLHVWAKVEADVAGHGQ

Gaps-as-missing-data ML (SML) on the correct alignment Clearly ignores information from indels, Warnow (2012) argues that the method is inconsistent - but her proof only works when there are no substitutions on the tree. Holder, Heath, Lewis, Swoord, and Bryant (in prep): 1. Proof of consistency if indel process is independent of substitution process, 2. Example of the method being positively misleading under a +I model for indels and substitutions.

Theorem 1 The tree and parameter pair, ˆTM, ˆθ M, estimated via SML will yield a consistent estimator of the tree, T, if: (a) the time-reversible substitution model, θ, results in consistent estimation of the T in the absence of indels; (b) the indel process, φ, acts independently of substitution process and the sequence states; (c) the probability distribution for newly inserted states is identical to the the equilibrium state frequency of the substitution process; and (d) there is non-zero probability of generating a site without gaps under φ.

What if rates of indels are correlated with rates of substitutions? Substitution: Jukes-Cantor + a proportion of invariant sites Indel: invariant sites will not experience indels. All sites that are free to have substitutions are also free to experience indels. The result: If we calculate the expected pattern frequency spectra for extreme Felsenstein-zone trees, and mimic innite character sampling the software using gaps-as-missing-data approach (ML and Bayesian) prefers the wrong tree. We still need to verify that this is not an artifact of local optima being found in software.

Positively misleading behavior from treating gaps as missing data Long-branch attraction if the rate of the indel process is correlated with the rate of the substitution process. Non-random ltering of data long branches underestimated. The result could have implications for the (long-standing) debates in systematics about the eect of missing data and inapplicable character states.

Informatics implications Gaps aren't missing data, we really should be using models of the indel process. Terminal gaps in alignments often are the result of missing data. Software should not use the same symbol for gaps caused by indels and gaps caused by incomplete sequencing.

References I E Allman and J Rhodes. Identifying evolutionary trees and substitution parameters for the general markov model with invariable sites. Mathematical biosciences, 211(1):1833, Jan 2008. doi: 10.1016/j.mbs.2007.09.001. URL http://linkinghub.elsevier.com/retrieve/pii/ S0025556407001897. Elizabeth S. Allman, Mark T. Holder, and John A. Rhodes. Estimating trees from ltered data: Identiability of models for morphological phylogenetics. Journal of Theoretical Biology, 263(1):108119, 2010. ISSN 0022-5193. doi: DOI:10.1016/j.jtbi.2009.12.001. URL http://www.sciencedirect.com/science/article/ B6WMD-4XX160T-2/2/5adf8b8af77dd551890d7cb5b0e62dba.

References II Joseph Felsenstein. Phylogenies from restriction sites: a maximum-likelihood approach. Evolution, 46:159173, Jan 1992. Jordan M. Koch and Mark T. Holder. An algorithm for calculating the probability of classes of data patterns on a genealogy. PLOS Currents Tree of Life, Dec 14 [last modied: 2012 Dec 14](1), 2012. doi: 10.1371/4fd1286980c08. URL http://currents.plos.org/treeoflife/article/ an-algorithm-for-calculating-the-probability-of-classes-o P. O. Lewis. A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic Biology, 50(6):913925, 2001. Mike Steel. Recovering a tree from the leaf colourations it generates under a markov model. Appl. Math. Letters, 7(2): 1923, Jan 1994. URL http://www.math.canterbury.ac. nz/~mathmas/research/markov3.pdf.

References III Mike Steel, Michael D. Hendy, and David Penny. Parsimony can be consistent! Syst Biol, 42(4):581587, 1993. Tandy J. Warnow. Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent. PLOS Currents Tree of Life, Mar 12:[last modied: 2012 Apr 3] Edition 1, 2012. doi: 10.1371/currents.RRN1308. URL http://currents.plos.org/treeoflife/article/ standard-maximum-likelihood-analyses-of-alignments-with-g