Fast coalescent-based branch support using local quartet frequencies

Similar documents
ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Anatomy of a species tree

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Taming the Beast Workshop

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Estimating Evolutionary Trees. Phylogenetic Methods

New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois

From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Quartet Inference from SNP Data Under the Coalescent Model

Phylogenetic inference

The Mathema)cs of Es)ma)ng the Tree of Life. Tandy Warnow The University of Illinois

Phylogenetic Geometry

Concepts and Methods in Molecular Divergence Time Estimation

Jed Chou. April 13, 2015

Workshop III: Evolutionary Genomics

C3020 Molecular Evolution. Exercises #3: Phylogenetics

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Species Tree Inference using SVDquartets

Dr. Amira A. AL-Hosary

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting

arxiv: v1 [math.st] 22 Jun 2018

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW

Impact of recurrent gene duplication on adaptation of plant genomes

The impact of missing data on species tree estimation

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

On the variance of internode distance under the multispecies coalescent

Many of the slides that I ll use have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

Comparative Methods on Phylogenetic Networks

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

Performance Evaluation

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Phylogenetics in the Age of Genomics: Prospects and Challenges

WenEtAl-biorxiv 2017/12/21 10:55 page 2 #2

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

Phylogenetic Tree Reconstruction

Point of View. Why Concatenation Fails Near the Anomaly Zone

An Investigation of Phylogenetic Likelihood Methods

A (short) introduction to phylogenetics

Systematics - Bio 615

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets

Estimating phylogenetic trees from genome-scale data

Understanding How Stochasticity Impacts Reconstructions of Recent Species Divergent History. Huateng Huang

Phylogenetics: Building Phylogenetic Trees

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

To link to this article: DOI: / URL:

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Estimating phylogenetic trees from genome-scale data

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

An Evaluation of Different Partitioning Strategies for Bayesian Estimation of Species Divergence Times

Phylogenomics of closely related species and individuals

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

DNA-based species delimitation

Phylogenetic Networks, Trees, and Clusters

Properties of Consensus Methods for Inferring Species Trees from Gene Trees

Learning Outbreak Regions in Bayesian Spatial Scan Statistics

Methods to reconstruct phylogene1c networks accoun1ng for ILS

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2012 University of California, Berkeley

Techniques for generating phylogenomic data matrices: transcriptomics vs genomics. Rosa Fernández & Marina Marcet-Houben

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Reconstruire le passé biologique modèles, méthodes, performances, limites

X X (2) X Pr(X = x θ) (3)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Fine-Scale Phylogenetic Discordance across the House Mouse Genome

What is Phylogenetics

Phylogenetic inference: from sequences to trees

Isolating - A New Resampling Method for Gene Order Data

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Supplementary Materials for

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

A phylogenomic toolbox for assembling the tree of life

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

Detection and Polarization of Introgression in a Five-Taxon Phylogeny

Efficient Bayesian Species Tree Inference under the Multispecies Coalescent

Misconceptions on Missing Data in RAD-seq Phylogenetics with a Deep-scale Example from Flowering Plants

Symmetric Tree, ClustalW. Divergence x 0.5 Divergence x 1 Divergence x 2. Alignment length

Constructing Evolutionary/Phylogenetic Trees

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

A Phylogenetic Network Construction due to Constrained Recombination

Recent Advances in Phylogeny Reconstruction

Gene Tree Parsimony for Incomplete Gene Trees

Transcription:

Fast coalescent-based branch support using local quartet frequencies Molecular Biology and Evolution (2016) 33 (7): 1654 68 Erfan Sayyari, Siavash Mirarab University of California, San Diego (ECE) anzee Orangutan

Phylogenomics Orangutan anzee gene 1 gene 2 gene 999 gene 1000 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene here refers to a portion of the genome (not a functional gene) 2

Gene tree discordance gene 1 gene1000 3

Gene tree discordance The species tree gene 1 gene1000 Orangutan A gene tree 3

Gene tree discordance The species tree gene 1 gene1000 Orangutan A gene tree Causes of gene tree discordance include: Incomplete Lineage Sorting (ILS) Duplication and loss Horizontal Gene Transfer (HGT) 3

Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations 4

Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations 4

Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations Omnipresent: possible for every tree Likely for short branches or large population sizes 4

MSC and Identifiability A statistical model called multi-species coalescent (MSC) can generate ILS. 5

MSC and Identifiability A statistical model called multi-species coalescent (MSC) can generate ILS. Any species tree defines a unique distribution on the set of all possible gene trees 5

MSC and Identifiability A statistical model called multi-species coalescent (MSC) can generate ILS. Any species tree defines a unique distribution on the set of all possible gene trees In principle, the species tree can be identified despite high discordance from the gene tree distribution Likelihood calculation is not feasible. 5

Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 6

Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 The most frequent gene tree = The most likely species tree 6

Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 The most frequent gene tree = The most likely species tree speices topology probability 1.00 0.75 0.50 0.25 0.00 1 =1 2 3 e d 1/3 0 1 2 3 branch length 6

Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 The most frequent gene tree = The most likely species tree speices topology probability 1.00 0.75 0.50 0.25 0.00 1 =1 2 3 e d 1/3 shorter branches more discordance a harder species tree reconstruction problem 0 1 2 3 branch length 6

Species tree inference for >4 species For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013) Rhesus 7

Species tree inference for >4 species For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013) Rhesus 1. Break gene trees into ( n 4 ) quartets of species 2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees Some tools (e.g.. BUCKy-p [Larget, et al., 2010]) 7

Species tree inference for >4 species For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013) ASTRAL: Rhesus weight all 3( n 4 ) quartet topologies by 1. Break gene trees into ( n 4 ) quartets of species their frequency in gene trees & find the optimal species tree using dynamic programming 2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees Some tools (e.g.. BUCKy-p [Larget, et al., 2010]) 7

ASTRAL used by biologists Plants: Wickett et al., 2014, PNAS Birds: Prum et al., 2015, Nature ASTRALI: [Mirarab et al., 2014, Bioinformatics] Xenoturbella Cannon et al., 2016, Nature Xenoturbella Rouse et al., 2016, Nature Flatworms: Laumer et al., 2015, elife Shrews: Giarla et al., 2015, Syst. Bio. Frogs: Yuan et al., 2016, Syst. Bio. Tomatoes: Pease et al., 2016, PLoS Bio. ASTRAL-II: [Mirarab and Warnow, 2015, Bioinformatic] Angiosperms: Huang et al., 2016, MBE Worms: Andrade et al., 2015, MBE

Going beyond the topology [Sayyari and Mirarab, Molecular Biology & Evolution, 2016] Branch length (BL): Erfan Sayyari ASTRAL did not estimate branch length We added branch length estimation in coalescent units (#generations/population size) only for internal branches 9

Going beyond the topology [Sayyari and Mirarab, Molecular Biology & Evolution, 2016] Branch length (BL): Erfan Sayyari ASTRAL did not estimate branch length We added branch length estimation in coalescent units (#generations/population size) only for internal branches Branch support: how reliable is a branch? ASTRAL relied on bootstrapping We added a native Bayesian support 9

Branch Length [Sayyari and Mirarab, MBE, 2016] Simply a function of the level of discordance d=0.8 1 =1 2 3 e d θ 1 =70% θ 2 =15% θ 3 =15% 10

Branch Length [Sayyari and Mirarab, MBE, 2016] Simply a function of the level of discordance A single quartet (n=4): reverse the discordance formula to get the ML estimate d=0.8 1 =1 2 3 e d θ 1 =70% θ 2 =15% θ 3 =15% d =0.67 ln 3 2 (1 ˆ 1 ) m 1 = 132 θ 1=66% m 2 = 32 m 3 = 36 θ 2=16% θ 3=18% 10

Branch length for n>4 Simply average all quartet frequencies around that branch a d Justified given some b 1 =1 2 3 e d e assumptions c f h g 11

Branch length for n>4 Simply average all quartet frequencies around that branch a d Justified given some b 1 =1 2 3 e d e assumptions Can be done efficiently in Θ(n 2 m) for all c f branches for n species and m genes h g 11

Branch length accuracy estimated estimated branch branch length length (log (log scale) 2.5 0.0 2.5 5.0 7.5 True gene trees 7.5 5.0 2.5 0.0 2.5 true branch length (log scale) With true gene trees, ASTRAL correctly estimates BL 12

Branch length accuracy estimated estimated branch branch length length (log (log scale) low gene tree error Moderate g.t. error True gene trees 2.5 0.0 2.5 5.0 7.5 Medium g.t. error 7.5 5.0 2.5 0.0 2.5 true branch length (log scale) 12 High gene tree error true branch length (log scale) With error-prone With true estimated gene trees, gene ASTRAL trees, correctly ASTRAL estimates underestimates BL BL

Branch support (common practice) Multi-locus bootstrapping (MLBS) Slow: requires bootstrapping all genes (e.g., 100m ML trees) Inaccurate and hard to interpret [Mirarab et al., Sys bio, 2014; Bayzid et al., PLoS One, 2015] Correct branches (percentage) [Mirarab et al., Sys bio, 2014] 13

Branch support idea: n=4 Recall quartet frequencies follow a multinomial distribution m = 200 m 1 = 80 m 2 = 63 m 3 = 57 θ 1 θ 2 θ 3 P ( topology seen in m 1 / m gene trees is the species tree ) = P ( θ 1 > 1/3 ) = P ( a 3-sided coin tossed m times is biased towards the side that shows up m 1 times) 14

Branch support idea: n=4 Recall quartet frequencies follow a multinomial distribution m = 200 m 1 = 80 m 2 = 63 m 3 = 57 θ 1 P ( topology seen in m 1 / m gene trees is the species tree ) = P ( θ 1 > 1/3 ) = P ( a 3-sided coin tossed m times is biased towards the side that shows up m 1 times) Can be analytically solved θ 2 θ 3 14

Posterior Prior: Yule process become conjugate Fast to calculate Depends on the frequency of not just the first topology, but also the frequency of second and third topologies 15

Conjugate prior All three topologies have equally prior Pr( 1 > 1 3 )=Pr( 2 > 1 3 )=Pr( 3 > 1 3 )=1 3 The species tree generated through a birth-only (Yule) process with rate λ Turns out to be the conjugate prior (default) λ =0.5 uniformly distributed branch lengths 16

Quartet support v.s. posterior quartet frequency (θ 1 ) Increased number of genes (m) increased support Decreased discordance increased support 17

How about n>4? Locality Assumption: All four clusters around a branch are correct a C 1 =n 1 C 3 =n 3 d Treat branches independently b e c f C 2 =n 2 C 4 =n 4 h g k=n 1 n 2 n 3 n 4 18

How about n>4? Locality Assumption: All four clusters around a branch are correct a C 1 =n 1 C 3 =n 3 d Treat branches independently b e k quartets around a branch? Independence assumption is too liberal (m k tosses of the coin) c C 2 =n 2 C 4 =n 4 h g f Fully dependent assumption: all quartets give noisy estimates of a single hidden true frequency Simply average their frequencies k=n 1 n 2 n 3 n 4 18

Simulation studies Our simulations violate our assumptions Estimated gene trees instead of true gene trees Estimated species trees: the locality assumption can be violated Measuring the support accuracy: the number of false positive and false negatives above various thresholds of support True (model) species tree True gene trees Sequence data Finch Falcon Owl Eagle Pigeon Finch Owl Falcon Eagle Pigeon Es mated species tree Es mated gene trees 19

localpp is more accurate than bootstrapping 1.00 MLBS Local PP Recall 0.75 0.50 0.25 100X faster 0.00 0.00 0.25 0.50 0.75 1.0 False Positive Rate Avian simulated dataset (48 taxa, 1000 genes) [Sayyari and Mirarab, MBE, 2016] 20

High precision and recall at high A support B B Downloaded from http://mbe.oxfordjournals.org/ by guest on May 28, 2016 Downloaded from http://mbe.oxfordjournals.org/ by guest on May 28, 2016 valuation of local PP on the A-200 dataset with ASTRAL species trees. See supplementary figures S2 S4, Supplementary Material online for ecies trees. (A) Precision FIG. 3. and Evaluation recall of branches of local withpp local on PPthe above A-200 a threshold dataset ranging with fromastral 0.9 to 1.0 using species estimated trees. gene See trees supplementary (solid) or figures S2 S4, Supplementary Material online for e trees (dotted). other (B) ROCspecies curve (recall trees. vs. FPR) (A) for Precision varying thresholds and recall (figure of201-taxon branches trimmed at 0.4 with FPR). local Columns datasets PP above show different a threshold levels (simphy) ofranging ILS. from 0.9 to 1.0 using estimated gene trees (solid) or e observed genetrue treegene discordance trees (dotted). and branch(b) lengths ROC curve 81% (recall for thevs. 1,500 FPR) bpfor model varying condition thresholds to 69% (figure for 250trimmed bp at 0.4 FPR). Columns show different levels of ILS. nction of observed discordance. (supplementary table S1 and figs. S5 and S6, Supplementary Material online). Precision is at least 99.8% 21for the 0.95 threshold, and the recall is between 71.5% and 84.7%, depending on

High precision and recall at high A support B Downloaded from http://mbe.oxfordjournals.org/ B by guest on May 28, 2016 FIG. 3. Evaluation of local PP on the A-200 dataset with ASTRAL species trees. See supplementary figures S2 S4, Supplementary Material online for other species trees. (A) Precision and recall of201-taxon branches with localdatasets PP above a threshold (simphy) ranging from 0.9 to 1.0 using estimated gene trees (solid) or true gene trees (dotted). (B) ROC curve (recall vs. FPR) for varying thresholds (figure trimmed at 0.4 FPR). Columns show different levels of ILS. 21

High precision and recall at high A support B 201-taxon datasets (simphy) 21

Summary Both branch length and support can be computed quickly a function of the observed amount of gene tree discordance support is also a function of the number of genes Local posterior probability outperforms bootstrapping Requires strong assumptions (to be relaxed in future) Branch length accuracy depends on the gene tree accuracy All available at https://github.com/smirarab/astral 22

Tandy Warnow Erfan Sayyari

Results (A200) 24

Results (A200) recall (the percentage of all true branches that have support s), false positive rate (FPR) (the percentage of all false branches that have support s). 24

Results (A200) recall (the percentage of all true branches that have support s), false positive rate (FPR) (the percentage of all false branches that have support s). Recall above threshold 1.00 0.75 0.50 0.25 0.00 Low ILS 0.0 0.1 0.2 0.3 0.4 False Positive Rate # genes 1000 200 50 True gene tree Estimated gene tree 24

Results (A200) recall (the percentage of all true branches that have support s), false positive rate (FPR) (the percentage of all false branches that have support s). 24

Results (A200) recall (the percentage of all true branches that have support s), false positive rate (FPR) (the percentage of all false branches that have support s). Recall above threshold 1.00 0.75 0.50 0.25 0.00 Low ILS Med ILS High ILS 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 False Positive Rate # genes 1000 200 50 True gene tree Estimated gene tree 24

MLBS Procedure 25

MLBS Procedure First bootstrap each gene 25

MLBS Procedure First bootstrap each gene Alignments gene 1 gene 2 gene k 25

MLBS Procedure First bootstrap each gene Alignments gene 1 gene 2 gene k Replicate M Replicate 1 gene k gene 2 gene 1 gene k gene 2 gene 1 25

MLBS Procedure First bootstrap each gene Alignments gene 1 gene 2 gene k Replicate M Replicate 1 gene k gene 2 gene 1 gene k gene 2 gene 1 gene K gene 2 gene 1 gene K gene 2 gene 1 Gene tree estimation Gene tree estimation 25

MLBS Procedure First bootstrap each gene Gene tree estimation Alignments gene 1 gene 2 gene k Replicate M Replicate 1 gene k gene 2 gene 1 gene k gene 2 gene 1 gene K gene 2 gene 1 gene K gene 2 gene 1 Gene tree estimation 25

MLBS Procedure First bootstrap each gene Gene tree estimation Alignments gene 1 gene 2 gene k Replicate M Replicate 1 gene k gene 2 gene 1 gene k gene 2 gene 1 gene K gene 2 gene 1 gene K gene 2 gene 1 Gene tree estimation Q 25

MLBS Procedure First bootstrap each gene Gene tree estimation Alignments gene 1 gene 2 gene k Replicate M Replicate 1 gene k gene 2 gene 1 gene k gene 2 gene 1 gene K gene 2 gene 1 gene K gene 2 gene 1 Gene tree estimation Count how many times Q appeared Q Count how many times Q appeared 25