Fast coalescent-based branch support using local quartet frequencies Molecular Biology and Evolution (2016) 33 (7): 1654 68 Erfan Sayyari, Siavash Mirarab University of California, San Diego (ECE) anzee Orangutan
Phylogenomics Orangutan anzee gene 1 gene 2 gene 999 gene 1000 ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT gene here refers to a portion of the genome (not a functional gene) 2
Gene tree discordance gene 1 gene1000 3
Gene tree discordance The species tree gene 1 gene1000 Orangutan A gene tree 3
Gene tree discordance The species tree gene 1 gene1000 Orangutan A gene tree Causes of gene tree discordance include: Incomplete Lineage Sorting (ILS) Duplication and loss Horizontal Gene Transfer (HGT) 3
Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations 4
Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations 4
Incomplete Lineage Sorting (ILS) A random process related to the coalescence of alleles across various populations Tracing alleles through generations Omnipresent: possible for every tree Likely for short branches or large population sizes 4
MSC and Identifiability A statistical model called multi-species coalescent (MSC) can generate ILS. 5
MSC and Identifiability A statistical model called multi-species coalescent (MSC) can generate ILS. Any species tree defines a unique distribution on the set of all possible gene trees 5
MSC and Identifiability A statistical model called multi-species coalescent (MSC) can generate ILS. Any species tree defines a unique distribution on the set of all possible gene trees In principle, the species tree can be identified despite high discordance from the gene tree distribution Likelihood calculation is not feasible. 5
Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 6
Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 The most frequent gene tree = The most likely species tree 6
Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 The most frequent gene tree = The most likely species tree speices topology probability 1.00 0.75 0.50 0.25 0.00 1 =1 2 3 e d 1/3 0 1 2 3 branch length 6
Unrooted quartets under MSC model For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010) θ 1 =70% θ 2 =15% θ 3 =15% d=0.8 The most frequent gene tree = The most likely species tree speices topology probability 1.00 0.75 0.50 0.25 0.00 1 =1 2 3 e d 1/3 shorter branches more discordance a harder species tree reconstruction problem 0 1 2 3 branch length 6
Species tree inference for >4 species For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013) Rhesus 7
Species tree inference for >4 species For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013) Rhesus 1. Break gene trees into ( n 4 ) quartets of species 2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees Some tools (e.g.. BUCKy-p [Larget, et al., 2010]) 7
Species tree inference for >4 species For >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013) ASTRAL: Rhesus weight all 3( n 4 ) quartet topologies by 1. Break gene trees into ( n 4 ) quartets of species their frequency in gene trees & find the optimal species tree using dynamic programming 2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees Some tools (e.g.. BUCKy-p [Larget, et al., 2010]) 7
ASTRAL used by biologists Plants: Wickett et al., 2014, PNAS Birds: Prum et al., 2015, Nature ASTRALI: [Mirarab et al., 2014, Bioinformatics] Xenoturbella Cannon et al., 2016, Nature Xenoturbella Rouse et al., 2016, Nature Flatworms: Laumer et al., 2015, elife Shrews: Giarla et al., 2015, Syst. Bio. Frogs: Yuan et al., 2016, Syst. Bio. Tomatoes: Pease et al., 2016, PLoS Bio. ASTRAL-II: [Mirarab and Warnow, 2015, Bioinformatic] Angiosperms: Huang et al., 2016, MBE Worms: Andrade et al., 2015, MBE
Going beyond the topology [Sayyari and Mirarab, Molecular Biology & Evolution, 2016] Branch length (BL): Erfan Sayyari ASTRAL did not estimate branch length We added branch length estimation in coalescent units (#generations/population size) only for internal branches 9
Going beyond the topology [Sayyari and Mirarab, Molecular Biology & Evolution, 2016] Branch length (BL): Erfan Sayyari ASTRAL did not estimate branch length We added branch length estimation in coalescent units (#generations/population size) only for internal branches Branch support: how reliable is a branch? ASTRAL relied on bootstrapping We added a native Bayesian support 9
Branch Length [Sayyari and Mirarab, MBE, 2016] Simply a function of the level of discordance d=0.8 1 =1 2 3 e d θ 1 =70% θ 2 =15% θ 3 =15% 10
Branch Length [Sayyari and Mirarab, MBE, 2016] Simply a function of the level of discordance A single quartet (n=4): reverse the discordance formula to get the ML estimate d=0.8 1 =1 2 3 e d θ 1 =70% θ 2 =15% θ 3 =15% d =0.67 ln 3 2 (1 ˆ 1 ) m 1 = 132 θ 1=66% m 2 = 32 m 3 = 36 θ 2=16% θ 3=18% 10
Branch length for n>4 Simply average all quartet frequencies around that branch a d Justified given some b 1 =1 2 3 e d e assumptions c f h g 11
Branch length for n>4 Simply average all quartet frequencies around that branch a d Justified given some b 1 =1 2 3 e d e assumptions Can be done efficiently in Θ(n 2 m) for all c f branches for n species and m genes h g 11
Branch length accuracy estimated estimated branch branch length length (log (log scale) 2.5 0.0 2.5 5.0 7.5 True gene trees 7.5 5.0 2.5 0.0 2.5 true branch length (log scale) With true gene trees, ASTRAL correctly estimates BL 12
Branch length accuracy estimated estimated branch branch length length (log (log scale) low gene tree error Moderate g.t. error True gene trees 2.5 0.0 2.5 5.0 7.5 Medium g.t. error 7.5 5.0 2.5 0.0 2.5 true branch length (log scale) 12 High gene tree error true branch length (log scale) With error-prone With true estimated gene trees, gene ASTRAL trees, correctly ASTRAL estimates underestimates BL BL
Branch support (common practice) Multi-locus bootstrapping (MLBS) Slow: requires bootstrapping all genes (e.g., 100m ML trees) Inaccurate and hard to interpret [Mirarab et al., Sys bio, 2014; Bayzid et al., PLoS One, 2015] Correct branches (percentage) [Mirarab et al., Sys bio, 2014] 13
Branch support idea: n=4 Recall quartet frequencies follow a multinomial distribution m = 200 m 1 = 80 m 2 = 63 m 3 = 57 θ 1 θ 2 θ 3 P ( topology seen in m 1 / m gene trees is the species tree ) = P ( θ 1 > 1/3 ) = P ( a 3-sided coin tossed m times is biased towards the side that shows up m 1 times) 14
Branch support idea: n=4 Recall quartet frequencies follow a multinomial distribution m = 200 m 1 = 80 m 2 = 63 m 3 = 57 θ 1 P ( topology seen in m 1 / m gene trees is the species tree ) = P ( θ 1 > 1/3 ) = P ( a 3-sided coin tossed m times is biased towards the side that shows up m 1 times) Can be analytically solved θ 2 θ 3 14
Posterior Prior: Yule process become conjugate Fast to calculate Depends on the frequency of not just the first topology, but also the frequency of second and third topologies 15
Conjugate prior All three topologies have equally prior Pr( 1 > 1 3 )=Pr( 2 > 1 3 )=Pr( 3 > 1 3 )=1 3 The species tree generated through a birth-only (Yule) process with rate λ Turns out to be the conjugate prior (default) λ =0.5 uniformly distributed branch lengths 16
Quartet support v.s. posterior quartet frequency (θ 1 ) Increased number of genes (m) increased support Decreased discordance increased support 17
How about n>4? Locality Assumption: All four clusters around a branch are correct a C 1 =n 1 C 3 =n 3 d Treat branches independently b e c f C 2 =n 2 C 4 =n 4 h g k=n 1 n 2 n 3 n 4 18
How about n>4? Locality Assumption: All four clusters around a branch are correct a C 1 =n 1 C 3 =n 3 d Treat branches independently b e k quartets around a branch? Independence assumption is too liberal (m k tosses of the coin) c C 2 =n 2 C 4 =n 4 h g f Fully dependent assumption: all quartets give noisy estimates of a single hidden true frequency Simply average their frequencies k=n 1 n 2 n 3 n 4 18
Simulation studies Our simulations violate our assumptions Estimated gene trees instead of true gene trees Estimated species trees: the locality assumption can be violated Measuring the support accuracy: the number of false positive and false negatives above various thresholds of support True (model) species tree True gene trees Sequence data Finch Falcon Owl Eagle Pigeon Finch Owl Falcon Eagle Pigeon Es mated species tree Es mated gene trees 19
localpp is more accurate than bootstrapping 1.00 MLBS Local PP Recall 0.75 0.50 0.25 100X faster 0.00 0.00 0.25 0.50 0.75 1.0 False Positive Rate Avian simulated dataset (48 taxa, 1000 genes) [Sayyari and Mirarab, MBE, 2016] 20
High precision and recall at high A support B B Downloaded from http://mbe.oxfordjournals.org/ by guest on May 28, 2016 Downloaded from http://mbe.oxfordjournals.org/ by guest on May 28, 2016 valuation of local PP on the A-200 dataset with ASTRAL species trees. See supplementary figures S2 S4, Supplementary Material online for ecies trees. (A) Precision FIG. 3. and Evaluation recall of branches of local withpp local on PPthe above A-200 a threshold dataset ranging with fromastral 0.9 to 1.0 using species estimated trees. gene See trees supplementary (solid) or figures S2 S4, Supplementary Material online for e trees (dotted). other (B) ROCspecies curve (recall trees. vs. FPR) (A) for Precision varying thresholds and recall (figure of201-taxon branches trimmed at 0.4 with FPR). local Columns datasets PP above show different a threshold levels (simphy) ofranging ILS. from 0.9 to 1.0 using estimated gene trees (solid) or e observed genetrue treegene discordance trees (dotted). and branch(b) lengths ROC curve 81% (recall for thevs. 1,500 FPR) bpfor model varying condition thresholds to 69% (figure for 250trimmed bp at 0.4 FPR). Columns show different levels of ILS. nction of observed discordance. (supplementary table S1 and figs. S5 and S6, Supplementary Material online). Precision is at least 99.8% 21for the 0.95 threshold, and the recall is between 71.5% and 84.7%, depending on
High precision and recall at high A support B Downloaded from http://mbe.oxfordjournals.org/ B by guest on May 28, 2016 FIG. 3. Evaluation of local PP on the A-200 dataset with ASTRAL species trees. See supplementary figures S2 S4, Supplementary Material online for other species trees. (A) Precision and recall of201-taxon branches with localdatasets PP above a threshold (simphy) ranging from 0.9 to 1.0 using estimated gene trees (solid) or true gene trees (dotted). (B) ROC curve (recall vs. FPR) for varying thresholds (figure trimmed at 0.4 FPR). Columns show different levels of ILS. 21
High precision and recall at high A support B 201-taxon datasets (simphy) 21
Summary Both branch length and support can be computed quickly a function of the observed amount of gene tree discordance support is also a function of the number of genes Local posterior probability outperforms bootstrapping Requires strong assumptions (to be relaxed in future) Branch length accuracy depends on the gene tree accuracy All available at https://github.com/smirarab/astral 22
Tandy Warnow Erfan Sayyari
Results (A200) 24
Results (A200) recall (the percentage of all true branches that have support s), false positive rate (FPR) (the percentage of all false branches that have support s). 24
Results (A200) recall (the percentage of all true branches that have support s), false positive rate (FPR) (the percentage of all false branches that have support s). Recall above threshold 1.00 0.75 0.50 0.25 0.00 Low ILS 0.0 0.1 0.2 0.3 0.4 False Positive Rate # genes 1000 200 50 True gene tree Estimated gene tree 24
Results (A200) recall (the percentage of all true branches that have support s), false positive rate (FPR) (the percentage of all false branches that have support s). 24
Results (A200) recall (the percentage of all true branches that have support s), false positive rate (FPR) (the percentage of all false branches that have support s). Recall above threshold 1.00 0.75 0.50 0.25 0.00 Low ILS Med ILS High ILS 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 False Positive Rate # genes 1000 200 50 True gene tree Estimated gene tree 24
MLBS Procedure 25
MLBS Procedure First bootstrap each gene 25
MLBS Procedure First bootstrap each gene Alignments gene 1 gene 2 gene k 25
MLBS Procedure First bootstrap each gene Alignments gene 1 gene 2 gene k Replicate M Replicate 1 gene k gene 2 gene 1 gene k gene 2 gene 1 25
MLBS Procedure First bootstrap each gene Alignments gene 1 gene 2 gene k Replicate M Replicate 1 gene k gene 2 gene 1 gene k gene 2 gene 1 gene K gene 2 gene 1 gene K gene 2 gene 1 Gene tree estimation Gene tree estimation 25
MLBS Procedure First bootstrap each gene Gene tree estimation Alignments gene 1 gene 2 gene k Replicate M Replicate 1 gene k gene 2 gene 1 gene k gene 2 gene 1 gene K gene 2 gene 1 gene K gene 2 gene 1 Gene tree estimation 25
MLBS Procedure First bootstrap each gene Gene tree estimation Alignments gene 1 gene 2 gene k Replicate M Replicate 1 gene k gene 2 gene 1 gene k gene 2 gene 1 gene K gene 2 gene 1 gene K gene 2 gene 1 Gene tree estimation Q 25
MLBS Procedure First bootstrap each gene Gene tree estimation Alignments gene 1 gene 2 gene k Replicate M Replicate 1 gene k gene 2 gene 1 gene k gene 2 gene 1 gene K gene 2 gene 1 gene K gene 2 gene 1 Gene tree estimation Count how many times Q appeared Q Count how many times Q appeared 25