TreeShrink: efficient detection of outlier tree leaves

Similar documents
Orthologous loci for phylogenomics from raw NGS data

Complex evolutionary history of the vertebrate sweet/umami taste receptor genes

Sequence motif analysis

Fast coalescent-based branch support using local quartet frequencies

Stat 529 (Winter 2011) A simple linear regression (SLR) case study. Mammals brain weights and body weights

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

Supplementary text and figures: Comparative assessment of methods for aligning multiple genome sequences

Molecular evidence for multiple origins of Insectivora and for a new order of endemic African insectivore mammals

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Molecules consolidate the placental mammal tree.

Phylogenetic Tree Reconstruction

Mechanisms of Evolution Darwinian Evolution

Reconstructing the History of Large-scale Genomic Changes. Jian Ma

Realism and Instrumentalism. in models of. molecular evolution

Proximal point algorithm in Hadamard spaces

1 ATGGGTCTC 2 ATGAGTCTC

Changes in the composition of the RNA virome mark evolutionary transitions in green plants

Workshop III: Evolutionary Genomics

Supplementary information

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

Phylogeny: traditional and Bayesian approaches

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Phylogenetic inference

Algorithms in Bioinformatics

Phylogenetics in the Age of Genomics: Prospects and Challenges

Chapter 26 Phylogeny and the Tree of Life

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Cladistics and Bioinformatics Questions 2013

Constructing Evolutionary Trees

Emily Blanton Phylogeny Lab Report May 2009

Bayesian Phylogenetics

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Estimation of species divergence dates with a sloppy molecular clock

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Phylogeny. Properties of Trees. Properties of Trees. Trees represent the order of branching only. Phylogeny: Taxon: a unit of classification

Anatomy of a species tree

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

C3020 Molecular Evolution. Exercises #3: Phylogenetics

An Introduction to Bayesian Phylogenetics

Evidence of Evolution by Natural Selection. Dodo bird

Phylogeny. November 7, 2017

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Supplemental Information: Origin of land plants revisited in the light of sequence

Lecture 16: Again on Regression

How to read and make phylogenetic trees Zuzana Starostová

Copyright notice. Molecular Phylogeny and Evolution. Goals of the lecture. Introduction. Introduction. December 15, 2008

Many of the slides that I ll use have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

DO NOT WRITE ON THIS. Evidence from Evolution Activity. The Fossilization Process. Types of Fossils

Lecture 6 Phylogenetic Inference

Evidence of Evolution by Natural Selection (Ch. 16.4) Dodo bird

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

X X (2) X Pr(X = x θ) (3)

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Lab 22: Classification of Species

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Multiple Sequence Alignment. Sequences

AP Biology. Evolution is "so overwhelmingly established that it has become irrational to call it a theory." Evidence of Evolution by Natural Selection

Evolution by duplication

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

Tree thinking pretest

Introduction to Biological Anthropology: Notes 11 What is a primate, and why do we study them? Copyright Bruce Owen 2011

Bootstraps and testing trees. Alog-likelihoodcurveanditsconfidenceinterval

User s Manual for. Continuous. (copyright M. Pagel) Mark Pagel School of Animal and Microbial Sciences University of Reading Reading RG6 6AJ UK

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Darwin's Theory. Use Target Reading Skills. Darwin's Observations. Changes Over Time Guided Reading and Study

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Lecture 11 Friday, October 21, 2011

Evolution and divergence of the mammalian SAMD9/SAMD9L gene family

Evidence for Evolution by Natural Selection. Raven Chapters 1 & 22

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Primate phylogeny: molecular evidence for a pongid clade excluding humans and a prosimian clade containing tarsiers

Being Bayesian About Network Structure:

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Determining the Null Model for Detecting Adaptive Convergence from Genomic Data: A Case Study using Echolocating Mammals

Anthro 101: Human Biological Evolution. Lecture 7: Taxonomy/Primate Adaptations. Prof. Kenneth Feldmeier

BINF6201/8201. Molecular phylogenetic methods

Station A: #3. If two organisms belong to the same order, they must also belong to the same

Week 7: Bayesian inference, Testing trees, Bootstraps

Anthro 101: Human Biological Evolution. Lecture 7: Taxonomy/Primate Adaptations. Prof. Kenneth Feldmeier

Bio94 Discussion Activity week 3: Chapter 27 Phylogenies and the History of Life

ELE4120 Bioinformatics Tutorial 8

CHAPTER 26 PHYLOGENY AND THE TREE OF LIFE Connecting Classification to Phylogeny

Ultraconserved elements are novel phylogenomic markers that resolve placental mammal phylogeny when combined with species-tree analysis

Evolution. Darwin s Voyage

Multidimensional Vector Space Representation for Convergent Evolution and Molecular Phylogeny

Introduction to Biological Anthropology: Notes 9 What is a primate, and why do we study them? Copyright Bruce Owen 2008

Minimum Regularized Covariance Determinant Estimator

Biology Keystone (PA Core) Quiz Theory of Evolution - (BIO.B ) Theory Of Evolution, (BIO.B ) Scientific Terms

Biology 211 (2) Week 1 KEY!

Basic Tree Thinking Assessment David A. Baum, Stacey DeWitt Smith, Samuel S. Donovan

0 Mya - Humans Goodbye Big Dinosaurs Mammals EXPLODE First flowers 100 Mya- First 200 Mya-

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Are Guinea Pigs Rodents? The Importance of Adequate Models in Molecular Phylogenetics

Origins of Life. Fundamental Properties of Life. Conditions on Early Earth. Evolution of Cells. The Tree of Life

CLASSIFICATION OF LIVING THINGS. Chapter 18


Master Biomedizin ) UCSC & UniProt 2) Homology 3) MSA 4) Phylogeny. Pablo Mier

Transcription:

TreeShrink: efficient detection of outlier tree leaves Uyen Mai Siavash Mirarab University of California at San Diego 1

Lesser Hedgehog Tenrec Long branches are suspect Sphagnum lesc Amborella trichopoda Nuphar advena Monilophytes Flowering plants Gnetum montanum Ephedra sinica Gymnosperms Tarsier Guinea Pig Kangaroo Rat Squirrel Galagos Mouse Lemur Tree Shrew Selaginella moellendorffii 1kp Selaginella moellendorffii genome Chaetosphaeridium globosum Klebsormidium subtile Entransia fimbriata Coleochaete irregularis Monomastix opisthostigma Netrium digitus Pyramimonas parkeae Roya obtusa Cosmarium ochthodes Nephroselmis pyriformis Coleochaete scutata Penium margaritaceum Chlorokybus atmophyticus Mougeotia sp Cylindrocystis brebissonii Mesotaenium endlicherianum Cylindrocystis cushleckae Spirogyra sp Chara vulgaris Pseudolycopodiella caroliniana Dendrolycopodium obscurum Huperzia squarrosa Nothoceros aenigmaticus Nothoceros vincentianus Mosses Liverworts 0.6 A gene tree from the1kp plant dataset (Wicket et al, PNAS, 2014) 2 0.05 Rat Pika Rabbit Mouse Macaque Chimp Human Orangutan Gorilla Marmoset Alpaca

Lesser Hedgehog Tenrec Long branches are suspect Sphagnum lesc Idea: find errors in the data by building a phylogeny and detecting long branches Amborella trichopoda Nuphar advena Monilophytes Flowering plants Gnetum montanum Ephedra sinica Gymnosperms Tarsier Guinea Pig Kangaroo Rat Squirrel Galagos Mouse Lemur Tree Shrew Selaginella moellendorffii 1kp Selaginella moellendorffii genome Chaetosphaeridium globosum Klebsormidium subtile Entransia fimbriata Coleochaete irregularis Monomastix opisthostigma Netrium digitus Pyramimonas parkeae Roya obtusa Cosmarium ochthodes Nephroselmis pyriformis Coleochaete scutata Penium margaritaceum Chlorokybus atmophyticus Mougeotia sp Cylindrocystis brebissonii Mesotaenium endlicherianum Cylindrocystis cushleckae Spirogyra sp Chara vulgaris Pseudolycopodiella caroliniana Dendrolycopodium obscurum Huperzia squarrosa Nothoceros aenigmaticus Nothoceros vincentianus Mosses Liverworts 0.6 A gene tree from the1kp plant dataset (Wicket et al, PNAS, 2014) 2 0.05 Rat Pika Rabbit Mouse Macaque Chimp Human Orangutan Gorilla Marmoset Alpaca

For unrooted trees? Diameter: the longest path between any two species A gene tree from the1kp plant dataset (Wicket et al, PNAS, 2014) 3 0.2

For unrooted trees? Diameter: the longest path between any two species A gene tree from the1kp plant dataset (Wicket et al, PNAS, 2014) 3 0.2 0.2

An optimization problem The k-shrink problem: Given: a tree with n leaves and branch lengths some 1 k n 4

An optimization problem The k-shrink problem: Given: a tree with n leaves and branch lengths some 1 k n Find: for every 1 i k: the set of i leaves that should be removed to reduce the tree diameter maximally 4

An optimization problem The k-shrink problem: Given: a tree with n leaves and branch lengths some 1 k n Find: We have a polynomial for every 1 i k: time solution the set of i leaves that should be removed to reduce the tree diameter maximally 4

Running Time k-shrink can be solved in O(k 2 h+n) where h = the tree height by default, we set k=o(n 0.5 ) 5

Running Time k-shrink can be solved in O(k 2 h+n) where h = the tree height by default, we set k=o(n 0.5 ) Fast enough: processes a tree of n=203,452 leaves with k=2255 in 28 mins 5

How many do we remove? How do we decide how many things to remove? We have the optimal removals for 1 i k. What i should we use? 6

How many do we remove? How do we decide how many things to remove? We have the optimal removals for 1 i k. What i should we use? Find an i where the corresponding reduction in the diameter is unexpectedly high needs statistical tests to find outliers 6

What to remove? 0.2 7

What to remove? the diameter after i-1 removals Let ν i = the diameter after i removals 0.2 7

What to remove? the diameter after i-1 removals Let ν i = the diameter after i removals 5 4 νi ratio 3 2 1 0.2 5 10 15 removal 7

What to remove? 5 4 ratio νi 3 2 1 5 10 15 20 removal 8

Signature of each species Signature of x = max log(ν i ) among all i that remove x Pyramimonas parkeae Anomodon attenuatus Nephroselmis pyriformis ratio (ν) 3.5 3.0 2.5 3.43 Species Signatures: Smilax bona-nox log(3.43) Pyramimonas parkeae log(1.12) Equisetum diffusum log(3.43) Anomodon attenuatus log(1.80) Klebsormidium subtile log(3.43) Nephroselmis pyriformis log(1.12)... 2.0 1.80 Smilax bona-nox Klebsormidium subtile Equisetum diffusum 1.5 1.0 5 10 15 20 removal Optimal removing sets: i=1 1.12 Anomodon attenuatus i=2 1.01 Equisetum diffusum, Anomodon attenuatus i=3 3.43 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile i=4 1.80 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile, Anomodon attenuatus i=5 1.08 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile, Nephroselmis pyriformis, Anomodon attenuatus i=6 1.12 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile, Nephroselmis pyriformis, Pyramimonas parkeae, Anomodon attenuatus... 9 1.12

Signature of each species Signature of x = max log(ν i ) among all i that remove x Pyramimonas parkeae Anomodon attenuatus Nephroselmis pyriformis ratio (ν) 3.5 3.0 2.5 3.43 Species Signatures: Smilax bona-nox log(3.43) Pyramimonas parkeae log(1.12) Equisetum diffusum log(3.43) Anomodon attenuatus log(1.80) Klebsormidium subtile log(3.43) Nephroselmis pyriformis log(1.12)... 2.0 1.80 Smilax bona-nox Klebsormidium subtile Equisetum diffusum 1.5 1.0 5 10 15 20 removal Optimal removing sets: i=1 1.12 Anomodon attenuatus i=2 1.01 Equisetum diffusum, Anomodon attenuatus i=3 3.43 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile i=4 1.80 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile, Anomodon attenuatus i=5 1.08 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile, Nephroselmis pyriformis, Anomodon attenuatus i=6 1.12 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile, Nephroselmis pyriformis, Pyramimonas parkeae, Anomodon attenuatus... 9 1.12

Signature of each species Signature of x = max log(ν i ) among all i that remove x Pyramimonas parkeae Anomodon attenuatus Nephroselmis pyriformis ratio (ν) 3.5 3.0 2.5 3.43 Species Signatures: Smilax bona-nox log(3.43) Pyramimonas parkeae log(1.12) Equisetum diffusum log(3.43) Anomodon attenuatus log(1.80) Klebsormidium subtile log(3.43) Nephroselmis pyriformis log(1.12)... 2.0 1.80 Smilax bona-nox Klebsormidium subtile Equisetum diffusum 1.5 1.0 5 10 15 20 removal Optimal removing sets: i=1 1.12 Anomodon attenuatus i=2 1.01 Equisetum diffusum, Anomodon attenuatus i=3 3.43 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile i=4 1.80 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile, Anomodon attenuatus i=5 1.08 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile, Nephroselmis pyriformis, Anomodon attenuatus i=6 1.12 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile, Nephroselmis pyriformis, Pyramimonas parkeae, Anomodon attenuatus... 9 1.12

Signature of each species Signature of x = max log(ν i ) among all i that remove x Pyramimonas parkeae Anomodon attenuatus Nephroselmis pyriformis ratio (ν) 3.5 3.0 2.5 3.43 Species Signatures: Smilax bona-nox log(3.43) Pyramimonas parkeae log(1.12) Equisetum diffusum log(3.43) Anomodon attenuatus log(1.80) Klebsormidium subtile log(3.43) Nephroselmis pyriformis log(1.12)... 2.0 1.80 Smilax bona-nox Klebsormidium subtile Equisetum diffusum 1.5 1.0 5 10 15 20 removal Optimal removing sets: i=1 1.12 Anomodon attenuatus i=2 1.01 Equisetum diffusum, Anomodon attenuatus i=3 3.43 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile i=4 1.80 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile, Anomodon attenuatus i=5 1.08 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile, Nephroselmis pyriformis, Anomodon attenuatus i=6 1.12 Equisetum diffusum, Smilax bona-nox, Klebsormidium subtile, Nephroselmis pyriformis, Pyramimonas parkeae, Anomodon attenuatus... 9 1.12

Three statistical tests of TreeShrink The per-gene test: requires only a single tree The all-gene test: requires a collection of gene trees The per-species test: requires a collection of gene trees 10

Statistical tests The per-gene test (input: a single tree) Fit a log-normal distribution to the signatures Remove taxa with outlier signatures Outlier: CDF above 1 α fore a given α (false positive tolerance) The all-gene test The per-species test 11

Statistical tests The per-gene test The all-gene test (input: a collection of gene trees) Combine all signature values across all genes Compute a kernel density over the empirical distribution Remove the taxa of the outlier signatures Outlier: CDF above 1 α fore a given α The per-species test 12

Statistical tests The per-gene test The all-gene test The per-species test (input: a collection of gene trees) Compute a kernel density function for each species over its signatures across genes Remove the taxa of the outlier signatures Outlier: CDF above 1 α for a given α 13

Methods The three tests of TreeShrink Alternative filtering methods RootedFiltering: root gene trees and remove taxa X standard deviations more distant to the root than average RogueNarok: rogue taxon removal based; finds unstable nodes based on bootstrap replicates RandomFiltering: randomly choose what to remove. 14

Measurements Effects of filtering on taxon occupancy Proportion of data retained for each species Effects of filtering on gene tree discordance Reduction in pairwise MS distance of gene trees on controlled amount of filtering 15

Datasets Genes Species Plants 852 104 6 phylogenomic datasets Gene number: 95-1478 Species number: 26-164 Insects 1478 144 Mammals 424 37 Frogs 95 164 Metazoa- Cannon Metazoa- Rouse 213 78 393 26 16

Results: outgroup removal 30 Cannon Frogs Insects Percent of the data removed for α=0.05 for All species 20 Percent removed Percent removed 10 0 30 20 15 10 5 Mammals Plants Rouse Outgroups 10 0 0 all gene per gene per species all gene per gene per species all gene per gene per species all gene per gene per species All All taxa Outgroups Outgroups 17 Mammalian dataset

Results: outgroup removal 30 Cannon Frogs Insects Percent of the data removed for α=0.05 for All species Outgroups Percent removed 20 10 0 30 20 10 0 Mammals Plants Rouse all gene per gene per species all gene per gene per species all gene per gene per species All taxa Outgroups 17

Impact of filtering on discordance Random_pruning 20 TreeShrink_all_gene TreeShrink_per_gene TreeShrink_per_species Delta MS 10 0 0.94 0.96 0.98 1.00 Proportion of taxa retained Plant dataset 18

TreeShrink versus alternative methods (discordance) (b) Plants Random_pruning Delta MS 20 15 10 20 Delta MS 15 10 5 Proportion of taxa retained Random_pruning RogueNarok Rooted_pruning TreeShrink 0.95 0.96 0.97 0.98 0.99 1.00 19 RogueNarok Rooted_pruning TreeShrink

TreeShrink versus alternative methods (b) Plants (discordance) (a) Insects Random_pruning Delta MS 20 15 10 Delta MS 100 75 50 25 Delta MS 20 15 10 5 0.95 0.96 0.97 0.98 0.99 1.00 Proportion of taxa retained 20 RogueNarok Random_pruning RogueNarok Rooted_pruning Rooted_pruning TreeShrink TreeShrink

Results: TreeShrink versus Alternative Methods (e) Mammals (f) Frogs Random_pruning RogueNarok 50 Rooted_pruning 10 40 TreeShrink Delta MS 5 Delta MS 30 20 10 0 0 0.95 0.96 0.97 0.98 0.99 1.00 Proportion of taxa retained 0.95 0.96 0.97 0.98 0.99 1.00 Proportion of taxa retained 21

TreeShrink versus alternative methods (occupancy) 200 400 600 800 Arabidopsis_thaliana Eschscholzia_californica Amborella_trichopoda SOriginalhum_bicolor Catharanthus_roseus Brachypodium_distachyon Oryza_sativa Pinus_taeda Nuphar_advena Liriodendron_tulipifera Vitis_vinifera Carica_papaya Persea_americana Diospyros_malabarica Prumnopitys_andina Allamanda_cathartica Saruma_henryi Ipomoea_purpurea Acorus_americanus Hibiscus_cannabinus Boehmeria_nivea Tanacetum_parthenium Dioscorea_villosa Podophyllum_peltatum Kochia_scoparia Sabal_bermudana Sciadopitys_verticillata Yucca_filamentosa Sarcandra_glabra Hedwigia_ciliata Rosmarinus_officinalis Zea_mays Smilax_bona nox Cedrus_libani Rhynchostegium_serrulatum Ricciocarpos_natans Cunninghamia_lanceolata Sphagnum_lescurii Ginkgo_biloba Leucodon_brachypus Taxus_baccata Gnetum_montanum Thuidium_delicatulum Marchantia_polymorpha Selaginella_moellendorffii_genome Juniperus_scopulorum Anomodon_attenuatus Larrea_tridentata Pteridium_aquilinum Ceratodon_purpureus Polytrichum_commune Pseudolycopodiella_caroliniana Cycas_micholitzii Inula_helenium Sphaerocarpos_texanus Equisetum_diffusum Cylindrocystis_cushleckae Ophioglossum_petiolatum Rosulabryum_cf_capillare Dendrolycopodium_obscurum Kadsura_heteroclita Huperzia_squarrosa Nothoceros_aenigmaticus Metzgeria_crassipilis Coleochaete_scutata Bryum_argenteum Cylindrocystis_brebissonii Houttuynia_cordata Bazzania_trilobata Coleochaete_irregularis Psilotum_nudum Selaginella_moellendorffii_1kp Mesotaenium_endlicherianum Entransia_fimbriata Aquilegia_formosa Medicago_truncatula Mougeotia_sp Angiopteris_evecta Physcomitrella_patens Nothoceros_vincentianus Populus_trichocarpa Marchantia_emarginata Alsophila_spinulosa Colchicum_autumnale Zamia_vazquezii Chlorokybus_atmophyticus Roya_obtusa Klebsormidium_subtile Ephedra_sinica Netrium_digitus Pyramimonas_parkeae Chaetosphaeridium_globosum Nephroselmis_pyriformis Cosmarium_ochthodes Spirogyra_sp Monomastix_opisthostigma Chara_vulgaris Spirotaenia_minuta Mesostigma_viride Cycas_rumphii Penium_margaritaceum Welwitschia_mirabilis Uronema_sp Occupancy (# genes) Original RogueNarok Rooted_pruning TreeShrink Original RogueNarok Rooted_pruning TreeShrink 22

{} ab h a d f g e c on-diameter b removed

remove b {} ab h a {b} ac d f g e c on-diameter b removed

remove b {} ab remove a h a {b} ac {a} db d f g e c on-diameter b removed

remove b {} ab remove a h a remove c {b} ac {a} db d {b,c} ae f g e c on-diameter b removed

remove b {} ab remove a h a remove c {b} ac remove a {a} db remove b d {b,c} ae {a,b} dc f g e c on-diameter b removed

remove b {} ab remove a h a remove c {b} ac remove a {a} db remove b remove d d {b,c} ae {a,b} dc {a,d} fb f g e c on-diameter b removed

remove b {} ab remove a h a remove c {b} ac remove a {a} db remove b remove d d remove e {b,c} ae {a,b} dc remove a remove c remove d remove b {a,d} fb f remove f {b,c,e} ag {b,c,a} de {a,b,d} fc {a,d,f} hb g e c on-diameter b removed

Solution space i = 0 remove b {} ab remove a i = 1 {b} ac {a} db k = 3 remove c remove a remove b remove d i = 2 remove e {b,c} ae {a,b} dc remove a remove c remove d remove b {a,d} fb remove f i = 3 {b,c,e} ag {b,c,a} de {a,b,d} fc {a,d,f} hb

The TreeShrink tool is publicly available https://github.com/uym2/treeshrink Uyen Mai 31

A single HIV tree 648 HIV-1 partial pol sequences 639 subtype B 7 non-subtype B 2 unassigned TreeShrink RogueNarok TreeShrink and RogueNarok Unassigned Subtype 32

Results: TreeShrink versus Alternative Methods (c) Metazoa - Cannon (d) Metazoa - Rouse Random_pruning 40 7.5 RogueNarok Delta MS 30 20 Delta MS 5.0 Rooted_pruning TreeShrink 2.5 10 0.95 0.96 0.97 0.98 0.99 1.00 Proportion of taxa retained 0.0 0.95 0.96 0.97 0.98 0.99 1.00 Proportion of taxa retained 33

Results: The 3 Tests of TreeShrink (c) Metazoa - Cannon (d) Metazoa - Rouse 9 40 Delta MS Delta MS 6 20 3 0 0 0.94 0.96 0.98 1.00 Proportion of taxa retained 0.94 0.96 0.98 1.00 Proportion of taxa retained 34

Results: The 3 Tests of TreeShrink (e) Mammals (f) Frogs 40 10 30 Delta MS Delta MS 20 5 10 0 0.94 0.96 0.98 1.00 Proportion of taxa retained 0 0.96 0.97 0.98 0.99 1.00 Proportion of taxa retained 35

Can be done in other ways too (e.g., O(n.k+k 2 logk)), but harder to implement

Can be just outgroups Sphagnum lesc Chicken Lesser Hedgehog Tenrec Platypus Wallaby Opossum Tarsier Guinea Pig Kangaroo Rat Squirrel Galagos Mouse Lemur Tree Shrew Hyrax Elephant Pika Rabbit Sloth Armadillos Rat Mouse Horse Megabat Macaque Chimp Human Orangutan Gorilla Marmoset Alpaca Dolphin Cow Microbat Cat Dog Shrew Hedgehog 0.05 b) Hard case: a gene tree in the mammalian dataset 37 Pig