Species Tree Inference using SVDquartets

Similar documents
Jed Chou. April 13, 2015

Quartet Inference from SNP Data Under the Coalescent Model

Introduction to Algebraic Statistics

Algebraic Statistics Tutorial I

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Phylogenetic invariants versus classical phylogenetics

Phylogenetic Algebraic Geometry

Phylogenetic Geometry

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

19 Tree Construction using Singular Value Decomposition

Phylogenetic Networks, Trees, and Clusters

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

Misconceptions on Missing Data in RAD-seq Phylogenetics with a Deep-scale Example from Flowering Plants

Anatomy of a species tree

Phylogenetic Tree Reconstruction

Example: Hardy-Weinberg Equilibrium. Algebraic Statistics Tutorial I. Phylogenetics. Main Point of This Tutorial. Model-Based Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Dr. Amira A. AL-Hosary

Fast coalescent-based branch support using local quartet frequencies

Rooting phylogenetic trees under the coalescent model using site pattern probabilities

Taming the Beast Workshop

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

The problem Lineage model Examples. The lineage model

Dynamics of Nucleic Acids Analyzed from Base Pair Geometry

Efficient Bayesian Species Tree Inference under the Multispecies Coalescent

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Comparative Bioinformatics Midterm II Fall 2004

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Workshop III: Evolutionary Genomics

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Evolutionary Tree Analysis. Overview

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109

Phylogenetic inference

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

A phylogenomic toolbox for assembling the tree of life

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Properties of Consensus Methods for Inferring Species Trees from Gene Trees

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

Diffusion Models in Population Genetics

Many of the slides that I ll use have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

TheDisk-Covering MethodforTree Reconstruction

A Phylogenetic Network Construction due to Constrained Recombination

WenEtAl-biorxiv 2017/12/21 10:55 page 2 #2

Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting

Using algebraic geometry for phylogenetic reconstruction

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Effects of Gap Open and Gap Extension Penalties

Understanding How Stochasticity Impacts Reconstructions of Recent Species Divergent History. Huateng Huang

Supplemental Material

Phylogenetics: Parsimony

Phylogenetic inference: from sequences to trees

Estimating Evolutionary Trees. Phylogenetic Methods

Supplementary Information

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogenetics in the Age of Genomics: Prospects and Challenges

Inferring Phylogenies from RAD Sequence Data

arxiv: v1 [q-bio.pe] 1 Jun 2014

Comparative Genomics II


MtDNA profiles and associated haplogroups

arxiv: v1 [cs.ds] 1 Nov 2018

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Phylogeny Tree Algorithms

RECOVERING A PHYLOGENETIC TREE USING PAIRWISE CLOSURE OPERATIONS

DNA-based species delimitation

Unsupervised Learning in Spectral Genome Analysis

C.DARWIN ( )

Improved maximum parsimony models for phylogenetic networks

Weighted Quartets Phylogenetics

Molecular Evolution & Phylogenetics

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

Constructing Evolutionary/Phylogenetic Trees

Processes of Evolution

Phylogenetics. BIOL 7711 Computational Bioscience

To link to this article: DOI: / URL:

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Learning ancestral genetic processes using nonparametric Bayesian models

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

arxiv: v2 [q-bio.pe] 4 Feb 2016

Estimating phylogenetic trees from genome-scale data

Phylogenetic analyses. Kirsi Kostamo

THE TRIPLES DISTANCE FOR ROOTED BIFURCATING PHYLOGENETIC TREES

1.1 The (rooted, binary-character) Perfect-Phylogeny Problem

The impact of missing data on species tree estimation

Analysis and Design of Algorithms Dynamic Programming

SUPPLEMENTARY INFORMATION

Module 12: Molecular Phylogenetics

Phylogeny: traditional and Bayesian approaches

Intraspecific gene genealogies: trees grafting into networks

Lab 9: Maximum Likelihood and Modeltest

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Robust demographic inference from genomic and SNP data

Ancestral Gene Flow and Parallel Organellar Genome Capture Result in Extreme Phylogenomic Discord in a Lineage of Angiosperms

What is Phylogenetics

Transcription:

Species Tree Inference using SVDquartets Laura Kubatko and Dave Swofford May 19, 2015 Laura Kubatko SVDquartets May 19, 2015 1 / 11

SVDquartets In this tutorial, we ll discuss several different data types: Multi-locus data aligned DNA sequence data for many genes SNP data large number of SNPs sampled throughout the genome Single-locus data aligned DNA sequence data for a single gene In the first two cases, we ll assume that incongruence between gene trees and the species trees arises solely from the coalescent process In the third case, we assume that the locus under consideration is a single non-recombining unit Goal: Estimate the underlying phylogenetic tree (species tree or gene tree) Laura Kubatko SVDquartets May 19, 2015 2 / 11

Definition: splits Definition: A split of a set is a bipartition A B. AsplitA B atreet is valid for T if the induced tree T A and T B do Definition: A split of a set of taxa L is a bipartition of L into two non-overlapping subsets A and B, denoted A B. A split A B is valid for tree T if the subtrees containing the taxa in A and in B do not intersect. 1 2 3 4 Valid: 12 34 Not valid: 13 24 14 23 Laura Kubatko SVDquartets May 19, 2015 3 / 11

Definition: flattenings p ijkl = P(X 1 = i, X 2 = j, X 3 = k, X 4 = l) [AA] [AC] [AG] [AT ] [CA] [AA] p AAAA p AAAC p AAAG p AAAT p AACA [AC] p ACAA p ACAC p ACAG p ACAT p ACCA Flat 12 34 (P) = [AG] p AGAA p AGAC p AGAG p AGAT p AGCA [AT ] p AT AA p AT AC p AT AG p AT AT p AT CA [CA] p CAAA p CAAC p CAAG p CAAT p CACA [ ] Theorem (Chifman and Kubatko 2015): Under the coalescent model and the GTR+I+Γ model and its sub models, we have the following: If A B is a valid split for a tree T, then rank(flat A B (P)) 10. If C D is not a valid split for a tree T, then rank(flat C D (P)) > 10. The species tree is completely determined by knowledge of valid splits on all quartets. Laura Kubatko SVDquartets May 19, 2015 4 / 11

Extensions of the Main Result Arbitrary number of states, κ, under the coalescent model: If A B is a valid split for a tree T, then rank(flata B (P)) ( ) κ+1 2. If C D is not a valid split for a tree T, then rank(flatc D (P)) > ( ) κ+1 2. The species tree is completely determined by knowledge of valid splits on all quartets. Single underlying gene tree (no coalescent assumption): If A B is a valid split for a tree T, then rank(flata B (P)) 4. If C D is not a valid split for a tree T, then rank(flatc D (P)) = 16. The species tree is completely determined by knowledge of valid splits on all quartets. Laura Kubatko SVDquartets May 19, 2015 5 / 11

Species tree estimation using algebraic statistics Species tree estimation using SVDquartets Main idea: use the observed site pattern distribution to provide information about which Mainofidea: the three use the possible observed splits site for apattern set of four distribution taxa thetotrue provide split. information about which of the three possible splits for a set of four taxa is the true split. A C A B A C B D C D D B The program SVDquartets computes a score for each split in a given quartet of The taxa program and chooses SVDscores the computes split withathe score best for (lowest) each splitscore. in a given quartet of taxa and chooses the split with the best (lowest) score. We use the following score: SVDScore = 16 Laura Kubatko () Molecular Evolution Workshop 2013 July 30, 2013 2 / 9 where ˆσ i is the i th singular value computed from the observed flattening matrix. i=11 ˆσ 2 i Laura Kubatko SVDquartets May 19, 2015 6 / 11

Species tree estimation using SVDquartets Algorithm 1 Generate all quartets (small problems) or sample quartets (large problems) 2 Estimate the correct quartet relationship for each sampled quartet 3 Use a quartet assembly method to build the tree PAUP* uses the method of Reaz-Bayzid-Rahman (2014), called QFM, to build the tree. Laura Kubatko SVDquartets May 19, 2015 7 / 11

Species tree estimation using SVDquartets Variability in the estimated tree is assessed using nonparametric bootstrapping Multiple lineages are handled as follows: 1 Sample four species 2 Select one lineage at random from each species 3 Estimate the quartet relationships among the four sampled lineages 4 Restore the species labels (but lineage quartets are saved, too) Laura Kubatko SVDquartets May 19, 2015 8 / 11

Multi-locus vs. SNP data The theory is developed for the SNP setting why do we think this might be ok for multilocus data? Consider the case of three possible gene trees with the probabilities below under the coalescent model: Gene tree 1 p 1 = 0.4 Gene tree 2 p 2 = 0.3 Gene tree 3 p 3 = 0.3 Now suppose we observe multilocus data for 1,000 genes as follows: Gene tree1 380 genes Gene tree 2 300 genes Gene tree 3 320 genes Then, if the genes are equal in length, the proportion of sites coming from each tree is approximately what is predicted under the SNP model. Laura Kubatko SVDquartets May 19, 2015 9 / 11

Species tree estimation using SVDquartets Advantages: Fast! How fast? Rattlesnakes: < 1 hour ( 8500bp, 52 tips) Soybeans: < 1 day (6 million SNPs, 62 tips) Scales well: Number of quartets needed increases as number of species increases (but can be done in parallel) Linear in number of sites (but this is just counting) Potential for application to other data types Natural way to handle missing data Disadvantages: Only the (unrooted) topology is estimated no parameters Laura Kubatko SVDquartets May 19, 2015 10 / 11

SVDquartets Described in the papers Chifman, J. and L. Kubatko. 2014. Quartet inference from SNP data under the coalescent model, Bioinformatics 30(23): 3317-3324. Chifman, J. and L. Kubatko. 2015. Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites, Journal of Theoretical Biology 374: 35-47 Implemented in PAUP* thanks, Dave! Now on to the tutorial! Laura Kubatko SVDquartets May 19, 2015 11 / 11