Species Tree Inference using SVDquartets

Species Tree Inference using SVDquartets Laura Kubatko and Dave Swofford May 19, 2015 Laura Kubatko SVDquartets May 19, 2015 1 / 11

SVDquartets In this tutorial, we ll discuss several different data types: Multi-locus data aligned DNA sequence data for many genes SNP data large number of SNPs sampled throughout the genome Single-locus data aligned DNA sequence data for a single gene In the first two cases, we ll assume that incongruence between gene trees and the species trees arises solely from the coalescent process In the third case, we assume that the locus under consideration is a single non-recombining unit Goal: Estimate the underlying phylogenetic tree (species tree or gene tree) Laura Kubatko SVDquartets May 19, 2015 2 / 11

Definition: splits Definition: A split of a set is a bipartition A B. AsplitA B atreet is valid for T if the induced tree T A and T B do Definition: A split of a set of taxa L is a bipartition of L into two non-overlapping subsets A and B, denoted A B. A split A B is valid for tree T if the subtrees containing the taxa in A and in B do not intersect. 1 2 3 4 Valid: 12 34 Not valid: 13 24 14 23 Laura Kubatko SVDquartets May 19, 2015 3 / 11

Definition: flattenings p ijkl = P(X 1 = i, X 2 = j, X 3 = k, X 4 = l) [AA] [AC] [AG] [AT ] [CA] [AA] p AAAA p AAAC p AAAG p AAAT p AACA [AC] p ACAA p ACAC p ACAG p ACAT p ACCA Flat 12 34 (P) = [AG] p AGAA p AGAC p AGAG p AGAT p AGCA [AT ] p AT AA p AT AC p AT AG p AT AT p AT CA [CA] p CAAA p CAAC p CAAG p CAAT p CACA [ ] Theorem (Chifman and Kubatko 2015): Under the coalescent model and the GTR+I+Γ model and its sub models, we have the following: If A B is a valid split for a tree T, then rank(flat A B (P)) 10. If C D is not a valid split for a tree T, then rank(flat C D (P)) > 10. The species tree is completely determined by knowledge of valid splits on all quartets. Laura Kubatko SVDquartets May 19, 2015 4 / 11

Extensions of the Main Result Arbitrary number of states, κ, under the coalescent model: If A B is a valid split for a tree T, then rank(flata B (P)) ( ) κ+1 2. If C D is not a valid split for a tree T, then rank(flatc D (P)) > ( ) κ+1 2. The species tree is completely determined by knowledge of valid splits on all quartets. Single underlying gene tree (no coalescent assumption): If A B is a valid split for a tree T, then rank(flata B (P)) 4. If C D is not a valid split for a tree T, then rank(flatc D (P)) = 16. The species tree is completely determined by knowledge of valid splits on all quartets. Laura Kubatko SVDquartets May 19, 2015 5 / 11

Species tree estimation using algebraic statistics Species tree estimation using SVDquartets Main idea: use the observed site pattern distribution to provide information about which Mainofidea: the three use the possible observed splits site for apattern set of four distribution taxa thetotrue provide split. information about which of the three possible splits for a set of four taxa is the true split. A C A B A C B D C D D B The program SVDquartets computes a score for each split in a given quartet of The taxa program and chooses SVDscores the computes split withathe score best for (lowest) each splitscore. in a given quartet of taxa and chooses the split with the best (lowest) score. We use the following score: SVDScore = 16 Laura Kubatko () Molecular Evolution Workshop 2013 July 30, 2013 2 / 9 where ˆσ i is the i th singular value computed from the observed flattening matrix. i=11 ˆσ 2 i Laura Kubatko SVDquartets May 19, 2015 6 / 11

Species tree estimation using SVDquartets Algorithm 1 Generate all quartets (small problems) or sample quartets (large problems) 2 Estimate the correct quartet relationship for each sampled quartet 3 Use a quartet assembly method to build the tree PAUP* uses the method of Reaz-Bayzid-Rahman (2014), called QFM, to build the tree. Laura Kubatko SVDquartets May 19, 2015 7 / 11

Species tree estimation using SVDquartets Variability in the estimated tree is assessed using nonparametric bootstrapping Multiple lineages are handled as follows: 1 Sample four species 2 Select one lineage at random from each species 3 Estimate the quartet relationships among the four sampled lineages 4 Restore the species labels (but lineage quartets are saved, too) Laura Kubatko SVDquartets May 19, 2015 8 / 11

Multi-locus vs. SNP data The theory is developed for the SNP setting why do we think this might be ok for multilocus data? Consider the case of three possible gene trees with the probabilities below under the coalescent model: Gene tree 1 p 1 = 0.4 Gene tree 2 p 2 = 0.3 Gene tree 3 p 3 = 0.3 Now suppose we observe multilocus data for 1,000 genes as follows: Gene tree1 380 genes Gene tree 2 300 genes Gene tree 3 320 genes Then, if the genes are equal in length, the proportion of sites coming from each tree is approximately what is predicted under the SNP model. Laura Kubatko SVDquartets May 19, 2015 9 / 11

Species tree estimation using SVDquartets Advantages: Fast! How fast? Rattlesnakes: < 1 hour ( 8500bp, 52 tips) Soybeans: < 1 day (6 million SNPs, 62 tips) Scales well: Number of quartets needed increases as number of species increases (but can be done in parallel) Linear in number of sites (but this is just counting) Potential for application to other data types Natural way to handle missing data Disadvantages: Only the (unrooted) topology is estimated no parameters Laura Kubatko SVDquartets May 19, 2015 10 / 11

SVDquartets Described in the papers Chifman, J. and L. Kubatko. 2014. Quartet inference from SNP data under the coalescent model, Bioinformatics 30(23): 3317-3324. Chifman, J. and L. Kubatko. 2015. Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites, Journal of Theoretical Biology 374: 35-47 Implemented in PAUP* thanks, Dave! Now on to the tutorial! Laura Kubatko SVDquartets May 19, 2015 11 / 11