Maximum Likelihood Inference of Reticulate Evolutionary Histories Luay Nakhleh Department of Computer Science Rice University The 2015 Phylogenomics Symposium and Software School The University of Michigan, Ann Arbor 18 May 2015
Species Phylogeny Inference in the Pre-Genomic Era
Species Phylogeny Inference in the Pre-Genomic Era gene family A1 B1 C1 D1
Species Phylogeny Inference in the Pre-Genomic Era gene family A1 B1 C1 D1 Gene Tree A B C D
Species Phylogeny Inference in the Pre-Genomic Era gene family A1 B1 C1 D1 Species Phylogeny = Gene Tree A B C D
Species Phylogeny Inference in the Pre-Genomic Era gene family Account for nucleotide substitutions, and indels A1 B1 C1 D1 Inference NJ, MP (PAUP), ML(RAxML), Bayesian (MrBayes),. Species Phylogeny = Gene Tree A B C D
Species Phylogeny Inference in the Post-Genomic Era gene family 1 gene family 2 gene family 3 A1 B1 C1 D1 A1 D1 B1 C1 A1 B1 D1 D2 Species Phylogeny A B C D
Species Phylogeny Inference in the Post-Genomic Era gene family 1 gene family 2 gene family 3 A1 B1 C1 D1 A1 D1 B1 C1 A1 B1 D1 D2 Account for nucleotide substitutions, and indels + other processes Inference? Species Phylogeny A B C D
Some of the Other Processes gene tree gene tree A ABCC B A BCAB C A B C Incomplete lineage sorting (ILS) gene duplication A B C gene tree Hybridization A C B C x x A B C Gene duplication/loss
This Talk gene tree gene tree A ABCC B A BCAB C + A B C Incomplete lineage sorting (ILS) A B C Hybridization
ILS and Maximum Likelihood Inference of Species Trees
Incomplete Lineage Sorting (ILS) A B C
Inference in the Presence of ILS Input: Sequence alignments for m (independent) loci S = {S 1,S 2,...,S m } Output: Species tree
ML Inference in the Presence of ILS A maximum likelihood approach: Seek the tree that maximizes my Z L( S) = i=1 g P(S i g)p(g )dg
ML Inference in the Presence of ILS A maximum likelihood approach: Seek the tree that maximizes my Z L( S) = i=1 g P(S i g)p(g )dg Contrast to gene tree inference under ML: L(g S i )=P(S i g)
ML Inference in the Presence of ILS A maximum likelihood approach: Seek the tree that maximizes my Z L( S) = i=1 g P(S i g)p(g )dg Contrast to gene tree inference under ML: L(g S i )=P(S i g) Computed using Felsenstein s algorithm
ML Inference in the Presence of ILS: The Gene Tree Shortcut Suppose that for every locus i, a gene tree* Gi has been inferred, so that the input is G = {G 1,G 2,...,G m } The inference problem (under ML) becomes: Seek the tree that maximizes L( G) = my i=1 p(g i ) * gene tree = genealogy of a recombination-free genomic region
ML Inference in the Presence of ILS So, what is this? L( G) = my p(g i ) i=1
The Multi-Species Coalescent MRCA(H,C,G) MRCA(C,G) T 2 T 1
The Multi-Species Coalescent Rannala and Yang (Genetics, 2003) derived the density function of gene trees (topology + branch lengths) given a species tree. Degnan and Salter (Evolution, 2005) gave the probability mass function of gene trees (topology alone) given species tree.
Maximum Likelihood Inference of Species Trees BIOINFORMATICS APPLICATIONS NOTE Vol. 25 no. 7 2009, pages 971 973 doi:10.1093/bioinformatics/btp079 Phylogenetics STEM: species tree estimation using maximum likelihood for gene trees under coalescence Laura S. Kubatko 1,, Bryan C. Carstens 2 and L. Lacey Knowles 3 1 Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, 2 Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803 and 3 Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA Received on November 28, 2008; revised and accepted on February 4, 2009 Advance Access publication February 10, 2009 Associate Editor: Martin Bishop ML Inference from gene trees w/ branch lengths under the Rannala&Yang model COALESCENT-BASED SPECIES TREE INFERENCE FROM GENE TREE TOPOLOGIES UNDER INCOMPLETE LINEAGE SORTING BY MAXIMUM LIKELIHOOD Yufeng Wu 1,2 ML Inference from gene tree topologies alone under the Degnan&Salter model 1 Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut 06269 2 E-mail: ywu@engr.uconn.edu
Hybridization, ILS, and Species Network Inference
Hybridization A B C
ILS + Hybridization A B C
Phylogenetic Networks A leaf-labeled, rooted, directed, acyclic graph (rdag) tree node branch lengths (coalescent units) A B C reticulation node
The Model Phylogenetic network with branch lengths (in coalescent units) Inheritance probabilities, one per locus per reticulation node A B C
The Data Sequence alignments for m (independent) loci S = {S 1,S 2,...,S m }
ML Inference of Phylogenetic Networks in the Presence of ILS A maximum likelihood approach: Seek (ψ,γ) that maximizes my Z Hybridization + ILS : network L(, S) = i=1 g P(S i g)p(g, )dg
ML Inference of Phylogenetic Networks in the Presence of ILS A maximum likelihood approach: Seek (ψ,γ) that maximizes my Z Hybridization + ILS : network L(, S) = i=1 g P(S i g)p(g, )dg tree Contrast to my Z ILS : L( S) = i=1 g P(S i g)p(g )dg
ML Inference of Phylogenetic Networks in the Presence of ILS: The Gene Tree Shortcut Suppose that for every locus i, a gene tree Gi has been inferred, so that the input is G = {G 1,G 2,...,G m } The inference problem (under ML) becomes: Seek the pair (ψ,γ) that maximizes L(, G) = my p(g i, ) i=1
ML Inference of Phylogenetic Networks in the Presence of ILS: The Gene Tree Shortcut To enable such inference, we need to figure out what p(gi ψ,γ) is, and how to search the space of phylogenetic networks
Computing p(g i, ) Coalescent Histories Given a phylogenetic network ψ and a gene tree Gi, an element that is central to computing gene tree distributions is the set of coalescent histories, denoted by Hψ(Gi).
! G i A B C a c b1 b2 h1 h2 h3 h4 h5 a b1b2 c a c a c b1b2 b1b2 b1b2 b1b2 a c a c h6 h7 h8 h9 h10 a c a c a b1b2 b1b2 b1b2 b1b2 b1b2 c a c a c
Computing p(g i, ) Gi is given by its topology alone When gene tree Gi is given by its topology alone: X P (G i, ) = P (h, ) h2h (G i ) P (h, ) = w(h) d(h) Y b2e( ) w b (h) d b (h) [b, j] u b(h) p ub (h)v b (h)( b ) [Yu, Degnan, Nakhleh, PLoS Genetics, 2012.] [Yu, Ristic, Nakhleh, BMC Bioinformatics, 2013.]
Computing p(g i, ) Gi is given by its topology and branch lengths We also derived the density function of gene trees (with branch lengths) given a species network: p(g i, ) = X h2h (G i ) p(h, ) p(h, )= Y apple T b (h) 1 Y e (u b (h) i+1 2 )(T b (h) i+1 T b (h) i ) e (v b (h) 2 )( (b) T b (h) Tb (h) ) [b, j] u b(h) b2e( ) i=1 (7) [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]
Accounting For Gene Tree Uncertainty p(g i, )= 0 1 @ X p(g, ) A / G i g2g i [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]
Accounting For Gene Tree Uncertainty p(g i, )= 0 1 @ X p(g, ) A / G i g2g i a set of trees for each locus i [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]
Searching the Network Space and Accounting for Model Complexity We need a set of operations that transform networks (similar to tree operations such as SPR, TBR, NNI,..), and mechanisms to control for network complexity (networks with more hybridizations would naturally fit the data better!).
Searching the Network Space and Accounting for Model Complexity endpoints (potentially convergent) search within a layer Ω(n,k) descending a layer ascending a layer multiple starting points Ω(n,1) Ω(n,0) [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]
Searching the Network Space and Accounting for Model Complexity We use standard information criteria (AIC and BIC) and have introduced the use of cross-validation to control for model complexity. [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]
Results : Simulated Data A Ψ 1 Ψ 2 A B C D A B C D Ψ 3 A B C D B Counts of inferred networks gene tree topologies 30 25 20 15 10 5 0 10 20 40 80 160 Number of loci sampled C Counts of inferred networks gene tree topologies + branch lengths 30 25 20 15 10 5 0 10 20 40 80 160 Number of loci sampled sequence lengths = 250, 500, 1000 true gene trees [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]
Results : A Mouse Data Set ~ 1.96 ~ 0.001 ~ 0.08 ~ 1.05 0.045 ~ 0.054 0.063 ~ 0.068 DF DG MZ MK Russia MC Germany Poland Ukraine Russia Kazakhsatan France Czech Republic China [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]
PhyloNet All the Methods are implemented in PhyloNet (open source, JAVA): http://bioinfo.cs.rice.edu/phylonet (extended) Nexus I/O format Networks are produced in the Rich Newick format, which can be read by Dendroscope for visualization and figure generation.
PhyloNet Parsimony Syst. Biol. 62(5):738 751, 2013 The Author(s) 2013. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com DOI:10.1093/sysbio/syt037 Advance Access publication June 4, 2013 Parsimonious Inference of Hybridization in the Presence of Incomplete Lineage Sorting YUN YU, R. MATTHEW BARNETT, AND LUAY NAKHLEH Department of Computer Science, Rice University, 6100 Main Street, Houston, TX 77005, USA to be sent to: Luay Nakhleh, Department of Computer Science, Rice University, 6100 Main Street, Houston, TX 77005, USA; E-mail: yy9@rice.edu; nakhleh@rice.edu Phylogenetic networks + HMMs Correspondence Received 25 September 2012; reviews returned 26 November 2012; accepted 28 May 2013 Associate Editor: Laura Kubatko An HMM-Based Comparative Genomic Framework for Detecting Introgression in Eukaryotes Interspecific introgressive origin of genomic diversity in the house mouse a b detected introgressed genomic regions. Based on our analysis, it is estimated that about 9% of all sites within chromosome Downloaded from http://sysbio.ox Abstract. Hybridization plays an important evolutionary role in several groups of organisms. A phylogenetic approach to detect hybridization entails sequencing multiple loci across the genomes of a group of species of interest, reconstructing their gene trees, and taking their differences as indicators of hybridization. However, methods that follow this approach 1 3 2 mostlykevin ignorej.population effects, such as incomplete lineage sorting (ILS).H.Given hybridization Liu1,2*, Jingxuan Dai, Kathy Truong1, Ying Song, Michael Kohnthat, Luay Nakhleh1,2* occurs between closely related1organisms, ILS may very well be at play and, hence, must be accounted for in the analysis framework. To address Department of Computer Science, Rice University, Houston, Texas, United States of America, 2 Department of Ecology and Evolutionary Biology, Rice University, this issue, wetexas, present a parsimony for reconciling gene trees within the branches of achinese phylogenetic Houston, United States of America, 3 Thecriterion State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Academy of network, and Agricultural Sciences, Beijing, China a local search heuristic for inferring phylogenetic networks from collections of gene-tree topologies under this criterion. This framework enables phylogenetic analyses while accounting for both hybridization and ILS. Further, we propose two techniquesabstract for incorporating information about uncertainty in gene-tree estimates. Our simulation studies demonstrate the a,1,2 good performance of our framework of identifying the location hybridization events, asb,1well estimating Kevin J.in Liuterms, Ethan Steinberga, Alexander Yozzoaof, Ying Songb,3, Michael H. Kohn, and as Luay Nakhleha,b,1the One outcome of interspecific hybridization and subsequent effects of evolutionary forces is introgression, which is the proportions of genes that underwent Also, our offramework shows performance Department of Computer andgenome BioSciences, Rice University, Houston, TXgood 77005 integration of genetic material fromhybridization. one species Science into the an individual in another species. The evolutioninofterms of efficiency on handling large data sets in our experiments. Further, inandanalysing a yeastthrough data set, we demonstrate several groups of eukaryotic species has involved hybridization, cases of adaptation introgression have been issues that arise Edited by John C. Avise, University of California, Irvine, CA, and approved November 12, 2014 (received for review April 4, 2014) already established. thisalthough work, we report on PhyloNet-HMM a new was comparative genomic framework detecting when analysing real data In sets. a probabilistic approach recently introduced forforthis problem, and although introgression in genomes. We PhyloNet-HMM combines phylogenetic networks with hidden Markov models (HMMs) tointo some M. m. domesticus popreport on a genome-wide scan for introgression between the introgression from M. spretus parsimonious reconciliations have accuracy issues underhistory certain settings, our ulations parsimony framework provides a much simultaneously capture the (potentially reticulate) evolutionary the genomes and dependencies house mouse (Mus musculus domesticus) andof the Algerian mouse in thewithin wild, genomes. involving the vitamin K epoxide reductase more computationally technique for of the analysis. framework now for genome-wide for to be spretus), usingthis samples from rangessorting ofour sympatry and A novel aspect ofefficient our work is(mus that it also accounts fortype incomplete lineage and dependence acrossallows loci.1application of which wasscans subcomponent (Vkorc1) gene, later shown in Africa and7europe. analysis wide variabilmore widespread inlineage geographically restricted to our model to also variation dataallopatry from chromosome in the Our mouse (Musreveals musculus domesticus) genome detected aeurope, recentlyalbeit hybridization, while accounting for ILS. [Phylogenetic networks; hybridization; incomplete sorting; coalescent; ity inevent introgression signatures along the genomes, asgene well Vkorc1, as parts of southwestern and central Europe (11). reported adaptive introgression involving the rodent poison resistance in addition to other newly multi-labeled trees.] across the samples. We find that fewer than half of the autosomes Major, unanswered questions arise from these studies. First, is
PhyloNet Input NEXUS File #NEXUS BEGIN TREES; Tree gt0 = ((((Scer,Spar),Smik),Skud),Sbay); Tree gt105 = ((Scer,Spar),Smik,Skud,Sbay); END; BEGIN PHYLONET; InferNetwork_ML (gt0,...,gt105) 1; END; Output Inferred Network #1: ((Sbay:1.0)#H1:1.0::0.6130,((Smik:1.0,(Scer:1.0,Spar:1.0):3.5436):1.0585, (#H1:1.0::0.3869,Skud:1.0):2.1717):5.9272); Total log probability: - 151.57753843275103
Summary We extended the multi-species coalescent to phylogenetic networks to account for both ILS and hybridization. We account for gene tree uncertainty We account for model complexity using information criteria and cross-validation We also have a parsimony formulation and solution We also have a phylogenetic network + HMM framework Implementation of all methods in PhyloNet
THANK YOU http://bioinfo.cs.rice.edu/phylonet Collaborators: R.M. Barnett (Rice), J. Dai (Rice), J.H. Degnan (UNM), J. Dong, (Rice), M.H. Kohn (Rice), K. Liu (Michigan State University), Y. Song (Rice), E. Steinberg (Rice), K. Truong (Rice), A. Yozzo (Rice), Y. Yu (Rice) Funding: NSF, NIH, Sloan Foundation, Guggenheim Foundation