Maximum Likelihood Inference of Reticulate Evolutionary Histories

Similar documents
WenEtAl-biorxiv 2017/12/21 10:55 page 2 #2

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University

Quartet Inference from SNP Data Under the Coalescent Model

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

Methods to reconstruct phylogene1c networks accoun1ng for ILS

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection

Taming the Beast Workshop

Coalescent Histories on Phylogenetic Networks and Detection of Hybridization Despite Incomplete Lineage Sorting

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Phylogenetic Networks, Trees, and Clusters

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Anatomy of a species tree

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

ALGORITHMIC STRATEGIES FOR ESTIMATING THE AMOUNT OF RETICULATION FROM A COLLECTION OF GENE TREES

Phylogenetics: Parsimony

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetics in the Age of Genomics: Prospects and Challenges

Dr. Amira A. AL-Hosary

Estimating Evolutionary Trees. Phylogenetic Methods

An introduction to phylogenetic networks

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Inference of Parsimonious Species Phylogenies from Multi-locus Data

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

Phylogenomics of closely related species and individuals

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Fine-Scale Phylogenetic Discordance across the House Mouse Genome

Species Tree Inference by Minimizing Deep Coalescences

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Improved maximum parsimony models for phylogenetic networks

A Phylogenetic Network Construction due to Constrained Recombination

To link to this article: DOI: / URL:

Efficient Bayesian Species Tree Inference under the Multispecies Coalescent

Intraspecific gene genealogies: trees grafting into networks

Learning ancestral genetic processes using nonparametric Bayesian models

Constructing Evolutionary/Phylogenetic Trees

Effects of Gap Open and Gap Extension Penalties

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetics: Building Phylogenetic Trees

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Introduction to Phylogenetic Analysis

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

The Inference of Gene Trees with Species Trees

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Fast coalescent-based branch support using local quartet frequencies

Efficient Bayesian species tree inference under the multi-species coalescent

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS

Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Genetic Drift in Human Evolution

Gene Trees, Species Trees, and Species Networks

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Exploring phylogenetic hypotheses via Gibbs sampling on evolutionary networks

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Consensus Methods. * You are only responsible for the first two

JML: testing hybridization from species trees

Discordance of Species Trees with Their Most Likely Gene Trees: The Case of Five Taxa

Concepts and Methods in Molecular Divergence Time Estimation

Phylogenetics: Likelihood

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Phylogenetic Tree Reconstruction

Molecular Evolution & Phylogenetics

Supplementary Materials for

Reconstructing Phylogenetic Networks Using Maximum Parsimony

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Hidden Markov models in population genetics and evolutionary biology

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Reconstruire le passé biologique modèles, méthodes, performances, limites

Phylogenetics. BIOL 7711 Computational Bioscience

Evaluation of a Bayesian Coalescent Method of Species Delimitation

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

A new algorithm to construct phylogenetic networks from trees

Constructing Evolutionary/Phylogenetic Trees

Chapter 26 Phylogeny and the Tree of Life

Molecular Evolution, course # Final Exam, May 3, 2006

Rapid speciation following recent host shift in the plant pathogenic fungus Rhynchosporium

Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis. Research article.

Phylogenetic Geometry

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

EVOLUTIONARY DISTANCES

Regular networks are determined by their trees

Processes of Evolution

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW

Introduction to Phylogenetic Analysis

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Bayesian Models for Phylogenetic Trees

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Unified modeling of gene duplication, loss and coalescence using a locus tree

reconciling trees Stefanie Hartmann postdoc, Todd Vision s lab University of North Carolina the data

A Fitness Distance Correlation Measure for Evolutionary Trees

Lecture 11 Friday, October 21, 2011

Transcription:

Maximum Likelihood Inference of Reticulate Evolutionary Histories Luay Nakhleh Department of Computer Science Rice University The 2015 Phylogenomics Symposium and Software School The University of Michigan, Ann Arbor 18 May 2015

Species Phylogeny Inference in the Pre-Genomic Era

Species Phylogeny Inference in the Pre-Genomic Era gene family A1 B1 C1 D1

Species Phylogeny Inference in the Pre-Genomic Era gene family A1 B1 C1 D1 Gene Tree A B C D

Species Phylogeny Inference in the Pre-Genomic Era gene family A1 B1 C1 D1 Species Phylogeny = Gene Tree A B C D

Species Phylogeny Inference in the Pre-Genomic Era gene family Account for nucleotide substitutions, and indels A1 B1 C1 D1 Inference NJ, MP (PAUP), ML(RAxML), Bayesian (MrBayes),. Species Phylogeny = Gene Tree A B C D

Species Phylogeny Inference in the Post-Genomic Era gene family 1 gene family 2 gene family 3 A1 B1 C1 D1 A1 D1 B1 C1 A1 B1 D1 D2 Species Phylogeny A B C D

Species Phylogeny Inference in the Post-Genomic Era gene family 1 gene family 2 gene family 3 A1 B1 C1 D1 A1 D1 B1 C1 A1 B1 D1 D2 Account for nucleotide substitutions, and indels + other processes Inference? Species Phylogeny A B C D

Some of the Other Processes gene tree gene tree A ABCC B A BCAB C A B C Incomplete lineage sorting (ILS) gene duplication A B C gene tree Hybridization A C B C x x A B C Gene duplication/loss

This Talk gene tree gene tree A ABCC B A BCAB C + A B C Incomplete lineage sorting (ILS) A B C Hybridization

ILS and Maximum Likelihood Inference of Species Trees

Incomplete Lineage Sorting (ILS) A B C

Inference in the Presence of ILS Input: Sequence alignments for m (independent) loci S = {S 1,S 2,...,S m } Output: Species tree

ML Inference in the Presence of ILS A maximum likelihood approach: Seek the tree that maximizes my Z L( S) = i=1 g P(S i g)p(g )dg

ML Inference in the Presence of ILS A maximum likelihood approach: Seek the tree that maximizes my Z L( S) = i=1 g P(S i g)p(g )dg Contrast to gene tree inference under ML: L(g S i )=P(S i g)

ML Inference in the Presence of ILS A maximum likelihood approach: Seek the tree that maximizes my Z L( S) = i=1 g P(S i g)p(g )dg Contrast to gene tree inference under ML: L(g S i )=P(S i g) Computed using Felsenstein s algorithm

ML Inference in the Presence of ILS: The Gene Tree Shortcut Suppose that for every locus i, a gene tree* Gi has been inferred, so that the input is G = {G 1,G 2,...,G m } The inference problem (under ML) becomes: Seek the tree that maximizes L( G) = my i=1 p(g i ) * gene tree = genealogy of a recombination-free genomic region

ML Inference in the Presence of ILS So, what is this? L( G) = my p(g i ) i=1

The Multi-Species Coalescent MRCA(H,C,G) MRCA(C,G) T 2 T 1

The Multi-Species Coalescent Rannala and Yang (Genetics, 2003) derived the density function of gene trees (topology + branch lengths) given a species tree. Degnan and Salter (Evolution, 2005) gave the probability mass function of gene trees (topology alone) given species tree.

Maximum Likelihood Inference of Species Trees BIOINFORMATICS APPLICATIONS NOTE Vol. 25 no. 7 2009, pages 971 973 doi:10.1093/bioinformatics/btp079 Phylogenetics STEM: species tree estimation using maximum likelihood for gene trees under coalescence Laura S. Kubatko 1,, Bryan C. Carstens 2 and L. Lacey Knowles 3 1 Departments of Statistics and Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, 2 Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803 and 3 Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA Received on November 28, 2008; revised and accepted on February 4, 2009 Advance Access publication February 10, 2009 Associate Editor: Martin Bishop ML Inference from gene trees w/ branch lengths under the Rannala&Yang model COALESCENT-BASED SPECIES TREE INFERENCE FROM GENE TREE TOPOLOGIES UNDER INCOMPLETE LINEAGE SORTING BY MAXIMUM LIKELIHOOD Yufeng Wu 1,2 ML Inference from gene tree topologies alone under the Degnan&Salter model 1 Department of Computer Science and Engineering, University of Connecticut, Storrs, Connecticut 06269 2 E-mail: ywu@engr.uconn.edu

Hybridization, ILS, and Species Network Inference

Hybridization A B C

ILS + Hybridization A B C

Phylogenetic Networks A leaf-labeled, rooted, directed, acyclic graph (rdag) tree node branch lengths (coalescent units) A B C reticulation node

The Model Phylogenetic network with branch lengths (in coalescent units) Inheritance probabilities, one per locus per reticulation node A B C

The Data Sequence alignments for m (independent) loci S = {S 1,S 2,...,S m }

ML Inference of Phylogenetic Networks in the Presence of ILS A maximum likelihood approach: Seek (ψ,γ) that maximizes my Z Hybridization + ILS : network L(, S) = i=1 g P(S i g)p(g, )dg

ML Inference of Phylogenetic Networks in the Presence of ILS A maximum likelihood approach: Seek (ψ,γ) that maximizes my Z Hybridization + ILS : network L(, S) = i=1 g P(S i g)p(g, )dg tree Contrast to my Z ILS : L( S) = i=1 g P(S i g)p(g )dg

ML Inference of Phylogenetic Networks in the Presence of ILS: The Gene Tree Shortcut Suppose that for every locus i, a gene tree Gi has been inferred, so that the input is G = {G 1,G 2,...,G m } The inference problem (under ML) becomes: Seek the pair (ψ,γ) that maximizes L(, G) = my p(g i, ) i=1

ML Inference of Phylogenetic Networks in the Presence of ILS: The Gene Tree Shortcut To enable such inference, we need to figure out what p(gi ψ,γ) is, and how to search the space of phylogenetic networks

Computing p(g i, ) Coalescent Histories Given a phylogenetic network ψ and a gene tree Gi, an element that is central to computing gene tree distributions is the set of coalescent histories, denoted by Hψ(Gi).

! G i A B C a c b1 b2 h1 h2 h3 h4 h5 a b1b2 c a c a c b1b2 b1b2 b1b2 b1b2 a c a c h6 h7 h8 h9 h10 a c a c a b1b2 b1b2 b1b2 b1b2 b1b2 c a c a c

Computing p(g i, ) Gi is given by its topology alone When gene tree Gi is given by its topology alone: X P (G i, ) = P (h, ) h2h (G i ) P (h, ) = w(h) d(h) Y b2e( ) w b (h) d b (h) [b, j] u b(h) p ub (h)v b (h)( b ) [Yu, Degnan, Nakhleh, PLoS Genetics, 2012.] [Yu, Ristic, Nakhleh, BMC Bioinformatics, 2013.]

Computing p(g i, ) Gi is given by its topology and branch lengths We also derived the density function of gene trees (with branch lengths) given a species network: p(g i, ) = X h2h (G i ) p(h, ) p(h, )= Y apple T b (h) 1 Y e (u b (h) i+1 2 )(T b (h) i+1 T b (h) i ) e (v b (h) 2 )( (b) T b (h) Tb (h) ) [b, j] u b(h) b2e( ) i=1 (7) [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]

Accounting For Gene Tree Uncertainty p(g i, )= 0 1 @ X p(g, ) A / G i g2g i [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]

Accounting For Gene Tree Uncertainty p(g i, )= 0 1 @ X p(g, ) A / G i g2g i a set of trees for each locus i [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]

Searching the Network Space and Accounting for Model Complexity We need a set of operations that transform networks (similar to tree operations such as SPR, TBR, NNI,..), and mechanisms to control for network complexity (networks with more hybridizations would naturally fit the data better!).

Searching the Network Space and Accounting for Model Complexity endpoints (potentially convergent) search within a layer Ω(n,k) descending a layer ascending a layer multiple starting points Ω(n,1) Ω(n,0) [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]

Searching the Network Space and Accounting for Model Complexity We use standard information criteria (AIC and BIC) and have introduced the use of cross-validation to control for model complexity. [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]

Results : Simulated Data A Ψ 1 Ψ 2 A B C D A B C D Ψ 3 A B C D B Counts of inferred networks gene tree topologies 30 25 20 15 10 5 0 10 20 40 80 160 Number of loci sampled C Counts of inferred networks gene tree topologies + branch lengths 30 25 20 15 10 5 0 10 20 40 80 160 Number of loci sampled sequence lengths = 250, 500, 1000 true gene trees [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]

Results : A Mouse Data Set ~ 1.96 ~ 0.001 ~ 0.08 ~ 1.05 0.045 ~ 0.054 0.063 ~ 0.068 DF DG MZ MK Russia MC Germany Poland Ukraine Russia Kazakhsatan France Czech Republic China [Yu, Dong, Liu, Nakhleh, PNAS, 2014.]

PhyloNet All the Methods are implemented in PhyloNet (open source, JAVA): http://bioinfo.cs.rice.edu/phylonet (extended) Nexus I/O format Networks are produced in the Rich Newick format, which can be read by Dendroscope for visualization and figure generation.

PhyloNet Parsimony Syst. Biol. 62(5):738 751, 2013 The Author(s) 2013. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com DOI:10.1093/sysbio/syt037 Advance Access publication June 4, 2013 Parsimonious Inference of Hybridization in the Presence of Incomplete Lineage Sorting YUN YU, R. MATTHEW BARNETT, AND LUAY NAKHLEH Department of Computer Science, Rice University, 6100 Main Street, Houston, TX 77005, USA to be sent to: Luay Nakhleh, Department of Computer Science, Rice University, 6100 Main Street, Houston, TX 77005, USA; E-mail: yy9@rice.edu; nakhleh@rice.edu Phylogenetic networks + HMMs Correspondence Received 25 September 2012; reviews returned 26 November 2012; accepted 28 May 2013 Associate Editor: Laura Kubatko An HMM-Based Comparative Genomic Framework for Detecting Introgression in Eukaryotes Interspecific introgressive origin of genomic diversity in the house mouse a b detected introgressed genomic regions. Based on our analysis, it is estimated that about 9% of all sites within chromosome Downloaded from http://sysbio.ox Abstract. Hybridization plays an important evolutionary role in several groups of organisms. A phylogenetic approach to detect hybridization entails sequencing multiple loci across the genomes of a group of species of interest, reconstructing their gene trees, and taking their differences as indicators of hybridization. However, methods that follow this approach 1 3 2 mostlykevin ignorej.population effects, such as incomplete lineage sorting (ILS).H.Given hybridization Liu1,2*, Jingxuan Dai, Kathy Truong1, Ying Song, Michael Kohnthat, Luay Nakhleh1,2* occurs between closely related1organisms, ILS may very well be at play and, hence, must be accounted for in the analysis framework. To address Department of Computer Science, Rice University, Houston, Texas, United States of America, 2 Department of Ecology and Evolutionary Biology, Rice University, this issue, wetexas, present a parsimony for reconciling gene trees within the branches of achinese phylogenetic Houston, United States of America, 3 Thecriterion State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Academy of network, and Agricultural Sciences, Beijing, China a local search heuristic for inferring phylogenetic networks from collections of gene-tree topologies under this criterion. This framework enables phylogenetic analyses while accounting for both hybridization and ILS. Further, we propose two techniquesabstract for incorporating information about uncertainty in gene-tree estimates. Our simulation studies demonstrate the a,1,2 good performance of our framework of identifying the location hybridization events, asb,1well estimating Kevin J.in Liuterms, Ethan Steinberga, Alexander Yozzoaof, Ying Songb,3, Michael H. Kohn, and as Luay Nakhleha,b,1the One outcome of interspecific hybridization and subsequent effects of evolutionary forces is introgression, which is the proportions of genes that underwent Also, our offramework shows performance Department of Computer andgenome BioSciences, Rice University, Houston, TXgood 77005 integration of genetic material fromhybridization. one species Science into the an individual in another species. The evolutioninofterms of efficiency on handling large data sets in our experiments. Further, inandanalysing a yeastthrough data set, we demonstrate several groups of eukaryotic species has involved hybridization, cases of adaptation introgression have been issues that arise Edited by John C. Avise, University of California, Irvine, CA, and approved November 12, 2014 (received for review April 4, 2014) already established. thisalthough work, we report on PhyloNet-HMM a new was comparative genomic framework detecting when analysing real data In sets. a probabilistic approach recently introduced forforthis problem, and although introgression in genomes. We PhyloNet-HMM combines phylogenetic networks with hidden Markov models (HMMs) tointo some M. m. domesticus popreport on a genome-wide scan for introgression between the introgression from M. spretus parsimonious reconciliations have accuracy issues underhistory certain settings, our ulations parsimony framework provides a much simultaneously capture the (potentially reticulate) evolutionary the genomes and dependencies house mouse (Mus musculus domesticus) andof the Algerian mouse in thewithin wild, genomes. involving the vitamin K epoxide reductase more computationally technique for of the analysis. framework now for genome-wide for to be spretus), usingthis samples from rangessorting ofour sympatry and A novel aspect ofefficient our work is(mus that it also accounts fortype incomplete lineage and dependence acrossallows loci.1application of which wasscans subcomponent (Vkorc1) gene, later shown in Africa and7europe. analysis wide variabilmore widespread inlineage geographically restricted to our model to also variation dataallopatry from chromosome in the Our mouse (Musreveals musculus domesticus) genome detected aeurope, recentlyalbeit hybridization, while accounting for ILS. [Phylogenetic networks; hybridization; incomplete sorting; coalescent; ity inevent introgression signatures along the genomes, asgene well Vkorc1, as parts of southwestern and central Europe (11). reported adaptive introgression involving the rodent poison resistance in addition to other newly multi-labeled trees.] across the samples. We find that fewer than half of the autosomes Major, unanswered questions arise from these studies. First, is

PhyloNet Input NEXUS File #NEXUS BEGIN TREES; Tree gt0 = ((((Scer,Spar),Smik),Skud),Sbay); Tree gt105 = ((Scer,Spar),Smik,Skud,Sbay); END; BEGIN PHYLONET; InferNetwork_ML (gt0,...,gt105) 1; END; Output Inferred Network #1: ((Sbay:1.0)#H1:1.0::0.6130,((Smik:1.0,(Scer:1.0,Spar:1.0):3.5436):1.0585, (#H1:1.0::0.3869,Skud:1.0):2.1717):5.9272); Total log probability: - 151.57753843275103

Summary We extended the multi-species coalescent to phylogenetic networks to account for both ILS and hybridization. We account for gene tree uncertainty We account for model complexity using information criteria and cross-validation We also have a parsimony formulation and solution We also have a phylogenetic network + HMM framework Implementation of all methods in PhyloNet

THANK YOU http://bioinfo.cs.rice.edu/phylonet Collaborators: R.M. Barnett (Rice), J. Dai (Rice), J.H. Degnan (UNM), J. Dong, (Rice), M.H. Kohn (Rice), K. Liu (Michigan State University), Y. Song (Rice), E. Steinberg (Rice), K. Truong (Rice), A. Yozzo (Rice), Y. Yu (Rice) Funding: NSF, NIH, Sloan Foundation, Guggenheim Foundation