STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Similar documents
Anatomy of a species tree

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Quartet Inference from SNP Data Under the Coalescent Model

Taming the Beast Workshop

WenEtAl-biorxiv 2017/12/21 10:55 page 2 #2

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Properties of Consensus Methods for Inferring Species Trees from Gene Trees

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Dr. Amira A. AL-Hosary

Phylogenetic inference

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Species Tree Inference using SVDquartets

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

Understanding How Stochasticity Impacts Reconstructions of Recent Species Divergent History. Huateng Huang

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Workshop III: Evolutionary Genomics

Intraspecific gene genealogies: trees grafting into networks

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

To link to this article: DOI: / URL:

Jed Chou. April 13, 2015


Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

A (short) introduction to phylogenetics

Efficient Bayesian Species Tree Inference under the Multispecies Coalescent

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Coalescent Histories on Phylogenetic Networks and Detection of Hybridization Despite Incomplete Lineage Sorting

Maximum Likelihood Inference of Reticulate Evolutionary Histories

Phylogenomics. Jeffrey P. Townsend Department of Ecology and Evolutionary Biology Yale University. Tuesday, January 29, 13

Phylogenetic analyses. Kirsi Kostamo

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Estimating Evolutionary Trees. Phylogenetic Methods

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Algorithms in Bioinformatics

SpeciesNetwork Tutorial

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW

Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting

The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection

Phylogenetic Networks, Trees, and Clusters

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Phylogenetic Tree Reconstruction

Efficient Bayesian species tree inference under the multi-species coalescent

Phylogeny Tree Algorithms

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2011 University of California, Berkeley

8/23/2014. Phylogeny and the Tree of Life

I. Short Answer Questions DO ALL QUESTIONS

ESS 345 Ichthyology. Systematic Ichthyology Part II Not in Book

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Evaluation of a Bayesian Coalescent Method of Species Delimitation

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2012 University of California, Berkeley

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Inferring Speciation Times under an Episodic Molecular Clock

Evolutionary Tree Analysis. Overview

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

What is Phylogenetics

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Phylogenetics in the Age of Genomics: Prospects and Challenges

Consensus methods. Strict consensus methods

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogeny. November 7, 2017

Taxon: generally refers to any named group of organisms, such as species, genus, family, order, etc.. Node: represents the hypothetical ancestor

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2012 University of California, Berkeley

Estimating Species Phylogeny from Gene-Tree Probabilities Despite Incomplete Lineage Sorting: An Example from Melanoplus Grasshoppers

A Phylogenetic Network Construction due to Constrained Recombination

C.DARWIN ( )

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2011 University of California, Berkeley

DNA-based species delimitation

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

EVOLUTIONARY DISTANCES

Concepts and Methods in Molecular Divergence Time Estimation

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

How to read and make phylogenetic trees Zuzana Starostová

reconciling trees Stefanie Hartmann postdoc, Todd Vision s lab University of North Carolina the data

Reconstructing the history of lineages

Processes of Evolution

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Bayesian Models for Phylogenetic Trees

Fine-Scale Phylogenetic Discordance across the House Mouse Genome

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Methods to reconstruct phylogene1c networks accoun1ng for ILS

molecular evolution and phylogenetics

Theory of Evolution Charles Darwin

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Consistency Index (CI)

Lecture 6 Phylogenetic Inference

OMICS Journals are welcoming Submissions

Transcription:

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization) Laura Salter Kubatko Departments of Statistics and Evolution, Ecology, and Organismal Biology The Ohio State University kubatko.2@osu.edu June 7, 2013

What is STEM-hy? Assumptions and Methods Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Background: STEM s Hybrid Species Models

Assumptions and Methods What is STEM-hy? STEM-hy is a program to perform maximum likelihood analysis for estimation of the species tree from multilocus data under the coalescent process. It includes the capability of evaluating hybrid taxa. Basic functions: Return the ML species tree. Search the space of all species trees and return the k trees with the highest likelihoods found. Compute the likelihood of a user-specified tree with branch lengths. Find optimal branch lengths on a user-specified tree. Carry out a bootstrap analysis to obtain bootstrap support values for nodes in the species tree. Evaluate hypotheses of hybridization in a model selection framework.

Assumptions and Methods Assumptions No recombination within loci Free recombination between loci No gene flow following speciation Only source of variability in single-gene histories is due to the coalescence process There is a single θ for the entire tree, for each locus Evolutionary rates may vary across loci

Assumptions and Methods Methods: ML Estimate of the Species Tree Liu et al. (2009) showed that the ML estimate of the species tree can be computed by sequentially clustering minimum observed divergence times between pairs of species across genes. They have shown that when gene trees are known without error, the ML species tree is a consistent estimator. A similar result was obtained by Roch & Mossel (2010) they call their estimator the GLASS tree (an acronym for Global LAteSt Split, based on the algorithm they developed to compute it). STEM computes the ML estimate of the species tree this way.

Assumptions and Methods Methods: Estimation of ML Times for an Arbitrary Species Tree The results of Liu et al. (2009) can be extended to derive the ML estimates of the speciation times for an arbitrary species tree. Thus, the likelihood of any species tree can be readily computed by using this result to obtain ML branch lengths. This is important in that it allows us to compare alternative phylogenetic hypotheses.

Assumptions and Methods Methods: Searching Species Tree Space for Trees of High Likelihood A simulated annealing algorithm is used to search the space of all species trees for trees that have high likelihoods. The k best trees found during the search are saved and printed to a file (k is set by the user). Exploration of the likelihood surface is particularly important for many of these problems. The details of the simulated annealing algorithm are similar to those given in Salter & Pearl (2001).

Assumptions and Methods Features of STEM-hy No limits (that I know of) on the number of taxa or the number of loci. Can handle intraspecific sampling. Allows information concerning mutation rate for each locus to be used in the analysis. Can handle different taxon samples across genes. Version 1.1 is written in Java (using Clojure).

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Data Preparation - Gene Trees STEM-hy takes as its input one gene tree for each locus. Thus, a first step in an analysis using STEM-hy is to estimate gene trees with branch lengths for each locus. Any method can be used to do this, but note a couple requirements: Branch lengths are assumed to be in units of expected number of substitutions per site per unit time. Branch lengths must be estimated subject to a molecular clock. This is not checked by the program. Gene trees must be fully resolved; however, polytomies can be included by setting branch lengths to 0 for an arbitrary resolution of the polytomy.

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Data Preparation - Population Genetics Parameters A value of the parameter θ = 4Nµ must be provided. Note that this is the per-site θ, not a per-locus value as used by other population genetics programs. This will be used to convert gene tree branch lengths to coalescent units (number of 2N generations) by dividing all gene tree branch lengths by θ. Estimates of θ could be obtained by standard methods. Typical values of θ will be between 0.001 and 0.1.

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Data Preparation - Population Genetics and Mutation Parameters Each locus can also be given a rate multiplier. These can adjust for Variation in mutation rate across loci. Ploidy (e.g., haploid loci mtdna should be given a rate of 0.5). At the least, one should estimate rate variation from the data by something like the following: Compute average pairwise sequence divergence of each sequence to the outgroup. Divide all of these values by their overall mean, and assign that number as the rate multiplier for each gene. Adjust specific genes for ploidy, if necessary.

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Start with a small example where we can work things out by hand Four species, eight lineages, and two loci (N = 2) Suppose that the gene trees for the two loci are 3.46 2.40 2.46 1.20 1.00 1.20 1.23 8 7 6 5 4 3 2 1 3.50 2.86 2.54 1.10 1.00 1.20 1.23 8 7 6 5 4 3 2 1

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis STEM Now we can run STEM and look at output First, let s compute the relevant distances by hand: {Dab 1 }: 2-2.46 2.46 3.46 3.46 3.46 3.46 3 - - 1.2 3.46 3.46 3.46 3.46 2 3 4 5 6 7 8 1 1.23 2.46 2.46 3.46 3.46 3.46 3.46 4 - - - 3.46 3.46 3.46 3.46 5 - - - - 1.0 2.4 2.4 6 - - - - - 2.4 2.4 7 - - - - - - 1.2 {Dab 2 }: 2-2.56 2.56 2.86 2.86 3.5 3.5 3 - - 1.2 2.86 2.86 3.5 3.5 2 3 4 5 6 7 8 1 1.23 2.56 2.56 2.86 2.86 3.5 3.5 4 - - - 2.86 2.86 3.5 3.5 5 - - - - 1.0 3.5 3.5 6 - - - - - 3.5 3.5 7 - - - - - - 1.1

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis STEM Now we can run STEM and look at output First, let s compute the relevant distances by hand: 2 3 4 5 6 7 8 1 1.23 2.46 2.46 3.46 3.46 3.46 3.46 2-2.46 2.46 3.46 3.46 3.46 3.46 3 - - 1.2 3.46 3.46 3.46 3.46 4 - - - 3.46 3.46 3.46 3.46 5 - - - - 1.0 2.4 2.4 6 - - - - - 2.4 2.4 7 - - - - - - 1.2 S1 S2 S3 S4 S1-1.2 3.46 3.46 S2-1.0 2.4 S3-1.2 S4-2 3 4 5 6 7 8 1 1.23 2.56 2.56 2.86 2.86 3.5 3.5 2-2.56 2.56 2.86 2.86 3.5 3.5 3 - - 1.2 2.86 2.86 3.5 3.5 4 - - - 2.86 2.86 3.5 3.5 5 - - - - 1.0 3.5 3.5 6 - - - - - 3.5 3.5 7 - - - - - - 1.1 S1 S2 S3 S4 S1-1.2 2.86 3.5 S2-1.0 3.5 S3-1.1 S4 -

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis STEM Now we can run STEM and look at output First, let s compute the relevant distances by hand: S1 S2 S3 S4 S1-1.2 3.46 3.46 S2-1.0 2.4 S3-1.2 S4 - S1 S2 S3 S4 S1-1.2 2.86 3.5 S2-1.0 3.5 S3-1.1 S4 - S1 S2 S3 S4 S1-1.2 2.86 3.46 S2-1.0 2.4 S3-1.1 S4 -

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis STEM First, let s compute the relevant distances by hand: S1 S2 S3 S4 S1-1.2 3.46 3.46 S2-1.0 2.4 S3-1.2 S4 - S1 S2 S3 S4 S1-1.2 2.86 3.5 S2-1.0 3.5 S3-1.1 S4-1.2 S1 S2 S3 S4 S1-1.2 2.86 3.46 S2-1.0 2.4 S3-1.1 S4-1.1 1.0 3 2 4 1

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Step 1: Prepare the gene trees Option 1: Place all gene trees in a single file called genetrees.tre: Newick format required One gene tree per line Rate multipliers must be given in brackets in front of each gene tree [1.0](((Name1:0.00123,Name2:0.00123):0.00123,(Name3:0.00121,Name4:0.00121):0.00125):0.0010, ((Name5:0.0010,Name6:0.0010):0.0014,(MyName7:0.0012,Name8:0.0012):0.0012):0.00106); [1.0]((((Name1:0.00123,Name2:0.00123):0.00133,(Name3:0.0012,Name4:0.0012):0.00134):0.0003, (Name5:0.0010,Name6:0.0010):0.00186):0.00064,(MyName7:0.0011,Name8:0.0011):0.0024);

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Step 1: Prepare the gene trees Option 2: Place sets of gene trees in separate files File names will be supplied to STEM-hy in the settings file Rate multipliers will also be supplied in the settings file All genes in a single file are assumed to have the same rate genetrees1.tre: (((Name1:0.00123,Name2:0.00123):0.00123,(Name3:0.00121,Name4:0.00121):0.00125):0.0010, ((Name5:0.0010,Name6:0.0010):0.0014,(MyName7:0.0012,Name8:0.0012):0.0012):0.00106); genetrees2.tre: ((((Name1:0.00123,Name2:0.00123):0.00133,(Name3:0.0012,Name4:0.0012):0.00134):0.0003, (Name5:0.0010,Name6:0.0010):0.00186):0.00064,(MyName7:0.0011,Name8:0.0011):0.0024);

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Step 2: Prepare the settings file - input option 1 yaml format: headings with indented parameters defined below properties: species: run: 1 #0=user-tree, 1=MLE, 2=search, 3=hybridization, 4=bootstrap theta: 0.001 num saved trees: 15 beta: 0.0005 seed: 3435893 Species1: Name1, Name2, Name3 Species2: Name4, Name5 Species3: Name6, MyName7 Species4: Name8

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Step 2: Prepare the settings file - input option 2 yaml format: headings with indented parameters defined below properties: species: files: run: 1 #0=user-tree, 1=MLE, 2=search, 3=hybridization, 4=bootstrap theta: 0.001 num saved trees: 15 beta: 0.0005 seed: 3435893 Species1: Name1, Name2, Name3 Species2: Name4, Name5 Species3: Name6, MyName7 Species4: Name8 genetrees1.tre: 1.0 # notice the space after each : genetrees2.tre: 1.0

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Step 2: Prepare the settings file yaml format: headings with indented parameters defined below properties: species: files: run: 1 #0=user-tree, 1=MLE, 2=search, 3=... theta: 0.001 num saved trees: 15 beta: 0.0005 seed: 3435893 Species1: Name1, Name2, Name3 Species2: Name4, Name5 Species3: Name6, MyName7 Species4: Name8 genetrees1.tre: 1.0 # notice the space after each : genetrees2.tre: 1.0 Some parameters will only be used for certain run settings. They are ignored otherwise, and can be omitted from the settings file.

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis - Results Analysis 1: Find the ML species tree (run with run: 1) Run at the command line with: java -jar stem-hy.jar *************************************** ** Welcome to STEM 2.0 ** *************************************** The settings file was successfully parsed... Using theta = 0.0010 The settings file contained 4 species and 8 lineages. The species-to-lineage mappings are: Species4: Name8 Species3: MyName7, Name6 Species2: Name4, Name5 Species1: Name1, Name2, Name3

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis - Results Analysis 1: Find the ML species tree (run with run: 1) Run at the command line with: java -jar stem.jar Results are written to the file mle.tre ****************Results***************** D AB Matrix: [ 0.00000 1.20000 2.86000 3.46000] [ 0.00000 0.00000 1.00000 2.40000] [ 0.00000 0.00000 0.00000 1.10000] [ 0.00000 0.00000 0.00000 0.00000] Likelihood Species Tree (Newick format): (Species1:1.20000,(Species4:1.10000,(Species2:1.00000,Species3:1.00000):0.10000):0.10000); Log likelihood for tree: -52.43701947216076 ****************** Done ****************

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis - Results Analysis 2: Find likelihood of all 15 trees (run with run: 2) Output files: *************************************** ** Welcome to STEM 2.0 ** *************************************** The settings file was successfully parsed...... Beginning search now (this could take a while)... Search completed. Here are the results (also written to file search.tre ): [-52.43702] (Species1:1.20000,(Species4:1.10000,(Species2:1.00000,Species3:1.00000):0.10000):0.10000); [-53.63718] (Species1:1.20000,(Species3:1.00000,(Species2:1.00000,Species4:1.00000):0.00000):0.20000); [-56.63684] ((Species4:1.10000,Species1:1.10000):0.00000,(Species2:1.00000,Species3:1.00000):0.10000); [-56.63720] (Species4:1.10000,(Species1:1.10000,(Species2:1.00000,Species3:1.00000):0.10000):0.00000); [-60.23760] (Species4:1.10000,(Species2:1.00000,(Species1:1.00000,Species3:1.00000):0.00000):0.10000); [-62.63758] ((Species1:1.00000,Species3:1.00000):0.00000,(Species2:1.00000,Species4:1.00000):0.00000); [-62.63778] (Species3:1.00000,(Species1:1.00000,(Species2:1.00000,Species4:1.00000):0.00000):0.00000); [-62.63790] (Species2:1.00000,((Species1:1.00000,Species4:1.00000):0.00000,Species3:1.00000):0.00000); [-62.63806] (Species2:1.00000,((Species1:1.00000,Species3:1.00000):0.00000,Species4:1.00000):0.00000);

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis - Results Analysis 3: Find the likelihood of a particular species tree Place the tree(s) of interest in the file user tree in the same directory as STEM-hy ((Species1:0.000222,Species3:0.000222):0.016444,(Species2:0.000222,Species4:0.000222):0.016444); Branch lengths must be included. STEM-hy gives the likelihood of the tree with the user-specified branch lengths, as well as the ML branch lengths along the user tree.

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis - Results *************************************** ** Welcome to STEM 2.0 ** *************************************** The settings file was successfully parsed...... Read 1 species tree[s] from user.tre ****************Results***************** User tree: ((Species1:0.000222,Species3:0.000222):0.016444,(Species2:0.000222,Species4:0.000222):0.016444) Log likelihood for tree: -153.62929947216077 **************Optimized Trees************ Optimized user tree: ((Species1:0.99995,Species3:0.99995):0.00005,(Species2:0.99995,Species4:0.99995):0.00005); Log likelihood: -62.63865947216076 ****************** Done ****************

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Example 2: Missing Data Example genetrees.tre: [1.0](((Name1:0.00123,Name2:0.00123):0.00123,(Name3:0.00121,Name4:0.00121):0.00125):0.0010, ((Name5:0.0010,Name6:0.0010):0.0014,(MyName7:0.0012,Name8:0.0012):0.0012):0.00106); [1.0]((((Name1:0.00123,Name2:0.00123):0.00133,(Name3:0.0012,Name4:0.0012):0.00134):0.00032, (Name5:0.0010,Name6:0.0010):0.00186); [1.0](((Name1:0.00123,Name2:0.00123):0.00123,(Name3:0.00121,Name4:0.00121):0.00125):0.0010, ((Name5:0.0010,Name6:0.0010):0.0014,(MyName7:0.0012,Name8:0.0012):0.0012):0.00106); [1.0]((((Name1:0.00123,Name2:0.00123):0.00133,(Name3:0.0012,Name4:0.0012):0.00134):0.0003, (Name5:0.0010,Name6:0.0010):0.00186):0.00064,(MyName7:0.0011,Name8:0.0011):0.0024); [1.0](((Name1:0.00123,Name2:0.00123):0.00123,(Name3:0.00121,Name4:0.00121):0.00125):0.0010, ((Name5:0.0010,Name6:0.0010):0.0014,(MyName7:0.0012,Name8:0.0012):0.0012):0.00106); [1.0]((((Name1:0.00123,Name2:0.00123):0.00133,(Name3:0.0012,Name4:0.0012):0.00134):0.0003, (Name5:0.0010,Name6:0.0010):0.00186):0.00064,(MyName7:0.0011,Name8:0.0011):0.0024); [1.0](((Name1:0.00123,Name2:0.00123):0.00123,(Name3:0.00121,Name4:0.00121):0.00125):0.0010, ((Name5:0.0010,Name6:0.0010):0.0014,(MyName7:0.0012,Name8:0.0012):0.0012):0.00106); [1.0]((((Name1:0.00123,Name2:0.00123):0.00133,(Name3:0.0012,Name4:0.0012):0.00134):0.0003, (Name5:0.0010,Name6:0.0010):0.00186):0.00064,(MyName7:0.0011,Name8:0.0011):0.0024);

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Example 2: Missing Data Example Look at gene trees: Name8 MyName7 Name6 Name5 Name4 Name3 Name2 Name1 Name6 Name5 Name4 Name3 Name2 Name1 Name8 MyName7 Name6 Name5 Name4 Name3 Name2 Name1 4 loci 1 locus 3 loci

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Example 2: Missing Data Example Note: The settings file remains unchanged. Below is the output. ****************Results***************** D AB Matrix: [ 0.00000 1.20000 2.86000 3.46000] [ 0.00000 0.00000 1.00000 2.40000] [ 0.00000 0.00000 0.00000 1.10000] [ 0.00000 0.00000 0.00000 0.00000] Maximum Likelihood Species Tree (Newick format): (Species1:1.20000,(Species4:1.10000,(Species2:1.00000,Species3:1.00000):0.10000):0.10000); log likelihood for tree: -967.874444171144 ****************** Done ****************

What is STEM-hy? Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Example Data: Heliconius Butterflies ABCD 3 2 BCD BD CD 1 H. hecale H. melpomene H. heurippa H. cydno

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Example 3: Bootstrap Analysis The current version of STEM-hy can be used to estimate bootstrap proportions on the ML tree, as well as to construct a bootstrap consensus tree. Sequence data must be provided in PHYLIP format (separate files need to be used for each gene). Each gene is bootstrapped a user-specified number of times, B, to produce B bootstrap samples (alignments) for each gene. Gene trees are estimated for each bootstrap sample using the program SSA. This program uses a simulated annealing method to estimate gene trees under the assumption of a molecular clock. B species trees are reconstructed using STEM-hy and printed to both the screen and to the file bootstrap.results.

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Example 3: Bootstrap Analysis For this example, we ll consider four taxa and six genes in Heliconius butterflies. The settings file is shown below, with changes in blue properties: species: run: 4 #0=user-tree, 1=MLE, 2=search, 3=hybridization, 4=bootstrap bootstrap samples: 100 phylip files: co 4tax.phy,dll 4tax.phy,inv 4tax.phy,sd 4tax.phy,tpi 4tax.phy,white 4tax.phy theta: 0.01 num saved trees: 15 beta: 0.0005 seed: 3435893 H. melpomene: M95 H. hecale: Hh H. cordula: M187 H. heurippa: Strib40

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Example 3: Bootstrap Analysis Below is the output. All bootstrap trees are written to a file called bootstrap.results and can be read into another program and summarized.... The species-to-lineage mappings are: H. heurippa: Strib40 H. cordula: M187 H. hecale: Hh H. melpomene: M95 Bootstrapping trees (this might take a while)... ****************Results***************** The maximum likelihood species tree estimate is: (H. hecale:6.82133,(h. melpomene:0.74608,(h. heurippa:0.07658,h. cordula:0.07658):0.66950):6.07525); The 100 bootstrapped species trees: (H. heurippa:0.29907,(h. hecale:0.17664,(h. melpomene:0.12424,h. cydno:0.12424):0.05240):0.12243); (H. hecale:1.52825,(h. melpomene:0.35022,(h. heurippa:0.31089,h. cydno:0.31089):0.03933):1.17803);

Data Preparation Example 2: Small Example with Missing Data Example 3: Bootstrap Analysis Some Notes on Program Versions There are some important differences between STEMv1.1a and STEMv2.0/STEM-hyv1.0 Multifurcations are handled differently. STEM v1.1a and lower: Zero-length branches are set to 0.00001. STEMv2.0 / STEM-hyv1.0: Zero-length branches are treated as missing data. Other big differences are improvements to input format and increased functionality in later versions.

Background: STEM s Hybrid Species Models STEM s Hybrid Species Model τ γ τ A B C P(C(AB)) = 1 (2/3)exp( τ) P(A(BC))=(1/3)exp( τ) P(B(AC))=(1/3)exp( τ) Mutation Process A B C 1 γ A τ B C P(C(AB))=(1/3)exp( τ) P(A(BC))=1 (2/3)exp( τ) P(B(AC))=(1/3)exp( τ) Mutation Process

Background: STEM s Hybrid Species Models STEM s Hybrid Species Model Species tree subject to hybridization τ γ τ A B C P(C(AB)) = 1 (2/3)exp( τ) P(A(BC))=(1/3)exp( τ) P(B(AC))=(1/3)exp( τ) Mutation Process A B C 1 γ A τ B C P(C(AB))=(1/3)exp( τ) P(A(BC))=1 (2/3)exp( τ) P(B(AC))=(1/3)exp( τ) Mutation Process

Background: STEM s Hybrid Species Models STEM s Hybrid Species Model Hybridization parameter to model the extent of the contribution from each parent τ γ τ A B C P(C(AB)) = 1 (2/3)exp( τ) P(A(BC))=(1/3)exp( τ) P(B(AC))=(1/3)exp( τ) Mutation Process A B C 1 γ A τ B C P(C(AB))=(1/3)exp( τ) P(A(BC))=1 (2/3)exp( τ) P(B(AC))=(1/3)exp( τ) Mutation Process

Background: STEM s Hybrid Species Models STEM s Hybrid Species Model Possible parental species trees τ γ τ A B C P(C(AB)) = 1 (2/3)exp( τ) P(A(BC))=(1/3)exp( τ) P(B(AC))=(1/3)exp( τ) Mutation Process A B C 1 γ A τ B C P(C(AB))=(1/3)exp( τ) P(A(BC))=1 (2/3)exp( τ) P(B(AC))=(1/3)exp( τ) Mutation Process

Background: STEM s Hybrid Species Models STEM s Hybrid Species Model Probabilities associated with each gene tree topology for each parental tree under the coalescent model τ γ τ A B C P(C(AB)) = 1 (2/3)exp( τ) P(A(BC))=(1/3)exp( τ) P(B(AC))=(1/3)exp( τ) Mutation Process A B C 1 γ A τ B C P(C(AB))=(1/3)exp( τ) P(A(BC))=1 (2/3)exp( τ) P(B(AC))=(1/3)exp( τ) Mutation Process

Background: STEM s Hybrid Species Models STEM s Hybrid Species Model Sequence evolution proceeds along gene trees τ γ τ A B C P(C(AB)) = 1 (2/3)exp( τ) P(A(BC))=(1/3)exp( τ) P(B(AC))=(1/3)exp( τ) Mutation Process A B C 1 γ A τ B C P(C(AB))=(1/3)exp( τ) P(A(BC))=1 (2/3)exp( τ) P(B(AC))=(1/3)exp( τ) Mutation Process

Background: STEM s Hybrid Species Models Inference of Trees Subject to Hybridization Assumptions: Hybridization results in a mosaic genome, so that a sampled gene has a probability distribution that its history originated from one of several parental species trees Genes in the sample are independent given the species tree Hybridization events happen only between sister taxa No factors other than coalescence and hybridization lead to incongruence between gene trees and the species tree

Background: STEM s Hybrid Species Models Likelihood Calculation for the Three-taxon Case Let f (g i S) be the probability density of gene tree g i given species tree S under the coalescent model (Rannala and Yang, 2003)

Background: STEM s Hybrid Species Models Likelihood Calculation for the Three-taxon Case Let f (g i S) be the probability density of gene tree g i given species tree S under the coalescent model (Rannala and Yang, 2003) The likelihood function for the three-taxon case is N {γf (g i S 1 ) + (1 γ)f (g i S 2 )} i=1 where S 1 and S 2 are two possible parental species trees γ [0, 1]

Background: STEM s Hybrid Species Models Likelihood Calculation for the Three-taxon Case N {γf (g i S 1 ) + (1 γ)f (g i S 2 )} i=1 τ γ τ A B C f(g S1) Mutation Process A B C 1 γ A τ B C f(g S2) Mutation Process

Background: STEM s Hybrid Species Models Beyond Three Taxa... Propose a method which incorporates any number of hybridization events, provided they occur between sister taxa Each putative hybridization event is assigned a parameter, γ 1, γ 2,... The likelihood is computed by looking at all combinations of possible parental species trees, weighted appropriately by the γ j parameters

Background: STEM s Hybrid Species Models A Bigger Example Motivating example: A B C D E F A B C D E F A B C D E F A B C D E F

Background: STEM s Hybrid Species Models A Bigger Example Consider the hybrid species tree: Motivating example: A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F

Background: STEM s Hybrid Species Models The Likelihood Function S 1 S 3 A B C D E F A B C D E F γ 1 γ 2 S 2 A B C D E F (1 γ 1 )γ 2 S 4 A B C D E F γ 1 1 γ 2 ) A B C D E F (1 γ 1 )(1 γ 2 ) N i=1 {γ 1 γ 2 f (g i S 1 ) + γ 1 (1 γ 2 )f (g i S 2 ) +(1 γ 1 )γ 2 f (g i S 3 ) + (1 γ 1 )(1 γ 2 )f (g i S 4 )}

Background: STEM s Hybrid Species Models Comments on Computation Parameters in the likelihood function: γ 1, γ 2, branch lengths For a given hybrid species tree and sample of gene trees with divergence times, maximum likelihood branch lengths can be analytically determined Fitting the likelihood model for a hypothesized hybrid species tree only requires optimization of γ parameters Implemented in a modified version of the program STEM, called STEM-hy

Background: STEM s Hybrid Species Models Selecting the Best Hybrid Species Tree For the example hybrid species tree, pick the best hybrid model from among possible models using the AIC: Model Tree γ 1 γ 2 Number of Parameters 1 A B C D E F 0 0 5 2 A B C D E F 0 1 5 3 A B C D E F 1 0 5 4 A B C D E F 1 1 5

Background: STEM s Hybrid Species Models Selecting the Best Hybrid Species Tree Model Tree γ 1 γ 2 Number of Parameters A B C D E F 5 0 (0,1) 6 A B C D E F 6 1 (0,1) 6 A B C D E F 7 (0,1) 0 6 A B C D E F 8 (0,1) 1 6 A B C D E F 9 (0,1) (0,1) 7

Background: STEM s Hybrid Species Models STEM-hy: Assumptions In practice, the γ i are not given (neither are times of speciation or hybridization events). The algorithm finds MLEs for these parameters. STEM-hy inherits all of STEM-hy s other assumptions (e.g., no gene flow after speciation if no hybridization, gene tree variability is not taken into consideration, etc.).

Background: STEM s Hybrid Species Models STEM-hy: Assumptions One important point: STEM-hy looks for evidence of hybridization in the presence of incomplete lineage sorting. By using the model in STEM-hy to compute likelihoods, the coalescent process is incorporated. The AIC is used to compare models: AIC = 2lnL(M D) + 2k where M is the model and D is the data. LnL(M D) is the likelihood from STEM-hy for the hybridization model under consideration.

Background: STEM s Hybrid Species Models Input data format is the same as for previous analyses: Gene trees are placed in the file called genetrees.tre (option 1) or the files containing the gene trees are listed in the settings file (option 2). The settings file (in yaml format) is used to give user settings (e.g., θ). The run option is set to 3.

Background: STEM s Hybrid Species Models The user must additionally provide information about hybridization: The only option at present is to use a user-specified tree the present version of the program assumes that the overall species phylogeny is known. The user-specified tree is one of the possible parental trees it doesn t matter which one. The putative hybrid species are identified in the settings.yaml file.

What is STEM-hy? Background: STEM s Hybrid Species Models ABCD 3 2 BCD BD CD 1 H. hecale H. melpomene H. heurippa H. cydno

Background: STEM s Hybrid Species Models STEM-hy Example: Heliconius Butterflies Example genetrees.tre file: [0.37137]((Hheurippa:0.005989,(Hcydno:0.001322,Hmelpomene:0.001322):0.004667):0.022778,Hhecale:0.028767); [1.17059]((Hmelpomene:0.049843,(Hcydno:0.000001,Hheurippa:0.000001):0.049843):0.001,Hhecale:0.049943); [0.11434](((Hcydno:0.021024,Hheurippa:0.021024):0.020051,Hmelpomene:0.041076):0.002610,Hhecale:0.043685); [1.35454](((Hheurippa:0.010740,Hcydno:0.010740):0.003498,Hmelpomene:0.014238):0.037654,Hhecale:0.051892); [0.39096](((Hheurippa:0.008764,Hmelpomene:0.008764):0.001686,Hcydno:0.010450):0.003969,Hhecale:0.014419); [1.22683](((Hheurippa:0.002431,Hcydno:0.002431):0.062919,Hmelpomene:0.065350):0.0000001,Hhecale:0.065351);

Background: STEM s Hybrid Species Models STEM-hy Example: Heliconius Butterflies Example settings file: properties: species: run: 3 theta: 0.001 beta: 0.0005 burnin: 100 seed: 3435893 bound total iter: 20 num saved trees: 10 hybrid species: H. heurippa hybrid tree: user-heliconius.tre H. melpomene: M95 H. hecale: Hh H. cordula: M187 H. heurippa: Strib40

Background: STEM s Hybrid Species Models Example user-heliconius.tre: (((H. heurippa:0.000085,h. cydno:0.000085):0.347479,h. melpomene:0.355979):3.332091,h. hecale:3.68807);

Background: STEM s Hybrid Species Models ****************Results*****************... Parental trees: gamma(h. heurippa) = 1 ((H. cydno:0.00009,(h. heurippa:0.00009,h. melpomene:0.00009):0.00000):3.68801,h. hecale:3.68810); Lik: -357.4325907499209 AIC: 720.8651814998418 k: 3 gamma(h. heurippa) = 0 (((H. heurippa:0.00009,h. cydno:0.00009):0.35589,h. melpomene:0.35598):3.33212,h. hecale:3.68810); Lik: -349.9185707499209 AIC: 705.8371414998418 k: 3 Hybrid trees: (((H. heurippa:0.00009,h. cydno:0.00009):0.35589,h. melpomene:0.35598):3.33212,h. hecale:3.68810); Lik: -349.5409832924012 gamma(h. heurippa): 0.6600000000000004 AIC: 707.0819665848024 k: 4 ****************** Done ****************

Background: STEM s Hybrid Species Models What hybrid species can be considered? Care must be taken in selecting hybrid species: Both members of a sister group cannot be selected as hybrid taxa in a single analysis. However, two analyses can be run (one with each of the sister group identified as the hybrid) and results will be comparable across runs. The outgroup cannot be selected as a hybrid. Both of these restrictions result from the fact that for now hybridization is only considered between sister taxa. More general hybridization relationships can be considered by hand using the user-specified tree feature of STEM-hy.

Background: STEM s Hybrid Species Models STEM-hy: Strengths and Weaknesses STEM-hy makes some fairly strong assumptions: Error in estimating gene trees and branch lengths is not incorporated!!!! But the possibility of carrying out bootstrap analysis helps. Information in the sequence data is not used directly; it is only used as summarized by estimated gene divergence times. There is a single value of θ for the entire tree.

Background: STEM s Hybrid Species Models STEM-hy: Strengths and Weaknesses STEM-hy makes some fairly strong assumptions: Error in estimating gene trees and branch lengths is not incorporated!!!! But the possibility of carrying out bootstrap analysis helps. Information in the sequence data is not used directly; it is only used as summarized by estimated gene divergence times. There is a single value of θ for the entire tree. There are trade-offs involved, and STEM-hy does some things well: It is quick (even the tree search does not take long). It can handle missing data easily and intuitively. Simulations demonstrate reasonable performance (unlikely to be misleading; may be uninformative).

Challenge Datasets I ve created four datasets under varying conditions: M1 No hybridization, long intervals between speciation events. M2 No hybridization, short intervals between speciation events. M3 Low-levels of hybridization - B is a hybrid of A and C (species tree as in M1 and M2). M4 Extensive hybridization - B is a hybrid of A and C (species tree as in M1 and M2). All data sets have 6 species, 2 individuals/species, and 10 loci. GOAL: match the data set to the condition listed above Solutions are at www.stat.osu.edu/ lkubatko/solutions.html

STEM-hy Information, References, etc. Recommended citations - species tree estimation: Kubatko, L.S., B. C.Carstens, and L. L. Knowles. 2009. STEM: Species Tree Estimation using Maximum likelihood under coalescence. Bioinformatics 25(7): 971-973. Liu, L., L. Yu, and D.K. Pearl. 2009. Maximum tree: a consistent estimator of the species tree. Journal of Mathematical Biology 60(1):95-106. Mossel, E. and S. Roch. 2010. Incomplete lineage sorting: Consistent phylogeny estimation from multiple loci. IEEE/ACM Transactions on Computational Biology and Bioinformatics 7(1): 166-171. Recommended citations - hybridization: Kubatko, LS. 2009. Identifying Hybridization Events in the Presence of Coalescence via Model Selection, Systematic Biology 58(5): 478-488. Thank you! STEM-hy is available at http://www.stat.osu.edu/ lkubatko/software/stem/ Questions concerning the programs can be sent to kubatko.2@osu.edu.