394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees

Similar documents
From Gene Trees to Species Trees. Tandy Warnow The University of Texas at Aus<n

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Construc)ng the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign

CS 581 / BIOE 540: Algorithmic Computa<onal Genomics. Tandy Warnow Departments of Bioengineering and Computer Science hhp://tandy.cs.illinois.

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

Ultra- large Mul,ple Sequence Alignment

The Mathema)cs of Es)ma)ng the Tree of Life. Tandy Warnow The University of Illinois

Phylogenomics, Multiple Sequence Alignment, and Metagenomics. Tandy Warnow University of Illinois at Urbana-Champaign

From Genes to Genomes and Beyond: a Computational Approach to Evolutionary Analysis. Kevin J. Liu, Ph.D. Rice University Dept. of Computer Science

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Constrained Exact Op1miza1on in Phylogene1cs. Tandy Warnow The University of Illinois at Urbana-Champaign

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

New methods for es-ma-ng species trees from genome-scale data. Tandy Warnow The University of Illinois

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Phylogenetic inference

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

Genome-scale Es-ma-on of the Tree of Life. Tandy Warnow The University of Illinois

Phylogenetic Geometry

Fast coalescent-based branch support using local quartet frequencies

Quantifying sequence similarity

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1. Ben Raphael January 21, Course Par3culars

Evolu&on, Popula&on Gene&cs, and Natural Selec&on Computa.onal Genomics Seyoung Kim

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

The impact of missing data on species tree estimation

Computa(onal Challenges in Construc(ng the Tree of Life

Phylogenetics in the Age of Genomics: Prospects and Challenges

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogenetic Tree Reconstruction

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Dr. Amira A. AL-Hosary

Anatomy of a species tree

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

A Phylogenetic Network Construction due to Constrained Recombination

How to read and make phylogenetic trees Zuzana Starostová

Phylogenetics: Building Phylogenetic Trees

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Basics on bioinforma-cs Lecture 7. Nunzio D Agostino

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics. BIOL 7711 Computational Bioscience

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Methods to reconstruct phylogene1c networks accoun1ng for ILS

Taming the Beast Workshop

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

Phylogeny Fig Overview: Inves8ga8ng the Tree of Life Phylogeny Systema8cs

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

C3020 Molecular Evolution. Exercises #3: Phylogenetics

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Minimum Edit Distance. Defini'on of Minimum Edit Distance

Estimating Evolutionary Trees. Phylogenetic Methods

Quartet Inference from SNP Data Under the Coalescent Model

Multiple Sequence Alignment. Sequences

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

WenEtAl-biorxiv 2017/12/21 10:55 page 2 #2

TheDisk-Covering MethodforTree Reconstruction

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:

Bias/variance tradeoff, Model assessment and selec+on

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW

Effects of Gap Open and Gap Extension Penalties

EVOLUTIONARY DISTANCES

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce

Phylogenetic analyses. Kirsi Kostamo

Ch. 26 Phylogeny BIOL 221

7. Tests for selection

Workshop III: Evolutionary Genomics

Woods Hole brief primer on Multiple Sequence Alignment presented by Mark Holder

A phylogenomic toolbox for assembling the tree of life

Phylogeny: building the tree of life

Sequence Analysis '17- lecture 8. Multiple sequence alignment

1 ATGGGTCTC 2 ATGAGTCTC

Least Squares Parameter Es.ma.on

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Recent Advances in Phylogeny Reconstruction

Gene Regulatory Networks II Computa.onal Genomics Seyoung Kim

Multiple Alignment. Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis

Molecular Evolution & Phylogenetics

Priors in Dependency network learning

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Session 5: Phylogenomics

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Moreover, the circular logic

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Transcription:

394C, October 2, 2013 Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees

Mul9ple Sequence Alignment Mul9ple Sequence Alignments and Evolu9onary Histories (the meaning of homologous ) How to define error rates in mul9ple sequence alignments Minimum edit transforma9ons and pairwise alignments Dynamic Programming for calcula9ng a pairwise alignment (or minimum edit transforma9on) Co- es9ma9ng alignments and trees

DNA Sequence Evolu9on AAGACTT AAGGCCT TGGACTT AGGGCAT TAGCCCT AGCACTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT

U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U X Y V W

Deletion Mutation ACGGTGCAGTTACCA ACCAGTCACCA

Deletion Mutation ACGGTGCAGTTACCA ACCAGTCACCA ACGGTGCAGTTACCA AC----CAGTCACCA The true mul9ple alignment Reflects historical substitution, insertion, and deletion events in the true phylogeny

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Mul9ple Sequence Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Many methods Alignment methods Clustal POY (and POY*) Probcons (and Probtree) MAFFT Prank Muscle Di- align T- Coffee Opal Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc.

Deletion Mutation ACGGTGCAGTTACCA ACCAGTCACCA ACGGTGCAGTTACCA AC----CAGTCACCA The true mul9ple alignment Reflects historical substitution, insertion, and deletion events in the true phylogeny But how do we try to estimate this?

Pairwise alignments and edit transforma9ons Each pairwise alignment implies one or more edit transforma9ons Each edit transforma9on implies one or more pairwise alignments So calcula9ng the edit distance (and hence minimum cost edit transforma9on) is the same as calcula9ng the op9mal pairwise alignment

Edit distances Subs9tu9on costs may depend upon which nucleo9des are involved (e.g, transi9on/ transversion differences) Gap costs Linear (aka simple ): gapcost(l) = cl Affine: gapcost(l) = c+c L Other: gapcost(l) = c+c log(l)

Compu9ng op9mal pairwise alignments The cost of a pairwise alignment (under a simple gap model) is just the sum of the costs of the columns Under affine gap models, it s a bit more complicated (but not much)

Compu9ng edit distance Given two sequences and the edit distance func9on F(.,.), how do we compute the edit distance between two sequences? Simple algorithm for standard gap cost func9ons (e.g., affine) based upon dynamic programming

DP alg for simple gap costs Given two sequences A[1 n] and B[1 m], and an edit distance func9on F(.,.) with unit subs9tu9on costs and gap cost C, Let A = A 1,A 2,,A n B = B 1,B 2,,B m Let M(i,j)=F(A[1 i],b[1 j]) (i.e., the edit distance between these two prefixes )

Dynamic programming algorithm Let M(i,j)=F(A[1 i],b[1 j]) M(0,0)=0 M(n,m) stores our answer How do we compute M(i,j) from other entries of the matrix?

Calcula9ng M(i,j) Examine final column in some op9mal pairwise alignment of A[1 i] to B[1 j] Possibili9es: Nucleo9de over nucleo9de: previous columns align A[1 i- 1] to B[1 j- 1]: M(i,j)=M(i- 1,j- 1)+subcost(A i,b j ) Indel (- ) over nucleo9de: previous columns align A[1 i] to B[1 j- 1]: M(i,j)=M(i,j- 1)+indelcost Nucleo9de over indel: previous columns align A[1 i- 1] to B[1 j]: M(i,j)=M(i- 1,j)+indelcost

Calcula9ng M(i,j) Examine final column in some op9mal pairwise alignment of A[1 i] to B[1 j] Possibili9es: Nucleo9de over nucleo9de: previous columns align A[1 i- 1] to B[1 j- 1]: M(i,j)=M(i- 1,j- 1)+subcost(A i,b j ) Indel (- ) over nucleo9de: previous columns align A[1 i] to B[1 j- 1]: M(i,j)=M(i,j- 1)+indelcost Nucleo9de over indel: previous columns align A[1 i- 1] to B[1 j]: M(i,j)=M(i- 1,j)+indelcost

Calcula9ng M(i,j) M(i,j) = min { M(i- 1,j- 1)+subcost(A i,b j ), M(i,j- 1)+indelcost, M(i- 1,j)+indelcost }

O(nm) DP algorithm for pairwise alignment using simple gap costs Ini9alize M(0,j) = M(j,0) = j*indelcost For i=1 n For j = 1 m M(i,j) = min { M(i- 1,j- 1)+subcost(A i,b j ), M(i,j- 1)+indelcost, M(i- 1,j)+indelcost } Return M(n,m) Add arrows for backtracking (to construct an op9mal alignment and edit transforma9on rather than just the cost) Modifica9on for other gap cost func9ons is straighlorward but leads to an increase in running 9me

Sum- of- pairs op9mal mul9ple alignment Given set S of sequences and edit cost func9on F(.,.), Find mul9ple alignment that minimizes the sum of the implied pairwise alignments (Sum- of- Pairs criterion) NP- hard, but can be approximated Is this useful?

Other approaches to MSA Many of the methods used in prac9ce do not try to op9mize the sum- of- pairs Instead they use probabilis9c models (HMMs) Omen they do a progressive alignment on an es9mated tree (aligning alignments) Performance of these methods can be assessed using real and simulated data

Many methods Alignment methods Clustal POY (and POY*) Probcons (and Probtree) MAFFT Prank Muscle Di- align T- Coffee Opal Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc.

Simula9on study ROSE simula9on: 1000, 500, and 100 sequences Evolu9on with subs9tu9ons and indels Varied gap lengths, rates of evolu9on Computed alignments Used RAxML to compute trees Recorded tree error (missing branch rate) Recorded alignment error (SP- FN)

Alignment Error Given a mul9ple sequence alignment, we represent it as a set of pairwise homologies. To compare two alignments, we compare their sets of pairwise homologies. The SP- FN (sum- of- pairs false nega9ve rate) is the percentage of the true homologies (those present in the true alignment) that are missing in the es9mated alignment. The SP- FP (sum- of- pairs false posi9ve rate) is the percentage of the homologies in the es9mated alignment that are not in the true alignment.

1000 taxon models ranked by difficulty

Problems with the two phase approach Manual alignment can have a high level of subjec9vity (and can take a long 9me). Current alignment methods fail to return reasonable alignments on markers that evolve with high rates of indels and subs9tu9ons, especially if these are large datasets. We discard poten9ally useful markers if they are difficult to align.

S1 S2 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S4 and S3 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA Simultaneous es9ma9on of trees and alignments

Simultaneous Es9ma9on Methods Likelihood- based (under model of evolu9on including ( events inser9on/dele9on ALIFRITZ, BAli- Phy, BEAST, StatAlign, others Computa9onally intensive Most are limited to small datasets (< 30 sequences)

Treelength- based Input: Set S of unaligned sequences over an alphabet, and an edit distance func9on F(.,.) (must account for gaps and subs9tu9ons) Output: Tree T with sequences S at the leaves and other sequences at the internal nodes so as to minimize e F(s v,s w ), where the sum is taken over all edges e=(s v,s w ) in the tree

Minimizing treelength Given set S of sequences and edit distance func9on F(.,.), Find tree T with S at the leaves and sequences at the internal nodes so as to minimize the treelength (sum of edit distances) NP- hard but can be approximated NP- hard even if the tree is known!

Minimizing treelength The problem of finding sequences at the internal nodes of a fixed tree was introduced by Sankoff. Several algorithmic results related to this problem, with prevy theory Most popular somware is POY, which tries to op9mize tree length. The accuracy of any tree or alignment depends upon the edit distance func9on F(.,.), but so far even good affine distances don t produce very good trees or alignments.

More SATé: a heuris9c method for simultaneous es9ma9on and tree alignment POY, POY*, and BeeTLe: results of how changing the gap penalty from simple to affine impacts the alignment and tree Impact of guide tree on MSA Sta9s9cal co- es9ma9on using models that include indel events (Statalign, Alifritz, BAliPhy) UPP (ultra- large alignments using SEPP) Alignment es9ma9on in the presence of duplica9ons and rearrangements Visualizing large alignments The differences between amino- acid alignments and nucleo9de alignments (especially for non- coding data)

Research Projects How to use indel informa9on in an alignment? Do the sta9s9cal es9ma9on methods (Bali- Phy, StatAlign, etc.) produce more accurate alignments than standard methods (e.g., MAFFT)? Do they result in bever trees? What benefit do we get from an improved alignment? (What biological problem does the alignment method help us solve, besides tree es9ma9on?)

(Phylogenetic estimation from whole genomes)

Gene Trees to Species Trees Gene trees are inside species trees Causes of gene tree discord Incomplete lineage sor9ng Methods for es9ma9ng species trees from gene trees

Sampling mul9ple genes from mul9ple species Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Using mul9ple genes S 1 S 2 S 3 S 4 gene 1" TCTAATGGAA" GCTAAGGGAA" TCTAAGGGAA" TCTAACGGAA" S 4 gene 2! GGTAACCCTC! S 1 S 3 S 4 gene 3! TATTGATACA" TCTTGATACC" TAGTGATGCA" S 7 S 8 TCTAATGGAC" TATAACGGAA" S 5 S 6 S 7 GCTAAACCTC! GGTGACCATC! GCTAAACCTC! S 7 S 8 TAGTGATGCA" CATTCATACC"

Two compe9ng approaches Species gene 1 gene 2... gene k"..." Concatenation" Analyze" separately"..." Summary Method"

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen, Md. S.Bayzid UT-Austin UT-Austin UT-Austin UT-Austin l l l Plus many many other people Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy) Gene sequence alignments and trees computed using SATé (Liu et al., Science 2009 and Systema9c Biology 2012) Gene Tree Incongruence Challenges: Mul3ple sequence alignments of > 100,000 sequences Gene tree incongruence

Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT- Aus9n S. Mirarab Md. S. Bayzid, UT- Aus9n UT- Aus9n Plus many many other people Approx. 50 species, whole genomes 8000+ genes, UCEs Gene sequence alignments and trees computed using SATé (Liu et al., Science 2009 and Systema9c Biology 2012) Challenges: Maximum likelihood on mul3- million- site sequence alignments Massive gene tree incongruence

Ques9ons Is the model tree iden9fiable? Which es9ma9on methods are sta9s9cally consistent under this model? What is the computa9onal complexity of an es9ma9on problem?

Sta9s9cal Consistency error Data

Sta9s9cal Consistency error Data Data are sites in an alignment

Neighbor Joining (and many other distance- based methods) are sta9s9cally consistent under Jukes- Cantor

Ques9ons Is the model tree iden9fiable? Which es9ma9on methods are sta9s9cally consistent under this model? What is the computa9onal complexity of an es9ma9on problem?

Answers? We know a lot about which site evolu9on models are iden9fiable, and which methods are sta9s9cally consistent. Some polynomial 9me afc methods have been developed, and we know a livle bit about the sequence length requirements for standard methods. Just about everything is NP- hard, and the datasets are big. Extensive studies show that even the best methods produce gene trees with some error.

Answers? We know a lot about which site evolu9on models are iden9fiable, and which methods are sta9s9cally consistent. Just about everything is NP- hard, and the datasets are big. Extensive studies show that even the best methods produce gene trees with some error.

Answers? We know a lot about which site evolu9on models are iden9fiable, and which methods are sta9s9cally consistent. Just about everything is NP- hard, and the datasets are big. Extensive studies show that even the best methods produce gene trees with some error.

In other words error Data Sta9s9cal consistency doesn t guarantee accuracy w.h.p. unless the sequences are long enough.

Species Tree Es9ma9on from Gene Trees error Data Data are gene trees, presumed to be randomly sampled true gene trees.

(Phylogenetic estimation from whole genomes) 1. Why do we need whole genomes? 2. Will whole genomes make phylogeny estimation easy? 3. How hard are the computational problems? 4. Do we have sufficient methods for this?

Using mul9ple genes S 1 S 2 S 3 S 4 gene 1" TCTAATGGAA" GCTAAGGGAA" TCTAAGGGAA" TCTAACGGAA" S 4 gene 2! GGTAACCCTC! S 1 S 3 S 4 gene 3! TATTGATACA" TCTTGATACC" TAGTGATGCA" S 7 S 8 TCTAATGGAC" TATAACGGAA" S 5 S 6 S 7 GCTAAACCTC! GGTGACCATC! GCTAAACCTC! S 7 S 8 TAGTGATGCA" CATTCATACC"

Concatena9on gene 1" gene 2! gene 3! S 1 S 2 S 3 S 4 S 5 TCTAATGGAA" GCTAAGGGAA" TCTAAGGGAA" TCTAACGGAA" GGTAACCCTC! TAGTGATGCA"?????????? "?????????? " GCTAAACCTC! TATTGATACA"?????????? "??????????"?????????? " TCTTGATACC"??????????" S 6??????????" GGTGACCATC!??????????" S 7 TCTAATGGAC" GCTAAACCTC! TAGTGATGCA" S 8 TATAACGGAA"?????????? " CATTCATACC"

Red gene tree species tree (green gene tree okay)

1KP: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iplant T. Warnow, S. Mirarab, N. Nguyen, Md. S.Bayzid UT-Austin UT-Austin UT-Austin UT-Austin l l l l l 1200 plant transcriptomes More than 13,000 gene families (most not single copy) Mul9- ins9tu9onal project (10+ universi9es) iplant (NSF- funded coopera9ve) Gene sequence alignments and trees computed using SATe (Liu et al., Science 2009 and Systema9c Biology 2012)

Avian Phylogenomics Project E Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT- Aus9n S. Mirarab Md. S. Bayzid, UT- Aus9n UT- Aus9n Plus many many other people Approx. 50 species, whole genomes 8000+ genes, UCEs Gene sequence alignments computed using SATé (Liu et al., Science 2009 and Systema9c Biology 2012)

Gene Tree Incongruence Gene trees can differ from the species tree due to: Duplica9on and loss Horizontal gene transfer Incomplete lineage sor9ng (ILS)

Species Tree Es9ma9on in the presence of ILS Mathema9cal model: Kingman s coalescent Coalescent- based species tree es9ma9on methods Simula9on studies evalua9ng methods New techniques to improve methods Applica9on to the Avian Tree of Life

Species tree es9ma9on: difficult, even for small datasets! Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

The Coalescent Past Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree. Present Courtesy James Degnan

Gene tree in a species tree Courtesy James Degnan

Lineage Sor9ng Lineage sor9ng is a Popula9on- level process, also called the Mul9- species coalescent (Kingman, 1982). The probability that a gene tree will differ from species trees increases for short 9mes between specia9on events or large popula9on size. When a gene tree differs from the species tree, this is called Incomplete Lineage Sor9ng or Deep Coalescence.

Key observa9on: Under the mul9- species coalescent model, the species tree defines a probability distribu3on on the gene trees Courtesy James Degnan

Incomplete Lineage Sor9ng (ILS) 2000+ papers in 2013 alone Confounds phylogene9c analysis for many groups: Hominids Birds Yeast Animals Toads Fish Fungi There is substan9al debate about how to analyze phylogenomic datasets in the presence of ILS.

Two compe9ng approaches Species gene 1 gene 2... gene k"..." Concatenation" Analyze" separately"..." Summary Method"

How to compute a species tree?..."

MDC Problem (Maddison 1997) XL(T,t) = the number of extra lineages on the species tree T with respect to the gene tree t. In this example, XL(T,t) = 1. Courtesy James Degnan MDC (minimize deep coalescence) problem: Given set X = {t 1,t 2,,t k } of gene trees find the species tree T that implies the fewest extra lineages (deep coalescences) with respect to X, i.e., minimize MDC(T, X) = Σ i XL(T,t i )

MDC Problem MDC is NP- hard Exact solu9on to MDC that runs in exponen9al 9me (Than and Nakhleh, PLoS Comp Biol 2009). Popular technique, omen gives good accuracy. However, not sta9s9cally consistent under ILS, even if solved exactly!

Sta9s9cally consistent under ILS? MDC NO Greedy NO Most frequent gene tree - NO Concatena9on under maximum likelihood open MRP (supertree method) open MP- EST (Liu et al. 2010): maximum likelihood es9ma9on of rooted species tree YES BUCKy- pop (Ané and Larget 2010): quartet- based Bayesian species tree es9ma9on YES

Under the mul9- species coalescent model, the species tree defines a probability distribu9on on the gene trees Courtesy James Degnan Theorem (Degnan et al., 2006, 2009): Under the mul9- species coalescent model, for any three taxa A, B, and C, the most probable rooted gene tree on {A,B,C} is iden9cal to the rooted species tree induced on {A,B,C}.

How to compute a species tree?..." Techniques: MDC? Most frequent gene tree? Consensus of gene trees? Other?

How to compute a species tree?..." Theorem (Degnan et al., 2006, 2009): Under the mul9- species coalescent model, for any three taxa A, B, and C, the most probable rooted gene tree on {A,B,C} is iden9cal to the rooted species tree induced on {A,B,C}.

How to compute a species tree?..." Es9mate species tree for every 3 species Theorem (Degnan et al., 2006, 2009): Under the mul9- species coalescent model, for any three taxa A, B, and C, the most probable rooted gene tree on {A,B,C} is iden9cal to the rooted species tree induced on {A,B,C}...."

How to compute a species tree?..." Es9mate species tree for every 3 species Theorem (Aho et al.): The rooted tree on n species can be computed from its set of 3- taxon rooted subtrees in polynomial 9me...."

How to compute a species tree?..." Es9mate species tree for every 3 species Theorem (Aho et al.): The rooted tree on n species can be computed from its set of 3- taxon rooted subtrees in polynomial 9me...." Combine rooted 3- taxon trees

How to compute a species tree?..." Es9mate species tree for every 3 species Theorem (Degnan et al., 2009): Under the mul9- species coalescent, the rooted species tree can be es9mated correctly (with high probability) given a large enough number of true rooted gene trees...." Combine rooted 3- taxon trees Theorem (Allman et al., 2011): the unrooted species tree can be es9mated from a large enough number of true unrooted gene trees.

How to compute a species tree?..." Es9mate species tree for every 3 species Theorem (Degnan et al., 2009): Under the mul9- species coalescent, the rooted species tree can be es9mated correctly (with high probability) given a large enough number of true rooted gene trees...." Combine rooted 3- taxon trees Theorem (Allman et al., 2011): the unrooted species tree can be es9mated from a large enough number of true unrooted gene trees.

How to compute a species tree?..." Es9mate species tree for every 3 species Theorem (Degnan et al., 2009): Under the mul9- species coalescent, the rooted species tree can be es9mated correctly (with high probability) given a large enough number of true rooted gene trees...." Combine rooted 3- taxon trees Theorem (Allman et al., 2011): the unrooted species tree can be es9mated from a large enough number of true unrooted gene trees.

Sta9s9cal Consistency error Data Data are gene trees, presumed to be randomly sampled true gene trees.

Sta9s9cally consistent methods under ILS Quartet- based methods (e.g., BUCKy- pop (Ané and Larget 2010)) for unrooted species trees MP- EST (Liu et al. 2010): maximum likelihood es9ma9on of rooted species tree for rooted species trees *BEAST (Heled and Drummond, 2011), co- es9mates gene trees and species trees (and some others)

Ques9ons Is the model tree iden9fiable? Which es9ma9on methods are sta9s9cally consistent under this model? What is the computa9onal complexity of an es9ma9on problem? What is the performance in prac9ce?

Results on 11- taxon weakils 0.25 0.2 Average FN rate 0.15 0.1 0.05 *BEAST CA ML BUCKy con BUCKy pop MP EST Phylo exact MRP GC 0 5 genes 10 genes 25 genes 50 genes 20 replicates studied, due to computa9onal challenge of running *BEAST and BUCKy

Results on 11- taxon strongils 0.5 0.4 Average FN rate 0.3 0.2 0.1 *BEAST CA ML BUCKy con BUCKy pop MP EST Phylo exact MRP GC 0 5 genes 10 genes 25 genes 50 genes 20 replicates studied, due to computa9onal challenge of running *BEAST and BUCKy

*BEAST is bemer than ML at es3ma3ng gene trees 0.5 0.5 0.4 0.4 Average FN rate 0.3 0.2 5 genes 10 genes 25 genes 50 genes Average FN rate 0.3 0.2 8 genes 32 genes 0.1 0.1 0 *BEAST FastTree RAxML 0 *BEAST FastTree RAxML 11- taxon weakils datasets 17- taxon (very high ILS) datasets FastTree- 2 and RAxML very close in accuracy *BEAST much more accurate than both ML methods *BEAST gives biggest improvement under low- ILS condi9ons

Impact of Gene Tree Es9ma9on Error on MP- EST 0.25 0.2 Average FN rate 0.15 0.1 true estimated 0.05 0 MP EST MP- EST has no error on true gene trees, but MP- EST has 9% error on es9mated gene trees Similar results for other summary methods (e.g., MDC) Datasets: 11- taxon 50- gene datasets with high ILS (Chung and Ané 2010).

Problem: poor phylogene3c signal Summary methods combine es9mated gene trees, not true gene trees. The individual genes in the 11- taxon datasets have poor phylogene9c signal. Species trees obtained by combining poorly es9mated gene trees have poor accuracy.

Controversies/Open Problems Concatena9on may (or may not be) sta9s9cally consistent under ILS but some simula9on studies suggest it can be posi9vely misleading. Coalescent- based methods have not in general given strong results on biological data can give poor bootstrap support, or produce strange trees, compared to concatena9on.

Problem: poor gene trees Summary methods combine es9mated gene trees, not true gene trees. The individual gene sequence alignments in the 11- taxon datasets have poor phylogene9c signal, and result in poorly es9mated gene trees. Species trees obtained by combining poorly es9mated gene trees have poor accuracy.

Problem: poor gene trees Summary methods combine es9mated gene trees, not true gene trees. The individual gene sequence alignments in the 11- taxon datasets have poor phylogene9c signal, and result in poorly es9mated gene trees. Species trees obtained by combining poorly es9mated gene trees have poor accuracy.

Problem: poor gene trees Summary methods combine es9mated gene trees, not true gene trees. The individual gene sequence alignments in the 11- taxon datasets have poor phylogene9c signal, and result in poorly es9mated gene trees. Species trees obtained by combining poorly es9mated gene trees have poor accuracy.

TYPICAL PHYLOGENOMICS PROBLEM: many poor gene trees Summary methods combine es9mated gene trees, not true gene trees. The individual gene sequence alignments in the 11- taxon datasets have poor phylogene9c signal, and result in poorly es9mated gene trees. Species trees obtained by combining poorly es9mated gene trees have poor accuracy.

Research Projects Coalescent- based methods: analyze a biological dataset using different coalescent- based methods and compare to concatena9on Evalua9on impact of choice of gene trees (e.g., removing gene trees with low support) Evaluate impact of missing taxa in gene trees Develop new coalescent- based method (e.g., combine quartet trees) Evaluate scalability of coalescent- based methods