Workshop III: Evolutionary Genomics

Similar documents
Properties of Consensus Methods for Inferring Species Trees from Gene Trees

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Taming the Beast Workshop

arxiv: v1 [q-bio.pe] 3 May 2016

Anatomy of a species tree

Phylogenetic inference

A Phylogenetic Network Construction due to Constrained Recombination

arxiv: v1 [math.st] 22 Jun 2018

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Constructing Evolutionary/Phylogenetic Trees

arxiv: v1 [q-bio.pe] 16 Aug 2007

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Estimating Evolutionary Trees. Phylogenetic Methods

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Phylogenetics. BIOL 7711 Computational Bioscience

BINF6201/8201. Molecular phylogenetic methods

Concepts and Methods in Molecular Divergence Time Estimation

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Multiple Sequence Alignment. Sequences

Dr. Amira A. AL-Hosary

CS 484 Data Mining. Association Rule Mining 2

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting

Jed Chou. April 13, 2015

Quartet Inference from SNP Data Under the Coalescent Model

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore

Molecular Evolution & Phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogeny: traditional and Bayesian approaches

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetic Tree Reconstruction

To link to this article: DOI: / URL:

Constructing Evolutionary/Phylogenetic Trees

Phylogenetic Networks, Trees, and Clusters

arxiv: v1 [cs.cc] 9 Oct 2014

Theory of Evolution Charles Darwin

Consistency Index (CI)

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

X X (2) X Pr(X = x θ) (3)

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Evolutionary Tree Analysis. Overview

Is the equal branch length model a parsimony model?

Phylogenetic Geometry

CHAPTER 5 KARNAUGH MAPS

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

Data Mining Concepts & Techniques

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogeny. November 7, 2017

D B M G Data Base and Data Mining Group of Politecnico di Torino

Association Rules. Fundamentals

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

DATA MINING - 1DL360

A (short) introduction to phylogenetics

Understanding How Stochasticity Impacts Reconstructions of Recent Species Divergent History. Huateng Huang

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

Data Mining and Analysis: Fundamental Concepts and Algorithms

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW

Discordance of Species Trees with Their Most Likely Gene Trees: The Case of Five Taxa

Applications of Analytic Combinatorics in Mathematical Biology (joint with H. Chang, M. Drmota, E. Y. Jin, and Y.-W. Lee)

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection

Reconstructing the history of lineages

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

EVOLUTIONARY DISTANCES

Algorithms in Bioinformatics

Basic Data Structures and Algorithms for Data Profiling Felix Naumann

CS 581 Paper Presentation

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Supplementary Information

How to read and make phylogenetic trees Zuzana Starostová

Lecture Notes for Chapter 6. Introduction to Data Mining

Discordance of Species Trees with Their Most Likely Gene Trees

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Fast coalescent-based branch support using local quartet frequencies

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Week 5: Distance methods, DNA and protein models

Macroevolution Part I: Phylogenies

DATA MINING - 1DL360

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS

arxiv: v1 [q-bio.pe] 4 Sep 2013

Efficient Bayesian Species Tree Inference under the Multispecies Coalescent

A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES. 1. Introduction

Reading for Lecture 13 Release v10

Many of the slides that I ll use have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

The impact of missing data on species tree estimation

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

arxiv: v1 [q-bio.pe] 1 Jun 2014

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

C3020 Molecular Evolution. Exercises #3: Phylogenetics

What is Phylogenetics

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

Reconstruire le passé biologique modèles, méthodes, performances, limites

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Transcription:

Identifying Species Trees from Gene Trees Elizabeth S. Allman University of Alaska IPAM Los Angeles, CA November 17, 2011 Workshop III: Evolutionary Genomics

Collaborators The work in today s talk is joint with John A. Rhodes University of Alaska Fairbanks James H. Degnan University of Canterbury New Zealand Identifying species trees from gene trees 2/38

Gene trees vs. species trees From genetic data like DNA sequences, Gorilla Orangutan Human Chimpanzee Gibbon AAGCTTCACCGGCGCAGTTGTTCTTATAATTGCCCACGGACTTACATCAT... AAGCTTCACCGGCGCAACCACCCTCATGATTGCCCATGGACTCACATCCT... AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGGCTTACATCCT... AAGCTTCACCGGCGCAATTATCCTCATAATCGCCCACGGACTTACATCCT... AAGCTTTACAGGTGCAACCGTCCTCATAATCGCCCACGGACTAACCTCTT... phylogenetic methods construct gene trees. h c g o gb It is well-known that gene trees may differ from the underlying species tree. Identifying species trees from gene trees 3/38

Sources of conflict There are many reasons gene trees may differ from species trees lateral gene transfer (subtree-prune-and-regraft) hybridization (network) effects from population genetics incomplete lineage sorting Identifying species trees from gene trees 4/38

Gene tree discordance Different gene trees for the same set of taxa often disagree. One cause is incomplete lineage sorting. a b c d e a b c d e Gene trees (((C,D),A),(B,E)) and ((((C,D),A),B), E) in species tree ((((a,b), c),d),e) Identifying species trees from gene trees 5/38

Multispecies coalescent model Incomplete lineage sorting is modeled by the multispecies coalescent model. Viewing time backwards (present past), lineages within a population coalesce, one pair at a time. The species tree constrains which lineages may coalesce at any given time. a b c d e Identifying species trees from gene trees 6/38

Multispecies coalescent model y x Coalescent events occur in populations, or internal branches of species tree σ. x, y in coalescent units ( x t/n(t)) a b c d If k lineages enter a population from below, then coalescent events follow a Poisson process with rate ( k 2) ; Thus, the waiting time T k for a coalescent event in any population is modeled by an exponential random variable, T k exp( ( k 2) ). When a coalescent event occurs, any two lineages are equally likely to coalesce, with probability ( k 2) 1. Identifying species trees from gene trees 7/38

Multispecies coalescent model y Coalescent events occur in populations, or internal branches of species tree σ. A1 A2 B1 a b C1 C2-C3 c x, y in coalescent units ( x t/n(t)) If k lineages enter a population from below, then coalescent events follow a Poisson process with rate ( k 2) ; Thus, the waiting time T k for a coalescent event in any population is modeled by an exponential random variable, T k exp( ( k 2) ). When a coalescent event occurs, any two lineages are equally likely to coalesce, with probability ( k 2) 1. Identifying species trees from gene trees 7/38

Gene tree distributions Given a species tree σ on X, one can compute the (rooted) gene tree distribution. This distribution is parameterized by the species tree topology ψ and internal branch lengths λ; σ = (ψ,λ). A B C D A B D C A C B D A C DB A D B C y x A D C B B C A D B C D A B D A C B D C A a b c d C D A B C D B A A B C D A C B D A D B C Discordance in gene trees can indicate the species tree. Identifying species trees from gene trees 8/38

Gene tree distributions Given a species tree σ on X, one can compute the (rooted) gene tree distribution. This distribution is parameterized by the species tree topology ψ and internal branch lengths λ; σ = (ψ,λ). Nei 1987: Probs for rooted gene trees with 3 species derived Pamilo and Nei 1988: Probs that rooted gene tree matches the species tree topology for 4 and 5 species derived Rosenberg 2002: Probs for all 30 species trees/gene tree combinations for 4 species Degnan and Salter 2005: General case solved and implemented in COAL Discordance in gene trees can indicate the species tree. Identifying species trees from gene trees 8/38

Ex: Gene tree distribution 3 taxa, species tree σ = ((a,b):x,c) x a b c The probability that two specific entering lineages have not coalesced before x units is X = e x = 1 So topological gene trees have probabilities x 0 e (2 2)t dt. Pr((A,B),C) = 1 2X(1/3) 1/3, Pr((A,C),B) = X (1/3) 1/3, Pr((B,C),A) = X (1/3) 1/3. Identifying species trees from gene trees 9/38

Ex: Gene tree distribution In particular, for species tree σ = ((a,b):x,c), Pr((A,B),C) > Pr((A,C),B) = Pr((B,C),A), and gap in inequality identifies branch length x. That is, the 3-taxon species tree σ = (ψ,λ) is identifiable from gene tree probabilities: the most probable gene tree matches the species tree; if m is the value of the smallest gene tree probability, solve m = 1 3X for X = exp( x) to obtain the branch length x. By marginalization to 3-taxon subsets, this shows for n 3 taxa, metric rooted species trees can be identified from n-taxon rooted topological gene tree probabilities. Identifying species trees from gene trees 10/38

Empirical examples Ebersberger et al., 2007: For multiple loci on {H,C,G}, applying a rooted triple method gives further evidence for ((human,chimp),gorilla) tree, as well as dating divergences in it. Jennings and Edwards, 2005; Wakeley, 2008: A rooted triple method can give evidence for evolutionary relationships between a triple of Australian grass finches. Identifying species trees from gene trees 11/38

Motivating Questions How can species trees be inferred? How can we summarize the gene tree dist and still retain enough information to recover the species tree σ? In today s talk, Coalescent model Species Tree Gene Trees sequences focus on identifying species trees from summary statistics for topological gene trees (not practical inference) Identifying species trees from gene trees 12/38

Motivating Questions How can species trees be inferred? How can we summarize the gene tree dist and still retain enough information to recover the species tree σ? In today s talk, Coalescent model Species Tree Gene Trees sequences Inference Scheme focus on identifying species trees from summary statistics for topological gene trees (not practical inference) Identifying species trees from gene trees 12/38

Motivating Questions How can species trees be inferred? How can we summarize the gene tree dist and still retain enough information to recover the species tree σ? In today s talk, Coalescent model Species Tree Gene Trees sequences Identifiability from topological gene tree distribution focus on identifying species trees from summary statistics for topological gene trees (not practical inference) Identifying species trees from gene trees 12/38

Main Results In (ADRa, ADRb, 2011), we prove that 1. Species trees topologies ψ and branch lengths λ are identifiable from the distribution of unrooted topological gene tree probabilities. 2. Species tree topologies ψ are identifiable from clade probabilities. Identifying species trees from gene trees 13/38

Unrooted gene tree distribution The ugt distribution is obtained by summing the probabilities of all rooted gene trees with the same unrooted topology. P[AB CD] = P[(((AB)C)D)] + P[(((AB)D)C)] + P[(((CD)A)B)] + P[(((CD)B)A)] + P[((AB)(CD))] + + A C A B C D A B D C B D + + C D A B C D B A A B C D Identifying species trees from gene trees 14/38

Motivation Why look at unrooted topological gene trees? Many species tree methods using the coalescent currently assume or co-estimate rooted metric gene trees (BEST, *BEAST, STEM,...) To infer roots of gene trees, you need a molecular clock hypothesis (unrealistic?) or an outgroup (non-statistical, perhaps biased placement) C G O H C G O H C G O MC each tree has constant root to tip distance H O Outgroup is used to root tree, then deleted. H C G Identifying species trees from gene trees 15/38

Motivation Why look at unrooted topological gene trees? Many species tree methods using the coalescent currently assume or co-estimate rooted metric gene trees (BEST, *BEAST, STEM,...) To infer roots of gene trees, you need a molecular clock hypothesis (unrealistic?) or an outgroup (non-statistical, perhaps biased placement) Inferred branch lengths in metric gene trees can be sensitive to model used in inference; gene tree topology is more robust to model choice Identifying species trees from gene trees 15/38

Results (ugt) Theorem (ADR). Unrooted topological gene tree probabilities determine the metric species tree. For 4 taxa, the unrooted gene tree distribution determines the unrooted species tree, but not the rooted species tree. The unrooted gene tree distribution determines the rooted species tree and branch lengths when there are 5 or more taxa. These results remain valid for species trees with polytomies (non-binary trees). This theorem justifies the use of ML or Bayesian methods to infer species trees from unrooted topological gene tree probabilities. Identifying species trees from gene trees 16/38

Method of proof (Part I) Let u i denote the probability of unrooted gene tree T i. Tree Splits Prob. T 1 AB CDE, ABC DE u 1 T 2 AB CDE, ABD CE u 2 T 3 AB CDE, ABE CD u 3 T 4 AC BDE, ABC DE u 4 T 5 AC BDE, ACD BE u 5 T 6 AC BDE, ACE BD u 6 T 7 AD BCE, ABD CE u 7 T 8 AD BCE, ACD BE u 8 T 9 AD BCE, ADE BC u 9 T 10 AE BCD, ABE CD u 10 T 11 AE BCD, ACE BD u 11 T 12 AE BCD, ADE BC u 12 T 13 BC ADE, ABC DE u 13 T 14 BD ACE, ABD CE u 14 T 15 BE ACD, ABE CD u 15 y x z x z z y x y For 5-taxon species trees σ, we show that different topologies satisfy different equations and inequalities in the u i. Once the topology is known, internal branch lengths can be computed. Generalize from 5 to n. Identifying species trees from gene trees 17/38

Note: ADR theorem is an identifiability result. Proof method not intended as practical inference method of rooted, metric species trees. Liu and Yu in Sys. Biol. 2011 Estimating species trees from unrooted gene trees give consistent algorithm to estimate unrooted topological species trees from unrooted topological gene trees Method closely related to the STAR algorithm of Liu, Yu, Pearl, and Edwards. More to come on this... Kreidl 2011, arxiv: rigorous proof of consistency of algorithm Identifying species trees from gene trees 18/38

Identifiability from clade probabilities Defn: A clade C on a species tree σ is the set of taxa all descended from a node of σ. a b c d e f A clade on a gene tree T is defined similarly. Clades C 1 = {a,b,c} and C 2 = {e,f } depicted at left. Question: Do clade probabilities on gene trees determine the species tree? Identifying species trees from gene trees 19/38

Identifiability from clade probabilities Defn: A clade C on a species tree σ is the set of taxa all descended from a node of σ. A B C D E F A clade on a gene tree T is defined similarly. Clades C 1 = {a,b,c} and C 2 = {e,f } depicted at left. Question: Do clade probabilities on gene trees determine the species tree? Identifying species trees from gene trees 19/38

Why clades? Clades on gene trees are natural to consider since they are inherent to the coalescent; summarize a collection of gene trees; don t depend on metric information on gene trees; (addresses lack of robustness) are already being used to deal with the gene tree/species trees differences. (BUCKy) Identifying species trees from gene trees 20/38

Eg. Clade probabilities on 4-taxon trees y x x y a b c d Caterpillar tree a b c d Balanced tree Clade probabilities are functions of parameters σ = (ψ,λ). They do not form a distribution. Identifying species trees from gene trees 21/38

Eg. Clade probabilities on 4-taxon trees Probabilities of clades for 4-taxon species trees under the coalescent. X = exp( x),y = exp( y). probability under species tree clade (((a, b):x,c):y,d) ((a,b):x,(c, d):y) c 1 = Pr(AB) 1 2 3 X 1 9 XY 3 1 2 3 X 1 9 XY 1 c 2 = Pr(AC) 3 X 1 9 XY 3 2 9 XY 1 c 3 = Pr(AD) 6 XY + 1 18 XY 3 2 9 XY 1 c 4 = Pr(BC) 3 X 1 9 XY 3 2 9 XY 1 c 5 = Pr(BD) 6 XY + 1 18 XY 3 2 9 XY 1 c 6 = Pr(CD) 3 Y 1 6 XY + 1 18 XY 3 1 2 3 Y 1 9 XY c 7 = Pr(ABC) 1 2 3 Y 1 3 XY + 1 6 XY 3 1 3 Y 1 6 XY 1 c 8 = Pr(ABD) 3 Y 1 6 XY 1 3 Y 1 6 XY 1 c 9 = Pr(ACD) 6 XY 1 3 X 1 6 XY 1 c 10 = Pr(BCD) 6 XY 1 3 X 1 6 XY These parameterizations give rise to algebraic varieties V σ. Identifying species trees from gene trees 21/38

High probability clades Theorem. Suppose σ = (ψ,λ) is a species tree (not necessarily binary), and that C is a gene tree clade with Pr σ (C) > 1/3. Then C is a clade on σ. Partial analog for rooted triple gene tree probability. But a clade C can be on σ, with Pr(C) 1/3. Identifying species trees from gene trees 22/38

Marginalization fails For a proof of identifiability of the species tree topology from clade probabilities, it was not possible to argue from small to large trees. For example, suppose σ is a species tree on X = {a,b,c,d}, and υ is the induced species tree on the 3-taxon subset {a,b,c}. Then, there does NOT exist any linear combination of the clade probabilities on σ that gives the clade probabilities on υ that is valid for all species tree topologies. i.e. We can t relate clade probabilities on trees and their subtrees.. Identifying species trees from gene trees 23/38

Algebraic proofs of identifiability Clade probabilities for a particular species tree topology satisfy equations specific to that topology. Eg. Identifiability of 4-taxon σ: y x x y a b c d Caterpillar tree a b c d Balanced tree Probabilities of clades for 4-taxon species trees under the coalescent. X = exp( x), Y = exp( y). probability under species tree clade (((a, b):x,c):y,d) ((a, b):x,(c, d):y) 1 c 2 = Pr(AC) 3 X 1 9 XY 3 2 9 XY 1 c 3 = Pr(AD) 6 XY + 1 18 XY 3 2 9 XY For example, c 2 = c 3 for clade probabilities arising from the balanced tree, but c 2 c 3 for the caterpillar tree. Identifying species trees from gene trees 24/38

Algebraic proofs of identifiability Clade probabilities for a particular species tree topology satisfy equations specific to that topology. Eg. Identifiability of 4-taxon σ: y x x y a b c d Caterpillar tree a b c d Balanced tree Conclusion: 4-taxon species tree topologies are identifiable from {c i }. For example, c 2 = c 3 for clade probabilities arising from the balanced tree, but c 2 c 3 for the caterpillar tree. Identifying species trees from gene trees 24/38

Identifiability of small σ For small species trees, we were able to find polynomial equations that exactly determined the species tree topology. (used computational algebra package Singular...) This proved identifiability for small species tree topologies. But, 4-, 5-taxon species tree topology identifiability n-taxon tree identifiability result For a general theorem, understanding of relationships between clade probabilites without computation was needed. Identifying species trees from gene trees 25/38

Polynomials relations Cherry swapping invariants Proposition. If (a,b) is a 2-clade on the species tree, then for any non-empty set D of taxa excluding A,B, For example, Pr(D {A}) = Pr(D {B}). Pr({AC}) = Pr({BC}). x (Equiv. Pr({AC}) Pr({BC}) is an invariant for σ.) a b c Proof: Under the coalescent model, lineages for A,B are exchangeable (i.e., behave identically). Identifying species trees from gene trees 26/38

Exchangeability of gene lineages If two lineages are present in the same population at a particular point in time on σ, then above that point both lineages behave exactly the same way. T 2 T 1 T 0 a b c d e Lineages A, B behave exactly the same into the past, for T < T 0. Similarly, lineages A, B-C behave the same for T < T 1, and lineages A, B-C, D-E for T < T 2. Identifying species trees from gene trees 27/38

More polynomial relations Since no other polynomial relationships for clade probabilities were obvious, from Singular we obtain polynomials invariants for small trees. Ex: Species tree (((a,b):x,c):y,d): (c i = probability of clade i) stdjhcat[1] = c9 c10 stdjhcat[2] = c5 c6 + c8 c10 stdjhcat[3] = c3 c6 + c8 c10 stdjhcat[4] = c2 c4 stdjhcat[5] = c1 + 2 c4 + 9 c6 c7 11 c8 4 c10 stdjhcat[6] = 3 c4 c8 + 6 c6 c8 6 c8 2 + 3 c4 c10 + 12 c6 c10 2 c7 c10... stdjhcat[7] = 9 c6 3 6 c6 2 c7 + c6 c7 2 39 c6 2 c8 + 16 c6 c7 c8... stdjhcat[8] = 9 c4 c6 2 3 c4 c6 c7 + 6 c6 2 c7 2 c6 c7 2 + 60 c6 2 c8... stdjhcat[9] = 9 c4 2 c6 + 12 c4 c6 c7 + 4 c6 c7 2 84 c6 2 c8 +... Identifying species trees from gene trees 28/38

Insight from computations For some 5-taxon species trees, elimination calculation did not terminate, but we do obtain some polynomials of low degree. Such calculations... and much staring to discern a pattern... led to formulating the following theorem. Identifying species trees from gene trees 29/38

Polynomials relations determine σ Theorem. Let C be a subset of taxa with at least two elements, and D a non-empty set of taxa disjoint from C. For distinct a,b C, let C = C {a,b}. Then if C is a clade on the species tree σ, the polynomial f = Pr(S {a} D) Pr(S {b} D) = 0. S C S C Moreover, if C is not a clade on the species tree, then for generic edge lengths the above equality does not hold. Identifying species trees from gene trees 30/38

Example Caterpillar species tree σ = (((a,b):x,c):y,d) y x a b c d Clade C = {a,b,c}; C = C {a,c} = {b}; D = {d} or Pr(S {a} D) Pr(S {c} D) = 0. S C S C ( ) ( ) Pr{a,d} + Pr{a,b,d} Pr{c,d} + Pr{b,c,d} = 0. Identifying species trees from gene trees 31/38

Results (clade probs) This leads to Theorem (ADR). Clade probabilities determine the species tree topology for generic edge lengths. For small trees, it is also possible to recover branch lengths, but no general method is known. Identifying species trees from gene trees 32/38

STAR and clades Liu, Yu, Pearl, and Edwards (Sys. Biol. 2009): introduce the STAR algorithm for estimating species tree topologies from gene trees. (2012? ADR give a rigorous proof that STAR consistently estimates ψ.) Identifying species trees from gene trees 33/38

STAR and clades Outline: Given a collection of rooted (possibly metric) gene trees on n taxa, 1. For each gene tree, make all internal edges length 1, terminal edges lengths as needed so root to leaf distance is n. 2. Compute pairwise distances on each modified gene tree. 3. To find the distance d STAR (a, b) between species a and b, average distances over all gene trees. 4. Use a distance method to reconstruct the species tree topology, ψ.... extends to multiple individuals sampled per species Identifying species trees from gene trees 34/38

STAR distances For a rooted n-taxon gene tree, assign 1 to each internal edge and assign weights to terminal edges so that root to tip distance is n..12.09.5.2.18 A B C.38 D Original gene tree 1 1 4 3 2 2 A B C D Modified gene tree. On the modified gene tree, d(a,c) = 6 for example. Identifying species trees from gene trees 35/38

Connection: STAR and {Pr(c i )} Theorem (ADR). Suppose σ = (ψ,λ) is an n-taxon rooted binary metric species tree. Then the clade probabilities {Pr(C)} determine the expected STAR distances. (Empirical version too.) (And both determine ψ.) Idea: 1 1 1 4 3 3 4 4 A B C D E On this gene tree, the number of non-trivial clades containing A,B is 2; That is, the distance from the root to MRCA(A, B) is 2. d STAR (A,B) = 2n 2(number of clades containing {A,B}). Identifying species trees from gene trees 36/38

Connection: STAR and {Pr(c i )} Theorem (ADR). Suppose σ = (ψ,λ) is an n-taxon rooted binary metric species tree. Then the clade probabilities {Pr(C)} determine the expected STAR distances. (Empirical version too.) This idea underlies d STAR (a,b) = 2n 2 non-trivial clades C Pr(C)I A,B (C). Remark: This gives an algorithm for computing the species tree topology from {Pr(c i )}, and explains the underpinnings of STAR. The first taxa joined in STAR are those which appear in the most observed clades... Identifying species trees from gene trees 37/38

Thank you. Identifying species trees from gene trees 38/38