Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore

Similar documents
Reconciliation with Non-binary Gene Trees Revisited

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Gene Tree Parsimony for Incomplete Gene Trees

Phylogenetic Networks, Trees, and Clusters

should be presented and explained in the combined species tree (Fitch, 1970; Goodman et al., 1979). The gene divergence can be the results of either s

Evolutionary Tree Analysis. Overview

Unified modeling of gene duplication, loss and coalescence using a locus tree

Gene Families part 2. Review: Gene Families /727 Lecture 8. Protein family. (Multi)gene family

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Comparative Genomics II

ALGORITHMIC STRATEGIES FOR ESTIMATING THE AMOUNT OF RETICULATION FROM A COLLECTION OF GENE TREES

Phylogenetic Tree Reconstruction

Inference of Parsimonious Species Phylogenies from Multi-locus Data

A Phylogenetic Network Construction due to Constrained Recombination

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Workshop III: Evolutionary Genomics

Anatomy of a species tree

Solving the Maximum Agreement Subtree and Maximum Comp. Tree problems on bounded degree trees. Sylvain Guillemot, François Nicolas.

Evolution of Tandemly Arrayed Genes in Multiple Species

Gene Trees, Species Trees, and Species Networks

The Inference of Gene Trees with Species Trees

Improved Gene Tree Error-Correction in the Presence of Horizontal Gene Transfer

Solving the Tree Containment Problem for Genetically Stable Networks in Quadratic Time

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Consensus properties for the deep coalescence problem and their application for scalable tree search

BINF6201/8201. Molecular phylogenetic methods

Species Tree Inference by Minimizing Deep Coalescences

A CLUSTER REDUCTION FOR COMPUTING THE SUBTREE DISTANCE BETWEEN PHYLOGENIES

TheDisk-Covering MethodforTree Reconstruction

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline


Finding a gene tree in a phylogenetic network Philippe Gambette

Improved maximum parsimony models for phylogenetic networks

GATC: A Genetic Algorithm for gene Tree Construction under the Duplication Transfer Loss model of evolution

Theory of Evolution Charles Darwin

Methods to reconstruct phylogene1c networks accoun1ng for ILS

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

Gene Family Trees. Kevin Chen. Department of Computer Science, Princeton University, Princeton, NJ 08544, USA.

reconciling trees Stefanie Hartmann postdoc, Todd Vision s lab University of North Carolina the data

Phylogenetic trees 07/10/13

Properties of normal phylogenetic networks

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Analysis of Gene Order Evolution beyond Single-Copy Genes

Algorithms for efficient phylogenetic tree construction

Coalescent Histories on Phylogenetic Networks and Detection of Hybridization Despite Incomplete Lineage Sorting

arxiv: v2 [q-bio.pe] 4 Feb 2016

Inferring gene duplications, transfers and losses can be done in a discrete framework

Phylogenetics: Parsimony

Charles Semple, Philip Daniel, Wim Hordijk, Roderic D M Page, and Mike Steel

Phylogenomics of closely related species and individuals

Notung: A Program for Dating Gene Duplications and Optimizing Gene Family Trees

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Properties of Consensus Methods for Inferring Species Trees from Gene Trees

Concepts and Methods in Molecular Divergence Time Estimation

Is the equal branch length model a parsimony model?

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Covering Linear Orders with Posets

Supertree Algorithms for Ancestral Divergence Dates and Nested Taxa

The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection

arxiv: v1 [cs.ds] 1 Nov 2018

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

EVOLUTIONARY DISTANCES

Evolution of genes neighborhood within reconciled phylogenies: an ensemble approach

arxiv: v1 [q-bio.pe] 3 May 2016

Consensus Methods. * You are only responsible for the first two

The Tree Evaluation Problem: Towards Separating P from NL. Lecture #6: 26 February 2014

GenPhyloData: realistic simulation of gene family evolution

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Disjoint Sets : ADT. Disjoint-Set : Find-Set

Estimating Evolutionary Trees. Phylogenetic Methods

Theory of Evolution. Charles Darwin

Phylogenetics. BIOL 7711 Computational Bioscience

Constructing Evolutionary/Phylogenetic Trees

Chapter 5 The Witness Reduction Technique

Quartet Inference from SNP Data Under the Coalescent Model

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them?

Integer Programming for Phylogenetic Network Problems

Forbidden Time Travel: Characterization of Time-Consistent Tree Reconciliation Maps

A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES. 1. Introduction

arxiv: v1 [cs.cc] 9 Oct 2014

A new algorithm to construct phylogenetic networks from trees

A Faster Algorithm for the Perfect Phylogeny Problem when the Number of Characters is Fixed

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Perfect Phylogenetic Networks with Recombination Λ

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Evolution by duplication

How should we organize the diversity of animal life?

Inferring a level-1 phylogenetic network from a dense set of rooted triplets

8/23/2014. Phylogeny and the Tree of Life

Applications of Analytic Combinatorics in Mathematical Biology (joint with H. Chang, M. Drmota, E. Y. Jin, and Y.-W. Lee)

Perfect Sorting by Reversals and Deletions/Insertions

What is Phylogenetics

Point of View. Why Concatenation Fails Near the Anomaly Zone

Approximating the correction of weighted and unweighted orthology and paralogy relations

Transcription:

Non-binary Tree Reconciliation Louxin Zhang Department of Mathematics National University of Singapore matzlx@nus.edu.sg

Introduction: Gene Duplication Inference Consider a duplication gene family G Species A B C D E F H Genes A_g, A_g, A_g3, A_g4 B_g C_g D_g, D_g E_g, E_g F_g H_g Duplication Gene loss A B C D E F H Question: How to reconstruct the duplication history of the gene family G?

Introduction: Tree Reconciliation Approach Step Build the gene tree G for the gene family using gene sequences, and the species tree S if it is not available. G

Gene Tree and Species Tree A species tree S represents the evolutionary pathways of of a group of species S S S3 S4 Species tree S Sg Sg Sg S3g S4g Gene tree G A gene tree G is reconstructed from gene sequences, representing evolutionary relationship of genes, but is not the duplication history of the gene family. G can differ from the corresponding S in two respects. -- The divergence of two genes may predate the divergence of the corresponding species -- Their topologies are different

Introduction: Tree Reconciliation Approach Step Build the gene tree G for the gene family using gene sequences, and the species tree S if it is not available. G Step Reconcile G and S to infer gene duplication and loss events, forming a duplication history of the gene family. G a b a d e a h d e f h a c Node-to-Node Map λ A B S C D E F H

LCA reconciliation λ: Binary trees In G, the leaves are labeled with corresponding species; l(x): the label of a leaf x of G; l y : the leaf of S that has the label y; lca: the lowest common ancestor of two nodes v, v : the children of v. λ: V(G) V(S) is defined as: λ v = l S l G (v), v is a leaf of G, lca λ v, λ v, otherwise w v x a b a d e a h d e f h a c G λ(w) λ(v) λ(x) A B C D E F H S Goodman et al, 979

LCA reconciliation λ: Binary trees (con t) v V G is a duplication node if λ v = λ v or λ v = λ v. r z w u v y a b a d e a h d e f h a c λ(u)=λ(v) A B λ(r)= λ(w)= λ(z)= λ(y) C D E F H For each duplication node v, a duplication is assumed in the branch entering λ v, producing two gene copies, which are the ancestors of the modern genes in the left subtree and in the right subtree, respectively. Duplication Gene loss A B C D E F H

LCA reconciliation λ: Binary trees (con t) (The gene duplication cost of λ) = (no. of duplication nodes) (The gene loss cost of λ) = (no. of gene loss events) u λ(u) u u λ(u ) λ(u ) a b a d e a h d e f h a c A B C D E F H The gene loss cost can be computed from the no. of lineages branching off the paths from λ(u) to λ(u ) and λ(u ) Both gene duplication and loss costs are two dissimilarity measures for gene and species trees.

Theorem Let G and S be binary. ). λ gives a duplication history of the gene family with the least gene duplication events (Gorecki & Tiuryn, 6). ). λ gives a duplication history of the gene family with the least gene loss events (Chauve & El-Mabrouk, 9). 3). λ gives a duplication history of the gene family with the least deep coalescence cost (Wu & Zhang, ). 4). λ is linear-time computable (Zhang, 997, Chen, Durand & Farach ). λ is the parsimonious reconciliation for binary trees

Introduction: Species Tree Reconstruction Species Tree (ST) Problem Instance: A set of gene trees G i ( i n) and a cost function c(). Solution: A binary species tree S that minimizes c(g i, S) i n The following cost functions have been used: -- Gene duplication cost W -- Gene loss cost L -- Deep coalescence cost DC -- Mutation cost (W+L), or weighted sum of W and L -- Robinson-Foulds distance The ST problem is NP-hard for each of the above cost functions. Hallett & Lagergren, Yu, Warnow & Nakhleh, Than & Nakhleh, 9 Liu, Yu, Kubatko, Pearl & Edwards, 9 McMorris & Steel, 993 Ma, Li, & Zhang, ; Bansal & Shamir, ; Zhang, ;

Introduction: Unify Two Problems General Reconciliation (GR) Problem Instance: A gene tree G and a species tree S and a reconciliation cost c(, ). Solution: A binary refinement Ĝ of G and Ŝ of S such that the lca reconciliation of Ĝ and Ŝ minimizes a reconciliation cost c(ĝ, Ŝ). Refinement Contraction Eulenstein, Huzurbazar, Liberles,

Two remarks. The GR problem is a generalization of binary tree reconciliation. The species tree inference problem is a special case of the GR problem, and hence the latter is NP-hard. Set S be the star tree over the species in the reduction from the Species Tree problem to the GR problem Species Tree Inference Instance: A set of gene trees G i ( i n). Solution: A binary species tree S that minimizes i n c(g i, S)

Outline of Today s Talk Relationship between tree similarity measures Algorithms for the General Reconciliation problem -- Extensions of the reconciliation of binary trees to non-binary gene trees -- Exact algorithm for reconciling two non-binary trees Computer program TxT Conclusion Zheng, Wu & Zhang, Zheng & Zhang, 3

Part I: Relationship between Cost Functions Theorem Let S be a species tree and G the gene tree of a gene family. If one family member is found in each of the species, then C loss G, S = C dup G, S + C dc G, S where C dc G, S (deep coalescence cost) is defined as the sum of extra lineages in all branches when G is mapped onto S. Maddison, 997 Zhang,

Consider two singly-labeled trees G and S over n taxa X (that is, each leaf is uniquely labeled with e X). The Robinson-Foulds distance C RF G, S is defined to be the number of leaf clusters appearing in G but not in S. {a, b} {e, f, g, h} {e, f, g, h} a b c d e f g h a c b d g e f h Proposition (i) For G and S defined above, C dup G, S C RF G, S C DC (G, S) C loss G, S. (ii) max G,S C dup (G, S) = max G,S C RF (G, S) = n.

#(Gene trees ) (%) Theorem (i) There exist G and S with n leaves such that C dup G, S =, but C RF G, S =n-. (ii) For any G and S defined above, max C dup (G, S), C dup (S, G) C RF G, S. Duplication Cost Distribution Robinson-Foulds Distribution 7 6 8 5 4 7 98 species tree topologies for taxa (listed in Fumas rank)

Part II: Reconciling Non-binary G and Binary S Instance: A gene tree G and a binary species tree S and a cost c( ). Solution: The binary refinement Ĝ of G such that the lca reconciliation of Ĝ and S minimizes c(ĝ, Ŝ). The following duplication inference rule does not work for non-binary nodes: A duplication isassociated with u iff ( u) ( u ), or ( u) ( u ). having children u and u Durand et al (6) presented first dynamic programming alg. for reconciling a non-binary gene tree and a binary species tree. Generalize the reconciliation to non-binary gene trees. The whole process takes O( G + Ŝ ) time for the duplication and loss costs.

λ: The lca reconciliation of G and S G v S ac a de ag ab de fg a b c d e f g The node v and its children are mapped to a subtree (blue) under λ, which is expanded into a binary subtree (by adding purple edges). The image subtree I(v) (I + (v) after extension)

G v S ac a de ag ab de fg a b c d e f g 4 3 Step Compute m(u), the maximum number of child images in a path from u to some leaf descendant in I + (v). Algorithm ω(u) is the # of children mapped to u. m u = max m u, m u + ω u.

Thm (i) The min. dup. cost for refining the non-binary node v is m λ v. (ii) The min. loss cost for refining v is equal to (# of purple edges). Idea of Proof. P = λ v, λ v,, λ v k, L: The size of the longest chain in P, which is m λ v in our case ; P: The min. # of antichains into which P may be partitioned. Dual of Dilworth Theorem (Mirsky, 97): L=P. (ii) It is obvious. 4 3

. A Simple Refinement with the Optimal Dup. Cost 4 /4 3/3 3/ 3 / / / / / / Step Compute α(u) / β(u) using m(u). 3 4 Algorithm α r =, β r = m r ; α u = β p u ω p u, β u = m u. α(u): the # of genes flowing into a branch (p(u), u). β(u): the # of genes leaving a branch (p(u), u).. ω(u): the # of children mapped to u.

4 /4 3/3 3/ 3 / / / / / / Step 3 Infer duplications and losses: If α(u) < β(u), duplications ( ) are postulated. If α(u) > β(u), losses ( ) are postulated.

. A Simple Refinement with the Optimal Loss Cost 4 / / / 3 / / / / / / Step Compute α(u) / β(u) using m(u). 3 4 α u =, β u = ω u +, ω u, Algorithm if u is an internal node if u is a leaf α(u): the # of genes flowing into a branch (p(u), u). β(u): the # of genes leaving a branch (p(u), u). ω(u): the # of children mapped to u.

4 / / / 3 / / / / / / Step 3 Infer duplications and losses: If α(u) < β(u), duplications ( ) are postulated. If α(u) > β(u), losses ( ) are postulated.

3. A Refinement Minimizing the Loss Cost with the Constraint of Optimal Dup. Cost 4 /3 / / 3 / / / / / / Step: Compute α(u) / β(u) using m(u). Algorithm Step 3: Infer duplications and losses. If α(u) < β(u), duplications ( ) are postulated. { If α(u) > β(u), losses ( ) are postulated.

G v S ac a de ag ab de fg a b c d e f g Dup-optimal solution Loss-optimal solution Solution of minimizing duplications and then loss

4. Exact Algorithm for Reconciling Non-binary Trees S Ŝ a b c d e f h a b c d e f h a b c d e f h G Ĝ ac a de ah ab de fh ac a de ah ab de fh ab a de ah ac de fh 8 losses 3 duplications Step Obtain the optimal refinement Ŝ of S using the union network Step Refine G based on the refinement Ŝ of S, obtaining Ĝ Step 3 Reconcile Ĝ and Ŝ to infer the evolution of the gene family

http:phylotoo.appspot.com

Conclusion Modeling gene duplication, losses, horizontal gene transfer, incomplete lineage sorting simultaneously -- Hallett, Lagergren & Tofigh, 4 -- Stolzer et al, -- Bansal, EJ Alm, M Kellis, Likelihood methods for tree reconciliation -- Arvestad, Lagergren, Sennblad, 9 -- Boussau et al. 3 -- Liu, Yu, Kubatko, Pearl, Edwards, 9