A Phylogenetic Network Construction due to Constrained Recombination

Similar documents
Phylogenetic Networks with Recombination

Introduction to characters and parsimony analysis

Chapter 26 Phylogeny and the Tree of Life

Properties of normal phylogenetic networks

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Dr. Amira A. AL-Hosary

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Integer Programming for Phylogenetic Network Problems

Phylogenetic Networks, Trees, and Clusters

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

How should we organize the diversity of animal life?

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them?

Chapter 16: Reconstructing and Using Phylogenies

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.!

8/23/2014. Phylogeny and the Tree of Life

Phylogenetic Tree Reconstruction

Aphylogenetic network is a generalization of a phylogenetic tree, allowing properties that are not tree-like.

What is Phylogenetics

Beyond Galled Trees Decomposition and Computation of Galled Networks

SPECIATION. REPRODUCTIVE BARRIERS PREZYGOTIC: Barriers that prevent fertilization. Habitat isolation Populations can t get together

Organizing Life s Diversity

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Phylogeny: building the tree of life

Outline. Classification of Living Things

Evolutionary Tree Analysis. Overview

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Constructing Evolutionary/Phylogenetic Trees

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Consensus methods. Strict consensus methods

Improved maximum parsimony models for phylogenetic networks

1.1 The (rooted, binary-character) Perfect-Phylogeny Problem

Reproduction- passing genetic information to the next generation

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

1/27/2010. Systematics and Phylogenetics of the. An Introduction. Taxonomy and Systematics

Biology 211 (2) Week 1 KEY!

Lecture 11 Friday, October 21, 2011

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Processes of Evolution

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Theory of Evolution Charles Darwin

PHYLOGENY AND SYSTEMATICS

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Reconstruction of certain phylogenetic networks from their tree-average distances

Phylogenetic analysis. Characters

How to read and make phylogenetic trees Zuzana Starostová

Cladistics and Bioinformatics Questions 2013

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Classification and Phylogeny

BINF6201/8201. Molecular phylogenetic methods

9/19/2012. Chapter 17 Organizing Life s Diversity. Early Systems of Classification

Phylogeny: traditional and Bayesian approaches

Classification and Phylogeny

ESS 345 Ichthyology. Systematic Ichthyology Part II Not in Book

Intraspecific gene genealogies: trees grafting into networks

An introduction to phylogenetic networks

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

C.DARWIN ( )

Charles Semple, Philip Daniel, Wim Hordijk, Roderic D M Page, and Mike Steel

Phylogenetic inference

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Evolution Problem Drill 10: Human Evolution

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetic Analysis

Tree-average distances on certain phylogenetic networks have their weights uniquely determined

Phylogenetic Analysis

Phylogenetic Analysis

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)


A (short) introduction to phylogenetics

Supplementary Materials for

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Phylogenetics. BIOL 7711 Computational Bioscience

Chapter 27: Evolutionary Genetics

Estimating Evolutionary Trees. Phylogenetic Methods

Constructing Evolutionary/Phylogenetic Trees

Consensus Methods. * You are only responsible for the first two

Microbial Taxonomy and the Evolution of Diversity

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

UON, CAS, DBSC, General Biology II (BIOL102) Dr. Mustafa. A. Mansi. The Origin of Species

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Phylogenetic trees 07/10/13

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2012 University of California, Berkeley

Algorithms in Bioinformatics

Weighted Quartets Phylogenetics

Chapter 17A. Table of Contents. Section 1 Categories of Biological Classification. Section 2 How Biologists Classify Organisms

arxiv: v1 [cs.cc] 9 Oct 2014

PHYLOGENY & THE TREE OF LIFE

Phylogenetics: Parsimony

Supertree Algorithms for Ancestral Divergence Dates and Nested Taxa

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2011 University of California, Berkeley

Transcription:

A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer Engineering, Indian Institute of technology, Roorkee December, 27, 2006 1

Outline Phylogenetics and its applications Phylogenetic networks Non-tree like events Existing network reconstruction methods Proposed Algorithm Phylogenetic supertrees Need for supertree methods desirable properties existing methods Proposed Algorithm 2

phylogenetics is the study of the evolutionary relationships among different species or genes. Phylogenetics Morphological characters were used to classify the species for years. The increasing availability of DNA and amino acid data has lead to the wider interest in computational methods for the tree construction. The advantage of using genetic data is unambiguous characters (A,C,G,T) molecular data can be converted to numerical form, which can be used for the computational and mathematical analysis. Orangutan Gorilla Chimpanzee Human (From university of Arizona) 3

Big genome sequencing projects just produce the data, which is meaningless until analyzed and classified properly. Evolutionary history relates all organisms and genes, and helps to understand and predict: Interaction between genes Drug design Predicting functions of genes Origins and spread of disease Origins and migrations of humans Applications 4

Phylogenetic networks Phylogenetic tree construction methods failed to find true relationship between the species not because of wrong genes were selected or methods are not adequate to do so. It is because of the events Horizontal gene transfer Hybridization Homoplasy Genetic recombination SNEATH, P. H. A.. Cladistic representation of reticulate evolution. Syst. Zool. 24:360 368. 1975 5

Horizontal gene transfer Non-tree like events Between similar species, through asexual process A B C D Phylogenetic network A B C D Tree for genes acquired by HGT A B C D Tree for inherited genes Hybridization Between different species, through sexual process A B C D A B C D A B C D Phylogenetic network Tree for hybrid genes Tree for inherited genes 6

Non-tree like events cont Homoplasy Similarity despite of separate ancestry (evolutionary noise) P. A C A T A C G Q. G T A T A C G R. G G A C A T G S. G C A C A C A Homoplasy is change of state twice or more Ex. Site 2. Genetic recombination Within the same lineages Homologous chromosomes exchanging genetic martial 7

Phylogenetic Networks under Constrained Recombination Binary sequences (SNP) A Single Nucleotide Polymorphism (SNP) (pronounced "snip"), is a small genetic change, or variation, that can occur within a specie s DNA sequence. Example AAGGTTA ---ATGGTTA. Each site in the sequence represents a SNP (single nucleotide polymorphism), a site where two of the four possible nucleotides appear in the population with the frequency above some threshold The order of the sites is fixed (on a chromosome) in SNPs. In contrast to a set of taxonomic characters, where the given order is arbitrary. 8

The Perfect Phylogeny Model for binary sequences sites Ancestral sequence 12345 00000 Site mutations on edges The tree derives the set M: 10100 10000 01011 01010 00010 10100 3 10000 1 4 5 01011 2 01010 Extant sequences at the leaves 00010 9

The Perfect Phylogeny Problem Given a set of sequences M we want to find, if possible, a perfect phylogeny that derives M. Remember that each site can change state from 0 to 1 only once. n will denote the number of sequences in M, and m will denote the length of each sequence in M. n m 01101001 11100101 10101011 The sequence matrix M 10

The 4-Gamete Test This will test whether the given sequences derive a perfect phylogenetic tree or not. Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs: 0,0 and 0,1 and 1,0 and 1,1 11

An Example of Gamete test The tree derives the set M: 10100 10000 01011 01010 00010 10101 (new) 10100 sites 3 1 12345 00000 4 5 2 00010 Columns 4 and 5 fails the Gamete test. This is called a Conflict. 10000 01011 Conflicts are due to RECOMBINATION and is often encountered in real sequences 01010 12

Recombination in SNP P 10100 S 01011 10101 The first 4 sites come from P (Prefix) and the sites from 5 onward come from S (Suffix). 13

Recombination Network with now node The tree derives the set M: 10100 10000 01011 01010 00010 10101 (new) 10000 sites 12345 00000 1 3 4 5 2 Phylogenetic Network for M with single recombination event. 10100 01011 01010 10101 14

Biologically Significant Network Given M there may exist many Network derivations for the sequence matrix M. A biologically significant network is Perfect Phylogeny Otherwise a little deviation from tree (Network) with minimum recombination events. 15

Minimizing number of recombination events The problem of finding a phylogenetic network that creates a given set of sequences M, and minimizes the number of recombinations, is NP-hard. (Wang et al 2000) (Semple 2004). Wang et al. explored the problem of finding a phylogenetic network where the recombination cycles are required to be node disjoint, if possible. They gave a sufficient but not a necessary condition to recognize cases when this is possible. O(nm + n^4) time. 16

Recombination Cycles In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet. The cycle specified by those two paths is called a ``recombination cycle. 17

Gall trees A recombination cycle in a phylogenetic network is called a gall if it shares no node with any other recombination cycle. A phylogenetic network is called a galled-tree if every recombination cycle is a gall. 18

Example gall tree Rooted Gall tree 4 3 a: 00010 1 3 b: 10010 d: 10100 c: 00100 2 5 4 e: 01100 g: 00101 f: 01101 19

The Gall Tree Construction Algo. 1. Input binary sequences 2. Sort the them in ascending order 3. Find similarity and dissimilarity between the sequences corresponding to 1. 4. Classify the nodes into predefined classes 5. Make Parent and Child Tables 6. Use the information available from step 3, 4 and 5 to construct the gall tree 20

Classification of nodes We follow the rule based classification approach The nodes are classified into three classes Null, Mutation, and Recombination The nodes with least binary sequence values and no similarity between them, belongs to NULL class The nodes with similarity to a single node are classified as MUTATION class The nodes with similarity to a more than one node are classified as RECOMBINATION class 21

Mathematical Basis for rule formation Lemma 1 If a node v' is the result of mutation from its parent v then v < v', when the sequences are considered as the binary numbers. sites 12345 00000 1 4 10100 3 5 2 00010 10000 01011 01010 22

Mathematical Basis for rule formation Lemma 2 Let S and S' be the sequences of the children of node V. if S' is not the result of the mutation or recombination in S then the similarity between S and S' is due to common ancestry. sites 12345 00000 1 4 10100 3 5 2 00010 10000 01011 01010 23

Mathematical Basis for rule formation Lemma 3 Let v be a recombination node with sequence S. if P' and P" are two parent nodes of v, with sequences S' and S" respectively, then any of the following should hold. S'>S and S">S S'>S and S"<S S'<S and S">S 24

An Example Input sequences: A 0 0 0 1 0 B 1 0 0 1 0 C 0 0 1 0 0 D 1 0 1 0 0 E 0 1 1 0 0 F 0 1 1 0 1 G 0 0 1 0 1 Similarity and dissimilarity matrix: (sim,dis) A B C D E F G A 1,0 1,1 0,2 0,3 0,3 0,4 0,3 B 1,1 2,0 0,3 1,2 0,4 0,5 0,4 C 0,2 0,3 1,0 0,1 0,1 0,2 0,1 D 0,3 1,2 0,1 2,0 1,2 1,3 1,2 E 0,3 0,4 0,1 1,2 2,0 2,1 1,2 F 0,4 0,5 0,2 1,3 2,1 3,0 2,1 G 0,3 0,4 0,1 1,2 1,2 2,1 2,0 25

Example Contd Classification of Nodes Child Table Parent Table Node Label Type A Null 0 B Mutation 1 C Null 0 D Recombi 2 nation E Mutation 1 F Recombi 3 nation G Mutation 1 Count A B C D E F Node Label G B D Child List D,E,F,G Null F Null F A B C D E F Node Label G Parent List Null A Root B,C C E,G C 26

Research problems cont 4. Phylogenetic networks :recombination an example Node l Type Count A Null 0 B Mutation 1 C Null 0 D Recombi nation 2 E Mutation 1 F Recombi nation 3 G Mutation 1 Node Label A B C D E F G Child List B D D,E,F, G Null F Null F R:00000 A:00010 C:00100 B:10010 C:00100 D:00010 G:00101 E:01100 F:01101 27

Contribution of the work 1. Used similarity and dissimilarity to avoid information loss 2. Taken a row based search instead of column based search for detecting recombination. This reduces the time complexity of the algorithm. 3. Given better network visualization that is close real visualization and biologically significant. 28

Outline Phylogenetics and its applications Phylogenetic networks Non-tree like events Existing network reconstruction methods Proposed Algorithm Phylogenetic supertrees Need for supertree methods desirable properties existing methods Proposed Algorithm 29

Phylogenetic supertrees There are nearly 1.7 million described species. Not possible for a single researcher or a small group to construct the Tree of Life. No algorithm exist for the construction of the Tree of Life. Phylogenetic supertree methods combine smaller phylogenetic trees into a single tree in such a way that no branching information carried small trees is lost. 30

Phylogenetic tree construction methods Distance Based Method Character Based Method Computationally Efficient Can we Combine them? Highly Accurate Loss of Information Computationally Expensive 31

Desirable properties There exist very few algorithms which satisfy the following properties: The supertree can be computed in polynomial time. The algorithm preserves the branch information shared by all the input trees. Changing the order of the input tree should not change the resulting supertree. Renaming or relabeling leads to the change in the corresponding label in the resulting tree and should not effect the supertree topology. In case all the input trees are compatible the algorithm should return a tree which displays all the input trees. 32

Existing supertree methods If the input trees classifies the same set of leaf nodes, the result of amalgamating them is called Consensus tree (constrained supertree). Many supertree methods use the consensus methods as the basis for supertree construction. Popular consensus methods: Strict, majority consensus method, loose consensus, Adam s consensus methods, and BUILD by Aho et. al. Popular supertree methods: Matrix representation of parsimony (MRP), Mincut, Modified Mincut, RankedTree, MinFlip supertree. BUILD is used as the basis for the supertree construction in Mincut, Modified Mincut and RankedTree. 33

Consensus methods: Strict consensus methods: Existing supertree methods cont Given a collection of phylogenetic trees it return a tree with splits or clusters common to all the trees. Majority consensus methods: Suffers from the problem of incompatibility and loss of information Given a collection of phylogenetic trees it return a tree with splits or clusters found in more than half of the trees. + = A B C D A B D C A B D Strict consensus tree C 34

Supertree methods: Existing supertree methods cont Matrix representation of parsimony (MRP): As a first step, the input phylogenetic trees are converted to the binary matrix. In the second step parsimony tree building method is used to construct the supertree. Exponential time complexity. 1 2 A B C D Taxa Computationally expensive A B C D 1 2 0 0 1 0 1 1 1 1 A phylogenetic tree and its binary character matrix. 35

Supertree methods cont : Mincut: Existing supertree methods cont Satisfy all the desirable properties. Extension of BUILD, BUILD reports an error on incompatible input trees. Mincut removes the minimum information edges from the modified graph to construct supertree, when input trees are incompatible. A B C D E A B C D Input trees E E C Result in highly unresolved tree if the input trees carry polytomy. C 1 1 2 A 1 D D A B A,B 1 2 1 1 Modified graphs A B C D E Supertree 36

Outline Phylogenetics and its applications Phylogenetic networks Non-tree like events Existing network reconstruction methods Phylogenetic supertrees Need for supertree methods desirable properties existing methods Supertree with ancestral divergence time Supertree for semi-labeled taxa Phylogenetic Question Answering System Conclusion and future work 37

Research problems and our approach Research problems: 1. Phylogenetic supertrees. 2. Phylogenetic supertree methods for ancestral time divergence data. 3. Phylogenetic supertree methods for semi-labeled trees. 4. Phylogenetic networks due to recombination. 5. Phylogenetic Question Answering System 38

1. Phylogenetic supertrees Research problems cont Due to the drawbacks of character and distance based methods new techniques to be developed for the construction of the supertrees. The idea is to combine small phylogenetic trees into a single tree in such a way that the branching information carried by each tree is persevered. (branching information includes clusters, triplets, quartets, and splits). To construct the tree of life all the desirable properties should be satisfied. If the input trees shows incompatibility, minimum information can be removed from the input trees to result in a supertree. 39

Phylogenetic supertrees: our approach Research problems cont We took the recoding based approach. Used UPGMA variant for the consensus tree construction. Minimize the least square error to deal with incompatibility. Steps in our method: 1. Find the restricted trees for common nodes 2. recode the restricted trees by considering path length between the taxa as distance between them. 3. Use UPGMA variant to construct the consensus tree. 4. Add the distinct taxa to the consensus tree in way to minimize the least square error. 40

DISTSUPERTREE: An example Research problems cont A B C D E F B C D EF A Average distance matrix for the resulting trees Species A B C D E B 4 C 4.5 2.5 D 4.5 3.5 3 E 4.5 4.5 4 3 F 4 6 5.5 4.5 3.5 41

Research problems cont DISTSUPERTREE: An example cont Modified matrix after grouping the taxa B and C Species A BC D E BC 4.25 D 4.5 3.25 E 4.5 4.25 3 F 4 5.75 4.5 3.5 Leads to wrong conclusion Ancestor of BC is D not E 42

Research problems cont DISTSUPERTREE: An example cont The problem of distance matrix modification can be solved using the distances between the ancestors of the groups and taxa. Distance Between (AB) and (E) ABC AB ABCDEF DEF EF 1. Find all the clusters with X=(AB) and Y=(E) X= {(AB), (ABC), (ABCDEF)} A B CDE F Y= {(E), (EF), (DEF), (ABCDEF)} 43

DISTSUPERTREE: The final result Research problems cont + A B C D E F B C D E F A = B C D E F A 44

Questions and Answers Thank You 45