Phylogeny Tree Algorithms

Similar documents
9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)


Constructing Evolutionary/Phylogenetic Trees

Theory of Evolution. Charles Darwin

Phylogenetic Tree Reconstruction

Algorithms in Bioinformatics

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Theory of Evolution Charles Darwin

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Evolutionary Tree Analysis. Overview

Constructing Evolutionary/Phylogenetic Trees

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Dr. Amira A. AL-Hosary

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

C3020 Molecular Evolution. Exercises #3: Phylogenetics

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Phylogenetic inference

BINF6201/8201. Molecular phylogenetic methods

Phylogeny. November 7, 2017

EVOLUTIONARY DISTANCES

Seuqence Analysis '17--lecture 10. Trees types of trees Newick notation UPGMA Fitch Margoliash Distance vs Parsimony

Phylogenetic trees 07/10/13

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

What is Phylogenetics

Phylogeny: building the tree of life

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Multiple Sequence Alignment. Sequences

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogeny: traditional and Bayesian approaches

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

Computational Genomics

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

A (short) introduction to phylogenetics

Phylogenetics. BIOL 7711 Computational Bioscience

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Consistency Index (CI)

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Molecular Evolution and Phylogenetic Tree Reconstruction

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

How to read and make phylogenetic trees Zuzana Starostová

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

X X (2) X Pr(X = x θ) (3)

Bioinformatics course

Gel Electrophoresis. 10/28/0310/21/2003 CAP/CGS 5991 Lecture 10Lecture 9 1

Phylogenetics: Building Phylogenetic Trees

Phylogenetic analyses. Kirsi Kostamo

Phylogeny. Information. ARB-Workshop 14/ CEH Oxford. Molecular Markers. Phylogeny The Backbone of Biology. Why? Zuckerkandl and Pauling 1965

A Phylogenetic Network Construction due to Constrained Recombination

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Inferring Molecular Phylogeny

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Phylogeny. Properties of Trees. Properties of Trees. Trees represent the order of branching only. Phylogeny: Taxon: a unit of classification

The Generalized Neighbor Joining method

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Introduction to characters and parsimony analysis

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

C.DARWIN ( )

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

8/23/2014. Phylogeny and the Tree of Life

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Copyright 2000 N. AYDIN. All rights reserved. 1

Phylogenetics: Parsimony

Supplementary Information

Organizing Life s Diversity

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Chapter 3: Phylogenetics

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the

Introduction to Bioinformatics Introduction to Bioinformatics

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Quantifying sequence similarity

Reconstructing Trees from Subtree Weights

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Phylogenetic Networks, Trees, and Clusters

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology

BIOL 1010 Introduction to Biology: The Evolution and Diversity of Life. Spring 2011 Sections A & B

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Biology 211 (2) Week 1 KEY!

Computational Biology

Multiple Alignment. Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis

MOLECULAR EVOLUTION AND PHYLOGENETICS SERGEI L KOSAKOVSKY POND CSE/BIMM/BENG 181 MAY 27, 2011

TheDisk-Covering MethodforTree Reconstruction

7. Tests for selection

Reading for Lecture 13 Release v10

Transcription:

Phylogeny Tree lgorithms Jianlin heng, PhD School of Electrical Engineering and omputer Science University of entral Florida 2006 Free for academic use. opyright @ Jianlin heng & original sources for some materials

Evolution Evolution of new organisms is driven by Diversity Different individuals carry different variants of the same basic blue print Mutations The DN sequence can be changed due to single base changes, deletion/insertion of DN segments, etc. Selection bias S. Moran and I. Wexler,2005

Motivation To understand lineage of various species (evolutionary history) To understand how various functions evolve To inform multiple alignments To map virus strains (vaccine construction) To identify what is most conserved / important in some class of sequences

Frank Olken, 2002

Historical Note Until mid 1950 s phylogenies were constructed by experts based on their opinion (subjective criteria) Since then, focus on objective criteria for constructing phylogenetic trees Thousands of articles in the last decades Important for many aspects of biology lassification Understanding biological mechanisms S. Moran and I. Wexler

Morphological vs. Molecular lassical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc. Modern biological methods allow to use molecular features Gene sequences Protein sequences nalysis based on homologous sequences (e.g., globins) in different species S. Moran and I. Wexler

Phylogeny Tree Basics Leaves represent things (genes, individuals, strains, species) being compared. Term taxon (taxa plural) is used to refer to this. Internal nodes are hypothetical ancestral units In a rooted tree, path from root represents an evolutionary path (root represents the common ancestor) n unrooted tree specifies relationships among things, but no evolutionary path.

Tree of Life Source: lberts et al

Primate Evolution S. Moran and I. Wexler

T. Warnow, 2004

Frank Olken, 2002

Frank Olken, 2002

Frank Olken, 2002

Phylogeny Tree Space The space of phylogeny tree is exponential. For n sequences, the number of unrooted tree is (2n5)!! For n sequences, the number of rooted tree is (2n3)!!

Root Rooted Tree Unrooted Tree

Definition of a tree: Edge num: E Internal node num: I Leaf node num: L Rule 1: E = I + L 1 (why?)

Rooted Phylogeny Tree E = 2 * I (degree) E = I + L 1 2I = I + L 1 I = L 1 (# internal node = #leaf 1) E = 2L 2

UnRooted Tree Total degree = (L+3*I) = 2E E = L + I 1 I = L 2 E = 2L 3

Total number of unrooted tree Given n species (n >= 3), there are (2i5)!! Unrooted bifurcating trees. #leaf node #edge #tree 3 3 1 4 5 1*3 5 7 1*3*5 n 2n3 1*3*5* *2n5 a a d c c b b.

Total number of rooted tree Root a b onvert an unrooted tree into a rooted tree by adding one node on any edge a b c c Total number of rooted tree for n leaf nodes: Total number of unrooted tree * total number of edge = (2n5)!! * (2n3) = (2n3)!!

Number of Rooted and Unrooted Trees 3 4 5 6 7 8 9 10 Unrooted 1 3 15 105 945 10395 135135 2027025 rooted 3 15 105 945 10395 135135 2027025 34459425

Phylogeny Tree lgorithms DistanceBased (UPGM, Neighbor Join) Maximum Parsimony (characterbased) Maximum Likelihood (characterbased)

Frank Olken, 2002

How to choose methods Very similar sequences: Maximum Parsimony (time intensive) Medium similar sequences: distance based method (fast, O(n 2 )) Very dissimilar sequences: Maximum Likelihood method (very time intensive)

Maximum Parsimony Method Predict the evolutionary tree that minimizes the number of steps required to generate the observed variation in the sequences. Find a tree that explains data with a minimal number of changes. ppropriate for very similar sequences and a small number of sequences Time onsuming (try to examine all possible trees) PHYLIP and PUP offer maximum parsimony method

Select Informative Sites Sites 1,6,8: not informative Site 2: not informative (doesn t favor any tree) Site 3: not informative (doesn t favor any tree) Site 4: not informative (doesn t favor any tree) Sites 5, 7: informative Rule of thumb: to be informative, a character must appear in at least two taxa and there are at least two characters.

Example dapted from Li and Graur 1991 Tree 1 Tree 2 Tree 3 Taxa 1 G Taxa 3 Taxa 1 G Taxa 2 G Taxa 1 G Taxa 2 G G G Taxa 2 Taxa 4 Taxa 3 Taxa 4 Taxa 4 Length = 1 Length = 2 Length = 2 Taxa 3

Example dapted from Li and Graur 1991 Tree 1 Tree 2 Tree 3 Taxa 1 G Taxa 3 Taxa 1 G Taxa 2 G Taxa 1 G Taxa 2 G G G Taxa 2 Taxa 4 Taxa 3 Taxa 4 Taxa 4 Length = 1 Length = 2 Length = 2 Taxa 3

omments For a small number of sequences, exhaustive search is ok. (n=10, about 2 million trees) For many sequences, exhaustive search is very time consuming (NPomplete). BranchBound method or even heuristic methods must be used. (PUP provides both options) Maximum parsimony tree provides an explicit evolutionary model The rates of changes along all branches of the tree are assumed to be equal PHYLIP: DNPENNY (branchbound to analyze up to 1112 sequences), DNOMP performs phylogenetic analysis using the compatibility criterion (find a tree that supports the largest number of sites). PROTPRS (for protein parsimony tree)

DistanceBased Method Goal is to generate a tree in which similar sequences with short distance are closer and the sum of branch lengths of two nodes is equal to their distance. lustalw (neighborjoin method) PUP also has distance method PHYLIP: DNDIST, PROTDIST (PM) to generate distance matrix

UPGM UPGM: unweighted pair group method with arithmetic mean. ssume a molecular clock (constant evolution rate) Produce a rooted tree Ultrametric condition: for any three taxa (a,b,c), d ac <= max(d ab, d bc ).

UPGM condition db <= max(d,db) d <= max(db,db) db <= max(db, d) B In another words: two greatest distance must be equal. Or: constant evolutionary rate for all branches.

UPGM lgorithm Initialization: Define T to be the set of leaf nodes, one for each sequence. Height of each node is 0. Let L = T Repeat Select closest two nodes (,B) and create parent node K for them. Join, B and K respectively. Set the height of node K to d B / 2. Set branch length between K and = height K height, set branch length between K and B = height K height B. Remove, B from L and add K into L. Recompute the distance between K and other nodes in L. Distance between K and other nodes is average distance of leaf sequences below K and the other node. Until there is only one node

Example of UPGM (perfect) B D E Step 1: Select D and E 20 26 26 26 B 26 26 26 16 16 D 10 E B 20 26 DE 26 5 5 B 26 26 16 D E DE

B 20 26 DE 26 Step 2: Select (DE) and B 26 26 16 DE B 20 DE 26 8 5 3 5 B DE 26 D E dist(de,) = (d D +d E +d )/3 = 26 dist(de,b) = (d DB +d EB +d B )/3 = 26

B DE 20 26 Step 3: select, B B 26 DE B DE B DE 26 10 B 8 5 D 3 E 5 dist(de,b)= (d D +d DB +d E +d EB +d +d B )/6 = 26

B DE Step 4: select (,B), (D,E,) B 26 DE 3 Root 5 10 10 8 5 3 5 B D E

Example of UPGM (imperfect) B D E Step 1: Select D and E 22 39 39 41 B 41 41 43 18 20 D E 10 5 D B DE 5 E 22 39 40 B 41 42 19 DE

B 22 39 DE 40 Step 2: Select (DE) and B 41 42 DE 19 4.5 5 D 5 E B DE 9.5 22 39.7 B 41.7 DE dist(de,) = (d D +d E +d )/3 = 39.7 dist(de,b) = (d DB +d EB +d B )/3 = 41.7

B DE 22 39.7 Step 3: select, B B 41.7 DE 4.5 5 G D 5 E B B DE 40.7 9.5 DE 11 dist(de,b)= (d D +d DB +d E +d EB +d +d B )/6 = B

B DE Step 4: select (,B), (D,E,) B DE 40.7 4.5 G 5 D 10.85 5 E 20.35 9.5 = 10.85 Root 9.5 20.35 11 = 9.35 9.35 11 B

NeighborJoin Method Do not assume molecular clock ssume additivity of distance matrix (ideal) Work for nonadditivity of distance matrix (nonideal situation) Most reliable when the branch lengths of trees are allowed to vary Goal is to find a tree that minimize the square errors of pairwise distances.

dditive ondition Given a L*L distance matrix M, d(i,i) = 0, d(i,j) > 0 for i j d(i,j) = d(j,i) For all i,j,k it holds that d(i,k) <= d(i,j) + d(j,k) i j k

dditive ondition (cont) We say that the distance matrix M with L objects is additive if there is a tree T, L of its nodes correspond to the L objects, with positive weights on the edges, such that for all i,j, d(i,j) = d T (i,j), the length of the path from i to j in T.

Three objects sets always additive: k For L=3: There is always a tree with one internal node. a c m b j d(, i j) = a+ b d(, ik) = a+ c d( jk, ) = b+ c i Thus c 1 = d( k, m) = [ d( i, k) + d( j, k) d( i, j)] 2 0

dditive Tree i k For four point condition: d(i,k)+d(j,l) = d(i,l) + d(k,j) >= d(i,j)+d(k,l) j l Theorem: distance matrix M of L objects is additive iff any subset of four objects can be labeled i,j,k,l so that: d(i,k) + d(j,l) = d(i,l) +d(k,j) d(i,j) + d(k,l) We call {{i,j},{k,l}} the split of {i,j,k,l}.

Neighbor Join lgorithm Initialization: Define T to be the set of leaf nodes, one for each sequence. nd let L = T Iteration: Pick a pair i,j for which D ij is minimal. D ij = d ij (r i +r j ), r i = d ik / ( L 2), r i : average distance from i to all other sequences (k) except j. Remove i, j from L. Define a new node k, for other node m in L, d km =1/2*(d im +d jm d ij ) = (d im +d jm )/2 d ij /2 dd k to T with edge of length d ik = ½(d ij +r i r j ), d jk =d ij d jk Join k to i and j respectively, remove i,j from L and add k into L Termination When L consists of only two nodes (i,j). dd one edge between i and j with length d ij.

Example of Neighbor Join B D B 3 Satisfy additive condition D B = 3 ( (7+8)/2 + (6+7)/2) =11 D D = 8 ( (3+7)/2 + (3+7)/2) = 2 D = 7 ( (3+8)/2 + (6+3)/2) = 3 D B = 6 ( (3+7)/2 + (7+3)/2) = 4 D BD = 7 ( (3+6)/2 + (8+3)/2) = 3 D D = 3 ( (7+6)/2 + (8+7)/2) = 11 7 6 D 8 7 3 Step 1: Select and B B 2 1 E d E = (d +d B d B )/2 = (7+63)/2=5 d ED = (d D +d BD d B )/2 = (8+73)/2=6 d E = (d B +r r B )/2=(3+1)/2 = 2 d BE = d B d E = 3 2 = 1

E D E D 5 6 3 D D = d D (r +r D )=3(5+6)=8 D E = d E (r +r E )=5(3+6)=4 D DE = d DE (r D +r E ) = 6(3+5) = 2 F D d FE = (d E +d DE d D )/2=(5+63)/2=4 d F = (d D +r r D )/2=(3+56)/2=1 d FD = d D d F = 2 1 2

Join E and F with length 4 B 2 1 E 4 F 1 2 D Now, verify if the sum of branch lengths matches with sequence distances.

S. Moran and I. Wexler

omments Given a distance matrix constituting an additive metric, the topology of the corresponding additive tree is unique. an run NJ algorithm on nonadditive matrix. In that case, tree may not be unique.

How to compute distance from alignment ount mismatch ount both mismatch and gaps Use substitution matrix to generate similarity scores, then convert it to distance Normalize score into the range [0,1]: S = (S real S rand ) / (S ident S rand ). Then 1 normalized score is distance. S real is the alignment score of sequence and B. S rand is the alignment score of two random sequences generated from and B. S ident is average alignment score of aligning with, B with B.

Maximum Likelihood Method Use probability calculations to find a tree that best accounts for variation in a set of sequences. nalysis is performed on each column of a multiple sequence alignment. ll possible trees are considered. Only can be used for a small number of sequences (at most 10?) PUP version 4 and up has maximum likelihood function PHYLIP (DNML and DNMLK (molecular clock))

Frank Olken, 2002

Sequence : GGTTGGG Sequence B: GGTTGGG Sequence : GTG Sequence D: GGG Example of ML Look at one column (5 possible columns), one tree (15 Possible trees), one assignment of nucleotides (64 possible combinations) a b c d T T G ssign nucleotides To the internal T nodes T G a b c d T T G Find a best tree that with maximum probability. Prob = prob(t) *Prob(T>G)* Prob(G>) =0.25*10 6 *10 6

dvantages and Disadvantages of ML Method Explicit Statistical Model Likelihood Efficient use of data Very expensive to compute (use heuristics)

Frank Olken, 2002

Reliability of Tree onstruction Boostraping Given an multiple sequence alignment, randomly sample ncolumns with replacement and construct a tree onstruct a lot of trees as above heck the relation between two sequences. If their relationship (split, branch) is stable (appearing in the most trees), then the tree more likely close to the true tree.

Frank Olken, 2002

Ten Topics 1. Introduction to Molecular Biology and Bioinformatics 2. Pairwise Sequence lignment Using Dynamic Programming 3. Practical Sequence/Profile lignment Using Fast Heuristic Methods (BLST and PSIBLST) 4. Multiple Sequence lignment 5. Gene Identification 6. Phylogenetic nalysis 7. Protein Structure nalysis and Prediction 8. RN Secondary Structure Prediction 9. lustering and lassification of Gene Expression Data 10. Search and Mining of Biological Databases, Databanks, and Literature