Phylogeny: traditional and Bayesian approaches

Similar documents
Phylogenetic Tree Reconstruction

Molecular Evolution and Phylogenetic Tree Reconstruction

Evolutionary Tree Analysis. Overview

Phylogeny: building the tree of life

Algorithms in Bioinformatics

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Week 5: Distance methods, DNA and protein models

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

EVOLUTIONARY DISTANCES

Constructing Evolutionary/Phylogenetic Trees

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Consistency Index (CI)

Phylogenetic trees 07/10/13

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Multiple Sequence Alignment. Sequences

Bioinformatics tools for phylogeny and visualization. Yanbin Yin


Multiple Alignment. Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

BINF6201/8201. Molecular phylogenetic methods

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Constructing Evolutionary/Phylogenetic Trees

Phylogeny. November 7, 2017

A (short) introduction to phylogenetics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Phylogenetic analyses. Kirsi Kostamo

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Phylogenetic inference

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

What is Phylogenetics

A Phylogenetic Network Construction due to Constrained Recombination

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Using algebraic geometry for phylogenetic reconstruction

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogeny Tree Algorithms

Theory of Evolution Charles Darwin

Dr. Amira A. AL-Hosary

How to read and make phylogenetic trees Zuzana Starostová

Phylogenetic inference: from sequences to trees

C3020 Molecular Evolution. Exercises #3: Phylogenetics

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Copyright 2000 N. AYDIN. All rights reserved. 1

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

MOLECULAR EVOLUTION AND PHYLOGENETICS SERGEI L KOSAKOVSKY POND CSE/BIMM/BENG 181 MAY 27, 2011

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

C.DARWIN ( )

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Workshop III: Evolutionary Genomics

BIOINFORMATICS GABRIEL VALIENTE ALGORITHMS, BIOINFORMATICS, COMPLEXITY AND FORMAL METHODS RESEARCH GROUP, TECHNICAL UNIVERSITY OF CATALONIA

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Module 12: Molecular Phylogenetics

Phylogeny and Molecular Evolution. Introduction

Phylogenetics: Building Phylogenetic Trees

Theory of Evolution. Charles Darwin

Generating phylogenetic trees with Phylomatic and dendrograms of functional traits in R

Phylogenetics. BIOL 7711 Computational Bioscience

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Is the equal branch length model a parsimony model?

Module 13: Molecular Phylogenetics

Lecture 18 Molecular Evolution and Phylogenetics

Gel Electrophoresis. 10/28/0310/21/2003 CAP/CGS 5991 Lecture 10Lecture 9 1

Phylogenetics: Parsimony

Seuqence Analysis '17--lecture 10. Trees types of trees Newick notation UPGMA Fitch Margoliash Distance vs Parsimony

Lecture 6 Phylogenetic Inference

Building Phylogenetic Trees UPGMA & NJ

Bioinformatics course

Hierarchical Clustering

Phylogeny and Molecular Evolution. Introduction

Molecular Evolution & Phylogenetics

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Module 13: Molecular Phylogenetics

molecular evolution and phylogenetics

Perfect Phylogenetic Networks with Recombination Λ

Evolutionary Trees. Evolutionary tree. To describe the evolutionary relationship among species A 3 A 2 A 4. R.C.T. Lee and Chin Lung Lu

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Molecular Phylogenetics (Hannes Luz)

Reconstruire le passé biologique modèles, méthodes, performances, limites

The Generalized Neighbor Joining method

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetic Networks, Trees, and Clusters

Transcription:

Phylogeny: traditional and Bayesian approaches 5-Feb-2014 DEKM book Notes from Dr. B. John Holder and Lewis, Nature Reviews Genetics 4, 275-284, 2003 1

Phylogeny A graph depicting the ancestor-descendent relationships between organisms or gene sequences. Sequences are tips of the tree Branches connect the tips to their unobservable ancestral sequence 2

3

4

5

6

What is phylogeny? The inference of evolutionary relationships The inference of putative common ancestors Trees (with branches and leaves!) Example: mouse: rat: marmoset Chimp: Human: ACAGTGACGCCCCAAACGT ACAGTGACGCTACAAACGT CCTGTGACGTAACAAACGA CCTGTGACGTAGCAAACGA CCTGTGACGTAGCAAACGA

Rooted vs Unrooted tree (dendograms) Rooted tree Unrooted tree C D B A B C D A OTUs Operational taxonomic units (leaves) HTUs Hypothetical taxonomic units (internal nodes)

Calculating the number of unrooted trees 1 2 1 2 3 How many possibilities are there for adding leaf 3? 1 3 4 How many possibilities are there for adding leaf 4? 3

Calculating the number of unrooted trees 1 2 4 5 3 How many possibilities are there for leaf 5? For the 5 th leaf, there are 5 possibilities ie to add the n th leaf, we can add at the immediate leaf branches+ the number of internal branches ie (n-1) terminal-branches + (n-4) internal branches =2n-5

Total number of trees? 1 2 N = 10 #unrooted: 2,027,025 #rooted: 34,459,425 5 4 N = 30 #unrooted: 8.7 x 10 36 #rooted: 4.95 x 10 38 3 #unrooted trees for n taxa: (2n-5)*(2n-7)*...*3*1 = (2n-5)! / [2 n-3 *(n-3)!] #rooted trees for n taxa: (2n-3)*(2n-5)*(2n-7)*...*3 = (2n-3)! / [2 n-2 *(n-2)!] Commonly represented as 2n-5!! or 2n-3!!

Two main problems in Tree Construction Evaluate tree and/or assign branch lengths Construction of the best tree among the many possibilities

Parsimony Approach to Evolutionary Tree Evaluation Assumes observed character differences resulted from the fewest possible mutations Seeks the tree that yields lowest possible parsimony score - sum of cost of all mutations found in the tree

Example for calculating parsimony score AAG AAA AGA GGA 1 AAG AAA AAA AAA 1 1 GGA AGA AGA AAG AAA AGA AAA AAA AAA 1 1 2 GGA Total #substitutions = 3 Total #substitutions = 4 The left tree is preferred over the right tree. The total number of changes is called the parsimony score.

Sankoff algorithm to count evolutionary changes in a given tree Step-1: Define a cost matrix [c ij ], representing changes from character state i to state j A C G T A 0 2.5 1 2.5 C 2.5 0 2.5 1 G 1 2.5 0 2.5 T 2.5 1 2.5 0 Step-2: Starting from the leaves, we work down at each node, k to calculate a score S k (i) for each state, i (i= 1.. 4 for DNA) S k is the minimal cost of events for all subtrees above that node, k

Sankoff algo cont d Remember for each internal node ( parent ) we have two sub-nodes ( children ) So the scores need to be added up on both branches First, consider say the left branch, with left child node at state i and we want to evaluate the score for parent node at state A. If S(left child node) is known for each states, the S(parent node@j) will be S p caa SC1( A) ca T SC1( T) ( A) LeftArm min cag SC1( G) cac SC1( C) A Left A/T/G/C Child node (C1) A Parent node C Left A/T/G/C Child node (C1) Right Child node A Parent node

Sankoff algo cont d S p caa SC1( A) ca T SC1( T) ( A) LeftBranch min cag SC1( G) cac SC1( C) min c S ( j) A j C1 j A/ T/ G/ C LeftBranch Left Child node (C1) A A/T/G/C Parent node Right Child node (C2) S () i S '() i S '() i p p left p right c S j c S k min i j C1( ) min / / / i k C2( ) j A T G C k A/ T/ G/ C LeftBranch RightBranch

Sankoff algorithm Example Sp() i min cij SC1( j) min cik SC2( k) ja/ T/ G/ C ka/ T/ G/ C LeftArm RightArm ACGT ACGT ACGT ACGT ACGT 0 0 0 0 0 {C} {A} {C} {A} {G} Cost matrix A C G T A 0 2.5 1 2.5 C 2.5 0 2.5 1 G 1 2.5 0 2.5 T 2.5 1 2.5 0 A C G T 2.5 2.5 3.5 3.5 A C G T 6 6 7 8 A C G T 1 5 1 5 A C G T 3.5 3.5 3.5 4.5

Problem 1: Assigning branch lengths using Least squares We have an observed matrix of Distances between sequences from all possible pair-wise comparisons Remember D and K in JC model? A B C D E A 0 0.23 0.16 0.20 0.17 B 0.23 0 0.23 0.17 0.24 C 0.16 0.23 0 0.15 0.11 D 0.20 0.17 0.15 0 0.21 E 0.17 0.24 0.11 0.21 0 E A 0.06 0.08 0.03 0.05 C 0.05 0.10 0.07 D B Observed distances denoted by D ij and the real branch lengths to be predicted as d ij n n i1 j1; ji QT ( ) ( D d) ij ij 2

Motivation: From a distance table to a tree Each tree has branch lengths from which predicted set of distances can be computed: d(i,j) (small d, denotes the distance of the branches, unlike the observed pairwise distances D). Human 0.3 0.41 Gorilla D(Human,Chimp) = 0.55 D(Human,Gorilla) = 0.71 0.25 Chimp D(Chimp, Gorilla) = 0.66

Motivation: From a distance table to a tree The question is can we find branch lengths, so that the d s are equal to the D s? Human X Y Gorilla D(Human,Chimp) = 0.3 =X + Z D(Human,Gorilla) = 0.4 = X + Y Chimp Z D(Chimp, Gorilla) = 0.5 = Y + Z X+Z = 0.3 X+Y = 0.4 Y+Z = 0.5 Y-Z = 0.1 Y+Z = 0.5 Y = 0.3 Z = 0.2 X = 0.1

Is there always a solution?? A X Y W D 5 Variables, 6 Equations, B Z V C It might be that there s no single solution NC 2 2 N 3 for N > 3 Number Of Equations Number of variables

Least squares Cont d d x v ij ij, k k k x ij k introduce an indicator variable,, which is 1 if branch v k lies in the path from species i to species j and 0 otherwise 5 1 2 v2 v1 v7 v5 v6 v4 v3 4 3 d 1v 1v 0v 0v 0v 0v 1v 12 1 2 3 4 5 6 7 d 1v 0v 1v 0v 0v 1v 0v... 13 1 2 3 4 5 6 7 d 0v 0v 0v 1v 1v 1v 1v 45 1 2 3 4 5 6 7

LS- Cont d when the weights are 1.0 QT ( ) ( D d ) n n i1 j1; ji dq n n n n 2 ( Dij xijkvk ) xij, k D ij xij, kvk i1 j1; ji k dvk i1 j1;; ji k T X D T ( X X) v 1 ( T T v X X) X D X ij ij 2 2 ( ) 0 No of rows~n 2 (nc 2 ) No of columns=2n-3 (eq to k) D No of rows nc 2 Columns=1 A B d x v ij ij, k k k E v5 v1 v6 v3 v7 v2 v4 D

Problem 2: Construct the best tree

P2: Fast clustering algorithm for tree construction UPGMA (unweighted pair group method using arithmetic averages) Given two disjoint clusters C i, C j of sequences, 1 d ij = {p Ci, q Cj} d pq C i C j Note that if C k = C i C j, then distance to another cluster C l is: d il C i + d jl C j d kl = C i + C j = UPGMA distance ni () n( j) D(( ij), l) ( ) D( i, l) ( ) D( j, l) ni () n( j) ni () n( j)

UPGMA algorithm Find i and j with smallest D ij Create new group X by joining nodes i & j Compute distances between X (new member) and others (old) Place node X at D ij /2 Delete i and j; replace with X in D-matrix

The distance table dog bear raccoon weasel seal sea lion cat chimp dog 0 32 48 51 50 48 98 148 bear 0 26 34 29 33 84 136 raccoon 0 42 44 44 92 152 weasel 0 44 38 86 142 seal 0 24 89 142 sea lion 0 90 142 cat 0 148 chimp 0

Distance between these two taxa was 24, so each branch has a length of 12. ss 12 12 seal sea lion We call the parent node of seal and sea lion ss.

Removing the seal and sea-lion rows and columns, and adding the ss row and columns dog bear raccoon weasel ss cat chimp dog 0 32 48 51? 98 148 bear 0 26 34? 84 136 raccoon 0 42? 92 152 weasel 0? 86 142 ss 0 89 142 cat 0 148 chimp 0

Computing dog-ss distance dog bear raccoon weasel seal sea lion cat chimp dog 0 32 48 51 50 48 98 148 n( i) n( j) D(( ij), k) ( ) D( i, k) ( ) D( j, k) n( i) n( j) n( i) n( j) Here, i=seal, j=sea lion, k = dog. n(i)=n(j)=1. D(ss,dog) = 0.5D(sea lion,dog) + 0.5D(seal,dog) = =0.5*48+0.5*50=49

The new table. Starting second iteration dog bear raccoon weasel ss cat chimp dog 0 32 48 51 49 98 148 bear 0 26 34 31 84 136 raccoon 0 42 44 92 152 weasel 0 41 86 142 ss 0 89 142 cat 0 148 chimp 0

Inferring tree Distance between bear and raccoon was 26, so each branch has a length of 13. ss 12 12 br 13 13 seal sea lion bear raccoon We call the parent node of bear and raccoon br.

Computing br-ss distance dog bear raccoon weasel ss cat chimp ss 49 31 44 41 0 89.5 142 Here, i=raccoon, j=bear, k = ss. n(i)=n(j)=1. D(br,ss) = 0.5D(bear,ss)+0.5D(raccoon,ss)=37.5. ), ( ) ) ( ) ( ) ( ( ), ( ) ) ( ) ( ) ( ( ) ), (( k j D j n i n j n k D i j n i n i n k ij D

The new table. Starting next iteration dog br weasel ss cat chimp dog 0 40 51 49 98 148 br 0 38 37.5 88 144 weasel 0 41 86 142 ss 0 89 142 cat 0 148 chimp 0

Inferring tree Distance between br and ss was 37.5, so each branch has a length of 18.75. But this is the distance from brss to the leaves. The distance brss to ss is 18.75-12=6.75. The distance between brss to br is 18.75-13=5.75 brss 6.75 5.75 ss br 12 12 13 13 seal sea lion bear raccoon And so on..

UPGMA s Weakness: Example Correct tree UPGMA 2 3 4 1 1 4 2 3 6.75 brss 5.75 ss br 12 12 13 13 seal sea lion bear raccoon

Neighbor joining: Star topology B b c C A a e d D Sum of all branches is S*=a+b+c+d+e. Summing all distances in the matrix counts each edge four times (e.g., d AB, d AC, d AD and d AE ). Hence, the sum of all distances in the matrix is 4S*. E

Add one branch (the first potential tree ) A B a b e C d AB c = a + b d AC = a + f + c d AD = a + f + d d AE d= a + f + e D A B a b f c C d e D E E Sum of branches with the new branch is S = a + b + c + d + e + f = (d AC + d AD + d AE + d BC + d BD + d BE )/6 + d AB /2 + (d CD + d CE + d DE )/3

Neighbor joining (general idea) 1. Add one branch to the star topology and compute the difference between S* and S. 2. Repeat for each pair of leaves in the tree. 3. Choose the pair that yields the largest difference (the closest neighbors). 4. Join that pair. 5. Repeat until all pairs are joined.

NJ algorithm For each tip compute n u D /( n 2) i jj : i ij Choose i and j for which, D u u ij i j is smallest Join nodes i and j to X. Compute branch length from i to X and j to X v ( D u u ) 2 ix ij i j v ( D u u ) 2 jx ij j i Compute the distance between X and remaining nodes D ( D D D ) 2 kx ik jk ij

NJ algorithm Cont d New node X is treated as a new tip and old nodes I, j are deleted If more than two nodes remain go back to step-1, else connect the two nodes (l,m) by D l,m

Bootstrapping:Confidence in trees

Trees: representation in computer files Newick format

Newick format: named for seafood restaurant where standard was decided upon

Trees: representation in computer files A B C D ( (A, B), (C, D) ); Newick format: Leafs: represented by taxon name Internal nodes: represented by pair of matching parentheses Descendants of internal node given as comma-delimited list. Tree string terminated by semicolon

Trees: representation in computer files A B C D 1 0.5 0.7 0.3 2 2.5 ( (A:1, B:0.5) :2, (C:0.7, D:0.3) :2.5); ( (A:1, B:0.5)L:2, (C:0.7, D:0.3)M:2.5)N; Newick format: Leafs: represented by taxon name Internal nodes: represented by pair of matching parentheses Descendants of internal node given as comma-delimited list. Tree string terminated by semicolon

How to make a Tree? 1. Know the distances between each pair e.g, multiple sequence alignment, or pairwise alignments 2. Apply UPGMA or NJ 3. Calculate statistics: Bootstrap 4. Visualize your tree (e.g., Treeview) 5. Or use an all-in-one program :-) e.g. CLUSTALW

Phylogenetic software Software packages Freely available Phylip (widely used) BioNJ PhyML Tree Puzzle MrBayes Commercial PAUP (widely used) MEGA

www.ebi.ac.uk

Phylogenetic servers http://www.phylogeny.fr/ http://bioweb.pasteur.fr/seqanal/phylogeny/intro-uk.html http://atgc.lirmm.fr/phyml/ http://phylobench.vital-it.ch/raxml-bb/ http://www.fbsc.ncifcrf.gov/app/htdocs/appdb/drawpage.php?ap pname=paup http://power.nhri.org.tw/power/home.htm