Chapter 3: Phylogenetics

Similar documents
Evolutionary Tree Analysis. Overview

Phylogenetic Tree Reconstruction

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)


Constructing Evolutionary/Phylogenetic Trees

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Theory of Evolution Charles Darwin

Theory of Evolution. Charles Darwin

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Dr. Amira A. AL-Hosary

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Phylogeny Tree Algorithms

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Constructing Evolutionary/Phylogenetic Trees

Algorithms in Bioinformatics

EVOLUTIONARY DISTANCES

Phylogeny: traditional and Bayesian approaches

Molecular Evolution and Phylogenetic Tree Reconstruction

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

A (short) introduction to phylogenetics

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Consistency Index (CI)

Phylogeny: building the tree of life

Phylogenetic trees 07/10/13

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

BINF6201/8201. Molecular phylogenetic methods

Phylogenetics: Parsimony

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Is the equal branch length model a parsimony model?

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Phylogenetic inference

Walks in Phylogenetic Treespace

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogeny Jan 5, 2016

Molecular Evolution & Phylogenetics

Phylogeny. November 7, 2017

Effects of Gap Open and Gap Extension Penalties

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Phylogenetics. BIOL 7711 Computational Bioscience

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Building Phylogenetic Trees UPGMA & NJ

Phylogenetic analyses. Kirsi Kostamo

Lecture 10: Phylogeny

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

TheDisk-Covering MethodforTree Reconstruction

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

A Phylogenetic Network Construction due to Constrained Recombination

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Sequential Monte Carlo Algorithms

Reconstruire le passé biologique modèles, méthodes, performances, limites

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogeny. Properties of Trees. Properties of Trees. Trees represent the order of branching only. Phylogeny: Taxon: a unit of classification

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Phylogenetics: Likelihood

Phylogenetic inference: from sequences to trees

Finding the best tree by heuristic search

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Properties of normal phylogenetic networks

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Phylogenetics: Building Phylogenetic Trees

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

What is Phylogenetics

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Letter to the Editor. Department of Biology, Arizona State University

Isolating - A New Resampling Method for Gene Order Data

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Copyright 2000 N. AYDIN. All rights reserved. 1

Seuqence Analysis '17--lecture 10. Trees types of trees Newick notation UPGMA Fitch Margoliash Distance vs Parsimony

Reconstruction of certain phylogenetic networks from their tree-average distances

Copyright notice. Molecular Phylogeny and Evolution. Goals of the lecture. Introduction. Introduction. December 15, 2008

Organisatorische Details

Tools and Algorithms in Bioinformatics

Who Has Heard of This Problem? Courtesy: Jeremy Kun

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Perfect Phylogenetic Networks with Recombination Λ

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Introduction to Bioinformatics Introduction to Bioinformatics

Reconstructing Trees from Subtree Weights

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes

Inferring Molecular Phylogeny

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Transcription:

Chapter 3: Phylogenetics 3. Computing Phylogeny Prof. Yechiam Yemini (YY) Computer Science epartment Columbia niversity Overview Computing trees istance-based techniques Maximal Parsimony (MP) techniques Maximum likelihood techniques This chapter is based on urbin Chapter 7 lso recommended: The Phylogenetic Handbook, Salemi and andamme 00

Can e Tell volution rom Homology uplication Partial sample Speciation 3 3B B B Phylogeny How do we tell the right tree? 3 B 3 B 3 Phylogeny: Computing Trees INPT: Y GGGCT TGCCC TGCTT TGCC TGCGCTT Phylogeny OTPT: Y

Brute orce pproach Brute orce numerate all trees Compute some measure of evolutionary likelihood Select best tree How many rooted trees are there with n leaves? n= leaves => tree n=3 leaves =>attach 3 rd leaf to 3 edges => 3 trees Let T(n)= # rooted trees with n leaves; (n) = # edges T()=, ()=3; T(3)=3, (3)= ddition of a leaf creates two new edges => (n)=(n-)+=> (n)=n- T(n)=T(n-)*(n-)=T(n-)*(n-3) => T(n)= *3** (n-3) or n=0 leaves ~0 pproaches istance based Tree should best model evolutionary distance metric among taxa Character-based [Maximal Parsimony (MP)] Tree should minimize changes Maximum likelihood (ML) Tree should maximize likelihood of changes INPT: Y GGGCT TGCCC TGCTT TGCC TGCGCTT Phylogeny OTPT: Y 6 3

istance Based Techniques 7 I. istance Based Techniques Key Idea: Compute evolutionary distance metric among S={,,,,Y} Compute a tree on S that best fits the distances ormally: Given: nxn distance matrix Compute: weighted tree T on n leaves that best fits How to establish evolutionary distance measures? istance ~ changes Next chapter: evaluating distance using Markovian evolution models 8

Is There Tree That Perfectly its? Not every distance metric can be modeled by a tree How can we tell distance metrics that model a tree?...? 9 The our-point condition distance matrix corresponding to a tree is called additive THORM: is additive if and only if: or every four indices i,j,k,l, the maximum and median of the three pairwise sums are identical: ij + kl < ik + jl = il + jk Suggests how to connect points into a tree to fit i l ik il ij kl < = jl j k jk... 0

How o e Handle Non-dditive? dditive metrics are very useful Provide perfect fit with a tree model; tree is easily computed from But evolutionary distance metrics are often non-additive How do we handle non-additive metric? itch & Margoliash: find a tree T to minimize least-square fit: (T) = i,j (d ij (T) ij ) This problem is NP-Hard need heuristics itch & Margoliash (968) exhaustive search Closest-Pair Clustering Idea: use to guide closest-pair clustering xtend to clusters by PGM/PGM averaging 6

PGM lgorithm Initialization Initialize n clusters C i ={S i } Initialize T with leaves for each cluster Ci Iteration ind C i, C j with smallest distance ij Create new cluster C k = C i C j dd a new node to T, for C k, and connect it to C i,c j If all nodes are connected to a tree exit; otherwise, assign ki = kj = ij / and compute the distances kl to all clusters C l il C i + jl C j kl = C i + C j Repeat the iteration 3 PGM: Molecular Clock Property niform distance from root to leaves istance to root ~ evolutionary clock Species are assumed to take identical time to evolve 9 8 6 7 0. 0. 67 0. 8 3 7

Notes Complexity is is O(n ) veraging redistributes distances to overcome non-additivity Clustering can lead to substantial errors and is very sensitive This limits the applications of clustering How do we overcome the sensitivity of PGM? Real tree.. 0 9 6 3 0 PGM 3 9 3 3 Improvements Through Bootstrapping Bootstrapping: statistical technique to increase robustness Scenario: given a sample S(ω) and a result R(S) computed from S Bootstrapping: o Resample S, to get S (ω); o valuate R(S (ω)); o valuate match of R(S) with the values R(S (ω)) In here S= columns of sequences of size n; R(S)=tree S (ω)=sample n random columns of S with possible repetitions Compute phylogenetic tree R(S (ω)) se {R(S (ω))} to compute consensus/likelihood of branches of R(S) 6 8

Bootstrapping xample 7 Closest Pair vs. volutionary-neighbors dditivity: ij + kl < ik + jl = il + jk i l ik il ij kl < = jl j k PGM overcomes non-additivity by averaging distances But, the closest pair may not be evolutionary neighbors The evolutionary tree distances may diverge greatly; averaging distorts neighborhood jk 8 9

Neighbor Joining [Saitou & Nei 87; Studier & Keppler 88] Neighbor joining heuristics: join closest clusters that are far from the rest efine: R k =Σ i k ik the divergence of k Cluster nodes k,m that minimize km = km -(R k +R m )/(n-) [efine r k =R k /(n-) and consider km -r k -r m ] km r k r m r 6 6 9 37-9 -30-9 -9-30 -9 9 Neighbor Joining lgorithm Initialization:(same as PGM) Initialize n clusters C i ={S i } Iteration:. Compute r k =Σ i k ik /(n-) for each cluster k. ind (k,m) minimizing km -r k -r m ; 3. efine a new node i and set is = 0.( ks + ms - km ) for all s. Join node i to k and m with edges of respective lengths: ki =0.( km +r k -r m ) mi =0.( km +r m -r k ). Repeat until all nodes are connected 0 0

xample: Step --Compute ivergences r B C Σ B C 7 6 8 7 0 9 7 7 6 8 7 0 7 9 6 9 6 8 8 8 9 8 Step 30 3 38 3 B C Step : compute r k =Σ i k ik /(n-) Sum the columns then divide by 6-= r 7. 0. 8 9. 8. rom The Phylogenetic Handbook, Salemi and andamme 00 Step : find neighboring pair Step : evaluate neighboring distance matrix N km = km -(r k +r m ) [Subtract the r column & row] ind (k,m) minimizing N km Create a new node and attach to k,m B C 7 6 8 B 7 0 9 C 7 7 6 8 7 0 7 9 6 9 6 8 8 8 9 8 7. 0. 8 9. 8. PGM would connect the closest pair 7. 0. 8 9. 8. Step B C -3 -.-. -0-0 B C B C -0-0. -0-0. -3-0.-0. - -. -. Min{ Min{N km km }

Step 3,: Join Neighbors pdate istances Step 3: Compute the branch lengths,b =0.( B +r -r B )=0.(-3)= B =0.( B +r B -r )=0.(+3)= Step : pdate distance matrix = 0.( + B - B ) C = 0.(+7-)=3; =0.(7+0-)=6 =0.(6+9-)=; =0.(8+-)=7 B C 7 6 8 B 7 0 9 C 7 7 6 8 7 0 7 9 6 9 6 8 8 8 9 8 7. 0. 8 9. 8. Step C 3 6 7 C 3 7 6 8 6 7 9 6 8 7 8 9 8 Step 3 B C 3 Repeat Steps //3/ r C 3 6 7 7 6 8 6 7 9 Step 6 8 7 8 9 8 7 8 9 8 0.7 C C 3 Step - -0-0 C - -0-0.7-0.7 - -0.7-0.7 Step : compute r k =Σ i k ik /(n-) Step : compute neighboring pair Min{N Y = Y -r -r Y } => (,C) or (,) Step 3: join neighbors; compute branch length =0.( C +r -r C )=; C = Step : re-compute distances = 0.( + C - C ) Step 3 B C Step 9 8 9 8

Repeat 6 9 8 Step 6 9 8 Step : compute r k =Σ i k ik /(n-) Step : compute neighboring pair Min{N Y = Y -r -r Y } => (,) Step 3: join neighbors; compute branch length =0.( +r -r )=3; = Step : re-compute distances = 0.( + - ) r 7. 9. 8.. Step - - -3-3 - - Step 3 C 3 Step 6 6 6 6 B Repeat 6 6 6 6 Step Step : compute r k =Σ i k ik /(n-) Step : compute neighboring pair Min{N Y = Y -r -r Y } => (,) Step 3: join neighbors; compute branch length Z =0.( +r -r )=; Z = Step : re-compute distances Z = 0.( + - ) r 8 8 Step - - - Step 3 C Z 3 Step Z Z B 6 3

7 Complete B C 3 Z Z Z B C 3 Z 8 Notes On Neighbors Joining Complexity is O(n ) oes not depend on molecular clock assumption Heavily used in practice [e.g., Clustal ] But can be sensitive to non-additivity

Maximal Parsimony (character based phylogeny) 9 Key Idea: Minimize Changes Reconsider the problem: ind best tree to explain evolution of sequences Motivation: focus on evolution of positions istance loses information on evolutionary changes TTCTG TTCT GTTGCT TTGCT Key idea: find tree with minimal changes to explain data G GG G C= G G GG C=3 G G GG G 30

More Generally Taxa are considered as sets of attributes: characters character = N position, genes order, morphological feature character state = a value assumed by a character Characters evolve through state changes volutionary tree represents changes in character states MP-tree seeks to minimize state changes 3 MP xample http://evolution.berkeley.edu/evosite/evo0/iicasingparsimony.shtml Characters Binary states Taxa state change 3 6

MP xample 7 state changes 6 state changes 33 xample: volution of Gene www.life.uiuc.edu/ib/33/molsyst.html Taxa Character = position State = nucleotide 3 7

xample: volution of Gene http://home.cc.umanitoba.ca/~psgendb/g/phylogeny/parsimony/phylip.parsimony.html Character = position State = nucleotide Taxa 3 xample MP rearrangements of chromosome Pevzner 003 Genome Research 36 8

The Max Parsimony (MP) Problem Big MP: Input: set of n aligned sequences of length k Output: phylogenetic tree T such that o T has n leaves labeled with the input sequences (taxa) o T has internal nodes labeled with sequences of length k (states) o T minimizes the Hamming distance among its node labels H=3 G This is a Steiner Tree type problem Can be shown to be NP hard [Gusfield, oulds] But often the number of sequences considered is small G GG G Small MP Input: a tree with sequence-labeled leaves Output: labeling of internal nodes states which max parsimony 37 MP Basics Consider {T,TT, GTT, GT, GGT} irst column admits arrangements & identifies likely mutation T G TT G 3 G GTT GT G G 3 G GGT MP ( mutation) mutations Second column does not provide clues on likely mutations T G T T 3 T 3 T T T T G T TT GTT GT GGT Non-informative position (need at least characters) 38 9

MP Basics G 3 MP G G T T 3 T MP T TT GTT GT GGT Merge MP trees of columns & 3: T TT GTT TT GTT GTT 3 GGT GT T GT T TT GTT 3 TT GGT GTT Two MP trees 39 ardvark: CGGT Bison: CGC Chimp: CGGGT og: TGCCT lephant: TGCGT xample (N. riedman) TGGGT CGGT CGGGT TGCGT ardvark Bison Chimp og lephant CGGT CGC CGGGT TGCCT TGCGT 0 0

xample:volution of Protein omains http://ai.stanford.edu/~serafim/cs37_006/ 0 0 0 3 0 Total Cost: 3 C. Chothia et al, volution of the Protein Repertoire, Science OL 300, 3 June 003 T. Przytycka et al, Graph Theoretical Insights., RCOMB 00, LNBI 300, pp. 3-3, 00 Single Site MP: The itch lgorithm Problem: Input: a tree T with labeled leaves Output: labels of internal nodes of MP tree + cost C Step : ssign to each node x a set of labels S(x) such that If x is a leaf then S(x)= label of x, C 0 If x has children y,z S(x) = if S(y) S(z) 0 then S(y) S(z) else S(y) S(z), C C+ Traverse T in postorder (leaves to root) Step : ssign to a node x a character value v(x) Traverse T in preorder (root to leaves) If y is the parent of x and v(y)εs(x) then v(x) v(y) else v(x)= any label from S(x)

Step : Computing Candidate Labels C= {} C= {, G} C= {} C= {, G} C= {, G} C=0 G G G G {} {G} {} {G} {} {G} {} {G} G G {} {G} {} {G} 3 Step : Selecting MP Labels {} {, G} {} {, G} {} C= {, G} {, G} {, G} {, G} G G {} {G} {} {G} G G {} {G} {} {G} G G {} {G} {} {G}

Notes lgorithm is fast O(nk) n= # nodes, k=#character values It selects a particular MP tree (there may be others) {, G} C= G G {, G} {} G G G G {} {G} {} {G} G G G G G G Run separately for each character then merge results May be generalized for weighted parsimony: Sankoff s generalization: different costs of different changes Heuristic MP lgorithms se Steiner-tree heuristic algorithms Branch-and-bound search Represent search space as tree (nodes at k-th level represent phylogenetic trees for first k species) ind best scoring search-node and use it as bound Branch to children of this search-node Nearest neighbor interchange (NNI) switch subtrees Simulated annealing. 6 3

Maximal Likelihood pproach 7 (III) Max Likelihood pproaches (Based on N. riedman slides) Key idea: compute maximum likelihood tree Many models of changes (trees) can yield observed data Compute tree that maximizes the likelihood Problem : given T, compute probability P(S T) S={, n } are the observed sequences Need a probability model of changes generated by T: o Background probabilities: q(a) o Mutation probabilities: P(a b,t) x Problem : compute T that maximizes P(S T) This is the complex part x t t t t 3 x x x 3 8

Tree Likelihood Computation efine P(L k a)= prob. of subtree below node k given x k =a Init: for all leaves k; P(L k a)= if x k =a ; 0 otherwise Iteration: if k is node with children i and j, then " P(L k a) = P(b a,t i )L(i b)p(c a,t j )L( j c) b,c Termination:Likelihood is P( x, K, x3 T, t) =! P( Lroot a) q( a) a x t x t t t 3 x x x 3 9 Maximum Likelihood (ML) Score each tree by P (, K, n T, t) =! P( x[ m], K, xn[ m] T, t) m ssumption of independent positions ind the highest scoring tree xhaustive search Sampling methods (Metropolis) pproximation (consider only a subset of trees) 0

Comparison Tony eisstein, http://bioquest.org:6080/bedrock/terre_haute_03_0/phylogenetics_.0.ppt Neighbor-joining Maximum parsimony Maximum likelihood ses only pairwise distances ses only shared derived characters ses all data Minimizes distance between nearest neighbors Minimizes total distance Maximizes tree likelihood given specific parameter values ery fast asily trapped in local optima Slow ssumptions fail when evolution is rapid ery slow Highly dependent on assumed evolution model Good for generating tentative tree, or choosing among multiple trees Best option when tractable (<30 taxa) Good for very small data sets and for testing trees built using other methods Conclusions Computing phylogeny is an area of active research Hundreds of algorithms. New models: phylogenetic networks (generalize trees) New challenges: whole genome phylogeny ccount for multi-site changes: replication, transpositions New algorithms pplications pidemiology Cancer diagnosis. 6