Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Similar documents
Phylogenetics: Likelihood

Phylogenetics: Parsimony

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Reconstruire le passé biologique modèles, méthodes, performances, limites

Constructing Evolutionary/Phylogenetic Trees

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Phylogenetics: Building Phylogenetic Trees


Using algebraic geometry for phylogenetic reconstruction

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Lecture 9. Intro to Hidden Markov Models (finish up)

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Evolutionary Tree Analysis. Overview

Molecular Evolution & Phylogenetics

Dr. Amira A. AL-Hosary

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetic Tree Reconstruction

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Evolutionary Models. Evolutionary Models

Constructing Evolutionary/Phylogenetic Trees

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models

Hidden Markov Models

HIDDEN MARKOV MODELS

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Consistency Index (CI)

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Pinvar approach. Remarks: invariable sites (evolve at relative rate 0) variable sites (evolves at relative rate r)

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

BMI/CS 776 Lecture 4. Colin Dewey

Probabilistic modeling and molecular phylogeny

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Questions we can ask. Recall. Accuracy and Precision. Systematics - Bio 615. Outline

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

TheDisk-Covering MethodforTree Reconstruction

EVOLUTIONARY DISTANCES

Hidden Markov Models (I)

CSCE 471/871 Lecture 3: Markov Chains and

Is the equal branch length model a parsimony model?

Multiple Sequence Alignment using Profile HMM

O 3 O 4 O 5. q 3. q 4. Transition

arxiv: v5 [q-bio.pe] 24 Oct 2016

Stephen Scott.

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

HMM: Parameter Estimation

A Phylogenetic Network Construction due to Constrained Recombination

The Generalized Neighbor Joining method

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

arxiv: v1 [q-bio.pe] 1 Jun 2014

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Hidden Markov Models 1

Phylogeny of Mixture Models

Hidden Markov Models

Consensus methods. Strict consensus methods

Introduction to Machine Learning CMU-10701

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Properties of normal phylogenetic networks

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Theory of Evolution Charles Darwin

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Hidden Markov Models. x 1 x 2 x 3 x K

Phylogenetics. BIOL 7711 Computational Bioscience

Intraspecific gene genealogies: trees grafting into networks

Algorithms in Bioinformatics

Phylogenetic invariants versus classical phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Molecular Evolution, course # Final Exam, May 3, 2006

What is Phylogenetics

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

A Minimum Spanning Tree Framework for Inferring Phylogenies

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

A Fitness Distance Correlation Measure for Evolutionary Trees

11.3 Decoding Algorithm

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

HMMs and biological sequence analysis

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Phylogeny: building the tree of life

11 The Max-Product Algorithm

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

1. Can we use the CFN model for morphological traits?

Markov Chains and Hidden Markov Models. = stochastic, generative models

COMP90051 Statistical Machine Learning

Transcription:

Phylogenetics: Parsimony and Likelihood COMP 571 - Spring 2016 Luay Nakhleh, Rice University

The Problem Input: Multiple alignment of a set S of sequences Output: Tree T leaf-labeled with S

Assumptions Characters are mutually independent Following a speciation event, characters continue to evolve independently

In parsimony-based methods, the inferred tree is fully labeled.

ACCT GGAT ACGT GAAT

ACCT GGAT ACCT GAAT ACGT GAAT

A Simple Solution: Try All Trees Problem: (2n-3)!! rooted trees (2m-5)!! unrooted trees

A Simple Solution: Try All Trees Number of Taxa Number of unrooted trees Number of rooted trees 3 1 3 4 3 15 5 15 105 6 105 945 7 945 10395 8 10395 135135 9 135135 2027025 10 2027025 34459425 20 2.22E+20 8.20E+21 30 8.69E+36 4.95E+38 40 1.31E+55 1.01E+57 50 2.84E+74 2.75E+76 60 5.01E+94 5.86E+96 70 5.00E+115 6.85E+117 80 2.18E+137 3.43E+139

Solution Define an optimization criterion Find the tree (or, set of trees) that optimizes the criterion Two common criteria: parsimony and likelihood

Parsimony

The parsimony of a fully-labeled unrooted tree T, is the sum of lengths of all the edges in T Length of an edge is the Hamming distance between the sequences at its two endpoints PS(T)

ACCT GGAT ACCT GAAT ACGT GAAT

ACCT 0 1 GGAT 1 ACCT 3 GAAT 0 ACGT GAAT

ACCT 0 1 GGAT 1 ACCT 3 GAAT 0 ACGT GAAT Parsimony score = 5

Maximum Parsimony (MP) Input: a multiple alignment S of n sequences Output: tree T with n leaves, each leaf labeled by a unique sequence from S, internal nodes labeled by sequences, and PS(T) is minimized

AAC AGC TTC ATC

AAC TTC AGC ATC AAC AGC AAC TTC TTC ATC ATC AGC

AAC TTC AAC ATC AGC 3 ATC AAC AGC AAC TTC TTC ATC ATC AGC

AAC TTC AAC ATC AGC 3 ATC AAC AGC AAC TTC TTC ATC ATC ATC ATC 3 AGC

AAC TTC AAC ATC AGC 3 ATC AAC AGC AAC TTC TTC ATC 3 ATC ATC ATC ATC ATC 3 AGC

AAC AAC ATC TTC The three trees are equally good MP trees AGC 3 ATC AAC AGC AAC TTC TTC ATC 3 ATC ATC ATC ATC ATC 3 AGC

ACT GTT GTA ACA

ACT GTA GTT ACA ACT GTT ACT GTA GTA ACA ACA GTT

ACT GTA GTT GTA GTT 5 ACA ACT GTT ACT GTA GTA ACA ACA GTT

ACT GTA GTT GTA GTT 5 ACA ACT GTT ACT GTA GTA ACT 6 ACT ACA ACA GTT

ACT GTA GTT GTA GTT 5 ACA ACT GTT ACT GTA GTA ACT 6 ACT ACA ACA GTA ACA 4 GTT

ACT GTA GTT GTA GTT 5 ACA ACT GTT ACT GTA GTA ACT 6 ACT ACA ACA ACA 4 GTA GTT MP tree

Weighted Parsimony Each transition from one character state to another is given a weight Each character is given a weight See a tree that minimizes the weighted parsimony

Both the MP and weighted MP problems are NP-hard

A Heuristic For Solving the MP Problem Starting with a random tree T, move through the tree space while computing the parsimony of trees, and keeping those with optimal score (among the ones encountered) Usually, the search time is the stopping factor

Two Issues How do we move through the tree search space? Can we compute the parsimony of a given leaf-labeled tree efficiently?

Searching Through the Tree Space Use tree transformation operations (NNI, TBR, and SPR)

Searching Through the Tree Space Use tree transformation operations (NNI, TBR, and SPR) global maximum local maximum

Computing the Parsimony Length of a Given Tree Fitch s algorithm Computes the parsimony score of a given leaf-labeled rooted tree Polynomial time

Fitch s Algorithm Alphabet Σ Character c takes states from Σ vc denotes the state of character c at node v

Fitch s Algorithm Bottom-up phase: For each node v and each character c, compute the set Sc,v as follows: If v is a leaf, then Sc,v={vc} If v is an internal node whose two children are x and y, then S c,v = { Sc,x S c,y S c,x S c,y S c,x S c,y otherwise

Fitch s Algorithm Top-down phase: For the root r, let rc=a for some arbitrary a in the set Sc,r For internal node v whose parent is u, v c = { uc u c S c,v arbitrary α S c,v otherwise

T

T T

T T T T

T T T T T

T T T T T 3 mutations

Fitch s Algorithm Takes time O(nkm), where n is the number of leaves in the tree, m is the number of sites, and k is the maximum number of states per site (for DNA, k=4)

Informative Sites and Homoplasy Invariable sites: In the search for MP trees, sites that exhibit exactly one state for all taxa are eliminated from the analysis Only variable sites are used

Informative Sites and Homoplasy However, not all variable sites are useful for finding an MP tree topology Singleton sites: any nucleotide site at which only unique nucleotides (singletons) exist is not informative, because the nucleotide variation at the site can always be explained by the same number of substitutions in all topologies

C,T,G are three singleton substitutions non-informative site All trees have parsimony score 3

Informative Sites and Homoplasy For a site to be informative for constructing an MP tree, it must exhibit at least two different states, each represented in at least two taxa These sites are called informative sites For constructing MP trees, it is sufficient to consider only informative sites

Informative Sites and Homoplasy Because only informative sites contribute to finding MP trees, it is important to have many informative sites to obtain reliable MP trees However, when the extent of homoplasy (backward and parallel substitutions) is high, MP trees would not be reliable even if there are many informative sites available

Measuring the Extent of Homoplasy The consistency index (Kluge and Farris, 1969) for a single nucleotide site (i-th site) is given by c i =m i /s i, where m i is the minimum possible number of substitutions at the site for any conceivable topology (= one fewer than the number of different kinds of nucleotides at that site, assuming that one of the observed nucleotides is ancestral) s i is the minimum number of substitutions required for the topology under consideration

Measuring the Extent of Homoplasy The lower bound of the consistency index is not 0 The consistency index varies with the topology Therefore, Farris (1989) proposed two more quantities: the retention index and the rescaled consistency index

The Retention Index The retention index, ri, is given by (gi-si)/(gi-mi), where gi is the maximum possible number of substitutions at the i-th site for any conceivable tree under the parsimony criterion and is equal to the number of substitutions required for a star topology when the most frequent nucleotide is placed at the central node

The Retention Index The retention index becomes 0 when the site is least informative for MP tree construction, that is, si=gi

The Rescaled Consistency Index rc i = g i s i g i m i m i s i

Ensemble Indices The three values are often computed for all informative sites, and the ensemble or overall consistency index (CI), overall retention index (RI), and overall rescaled index (RC) for all sites are considered

Ensemble Indices RI = CI = i m i i s i i g i i s i i g i i m i RC = CI RI These indices should be computed only for informative sites, because for uninformative sites they are undefined

Homoplasy Index The homoplasy index is HI =1 CI When there are no backward or parallel substitutions, we have HI =0. In this case, the topology is uniquely determined

A Major Caveat Maximum parsimony is not statistically consistent!

Likelihood

The likelihood of model M given data D, denoted by L(M D), is p(d M). For example, consider the following data D that result from tossing a coin 10 times: HTTTTHTTTT

Model M1: A fair coin (p(h)=p(t)=0.5) L(M1 D)=p(D M1)=0.5 10

Model M2: A biased coin (p(h)=0.8,p(t)=0.2) L(M2 D)=p(D M2)=0.8 2 0.2 8

Model M3: A biased coin (p(h)=0.1,p(t)=0.9) L(M3 D)=p(D M3)=0.1 2 0.9 8

The problem of interest is to infer the model M from the (observed) data D.

The maximum likelihood estimate, or MLE, is: ˆM argmax M p(d M)

D=HTTTTHTTTT M1: p(h)=p(t)=0.5 M2: p(h)=0.8, p(t)=0.2 M3: p(h)=0.1, p(t)=0.9 MLE (among the three models) is M3.

A more complex example: The model M is an HMM The data D is a sequence of observations Baum-Welch is an algorithm for obtaining the MLE M from the data D

The model parameters that we seek to learn can vary for the same data and model. For example, in the case of HMMs: The parameters are the states, the transition and emission probabilities (no parameter values in the model are known) The parameters are the transition and emission probabilities (the states are known) The parameters are the transition probabilities (the states and emission probabilities are known)

Back to Phylogenetic Trees What are the data D? A multiple sequence alignment (or, a matrix of taxa/characters)

Back to Phylogenetic Trees What is the (generative) model M? The tree topology The branch lengths The model of evolution (JC,..)

Back to Phylogenetic Trees What is the (generative) model M? The tree topology, T The branch lengths, λ The model of evolution (JC,..), Ε

Back to Phylogenetic Trees The likelihood is p(d T,λ,Ε). The MLE is ( ˆT,ˆ, Ê) argmax (T,,E)p(D T,,E)

Back to Phylogenetic Trees In practice, the model of evolution is estimated from the data first, and in the phylogenetic inference it is assumed to be known. In this case, given D and E, the MLE is ( ˆT,ˆ) argmax (T, ) p(d T, )

Assumptions Characters are independent Markov process: probability of a node having a given label depends only on the label of the parent node and branch length between them t

Maximum Likelihood Input: a matrix D of taxa-characters Output: tree T leaf-labeled by the set of taxa, and with branch lengths λ so as to maximize the likelihood P(D T,λ)

P(D T,λ) P (D T, ) = Q site j p(d j T, ) = Q site j (P R p(d j,r T, )) = Q P site j R hp(root) Q i edge u!v p u!v(t uv )

What is pi j(tuv) for a branch u v in the tree, where i and j are the states of the site at nodes u and v, respectively?

For the Jukes-Cantor model with the parameter μ (the overall substitution rate), we have p i!j (t) = 1 4 (1 + 3e tµ ) i = j 1 4 (1 e tµ ) i 6= j

If branch lengths are measured in expected number of mutations per site, ν (for JC: ν=(μ/4+μ/4+μ/4)t=(3/4)μt) p i!j ( ) = 1 4 (1 + 3e 4 /3 ) i = j 1 4 (1 e 4 /3 ) i 6= j

The ML problem is NP-hard (that is, finding the MLE (T,λ) is very hard computationally) Heuristics involve searching the tree space, while computing the likelihood of trees Computing the likelihood of a leaf-labeled tree T with branch lengths can be done efficiently using dynamic programming

P(D T,λ) Let Cj(x,v) = P(subtree whose root is v vj=x) Initialization: leaf v and state x C j (x, v) = Recursion: node v with children u,w C j (x, v) = { 1 vj = x 0 otherwise [ ] [ ] C j (y, u) P x y (t vu ) C j (y, w) P x y (t vw ) y y Te rmin at ion: L = [ ] m C j (x, root) P(x) j=1 x

Running Time Takes time O(nk 2 m), where n is the number of leaves in the tree, m is the number of sites, and k is the maximum number of states per site (for DNA, k=4)

Unidentifiability of the Root If the base substitution model is reversible (most of them are!), then rooting the same tree differently doesn t change the likelihood.

Questions?