# Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Size: px
Start display at page:

Download "Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University"

Transcription

1 Phylogenetics: Parsimony and Likelihood COMP Spring 2016 Luay Nakhleh, Rice University

2 The Problem Input: Multiple alignment of a set S of sequences Output: Tree T leaf-labeled with S

3 Assumptions Characters are mutually independent Following a speciation event, characters continue to evolve independently

4 In parsimony-based methods, the inferred tree is fully labeled.

5 ACCT GGAT ACGT GAAT

6 ACCT GGAT ACCT GAAT ACGT GAAT

7 A Simple Solution: Try All Trees Problem: (2n-3)!! rooted trees (2m-5)!! unrooted trees

8 A Simple Solution: Try All Trees Number of Taxa Number of unrooted trees Number of rooted trees E E E E E E E E E E E E E E+139

9 Solution Define an optimization criterion Find the tree (or, set of trees) that optimizes the criterion Two common criteria: parsimony and likelihood

10 Parsimony

11 The parsimony of a fully-labeled unrooted tree T, is the sum of lengths of all the edges in T Length of an edge is the Hamming distance between the sequences at its two endpoints PS(T)

12 ACCT GGAT ACCT GAAT ACGT GAAT

13 ACCT 0 1 GGAT 1 ACCT 3 GAAT 0 ACGT GAAT

14 ACCT 0 1 GGAT 1 ACCT 3 GAAT 0 ACGT GAAT Parsimony score = 5

15 Maximum Parsimony (MP) Input: a multiple alignment S of n sequences Output: tree T with n leaves, each leaf labeled by a unique sequence from S, internal nodes labeled by sequences, and PS(T) is minimized

16 AAC AGC TTC ATC

17 AAC TTC AGC ATC AAC AGC AAC TTC TTC ATC ATC AGC

18 AAC TTC AAC ATC AGC 3 ATC AAC AGC AAC TTC TTC ATC ATC AGC

19 AAC TTC AAC ATC AGC 3 ATC AAC AGC AAC TTC TTC ATC ATC ATC ATC 3 AGC

20 AAC TTC AAC ATC AGC 3 ATC AAC AGC AAC TTC TTC ATC 3 ATC ATC ATC ATC ATC 3 AGC

21 AAC AAC ATC TTC The three trees are equally good MP trees AGC 3 ATC AAC AGC AAC TTC TTC ATC 3 ATC ATC ATC ATC ATC 3 AGC

22 ACT GTT GTA ACA

23 ACT GTA GTT ACA ACT GTT ACT GTA GTA ACA ACA GTT

24 ACT GTA GTT GTA GTT 5 ACA ACT GTT ACT GTA GTA ACA ACA GTT

25 ACT GTA GTT GTA GTT 5 ACA ACT GTT ACT GTA GTA ACT 6 ACT ACA ACA GTT

26 ACT GTA GTT GTA GTT 5 ACA ACT GTT ACT GTA GTA ACT 6 ACT ACA ACA GTA ACA 4 GTT

27 ACT GTA GTT GTA GTT 5 ACA ACT GTT ACT GTA GTA ACT 6 ACT ACA ACA ACA 4 GTA GTT MP tree

28 Weighted Parsimony Each transition from one character state to another is given a weight Each character is given a weight See a tree that minimizes the weighted parsimony

29 Both the MP and weighted MP problems are NP-hard

30 A Heuristic For Solving the MP Problem Starting with a random tree T, move through the tree space while computing the parsimony of trees, and keeping those with optimal score (among the ones encountered) Usually, the search time is the stopping factor

31 Two Issues How do we move through the tree search space? Can we compute the parsimony of a given leaf-labeled tree efficiently?

32 Searching Through the Tree Space Use tree transformation operations (NNI, TBR, and SPR)

33 Searching Through the Tree Space Use tree transformation operations (NNI, TBR, and SPR) global maximum local maximum

34 Computing the Parsimony Length of a Given Tree Fitch s algorithm Computes the parsimony score of a given leaf-labeled rooted tree Polynomial time

35 Fitch s Algorithm Alphabet Σ Character c takes states from Σ vc denotes the state of character c at node v

36 Fitch s Algorithm Bottom-up phase: For each node v and each character c, compute the set Sc,v as follows: If v is a leaf, then Sc,v={vc} If v is an internal node whose two children are x and y, then S c,v = { Sc,x S c,y S c,x S c,y S c,x S c,y otherwise

37

38 Fitch s Algorithm Top-down phase: For the root r, let rc=a for some arbitrary a in the set Sc,r For internal node v whose parent is u, v c = { uc u c S c,v arbitrary α S c,v otherwise

39

40 T

41 T T

42 T T T T

43 T T T T T

44 T T T T T 3 mutations

45 Fitch s Algorithm Takes time O(nkm), where n is the number of leaves in the tree, m is the number of sites, and k is the maximum number of states per site (for DNA, k=4)

46 Informative Sites and Homoplasy Invariable sites: In the search for MP trees, sites that exhibit exactly one state for all taxa are eliminated from the analysis Only variable sites are used

47 Informative Sites and Homoplasy However, not all variable sites are useful for finding an MP tree topology Singleton sites: any nucleotide site at which only unique nucleotides (singletons) exist is not informative, because the nucleotide variation at the site can always be explained by the same number of substitutions in all topologies

48 C,T,G are three singleton substitutions non-informative site All trees have parsimony score 3

49 Informative Sites and Homoplasy For a site to be informative for constructing an MP tree, it must exhibit at least two different states, each represented in at least two taxa These sites are called informative sites For constructing MP trees, it is sufficient to consider only informative sites

50 Informative Sites and Homoplasy Because only informative sites contribute to finding MP trees, it is important to have many informative sites to obtain reliable MP trees However, when the extent of homoplasy (backward and parallel substitutions) is high, MP trees would not be reliable even if there are many informative sites available

51 Measuring the Extent of Homoplasy The consistency index (Kluge and Farris, 1969) for a single nucleotide site (i-th site) is given by c i =m i /s i, where m i is the minimum possible number of substitutions at the site for any conceivable topology (= one fewer than the number of different kinds of nucleotides at that site, assuming that one of the observed nucleotides is ancestral) s i is the minimum number of substitutions required for the topology under consideration

52 Measuring the Extent of Homoplasy The lower bound of the consistency index is not 0 The consistency index varies with the topology Therefore, Farris (1989) proposed two more quantities: the retention index and the rescaled consistency index

53 The Retention Index The retention index, ri, is given by (gi-si)/(gi-mi), where gi is the maximum possible number of substitutions at the i-th site for any conceivable tree under the parsimony criterion and is equal to the number of substitutions required for a star topology when the most frequent nucleotide is placed at the central node

54 The Retention Index The retention index becomes 0 when the site is least informative for MP tree construction, that is, si=gi

55 The Rescaled Consistency Index rc i = g i s i g i m i m i s i

56 Ensemble Indices The three values are often computed for all informative sites, and the ensemble or overall consistency index (CI), overall retention index (RI), and overall rescaled index (RC) for all sites are considered

57 Ensemble Indices RI = CI = i m i i s i i g i i s i i g i i m i RC = CI RI These indices should be computed only for informative sites, because for uninformative sites they are undefined

58 Homoplasy Index The homoplasy index is HI =1 CI When there are no backward or parallel substitutions, we have HI =0. In this case, the topology is uniquely determined

59 A Major Caveat Maximum parsimony is not statistically consistent!

60 Likelihood

61 The likelihood of model M given data D, denoted by L(M D), is p(d M). For example, consider the following data D that result from tossing a coin 10 times: HTTTTHTTTT

62 Model M1: A fair coin (p(h)=p(t)=0.5) L(M1 D)=p(D M1)=0.5 10

63 Model M2: A biased coin (p(h)=0.8,p(t)=0.2) L(M2 D)=p(D M2)=

64 Model M3: A biased coin (p(h)=0.1,p(t)=0.9) L(M3 D)=p(D M3)=

65 The problem of interest is to infer the model M from the (observed) data D.

66 The maximum likelihood estimate, or MLE, is: ˆM argmax M p(d M)

67 D=HTTTTHTTTT M1: p(h)=p(t)=0.5 M2: p(h)=0.8, p(t)=0.2 M3: p(h)=0.1, p(t)=0.9 MLE (among the three models) is M3.

68 A more complex example: The model M is an HMM The data D is a sequence of observations Baum-Welch is an algorithm for obtaining the MLE M from the data D

69 The model parameters that we seek to learn can vary for the same data and model. For example, in the case of HMMs: The parameters are the states, the transition and emission probabilities (no parameter values in the model are known) The parameters are the transition and emission probabilities (the states are known) The parameters are the transition probabilities (the states and emission probabilities are known)

70 Back to Phylogenetic Trees What are the data D? A multiple sequence alignment (or, a matrix of taxa/characters)

71 Back to Phylogenetic Trees What is the (generative) model M? The tree topology The branch lengths The model of evolution (JC,..)

72 Back to Phylogenetic Trees What is the (generative) model M? The tree topology, T The branch lengths, λ The model of evolution (JC,..), Ε

73 Back to Phylogenetic Trees The likelihood is p(d T,λ,Ε). The MLE is ( ˆT,ˆ, Ê) argmax (T,,E)p(D T,,E)

74 Back to Phylogenetic Trees In practice, the model of evolution is estimated from the data first, and in the phylogenetic inference it is assumed to be known. In this case, given D and E, the MLE is ( ˆT,ˆ) argmax (T, ) p(d T, )

75 Assumptions Characters are independent Markov process: probability of a node having a given label depends only on the label of the parent node and branch length between them t

76 Maximum Likelihood Input: a matrix D of taxa-characters Output: tree T leaf-labeled by the set of taxa, and with branch lengths λ so as to maximize the likelihood P(D T,λ)

77 P(D T,λ) P (D T, ) = Q site j p(d j T, ) = Q site j (P R p(d j,r T, )) = Q P site j R hp(root) Q i edge u!v p u!v(t uv )

78 What is pi j(tuv) for a branch u v in the tree, where i and j are the states of the site at nodes u and v, respectively?

79 For the Jukes-Cantor model with the parameter μ (the overall substitution rate), we have p i!j (t) = 1 4 (1 + 3e tµ ) i = j 1 4 (1 e tµ ) i 6= j

80 If branch lengths are measured in expected number of mutations per site, ν (for JC: ν=(μ/4+μ/4+μ/4)t=(3/4)μt) p i!j ( ) = 1 4 (1 + 3e 4 /3 ) i = j 1 4 (1 e 4 /3 ) i 6= j

81 The ML problem is NP-hard (that is, finding the MLE (T,λ) is very hard computationally) Heuristics involve searching the tree space, while computing the likelihood of trees Computing the likelihood of a leaf-labeled tree T with branch lengths can be done efficiently using dynamic programming

82 P(D T,λ) Let Cj(x,v) = P(subtree whose root is v vj=x) Initialization: leaf v and state x C j (x, v) = Recursion: node v with children u,w C j (x, v) = { 1 vj = x 0 otherwise [ ] [ ] C j (y, u) P x y (t vu ) C j (y, w) P x y (t vw ) y y Te rmin at ion: L = [ ] m C j (x, root) P(x) j=1 x

83 Running Time Takes time O(nk 2 m), where n is the number of leaves in the tree, m is the number of sites, and k is the maximum number of states per site (for DNA, k=4)

84 Unidentifiability of the Root If the base substitution model is reversible (most of them are!), then rooting the same tree differently doesn t change the likelihood.

85 Questions?

### Phylogenetics: Likelihood

1 Phylogenetics: Likelihood COMP 571 Luay Nakhleh, Rice University The Problem 2 Input: Multiple alignment of a set S of sequences Output: Tree T leaf-labeled with S Assumptions 3 Characters are mutually

### Phylogenetics: Parsimony

1 Phylogenetics: Parsimony COMP 571 Luay Nakhleh, Rice University he Problem 2 Input: Multiple alignment of a set S of sequences Output: ree leaf-labeled with S Assumptions Characters are mutually independent

### POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

### CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

### Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

### Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

### Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

### Phylogenetics: Building Phylogenetic Trees

1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

### Using algebraic geometry for phylogenetic reconstruction

Using algebraic geometry for phylogenetic reconstruction Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya IMA

### Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

### Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

### Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models Modeling the statistical properties of biological sequences and distinguishing regions

### Lecture 9. Intro to Hidden Markov Models (finish up)

Lecture 9 Intro to Hidden Markov Models (finish up) Review Structure Number of states Q 1.. Q N M output symbols Parameters: Transition probability matrix a ij Emission probabilities b i (a), which is

### Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

### An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training

### Evolutionary Tree Analysis. Overview

CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

### Molecular Evolution & Phylogenetics

Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic

### Dr. Amira A. AL-Hosary

Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

### Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

### Phylogenetic Tree Reconstruction

I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

### An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

### Evolutionary Models. Evolutionary Models

Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

### Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

### Hidden Markov Models. Three classic HMM problems

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems

### Hidden Markov Models

Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

### Hidden Markov Models

Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How

### HIDDEN MARKOV MODELS

HIDDEN MARKOV MODELS Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

### (Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Bot 421/521 PHYLOGENETIC ANALYSIS I. Origins A. Hennig 1950 (German edition) Phylogenetic Systematics 1966 B. Zimmerman (Germany, 1930 s) C. Wagner (Michigan, 1920-2000) II. Characters and character states

### Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Maximum Likelihood Tree Estimation Carrie Tribble IB 200 9 Feb 2018 Outline 1. Tree building process under maximum likelihood 2. Key differences between maximum likelihood and parsimony 3. Some fancy extras

### Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

### Consistency Index (CI)

Consistency Index (CI) minimum number of changes divided by the number required on the tree. CI=1 if there is no homoplasy negatively correlated with the number of species sampled Retention Index (RI)

### Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: www.ioalgorithms.info Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm

### Pinvar approach. Remarks: invariable sites (evolve at relative rate 0) variable sites (evolves at relative rate r)

Pinvar approach Unlike the site-specific rates approach, this approach does not require you to assign sites to rate categories Assumes there are only two classes of sites: invariable sites (evolve at relative

### Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

7.91 Lecture #5 Database Searching & Molecular Phylogenetics Michael Yaffe B C D B C D (((,B)C)D) Outline Distance Matrix Methods Neighbor-Joining Method and Related Neighbor Methods Maximum Likelihood

### BMI/CS 776 Lecture 4. Colin Dewey

BMI/CS 776 Lecture 4 Colin Dewey 2007.02.01 Outline Common nucleotide substitution models Directed graphical models Ancestral sequence inference Poisson process continuous Markov process X t0 X t1 X t2

### Probabilistic modeling and molecular phylogeny

Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU) What is a model? Mathematical

### Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

### Questions we can ask. Recall. Accuracy and Precision. Systematics - Bio 615. Outline

Outline 1. Mechanistic comparison with Parsimony - branch lengths & parameters 2. Performance comparison with Parsimony - Desirable attributes of a method - The Felsenstein and Farris zones - Heterotachous

### TheDisk-Covering MethodforTree Reconstruction

TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1 Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document

### EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

### Hidden Markov Models (I)

GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm Hidden Markov models A Markov chain of states At each state, there are a set of possible observables

### CSCE 471/871 Lecture 3: Markov Chains and

and and 1 / 26 sscott@cse.unl.edu 2 / 26 Outline and chains models (s) Formal definition Finding most probable state path (Viterbi algorithm) Forward and backward algorithms State sequence known State

### Is the equal branch length model a parsimony model?

Table 1: n approximation of the probability of data patterns on the tree shown in figure?? made by dropping terms that do not have the minimal exponent for p. Terms that were dropped are shown in red;

### Multiple Sequence Alignment using Profile HMM

Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,

### O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

### arxiv: v5 [q-bio.pe] 24 Oct 2016

On the Quirks of Maximum Parsimony and Likelihood on Phylogenetic Networks Christopher Bryant a, Mareike Fischer b, Simone Linz c, Charles Semple d arxiv:1505.06898v5 [q-bio.pe] 24 Oct 2016 a Statistics

### Stephen Scott.

1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative

### CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2

### Hidden Markov Models

Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas CG-Islands Given 4 nucleotides: probability of occurrence is ~ 1/4. Thus, probability of

### Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

### Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

ioinformatics -- lecture 9 Phylogenetic trees istance-based tree building Parsimony (,(,(,))) rees can be represented in "parenthesis notation". Each set of parentheses represents a branch-point (bifurcation),

### HMM: Parameter Estimation

I529: Machine Learning in Bioinformatics (Spring 2017) HMM: Parameter Estimation Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Content Review HMM: three problems

### A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

### The Generalized Neighbor Joining method

The Generalized Neighbor Joining method Ruriko Yoshida Dept. of Mathematics Duke University Joint work with Dan Levy and Lior Pachter www.math.duke.edu/ ruriko data mining 1 Challenge We would like to

### Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Today s topics Inferring phylogeny Introduction! Distance methods! Parsimony method!"#\$%&'(!)* +,-.'/01!23454(6!7!2845*0&4'9#6!:&454(6 ;?@AB=C?DEF Overview of phylogenetic inferences Methodology Methods

### arxiv: v1 [q-bio.pe] 1 Jun 2014

THE MOST PARSIMONIOUS TREE FOR RANDOM DATA MAREIKE FISCHER, MICHELLE GALLA, LINA HERBST AND MIKE STEEL arxiv:46.27v [q-bio.pe] Jun 24 Abstract. Applying a method to reconstruct a phylogenetic tree from

### Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

### Hidden Markov Models 1

Hidden Markov Models Dinucleotide Frequency Consider all 2-mers in a sequence {AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT} Given 4 nucleotides: each with a probability of occurrence of. 4 Thus, one

### Phylogeny of Mixture Models

Phylogeny of Mixture Models Daniel Štefankovič Department of Computer Science University of Rochester joint work with Eric Vigoda College of Computing Georgia Institute of Technology Outline Introduction

### Hidden Markov Models

Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

### Consensus methods. Strict consensus methods

Consensus methods A consensus tree is a summary of the agreement among a set of fundamental trees There are many consensus methods that differ in: 1. the kind of agreement 2. the level of agreement Consensus

### Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

### Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Inferring Phylogenetic Trees Distance Approaches Representing distances in rooted and unrooted trees The distance approach to phylogenies given: an n n matrix M where M ij is the distance between taxa

### 9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

### Properties of normal phylogenetic networks

Properties of normal phylogenetic networks Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu August 13, 2009 Abstract. A phylogenetic network is

### Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

### 66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866

66 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 5 Phylogeny Evolutionary tree of organisms, Ernst Haeckel, 1866 5.1 References J. Felsenstein, Inferring Phylogenies, Sinauer, 2004. C. Semple

### Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

### Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden

### Theory of Evolution Charles Darwin

Theory of Evolution Charles arwin 858-59: Origin of Species 5 year voyage of H.M.S. eagle (83-36) Populations have variations. Natural Selection & Survival of the fittest: nature selects best adapted varieties

### Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

### 6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic

### Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)

### Phylogenetics. BIOL 7711 Computational Bioscience

Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

### Intraspecific gene genealogies: trees grafting into networks

Intraspecific gene genealogies: trees grafting into networks by David Posada & Keith A. Crandall Kessy Abarenkov Tartu, 2004 Article describes: Population genetics principles Intraspecific genetic variation

### Algorithms in Bioinformatics

Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

### Phylogenetic invariants versus classical phylogenetics

Phylogenetic invariants versus classical phylogenetics Marta Casanellas Rius (joint work with Jesús Fernández-Sánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya Algebraic

### C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

### Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley K.W. Will Parsimony & Likelihood [draft] 1. Hennig and Parsimony: Hennig was not concerned with parsimony

### Molecular Evolution, course # Final Exam, May 3, 2006

Molecular Evolution, course #27615 Final Exam, May 3, 2006 This exam includes a total of 12 problems on 7 pages (including this cover page). The maximum number of points obtainable is 150, and at least

### What is Phylogenetics

What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

### Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

### Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny

### A Minimum Spanning Tree Framework for Inferring Phylogenies

A Minimum Spanning Tree Framework for Inferring Phylogenies Daniel Giannico Adkins Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-157

### Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption

### Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

### A Fitness Distance Correlation Measure for Evolutionary Trees

A Fitness Distance Correlation Measure for Evolutionary Trees Hyun Jung Park 1, and Tiffani L. Williams 2 1 Department of Computer Science, Rice University hp6@cs.rice.edu 2 Department of Computer Science

### 11.3 Decoding Algorithm

11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence

### Thanks to Paul Lewis and Joe Felsenstein for the use of slides

Thanks to Paul Lewis and Joe Felsenstein for the use of slides Review Hennigian logic reconstructs the tree if we know polarity of characters and there is no homoplasy UPGMA infers a tree from a distance

### HMMs and biological sequence analysis

HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

### CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

CSCI1950 Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary Parsimony Probabilis4c Method Input Output Sankoff s & Fitch

### Phylogeny: building the tree of life

Phylogeny: building the tree of life Dr. Fayyaz ul Amir Afsar Minhas Department of Computer and Information Sciences Pakistan Institute of Engineering & Applied Sciences PO Nilore, Islamabad, Pakistan

### 11 The Max-Product Algorithm

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms for Inference Fall 2014 11 The Max-Product Algorithm In the previous lecture, we introduced

### Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Maximum Likelihood This presentation is based almost entirely on Peter G. Fosters - "The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed. http://www.bioinf.org/molsys/data/idiots.pdf

### Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

### 1. Can we use the CFN model for morphological traits?

1. Can we use the CFN model for morphological traits? 2. Can we use something like the GTR model for morphological traits? 3. Stochastic Dollo. 4. Continuous characters. Mk models k-state variants of the