Pairwise sequence alignment and pair hidden Markov models

Similar documents
EECS730: Introduction to Bioinformatics

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Pairwise alignment using HMMs

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Hidden Markov Models

Hidden Markov Models

Stephen Scott.

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

E-value Estimation for Non-Local Alignment Scores

Multiple Sequence Alignment using Profile HMM

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Hidden Markov Models. x 1 x 2 x 3 x K

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Markov Models (HMMs) and Profiles

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Pair Hidden Markov Models

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Local Alignment: Smith-Waterman algorithm

HMMs and biological sequence analysis

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Computational Genomics and Molecular Biology, Fall

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

11.3 Decoding Algorithm

Simultaneous Sequence Alignment and Tree Construction Using Hidden Markov Models. R.C. Edgar, K. Sjölander

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models. x 1 x 2 x 3 x K

Computational Genomics and Molecular Biology, Fall

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Hidden Markov Models

Exercise 5. Sequence Profiles & BLAST

Moreover, the circular logic

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Applications of Hidden Markov Models

O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models

Week 10: Homology Modelling (II) - HHpred

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Alignment Algorithms. Alignment Algorithms

An Introduction to Sequence Similarity ( Homology ) Searching

Today s Lecture: HMMs

Topics in Probability Theory and Stochastic Processes Steven R. Dunbar. Examples of Hidden Markov Models

Basic math for biology

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

7. Shortest Path Problems and Deterministic Finite State Systems

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Lecture 7 Sequence analysis. Hidden Markov Models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

In-Depth Assessment of Local Sequence Alignment

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Single alignment: Substitution Matrix. 16 march 2017

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Sequential Supervised Learning

Expectation Maximization (EM)

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Stephen Scott.

Hidden Markov Models for biological sequence analysis

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

BINF 730. DNA Sequence Alignment Why?

Getting statistical significance and Bayesian confidence limits for your hidden Markov model or score-maximizing dynamic programming algorithm,

Hidden Markov Models for biological sequence analysis I

Detecting Distant Homologs Using Phylogenetic Tree-Based HMMs

Gibbs Sampling Methods for Multiple Sequence Alignment

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

HMM : Viterbi algorithm - a toy example

Directed Probabilistic Graphical Models CMSC 678 UMBC

Chapter 4: Hidden Markov Models

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Markov Chains and Hidden Markov Models. = stochastic, generative models

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Evolutionary Models. Evolutionary Models

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Statistical NLP: Hidden Markov Models. Updated 12/15

Sequence analysis and Genomics

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Biologically significant sequence alignments using Boltzmann probabilities

Lecture 4: State Estimation in Hidden Markov Models (cont.)

COMP90051 Statistical Machine Learning

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Chapter 7: Rapid alignment methods: FASTA and BLAST

The main algorithms used in the seqhmm package

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

HMM : Viterbi algorithm - a toy example

Conditional Random Field

order is number of previous outputs

Numerically Stable Hidden Markov Model Implementation

Algorithms in Bioinformatics

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Transcription:

Pairwise sequence alignment and pair hidden Markov models Martin C. Frith April 13, 2012 ntroduction Pairwise alignment and pair hidden Markov models (phmms) are basic textbook fare [2]. However, there are various slightly different algorithms and models that could be used. This document presents some variants that are different from, and maybe better than, the ones described by urbin et al. efinitions We wish to align two sequences: R 1,..., R m : 1st sequence (e.g. reference ), of length m. Q 1,..., Q n : 2nd sequence (e.g. query ), of length n. The classic approach is to define a scoring scheme, which assigns scores to aligned letters and gaps, and then find alignments with maximal score. This document considers the standard affine-gap scheme only. S(x, y): score for aligning reference base x to query base y. a: gap existence score. b: gap extension score. (A gap of length k scores a + b k.) Note that a and b are negative. Alternative dynamic programming algorithms The standard way of finding the maximal alignment score is dynamic programming, which finds the optimal score for sequences of length i and j in terms of the optimal scores for shorter sequences (i 1 and j 1). This variant seems to be popular: 1

Algorithm A X i,j = max(x i 1,j 1, Y i 1,j 1, Z i 1,j 1 ) + S(R i, Q j ) (1) Y i,j = max(x i 1,j + a, Y i 1,j, Z i 1,j + a) + b (2) Z i,j = max(x i,j 1 + a, Y i,j 1 + a, Z i,j 1 ) + b (3) Here, X i,j is the optimal alignment score up to R i and Q j ending with a match, Y i,j is the optimal score ending with a deletion, and Z i,j is the optimal score ending with an insertion. This algorithm is equivalent but more efficient (fewer CPU instructions): Algorithm B Y i,j = max(w i 1,j + a, Y i 1,j ) + b (4) Z i,j = max(w i,j 1 + a, Z i,j 1 ) + b (5) W i,j = max(w i 1,j 1 + S(R i, Q j ), Y i,j, Z i,j ) (6) Here, W i,j is the optimal alignment score ending with anything. nterestingly, this is the original algorithm described by Gotoh [3]. t can be made even more efficient by some reorganization [1]. Pair hidden Markov models The urbin et al. textbook describes some phmms, and demonstrates that finding the most probable path is equivalent to classic maximum-score alignment [2]. Figure 1 shows phmms that differ from those of urbin et al. in some interesting ways: They allow insertions next to deletions. They allow insertions next to insertions, and deletions next to deletions. For example, a length-2 deletion next to a length-3 deletion. This makes no difference to the Viterbi (maximum score) algorithm, because (e.g.) a length-5 deletion has a better score than a length-2 plus a length-3 deletion. t does make a difference, however, to the Forward algorithm. The paths through these phmms are reflected in Algorithm B, rather than Algorithm A. Score parameters in terms of model parameters The Viterbi (maximum likelihood) algorithms for these phmms can be cast in the same form as maximum-score alignment, by using these formulas: ( πxy S(x, y) = t ln 1 2 τ ) φ x ψ y (1 ) 2 (7) ( ) (1 ) a = t ln (8) ( ) b = t ln (9) 1 2

Here, t is an arbitrary scale factor. (f we multiply all the score parameters by a constant factor, it makes no difference to the alignment.) Local alignment nitialization W 0,0 = 0 (10) W i,0 = 0 Y i,0 = Z i,0 = (11) W 0,j = 0 Y 0,j = Z 0,j = (12) Recurrence X i,j = W i 1,j 1 + S(R i, Q j ) (13) Y i,j = max(w i 1,j + a, Y i 1,j ) + b (14) Z i,j = max(w i,j 1 + a, Z i,j 1 ) + b (15) W i,j = max(x i,j, Y i,j, Z i,j, 0) (16) Termination Optimal alignment score = max i,j (W i,j) (17) Semi-global (short-in-long) alignment nitialization W 0,0 = 0 (18) W i,0 = 0 Y i,0 = Z i,0 = (19) W 0,j = Y 0,j = Z 0,j = (20) Recurrence X i,j = W i 1,j 1 + S(R i, Q j ) (21) Y i,j = max(w i 1,j + a, Y i 1,j ) + b (22) Z i,j = max(w i,j 1 + a, Z i,j 1 ) + b (23) W i,j = max(x i,j, Y i,j, Z i,j ) (24) Termination Optimal alignment score = max(w i,n ) (25) i References [1] M. Cameron, H. E. Williams, and A. Cannane. mproved gapped alignment in BLAST. EEE/ACM Trans Comput Biol Bioinform, 1(3):116 129, 2004. 3

A M 1 2 τ τ 1 1 B M 1 2 τ τ 1 1 C Figure 1: Pair hidden Markov models. A Semi-global (short-in-long) model. B Local model. C Null model. States labeled M emit aligned bases x : y with probability π xy. States labeled emit reference bases x with probability φ x. States labeled emit query bases y with probability ψ y. 4

[2] R. urbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998. [3] O. Gotoh. An improved algorithm for matching biological sequences. J. Mol. Biol., 162(3):705 708, ec 1982. 5