Pairwise alignment using HMMs

Similar documents
CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x K

Local Alignment: Smith-Waterman algorithm

Multiple Sequence Alignment using Profile HMM

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models (I)

Moreover, the circular logic

Pairwise sequence alignment

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. Ron Shamir, CG 08

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Stephen Scott.

Pair Hidden Markov Models

Stephen Scott.

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

11.3 Decoding Algorithm

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Introduction to Machine Learning CMU-10701

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

CS711008Z Algorithm Design and Analysis

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

An Introduction to Bioinformatics Algorithms Hidden Markov Models

HIDDEN MARKOV MODELS

Basic math for biology

Pairwise sequence alignment and pair hidden Markov models

EECS730: Introduction to Bioinformatics

HMMs and biological sequence analysis

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Hidden Markov Models

Evolutionary Models. Evolutionary Models

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Computational Genomics and Molecular Biology, Fall

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Lecture 7 Sequence analysis. Hidden Markov Models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

6 Markov Chains and Hidden Markov Models

Alignment Algorithms. Alignment Algorithms

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Lecture 5: December 13, 2001

Hidden Markov Models 1

Markov Chains and Hidden Markov Models. = stochastic, generative models

Hidden Markov Models. Terminology, Representation and Basic Problems

Lecture 9. Intro to Hidden Markov Models (finish up)

Markov chains and Hidden Markov Models

Hidden Markov Models for biological sequence analysis

9 Forward-backward algorithm, sum-product on factor graphs

Today s Lecture: HMMs

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Introduction to Hidden Markov Models (HMMs)

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Hidden Markov Models. Hosein Mohimani GHC7717

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Hidden Markov Models (HMMs) November 14, 2017

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

BMI/CS 576 Fall 2016 Final Exam

Advanced Data Science

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018

Lecture 4: State Estimation in Hidden Markov Models (cont.)

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models for biological sequence analysis I

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

A gentle introduction to Hidden Markov Models

Conditional Random Field

Info 2950, Lecture 25

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

L23: hidden Markov models

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Bioinformatics 2 - Lecture 4

Multiscale Systems Engineering Research Group

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Chapter 4: Hidden Markov Models

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Hidden Markov Models (Part 1)

Practical considerations of working with sequencing data

Genome 373: Hidden Markov Models II. Doug Fowler

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

Dynamic Approaches: The Hidden Markov Model

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

STA 414/2104: Machine Learning

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

order is number of previous outputs

Transcription:

Pairwise alignment using HMMs The states of an HMM fulfill the Markov property: probability of transition depends only on the last state. CpG islands and casino example: HMMs emit sequence of symbols (nucleotides or die rolls). We only observe the emitted sequences, the generating state path is unknown inference problems, e.g. estimate the most probable generating path ( Viterbi algorithm). Knowing the path allows us to analyze the internal structure of the string (localizing CpG islands, deciding if the die was fair...) 111

Pair HMMs for string alignment HMMs can be used for sequence alignment: emission is not a single string, but a pair of aligned strings pair HMMs. From a FSA to a pair HMM: Define emission probabilities for states. Match state has emission probability p xi y j for emitting an aligned pair of symbols x i y j. Insert/Delete state X emits a symbol x i from string x against a gap with probability q xi. Define transition probabilities between the states. Requirement: probabilities for all the transitions leaving a state must sum to one. Define begin and end states to meet the initialization and termination conditions for the dynamic programming algorithms. 112

FSAs and Pair HMMs 113

A complete Pair HMM 114

Pair HMMs: Viterbi algorithm (cont d) Initialization: v M (0, j) = v M (i, 0) = 0, v M (0, 0) = 1, initialize: v X/Y (0, j), v X/Y (i, 0) Recurrence: i = 1,..., n, j = 1,..., m: v M (i, j) = p xi,y j max (1 2δ τ)v M (i 1, j 1), (1 ɛ τ)v X (i 1, j 1), (1 ɛ τ)v Y (i 1, j 1); 1 2δ τ 1 ε τ M X M M 1 ε τ Y 115

Pair HMMs: Viterbi algorithm (cont d) Initialization: v M (0, j) = v M (i, 0) = 0, v M (0, 0) = 1, initialize: v X/Y (0, j), v X/Y (i, 0). Recurrence: i = 1,..., n, j = 1,..., m: (1 2δ τ)v M (i 1, j 1), v M (i, j) = p xi,y j max (1 ɛ τ)v X (i 1, j 1), (1 ɛ τ)v Y (i 1, j 1); v X (i, j) = q xi max { δv M (i 1, j), ɛv X (i 1, j); δ M X v Y (i, j) = q yi max { δv M (i, j 1), ɛv Y (i, j 1). X ε 116

Termination: Pair HMMs: Viterbi algorithm (cont d) X τ v E = τ max [ v M (n, m), v X (n, m), v Y (n, m) ]. M Y τ τ E Traceback: We keep traceback pointers as usual reconstruct the whole alignment from the pointers. 117

Pair HMMs and FSA alignment (cont d) Theorem 1. The most probable path through the pair HMM for global alignment gives the optimal alignment associated with the substitution matrix s(x i, y j ) = log p(x i, y j ) q xi q yj + log (1 2δ τ) (1 η) 2 with affine gap penalty γ(g) = d (g 1)e with δ(1 ɛ τ) d = log (1 η)(1 2δ τ), ɛ e = log 1 η. Proof: exercises. 118

Example: the match hypothesis Example alignment: X x 1 x 2 x 3 x 4 x 5 x 6 x 7 Y y 1 y 2 y 3 y 4 y 5 y 6 The FSA model: score(x, y) = log = s(x 1, y 1 ) + s(x 2, y 2 ) d e + s(x 3, y 5 ) + s(x 4, y 6 ) d e e. p(x, y M) p(x, y R) 119

The Pair HMM model: Example (cont d) Define a := (1 2δ τ), b := (1 ɛ τ): Π B M M Y Y M M X X X E X x 1 x 2 x 3 x 4 x 5 x 6 x 7 Y y 1 y 2 y 3 y 4 y 5 y 6 P = a p x1 y 1 ap x2 y 2 δq y3 ɛq y4 bp x3 y 5 ap x4 y 6 δq x5 ɛq x6 ɛq x7 τ Given the path Π, the probability of the pair of sequences (x, y) under the match hypothesis is p(x, y, Π M) = (1 2δ τ)p x1 y 1 (1 2δ τ)p x2 y 2 δq y3 ɛq y4 (1 ɛ τ)p x3 y 5 (1 2δ τ)p x4 y 6 δq x5 ɛq x6 ɛq x7 τ 120

A random length independent site model......written as a Pair HMMs: no match state the states X and Y emit two sequences in turn, independently of each other. x y i j Emitted symbols 1 η 1 η B 1 η X Y η E η 1 η η Silent transitional state 121

Example (cont d) Π B X X X X X X X Y Y Y Y Y Y E X x 1 x 2,..............., x 7 Y y 1 y 2,............, y 6 P = (1 η) q x1 7 i=2 (1 η)q x i η(1 η)q y1 6 j=2 (1 η)q y j η The probability of the pair of sequences (x, y) under the random hypothesis is p(x, y R) = η 2 7 (1 η)q xi i=1 6 (1 η)q yj j=1 122

A pair HMM for local alignment Global model states M,X,Y, flanked by two copies of the random model arbitrary start and stop of alignment. Note that sequences in flanking regions are unaligned random model. 123

The full probability of two aligned sequences If the similarity of two sequences is weak, it is hard to find the correct alignment. HMMs allow us to calculate the probability that two sequences are related by any alignment: P (x, y) = alignments Π P (x, y, Π) : P (x, y) will always be higher than the Viterbi-probability P (x, y, Π )! Can be significantly different when there are many comparable alternative alignments. 124

The full probability (cont d) More realistic score: likelihood that two sequences are related by some unspecified alignment as opposed to being unrelated: score(x, y) = = P (x, y match hypothesis) P (x, y random hypothesis) Π P (x, y, Π) q x q y. 125

The full probability: forward algorithm f M (i, j) = p xi,y j [ (1 2δ τ)f M (i 1, j 1) +(1 ɛ τ)f X (i 1, j 1) ] +(1 ɛ τ)f Y (i 1, j 1) ; ] f X (i, j) = q xi [δf M (i 1, j) + ɛf X (i 1, j) ; 1 2δ τ 1 ε τ M δ M X X M ] f Y (i, j) = q yi [δf M (i, j 1) + ɛf Y (i, j 1). X ε 126

The full probability (cont d) Important use of P (x, y): posterior distribution over alignments Π given two sequences x, y: P (Π x, y) = P (x, y, Π) P (x, y). Example: set Π = Π, the Viterbi path: P (Π x, y) is the posterior probability of observing the Viterbi path = probability that the optimal scoring alignment is correct. 127

Globin example: The full probability (cont d) P (Π x, y) = 4.6 10 6. Alarming observation if one was hoping that standard alignment algorithms would find the correct alignment! Explanation: there are many small variants of alignments with nearly the same score. 128

one single alignment not accurate for determining similarity!! 129 1st alignment: score 3 (BLOSUM 50, d = 12, e = 2). 2nd alignment: also score 3, but different gap position. 3rd alignment: score 6 increase in relative likelihood of a factor of 2 (BLOSUM 50 is scaled in 1/3 bits).

The posterior probability Degree of conservation along the sequence may vary depending on functional / structural constraints some parts of the alignment will be clear, other regions may be less certain. Local view: what about the local accuracy of an alignment? We are interested in a reliability measure for each part of an alignment: probability of two residues (x i, y j ) being aligned, given the complete sequences: P (x i y j x, y) backward algorithm. 130

The backward algorithm The quantity we are interested in: P (x i y j x, y) = P (x i y j, x, y). P (x, y) The denominator: final result from forward algorithm: P (x, y) = f E (n, m). Numerator: P (x, y, x i y j ) = = P (x 1,...,i, y 1,...,j, x i y }{{} j ) P (x i+1,...,n, y j+1,...,m x 1,...,i, y 1,...,j, x i y j ) }{{} A A Markov = P (x 1,...,i, y 1,...,j, x i y j ) P (x i+1,...,n, y j+1,...,m x i y j ) = f M (i, j) b M (i, j). 131

The backward algorithm: recursion b M (i, j) = (1 2δ τ)p xi+1,y j+1 b M (i + 1, j + 1) [ ] +δ q xi+1 b X (i + 1, j) + q yj+1 b Y (i, j + 1) ; 1 2δ τ δ M X M b X (i, j) =(1 ɛ τ)p xi+1,y j+1 b M (i + 1, j + 1) +ɛq xi+1 b X (i + 1, j); 1 ε τ M X b Y (i, j) =(1 ɛ τ)p xi+1,y j+1 b M (i + 1, j + 1) +ɛq yj+1 b Y (i, j + 1). X ε 132