Pairwise alignment using HMMs

Size: px

Start display at page:

Download "Pairwise alignment using HMMs"

Brian Pierce
6 years ago
Views:

1 Pairwise alignment using HMMs The states of an HMM fulfill the Markov property: probability of transition depends only on the last state. CpG islands and casino example: HMMs emit sequence of symbols (nucleotides or die rolls). We only observe the emitted sequences, the generating state path is unknown inference problems, e.g. estimate the most probable generating path ( Viterbi algorithm). Knowing the path allows us to analyze the internal structure of the string (localizing CpG islands, deciding if the die was fair...) 111

2 Pair HMMs for string alignment HMMs can be used for sequence alignment: emission is not a single string, but a pair of aligned strings pair HMMs. From a FSA to a pair HMM: Define emission probabilities for states. Match state has emission probability p xi y j for emitting an aligned pair of symbols x i y j. Insert/Delete state X emits a symbol x i from string x against a gap with probability q xi. Define transition probabilities between the states. Requirement: probabilities for all the transitions leaving a state must sum to one. Define begin and end states to meet the initialization and termination conditions for the dynamic programming algorithms. 112

3 FSAs and Pair HMMs 113

4 A complete Pair HMM 114

5 Pair HMMs: Viterbi algorithm (cont d) Initialization: v M (0, j) = v M (i, 0) = 0, v M (0, 0) = 1, initialize: v X/Y (0, j), v X/Y (i, 0) Recurrence: i = 1,..., n, j = 1,..., m: v M (i, j) = p xi,y j max (1 2δ τ)v M (i 1, j 1), (1 ɛ τ)v X (i 1, j 1), (1 ɛ τ)v Y (i 1, j 1); 1 2δ τ 1 ε τ M X M M 1 ε τ Y 115

6 Pair HMMs: Viterbi algorithm (cont d) Initialization: v M (0, j) = v M (i, 0) = 0, v M (0, 0) = 1, initialize: v X/Y (0, j), v X/Y (i, 0). Recurrence: i = 1,..., n, j = 1,..., m: (1 2δ τ)v M (i 1, j 1), v M (i, j) = p xi,y j max (1 ɛ τ)v X (i 1, j 1), (1 ɛ τ)v Y (i 1, j 1); v X (i, j) = q xi max { δv M (i 1, j), ɛv X (i 1, j); δ M X v Y (i, j) = q yi max { δv M (i, j 1), ɛv Y (i, j 1). X ε 116

7 Termination: Pair HMMs: Viterbi algorithm (cont d) X τ v E = τ max [ v M (n, m), v X (n, m), v Y (n, m) ]. M Y τ τ E Traceback: We keep traceback pointers as usual reconstruct the whole alignment from the pointers. 117

8 Pair HMMs and FSA alignment (cont d) Theorem 1. The most probable path through the pair HMM for global alignment gives the optimal alignment associated with the substitution matrix s(x i, y j ) = log p(x i, y j ) q xi q yj + log (1 2δ τ) (1 η) 2 with affine gap penalty γ(g) = d (g 1)e with δ(1 ɛ τ) d = log (1 η)(1 2δ τ), ɛ e = log 1 η. Proof: exercises. 118

9 Example: the match hypothesis Example alignment: X x 1 x 2 x 3 x 4 x 5 x 6 x 7 Y y 1 y 2 y 3 y 4 y 5 y 6 The FSA model: score(x, y) = log = s(x 1, y 1 ) + s(x 2, y 2 ) d e + s(x 3, y 5 ) + s(x 4, y 6 ) d e e. p(x, y M) p(x, y R) 119

10 The Pair HMM model: Example (cont d) Define a := (1 2δ τ), b := (1 ɛ τ): Π B M M Y Y M M X X X E X x 1 x 2 x 3 x 4 x 5 x 6 x 7 Y y 1 y 2 y 3 y 4 y 5 y 6 P = a p x1 y 1 ap x2 y 2 δq y3 ɛq y4 bp x3 y 5 ap x4 y 6 δq x5 ɛq x6 ɛq x7 τ Given the path Π, the probability of the pair of sequences (x, y) under the match hypothesis is p(x, y, Π M) = (1 2δ τ)p x1 y 1 (1 2δ τ)p x2 y 2 δq y3 ɛq y4 (1 ɛ τ)p x3 y 5 (1 2δ τ)p x4 y 6 δq x5 ɛq x6 ɛq x7 τ 120

11 A random length independent site model......written as a Pair HMMs: no match state the states X and Y emit two sequences in turn, independently of each other. x y i j Emitted symbols 1 η 1 η B 1 η X Y η E η 1 η η Silent transitional state 121

12 Example (cont d) Π B X X X X X X X Y Y Y Y Y Y E X x 1 x 2, , x 7 Y y 1 y 2, , y 6 P = (1 η) q x1 7 i=2 (1 η)q x i η(1 η)q y1 6 j=2 (1 η)q y j η The probability of the pair of sequences (x, y) under the random hypothesis is p(x, y R) = η 2 7 (1 η)q xi i=1 6 (1 η)q yj j=1 122

13 A pair HMM for local alignment Global model states M,X,Y, flanked by two copies of the random model arbitrary start and stop of alignment. Note that sequences in flanking regions are unaligned random model. 123

14 The full probability of two aligned sequences If the similarity of two sequences is weak, it is hard to find the correct alignment. HMMs allow us to calculate the probability that two sequences are related by any alignment: P (x, y) = alignments Π P (x, y, Π) : P (x, y) will always be higher than the Viterbi-probability P (x, y, Π )! Can be significantly different when there are many comparable alternative alignments. 124

15 The full probability (cont d) More realistic score: likelihood that two sequences are related by some unspecified alignment as opposed to being unrelated: score(x, y) = = P (x, y match hypothesis) P (x, y random hypothesis) Π P (x, y, Π) q x q y. 125

16 The full probability: forward algorithm f M (i, j) = p xi,y j [ (1 2δ τ)f M (i 1, j 1) +(1 ɛ τ)f X (i 1, j 1) ] +(1 ɛ τ)f Y (i 1, j 1) ; ] f X (i, j) = q xi [δf M (i 1, j) + ɛf X (i 1, j) ; 1 2δ τ 1 ε τ M δ M X X M ] f Y (i, j) = q yi [δf M (i, j 1) + ɛf Y (i, j 1). X ε 126

17 The full probability (cont d) Important use of P (x, y): posterior distribution over alignments Π given two sequences x, y: P (Π x, y) = P (x, y, Π) P (x, y). Example: set Π = Π, the Viterbi path: P (Π x, y) is the posterior probability of observing the Viterbi path = probability that the optimal scoring alignment is correct. 127

18 Globin example: The full probability (cont d) P (Π x, y) = Alarming observation if one was hoping that standard alignment algorithms would find the correct alignment! Explanation: there are many small variants of alignments with nearly the same score. 128

19 one single alignment not accurate for determining similarity!! 129 1st alignment: score 3 (BLOSUM 50, d = 12, e = 2). 2nd alignment: also score 3, but different gap position. 3rd alignment: score 6 increase in relative likelihood of a factor of 2 (BLOSUM 50 is scaled in 1/3 bits).

20 The posterior probability Degree of conservation along the sequence may vary depending on functional / structural constraints some parts of the alignment will be clear, other regions may be less certain. Local view: what about the local accuracy of an alignment? We are interested in a reliability measure for each part of an alignment: probability of two residues (x i, y j ) being aligned, given the complete sequences: P (x i y j x, y) backward algorithm. 130

21 The backward algorithm The quantity we are interested in: P (x i y j x, y) = P (x i y j, x, y). P (x, y) The denominator: final result from forward algorithm: P (x, y) = f E (n, m). Numerator: P (x, y, x i y j ) = = P (x 1,...,i, y 1,...,j, x i y }{{} j ) P (x i+1,...,n, y j+1,...,m x 1,...,i, y 1,...,j, x i y j ) }{{} A A Markov = P (x 1,...,i, y 1,...,j, x i y j ) P (x i+1,...,n, y j+1,...,m x i y j ) = f M (i, j) b M (i, j). 131

22 The backward algorithm: recursion b M (i, j) = (1 2δ τ)p xi+1,y j+1 b M (i + 1, j + 1) [ ] +δ q xi+1 b X (i + 1, j) + q yj+1 b Y (i, j + 1) ; 1 2δ τ δ M X M b X (i, j) =(1 ɛ τ)p xi+1,y j+1 b M (i + 1, j + 1) +ɛq xi+1 b X (i + 1, j); 1 ε τ M X b Y (i, j) =(1 ɛ τ)p xi+1,y j+1 b M (i + 1, j + 1) +ɛq yj+1 b Y (i, j + 1). X ε 132

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models