Pairwise alignment using HMMs The states of an HMM fulfill the Markov property: probability of transition depends only on the last state. CpG islands and casino example: HMMs emit sequence of symbols (nucleotides or die rolls). We only observe the emitted sequences, the generating state path is unknown inference problems, e.g. estimate the most probable generating path ( Viterbi algorithm). Knowing the path allows us to analyze the internal structure of the string (localizing CpG islands, deciding if the die was fair...) 111
Pair HMMs for string alignment HMMs can be used for sequence alignment: emission is not a single string, but a pair of aligned strings pair HMMs. From a FSA to a pair HMM: Define emission probabilities for states. Match state has emission probability p xi y j for emitting an aligned pair of symbols x i y j. Insert/Delete state X emits a symbol x i from string x against a gap with probability q xi. Define transition probabilities between the states. Requirement: probabilities for all the transitions leaving a state must sum to one. Define begin and end states to meet the initialization and termination conditions for the dynamic programming algorithms. 112
FSAs and Pair HMMs 113
A complete Pair HMM 114
Pair HMMs: Viterbi algorithm (cont d) Initialization: v M (0, j) = v M (i, 0) = 0, v M (0, 0) = 1, initialize: v X/Y (0, j), v X/Y (i, 0) Recurrence: i = 1,..., n, j = 1,..., m: v M (i, j) = p xi,y j max (1 2δ τ)v M (i 1, j 1), (1 ɛ τ)v X (i 1, j 1), (1 ɛ τ)v Y (i 1, j 1); 1 2δ τ 1 ε τ M X M M 1 ε τ Y 115
Pair HMMs: Viterbi algorithm (cont d) Initialization: v M (0, j) = v M (i, 0) = 0, v M (0, 0) = 1, initialize: v X/Y (0, j), v X/Y (i, 0). Recurrence: i = 1,..., n, j = 1,..., m: (1 2δ τ)v M (i 1, j 1), v M (i, j) = p xi,y j max (1 ɛ τ)v X (i 1, j 1), (1 ɛ τ)v Y (i 1, j 1); v X (i, j) = q xi max { δv M (i 1, j), ɛv X (i 1, j); δ M X v Y (i, j) = q yi max { δv M (i, j 1), ɛv Y (i, j 1). X ε 116
Termination: Pair HMMs: Viterbi algorithm (cont d) X τ v E = τ max [ v M (n, m), v X (n, m), v Y (n, m) ]. M Y τ τ E Traceback: We keep traceback pointers as usual reconstruct the whole alignment from the pointers. 117
Pair HMMs and FSA alignment (cont d) Theorem 1. The most probable path through the pair HMM for global alignment gives the optimal alignment associated with the substitution matrix s(x i, y j ) = log p(x i, y j ) q xi q yj + log (1 2δ τ) (1 η) 2 with affine gap penalty γ(g) = d (g 1)e with δ(1 ɛ τ) d = log (1 η)(1 2δ τ), ɛ e = log 1 η. Proof: exercises. 118
Example: the match hypothesis Example alignment: X x 1 x 2 x 3 x 4 x 5 x 6 x 7 Y y 1 y 2 y 3 y 4 y 5 y 6 The FSA model: score(x, y) = log = s(x 1, y 1 ) + s(x 2, y 2 ) d e + s(x 3, y 5 ) + s(x 4, y 6 ) d e e. p(x, y M) p(x, y R) 119
The Pair HMM model: Example (cont d) Define a := (1 2δ τ), b := (1 ɛ τ): Π B M M Y Y M M X X X E X x 1 x 2 x 3 x 4 x 5 x 6 x 7 Y y 1 y 2 y 3 y 4 y 5 y 6 P = a p x1 y 1 ap x2 y 2 δq y3 ɛq y4 bp x3 y 5 ap x4 y 6 δq x5 ɛq x6 ɛq x7 τ Given the path Π, the probability of the pair of sequences (x, y) under the match hypothesis is p(x, y, Π M) = (1 2δ τ)p x1 y 1 (1 2δ τ)p x2 y 2 δq y3 ɛq y4 (1 ɛ τ)p x3 y 5 (1 2δ τ)p x4 y 6 δq x5 ɛq x6 ɛq x7 τ 120
A random length independent site model......written as a Pair HMMs: no match state the states X and Y emit two sequences in turn, independently of each other. x y i j Emitted symbols 1 η 1 η B 1 η X Y η E η 1 η η Silent transitional state 121
Example (cont d) Π B X X X X X X X Y Y Y Y Y Y E X x 1 x 2,..............., x 7 Y y 1 y 2,............, y 6 P = (1 η) q x1 7 i=2 (1 η)q x i η(1 η)q y1 6 j=2 (1 η)q y j η The probability of the pair of sequences (x, y) under the random hypothesis is p(x, y R) = η 2 7 (1 η)q xi i=1 6 (1 η)q yj j=1 122
A pair HMM for local alignment Global model states M,X,Y, flanked by two copies of the random model arbitrary start and stop of alignment. Note that sequences in flanking regions are unaligned random model. 123
The full probability of two aligned sequences If the similarity of two sequences is weak, it is hard to find the correct alignment. HMMs allow us to calculate the probability that two sequences are related by any alignment: P (x, y) = alignments Π P (x, y, Π) : P (x, y) will always be higher than the Viterbi-probability P (x, y, Π )! Can be significantly different when there are many comparable alternative alignments. 124
The full probability (cont d) More realistic score: likelihood that two sequences are related by some unspecified alignment as opposed to being unrelated: score(x, y) = = P (x, y match hypothesis) P (x, y random hypothesis) Π P (x, y, Π) q x q y. 125
The full probability: forward algorithm f M (i, j) = p xi,y j [ (1 2δ τ)f M (i 1, j 1) +(1 ɛ τ)f X (i 1, j 1) ] +(1 ɛ τ)f Y (i 1, j 1) ; ] f X (i, j) = q xi [δf M (i 1, j) + ɛf X (i 1, j) ; 1 2δ τ 1 ε τ M δ M X X M ] f Y (i, j) = q yi [δf M (i, j 1) + ɛf Y (i, j 1). X ε 126
The full probability (cont d) Important use of P (x, y): posterior distribution over alignments Π given two sequences x, y: P (Π x, y) = P (x, y, Π) P (x, y). Example: set Π = Π, the Viterbi path: P (Π x, y) is the posterior probability of observing the Viterbi path = probability that the optimal scoring alignment is correct. 127
Globin example: The full probability (cont d) P (Π x, y) = 4.6 10 6. Alarming observation if one was hoping that standard alignment algorithms would find the correct alignment! Explanation: there are many small variants of alignments with nearly the same score. 128
one single alignment not accurate for determining similarity!! 129 1st alignment: score 3 (BLOSUM 50, d = 12, e = 2). 2nd alignment: also score 3, but different gap position. 3rd alignment: score 6 increase in relative likelihood of a factor of 2 (BLOSUM 50 is scaled in 1/3 bits).
The posterior probability Degree of conservation along the sequence may vary depending on functional / structural constraints some parts of the alignment will be clear, other regions may be less certain. Local view: what about the local accuracy of an alignment? We are interested in a reliability measure for each part of an alignment: probability of two residues (x i, y j ) being aligned, given the complete sequences: P (x i y j x, y) backward algorithm. 130
The backward algorithm The quantity we are interested in: P (x i y j x, y) = P (x i y j, x, y). P (x, y) The denominator: final result from forward algorithm: P (x, y) = f E (n, m). Numerator: P (x, y, x i y j ) = = P (x 1,...,i, y 1,...,j, x i y }{{} j ) P (x i+1,...,n, y j+1,...,m x 1,...,i, y 1,...,j, x i y j ) }{{} A A Markov = P (x 1,...,i, y 1,...,j, x i y j ) P (x i+1,...,n, y j+1,...,m x i y j ) = f M (i, j) b M (i, j). 131
The backward algorithm: recursion b M (i, j) = (1 2δ τ)p xi+1,y j+1 b M (i + 1, j + 1) [ ] +δ q xi+1 b X (i + 1, j) + q yj+1 b Y (i, j + 1) ; 1 2δ τ δ M X M b X (i, j) =(1 ɛ τ)p xi+1,y j+1 b M (i + 1, j + 1) +ɛq xi+1 b X (i + 1, j); 1 ε τ M X b Y (i, j) =(1 ɛ τ)p xi+1,y j+1 b M (i + 1, j + 1) +ɛq yj+1 b Y (i, j + 1). X ε 132