Pairwise sequence alignment and pair hidden Markov models

Pairwise sequence alignment and pair hidden Markov models Martin C. Frith April 13, 2012 ntroduction Pairwise alignment and pair hidden Markov models (phmms) are basic textbook fare [2]. However, there are various slightly different algorithms and models that could be used. This document presents some variants that are different from, and maybe better than, the ones described by urbin et al. efinitions We wish to align two sequences: R 1,..., R m : 1st sequence (e.g. reference ), of length m. Q 1,..., Q n : 2nd sequence (e.g. query ), of length n. The classic approach is to define a scoring scheme, which assigns scores to aligned letters and gaps, and then find alignments with maximal score. This document considers the standard affine-gap scheme only. S(x, y): score for aligning reference base x to query base y. a: gap existence score. b: gap extension score. (A gap of length k scores a + b k.) Note that a and b are negative. Alternative dynamic programming algorithms The standard way of finding the maximal alignment score is dynamic programming, which finds the optimal score for sequences of length i and j in terms of the optimal scores for shorter sequences (i 1 and j 1). This variant seems to be popular: 1

Algorithm A X i,j = max(x i 1,j 1, Y i 1,j 1, Z i 1,j 1 ) + S(R i, Q j ) (1) Y i,j = max(x i 1,j + a, Y i 1,j, Z i 1,j + a) + b (2) Z i,j = max(x i,j 1 + a, Y i,j 1 + a, Z i,j 1 ) + b (3) Here, X i,j is the optimal alignment score up to R i and Q j ending with a match, Y i,j is the optimal score ending with a deletion, and Z i,j is the optimal score ending with an insertion. This algorithm is equivalent but more efficient (fewer CPU instructions): Algorithm B Y i,j = max(w i 1,j + a, Y i 1,j ) + b (4) Z i,j = max(w i,j 1 + a, Z i,j 1 ) + b (5) W i,j = max(w i 1,j 1 + S(R i, Q j ), Y i,j, Z i,j ) (6) Here, W i,j is the optimal alignment score ending with anything. nterestingly, this is the original algorithm described by Gotoh [3]. t can be made even more efficient by some reorganization [1]. Pair hidden Markov models The urbin et al. textbook describes some phmms, and demonstrates that finding the most probable path is equivalent to classic maximum-score alignment [2]. Figure 1 shows phmms that differ from those of urbin et al. in some interesting ways: They allow insertions next to deletions. They allow insertions next to insertions, and deletions next to deletions. For example, a length-2 deletion next to a length-3 deletion. This makes no difference to the Viterbi (maximum score) algorithm, because (e.g.) a length-5 deletion has a better score than a length-2 plus a length-3 deletion. t does make a difference, however, to the Forward algorithm. The paths through these phmms are reflected in Algorithm B, rather than Algorithm A. Score parameters in terms of model parameters The Viterbi (maximum likelihood) algorithms for these phmms can be cast in the same form as maximum-score alignment, by using these formulas: ( πxy S(x, y) = t ln 1 2 τ ) φ x ψ y (1 ) 2 (7) ( ) (1 ) a = t ln (8) ( ) b = t ln (9) 1 2

Here, t is an arbitrary scale factor. (f we multiply all the score parameters by a constant factor, it makes no difference to the alignment.) Local alignment nitialization W 0,0 = 0 (10) W i,0 = 0 Y i,0 = Z i,0 = (11) W 0,j = 0 Y 0,j = Z 0,j = (12) Recurrence X i,j = W i 1,j 1 + S(R i, Q j ) (13) Y i,j = max(w i 1,j + a, Y i 1,j ) + b (14) Z i,j = max(w i,j 1 + a, Z i,j 1 ) + b (15) W i,j = max(x i,j, Y i,j, Z i,j, 0) (16) Termination Optimal alignment score = max i,j (W i,j) (17) Semi-global (short-in-long) alignment nitialization W 0,0 = 0 (18) W i,0 = 0 Y i,0 = Z i,0 = (19) W 0,j = Y 0,j = Z 0,j = (20) Recurrence X i,j = W i 1,j 1 + S(R i, Q j ) (21) Y i,j = max(w i 1,j + a, Y i 1,j ) + b (22) Z i,j = max(w i,j 1 + a, Z i,j 1 ) + b (23) W i,j = max(x i,j, Y i,j, Z i,j ) (24) Termination Optimal alignment score = max(w i,n ) (25) i References [1] M. Cameron, H. E. Williams, and A. Cannane. mproved gapped alignment in BLAST. EEE/ACM Trans Comput Biol Bioinform, 1(3):116 129, 2004. 3

A M 1 2 τ τ 1 1 B M 1 2 τ τ 1 1 C Figure 1: Pair hidden Markov models. A Semi-global (short-in-long) model. B Local model. C Null model. States labeled M emit aligned bases x : y with probability π xy. States labeled emit reference bases x with probability φ x. States labeled emit query bases y with probability ψ y. 4

[2] R. urbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998. [3] O. Gotoh. An improved algorithm for matching biological sequences. J. Mol. Biol., 162(3):705 708, ec 1982. 5