E-value Estimation for Non-Local Alignment Scores

E-value Estimation for Non-Local Alignment Scores 1,2 1 Wadsworth Center, New York State Department of Health 2 Department of Computer Science, Rensselaer Polytechnic Institute April 13, 211 Janelia Farm Research Campus Howard Hughes Medical Institute

The Problem Pictures Overview of Technique Local alignments scores are easy enough A Gumbel distribution (Karlin & Altschul (199) statistics) applies well enough for local alignment scores, even with foward scores instead of Viterbi scores (Eddy, 28). but Non-local alignment scores are harder Kann et al. (27), Eddy (28), and others show that something else is needed for global and glocal alignment scores. Newberg (29) shows that something else is needed for true positive rates, even for local alignment scores. Unihit vs. multihit?

The Problem Pictures Overview of Technique p 1 exp( Ke λs ) Ke λs log 1 (p) vs. s is straight. 4 4 Viterbi protein alignment (BLOSUM62, -12, -1) 1 1 Viterbi protein align. (BLOSUM62, -12, -1) 4 8 12 16 2 24 28 32 36 4 44-2 -4-6 -8-1 -12-14 1 2 3 4 5 6 7 8 9 1 11-5 -1-15 -2-25 -3-35 -16-4 4 1 Viterbi protein align. (BLOSUM62, -12, -1) 4 8 12 16 2 24 28 32 36 4 44-2 -4-6 -8-1 -12-14 -16 Looking for BLOSUM62 slope =.347 log 1 (e) =.151.

The Problem Pictures Overview of Technique p 1 exp( Ke λs ) Ke λs log 1 (p) vs. s is straight. 4 4 Viterbi protein alignment (BLOSUM62, -12, -1) 1 1 Viterbi protein align. (BLOSUM62, -12, -1) 2 4 6 8 1 12 14 16 18 2 22 24 2 4 6 8 1 12 14 16 18 2 22 24-5 -5-1 -15-1 -15-2 -2-25 -25 4 1 Viterbi protein align. (BLOSUM62, -12, -1) 2 4 6 8 1 12 14 16 18 2 22 24-5 -1-15 -2-25 Looking for BLOSUM62 slope =.347 log 1 (e) =.151.

The Problem Pictures Overview of Technique p 1 exp( Ke λs ) Ke λs log 1 (p) vs. s is straight. 4 4 Viterbi protein alignment (BLOSUM62, -12, -1) 4 1 Viterbi protein align. (BLOSUM62, -12, -1) 1 1 Viterbi protein align. (BLOSUM62, -12, -1) 4 8 12 16 2 24 28 32 36 4 44-2 -4-6 -8-1 -12-14 -16 4 8 12 16 2 24 28 32 36 4 44-2 -4-6 -8-1 -12-14 -16 1 2 3 4 5 6 7 8 9 1 11-5 -1-15 -2-25 -3-35 -4 4 4 Viterbi protein alignment (BLOSUM62, -12, -1) 2 4 6 8 1 12 14 16 18 2 22 24 4 1 Viterbi protein align. (BLOSUM62, -12, -1) 2 4 6 8 1 12 14 16 18 2 22 24 1 1 Viterbi protein align. (BLOSUM62, -12, -1) 2 4 6 8 1 12 14 16 18 2 22 24-5 -5-5 -1-15 -1-15 -1-15 -2-2 -2-25 -25-25 Looking for BLOSUM62 slope =.347 log 1 (e) =.151.

The Problem Pictures Overview of Technique Q: Where did those pretty pictures come from? A: Simulations using : Instead of naïve sampling, draw samples from a distribution biased towards higher scores, and correct for the bias. Technique is applicable to hidden Markov models and their non-normalized generalization, hidden Boltzmann models (e.g., thermodynamic partition functions)

Choice of Distribution Flipping a biased coin b b Start H: c a Terminal T: d a I1 I2 I3 E C T M1 M2 M3 M4 S N B D1 D2 D3 D4 J A Plan7 Profile-HMM (Eddy, 23) Also: Viterbi vs. Forward and Smith & Johnson (27)

Choice of Distribution Let D represent a sequence of L emissions from a hidden Markov model. Naïve Sampling For the statistical significance of a score s : p(s ) = all D Pr null (D)Θ(s(D) s ) 1 Θ(s(D) s ) N D Pr null where Θ(true) = 1 and Θ(false) =. Need O(1/p) samples for a small p-value.

Choice of Distribution p(s ) = all D Pr null (D)Θ(s(D) s ) 1 Θ(s(D) s ). N D Pr null p(s ) = all D 1 N Pr T (D) Pr null(d) Pr T (D) Θ(s(D) s ) D Pr T Pr null(d) Pr T (D) Θ(s(D) s ) Importance sampling is the more efficient estimator when Pr T is chosen well; we need 1 samples, even for p = 1 4.

Choice of Distribution Q: What s the best Pr T for use with p(s ) 1 Pr null(d) N Pr T (D) Θ(s(D) s )? D Pr T A: Want to minimize variance so, ideally, Pr T Pr null (D)Θ(s(D) s ). Settle for Pr T giving most scores near s. We need a way to make high scores ( s ) more probable than under the null model.

Choice of Distribution We define Pr T (D) = Z(D) Z where, Z is a normalizing constant,, Z = D Z(D). We define Z(D), for some temperature T, with Z(D) = Pr null (D) π ( PrHMM (π, D) Pr null (D) ) 1/T, where π is summed over paths through the HMM. T : drawing from the null distribution. T > 1: interpolating between null and alternative. T = 1: drawing from the alternative distribution. T < 1: extrapolating beyond the alternative distribution.

Choice of Distribution Why this distribution? Gives scores near s and we can exactly sample D Pr T using an HMM forward-backward algorithm. Forward: Calculate normalizing constant Z, once. 1 Backward: Sample sequences, D Pr T. 2 Forward: Calculate s(d) for each sampled D. 3 Forward: Calculate Z(D) for each sampled D. 4 Use Pr T (D) = Z(D)/Z in p(s ) 1 Pr null(d) N Pr T (D) Θ(s(D) s ) D Pr T Before, slower: Wolfsheimer et al. (27) used importance sampling, but needed Metropolis-coupled Markov chain Monte Carlo (MCMCMC) for the actual sampling.

Choice of Distribution Zeroth forward algorithm We compute Z in the zeroth forward algorithm, once. 1 For each emitter E in the HMM and each letter d, replace the emission probability E d with the unnormalized E d = Pr null(d) ( ) 1/T Ed. Pr null (d) (Note: E d = should be treated as E d = ǫ >.) 2 To effect the sum over all sequences D, in lieu of choosing each emission in the forward calculation, use E = d E d.

Choice of Distribution The backward algorithm Sample a sequence D Pr T by 1 backsampling a path π through the forward Z calculation in the usual unnormalized way; and 2 as each emitter is encountered, also chose the emitted letter d with probability proportional to E d. Repeat 1 times.

Choice of Distribution The first and second forward algorithms 1 For each of the sampled sequences D, use the unmodified HMM to compute s(d). 2 For each of the sampled sequences D, evaluate its unnormalized probability Z(D) using a forward calculation with the unnormalized emission probabilities Putting it all together E d = Pr null(d) The imporance sampling sum is ( ) 1/T Ed. Pr null (d) p(s ) 1 Z Pr null (D) N Z(D) Θ(s(D) s ). D Pr T

Temperature, Calibrations, Interpolations Conclusions References Temperature Temperature is chosen in an ad hoc way. Heuristic: want 2 6% of samples to have s(d) s. Calibration curves specific to L. Current research: generalizing across values of L. Run-time 21-plus forward calculations for each time we want a p-value is still too slow. Current research: pre-compute points on p(s, L) surfaces. Use interpolation and extrapolation.

Temperature, Calibrations, Interpolations Conclusions References General applicability A few hundred forward calculations provides a precise p-value estimate for any sort of alignment. Current research: reduce that to an average of ten forward calculations. Additional savings: only the best results need a precise p-value. Reading Newberg (28): Smith-Waterman sequence alignments Newberg (29): Hidden Markov / Boltzmann models Newberg & Lawrence (29): Integer/Score distributions See http://www.rpi.edu/~newbel/publications/. Acknowledgments: Chip Lawrence; Sean Eddy; NIH; Health Research, Inc.; NSF.

Temperature, Calibrations, Interpolations Conclusions References Eddy, S. R. (23) HMMER User s Guide: Biological sequence analysis using profile hidden Markov models. Howard Hughes Medical Institute and Dept. of Genetics Washington University School of Medicine Saint Louis, MO 2.3.2 edition,. Eddy, S. R. (28) A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol, 4 (5), e169. pmid: 18516236, doi: 1.1371/journal.pcbi.169. Kann, M. G., Sheetlin, S. L., Park, Y., Bryant, S. H. & Spouge, J. L. (27) The identification of complete domains within protein sequences using accurate E-values for semi-global alignment. Nucleic Acids Res, 35 (14), 4678 4685. pmid: 17596268, doi: 1.193/nar/gkm414. Karlin, S. & Altschul, S. F. (199) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A, 87, 2264 2268. pmid: 2315319, doi: 1.173/pnas.87.6.2264. Newberg, L. A. (28) Significance of gapped sequence alignments. J Comput Biol, 15 (9), 1187 1194. pmid: 18973434, pmcid: PMC273773, doi: 1.189/cmb.28.125. Newberg, L. A. (29) Error statistics of hidden Markov model and hidden Boltzmann model results. BMC Bioinf, 1, article 212. pmid: 19589158, pmcid: PMC2722652, doi: 1.1186/1471-215-1-212. Newberg, L. A. & Lawrence, C. E. (29) Exact calculation of distributions on integers, with application to sequence alignment. J Comput Biol, 16 (1), 1 18. pmid: 19119992, pmcid: PMC2858568, doi: 1.189/cmb.28.137. Smith, N. A. & Johnson, M. (27) Weighted and probabilistic context-free grammars are equally expressive. Comput Linguistics, 33 (4), 477 491. doi: 1.1162/coli.27.33.4.477. Wolfsheimer, S., Burghardt, B. & Hartmann, A. K. (27) Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail. Algorithms Mol Biol, 2, article 9. pmid: 1762518, doi: 1.1186/1748-7188-2-9.