E-value Estimation for Non-Local Alignment Scores

Similar documents
Getting statistical significance and Bayesian confidence limits for your hidden Markov model or score-maximizing dynamic programming algorithm,

Pairwise sequence alignment and pair hidden Markov models

EECS730: Introduction to Bioinformatics

An Introduction to Sequence Similarity ( Homology ) Searching

In-Depth Assessment of Local Sequence Alignment

Local Alignment Statistics

Biologically significant sequence alignments using Boltzmann probabilities

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Pair Hidden Markov Models

Markov Chains and Hidden Markov Models. = stochastic, generative models

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Heuristic Alignment and Searching

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Hidden Markov Models for biological sequence analysis

11.3 Decoding Algorithm

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models

Lecture 7 Sequence analysis. Hidden Markov Models

Sequence Alignment Techniques and Their Uses

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Alignment Algorithms. Alignment Algorithms

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

O 3 O 4 O 5. q 3. q 4. Transition

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Stephen Scott.

Hidden Markov Models and some applications

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Sequence Database Search Techniques I: Blast and PatternHunter tools

Optimization of a New Score Function for the Detection of Remote Homologs

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov Models

Dept. of Linguistics, Indiana University Fall 2009

Hidden Markov models in population genetics and evolutionary biology

Lecture 9. Intro to Hidden Markov Models (finish up)

Hidden Markov Models and some applications

CS711008Z Algorithm Design and Analysis

Fundamentals of database searching

Reminder of some Markov Chain properties:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Stephen Scott.

Hidden Markov Models. Three classic HMM problems

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Gibbs Sampling Methods for Multiple Sequence Alignment

HIDDEN MARKOV MODELS

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Probabilistic Machine Learning

CSCE 471/871 Lecture 3: Markov Chains and

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Applications of Hidden Markov Models

Pairwise sequence alignment

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

BMI/CS 576 Fall 2016 Final Exam

LECTURE 15 Markov chain Monte Carlo

Sequences and Information

Hidden Markov Models

Biological Sequences and Hidden Markov Models CPBS7711

1.5 Sequence alignment

Hidden Markov Models (HMMs) and Profiles

Single alignment: Substitution Matrix. 16 march 2017

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Hidden Markov Models

Approximate Bayesian Computation: a simulation based approach to inference

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Hidden Markov Models

Statistical Methods for NLP

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Pairwise alignment using HMMs

Hidden Markov Models

The main algorithms used in the seqhmm package

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Monte Carlo (MC) Simulation Methods. Elisa Fadda

Chapter 4: Hidden Markov Models

Yifei Bao. Beatrix. Manor Askenazi

Hidden Markov Models 1

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

HMMs and biological sequence analysis

BLAST: Target frequencies and information content Dannie Durand

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

STA 4273H: Statistical Machine Learning

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Basic math for biology

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Multi-Ensemble Markov Models and TRAM. Fabian Paul 21-Feb-2018

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Answers and expectations

Intelligent Systems (AI-2)

Hidden Markov Models

Evolving a New Feature for a Working Program

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

HMM: Parameter Estimation

Transcription:

E-value Estimation for Non-Local Alignment Scores 1,2 1 Wadsworth Center, New York State Department of Health 2 Department of Computer Science, Rensselaer Polytechnic Institute April 13, 211 Janelia Farm Research Campus Howard Hughes Medical Institute

The Problem Pictures Overview of Technique Local alignments scores are easy enough A Gumbel distribution (Karlin & Altschul (199) statistics) applies well enough for local alignment scores, even with foward scores instead of Viterbi scores (Eddy, 28). but Non-local alignment scores are harder Kann et al. (27), Eddy (28), and others show that something else is needed for global and glocal alignment scores. Newberg (29) shows that something else is needed for true positive rates, even for local alignment scores. Unihit vs. multihit?

The Problem Pictures Overview of Technique p 1 exp( Ke λs ) Ke λs log 1 (p) vs. s is straight. 4 4 Viterbi protein alignment (BLOSUM62, -12, -1) 1 1 Viterbi protein align. (BLOSUM62, -12, -1) 4 8 12 16 2 24 28 32 36 4 44-2 -4-6 -8-1 -12-14 1 2 3 4 5 6 7 8 9 1 11-5 -1-15 -2-25 -3-35 -16-4 4 1 Viterbi protein align. (BLOSUM62, -12, -1) 4 8 12 16 2 24 28 32 36 4 44-2 -4-6 -8-1 -12-14 -16 Looking for BLOSUM62 slope =.347 log 1 (e) =.151.

The Problem Pictures Overview of Technique p 1 exp( Ke λs ) Ke λs log 1 (p) vs. s is straight. 4 4 Viterbi protein alignment (BLOSUM62, -12, -1) 1 1 Viterbi protein align. (BLOSUM62, -12, -1) 2 4 6 8 1 12 14 16 18 2 22 24 2 4 6 8 1 12 14 16 18 2 22 24-5 -5-1 -15-1 -15-2 -2-25 -25 4 1 Viterbi protein align. (BLOSUM62, -12, -1) 2 4 6 8 1 12 14 16 18 2 22 24-5 -1-15 -2-25 Looking for BLOSUM62 slope =.347 log 1 (e) =.151.

The Problem Pictures Overview of Technique p 1 exp( Ke λs ) Ke λs log 1 (p) vs. s is straight. 4 4 Viterbi protein alignment (BLOSUM62, -12, -1) 4 1 Viterbi protein align. (BLOSUM62, -12, -1) 1 1 Viterbi protein align. (BLOSUM62, -12, -1) 4 8 12 16 2 24 28 32 36 4 44-2 -4-6 -8-1 -12-14 -16 4 8 12 16 2 24 28 32 36 4 44-2 -4-6 -8-1 -12-14 -16 1 2 3 4 5 6 7 8 9 1 11-5 -1-15 -2-25 -3-35 -4 4 4 Viterbi protein alignment (BLOSUM62, -12, -1) 2 4 6 8 1 12 14 16 18 2 22 24 4 1 Viterbi protein align. (BLOSUM62, -12, -1) 2 4 6 8 1 12 14 16 18 2 22 24 1 1 Viterbi protein align. (BLOSUM62, -12, -1) 2 4 6 8 1 12 14 16 18 2 22 24-5 -5-5 -1-15 -1-15 -1-15 -2-2 -2-25 -25-25 Looking for BLOSUM62 slope =.347 log 1 (e) =.151.

The Problem Pictures Overview of Technique Q: Where did those pretty pictures come from? A: Simulations using : Instead of naïve sampling, draw samples from a distribution biased towards higher scores, and correct for the bias. Technique is applicable to hidden Markov models and their non-normalized generalization, hidden Boltzmann models (e.g., thermodynamic partition functions)

Choice of Distribution Flipping a biased coin b b Start H: c a Terminal T: d a I1 I2 I3 E C T M1 M2 M3 M4 S N B D1 D2 D3 D4 J A Plan7 Profile-HMM (Eddy, 23) Also: Viterbi vs. Forward and Smith & Johnson (27)

Choice of Distribution Let D represent a sequence of L emissions from a hidden Markov model. Naïve Sampling For the statistical significance of a score s : p(s ) = all D Pr null (D)Θ(s(D) s ) 1 Θ(s(D) s ) N D Pr null where Θ(true) = 1 and Θ(false) =. Need O(1/p) samples for a small p-value.

Choice of Distribution p(s ) = all D Pr null (D)Θ(s(D) s ) 1 Θ(s(D) s ). N D Pr null p(s ) = all D 1 N Pr T (D) Pr null(d) Pr T (D) Θ(s(D) s ) D Pr T Pr null(d) Pr T (D) Θ(s(D) s ) Importance sampling is the more efficient estimator when Pr T is chosen well; we need 1 samples, even for p = 1 4.

Choice of Distribution Q: What s the best Pr T for use with p(s ) 1 Pr null(d) N Pr T (D) Θ(s(D) s )? D Pr T A: Want to minimize variance so, ideally, Pr T Pr null (D)Θ(s(D) s ). Settle for Pr T giving most scores near s. We need a way to make high scores ( s ) more probable than under the null model.

Choice of Distribution We define Pr T (D) = Z(D) Z where, Z is a normalizing constant,, Z = D Z(D). We define Z(D), for some temperature T, with Z(D) = Pr null (D) π ( PrHMM (π, D) Pr null (D) ) 1/T, where π is summed over paths through the HMM. T : drawing from the null distribution. T > 1: interpolating between null and alternative. T = 1: drawing from the alternative distribution. T < 1: extrapolating beyond the alternative distribution.

Choice of Distribution Why this distribution? Gives scores near s and we can exactly sample D Pr T using an HMM forward-backward algorithm. Forward: Calculate normalizing constant Z, once. 1 Backward: Sample sequences, D Pr T. 2 Forward: Calculate s(d) for each sampled D. 3 Forward: Calculate Z(D) for each sampled D. 4 Use Pr T (D) = Z(D)/Z in p(s ) 1 Pr null(d) N Pr T (D) Θ(s(D) s ) D Pr T Before, slower: Wolfsheimer et al. (27) used importance sampling, but needed Metropolis-coupled Markov chain Monte Carlo (MCMCMC) for the actual sampling.

Choice of Distribution Zeroth forward algorithm We compute Z in the zeroth forward algorithm, once. 1 For each emitter E in the HMM and each letter d, replace the emission probability E d with the unnormalized E d = Pr null(d) ( ) 1/T Ed. Pr null (d) (Note: E d = should be treated as E d = ǫ >.) 2 To effect the sum over all sequences D, in lieu of choosing each emission in the forward calculation, use E = d E d.

Choice of Distribution The backward algorithm Sample a sequence D Pr T by 1 backsampling a path π through the forward Z calculation in the usual unnormalized way; and 2 as each emitter is encountered, also chose the emitted letter d with probability proportional to E d. Repeat 1 times.

Choice of Distribution The first and second forward algorithms 1 For each of the sampled sequences D, use the unmodified HMM to compute s(d). 2 For each of the sampled sequences D, evaluate its unnormalized probability Z(D) using a forward calculation with the unnormalized emission probabilities Putting it all together E d = Pr null(d) The imporance sampling sum is ( ) 1/T Ed. Pr null (d) p(s ) 1 Z Pr null (D) N Z(D) Θ(s(D) s ). D Pr T

Temperature, Calibrations, Interpolations Conclusions References Temperature Temperature is chosen in an ad hoc way. Heuristic: want 2 6% of samples to have s(d) s. Calibration curves specific to L. Current research: generalizing across values of L. Run-time 21-plus forward calculations for each time we want a p-value is still too slow. Current research: pre-compute points on p(s, L) surfaces. Use interpolation and extrapolation.

Temperature, Calibrations, Interpolations Conclusions References General applicability A few hundred forward calculations provides a precise p-value estimate for any sort of alignment. Current research: reduce that to an average of ten forward calculations. Additional savings: only the best results need a precise p-value. Reading Newberg (28): Smith-Waterman sequence alignments Newberg (29): Hidden Markov / Boltzmann models Newberg & Lawrence (29): Integer/Score distributions See http://www.rpi.edu/~newbel/publications/. Acknowledgments: Chip Lawrence; Sean Eddy; NIH; Health Research, Inc.; NSF.

Temperature, Calibrations, Interpolations Conclusions References Eddy, S. R. (23) HMMER User s Guide: Biological sequence analysis using profile hidden Markov models. Howard Hughes Medical Institute and Dept. of Genetics Washington University School of Medicine Saint Louis, MO 2.3.2 edition,. Eddy, S. R. (28) A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol, 4 (5), e169. pmid: 18516236, doi: 1.1371/journal.pcbi.169. Kann, M. G., Sheetlin, S. L., Park, Y., Bryant, S. H. & Spouge, J. L. (27) The identification of complete domains within protein sequences using accurate E-values for semi-global alignment. Nucleic Acids Res, 35 (14), 4678 4685. pmid: 17596268, doi: 1.193/nar/gkm414. Karlin, S. & Altschul, S. F. (199) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A, 87, 2264 2268. pmid: 2315319, doi: 1.173/pnas.87.6.2264. Newberg, L. A. (28) Significance of gapped sequence alignments. J Comput Biol, 15 (9), 1187 1194. pmid: 18973434, pmcid: PMC273773, doi: 1.189/cmb.28.125. Newberg, L. A. (29) Error statistics of hidden Markov model and hidden Boltzmann model results. BMC Bioinf, 1, article 212. pmid: 19589158, pmcid: PMC2722652, doi: 1.1186/1471-215-1-212. Newberg, L. A. & Lawrence, C. E. (29) Exact calculation of distributions on integers, with application to sequence alignment. J Comput Biol, 16 (1), 1 18. pmid: 19119992, pmcid: PMC2858568, doi: 1.189/cmb.28.137. Smith, N. A. & Johnson, M. (27) Weighted and probabilistic context-free grammars are equally expressive. Comput Linguistics, 33 (4), 477 491. doi: 1.1162/coli.27.33.4.477. Wolfsheimer, S., Burghardt, B. & Hartmann, A. K. (27) Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail. Algorithms Mol Biol, 2, article 9. pmid: 1762518, doi: 1.1186/1748-7188-2-9.