Bioinformatics Introduction to Hidden Markov Models Hidden Markov Models and Multiple Sequence Alignment

Bioinformatics Introduction to Hidden Markov Models Hidden Markov Models and Multiple Sequence Alignment Slides borrowed from Scott C. Schmidler (MIS graduated student)

Outline! Probability Review! Markov Chains! Hidden Markov Chains! Examples in HMMs for Protein Sequence! Algorithm Review for HMMs (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 3 Motivation: Composing a Drama by Mimicking Shakespeare! Assume we want to write a drama of Shakespeare style! We collect a large set of Shakespeare's works! Define a vocabulary V = {X,X 2,..., X N }! Build a model P(X i X j )fori, j =,..., N! To compose a drama, generate words from the model P(X i X j )! Though this is too simplistic to be useful, this naive model can be extended and refined to mimic the writing style of Shakespears' (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 4

Markov Approximations to English! From Shannon s original paper:. Zero-order approximation: XFOML RXKXRJFFUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD 2. First-order approximation: OCRO HLI RGWR NWIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH RBL 3. Second-order approximation: ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CITSBE (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 5 Markov Approximations (cont.) From Shannon s paper 4. Third-order approximation: IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTABIN IS REGOACTIONA OF CRE Markov random field with 000 features, no underlying machine (Della Pietra et. Al, 997): WAS REASER IN THERE TO WILL WAS BY HOMES THING BE RELOVERATED THER WHICH CONISTS AT RORES ANDITING WITH PROVERAL THE CHESTRAING FOR HAVE TO INTRALLY OF QUT DIVERAL THIS OFFECT INATEVER THIFER CONSTRANDED STATER VILL MENTTERING AND OF IN VERATE OF TO (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 6

Word-Based Approximations. First-order approximation: REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE T 2. Second-order approximation: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPETED Shannon s comment: It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage. (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 7 Motivation: Composing a Symphony of Beethoven Style! We want to compose a symphony of Beethoven style! We collect a large set of Beethoven's works! Define a vocabulary V = {X,X 2,..., X N } of musical notes! Build a model P(X i X j ) for i, j =,..., N! To compose a symphony, generate note symbols from the model P(X i X j ) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 8

Modeling Biological Sequences! Collect a set of sequences of interest! Define a vocabulary V = {X,X 2,...,X N } 4 For DNA sequences: N = 4 and V = {A, T, G, C} 4 For protein sequences: N = 20 and V = {amino acids}! Build (learn) a model P(X i X j )fori, j =,..., N or in more general P(X w) with X = X,X 2,...,X M and model parameter vector w! The model can be used to 4 To generate typical sequences from the class of training sequences, e.g. protein family 4 To compute the probability of an observed sequence O being generated from the model class 4 and others! Hidden Markov models (HMMs) are a class of stochastic generative models effective for building such probabilistic models. (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 9 Probability Review! Probability notation: 4 Probability: 4 Joint probability: 4 Conditional probability: 4 Marginal probability: 4 Independence: 4 Bayes rule: (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 0

Markov Chains! Markov property:! Formally: 4 State space 4 Transition matrix 4 Initial distribution! CS intuition 4 Stochastic finite automaton (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) Markovian Sequence! States through which the chain passes form a sequences: Example: S S, S, S, S,,L 0, 0 S! Graphically:! By the Markov property: 9 : : : 9 : P ( Sequence) = P( S0, S, S, S, S0, S, L) = π ( S ) P( S S ) P( S S ) 0 0 (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 2

Example! Markov chain for generating a DNA sequence:! Sequence probability: D D ( AGATCG) = π ( A) P( G A) P( A G) P( T A)K P Dinucleotide frequency (e.g. base-stacking) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 3 Hidden Markov Chains! Observed sequence is a probabilistic function of underlying Markov chain 4 Example: HMM for a (noisy) DNA sequence (see e.g. Churchill 989) True state sequence unknown, but observation sequence gives us a clue Unobserved truth D D Observed noisy sequence data (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 4

Figure from (Krogh et. al. 994) Example: Hidden Markov Chain for Protein Sequence! State space is backbone secondary structure 4 Used for prediction (Asai et. al., Stultz et. al.) I I! State space is side chain environment 4 Used for fold-recognition (Hubbard et. al.) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 5 A HMM for Multiple Protein Sequences (Krogh et. al.)! Match states are model (consensus) positions! Position-specific deletion penalties! Position-specific insertion frequencies! Path through states aligns sequence to model (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 6

Figure from (Krogh et. al. 994) Example: Multiple Alignment of Globin Sequences (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 7 HMM-based Multiple Sequence Alignment! Multiple alignment of k sequences is O(n k ), so instead:. Estimate a statistical model for the sequences Use head start PROFILE alignment Start from scratch with unaligned sequences (harder) 2. Align each remaining sequence to the model 3. Alignment yields assignments of equivalent sequence elements within the multiple alignment (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 8

Example: Aligning Sequence to Model! Given an HMM model for a protein family: Align a new sequence to the model (d states are gaps, i states are insertions) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 9 Computing with HMMs! Three tasks:. Probability of an observed sequence Given O, O 2,,O r find P ( O, O2 K, O r ) (nontrivial since state sequence unobserved) 2. Most likely hidden state sequence Given O, O 2,,O r compute 2. Most likely hidden state sequence Given observed sequence {O,,O n }find arg max P θ n ( O, K, O θ ) ( S, K, S O, K, O ) arg max P r S K, S, (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 20 r r

Computing Likelihood of Observed Sequence P O, O2, K, O r 4 True state sequence unknown 4 Must sum over all possible paths 4 Number paths O(T N ) 4 Markovian structure permits:! Compute Recursive definition and hence Efficient calculation by dynamic programming P ( ) ( O K, O ) = P( O, O, K, O S, S, K, S ) P( S, S, K, S ), r 2 r 0 r 0 S, S, K, S 0 r! Key observation: Any path must be in exactly one state at time t r (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 2 Key Idea for HMM Computations N possible amino acids t t+ T States (t, t+,t) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 22

Example: Searching Protein Database with HMM Profile! For each sequence in database:! Does sequence fit model?! Score by P(O, O 2,,O r ), compute Z-score adjusted for length Globins: Protein Kinases: Figure from (Krogh et. al. 994) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 23 Estimate Alignment and Model Parameters - Simultaneously! Key idea missing data: 4What if we know the alignment? Parameters easy to estimate: Calculate (expected) number of transitions Calculate (expected) frequency of amino acids 4What if we knew the parameters? Alignment easy to find Align each sequence to model using Viterbi algorithm Align residues in match states (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 24

Other details! Howmanystatesinmodel?! How to initialize parameters?! How to avoid local models? See (Krong et. al., 994) for some suggestion (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 25 Multiple Protein Sequence Alignment! Give a set of sequences: 4Estimate HMM model using optimization for parameter search (Baum-Welch, EM) 4Align each sequence to model (Viterbi) 4Match states of model provide columns of resulting multiple alignment (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 26

Extensions Clustering subfamilies Modeling domains Figure from (Krogh et. al. 994) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 27 Tradeoffs! Advantages: 4Explicit probabilistic model for family 4Position specific residue distributions, gap penalties, insertions frequencies! Disadvantages: 4Many parameters, requires more data of care 4Traded one hard optimization problem for another (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 28

HMM Summary! Powerful tool for modeling protein families! Generalization of existing profile methods! Data-intensive! Widely applicable to problems in bioinformatics (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 29 References! Bioinformatics Classic: Krogh et. al. (994) Hidden Markov models in computational biology: applications to protein modeling, J. Mol. Biol. 235: 50-53! Book: Eddy & Durbin, 999. See web site.! Tutorial: Rabiner, L. (989) A tutorial on hidden Markov models and selected applications in speech recognition, Proc IEEE, 77(2), 257-286 (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 30

Forward-backward Algorithm! Forward pass: 4 Define α t N ( j ) ( i) P( S S ) P( O S ) = αt i= 4 Prob. Of subsequence O, O 2,,O t when in S j at t j i i j Key obs: any path must be in of N states at t t- t T (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 3 Forward-backward Algorithm! Notice (, O, K O r ) α ( j) P O 2, N = t! Define an analogues backward pass so that: β t N ( j ) = βt ( i) P( Si S j ) P( O t + S i ) i= and αt ( ) () i βt () i t came from S j = N αt i βt P O i= () () i t- t T+ (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 32

Finding Most Likely Path! Forward pass: 4 Replace summation with maximization 4 Max prob. of subseq. O, O 2,,O r When in S j at t 4 Again: max P O, O, K, O = maxα (, then trace back ( ) ) 2 r T j j N (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 33 Baum-Welch Algorithm (Expectation- Maximization)! Set parameters to expected values given observed sequences: 4 State transition probs: 4 Observation probs: 4 Recalculate expectations with new probabilities 4 Iterate to convergence Guaranteed P( O, K, O n θ ) strictly increasing, converge to local mode (See Rabiner, 989 for details) (C) 200 SNU CSE Artificial Intelligence Lab (SCAI) 34