Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2 / 27 Bioinformatics : CpG Islands chains models (HMMs) Formal definition Finding most probable state path ( algorithm) Forward and backward algorithms Specifying an HMM The Property Focus on nucleotide sequences: Sequences of symbols from alphabet {A, C, G, T} The sequence CG (written CpG ) tends to appear more frequently in some places than in others Such CpG islands are usually 10 2 10 3 bases long Questions: 1 Given a short segment, is it from a CpG island? 2 Given a long segment, where are its islands? 3 / 27 4 / 27 Modeling CpG Islands Modeling CpG Islands (cont d) The Property Model will be a CpG generator Want probability of next symbol to depend on current symbol Will use a standard (non-hidden) model Probabilistic state machine Each state emits a symbol The Property P(A T) A T C G 5 / 27 6 / 27
The Property The Property A first-order model (what we study) has the property that observing symbol x i while in state i depends only on the previous state i 1 (which generated x i 1 ) Standard model has 1-1 correspondence between symbols and states, thus and P(x i x i 1,...,x 1 )=P(x i x i 1 ) P(x 1,...,x L )=P(x 1 ) LY P(x i x i 1 ) i=2 The Property For convenience, can add special begin (B) and end (E) states to clarify equations and define a distribution over sequence lengths Emit empty (null) symbols x 0 and x L+1 to mark ends of sequence B A T C G E 7 / 27 8 / 27 L+1 Y for The Property 9 / 27 How do we use this to differentiate islands from non-islands? Define two models: islands ( + ) and non-islands ( ) Each model gets 4 states (A, C, G, T) Take training set of known islands and non-islands Let c + st = number of times symbol t followed symbol s in an island: ˆP + (t s) = c+ st P t 0 c+ st 0 Now score a sequence X = hx 1,...,x L i by summing the log-odds ratios: log ˆP(X +) XL+1 = log ˆP(X ) ˆP + (x i x i 1 ) ˆP (x i x i 1 ) 10 / 27 Second CpG question: Given a long sequence, where are its islands? Could use tools just presented by passing a fixed-width window over the sequence and computing scores Trouble if islands lengths vary Prefer single, unified model for islands vs. non-islands A+ C+ T+ G+ [complete connectivity between all pairs] A C T G - - - - Within the + group, transition probabilities similar to those for the separate + model, but there is a small chance of switching to a state in the group What s? What s? (cont d) No longer have one-to-one correspondence between states and emitted characters E.g., was C emitted by C + or C? Must differentiate the symbol sequence X from the state sequence = h 1,..., L i State transition probabilities same as before: P( i = ` i 1 = j) (i.e., P(` j)) Now each state has a prob. of emitting any value: P(x i = x i = j) (i.e., P(x j)) [In CpG HMM, emission probs discrete and = 0 or 1] 11 / 27 12 / 27
: The Occasionally Dishonest Casino The Algorithm 13 / 27 Assume casino is typically fair, but with prob. 0.05 it switches to loaded die, and switches back with prob. 0.1 Fair 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 0.95 0.05 0.1 0.9 Given a sequence of rolls, what s hidden? Loaded 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 14 / 27 Probability of seeing symbol sequence X and state sequence is P(X, ) =P( 1 0) LY P(x i i ) P( i+1 i ) Can use this to find most likely path: = argmax P(X, ) and trace it to identify islands (paths through + states) There are an exponential number of paths through chain, so how do we find the most likely one? The Algorithm (cont d) The Algorithm (cont d) Assume that we know (for all k) v k (i) =probability of most likely path ending in state k with observation x i Then Given the formula, can fill in table with dynamic programming: v`(i + 1) =P(x i+1 `) max{v k (i) P(` k)} k All states at i State l at i +1 l v 0 (0) =1, v k (0) =0 for k > 0 For i = 1 to L; for ` = 1 to M (# states) v`(i) =P(x i `) max k {v k (i 1)P(` k)} ptr i (`) =argmax k {v k (i 1)P(` k)} P(X, )=max k {v k (L)P(0 k)} L = argmax k {v k(l)p(0 k)} For i = L to 1 i 1 = ptr i ( i ) To avoid underflow, use log(v`(i)) and add 15 / 27 16 / 27 The The 17 / 27 Given a sequence X, find P(X) = P P(X, ) Use dynamic programming like, replacing max with sum, and v k (i) with f k (i) =P(x 1,...,x i, i = k) (= prob. of observed sequence through x i, stopping in state k) f 0 (0) =1, f k (0) =0 for k > 0 For i = 1 to L; for ` = 1 to M (# states) f`(i) =P(x i `) P k f k(i 1)P(` k) P(X) = P k f k(l)p(0 k) To avoid underflow, can again use logs, though exactness of results compromised 18 / 27 Given a sequence X, find the probability that x i was emitted by state k, i.e., P( i = k X) = P( i = k, X) P(X) Algorithm: f k (i) z } { P(x 1,...,x i, i = k) = b k (i) z } { P(x i+1,...,x L i = k) P(X) {z} computed by forward alg b k (L) =P(0 k) for all k For i = L 1 to 1; for k = 1 to M (# states) b k (i) = P` P(` k) P(x i+1 `) b`(i + 1)
Use of Forward/ Specifying an HMM Define g(k) =1 if k 2 {A +,C +,G +,T + } and 0 otherwise Then G(i X) = P k P( i = k X) g(k) =probability that x i is in an island For each state k, compute P( i = k X) with forward/backward algorithm Technique applicable to any HMM where set of states is partitioned into classes Use to label individual parts of a sequence Two problems: defining structure (set of states) and parameters (transition and emission probabilities) Start with latter problem, i.e., given a training set X 1,...,X N of independently generated sequences, learn a good set of parameters Goal is to maximize the (log) likelihood of seeing the training set given that is the set of parameters for the HMM generating them: log(p(x j ; )) 19 / 27 20 / 27 Specifying an HMM: State Sequence Known Specifying an HMM: State Sequence Known (cont d) Estimating parameters when e.g., islands already identified in training set Let A k` = number of k ` transitions and E k (b) = number of emissions of b in state k P(` k) =A k` / X`0 A k`0 P(b k) =E k (b) / X b 0 E k (b 0 ) Be careful if little training data available E.g., an unused state k will have undefined parameters Workaround: Add pseudocounts r k` to A k` and r k (b) to E k (b) that reflect prior biases about parobabilities Increased training data decreases prior s influence 21 / 27 22 / 27 Specifying an HMM: The Algorithm Specifying an HMM: The Algorithm (cont d) Used for estimating params when state seq unknown Special case of expectation maximization (EM) Start with arbitrary P(` k) and P(b k), and use to estimate A k` and E k (b) as expected number of occurrences given the training set 1 : A k` = 1 P(X j ) LX f j k (i) P(` k) P(xj i+1 `) bj`(i + 1) (Prob. of transition from k to ` at position i of sequence j, summed over all positions of all sequences) E k (b) = X i:x j i =b P( i = k X j )= 1 X f j k P(X j ) (i) bj k (i) i:x j i =b Use these (& pseudocounts) to recompute P(` k) and P(b k) After each iteration, compute log likelihood and halt if no improvement 23 / 27 1 Superscript j corresponds to jth train example 24 / 27
Specifying an HMM: Specifying an HMM: Silent How to specify HMM states and connections? come from background knowledge on problem, e.g., size-4 alphabet, +/, ) 8 states Connections: Tempting to specify complete connectivity and let sort it out Problem: Huge number of parameters could lead to local max Better to use background knowledge to invalidate some connections by initializing P(` k) =0 will respect this May want to allow model to generate sequences with certain parts deleted E.g., when aligning DNA or protein sequences against a fixed model or matching a sequence of spoken words against a fixed model, some parts of the input might be omitted Problem: Huge number of connections, slow training, local maxima 25 / 27 26 / 27 Specifying an HMM: Silent (cont d) Silent states (like begin and end states) don t emit symbols, so they can bypass a regular state If there are no purely silent loops, can update, forward, and backward algorithms to work with silent states Used extensively in profile HMMs for modeling sequences of protein families (aka multiple alignments) 27 / 27