Unit 16: Hidden Markov Models

Size: px

Start display at page:

Download "Unit 16: Hidden Markov Models"

Laurence Caldwell
5 years ago
Views:

1 Computational Statistics with Application to Bioinformatics Prof. William H. Press Spring Term, 2008 The University of Texas at Austin Unit 16: Hidden Markov Models The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1

2 Unit 16: Hidden Markov Models (Summary) Markov models discrete states, discrete time steps transition between states at each time by specified probabilities how to find periodicity or reducibility take high power of the matrix by successive squares method irreducibility and aperiodicity imply ergodicity Hidden Markov Models we don t observe the states, but instead symbols from them which have a probabilistic distribution for each state forward-backward algorithm estimates the states from the data forward (backward) pass incorporates past (future) data at each time example: gene finding in the galaxy Zyzyx (fewer irrelevant complications than in our galaxy) use NR3 s HMM class Baum-Welch re-estimation uses the data to improve the estimate of the transition and symbol probabilities it s an EM method! pure unsupervised learning we try it on the Zyzyx data can find the right answers starting from amazingly crude guesses! Hidden Semi-Markov Models [aka Generalized HMMs (GHMMs)] give each state a specified residence time distribution can be implemented by expanding the number of states in an HMM we try it on Zyzyx and get somewhat better gene finding The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 2

3 Markov Models Directed graph may (usually does) have loops Discrete time steps Each time step, state advances with probabilities labeled on outgoing edges self loops also ok Markov because no memory knows only what state in now Markov models especially important because exists a fast algorithm for parsing their state from observed data so-called Hidden Markov Models (HMMs) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 3

4 A (right) stochastic matrix has non-negative entries with rows summing to 1 from i to j transpose, because of the way from/to are defined population vector The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 4

5 Note the two different ways of drawing the same Markov model: directed graph with loops adding the time dimension, directed graph, no loops (can t go backward in time) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 5

6 Every Markov model has at least one stationary (equilibrium) state A T s = s (A T 1) s = 0 Is there a nullspace? Yes, because columns all sum to zero, hence linearly dependent. But how do we know that this s has nonnegative components? define a vector s + with components s + i s i X j, A ij s + i = X A ij s i i i X A ij s = s i j = s + j X ( X 1 A ij )s + i X i summing over j, s + j i j j so the must be an = in the sum, and hence also for each term, implying X A ij s + i i = s + j Therefore s + is the desired (positive) stationary state, qed. (This is actually a simple special case of the Perron-Frobenius theorem.) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 6

7 Does every Markov model eventually reach a unique equilibrium from all starting distributions (ergodicity)? Not necessarily. Two things can go wrong: 1. More than one equilibrium distribution. (Fails test of irreducibility.) 2. Limit cycles. (Fails test of aperiodicity.) The theorem is: Irreducibility and aperiodicity imply ergodicity. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 7

8 Easy to diagnose a particular Markov model numerically by taking a high power of its transition matrix (by successive squaring) Say, take A T to the power 2 32, which requires 2 x 32 M 3 operations. If the columns of the result are all identical, then it s ergodic. (Done.) Otherwise: 1. Zero rows are states that become unpopulated. Ignore their corresponding columns. 2. See if remaining columns are self-reproducing under A T (eigenvectors of unit eigenvalue). When yes, they are equilbria. 3. When no, they are part of limit cycles. So, this example has an unpopulated state and no equilibria. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 8

9 In a Hidden Markov Model, we don t get to observe the states, but instead we see a symbol that each state probabilistically emits when it is entered sequence of (hidden) states sequence of (observed) symbols What can we say about the sequence of states, given a sequence of observations? The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 9

10 The Forward-Backward algorithm. Let s try to estimate the probability of being in a certain state at a certain time Define as the probability of state i at time t given (only) the data up to and including t. forward estimate huge sum over all possible paths! likelihood (or Bayes probability with uniform prior) of that exact path and the exact observed data As written, this is computationally unfeasible. But it satisfies an easy recurrence! The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 10

11 Define as the probability of state i at time t given (only) the data to the future of t. backward estimate Now, there is a backward recurrence! uniform prior: no data to the future of N-1 And the grand estimate using all the data is ( forward-backward algorithm ) Likelihood or Bayes probability of the data. Actually, it s independent of t! You could use its numerical value to compare different models. Worried about multiplying the α s and β s as independent probabilities? Markov guarantees that they are conditionally independent given i, and P t (i) P t (data i) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 11

12 Let s work a biologically motivated example In the galaxy Zyzyx, the Qiqiqi lifeform has a genome consisting of a linear sequence of amino acids, each chosen from 26 chemical possibilities, denoted A-Z. Genes alternate with intergenic regions. In intergenic regions, all A-Z are equiprobable. In genes, the vowels AEIOU are more frequent. Genes always end with Z. The length distribution of genes and intergenic regions is known (has been measured). Can we find the genes? On Earth, it s 20 amino acids, with the additional complication of a genetic code mapping three base-4 codons (ACGT) into one a.a. Our example thus simplifies by having no ambiguity on reading frame, and also no ambiguity of strand. genes intergenes qnkekxkdlscovjfehvesmdeelnzlzjeknvjgetyuhgvxlvjnvqlmcermojkrtuczg rbmpwrjtynonxveblrjuqiydehpzujdogaensduoermiadaaustihpialkxicilgk tottxxwawjvenowzsuacnppiharwpqviuammkpzwwjboofvmrjwrtmzmcxdkclvky vkizmckmpvwfoorbvvrnvuzfwszqithlkubjruoyyxgwvfgxzlzbkuwmkmzgmnsyb (pvowell = 0.45) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 12

13 The model: The code looks like this (NR3 C++ fragment) Int i,j,mstat=3; MatDoub b(mstat,26,0.), a(mstat,mstat,0.); a[0][0] = 1.-1./250.; a[0][1] = 1.-a[0][0]; a[1][1] = 1.-1./50.; a[1][2] = 1.-a[1][1]; a[2][0] = 1.; for (i=0;i<26;i++) b[0][i] = 1./26.; for (i=0;i<5;i++) b[1][i] = pvowel/5.; for (i=5;i<26;i++) b[1][i] = (1.-pvowel)/21.; b[2][25] = 1.; HMM hmm(a,b,seq); hmm.forwardbackward(); The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 13

14 Embed in a mex-file and return hmm.pstate to Matlab. Matlab has its own HMM functions, but I haven t mastered them. (Could someone do this and show this example?) [pstate pcorrect loglike merit] = hmmmex(0.45,1); plot(1:260,pstate(1:260,2),'b') hold on plot(1:260,pstate(1:260,3),'r') The forward-backward results on the previous data are: All the C++ code for this example is on the course website as hmmmex.cpp but beware, it s not cleaned up or made pretty! state G actual start the right z enough chance excess vowels to make it not completely sure! another z state Z The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 14

15 Bayesian re-estimation of the transition and output matrices (Baum-Welch re-estimation) Given the data, we can re-estimate A as follows So, estimating as an average over the data, the backward recurrence says that these are equal number of i j transitions number of i states (note that L cancels) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 15

16 Similarly, re-estimate b number of i states emitting k number of i states Hatted A and b are improved estimates of the hidden parameters. With them, you can go back and re-estimate α and β. And so forth. Does this remind you of the EM method? It should! It is another special case. Can prove by the same kind of convexity method as previously that Baum-Welch re-estimation always increases the overall likelihood L, iteratively to an (as usual possibly only local) maximum. Notice that re-estimation doesn t require any additional information, or any training data. It is pure unsupervised learning. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 16

17 Before (previous result) After re-estimation (data size N=10 5 ) parsing (forward-backward) can work on even small fragments, but re-estimation takes a lot of data how log-likelihood increases with iteration number The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 17

18 On many problems, re-estimation can hill-climb to the right answer from amazingly crude initial guesses a[0][0] = 1.-1./100.; a[0][1] = 1.-a[0][0]; a[1][1] = 1.-1./100.; a[1][2] = 1.-a[1][1]; a[2][0] = 1.; for (i=0;i<26;i++) { b[0][i] = 1./26.; b[1][i] = 1./26.; b[2][i] = 1./26.; } genes and intergenes alternate and are each about 100 long there is a one-symbol gene end marker but we don t know anything about which symbols are preferred in genes, end-genes, or intergenes a period of stagnation is not unusual this 1 st step is an artifact: it is calling all states as 0, which happens to be true ~90% of the time log-likelihood increases monotonically accuracy (in this example we know the right answers!) The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 18

19 Final estimates of the transition and symbol probability matrices: state I state G state Z A / = 166 1/ = 33.2 why these values? b A E I O U B C D F G H J K L M N P Q R S T V W X Y Z Yes, it discovered all the vowels. It discovered Z, but didn t quite figure out that state 3 always emits Z The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 19

20 An obvious flaw in the model: Self-loops in Markov models must always give (discrete approximation of) exponentially distributed residence times exit event waiting time to an event in a Poisson process is exponentially distributed But Qiqiqi genes and intergenes are roughly gamma-law distributed in length (In fact, they re exactly gamma-law, because that s how I constructed the genome not from a Markov model!) mug = 50.; sigg = 10.; mui = 250.; ag = SQR(mug/sigg); bg = mug/sqr(sigg); Gammadev rani(2.,2./mui,ran.int32()); Gammadev rang(ag,bg,ran.int32()); Can we make the results more accurate by somehow incorporating length info? The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 20

21 Generalized Hidden Markov Model (GHMM) also called Hidden Semi-Markov Model (HSMM) the idea is to impose (or learn by re-estimation) an arbitrary probability distribution for the residency time τ in each state can be thought of as an ordinary HMM where every state gets expanded into a timer cluster output symbol probabilities identical for all states in a timer (equal those of the state before it was expanded) arbitrary distribution with τ n Gamma-law distribution τ [p 1, (1 p 1 )p 2, (1 p 1 )(1 p 2 )p 3,...] τ Gamma(α,p) hτi = α p The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 21

22 So, our intergene-gene-z example becomes: initialize the model like this: input values for n0 and n1 (we ll try various choices) Int i,j,n01=n0+n1,mstat=n01+1,niter=40; initial guess for lengths (need not be this perfect) Doub len0=250.,len1=50.; MatDoub b(mstat,26,0.), a(mstat,mstat,0.); for (i=0;i<n0;i++) { a[i][i] = 1.-n0/len0; a[i][i+1] = 1.-a[i][i]; } Gamma-law timers for (i=n0;i<n01;i++) { a[i][i] = 1.-n1/len1; a[i][i+1] = 1.-a[i][i]; } a[n01][0] = 1.; for (j=0;j<n01;j++) for (i=0;i<26;i++) b[j][i] = 1./26.; b[n01][25] = 1.; tell it about Z, but not about vowels The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 22

23 We can use NR3 s HMM class for GHMMs by the kludge of averaging the output probabilities after each Baum-Welch re-estimation HMM hmm(a,b,seq); hmm.forwardbackward(); for (i=1;i<niter;i++) { hmm.baumwelch(); collapse_to_ghmm(hmm,n0,n1); hmm.forwardbackward(); } with void collapse_to_ghmm(hmm &hmm, Int n0, Int n1) { Int i,j,n01=n0+n1; Doub sum; for (j=0;j<26;j++) { for (sum=0.,i=0;i<n0;i++) sum += hmm.b[i][j]; sum /= n0; for (sum=0.,i=n0;i<n01;i++) sum += hmm.b[i][j]; sum /= n1; for (i=n0;i<n01;i++) hmm.b[i][j] = sum; } } See hmmmex.cpp. Actually this is not quite right, because it should be a weighted average by the number of times each state is occupied. The right way to do this would be to overload hmm.baumwelch with a slightly modified version that does the average properly. In this example the effect would be negligible. The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 23

24 So how well do we do? n0 = n1 = 1 (previous HMM) accuracy = table = n0 = 2, n1 = 5 accuracy = table = n0 = 3, n1 = 8 accuracy = table = Accuracy wrt genes shown as TP FP typically, for this example, it s starting a gene ~5 too late FN TN sensitivity = TP/(TP+FN) specificity = TN/(FP+TN) For whole genes (length ~50), the sensitivity and specificity are basically , because, with pvowel=0.45, the gene is highly statistically significant. What the HMM or GHMM does well is to call the boundaries as exactly as possible. Obviously there s a theoretical bound on the achievable accuracy, given that the exact sequence of an FP or FN might also occur as a TP or TN. Can you calculate or estimate the bound? or ~3 too early The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 24

Statistical Modeling. Prof. William H. Press CAM 397: Introduction to Mathematical Modeling 11/3/08 11/5/08

Statistical Modeling. Prof. William H. Press CAM 397: Introduction to Mathematical Modeling 11/3/08 11/5/08 Statistical Modeling Prof. William H. Press CAM 397: Introduction to Mathematical Modeling 11/3/08 11/5/08 What is a statistical model as distinct from other kinds of models? Models take inputs, turn some