Hidden Markov Models. x 1 x 2 x 3 x N

Size: px

Start display at page:

Download "Hidden Markov Models. x 1 x 2 x 3 x N"

Vanessa Turner
6 years ago
Views:

1 Hidden Markov Models K K K K x 1 x x 3 x N

2 Example: The dishonest casino A casino has two dice: Fair die P(1) = P() = P(3) = P(4) = P(5) = P(6) = 1/6 Loaded die P(1) = P() = P(3) = P(4) = P(5) = 1/10 P(6) = 1/ Casino player switches between fair and loaded die with probability 1/0 at each turn Game: 1. You bet $1. You roll (always with a fair die) 3. Casino player rolls (maybe with fair die, maybe with loaded die) 4. Highest number wins $

3 Question # 1 Evaluation GIVEN A sequence of rolls by the casino player QUESTION Prob = 1.3 x How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs

4 Question # Decoding GIVEN A sequence of rolls by the casino player FAIR LOADED FAIR QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs

5 Question # 3 Learning GIVEN A sequence of rolls by the casino player QUESTION Prob(6) = 64% How loaded is the loaded die? How fair is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs

6 The dishonest casino model P(1 F) = 1/6 P( F) = 1/6 P(3 F) = 1/6 P(4 F) = 1/6 P(5 F) = 1/6 P(6 F) = 1/6 FAIR 0.05 LOADED A Hidden Markov Model: we never observe the state, only observe output dependent (probabilistically) on state P(1 L) = 1/10 P( L) = 1/10 P(3 L) = 1/10 P(4 L) = 1/10 P(5 L) = 1/10 P(6 L) = 1/

7 A parse of a sequence Observation sequence x = x 1 x N, State sequence = 1,, N 1 K 1 K 1 K 1 K x 1 x x 3 x N 1 K

8 An HMM is memoryless At each time step t, the only thing that affects future states is the current state t 1 P( t+1 = k whatever happened so far ) = P( t+1 = k 1,,, t, x 1, x,, x t ) = P( t+1 = k t ) K

9 An HMM is memoryless At each time step t, the only thing that affects x t is the current state t P(x t = b whatever happened so far ) = P(x t = b 1,,, t, x 1, x,, x t-1 ) = P(x t = b t ) 1 K

10 Definition of a hidden Markov model Definition: A hidden Markov model (HMM) Alphabet = { b 1, b,, b M } = Val(x t ) Set of states Q = { 1,..., K }.= Val( t ) Transition probabilities between any two states a ij = transition prob from state i to state j a i1 + + a ik = 1, for all states i = 1K 1 Start probabilities a 0i a a 0K = 1 Emission probabilities within each state e k (b) = P( x t = b t = k) e k (b 1 ) + + e k (b M ) = 1, for all states k = 1K K

11 Hidden Markov Models as Bayes Nets i i+1 i+ i+3 States unobserved x i x i+1 x i+ x i+3 Observations Transition probabilities between any two states a jk = transition prob from state j to state k = P( i+1 = k i = j) Emission probabilities within each state e k (b) = P( x i = b i = k)

12 How special of a case are HMM BNs? Template-based representation All hidden nodes have same CPTs, as do all output nodes Limited connectivity Junction tree has clique nodes of size <= Special-purpose algorithms are simple and efficient

13 HMMs are good for Speech Recognition Gene Sequence Matching Text Processing Part of speech tagging Information extraction Handwriting recognition

14 Generating a sequence by the model Given a HMM, we can generate a sequence of length n as follows: 1. Start at state 1 according to prob a 0 1. Emit letter x 1 according to prob e 1 (x 1 ) 3. Go to state according to prob a 1 4. until emitting x n 1 K 1 K 1 What did we call this earlier, for general Bayes Nets? 0 a 0 e (x 1 ) K 1 x 1 x x 3 x n K

15 A parse of a sequence Given a sequence x = x 1 x N, A parse of x is a sequence of states = 1,, N 1 K 1 K 1 K 1 K x 1 x x 3 x N 1 K

16 Likelihood of a parse Given a sequence x = x 1 x N and a parse = 1,, N, K K K K To find how likely this scenario is: (given our HMM) x 1 x x 3 x K P(x, ) = P(x 1,, x N, 1,, N ) = P(x N N ) P( N N-1 ) P(x ) P( 1 ) P(x 1 1 ) P( 1 ) = a 0 1 a 1 a N-1 N e 1 (x 1 )e N (x N )

17 Example: the dishonest casino Let the sequence of rolls be: x = 1,, 1, 5, 6,, 1, 5,, 4 Then, what is the likelihood of = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair? (say initial probs a 0Fair = ½, a 0Loaded = ½) ½ P(1 Fair) P(Fair Fair) P( Fair) P(Fair Fair) P(4 Fair) = ½ (1/6) 10 (0.95) 9 = ~=

18 Example: the dishonest casino So, the likelihood the die is fair in this run is just What is the likelihood of = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded? ½ P(1 Loaded) P(Loaded, Loaded) P(4 Loaded) = ½ (1/10) 9 (1/) 1 (0.95) 9 = ~= Therefore, it s somewhat more likely that all the rolls are done with the fair die, than that they are all done with the loaded die

19 Example: the dishonest casino Let the sequence of rolls be: x = 1, 6, 6, 5, 6,, 6, 6, 3, 6 Now, what is the likelihood = F, F,, F? ½ (1/6) 10 (0.95) 9 ~= , same as before What is the likelihood = L, L,, L? ½ (1/10) 4 (1/) 6 (0.95) 9 = ~= So, it is 100 times more likely the die is loaded

20 The three main questions for HMMs 1. Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x M ]. Decoding GIVEN a HMM M, and a sequence x, FIND the sequence of states that maximizes P[ x, M ] 3. Learning GIVEN a HMM M, with unspecified transition/emission probs., and a sequence x, FIND parameters = (e i (.), a ij ) that maximize P[ x ]

21 Problem 1: Evaluation Compute the likelihood that a sequence is generated by the model

22 Generating a sequence by the model Given a HMM, we can generate a sequence of length n as follows: 1. Start at state 1 according to prob a 0 1. Emit letter x 1 according to prob e 1 (x 1 ) 3. Go to state according to prob a 1 4. until emitting x n a 0 K K K K e (x 1 ) x 1 x x 3 x n

23 Evaluation We will develop algorithms that allow us to compute: P(x) Probability of x given the model P(x i x j ) Probability of a substring of x given the model P( i = k x) Posterior probability that the i th state is k, given x

24 The Forward Algorithm We want to calculate P(x) = probability of x, given the HMM Sum over all possible ways of generating x: P(x) = P(x, ) = P(x ) P( ) To avoid summing over exponentially many paths, use variable elimination. In HMMs, VE has same form at each i. Define: f k (i) = P(x 1 x i, i = k) (the forward probability) generate i first observations and end up in state k

25 The Forward Algorithm derivation Define the forward probability: f k (i) = P(x 1 x i, i = k) = 1 i-1 P(x 1 x i-1, 1,, i-1, i = k) e k (x i ) = j 1 i- P(x 1 x i-1, 1,, i-, i-1 = j) a jk e k (x i ) = j P(x 1 x i-1, i-1 = j) a jk e k (x i ) = e k (x i ) j f j (i 1) a jk

26 The Forward Algorithm We can compute f k (i) for all k, i, using dynamic programming! Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Iteration: f k (i) = e k (x i ) j f j (i 1) a jk Termination: P(x) = k f k (N)

27 Backward Algorithm Forward algorithm can compute P(x). But we d also like: P( i = k x), the probability distribution on the i th position, given x Again, we ll use variable elimination. We start by computing: P( i = k, x) = P(x 1 x i, i = k, x i+1 x N ) = P(x 1 x i, i = k) P(x i+1 x N x 1 x i, i = k) = P(x 1 x i, i = k) P(x i+1 x N i = k) Forward, f k (i) Backward, b k (i) Then, P( i = k x) = P( i = k, x) / P(x)

28 The Backward Algorithm derivation Define the backward probability: b k (i) = P(x i+1 x N i = k) starting from i th state = k, generate rest of x = i+1 N P(x i+1,x i+,, x N, i+1,, N i = k) = j i+ N P(x i+1,x i+,, x N, i+1 = j, i+,, N i = k) = j e j (x i+1 ) a kj i+ N P(x i+,, x N, i+,, N i+1 = j) = j e l (x i+1 ) a kj b j (i+1)

29 The Backward Algorithm We can compute b k (i) for all k, i, using dynamic programming Initialization: b k (N) = 1, for all k Iteration: b k (i) = l e l (x i+1 ) a kl b l (i+1) Termination: P(x) = l a 0l e l (x 1 ) b l (1)

30 Computational Complexity What is the running time, and space required, for Forward and Backward? Time: O(K N) Space: O(KN) Useful implementation technique to avoid underflows: rescaling at each few positions by multiplying by a constant Assuming we want to save all forward, backward messages. Otherwise space can be O(K)

31 Posterior Decoding We can now calculate f k (i) b k (i) P( i = k x) = P(x) Then, we can ask P( i = k x) = P( i = k, x)/p(x) = P(x 1,, x i, i = k, x i+1, x n ) / P(x) = P(x 1,, x i, i = k) P(x i+1, x n i = k) / P(x) = f k (i) b k (i) / P(x) What is the most likely state at position i of sequence x: Define ^ by Posterior Decoding: ^i = argmax k P( i = k x)

32 Posterior Decoding State 1 x 1 x x 3 x N l P( i =j x) k

33 Problem : Decoding Find the most likely parse of a sequence

34 Most likely sequences vs. ind. states Given a sequence x, Most likely sequence of states 1.49*10 very -9 different from sequence of most likely states Example: the dishonest casino P(box: FFFFFFFFFFF) = (1/6) 11 * =.76 * 10-9 * 0.54 = P(box: LLLLLLLLLLL) = [ (1/) 6 * (1/10) 5 ] * * 0.05 = 1.56*10-7 * 1.5 * 10-3 = 0.3 * 10-9 Say x = F F Most likely path: = FFF (too unlikely to transition F L F) However: marked letters more likely to be L than unmarked letters

35 Decoding GIVEN x = x 1 x x N Find = 1,, N, to maximize P[ x, ] K K K K * = argmax P[ x, ] x 1 x x 3 x N Maximizes a 0 1 e 1 (x 1 ) a 1 a N-1 N e N (x N ) Like forward alg, with max instead of sum V k (i) = max { 1 i-1} P[x 1 x i-1, 1,, i-1, x i, i = k] = Prob. of most likely sequence of states ending at state i = k

36 Decoding main idea Induction: Given that for all states k, and for a fixed position i, V k (i) = max { 1 i-1} P[x 1 x i-1, 1,, i-1, x i, i = k] What is V j (i+1)? From definition, V l (i+1) = max { 1 i} P[ x 1 x i, 1,, i, x i+1, i+1 = j ] = max { 1 i} P(x i+1, i+1 = j x 1 x i, 1,, i ) P[x 1 x i, 1,, i ] = max { 1 i} P(x i+1, i+1 = j i ) P[x 1 x i-1, 1,, i-1, x i, i ] = max k [P(x i+1, i+1 = j i =k) max { 1 i-1} P[x 1 x i-1, 1,, i- 1,x i, i =k]] = max k [ P(x i+1 i+1 = j ) P( i+1 = j i =k) V k (i) ] = e j (x i+1 ) max k a kj V k (i)

37 The Viterbi Algorithm Input: x = x 1 x N Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 (0 is the imaginary first position) Iteration: V j (i) = e j (x i ) max k a kj V k (i 1) Ptr j (i) = argmax k a kj V k (i 1) Termination: P(x, *) = max k V k (N) Traceback: N * = argmax k V k (N) i-1 * = Ptr i (i)

38 The Viterbi Algorithm x 1 x x 3..x N State 1 V j (i) K Time: Space: O(K N) O(KN)

39 Viterbi Algorithm a practical detail Underflows are a significant problem (like with forward, backward) P[ x 1,., x i, 1,, i ] = a 0 1 a 1 a i e 1 (x 1 )e i (x i ) These numbers become extremely small underflow Solution: Take the logs of all values V l (i) = log e k (x i ) + max k [ V k (i-1) + log a kl ]

40 Example Let x be a long sequence with a portion of ~ 1/6 6 s, followed by a portion of ~ ½ 6 s x = Then, it is not hard to show that optimal parse is: FFF...F LLL...L 6 characters parsed as F, contribute.95 6 (1/6) 6 = parsed as L, contribute.95 6 (1/) 1 (1/10) 5 = parsed as F, contribute.95 6 (1/6) 6 = parsed as L, contribute.95 6 (1/) 3 (1/10) 3 =

41 Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Initialization: b k (N) = 1, for all k Iteration: Iteration: Iteration: V l (i) = e l (x i ) max k V k (i-1) a kl f l (i) = e l (x i ) k f k (i-1) a kl b l (i) = k e l (x i +1) a kl b k (i+1) Termination: Termination: Termination: P(x, *) = max k V k (N) P(x) = k f k (N) P(x) = k a 0k e k (x 1 ) b k (1)

42 Problem 3: Learning Find the parameters that maximize the likelihood of the observed sequence

43 Estimating HMM parameters Easy if we know the sequence of hidden states Count # times each transition occurs Count #times each observation occurs in each state Given an HMM and observed sequence, we can compute the distribution over paths, and therefore the expected counts Chicken and egg problem

44 Solution: Use the EM algorithm Guess initial HMM parameters E step: Compute distribution over paths M step: Compute max likelihood parameters But how do we do this efficiently?

45 The forward-backward algorithm Also known as the Baum-Welch algorithm Compute probability of each state at each position using forward and backward probabilities (Expected) observation counts Compute probability of each pair of states at each pair of consecutive positions i and i+1 using forward(i) and backward(i+1) (Expected) transition counts Count(j k) i f j (i) a jk b k (i+1)

46 Application: HMMs for Information Extraction (IE) IE: Text machine-understandable data Paris, the capital of France, (Paris, France) CapitalOf, p=0.85 Applied to Web: better search engines, semantic Web, step toward human-level AI

47 IE Automatically? Intractable to get human labels for every concept expressed on the Web Idea: extract from semantically tractable sentences Edison invented the light bulb (Edison, light bulb) Invented x V y => (x, y) V Bloomberg, mayor of New York City (Bloomberg, New York City) Mayor x, C of y => (x, y) C

48 But Extraction patterns make errors: Erik Jonsson, CEO of Texas Instruments, mayor of Dallas from , and Empirical fact: Extractions you see over and over tend to be correct The problem is the long tail 48

49 Challenge: the long tail Number of times extraction appears in pattern e.g., (Bloomberg, New York City) Frequency rank of extraction Tend to be correct e.g., (Dave Shaver, Pickerington) (Ronald McDonald, McDonaldland) 49 A mixture of correct and incorrect

50 Mayor McCheese 50

51 Assessing Sparse Extractions Strategy 1) Model how common extractions occur in text ) Rank sparse extractions by fit to model 51

52 The Distributional Hypothesis Terms in the same class tend to appear in similar contexts. Hits with Hits with Context Chicago Twisp cities including 4,000 1 and other cities 37,900 0 hotels,000,000 1,670 mayor of 657,

53 HMM Language Models Precomputed scalable Handle sparsity 53

54 Baseline: context vectors cities such as Chicago, Boston, But Chicago isn t the best cities such as Chicago, Boston, Los Angeles and Chicago. 1 1 Compute dot products between vectors of common and sparse extractions [cf. Ravichandran et al. 005] 54

55 Hidden Markov Model (HMM) t i t i+1 t i+ t i+3 States unobserved w i w i+1 w i+ w i+3 Words observed cities such as Seattle Hidden States t i {1,, N} (N fairly small) Train on unlabeled data P(t i w i = w) is N-dim. distributional summary of w Compare extractions using KL divergence 55

56 HMM Compresses Context Vectors Twisp: < > P(t Twisp): t=1 N Distributional Summary P(t w) Compact (efficient 10-50x less data retrieved) Dense (accurate 3-46% error reduction) 56

57 Example Is Pickerington of the same type as Chicago? Chicago, Illinois Pickerington, Ohio Chicago: Pickerington: => Context vectors say no, dot product is 0! 57

58 Example HMM Generalizes: Chicago, Illinois Pickerington, Ohio 58

59 Experimental Results Task: Ranking sparse TextRunner extractions. Metric: Area under precision-recall curve. Headquartered Merged Average Frequency PL LM Language models reduce missing area by 39% over nearest competitor. 59

60 Example word distributions (1 of ) P(word state 3) unk new more unk few small good large great important other major little P(word state 4), ; ! :

61 Example word distributions ( of ) P(word state 1) unk United+States world U.S University Internet time end unk war country way city US Sun Earth P(word state 3) the a an its our this unk your

62 Correlation between LM and IE accuracy Below: correlation coefficients As LM error decreases, IE accuracy increases

63 Correlation between LM and IE accuracy

64 Correlation between LM and IE accuracy

65 What this suggests Better HMM language models => better information extraction Better HMM language models => => human-level AI? Consider: a good enough LM could do question answering, pass the Turing Test, etc. There are lots of paths to human-level AI, but LMs have: Well-defined progress Ridiculous amounts of training data

66 Also: active learning Today, people train language models by taking what comes Larger corpora => better language models But corpus size limited by # of humans typing What if we asked for the most informative sentences? (active learning)

67 What have we learned? In HMMs, general Bayes Net algorithms have simple & efficient form 1. Evaluation GIVEN a HMM M, and a sequence x, FIND Prob[ x M ] Forward Algorithm and Backward Algorithm (Variable Elimination). Decoding GIVEN a HMM M, and a sequence x, FIND the sequence of states that maximizes P[ x, M ] Viterbi Algorithm (MAP query) 3. Learning GIVEN A sequence x, FIND HMM parameters = (e i (.), a ij ) that maximize P[ x ] Baum-Welch/Forward-Backward algorithm (EM)

68 What have we learned? Unsupervised Learning of HMMs can power more scalable, accurate unsupervised IE Today, unsupervised learning of neural network language models is much more popular Deep networks to be discussed in future weeks

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3)