Hidden Markov Models: Maxing and Summing

Size: px

Start display at page:

Download "Hidden Markov Models: Maxing and Summing"

Hector Craig
5 years ago
Views:

1 Hidden Markov Models: Maxing and Summing Introduction to Natural Language Processing Computer Science 585 Fall 2009 University of Massachusetts Amherst David Smith 1

2 Markov vs. Hidden Markov Models Fed raises rates interest 2

3 Markov vs. Hidden Markov Models Fed raises S rates Z interest P 2

4 Markov vs. Hidden Markov Models interest... Fed raises S rates Z interest P 2

5 Markov vs. Hidden Markov Models interest... Fed raises rates... S raises Z rates interest P 2

6 Markov vs. Hidden Markov Models interest... Fed raises rates... raises S raises rates... Z rates interest P 2

7 Markov vs. Hidden Markov Models interest... Fed raises rates... raises S raises rates... Z rates interest... interest P 2

8 Markov vs. Hidden Markov Models interest... Fed raises rates... raises S raises rates... Z rates interest... interest P Fed... 2

9 Unrolled into a Trellis S P Z 3

10 HMM Inference Problems Given an observation sequence, find the most likely state sequence (tagging) Compute the probability of observations when state sequence is hidden (language modeling) Given observations and (optionally) a their corresponding states, find parameters that maximize the probability probability of the observations (parameter estimation) 4

11 Tagging Given an observation sequence, find the most likely state sequence. arg max X P (X O, µ) = arg max X P (X, O µ) P (O µ) = arg max X P (X, O µ) arg max P (x 1,x 2,..., x T,O µ) x 1,x 2,...x T Last time: Use dynamic programming to find highestprobability sequence (i.e. best path, like Dijsktra s algorithm) 5

12 Language Modeling Compute the probability of observations when state sequence is hidden. P (X, O µ) =P (O X, µ)p (X µ) Therefore P (O µ) = X P (O X, µ)p (X µ) x 1,x 2,...x T P (x 1,x 2,..., x T,O µ) Suspiciously similar to max P (x 1,x 2,..., x T,O µ) x 1,x 2,...x T 6

13 Viterbi Algorithm (Tagging) S P Z 7

14 Viterbi Algorithm (Tagging) S P Z a[ Z]b[interest ] 7

15 Viterbi Algorithm (Tagging) S P a[ ]b[interest ] Z a[ Z]b[interest ] 7

16 Viterbi Algorithm (Tagging) S P a[ P]b[interest ] a[ ]b[interest ] Z a[ Z]b[interest ] 7

17 Viterbi Algorithm (Tagging) S a[ S]b[interest ] P a[ P]b[interest ] a[ ]b[interest ] Z a[ Z]b[interest ] 7

18 Viterbi Algorithm (Tagging) a[ ]b[interest ] S a[ S]b[interest ] P a[ P]b[interest ] a[ ]b[interest ] Z a[ Z]b[interest ] 7

19 Viterbi Algorithm (Tagging) S δ(2) δs(2) a[ ]b[interest ] a[ S]b[interest ] P Z δp(2) δ(2) δz(2) a[ P]b[interest ] a[ ]b[interest ] a[ Z]b[interest ] 7

20 Viterbi Algorithm (Tagging) δ(2) a[ ]b[interest ] S δs(2) a[ S]b[interest ] P δp(2) a[ P]b[interest ] δ(2) a[ ]b[interest ] max = δ(3) Z δz(2) a[ Z]b[interest ] 7

21 Forward Algorithm (LM) S P Z 8

22 Forward Algorithm (LM) S P Z a[ Z]b[interest ] 8

23 Forward Algorithm (LM) S P a[ ]b[interest ] Z a[ Z]b[interest ] 8

24 Forward Algorithm (LM) S P a[ P]b[interest ] a[ ]b[interest ] Z a[ Z]b[interest ] 8

25 Forward Algorithm (LM) S a[ S]b[interest ] P a[ P]b[interest ] a[ ]b[interest ] Z a[ Z]b[interest ] 8

26 Forward Algorithm (LM) a[ ]b[interest ] S a[ S]b[interest ] P a[ P]b[interest ] a[ ]b[interest ] Z a[ Z]b[interest ] 8

27 Forward Algorithm (LM) S α(2) αs(2) a[ ]b[interest ] a[ S]b[interest ] P Z αp(2) α(2) αz(2) a[ P]b[interest ] a[ ]b[interest ] a[ Z]b[interest ] 8

28 Forward Algorithm (LM) α(2) a[ ]b[interest ] S αs(2) a[ S]b[interest ] P αp(2) a[ P]b[interest ] α(2) a[ ]b[interest ] sum = α(3) Z αz(2) a[ Z]b[interest ] 8

29 What Do These Greek Letters Mean? δ j (t) = max P (x 1 x t 1,o 1 o t 1,x t = j µ) x 1 x t 1 α j (t) = P (x 1 x t 1,o 1 o t 1,x t = j µ) x 1 x t 1 = P (o 1 o t 1,x t = j µ) 9

30 What Do These Greek Letters Mean? Probability of the best path from the beginning to word t such that word t has tag j δ j (t) = max P (x 1 x t 1,o 1 o t 1,x t = j µ) x 1 x t 1 α j (t) = P (x 1 x t 1,o 1 o t 1,x t = j µ) x 1 x t 1 = P (o 1 o t 1,x t = j µ) 9

31 What Do These Greek Letters Mean? Probability of the best path from the beginning to word t such that word t has tag j δ j (t) = max P (x 1 x t 1,o 1 o t 1,x t = j µ) x 1 x t 1 α j (t) = x 1 x t 1 P (x 1 x t 1,o 1 o t 1,x t = j µ) = P (o 1 o t 1,x t = j µ) Probability of all paths from the beginning to word t such that word t has tag j 9

32 What Do These Greek Letters Mean? Probability of the best path from the beginning to word t such that word t has tag j δ j (t) = max P (x 1 x t 1,o 1 o t 1,x t = j µ) x 1 x t 1 α j (t) = x 1 x t 1 P (x 1 x t 1,o 1 o t 1,x t = j µ) = P (o 1 o t 1,x t = j µ) Probability of all paths from the beginning to word t such that word t has tag j NOT the probability of tag j at time t 9

33 HMM Language Modeling Probability of observations, summed over all possible ways of tagging that observation: i α i (T ) This is the sum of all path probabilities in the trellis 10

34 HMM Parameter Estimation Supervised Train on tagged text, test on plain text Maximum likelihood (can be smoothed): a[z ] = C(,Z) / C() b[rates Z] = C(Z,rates) / C(Z) Unsupervised Train and test on plain text What can we do? 11

35 Forward-Backward Algorithm S α(2) αs(2) a[ ]b[interest ] a[ S]b[interest ] P Z αp(2) α(2) αz(2) a[ P]b[interest ] a[ ]b[interest ] a[ Z]b[interest ] 12

36 Forward-Backward Algorithm α(2) a[ ]b[interest ] S αs(2) a[ S]b[interest ] P αp(2) a[ P]b[interest ] α(2) a[ ]b[interest ] Z αz(2) a[ Z]b[interest ] a[z ]b[rates Z] 12

37 Forward-Backward Algorithm α(2) a[ ]b[interest ] S αs(2) a[ S]b[interest ] P αp(2) a[ P]b[interest ] α(2) a[ ]b[interest ] a[ ]b[rates ] Z αz(2) a[ Z]b[interest ] a[z ]b[rates Z] 12

38 Forward-Backward Algorithm α(2) a[ ]b[interest ] S αs(2) a[ S]b[interest ] P αp(2) a[ P]b[interest ] a[p ]b[rates P] α(2) a[ ]b[interest ] a[ ]b[rates ] Z αz(2) a[ Z]b[interest ] a[z ]b[rates Z] 12

39 Forward-Backward Algorithm α(2) a[ ]b[interest ] S αs(2) a[ S]b[interest ] a[s ]b[rates S] P αp(2) a[ P]b[interest ] a[p ]b[rates P] α(2) a[ ]b[interest ] a[ ]b[rates ] Z αz(2) a[ Z]b[interest ] a[z ]b[rates Z] 12

40 Forward-Backward Algorithm α(2) a[ ]b[interest ] a[ ]b[rates ] S αs(2) a[ S]b[interest ] a[s ]b[rates S] P αp(2) a[ P]b[interest ] a[p ]b[rates P] α(2) a[ ]b[interest ] a[ ]b[rates ] Z αz(2) a[ Z]b[interest ] a[z ]b[rates Z] 12

41 Forward-Backward Algorithm α(2) a[ ]b[interest ] a[ ]b[rates ] β(4) S αs(2) a[ S]b[interest ] a[s ]b[rates S] P αp(2) a[ P]b[interest ] a[p ]b[rates P] α(2) a[ ]b[interest ] a[ ]b[rates ] Z αz(2) a[ Z]b[interest ] a[z ]b[rates Z] 12

42 Forward-Backward Algorithm α(2) a[ ]b[interest ] a[ ]b[rates ] β(4) S αs(2) a[ S]b[interest ] a[s ]b[rates S] βs(4) P αp(2) a[ P]b[interest ] a[p ]b[rates P] α(2) a[ ]b[interest ] a[ ]b[rates ] Z αz(2) a[ Z]b[interest ] a[z ]b[rates Z] 12

43 Forward-Backward Algorithm α(2) a[ ]b[interest ] a[ ]b[rates ] β(4) S αs(2) a[ S]b[interest ] a[s ]b[rates S] βs(4) P αp(2) a[ P]b[interest ] a[p ]b[rates P] βp(4) α(2) a[ ]b[interest ] a[ ]b[rates ] Z αz(2) a[ Z]b[interest ] a[z ]b[rates Z] 12

44 Forward-Backward Algorithm α(2) a[ ]b[interest ] a[ ]b[rates ] β(4) S αs(2) a[ S]b[interest ] a[s ]b[rates S] βs(4) P αp(2) a[ P]b[interest ] a[p ]b[rates P] βp(4) α(2) a[ ]b[interest ] a[ ]b[rates ] β(4) Z αz(2) a[ Z]b[interest ] a[z ]b[rates Z] 12

45 Forward-Backward Algorithm α(2) a[ ]b[interest ] a[ ]b[rates ] β(4) S αs(2) a[ S]b[interest ] a[s ]b[rates S] βs(4) P αp(2) a[ P]b[interest ] a[p ]b[rates P] βp(4) α(2) a[ ]b[interest ] a[ ]b[rates ] β(4) Z αz(2) a[ Z]b[interest ] a[z ]b[rates Z] βz(4) 12

46 Forward-Backward Algorithm P (o 1 o t 1,x t = j µ) =α j (t) P (o t o T x t = j, µ) =β j (t) P (o 1 o T,x t = j µ) =α j (t)β j (t) P (x t = j O, µ) = P (x t = j, O µ) P (O µ) =! j(t)" j (t)! #(T ) P (x t = i, x t+1 = j O, µ) = P (x t = i, x t+1 = j, O µ) P (O µ) =! i(t)a[j i]b[o t j]" j (t + 1)! #(T ) 13

47 Expectation Maximization (EM) Iterative algorithm to maximize likelihood of observed data in the presence of hidden data (e.g., tags) Choose an initial model μ Expectation step: find the expected value of hidden variables given current μ Maximization step: choose new μ to maximize probability of hidden and observed data Guaranteed to increase likelihood Not guaranteed to find global maximum 14

48 Supervised vs. Unsupervised Supervised Unsupervised Annotated training text Plain text Simple count/normalize EM Fixed tag set Training reads data once Set during training Training needs multiple passes 15

49 Logarithms for Precision P (Y )=p(y 1 )p(y 2 ) p(y T ) log P (Y ) = log p(y 1 ) + log p(y 2 ) + log p(y T ) Increased dynamic range of [0,1] to [-,0] 16

50 Semirings 0 1 Real + x 0 1 Max max x 0 1 Log log Tropical max Shortest path min + 0 Boolean F T 17

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017 1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow