Hidden Markov Models. Algorithms for NLP September 25, 2014

Size: px

Start display at page:

Download "Hidden Markov Models. Algorithms for NLP September 25, 2014"

Wilfred Higgins
5 years ago
Views:

1 Hidden Markov Models Algorithms for NLP September 25, 2014

2 Quick Review WFSAs DeterminisIc (for scoring strings) Best path algorithm for WFSAs

3 General Case: Ambiguous WFSAs

4 Ambiguous WFSAs A string may have more than one state sequence. Useful for represening ambiguity! Weights define relaive goodness of different paths. Recall from last Ime: the path can be used to encode an analysis. E.g., if symbols are words, the state before or ayer might correspond to the word's part of speech.

5 Toward an Algorithm We can use weights to encourage or discourage paths. In an ambiguous FSA, there might be muliple paths for the same string. Important algorithm: find the best path for a string. Related algorithm: find the sum of all paths

6 A TempIng, Incorrect Algorithm

7 TempIng but Incorrect Pick the best start state. q 0 := arg max q2i For t = 1 to s : (q) q t := arg max q2q w(q t 1,s t,q)

8 Why It s Wrong In General baz!

9 Problems What if there is no transiion out of q t- 1 that matches s t? If there are any epsilons in the sequence, we can't see them! This algorithm will never take an ε- edge. Something fishy about leaving out the stopping weights ρ! In general: a seemingly good decision early on could turn out to be very bad later on, since rewards and punishments can always be delayed unil the very end.

10 ε- TransiIons Let s coninue to ignore them There is a weighted remove epsilons operaion that preserves best paths. See Mohri's (2002) aricle in Interna1onal Journal on Founda1ons of Computer Science.

11 An Easier Problem If I knew the best path ê 1 ê` 1 for the first l- 1 symbols s 1 s l- 1, then the decision for the final part would be easy: q` := arg max q2f w(n(ê` 1),s`,q) (q) Note that this choice only depends on q l- 1, not the whole path prefix q 1 q l- 1. Manageable: best path prefix for s 1 s l- 1 ending in each q.

12 Toward a Recurrence Let α(q, i) be the score of the best path for the length t prefix (s 1 s t ) ending in state q. (q, t) = max e 1 e 2 e t :n(e t )=q = max e 1 e 2 e t :n(e t )=q (p(e 1 )) (p(e 1 )) ty w(e i ) i=1 ty 1 i=1 w(e i ) 8q 2 Q, 8t 2 [1,`)! {z } = (q t 1,t 1) = max q t 1 2Q (q t 1,t 1) w(hq t 1,s t,qi) w(hn(e t 1 ),s t,qi)

13 Recovering the Best Path We are given the prefix best- path scores: α : Q {0, 1,, l} R 0 q l := arg max q α(q, l) ρ(q) for t = l 1 to 0 q t := arg max q α(q, t) w(q, s t+1, q t+1 )

14 α(q, i) l- 2 l- 1 l q 0 q 1 q 2 q 3 q 4 q 5 q 6

15 α(q, i) l- 2 l- 1 l q 0 λ(q 0 ) q 1 λ(q 1 ) q 2 λ(q 2 ) q 3 λ(q 3 ) q 4 λ(q 4 ) q 5 λ(q 5 ) q 6 λ(q 6 ) 8q 2 I, (q, 0) (q)

16 α(q, i) l- 2 l- 1 l q 0 λ(q 0 ) q 1 λ(q 1 ) q 2 λ(q 2 ) q 3 λ(q 3 ) q 4 λ(q 4 ) q 5 λ(q 5 ) q 6 λ(q 6 ) 8q 2 Q, (q, 1) = max e2(q {s 1 } {q}\e) (p(e), 0) w(e)

17 α(q, i) l- 2 l- 1 l q 0 λ(q 0 ) q 1 λ(q 1 ) q 2 λ(q 2 ) q 3 λ(q 3 ) q 4 λ(q 4 ) q 5 λ(q 5 ) q 6 λ(q 6 ) 8q 2 Q, (q, 2) = max e2(q {s 2 } {q}\e) (p(e), 1) w(e)

18 α(q, i) l- 2 l- 1 l q 0 λ(q 0 ) q 1 λ(q 1 ) q 2 λ(q 2 ) q 3 λ(q 3 ) q 4 λ(q 4 ) q 5 λ(q 5 ) q 6 λ(q 6 ) 8q 2 Q, 8i 2 [1,`] (q, t) = max e2(q {s t } {q}\e) (p(e),t 1) w(e)

19 ConstrucIng Prefix Best- Path Scores Input string is (s 1 s 2 s l ) Goal: construct α : Q {0, 1,, l} R 0 set all α(q, i) := 0 for each q I: α(q, 0) := λ(q) for t = 1 to l: for each q Q: for each q' Q: α(q, t) := max{ α(q, t), α(q', t- 1) w(q', s t, q) }

20 High- Level Best WFSA Path Algorithm First, construct the prefix best- path scores α. Then recover the path. AlternaIve: maintain pointers to each argmax when construcing α, then follow the pointers to recover the path. RunIme and space requirements for this algorithm, in Q and in l?

21 Handling ε TransiIons iniialize all α(q, i) to 0 for each q: α(q, 0) := π(q) for i = 1 to n: for each q: for each q': α(q, i) := max{ α(q, i), α(q', i- 1) δ(q', s i, q) } repeat unil α(*, i) converge: for each q: for each q': α(q, i) := max{ α(q, i), α(q', i) δ(q', ε, q) }

22 Problems with ε RunIme is much harder to analyze, because of ε cycles. Two cases: Dampening cycles, where hypothesizing lots of ε transiions makes the path's score worse. Example: δ(q, ε, q) < 1 Amplifying cycles, e.g., δ(q, ε, q) > 1.

23 WFSA weighted languages disambiguaion FSA regular languages FST regular rela1ons string- to- string mappings

24 WFSA weighted languages disambiguaion WFST weighted string- to- string mappings FSA regular languages FST regular rela1ons string- to- string mappings

25 Beyond WFSAs: WFSTs Weighted finite- state transducers: combine the two- tape idea from last Ime with the weighted idea from today. Very general framework; key operaion is weighted composi5on. Best- path algorithm variaions: best path and output given input, best path given input and output, WFSAs then become just WFSTs encoding the idenity relaion

26 Weights Beyond Numbers Boolean weights = tradiional FSAs/FSTs. Real weights = the case we explored so far. Many other semirings (sets of possible weight values and operaions for combining weights) A semiring is a set of values ( real numbers, natural numbers, strings, points in R^2 ) AddiIon + MulIplicaIon x Zero

27 WFSAs and WFSTs: Algorithms OpenFST documentaion contains high- level view of most algorithms you think you want, and pointers. The set of people developing general WFST algorithms largely overlaps with the OpenFST team. Mehryar Mohri, Cyril Allauzen, Michael Reilly This is an acive area of research!

28 From WFSAs to HMMs

29 Weighted Finite- State Automaton Element Q Σ I Q F Q E (Q (Σ {ε}) Q) λ : I R 0 ρ : F R 0 w : E R 0 Defini5on Finite set of states Finite vocabulary Set of iniial states Set of final states Set of transiions (edges) IniIal weights Final weights TransiIon weights

30 First Change: Probability ProbabiliIes are a special kind of weight. Informally, there are three rules: All weights between zero and one. At each local posiion, the weights of all compeing possibiliies have to sum to one. Over all possible outcomes (beginning to end), the total sum of weights has to be one. The main operaions when you deal in probabiliies: MulIplying together (condiional) probabiliies of events Adding probabiliies to calculate marginals Using data to esimate probabiliies (e.g., couning)

31 ProbabilisIc Finite- State Automaton Element Defini5on Constraint Q Σ I=Q Finite set of states Finite vocabulary Set of iniial states F={q f } Final state F Q = {} E=Q Σ (Q F) Set of transiions λ : I [0,1] IniIal weights Σ q = 1 ρ : F {1} Final weights w : E [0,1] TransiIon weights q, Σ r,s w(q,s,r)= 1

32 Differences? All paths have scores between 0 and 1. These condiions are sufficient to show that the sum of all start- to- finish paths is one. Some of those paths might be infinitely long, though! No transiions out of q f.

33 Second Change: Factoring TransiIons Define w : Q Σ Q [0, 1] as a product of two parts: w(q, s, q 0 )= (s q) (q 0 q) Instead of transiioning in one move, the model first emits a symbol (η) then transiions (γ). A PFSA with this property is called an HMM.

34 Effects? It's easy to see that every HMM is a PFSA. Can we go in the other direcion? Not in general.

35 Silent States Some definiions of HMMs allow states to be silent (always, or with some probability). Related to ε transiions. In NLP, we do not usually use silent states, so I will remain silent on the ma er. This simplifies things.

36 The ConvenIonal View of HMMs

37 ProbabilisIc GeneraIve Stories This view focuses on the HMM as a way of genera1ng strings. ProbabilisIc generaive models are a huge class of techniques. In machine learning: naïve Bayes, mixtures of Gaussians, latent Dirichlet allocaion (LDA), and many more; emphasis there is on recovering the parameters of a model from the data.

38 The n- Gram Story Fix each of s - n+2, s - n+3,, s 0 to the START symbol. For i = 1, 2, : Pick s i according to the distribuion p( s i- n+1, s i- n+2,, s i- 1 ) Exit loop if s i is the STOP symbol. The Markov assumpion : the probability of a word given its history depends only on its (n 1) most recent predecessors.

39 The HMM Story Pick a start state q according to π(q). If state q is the stop state, then stop; otherwise: Emit symbol s with probability η(s q). TransiIon to q' with probability γ(q' q). The Markov assumpion : the probability of a word given its history only depends on the current state.

40 The HMM Story, in Math p(q 0,s 1,q 1,...,s n,q n ) = (q 0 ) ny 1 i=0 (s i+1 q i ) (q i+1 q i )! (q n )

41 The Two Views, Again DeclaraIve: HMMs (like all WFSAs) assign scores to paths by muliplying together local weights. Procedural: HMMs imply an algorithm for genera1ng data. In general, WFSAs do not. (Why?) What we really want is an algorithm for recovering the states, given the symbols.

42 C53

43 C53

44 C53 I one symbol die per state:

45 C53 C23 I one next state die per state:

46 C53 C23 I one stop or coninue coin per state:

47 C53 C23 I want one symbol die per state:

48 C53 C23 I C2 want one next state die per state:

49 C53 C23 I C2 want one stop or coninue coin per state:

50 C53 C23 C2 I want a one symbol die per state:

51 C53 C23 C2 C5 I want a one next state die per state:

52 C53 C23 I C2 C5 want a one stop or coninue coin per state:

53 C53 C23 C2 C5 I want a flight one symbol die per state:

54 Note: N- Gram Models vs. HMMs N- gram models are a special case of Markov models, in which the state is fully determined by the words. Markov, as opposed to hidden Markov, because the states are known (not hidden) given the sequence of symbols (recall, states = histories). Markov models determinisic WFSAs HMMs ambiguous WFSAs

55 WFSAs PFSAs HMMs

56 Where Do Weights Come From? In the general case of WFSAs, it is not clear where weights should come from. There are lots of ways you could imagine doing it. In the HMM case, the probabilis1c view suggests that we use data and staisics. General framework for extracing probabilisic model parameters from data: maximum likelihood esimaion. Given symbols and their states, we can use relaive frequency esimates.

57 RelaIve Frequency EsImaIon Count and normalize. For condiional probabiliies, the raio of the number of Imes the event did happen to the number of Imes it might have happened. (q) = count(q initial = q) #sequences (s q) = count(q emits s) count(q nonfinal = q) (q q) = count(q transitions to q ) count(q nonfinal = q) (q) = count(q final = q) count(q)

58 The Viterbi Algorithm

59 Most Probable State Sequence For WFSAs, we talked about the best path. ProbabilisIc models allow us to say most likely or most probable path. arg max q 0,...,q n p(q 0,s 1,q 1,...,s n,q n ) = arg max q 0,...,q n (q 0 ) ny 1 i=0 (s i+1 q i ) (q i+1 q i ) For a PFSA, the algorithm is idenical to what it was before. HMMs factor δ into two parts (η and γ); this changes the algorithm a li le bit.! (q n )

60 CondiIonal vs. Joint HMMs define the joint: p(states, symbols) We want the condiional: p(states symbols) It turns out that: arg max q p(q s) = arg max q p(q, s) / p(s) = arg max q p(q, s) So we can focus on maximizing the joint.

61 Example: POS Tagging I suspect the present forecast is pessimisic. CD JJ DT JJ NN NNS JJ. NN NN JJ NN VB VBZ NNP VB NN RB VBD PRP VBP NNP VB VBN VBP VBP VBP ,000 possible state sequences!

62 Naïve SoluIons List all the possibiliies in Q n. Correct. Inefficient. Work ley to right and greedily pick the best q i at each point, based on q i- 1 and s i+1. Not correct; soluion may not be the most probable path.

63 InteracIons Each word's label depends on the word, and nearby labels. But given adjacent labels, others do not ma er. I suspect the present forecast is pessimisic. CD JJ DT JJ NN NNS JJ. NN NN JJ NN VB VBZ NNP VB NN RB VBD PRP VBP NNP VB VBN VBP VBP VBP (arrows show most preferred label by each neighbor)

64 High- Level Viterbi Algorithm First, construct the prefix best- path scores α, with back pointers. Then recover the path.

65 ConstrucIng Prefix Best- Path Scores Input string is (s 1 s 2 s n ) Goal: construct α : Q {0, 1,, n} R 0 iniialize all α(q, i) to 0 for each q: α(q, 0) := π(q) for i = 1 to n: for each q: for each q': α(q, i) := max{ α(q, i), α(q', i- 1) η(s i q ) γ(q q ) }

66 ConstrucIng Prefix Best- Path Scores Input string is (s 1 s 2 s n ) Goal: construct α : Q {0, 1,, n} R 0 iniialize all α(q, i) to 0 for each q: α(q, 0) := π(q) for i = 1 to n: for each q: for each q': WFSA version from last lecture α(q, i) := max{ α(q, i), α(q', i- 1) δ(q', s i, q) }

67 I suspect the present forecast is pessimisic. CD 3E- 7 DT 3E- 8 JJ 1E- 9 1E- 12 3E- 12 7E- 23 NN 4E- 6 2E- 10 1E- 13 6E- 13 4E- 16 NNP 1E- 5 4E- 13 NNS 1E- 21 PRP 4E- 3 RB 2E- 14 VB 6E- 9 3E- 15 2E- 19 VBD 6E- 18 VBN 4E- 18 VBP 5E- 7 4E- 14 4E- 15 9E- 19 VBZ 6E E- 24

68 I suspect the present forecast is pessimisic. CD 3E- 7 DT 3E- 8 JJ 1E- 9 1E- 12 3E- 12 7E- 23 NN 4E- 6 2E- 10 1E- 13 6E- 13 4E- 16 NNP 1E- 5 4E- 13 NNS 1E- 21 PRP 4E- 3 RB 2E- 14 VB 6E- 9 3E- 15 2E- 19 VBD 6E- 18 VBN 4E- 18 VBP 5E- 7 4E- 14 4E- 15 9E- 19 VBZ 6E E- 24

69 I suspect the present forecast is pessimisic. CD 3E- 7 DT 3E- 8 JJ 1E- 9 1E- 12 3E- 12 7E- 23 NN 4E- 6 2E- 10 1E- 13 6E- 13 4E- 16 NNP 1E- 5 4E- 13 NNS 1E- 21 PRP 4E- 3 RB 2E- 14 VB 6E- 9 3E- 15 2E- 19 VBD 6E- 18 VBN 4E- 18 VBP 5E- 7 4E- 14 4E- 15 9E- 19 VBZ 6E E- 24

70 Another Example Ime flies like an arrow. dt 10e- 15 6e- 21 in 8e- 13 1e- 19 jj 6e- 14 2e- 16 nn 2e- 4 3e- 16 nnp 1e- 16 vb 2e- 7 1e- 14 1e- 19 vbp 8e- 16 4e- 19 vbz 2e- 9 3e e- 21 3e- 17, 4e- 20 5e- 22

71 Another Example Ime flies like an arrow. dt 10e- 15 6e- 21 in 8e- 13 1e- 19 jj 6e- 14 2e- 16 nn 2e- 4 3e- 16 nnp 1e- 16 vb 2e- 7 1e- 14 1e- 19 vbp 8e- 16 4e- 19 vbz 2e- 9 3e e- 21 3e- 17, 4e- 20 5e- 22

72 Formalisms vs. Tasks HMMs are a formal concept that apply to a wide range of NLP problems. This happens a lot: one tool has many uses. And of course, every problem has a wide range of soluions. Three modes in NLP research: Advancing formal models and formalizing interesing ideas. (Risk: may not be relevant to real problems.) IdenIfying problems with exising approaches and craying soluions. (Risk: soluions may not generalize.) IdenIfying new NLP problems to work on and ways to evaluate success.

Learning Weighted Finite State Transducers. SPFLODD October 27, 2011

Learning Weighted Finite State Transducers SPFLODD October 27, 2011 Background This lecture is based on a paper by Jason Eisner at ACL 2002, Parameter EsImaIon for ProbabilisIc Finite State Transducers.