INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Size: px

Start display at page:

Download "INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models"

Rachel Cummings
5 years ago
Views:

1 1 University of Oslo : Department of Informatics INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Language Models & Hidden Markov Models Stephan Oepen & Erik Velldal Language Technology Group (LTG) October 14, 2015

2 Recall: By the End of the Semester you should be able to determine which string is most likely: How to recognize speech vs. How to wreck a nice beach

3 Recall: By the End of the Semester you should be able to determine which string is most likely: How to recognize speech vs. How to wreck a nice beach which category sequence is most likely for flies like an arrow: N V D N vs. V P D N

4 Recall: By the End of the Semester you should be able to determine which string is most likely: How to recognize speech vs. How to wreck a nice beach which category sequence is most likely for flies like an arrow: N V D N vs. V P D N which syntactic analysis is most likely: S VP I VBD NP ate N PP with tuna NP sushi S VP I VBD NP PP with tuna NP ate N sushi

5 Recall: By the End of the Semester you should be able to determine which string is most likely: How to recognize speech vs. How to wreck a nice beach which category sequence is most likely for flies like an arrow: N V D N vs. V P D N which syntactic analysis is most likely: S VP I VBD NP ate N PP with tuna NP sushi S VP I VBD NP PP with tuna NP ate N sushi

6 Language Models N -Grams 3 A probabilistic (or stochastic) language model M assigns probabilities P M (x) to all strings x in language L. We simplify using the Markov assumption (limited history): the last n 1 elements approximate the effect of the full sequence. That is, instead of P(w i w 1,... w i 1 ) selecting an n of 3, we use P(w i w i 1, w i 2 )

7 Language Models N -Grams 3 A probabilistic (or stochastic) language model M assigns probabilities P M (x) to all strings x in language L. We simplify using the Markov assumption (limited history): the last n 1 elements approximate the effect of the full sequence. That is, instead of P(w i w 1,... w i 1 ) selecting an n of 3, we use P(w i w i 1, w i 2 ) We call these short sequences of words n-grams: bigrams: I want, want to, to go, go to, to the, the beach trigrams: I want to, want to go, to go to, go to the 4-grams: I want to go, want to go to, to go to the

8 N -Gram Models 4 A generative model models a joint probability in terms of conditional probabilities. We talk about the generative story: S

9 N -Gram Models 4 A generative model models a joint probability in terms of conditional probabilities. We talk about the generative story: S the P(the S ) P(S) = P(the S )

10 N -Gram Models 4 A generative model models a joint probability in terms of conditional probabilities. We talk about the generative story: and P(and the) S the cat P(cat the) P(the S ) eat P(eat the) P(S) = P(the S )

11 N -Gram Models 4 A generative model models a joint probability in terms of conditional probabilities. We talk about the generative story: S the cat P(the S ) P(cat the) P(S) = P(the S ) P(cat the)

12 N -Gram Models 4 A generative model models a joint probability in terms of conditional probabilities. We talk about the generative story: S the cat eats P(the S ) P(cat the) P(eats cat) P(S) = P(the S ) P(cat the) P(eats cat)

13 N -Gram Models 4 A generative model models a joint probability in terms of conditional probabilities. We talk about the generative story: S the cat eats mice P(the S ) P(cat the) P(eats cat) P(mice eats) P(S) = P(the S ) P(cat the) P(eats cat) P(mice eats)

14 N -Gram Models 4 A generative model models a joint probability in terms of conditional probabilities. We talk about the generative story: S the cat eats mice /S P(the S ) P(cat the) P(eats cat) P(mice eats) P( /S mice) P(S) = P(the S ) P(cat the) P(eats cat) P(mice eats) P( /S mice)

15 N -Gram Models 4 A generative model models a joint probability in terms of conditional probabilities. We talk about the generative story: S the cat eats mice /S P(the S ) P(cat the) P(eats cat) P(mice eats) P( /S mice) P(S) = P(the S ) P(cat the) P(eats cat) P(mice eats) P( /S mice)

16 N -Gram Models 5 An n-gram language model records the n-gram conditional probabilities: P(I S ) = P(to go) = P(want I) = P(the to) = P(to want) = P(beach the) = P(go to) =

17 N -Gram Models 5 An n-gram language model records the n-gram conditional probabilities: P(I S ) = P(to go) = P(want I) = P(the to) = P(to want) = P(beach the) = P(go to) = We calculate the probability of a sentence as (assuming bi-grams): P (w n 1 ) n P (w i w i 1 ) i=1 P (I S ) P (want I) P (to want) P (go to) P (to go) P (the to) P (beach the) =

18 Training an N -Gram Model 6 How to estimate the probabilities of n-grams? By counting (e.g. for trigrams): P (bananas i like) = C (i like bananas) C (i like) The probabilities are estimated using the relative frequencies of observed outcomes. This process is called Maximum Likelihood Estimation (MLE).

19 Bigram MLE Example 7 I want to go to the beach w 1 w 2 C (w 1 w 2 ) C (w 1 ) P (w 2 w 1 ) S I I want want to to go go to to the the beach

20 Bigram MLE Example 7 I want to go to the beach w 1 w 2 C (w 1 w 2 ) C (w 1 ) P (w 2 w 1 ) S I I want want to to go go to to the the beach What s the probability of Others want to go to the beach?

21 Problems with MLE of N -Grams 8 Data sparseness: many perfectly acceptable n-grams will not be observed Zero counts will result in a estimated probability of 0

22 Problems with MLE of N -Grams 8 Data sparseness: many perfectly acceptable n-grams will not be observed Zero counts will result in a estimated probability of 0 Remedy reassign some of the probability mass of frequent events to less frequent (or unseen) events. Known as smoothing or discounting The simplest approach is Laplace ( add-one ) smoothing: P L (w n w n 1 ) = C (w n 1w n ) + 1 C (w n 1 ) + V

23 Bigram MLE Example with Laplace Smoothing 9 Others want to go to the beach w 1 w 2 C (w 1 w 2 ) C (w 1 ) P (w 2 w 1 ) S I S Others I want Others want want to to go go to to the the beach

24 Bigram MLE Example with Laplace Smoothing 9 Others want to go to the beach w 1 w 2 C (w 1 w 2 ) C (w 1 ) P (w 2 w 1 ) S I S Others I want Others want want to to go go to to the the beach

25 Bigram MLE Example with Laplace Smoothing 9 Others want to go to the beach w 1 w 2 C (w 1 w 2 ) C (w 1 ) P (w 2 w 1 ) P L (w 2 w 1 ) S I S Others I want Others want want to to go go to to the the beach P L (w n w n 1 ) = C (w n 1w n ) + 1 C (w n 1 )

26 N -Gram Summary 10 The likelihood of the next word depends on its context. We can calculate this using the chain rule: ( ) N ( ) P w1 N = P w i w1 i 1 In an n-gram model, we approximate this with a Markov chain: ( ) N ( ) P w1 N P w i wi n+1 i 1 i=1 i=1 We use Maximum Likelihood Estimation to estimate the conditional probabilities. Smoothing techniques are used to avoid zero probabilities.

27 Parts of Speech 11 Known by a variety of names: part-of-speech, POS, lexical categories, word classes, morpho-syntactic classes,... Traditionally defined semantically (e.g. nouns are naming words ), but (arguably) more accurately by their distributional properties.

28 Parts of Speech 11 Known by a variety of names: part-of-speech, POS, lexical categories, word classes, morpho-syntactic classes,... Traditionally defined semantically (e.g. nouns are naming words ), but (arguably) more accurately by their distributional properties. Open-classes New words created/updated/deleted all the time

29 Parts of Speech 11 Known by a variety of names: part-of-speech, POS, lexical categories, word classes, morpho-syntactic classes,... Traditionally defined semantically (e.g. nouns are naming words ), but (arguably) more accurately by their distributional properties. Open-classes New words created/updated/deleted all the time Closed-classes Smaller classes, relatively static membership Usually function words

30 Open Class Words 12 Nouns: dog, Oslo, scissors, snow, people, truth, cups proper or common; countable or uncountable; plural or singular; masculine, feminine, or neuter;...

31 Open Class Words 12 Nouns: dog, Oslo, scissors, snow, people, truth, cups proper or common; countable or uncountable; plural or singular; masculine, feminine, or neuter;... Verbs: fly, rained, having, ate, seen transitive, intransitive, ditransitive; past, present, passive; stative or dynamic; plural or singular;...

32 Open Class Words 12 Nouns: dog, Oslo, scissors, snow, people, truth, cups proper or common; countable or uncountable; plural or singular; masculine, feminine, or neuter;... Verbs: fly, rained, having, ate, seen transitive, intransitive, ditransitive; past, present, passive; stative or dynamic; plural or singular;... Adjectives: good, smaller, unique, fastest, best, unhappy comparative or superlative; predicative or attributive; intersective, subsective, or scopal;...

33 Open Class Words 12 Nouns: dog, Oslo, scissors, snow, people, truth, cups proper or common; countable or uncountable; plural or singular; masculine, feminine, or neuter;... Verbs: fly, rained, having, ate, seen transitive, intransitive, ditransitive; past, present, passive; stative or dynamic; plural or singular;... Adjectives: good, smaller, unique, fastest, best, unhappy comparative or superlative; predicative or attributive; intersective, subsective, or scopal;... Adverbs: again, somewhat, slowly, yesterday, aloud intersective; scopal; discourse; degree; temporal; directional; comparative or superlative;...

34 Closed Class Words 13 Prepositions: on, under, from, at, near, over,... Determiners: a, an, the, that,... Pronouns: she, who, I, others,... Conjunctions: and, but, or, when,... Auxiliary verbs: can, may, should, must,... Interjections, particles, numerals, negatives, politeness markers, greetings, existential there... (Examples from Jurafsky & Martin, 2008)

35 POS Tagging 14 The (automatic) assignment of POS tags to word sequences non-trivial where words are ambiguous: fly (v) vs. fly (n) choice of the correct tag is context-dependent useful in pre-processing for parsing, etc; but also directly for text-to-speech synthesis: content (n) vs. content (adj)

36 POS Tagging 14 The (automatic) assignment of POS tags to word sequences non-trivial where words are ambiguous: fly (v) vs. fly (n) choice of the correct tag is context-dependent useful in pre-processing for parsing, etc; but also directly for text-to-speech synthesis: content (n) vs. content (adj) difficulty and usefulness can depend on the tagset English Penn Treebank (PTB) 45 tags: NNS, NN, NNP, JJ, JJR, JJS Norwegian Oslo-Bergen Tagset multi-part: subst appell fem be ent

37 Labeled Sequences 15 We are interested in the probability of sequences like: flies like the wind flies like the wind or nns vb dt nn vbz p dt nn

38 Labeled Sequences 15 We are interested in the probability of sequences like: flies like the wind flies like the wind or nns vb dt nn vbz p dt nn In normal text, we see the words, but not the tags. Consider the POS tags to be underlying skeleton of the sentence, unseen but influencing the sentence shape. A structure like this, consisting of a hidden state sequence, and a related observation sequence can be modelled as a Hidden Markov Model.

39 Hidden Markov Models 16 The generative story: S

40 Hidden Markov Models 16 The generative story: S DT P(DT S ) P(S, O) = P( DT S )

41 Hidden Markov Models 16 The generative story: the P(the DT) S DT P(DT S ) P(S, O) = P( DT S ) P(the DT)

42 Hidden Markov Models 16 The generative story: the P(the DT) S DT NN P(DT S ) P(NN DT) P(S, O) = P( DT S ) P(the DT) P(NN DT)

43 Hidden Markov Models 16 The generative story: the cat P(the DT) P(cat NN) S P(DT S ) DT P(NN DT) NN P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN)

44 Hidden Markov Models 16 The generative story: the cat P(the DT) P(cat NN) S DT NN VBZ P(DT S ) P(NN DT) P(VBZ NN) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN)

45 Hidden Markov Models 16 The generative story: the cat eats P(the DT) P(cat NN) P(eats VBZ) S DT NN VBZ P(DT S ) P(NN DT) P(VBZ NN) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN) P(eats VBZ)

46 Hidden Markov Models 16 The generative story: the cat eats P(the DT) P(cat NN) P(eats VBZ) S DT NN VBZ NNS P(DT S ) P(NN DT) P(VBZ NN) P(NNS VBZ) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN) P(eats VBZ) P(NNS VBZ)

47 Hidden Markov Models 16 The generative story: the cat eats mice P(the DT) P(cat NN) P(eats VBZ) P(mice NNS) S DT NN VBZ NNS P(DT S ) P(NN DT) P(VBZ NN) P(NNS VBZ) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN) P(eats VBZ) P(NNS VBZ) P(mice NNS)

48 Hidden Markov Models 16 The generative story: the cat eats mice P(the DT) P(cat NN) P(eats VBZ) P(mice NNS) S DT NN VBZ NNS /S P(DT S ) P(NN DT) P(VBZ NN) P(NNS VBZ) P( /S NNS) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN) P(eats VBZ) P(NNS VBZ) P(mice NNS) P( /S NNS)

49 Hidden Markov Models 16 The generative story: the cat eats mice P(the DT) P(cat NN) P(eats VBZ) P(mice NNS) S DT NN VBZ NNS /S P(DT S ) P(NN DT) P(VBZ NN) P(NNS VBZ) P( /S NNS) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN) P(eats VBZ) P(NNS VBZ) P(mice NNS) P( /S NNS)

50 Hidden Markov Models 17 For a bi-gram HMM, with observations O N 1 : P(S, O) = N+1 i=1 P(s i s i 1 )P(o i s i ) where s 0 = S, s N+1 = /S

51 Hidden Markov Models 17 For a bi-gram HMM, with observations O N 1 : P(S, O) = N+1 i=1 P(s i s i 1 )P(o i s i ) where s 0 = S, s N+1 = /S The transition probabilities model the probabilities of moving from state to state.

52 Hidden Markov Models 17 For a bi-gram HMM, with observations O N 1 : P(S, O) = N+1 i=1 P(s i s i 1 )P(o i s i ) where s 0 = S, s N+1 = /S The transition probabilities model the probabilities of moving from state to state. The emission probabilities model the probability that a state emits a particular observation.

53 Using HMMs 18 The HMM models the process of generating the labeled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximizes P(S O) given O P(s x O) given O We can learn the model parameters, given labeled observations.

54 Using HMMs 18 The HMM models the process of generating the labeled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximizes P(S O) given O P(s x O) given O We can learn the model parameters, given labeled observations.

55 Using HMMs 18 The HMM models the process of generating the labeled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximizes P(S O) given O P(s x O) given O We can learn the model parameters, given labeled observations.

56 Using HMMs 18 The HMM models the process of generating the labeled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximizes P(S O) given O P(s x O) given O We can learn the model parameters, given labeled observations. Our observations will be words (w i ), and our states PoS tags (t i ).

57 Estimation 19 As so often in NLP, we learn an HMM from labeled data: Transition Probabilities Based on a training corpus of previously tagged text, with tags as our states, the MLE can be computed from the counts of observed tags: P(t i t i 1 ) = C(t i 1, t i ) C(t i 1 )

58 Estimation 19 As so often in NLP, we learn an HMM from labeled data: Transition Probabilities Based on a training corpus of previously tagged text, with tags as our states, the MLE can be computed from the counts of observed tags: P(t i t i 1 ) = C(t i 1, t i ) C(t i 1 ) Emission Probabilities Computed from relative frequencies in the same way, with the words as observations: P(w i t i ) = C(t i, w i ) C(t i )

59 Implementation Considerations 20 P(S, O) = P(s 1 S )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )P(s 3 s 2 )P(o 3 s 3 )... =

60 Implementation Considerations 20 P(S, O) = P(s 1 S )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )P(s 3 s 2 )P(o 3 s 3 )... = Multiplying many small probabilities risk of numeric underflow

61 Implementation Considerations 20 P(S, O) = P(s 1 S )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )P(s 3 s 2 )P(o 3 s 3 )... = Multiplying many small probabilities risk of numeric underflow Solution: work in log(arithmic) space: log(ab) = log(a) + log(b) hence P(A)P(B) = exp(log(a) + log(b)) log(p(s, O)) =

62 Implementation Considerations 20 P(S, O) = P(s 1 S )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )P(s 3 s 2 )P(o 3 s 3 )... = Multiplying many small probabilities risk of numeric underflow Solution: work in log(arithmic) space: log(ab) = log(a) + log(b) hence P(A)P(B) = exp(log(a) + log(b)) log(p(s, O)) = Still, the issues related to MLE that we discussed for n-gram models also apply here...

63 Ice Cream and Global Warming 21 Missing records of weather in Baltimore for Summer 2007 Jason likes to eat ice cream. He records his daily ice cream consumption in his diary. The number of ice creams he ate was influenced, but not entirely determined by the weather. Today s weather is partially predictable from yesterday s.

64 Ice Cream and Global Warming 21 Missing records of weather in Baltimore for Summer 2007 Jason likes to eat ice cream. He records his daily ice cream consumption in his diary. The number of ice creams he ate was influenced, but not entirely determined by the weather. Today s weather is partially predictable from yesterday s. A Hidden Markov Model with: Hidden states: {H, C} (plus pseudo-states S and /S ) Observations: {1, 2, 3}

65 Ice Cream and Global Warming S H 0.2 C 0.5 P(1 H )=0.2 P(2 H )=0.4 P(3 H )= /S 0.2 P(1 C) = 0.5 P(2 C) = 0.4 P(3 C) = 0.1

66 Using HMMs 23 The HMM models the process of generating the labeled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximizes P(S O) given O P(s x O) given O We can also learn the model parameters, given a set of observations.

67 Part-of-Speech Tagging 24 We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: We want: P(S O) P(S, O) = N+1 i=1 P(s i s i 1 )P(o i s i )

68 Part-of-Speech Tagging 24 We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: We want: P(S O) = P(S, O) = P(S, O) P(O) N+1 i=1 P(s i s i 1 )P(o i s i )

69 Part-of-Speech Tagging 24 We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: We want: P(S O) = P(S, O) = P(S, O) P(O) N+1 i=1 P(s i s i 1 )P(o i s i ) Actually, we want the state sequence Ŝ that maximizes P(S O): Ŝ = arg max S P(S, O) P(O) Since P(O) always is the same, we can drop the denominator.

70 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM P(H S ) = 0.8 P(C S ) = 0.2 P(H H ) = 0.6 P(C H ) = 0.2 P(H C) = 0.3 P(C C) = 0.5 P( /S H ) = 0.2 P( /S C) = 0.2 P(1 H ) = 0.2 P(1 C) = 0.5 P(2 H ) = 0.4 P(2 C) = 0.4 P(3 H ) = 0.4 P(3 C) = 0.1

71 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM if O = P(H S ) = 0.8 P(C S ) = 0.2 P(H H ) = 0.6 P(C H ) = 0.2 P(H C) = 0.3 P(C C) = 0.5 P( /S H ) = 0.2 P( /S C) = 0.2 P(1 H ) = 0.2 P(1 C) = 0.5 P(2 H ) = 0.4 P(2 C) = 0.4 P(3 H ) = 0.4 P(3 C) = 0.1

72 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM if O = P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S P(H H ) = 0.6 P(C H ) = 0.2 P(H C) = 0.3 P(C C) = 0.5 P( /S H ) = 0.2 P( /S C) = 0.2 P(1 H ) = 0.2 P(1 C) = 0.5 P(2 H ) = 0.4 P(2 C) = 0.4 P(3 H ) = 0.4 P(3 C) = 0.1

73 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM if O = P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S P(H H ) = 0.6 P(C H ) = 0.2 P(H C) = 0.3 P(C C) = 0.5 P( /S H ) = 0.2 P( /S C) = 0.2 P(1 H ) = 0.2 P(1 C) = 0.5 P(2 H ) = 0.4 P(2 C) = 0.4 P(3 H ) = 0.4 P(3 C) = 0.1

74 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM if O = P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S P(H H ) = 0.6 P(C H ) = 0.2 S H H C /S P(H C) = 0.3 P(C C) = 0.5 P( /S H ) = 0.2 P( /S C) = 0.2 P(1 H ) = 0.2 P(1 C) = 0.5 P(2 H ) = 0.4 P(2 C) = 0.4 P(3 H ) = 0.4 P(3 C) = 0.1

75 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM if O = P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S P(H H ) = 0.6 P(C H ) = 0.2 S H H C /S P(H C) = 0.3 P(C C) = 0.5 P( /S H ) = 0.2 P( /S C) = 0.2 P(1 H ) = 0.2 P(1 C) = 0.5 P(2 H ) = 0.4 P(2 C) = 0.4 P(3 H ) = 0.4 P(3 C) = 0.1

76 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM if O = P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S P(H H ) = 0.6 P(C H ) = 0.2 S H H C /S P(H C) = 0.3 P(C C) = 0.5 S H C H /S P( /S H ) = 0.2 P( /S C) = 0.2 P(1 H ) = 0.2 P(1 C) = 0.5 P(2 H ) = 0.4 P(2 C) = 0.4 P(3 H ) = 0.4 P(3 C) = 0.1

77 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM if O = P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S P(H H ) = 0.6 P(C H ) = 0.2 S H H C /S P(H C) = 0.3 P(C C) = 0.5 S H C H /S P( /S H ) = 0.2 P( /S C) = 0.2 S H C C /S P(1 H ) = 0.2 P(1 C) = 0.5 P(2 H ) = 0.4 P(2 C) = 0.4 P(3 H ) = 0.4 P(3 C) = 0.1

78 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM if O = P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S P(H H ) = 0.6 P(C H ) = 0.2 S H H C /S P(H C) = 0.3 P(C C) = 0.5 S H C H /S P( /S H ) = 0.2 P( /S C) = 0.2 S H C C /S P(1 H ) = 0.2 P(1 C) = 0.5 S C H H /S P(2 H ) = 0.4 P(2 C) = 0.4 P(3 H ) = 0.4 P(3 C) = 0.1

79 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM if O = P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S P(H H ) = 0.6 P(C H ) = 0.2 S H H C /S P(H C) = 0.3 P(C C) = 0.5 S H C H /S P( /S H ) = 0.2 P( /S C) = 0.2 S H C C /S P(1 H ) = 0.2 P(1 C) = 0.5 S C H H /S P(2 H ) = 0.4 P(2 C) = 0.4 S C H C /S P(3 H ) = 0.4 P(3 C) = 0.1

80 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM if O = P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S P(H H ) = 0.6 P(C H ) = 0.2 S H H C /S P(H C) = 0.3 P(C C) = 0.5 S H C H /S P( /S H ) = 0.2 P( /S C) = 0.2 S H C C /S P(1 H ) = 0.2 P(1 C) = 0.5 S C H H /S P(2 H ) = 0.4 P(2 C) = 0.4 S C H C /S P(3 H ) = 0.4 P(3 C) = 0.1 S C C H /S

81 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM if O = P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S P(H H ) = 0.6 P(C H ) = 0.2 S H H C /S P(H C) = 0.3 P(C C) = 0.5 S H C H /S P( /S H ) = 0.2 P( /S C) = 0.2 S H C C /S P(1 H ) = 0.2 P(1 C) = 0.5 S C H H /S P(2 H ) = 0.4 P(2 C) = 0.4 S C H C /S P(3 H ) = 0.4 P(3 C) = 0.1 S C C H /S S C C C /S

82 Decoding 25 Task Find the most likely state sequence Ŝ, given an observation sequence O. HMM if O = P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S P(H H ) = 0.6 P(C H ) = 0.2 S H H C /S P(H C) = 0.3 P(C C) = 0.5 S H C H /S P( /S H ) = 0.2 P( /S C) = 0.2 S H C C /S P(1 H ) = 0.2 P(1 C) = 0.5 S C H H /S P(2 H ) = 0.4 P(2 C) = 0.4 S C H C /S P(3 H ) = 0.4 P(3 C) = 0.1 S C C H /S S C C C /S

83 Dynamic Programming 26 For (only) two states and a (short) observation sequence of length three, comparing all possible sequences may be workable, but...

84 Dynamic Programming 26 For (only) two states and a (short) observation sequence of length three, comparing all possible sequences may be workable, but... for N observations and L states, there are L N sequences; we end up doing the same partial calculations over and over again.

85 Dynamic Programming 26 For (only) two states and a (short) observation sequence of length three, comparing all possible sequences may be workable, but... for N observations and L states, there are L N sequences; we end up doing the same partial calculations over and over again. Dynamic Programming: records sub-problem solutions for further re-use useful when a complex problem can be described recursively examples: Dijkstra s shortest path, minimum edit distance, longest common subsequence, Viterbi algorithm

86 Dynamic Programming 26 For (only) two states and a (short) observation sequence of length three, comparing all possible sequences may be workable, but... for N observations and L states, there are L N sequences; we end up doing the same partial calculations over and over again. Dynamic Programming: records sub-problem solutions for further re-use useful when a complex problem can be described recursively examples: Dijkstra s shortest path, minimum edit distance, longest common subsequence, Viterbi algorithm

87 Viterbi Algorithm 27 Recall our problem: maximize P(s 1... s n o 1... o n ) = P(s 1 s 0 )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )... Our recursive sub-problem: v i (x) = L max k=1 [v i 1(k) P(x k) P(o i x)] The variable v i (x) represents the maximum probability that the i-th state is x, given that we have seen O i 1.

88 Viterbi Algorithm 27 Recall our problem: maximize P(s 1... s n o 1... o n ) = P(s 1 s 0 )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )... Our recursive sub-problem: v i (x) = L max k=1 [v i 1(k) P(x k) P(o i x)] The variable v i (x) represents the maximum probability that the i-th state is x, given that we have seen O i 1. At each step, we record backpointers showing which previous state led to the maximum probability.

89 An Example of the Viterbi Algorithmn 28 H H H S /S C C C o 1 o 2 o 3

90 An Example of the Viterbi Algorithmn 28 v 1 (H) = 0.32 S P(H S)P(3 H) H H H /S C C C o 1 o 2 o 3

91 An Example of the Viterbi Algorithmn 28 v 1 (H) = 0.32 S P(H S)P(3 H) H H H /S C C C o 1 o 2 o 3

92 An Example of the Viterbi Algorithmn 28 v 1 (H) = 0.32 S P(H S)P(3 H) P(C S)P(3 C) H H H /S C C C v 1 (C) = o 1 o 2 o 3

93 An Example of the Viterbi Algorithmn 28 v 1 (H) = 0.32 S P(H S)P(3 H) P(C S)P(3 C) H H H /S C C C v 1 (C) = o 1 o 2 o 3

94 An Example of the Viterbi Algorithmn 28 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 S P(H S)P(3 H) P(C S)P(3 C) P(H H)P(1 H) H H H P(H C)P(1 H) C C C /S v 1 (C) = o 1 o 2 o 3

95 An Example of the Viterbi Algorithmn 28 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 S P(H S)P(3 H) P(C S)P(3 C) P(H H)P(1 H) H H H P(H C)P(1 H) C C C /S v 1 (C) = o 1 o 2 o 3

96 An Example of the Viterbi Algorithmn 28 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 S P(H S)P(3 H) P(C S)P(3 C) P(H H)P(1 H) H H H P(C H)P(1 C) P(H C)P(1 H) P(C C)P(1 C) C C C /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) = o 1 o 2 o 3

97 An Example of the Viterbi Algorithmn 28 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 S P(H S)P(3 H) P(C S)P(3 C) P(H H)P(1 H) H H H P(C H)P(1 C) P(H C)P(1 H) P(C C)P(1 C) C C C /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) = o 1 o 2 o 3

98 An Example of the Viterbi Algorithmn 28 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 v 3 (H) = max( , ) = S P(H S)P(3 H) P(C S)P(3 C) P(H H)P(1 H) P(H H)P(3 H) H H H P(C H)P(1 C) P(H C)P(1 H) P(C C)P(1 C) C C C P(H C)P(3 H) /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) = o 1 o 2 o 3

99 An Example of the Viterbi Algorithmn 28 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 v 3 (H) = max( , ) = S P(H S)P(3 H) P(C S)P(3 C) P(H H)P(1 H) P(H H)P(3 H) H H H P(C H)P(1 C) P(H C)P(1 H) P(C C)P(1 C) C C C P(H C)P(3 H) /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) = o 1 o 2 o 3

100 An Example of the Viterbi Algorithmn 28 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 v 3 (H) = max( , ) = S P(H S)P(3 H) P(C S)P(3 C) P(H H)P(1 H) P(H H)P(3 H) H H H P(C H)P(1 C) P(H C)P(1 H) P(C C)P(1 C) P(C C)P(3 C) C C C P(C H)P(3 C) P(H C)P(3 H) /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 v 3 (C) = max( , ) = o 1 o 2 o 3

101 An Example of the Viterbi Algorithmn 28 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 v 3 (H) = max( , ) = S P(H S)P(3 H) P(C S)P(3 C) P(H H)P(1 H) P(H H)P(3 H) H H H P(C H)P(1 C) P(H C)P(1 H) P(C C)P(1 C) P(C C)P(3 C) C C C P(C H)P(3 C) P(H C)P(3 H) /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 v 3 (C) = max( , ) = o 1 o 2 o 3

102 An Example of the Viterbi Algorithmn 28 S P(H S)P(3 H) P(C S)P(3 C) v 1 (H) = 0.32 P(H H)P(1 H) P(H H)P(3 H) H H H P(C H)P(1 C) P(H C)P(1 H) P(C C)P(1 C) P(C C)P(3 C) C C C v 1 (C) = v 2 (H) = max(.32.12,.02.06) =.0384 v 2 (C) = max(.32.1,.02.25) =.032 P(C H)P(3 C) P(H C)P(3 H) v 3 (H) = max( , ) = v 3 (C) = max( , ) = o 1 o 2 o 3 P( /S H) 0.2 P( /S C) 0.2 v f ( /S ) = max( ) = /S

103 An Example of the Viterbi Algorithmn 28 S P(H S)P(3 H) P(C S)P(3 C) v 1 (H) = 0.32 P(H H)P(1 H) P(H H)P(3 H) H H H P(C H)P(1 C) P(H C)P(1 H) P(C C)P(1 C) P(C C)P(3 C) C C C v 1 (C) = v 2 (H) = max(.32.12,.02.06) =.0384 v 2 (C) = max(.32.1,.02.25) =.032 P(C H)P(3 C) P(H C)P(3 H) o 1 o 2 o 3 H v 3 (H) = max( , ) = v 3 (C) = max( , ) =.0016 P( /S H) 0.2 P( /S C) 0.2 v f ( /S ) = max( ) = /S

104 An Example of the Viterbi Algorithmn 28 S P(H S)P(3 H) P(C S)P(3 C) v 1 (H) = 0.32 P(H H)P(1 H) P(H H)P(3 H) H H H P(C H)P(1 C) P(H C)P(1 H) P(C C)P(1 C) P(C C)P(3 C) C C C v 1 (C) = v 2 (H) = max(.32.12,.02.06) =.0384 v 2 (C) = max(.32.1,.02.25) =.032 P(C H)P(3 C) P(H C)P(3 H) v 3 (H) = max( , ) = v 3 (C) = max( , ) = o 1 o 2 o 3 P( /S H) 0.2 P( /S C) 0.2 v f ( /S ) = max( ) = /S H H

105 An Example of the Viterbi Algorithmn 28 S P(H S)P(3 H) P(C S)P(3 C) v 1 (H) = 0.32 P(H H)P(1 H) P(H H)P(3 H) H H H P(C H)P(1 C) P(H C)P(1 H) P(C C)P(1 C) P(C C)P(3 C) C C C v 1 (C) = v 2 (H) = max(.32.12,.02.06) =.0384 v 2 (C) = max(.32.1,.02.25) =.032 P(C H)P(3 C) o 1 o 2 o 3 H H H P(H C)P(3 H) v 3 (H) = max( , ) = v 3 (C) = max( , ) =.0016 P( /S H) 0.2 P( /S C) 0.2 v f ( /S ) = max( ) = /S

106 An Example of the Viterbi Algorithmn 28 S P(H S)P(3 H) P(C S)P(3 C) v 1 (H) = 0.32 P(H H)P(1 H) P(H H)P(3 H) H H H P(C H)P(1 C) P(H C)P(1 H) P(C C)P(1 C) P(C C)P(3 C) C C C v 1 (C) = v 2 (H) = max(.32.12,.02.06) =.0384 v 2 (C) = max(.32.1,.02.25) =.032 P(C H)P(3 C) P(H C)P(3 H) o 1 o 2 o 3 H H H v 3 (H) = max( , ) = v 3 (C) = max( , ) =.0016 P( /S H) 0.2 P( /S C) 0.2 v f ( /S ) = max( ) = /S

107 Pseudocode for the Viterbi Algorithm 29 Input: observations of length N, state set of size L Output: best-path create a path probability matrix viterbi[n, L + 1] create a path backpointer matrix backpointer[n, L + 1] foreach state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end foreach time step i from 2 to N do foreach state s from 1 to L do viterbi[i, s] max L s =1 viterbi[i 1, s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L s =1 viterbi[i 1, s ] trans(s, s) end end viterbi[n, L + 1] max L s=1 viterbi[n, s] trans(s, /S ) backpointer[n, L + 1] arg max L s=1 viterbi[n, s] trans(s, /S ) return the path by following backpointers from backpointer[n, L + 1]

108 Diversion: Complexity and O(N) 30 Big-O notation describes the complexity of an algorithm. it describes the worst-case order of growth in terms of the size of the input only the largest order term is represented constant factors are ignored determined by looking at loops in the code

109 Pseudocode for the Viterbi Algorithm 31 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 1] create a path backpointer matrix backpointer[n, L + 1] foreach state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end foreach time step i from 2 to N do foreach state s from 1 to L do viterbi[i, s] max L s =1 viterbi[i 1, s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L s =1 viterbi[i 1, s ] trans(s, s) end end viterbi[n, L + 1] max L s=1 viterbi[n, 1] trans(s, /S ) backpointer[n, L + 1] arg max L s=1 viterbi[n, 1] trans(s, /S ) return the path by following backpointers from backpointer[n, L + 1]

110 Pseudocode for the Viterbi Algorithm 31 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 1] create a path backpointer matrix backpointer[n, L + 1] foreach state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end foreach time step i from 2 to N do foreach state s from 1 to L do viterbi[i, s] max L s =1 viterbi[i 1, s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L s =1 viterbi[i 1, s ] trans(s, s) end end viterbi[n, L + 1] max L s=1 viterbi[n, 1] trans(s, /S ) backpointer[n, L + 1] arg max L s=1 viterbi[n, 1] trans(s, /S ) return the path by following backpointers from backpointer[n, L + 1] L L

111 Pseudocode for the Viterbi Algorithm 31 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 1] create a path backpointer matrix backpointer[n, L + 1] foreach state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end foreach time step i from 2 to N do foreach state s from 1 to L do viterbi[i, s] max L s =1 viterbi[i 1, s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L s =1 viterbi[i 1, s ] trans(s, s) end end viterbi[n, L + 1] max L s=1 viterbi[n, 1] trans(s, /S ) backpointer[n, L + 1] arg max L s=1 viterbi[n, 1] trans(s, /S ) return the path by following backpointers from backpointer[n, L + 1] L L N

112 Pseudocode for the Viterbi Algorithm 31 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 1] create a path backpointer matrix backpointer[n, L + 1] foreach state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end foreach time step i from 2 to N do foreach state s from 1 to L do viterbi[i, s] max L s =1 viterbi[i 1, s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L s =1 viterbi[i 1, s ] trans(s, s) end end viterbi[n, L + 1] max L s=1 viterbi[n, 1] trans(s, /S ) backpointer[n, L + 1] arg max L s=1 viterbi[n, 1] trans(s, /S ) return the path by following backpointers from backpointer[n, L + 1] L L N L

113 Pseudocode for the Viterbi Algorithm 31 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 1] create a path backpointer matrix backpointer[n, L + 1] foreach state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end foreach time step i from 2 to N do foreach state s from 1 to L do viterbi[i, s] max L s =1 viterbi[i 1, s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L s =1 viterbi[i 1, s ] trans(s, s) end end viterbi[n, L + 1] max L s=1 viterbi[n, 1] trans(s, /S ) backpointer[n, L + 1] arg max L s=1 viterbi[n, 1] trans(s, /S ) return the path by following backpointers from backpointer[n, L + 1] L L N L L

114 Pseudocode for the Viterbi Algorithm 31 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 1] create a path backpointer matrix backpointer[n, L + 1] foreach state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end foreach time step i from 2 to N do foreach state s from 1 to L do viterbi[i, s] max L s =1 viterbi[i 1, s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L s =1 viterbi[i 1, s ] trans(s, s) end end viterbi[n, L + 1] max L s=1 viterbi[n, 1] trans(s, /S ) backpointer[n, L + 1] arg max L s=1 viterbi[n, 1] trans(s, /S ) return the path by following backpointers from backpointer[n, L + 1] L + L 2 N L N L L

115 Pseudocode for the Viterbi Algorithm 31 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 1] create a path backpointer matrix backpointer[n, L + 1] foreach state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end foreach time step i from 2 to N do foreach state s from 1 to L do viterbi[i, s] max L s =1 viterbi[i 1, s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L s =1 viterbi[i 1, s ] trans(s, s) end end viterbi[n, L + 1] max L s=1 viterbi[n, 1] trans(s, /S ) backpointer[n, L + 1] arg max L s=1 viterbi[n, 1] trans(s, /S ) return the path by following backpointers from backpointer[n, L + 1] L + L 2 N + N L N L L N

116 Pseudocode for the Viterbi Algorithm 31 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 1] create a path backpointer matrix backpointer[n, L + 1] foreach state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end foreach time step i from 2 to N do foreach state s from 1 to L do viterbi[i, s] max L s =1 viterbi[i 1, s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L s =1 viterbi[i 1, s ] trans(s, s) end end viterbi[n, L + 1] max L s=1 viterbi[n, 1] trans(s, /S ) backpointer[n, L + 1] arg max L s=1 viterbi[n, 1] trans(s, /S ) return the path by following backpointers from backpointer[n, L + 1] O(L 2 N ) L N L L N

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 27, 2016 Recap: Probabilistic Language