INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Similar documents
INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Dynamic Programming: Hidden Markov Models

Statistical Methods for NLP

10/17/04. Today s Main Points

Lecture 9: Hidden Markov Model

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Statistical methods in NLP, lecture 7 Tagging and parsing

CS838-1 Advanced NLP: Hidden Markov Models

Sequence Labeling: HMMs & Structured Perceptron

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Lecture 13: Structured Prediction

Hidden Markov Models

Midterm sample questions

Lecture 3: ASR: HMMs, Forward, Viterbi

Hidden Markov Models (HMMs)

LECTURER: BURCU CAN Spring

Lecture 12: Algorithms for HMMs

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Text Mining. March 3, March 3, / 49

Lecture 12: Algorithms for HMMs

Probabilistic Context-free Grammars

Part-of-Speech Tagging

Lecture 7: Sequence Labeling

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Machine Learning for natural language processing

HMM and Part of Speech Tagging. Adam Meyers New York University

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

POS-Tagging. Fabian M. Suchanek

CS460/626 : Natural Language

A brief introduction to Conditional Random Fields

Parsing with Context-Free Grammars

Sequences and Information

Hidden Markov Models

Lecture 11: Viterbi and Forward Algorithms

Part-of-Speech Tagging

Probabilistic Graphical Models

A gentle introduction to Hidden Markov Models

The Noisy Channel Model and Markov Models

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Dept. of Linguistics, Indiana University Fall 2009

TnT Part of Speech Tagger

Statistical Methods for NLP

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Lecture 6: Part-of-speech tagging

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Statistical Methods for NLP

Degree in Mathematics

Part-of-Speech Tagging + Neural Networks CS 287

Hidden Markov Models in Language Processing

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

LECTURER: BURCU CAN Spring

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

8: Hidden Markov Models

Hidden Markov Models, Part 1. Steven Bedrick CS/EE 5/655, 10/22/14

Advanced Natural Language Processing Syntactic Parsing

Fun with weighted FSTs

Log-Linear Models, MEMMs, and CRFs

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Language Processing with Perl and Prolog

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

COMP90051 Statistical Machine Learning

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Lecture 12: EM Algorithm

Processing/Speech, NLP and the Web

8: Hidden Markov Models

CS388: Natural Language Processing Lecture 4: Sequence Models I

Hidden Markov Models and Gaussian Mixture Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Data-Intensive Computing with MapReduce

A Context-Free Grammar

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

NLP Programming Tutorial 11 - The Structured Perceptron

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Hidden Markov Models. Terminology and Basic Algorithms

Natural Language Processing

Statistical Machine Learning from Data

CS460/626 : Natural Language

Natural Language Processing

Spectral Unsupervised Parsing with Additive Tree Metrics

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Sequential Data Modeling - The Structured Perceptron

Natural Language Processing

Hidden Markov Models

Automatic Speech Recognition (CS753)

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Transcription:

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 27, 2016

Recap: Probabilistic Language Models 2 Basic probability theory: axioms, joint vs. conditional probability, independence, Bayes Theorem;

Recap: Probabilistic Language Models 2 Basic probability theory: axioms, joint vs. conditional probability, independence, Bayes Theorem; Previous context can help predict the next element of a sequence, for example words in a sentence;

Recap: Probabilistic Language Models 2 Basic probability theory: axioms, joint vs. conditional probability, independence, Bayes Theorem; Previous context can help predict the next element of a sequence, for example words in a sentence; Rather than use the whole previous context, the Markov assumption says that the whole history can be approximated by the last n 1 elements;

Recap: Probabilistic Language Models 2 Basic probability theory: axioms, joint vs. conditional probability, independence, Bayes Theorem; Previous context can help predict the next element of a sequence, for example words in a sentence; Rather than use the whole previous context, the Markov assumption says that the whole history can be approximated by the last n 1 elements; An n-gram language model predicts the n-th word, conditioned on the n 1 previous words;

Recap: Probabilistic Language Models 2 Basic probability theory: axioms, joint vs. conditional probability, independence, Bayes Theorem; Previous context can help predict the next element of a sequence, for example words in a sentence; Rather than use the whole previous context, the Markov assumption says that the whole history can be approximated by the last n 1 elements; An n-gram language model predicts the n-th word, conditioned on the n 1 previous words; Maximum Likelihood Estimation uses relative frequencies to approximate the conditional probabilities needed for an n-gram model;

Recap: Probabilistic Language Models 2 Basic probability theory: axioms, joint vs. conditional probability, independence, Bayes Theorem; Previous context can help predict the next element of a sequence, for example words in a sentence; Rather than use the whole previous context, the Markov assumption says that the whole history can be approximated by the last n 1 elements; An n-gram language model predicts the n-th word, conditioned on the n 1 previous words; Maximum Likelihood Estimation uses relative frequencies to approximate the conditional probabilities needed for an n-gram model; Smoothing techniques are used to avoid zero probabilities.

Today 3 Determining which string is most likely: She studies morphosyntax vs. She studies more faux syntax which tag sequence is most likely for flies like flowers: NNS VB NNS vs. VBZ P NNS which syntactic analysis is most likely: S S NP VP NP VP I VBD ate N sushi NP PP with tuna I VBD ate NP N sushi PP with tuna

Today 3 Determining which string is most likely: She studies morphosyntax vs. She studies more faux syntax which tag sequence is most likely for flies like flowers: NNS VB NNS vs. VBZ P NNS which syntactic analysis is most likely: S S NP VP NP VP I VBD ate N sushi NP PP with tuna I VBD ate NP N sushi PP with tuna

Today 3 Determining which string is most likely: She studies morphosyntax vs. She studies more faux syntax which tag sequence is most likely for flies like flowers: NNS VB NNS vs. VBZ P NNS which syntactic analysis is most likely: S S NP VP NP VP I VBD ate N sushi NP PP with tuna I VBD ate NP N sushi PP with tuna

Parts of Speech 4 Known by a variety of names: part-of-speech, POS, lexical categories, word classes, morphological classes,... Traditionally defined semantically (e.g. nouns are naming words ), but more accurately by their distributional properties.

Parts of Speech 4 Known by a variety of names: part-of-speech, POS, lexical categories, word classes, morphological classes,... Traditionally defined semantically (e.g. nouns are naming words ), but more accurately by their distributional properties. Open-classes New words created/updated/deleted all the time

Parts of Speech 4 Known by a variety of names: part-of-speech, POS, lexical categories, word classes, morphological classes,... Traditionally defined semantically (e.g. nouns are naming words ), but more accurately by their distributional properties. Open-classes New words created/updated/deleted all the time Closed-classes Smaller classes, relatively static membership Usually function words

Open Class Words 5 Nouns: dog, Oslo, scissors, snow, people, truth, cups proper or common; countable or uncountable; plural or singular; masculine, feminine or neuter;...

Open Class Words 5 Nouns: dog, Oslo, scissors, snow, people, truth, cups proper or common; countable or uncountable; plural or singular; masculine, feminine or neuter;... Verbs: fly, rained, having, ate, seen transitive, intransitive, ditransitive; past, present, passive; stative or dynamic; plural or singular;...

Open Class Words 5 Nouns: dog, Oslo, scissors, snow, people, truth, cups proper or common; countable or uncountable; plural or singular; masculine, feminine or neuter;... Verbs: fly, rained, having, ate, seen transitive, intransitive, ditransitive; past, present, passive; stative or dynamic; plural or singular;... Adjectives: good, smaller, unique, fastest, best, unhappy comparative or superlative; predicative or attributive; intersective or non-intersective;...

Open Class Words 5 Nouns: dog, Oslo, scissors, snow, people, truth, cups proper or common; countable or uncountable; plural or singular; masculine, feminine or neuter;... Verbs: fly, rained, having, ate, seen transitive, intransitive, ditransitive; past, present, passive; stative or dynamic; plural or singular;... Adjectives: good, smaller, unique, fastest, best, unhappy comparative or superlative; predicative or attributive; intersective or non-intersective;... Adverbs: again, somewhat, slowly, yesterday, aloud intersective; scopal; discourse; degree; temporal; directional; comparative or superlative;...

Closed Class Words 6 Prepositions: on, under, from, at, near, over,... Determiners: a, an, the, that,... Pronouns: she, who, I, others,... Conjunctions: and, but, or, when,... Auxiliary verbs: can, may, should, must,... Interjections, particles, numerals, negatives, politeness markers, greetings, existential there... (Examples from Jurafsky & Martin, 2008)

POS Tagging 7 The (automatic) assignment of POS tags to word sequences

POS Tagging 7 The (automatic) assignment of POS tags to word sequences non-trivial where words are ambiguous: fly (v) vs. fly (n) choice of the correct tag is context-dependent

POS Tagging 7 The (automatic) assignment of POS tags to word sequences non-trivial where words are ambiguous: fly (v) vs. fly (n) choice of the correct tag is context-dependent useful in pre-processing for parsing, etc; but also directly for text-to-speech (TTS) system: content (n) vs. content (adj)

POS Tagging 7 The (automatic) assignment of POS tags to word sequences non-trivial where words are ambiguous: fly (v) vs. fly (n) choice of the correct tag is context-dependent useful in pre-processing for parsing, etc; but also directly for text-to-speech (TTS) system: content (n) vs. content (adj) difficulty and usefulness can depend on the tagset English Penn Treebank (PTB) 45 tags: NNS, NN, NNP, JJ, JJR, JJS http://bulba.sdsu.edu/jeanette/thesis/penntags.html

POS Tagging 7 The (automatic) assignment of POS tags to word sequences non-trivial where words are ambiguous: fly (v) vs. fly (n) choice of the correct tag is context-dependent useful in pre-processing for parsing, etc; but also directly for text-to-speech (TTS) system: content (n) vs. content (adj) difficulty and usefulness can depend on the tagset English Penn Treebank (PTB) 45 tags: NNS, NN, NNP, JJ, JJR, JJS http://bulba.sdsu.edu/jeanette/thesis/penntags.html Norwegian Oslo-Bergen Tagset multi-part: subst appell fem be ent http://tekstlab.uio.no/obt-ny/english/tags.html

Labelled Sequences 8 We are interested in the probability of sequences like: flies like the wind or flies like the wind nns vb dt nn vbz p dt nn

Labelled Sequences 8 We are interested in the probability of sequences like: flies like the wind or flies like the wind nns vb dt nn vbz p dt nn In normal text, we see the words, but not the tags.

Labelled Sequences 8 We are interested in the probability of sequences like: flies like the wind or flies like the wind nns vb dt nn vbz p dt nn In normal text, we see the words, but not the tags. Consider the POS tags to be underlying skeleton of the sentence, unseen but influencing the sentence shape.

Labelled Sequences 8 We are interested in the probability of sequences like: flies like the wind or flies like the wind nns vb dt nn vbz p dt nn In normal text, we see the words, but not the tags. Consider the POS tags to be underlying skeleton of the sentence, unseen but influencing the sentence shape. A structure like this, consisting of a hidden state sequence, and a related observation sequence can be modelled as a Hidden Markov Model.

Hidden Markov Models 9 The generative story: S

Hidden Markov Models 9 The generative story: S DT P(DT S ) P(S, O) = P( DT S )

Hidden Markov Models 9 The generative story: the P(the DT) S DT P(DT S ) P(S, O) = P( DT S ) P(the DT)

Hidden Markov Models 9 The generative story: the P(the DT) S DT NN P(DT S ) P(NN DT) P(S, O) = P( DT S ) P(the DT) P(NN DT)

Hidden Markov Models 9 The generative story: the cat P(the DT) P(cat NN) S P(DT S ) DT P(NN DT) NN P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN)

Hidden Markov Models 9 The generative story: the cat P(the DT) P(cat NN) S DT NN VBZ P(DT S ) P(NN DT) P(VBZ NN) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN)

Hidden Markov Models 9 The generative story: the cat eats P(the DT) P(cat NN) P(eats VBZ) S DT NN VBZ P(DT S ) P(NN DT) P(VBZ NN) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN) P(eats VBZ)

Hidden Markov Models 9 The generative story: the cat eats P(the DT) P(cat NN) P(eats VBZ) S DT NN VBZ NNS P(DT S ) P(NN DT) P(VBZ NN) P(NNS VBZ) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN) P(eats VBZ) P(NNS VBZ)

Hidden Markov Models 9 The generative story: the cat eats mice P(the DT) P(cat NN) P(eats VBZ) P(mice NNS) S DT NN VBZ NNS P(DT S ) P(NN DT) P(VBZ NN) P(NNS VBZ) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN) P(eats VBZ) P(NNS VBZ) P(mice NNS)

Hidden Markov Models 9 The generative story: the cat eats mice P(the DT) P(cat NN) P(eats VBZ) P(mice NNS) S DT NN VBZ NNS /S P(DT S ) P(NN DT) P(VBZ NN) P(NNS VBZ) P( /S NNS) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN) P(eats VBZ) P(NNS VBZ) P(mice NNS) P( /S NNS)

Hidden Markov Models 9 The generative story: the cat eats mice P(the DT) P(cat NN) P(eats VBZ) P(mice NNS) S DT NN VBZ NNS /S P(DT S ) P(NN DT) P(VBZ NN) P(NNS VBZ) P( /S NNS) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN) P(eats VBZ) P(NNS VBZ) P(mice NNS) P( /S NNS)

Hidden Markov Models 10 For a bi-gram HMM, with O N 1 : N+1 P(S, O) = P(s i s i 1 )P(o i s i ) where s 0 = S, s N+1 = /S i=1

Hidden Markov Models 10 For a bi-gram HMM, with O N 1 : N+1 P(S, O) = P(s i s i 1 )P(o i s i ) where s 0 = S, s N+1 = /S i=1 The transition probabilities model the probabilities of moving from state to state.

Hidden Markov Models 10 For a bi-gram HMM, with O N 1 : N+1 P(S, O) = P(s i s i 1 )P(o i s i ) where s 0 = S, s N+1 = /S i=1 The transition probabilities model the probabilities of moving from state to state. The emission probabilities model the probability that a state emits a particular observation.

Using HMMs 11 The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximises P(S O) given O We can also learn the model parameters, given a set of observations.

Using HMMs 11 The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximises P(S O) given O We can also learn the model parameters, given a set of observations.

Using HMMs 11 The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximises P(S O) given O We can also learn the model parameters, given a set of observations.

Using HMMs 11 The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximises P(S O) given O We can also learn the model parameters, given a set of observations. Our observations will be words (w i ), and our states PoS tags (t i )

Estimation 12 As so often in NLP, we learn an HMM from labelled data: Transition probabilities Based on a training corpus of previously tagged text, with tags as our state, the MLE can be computed from the counts of observed tags: P(t i t i 1 ) = C(t i 1, t i ) C(t i 1 )

Estimation 12 As so often in NLP, we learn an HMM from labelled data: Transition probabilities Based on a training corpus of previously tagged text, with tags as our state, the MLE can be computed from the counts of observed tags: P(t i t i 1 ) = C(t i 1, t i ) C(t i 1 ) Emission probabilities Computed from relative frequencies in the same way, with the words as observations: P(w i t i ) = C(t i, w i ) C(t i )

Implementation Issues 13 P(S, O) = P(s 1 S )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )P(s 3 s 2 )P(o 3 s 3 )... = 0.0429 0.0031 0.0044 0.0001 0.0072...

Implementation Issues 13 P(S, O) = P(s 1 S )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )P(s 3 s 2 )P(o 3 s 3 )... = 0.0429 0.0031 0.0044 0.0001 0.0072... Multiplying many small probabilities underflow

Implementation Issues 13 P(S, O) = P(s 1 S )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )P(s 3 s 2 )P(o 3 s 3 )... = 0.0429 0.0031 0.0044 0.0001 0.0072... Multiplying many small probabilities underflow Solution: work in log(arithmic) space: log(ab) = log(a) + log(b) hence P(A)P(B) = exp(log(a) + log(b)) log(p(s, O)) = 1.368 + 2.509 + 2.357 + 4 + 2.143 +...

Implementation Issues 13 P(S, O) = P(s 1 S )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )P(s 3 s 2 )P(o 3 s 3 )... = 0.0429 0.0031 0.0044 0.0001 0.0072... Multiplying many small probabilities underflow Solution: work in log(arithmic) space: log(ab) = log(a) + log(b) hence P(A)P(B) = exp(log(a) + log(b)) log(p(s, O)) = 1.368 + 2.509 + 2.357 + 4 + 2.143 +... The issues related to MLE / smoothing that we discussed for n-gram models also applies here...

Ice Cream and Global Warming 14 Missing records of weather in Baltimore for Summer 2007 Jason likes to eat ice cream. He records his daily ice cream consumption in his diary. The number of ice creams he ate was influenced, but not entirely determined by the weather. Today s weather is partially predictable from yesterday s.

Ice Cream and Global Warming 14 Missing records of weather in Baltimore for Summer 2007 Jason likes to eat ice cream. He records his daily ice cream consumption in his diary. The number of ice creams he ate was influenced, but not entirely determined by the weather. Today s weather is partially predictable from yesterday s. A Hidden Markov Model! with: Hidden states: {H, C} (plus pseudo-states S and /S ) Observations: {1, 2, 3}

Ice Cream and Global Warming 15 0.8 S 0.2 0.3 0.6 H 0.2 C 0.5 P(1 H)=0.2 P(2 H)=0.4 P(3 H)=0.4 0.2 /S 0.2 P(1 C) = 0.5 P(2 C) = 0.4 P(3 C) = 0.1

Using HMMs 16 The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximises P(S O) given O P(s x O) given O We can also learn the model parameters, given a set of observations.

Part-of-Speech Tagging 17 We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: We want: P(S O) N+1 P(S, O) = P(s i s i 1 )P(o i s i ) i=1

Part-of-Speech Tagging 17 We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: We want: P(S O) = N+1 P(S, O) = P(s i s i 1 )P(o i s i ) P(S, O) P(O) i=1

Part-of-Speech Tagging 17 We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N+1 P(S, O) = P(s i s i 1 )P(o i s i ) P(S, O) We want: P(S O) = P(O) Actually, we want the state sequence Ŝ that maximises P(S O): i=1 Ŝ = arg max S P(S, O) P(O)

Part-of-Speech Tagging 17 We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N+1 P(S, O) = P(s i s i 1 )P(o i s i ) P(S, O) We want: P(S O) = P(O) Actually, we want the state sequence Ŝ that maximises P(S O): i=1 Ŝ = arg max S P(S, O) P(O) Since P(O) always is the same, we can drop the denominator: Ŝ = arg max P(S, O) S

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM P(H S ) = 0.8 P(C S ) = 0.2 P(H H) = 0.6 P(C H) = 0.2 P(H C) = 0.3 P(C C) = 0.5 P( /S H) = 0.2 P( /S C) = 0.2 P(1 H) = 0.2 P(1 C) = 0.5 P(2 H) = 0.4 P(2 C) = 0.4 P(3 H) = 0.4 P(3 C) = 0.1

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 P(H H) = 0.6 P(C H) = 0.2 P(H C) = 0.3 P(C C) = 0.5 P( /S H) = 0.2 P( /S C) = 0.2 P(1 H) = 0.2 P(1 C) = 0.5 P(2 H) = 0.4 P(2 C) = 0.4 P(3 H) = 0.4 P(3 C) = 0.1

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S P(H H) = 0.6 P(C H) = 0.2 P(H C) = 0.3 P(C C) = 0.5 P( /S H) = 0.2 P( /S C) = 0.2 P(1 H) = 0.2 P(1 C) = 0.5 P(2 H) = 0.4 P(2 C) = 0.4 P(3 H) = 0.4 P(3 C) = 0.1

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S 0.0018432 P(H H) = 0.6 P(C H) = 0.2 P(H C) = 0.3 P(C C) = 0.5 P( /S H) = 0.2 P( /S C) = 0.2 P(1 H) = 0.2 P(1 C) = 0.5 P(2 H) = 0.4 P(2 C) = 0.4 P(3 H) = 0.4 P(3 C) = 0.1

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S 0.0018432 P(H H) = 0.6 P(C H) = 0.2 S H H C /S P(H C) = 0.3 P(C C) = 0.5 P( /S H) = 0.2 P( /S C) = 0.2 P(1 H) = 0.2 P(1 C) = 0.5 P(2 H) = 0.4 P(2 C) = 0.4 P(3 H) = 0.4 P(3 C) = 0.1

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S 0.0018432 P(H H) = 0.6 P(C H) = 0.2 S H H C /S 0.0001536 P(H C) = 0.3 P(C C) = 0.5 P( /S H) = 0.2 P( /S C) = 0.2 P(1 H) = 0.2 P(1 C) = 0.5 P(2 H) = 0.4 P(2 C) = 0.4 P(3 H) = 0.4 P(3 C) = 0.1

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S 0.0018432 P(H H) = 0.6 P(C H) = 0.2 S H H C /S 0.0001536 P(H C) = 0.3 P(C C) = 0.5 S H C H /S 0.0007680 P( /S H) = 0.2 P( /S C) = 0.2 P(1 H) = 0.2 P(1 C) = 0.5 P(2 H) = 0.4 P(2 C) = 0.4 P(3 H) = 0.4 P(3 C) = 0.1

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S 0.0018432 P(H H) = 0.6 P(C H) = 0.2 S H H C /S 0.0001536 P(H C) = 0.3 P(C C) = 0.5 S H C H /S 0.0007680 P( /S H) = 0.2 P( /S C) = 0.2 S H C C /S 0.0003200 P(1 H) = 0.2 P(1 C) = 0.5 P(2 H) = 0.4 P(2 C) = 0.4 P(3 H) = 0.4 P(3 C) = 0.1

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S 0.0018432 P(H H) = 0.6 P(C H) = 0.2 S H H C /S 0.0001536 P(H C) = 0.3 P(C C) = 0.5 S H C H /S 0.0007680 P( /S H) = 0.2 P( /S C) = 0.2 S H C C /S 0.0003200 P(1 H) = 0.2 P(1 C) = 0.5 S C H H /S 0.0000576 P(2 H) = 0.4 P(2 C) = 0.4 P(3 H) = 0.4 P(3 C) = 0.1

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S 0.0018432 P(H H) = 0.6 P(C H) = 0.2 S H H C /S 0.0001536 P(H C) = 0.3 P(C C) = 0.5 S H C H /S 0.0007680 P( /S H) = 0.2 P( /S C) = 0.2 S H C C /S 0.0003200 P(1 H) = 0.2 P(1 C) = 0.5 S C H H /S 0.0000576 P(2 H) = 0.4 P(2 C) = 0.4 S C H C /S 0.0000048 P(3 H) = 0.4 P(3 C) = 0.1

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S 0.0018432 P(H H) = 0.6 P(C H) = 0.2 S H H C /S 0.0001536 P(H C) = 0.3 P(C C) = 0.5 S H C H /S 0.0007680 P( /S H) = 0.2 P( /S C) = 0.2 S H C C /S 0.0003200 P(1 H) = 0.2 P(1 C) = 0.5 S C H H /S 0.0000576 P(2 H) = 0.4 P(2 C) = 0.4 S C H C /S 0.0000048 P(3 H) = 0.4 P(3 C) = 0.1 S C C H /S 0.0001200

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S 0.0018432 P(H H) = 0.6 P(C H) = 0.2 S H H C /S 0.0001536 P(H C) = 0.3 P(C C) = 0.5 S H C H /S 0.0007680 P( /S H) = 0.2 P( /S C) = 0.2 S H C C /S 0.0003200 P(1 H) = 0.2 P(1 C) = 0.5 S C H H /S 0.0000576 P(2 H) = 0.4 P(2 C) = 0.4 S C H C /S 0.0000048 P(3 H) = 0.4 P(3 C) = 0.1 S C C H /S 0.0001200 S C C C /S 0.0000500

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S 0.0018432 P(H H) = 0.6 P(C H) = 0.2 S H H C /S 0.0001536 P(H C) = 0.3 P(C C) = 0.5 S H C H /S 0.0007680 P( /S H) = 0.2 P( /S C) = 0.2 S H C C /S 0.0003200 P(1 H) = 0.2 P(1 C) = 0.5 S C H H /S 0.0000576 P(2 H) = 0.4 P(2 C) = 0.4 S C H C /S 0.0000048 P(3 H) = 0.4 P(3 C) = 0.1 S C C H /S 0.0001200 S C C C /S 0.0000500

Dynamic Programming 19 For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but...

Dynamic Programming 19 For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but... for N observations and L states, there are L N sequences we do the same partial calculations over and over again

Dynamic Programming 19 For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but... for N observations and L states, there are L N sequences we do the same partial calculations over and over again Dynamic Programming: records sub-problem solutions for further re-use useful when a complex problem can be described recursively examples: Dijkstra s shortest path, minimum edit distance, longest common subsequence, Viterbi algorithm

Dynamic Programming 19 For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but... for N observations and L states, there are L N sequences we do the same partial calculations over and over again Dynamic Programming: records sub-problem solutions for further re-use useful when a complex problem can be described recursively examples: Dijkstra s shortest path, minimum edit distance, longest common subsequence, Viterbi algorithm

Viterbi Algorithm 20 Recall our problem: maximise P(s 1... s n o 1... o n ) = P(s 1 s 0 )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )...

Viterbi Algorithm 20 Recall our problem: maximise P(s 1... s n o 1... o n ) = P(s 1 s 0 )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )... Our recursive sub-problem: v i (x) = L max k=1 [v i 1(k) P(x k) P(o i x)] The variable v i (x) represents the maximum probability that the i-th state is x, given that we have seen O i 1.

Viterbi Algorithm 20 Recall our problem: maximise P(s 1... s n o 1... o n ) = P(s 1 s 0 )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )... Our recursive sub-problem: v i (x) = L max k=1 [v i 1(k) P(x k) P(o i x)] The variable v i (x) represents the maximum probability that the i-th state is x, given that we have seen O i 1.

Viterbi Algorithm 20 Recall our problem: maximise P(s 1... s n o 1... o n ) = P(s 1 s 0 )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )... Our recursive sub-problem: v i (x) = L max k=1 [v i 1(k) P(x k) P(o i x)] The variable v i (x) represents the maximum probability that the i-th state is x, given that we have seen O i 1. At each step, we record backpointers showing which previous state led to the maximum probability.

An Example of the Viterbi Algorithm 21 H H H S /S C C C 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 v 1 (H) = 0.32 S P(H S)P(3 H) 0.8 0.4 H H H /S C C C 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 v 1 (H) = 0.32 S P(H S)P(3 H) 0.8 0.4 H H H /S C C C 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 v 1 (H) = 0.32 S P(H S)P(3 H) 0.8 0.4 H H H /S P(C S)P(3 C) 0.2 0.1 C C C v 1 (C) = 0.02 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 v 1 (H) = 0.32 S P(H S)P(3 H) 0.8 0.4 H H H /S P(C S)P(3 C) 0.2 0.1 C C C v 1 (C) = 0.02 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 P(H H)P(1 H) 0.6 0.2 H H H P(H C)P(1 H) 0.3 0.2 C C C /S v 1 (C) = 0.02 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 P(H H)P(1 H) 0.6 0.2 H H H P(H C)P(1 H) 0.3 0.2 C C C /S v 1 (C) = 0.02 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 C C C /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 C C C /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 v 3 (H) = max(.0384.24,.032.12) =.009216 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 C C C /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 v 3 (H) = max(.0384.24,.032.12) =.009216 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 C C C /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 v 3 (H) = max(.0384.24,.032.12) =.009216 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 P(C H)P(3 C) 0.2 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 C C C /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 v 3 (C) = max(.0384.02,.032.05) =.0016 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 v 1 (H) = 0.32 v 2 (H) = max(.32.12,.02.06) =.0384 v 3 (H) = max(.0384.24,.032.12) =.009216 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 P(C H)P(3 C) 0.2 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 C C C /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 v 3 (C) = max(.0384.02,.032.05) =.0016 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 v 1 (H) = 0.32 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 v 2 (H) = max(.32.12,.02.06) =.0384 P(C H)P(3 C) 0.2 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 v 3 (H) = max(.0384.24,.032.12) =.009216 C C C P( /S H) 0.2 P( /S C) 0.2 v f ( /S ) = max(.009216.2,.0016.2) =.0018432 /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 v 3 (C) = max(.0384.02,.032.05) =.0016 3 1 3 o 1 o 2 o 3

An Example of the Viterbi Algorithm 21 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 v 1 (H) = 0.32 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 v 2 (H) = max(.32.12,.02.06) =.0384 P(C H)P(3 C) 0.2 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 v 3 (H) = max(.0384.24,.032.12) =.009216 C C C P( /S H) 0.2 P( /S C) 0.2 v f ( /S ) = max(.009216.2,.0016.2) =.0018432 /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 v 3 (C) = max(.0384.02,.032.05) =.0016 3 1 3 o 1 o 2 o 3 H

An Example of the Viterbi Algorithm 21 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 v 1 (H) = 0.32 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 v 2 (H) = max(.32.12,.02.06) =.0384 P(C H)P(3 C) 0.2 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 v 3 (H) = max(.0384.24,.032.12) =.009216 C C C P( /S H) 0.2 P( /S C) 0.2 v f ( /S ) = max(.009216.2,.0016.2) =.0018432 /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 v 3 (C) = max(.0384.02,.032.05) =.0016 3 1 3 o 1 o 2 o 3 H H

An Example of the Viterbi Algorithm 21 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 v 1 (H) = 0.32 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 v 2 (H) = max(.32.12,.02.06) =.0384 P(C H)P(3 C) 0.2 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 v 3 (H) = max(.0384.24,.032.12) =.009216 C C C P( /S H) 0.2 P( /S C) 0.2 v f ( /S ) = max(.009216.2,.0016.2) =.0018432 /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 v 3 (C) = max(.0384.02,.032.05) =.0016 3 1 3 o 1 o 2 o 3 H H H

An Example of the Viterbi Algorithm 21 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 v 1 (H) = 0.32 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 v 2 (H) = max(.32.12,.02.06) =.0384 P(C H)P(3 C) 0.2 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 v 3 (H) = max(.0384.24,.032.12) =.009216 C C C P( /S H) 0.2 P( /S C) 0.2 v f ( /S ) = max(.009216.2,.0016.2) =.0018432 /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 v 3 (C) = max(.0384.02,.032.05) =.0016 3 1 3 o 1 o 2 o 3 H H H

Pseudocode for the Viterbi Algorithm 22 Input: observations of length N, state set of size L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] for each state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) end end viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1]

Diversion: Complexity and O(N) 23 Big-O notation describes the complexity of an algorithm. it describes the worst-case order of growth in terms of the size of the input only the largest order term is represented constant factors are ignored determined by looking at loops in the code

Pseudocode for the Viterbi Algorithm 24 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] for each state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) end end viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1]

Pseudocode for the Viterbi Algorithm 24 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] for each state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) end end viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1] L L

Pseudocode for the Viterbi Algorithm 24 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] for each state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) end end viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1] L L N

Pseudocode for the Viterbi Algorithm 24 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] for each state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) end end viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1] L L N L

Pseudocode for the Viterbi Algorithm 24 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] for each state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) end end viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1] L L N L L

Pseudocode for the Viterbi Algorithm 24 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] for each state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) end end viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1] L + L 2 N L N L L

Pseudocode for the Viterbi Algorithm 24 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] for each state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) end end viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1] L + L 2 N + N L N L L N

Pseudocode for the Viterbi Algorithm 24 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] for each state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) end end viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1] O(L 2 N) L N L L N

Using HMMs 25 The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximises P(S O) given O P(s x O) given O We can also learn the model parameters, given a set of observations.

Computing Likelihoods 26 Task Given an observation sequence O, determine the likelihood P(O), according to the HMM.

Computing Likelihoods 26 Task Given an observation sequence O, determine the likelihood P(O), according to the HMM. Compute the sum over all possible state sequences: P(O) = P(O, S) For example, the ice cream sequence 3 1 3: P(3 1 3) = P(3 1 3, cold cold cold) + P(3 1 3, cold cold hot) + S P(3 1 3, hot hot cold) +... O(L N N)

The Forward Algorithm 27 Again, we use dynamic programming storing and reusing the results of partial computations in a trellis α.

The Forward Algorithm 27 Again, we use dynamic programming storing and reusing the results of partial computations in a trellis α. Each cell in the trellis stores the probability of being in state s x after seeing the first i observations: α i (x) = P(o 1... o i, s i = x) L = α i 1 (k) P(x k) P(o i x) k=1

The Forward Algorithm 27 Again, we use dynamic programming storing and reusing the results of partial computations in a trellis α. Each cell in the trellis stores the probability of being in state s x after seeing the first i observations: α i (x) = P(o 1... o i, s i = x) L = α i 1 (k) P(x k) P(o i x) k=1 Note, instead of the max in Viterbi.

An Example of the Forward Algorithm 28 H H H S /S C C C 3 1 3 o 1 o 2 o 3

An Example of the Forward Algorithm 28 α 1 (H) = 0.32 S P(H S)P(3 H) 0.8 0.4 H H H /S C C C 3 1 3 o 1 o 2 o 3

An Example of the Forward Algorithm 28 α 1 (H) = 0.32 S P(H S)P(3 H) 0.8 0.4 H H H /S P(C S)P(3 C) 0.2 0.1 C C C α 1 (C) = 0.02 3 1 3 o 1 o 2 o 3

An Example of the Forward Algorithm 28 α 1 (H) = 0.32 α 2 (H) = (.32.12,.02.06) =.0396 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 P(H H)P(1 H) 0.6 0.2 H H H P(H C)P(1 H) 0.3 0.2 C C C /S α 1 (C) = 0.02 3 1 3 o 1 o 2 o 3

An Example of the Forward Algorithm 28 α 1 (H) = 0.32 α 2 (H) = (.32.12,.02.06) =.0396 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 C C C /S α 1 (C) = 0.02 α 2 (C) = (.32.1,.02.25) =.037 3 1 3 o 1 o 2 o 3

An Example of the Forward Algorithm 28 α 1 (H) = 0.32 α 2 (H) = (.32.12,.02.06) =.0396 α 3 (H) = (.0396.24,.037.12) =.013944 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 C C C /S α 1 (C) = 0.02 α 2 (C) = (.32.1,.02.25) =.037 3 1 3 o 1 o 2 o 3

An Example of the Forward Algorithm 28 α 1 (H) = 0.32 α 2 (H) = (.32.12,.02.06) =.0396 α 3 (H) = (.0396.24,.037.12) =.013944 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 P(C H)P(3 C) 0.2 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 C C C /S α 1 (C) = 0.02 α 2 (C) = (.32.1,.02.25) =.037 α 3 (C) = (.0396.02,.037.05) =.002642 3 1 3 o 1 o 2 o 3

An Example of the Forward Algorithm 28 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 α 1 (H) = 0.32 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 α 2 (H) = (.32.12,.02.06) =.0396 P(C H)P(3 C) 0.2 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 α 3 (H) = (.0396.24,.037.12) =.013944 C C C P( /S H) 0.2 P( /S C) 0.2 α f ( /S ) = (.013944.2,.002642.2) =.0033172 /S α 1 (C) = 0.02 α 2 (C) = (.32.1,.02.25) =.037 α 3 (C) = (.0396.02,.037.05) =.002642 3 1 3 o 1 o 2 o 3

An Example of the Forward Algorithm 28 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 α 1 (H) = 0.32 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 α 2 (H) = (.32.12,.02.06) =.0396 P(C H)P(3 C) 0.2 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 α 3 (H) = (.0396.24,.037.12) =.013944 C C C P( /S H) 0.2 P( /S C) 0.2 α f ( /S ) = (.013944.2,.002642.2) =.0033172 /S α 1 (C) = 0.02 α 2 (C) = (.32.1,.02.25) =.037 α 3 (C) = (.0396.02,.037.05) =.002642 3 1 3 o 1 o 2 o 3 P(3 1 3) = 0.0033172

Pseudocode for the Forward Algorithm 29 Input: observations of length N, state set of length L Output: forward-probability create a probability matrix forward[n, L + 2] for each state s from 1 to L do forward[1, s] trans( S, s) emit(o 1, s) end for each time step i from 2 to N do for each state s from 1 to L do forward[i, s] L s =1 forward[i 1, s] trans(s, s) emit(o t, s) end end forward[n, L + 1] L return forward[n, L + 1] s=1 forward[n, s] trans(s, /S )

Tagger Evaluation 30 To evaluate a part-of-speech tagger (or any classification system) we: train on a labelled training set test on a separate test set For a POS tagger, the standard evaluation metric is tag accuracy: Acc = number of correct tags number of words The other metric sometimes used is error rate: error rate = 1 Acc

Summary 31 Part-of-speech tagging as an example of sequence labelling.

Summary 31 Part-of-speech tagging as an example of sequence labelling. Hidden Markov Models to model the observation and hidden sequences.

Summary 31 Part-of-speech tagging as an example of sequence labelling. Hidden Markov Models to model the observation and hidden sequences. Learn the parameters of HMM (i.e. transition and emission probabilities) using MLE.

Summary 31 Part-of-speech tagging as an example of sequence labelling. Hidden Markov Models to model the observation and hidden sequences. Learn the parameters of HMM (i.e. transition and emission probabilities) using MLE. Use Viterbi for decoding, i.e.: S that maximises P(S O) given O.

Summary 31 Part-of-speech tagging as an example of sequence labelling. Hidden Markov Models to model the observation and hidden sequences. Learn the parameters of HMM (i.e. transition and emission probabilities) using MLE. Use Viterbi for decoding, i.e.: S that maximises P(S O) given O. Use Forward for computing likelihood, i.e.: P(O) given O