INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Similar documents
INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Dynamic Programming: Hidden Markov Models

Statistical Methods for NLP

10/17/04. Today s Main Points

Lecture 9: Hidden Markov Model

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

CS838-1 Advanced NLP: Hidden Markov Models

Statistical methods in NLP, lecture 7 Tagging and parsing

Sequence Labeling: HMMs & Structured Perceptron

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Lecture 13: Structured Prediction

Midterm sample questions

Hidden Markov Models

Lecture 3: ASR: HMMs, Forward, Viterbi

Hidden Markov Models (HMMs)

LECTURER: BURCU CAN Spring

Text Mining. March 3, March 3, / 49

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Probabilistic Context-free Grammars

Machine Learning for natural language processing

Part-of-Speech Tagging

Lecture 7: Sequence Labeling

HMM and Part of Speech Tagging. Adam Meyers New York University

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

POS-Tagging. Fabian M. Suchanek

A brief introduction to Conditional Random Fields

Sequences and Information

CS460/626 : Natural Language

Parsing with Context-Free Grammars

Hidden Markov Models

Lecture 11: Viterbi and Forward Algorithms

Probabilistic Graphical Models

Part-of-Speech Tagging

TnT Part of Speech Tagger

The Noisy Channel Model and Markov Models

A gentle introduction to Hidden Markov Models

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Dept. of Linguistics, Indiana University Fall 2009

Lecture 6: Part-of-speech tagging

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Statistical Methods for NLP

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Degree in Mathematics

Statistical Methods for NLP

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Part-of-Speech Tagging + Neural Networks CS 287

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

Hidden Markov Models in Language Processing

LECTURER: BURCU CAN Spring

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Hidden Markov Models, Part 1. Steven Bedrick CS/EE 5/655, 10/22/14

8: Hidden Markov Models

Fun with weighted FSTs

Log-Linear Models, MEMMs, and CRFs

Advanced Natural Language Processing Syntactic Parsing

Language Processing with Perl and Prolog

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

8: Hidden Markov Models

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

COMP90051 Statistical Machine Learning

Lecture 12: EM Algorithm

Hidden Markov Models and Gaussian Mixture Models

CS388: Natural Language Processing Lecture 4: Sequence Models I

Processing/Speech, NLP and the Web

Data-Intensive Computing with MapReduce

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

A Context-Free Grammar

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

NLP Programming Tutorial 11 - The Structured Perceptron

Natural Language Processing

Statistical Machine Learning from Data

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Hidden Markov Models. Terminology and Basic Algorithms

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Natural Language Processing

CS460/626 : Natural Language

Spectral Unsupervised Parsing with Additive Tree Metrics

Lecture 4: Smoothing, Part-of-Speech Tagging. Ivan Titov Institute for Logic, Language and Computation Universiteit van Amsterdam

Natural Language Processing

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Sequential Data Modeling - The Structured Perceptron

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Automatic Speech Recognition (CS753)

Hidden Markov Models

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Transcription:

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 18, 2017

Recap: Probabilistic Language Models 2 Basic probability theory: axioms, joint vs. conditional probability, independence, Bayes Theorem; Previous context can help predict the next element of a sequence, for example words in a sentence; Markov assumption says that the whole history can be approximated by the last n 1 elements; An n-gram language model predicts the n-th word, conditioned on the n 1 previous words; Maximum Likelihood Estimation uses relative frequencies to approximate the conditional probabilities needed for an n-gram model; Smoothing techniques are used to avoid zero probabilities.

Today 3 Determining which string is most likely: She studies morphosyntax vs. She studies more faux syntax which tag sequence is most likely for flies like flowers: NNS VB NNS vs. VBZ P NNS which syntactic analysis is most likely: S S NP VP NP VP I VBD ate N sushi NP PP with tuna I VBD ate NP N sushi PP with tuna

Parts of Speech 4 Known by a variety of names: part-of-speech, POS, lexical categories, word classes, morphological classes,... Traditionally defined semantically (e.g. nouns are naming words ), but more accurately by their distributional properties. Open-classes New words created/updated/deleted all the time Closed-classes Smaller classes, relatively static membership Usually function words

Open Class Words 5 Nouns: dog, Oslo, scissors, snow, people, truth, cups proper or common; countable or uncountable; plural or singular; masculine, feminine or neuter;... Verbs: fly, rained, having, ate, seen transitive, intransitive, ditransitive; past, present, passive; stative or dynamic; plural or singular;... Adjectives: good, smaller, unique, fastest, best, unhappy comparative or superlative; predicative or attributive; intersective or non-intersective;... Adverbs: again, somewhat, slowly, yesterday, aloud intersective; scopal; discourse; degree; temporal; directional; comparative or superlative;...

Closed Class Words 6 Prepositions: on, under, from, at, near, over,... Determiners: a, an, the, that,... Pronouns: she, who, I, others,... Conjunctions: and, but, or, when,... Auxiliary verbs: can, may, should, must,... Interjections, particles, numerals, negatives, politeness markers, greetings, existential there... (Examples from Jurafsky & Martin, 2008)

POS Tagging 7 The (automatic) assignment of POS tags to word sequences non-trivial where words are ambiguous: fly (v) vs. fly (n) choice of the correct tag is context-dependent useful in pre-processing for parsing, etc; but also directly for text-to-speech (TTS) system: content (n) vs. content (adj) difficulty and usefulness can depend on the tagset English Penn Treebank (PTB) 45 tags: NNS, NN, NNP, JJ, JJR, JJS www.ling.upenn.edu/courses/fall_2003/ling001/penn_treebank_pos.html Norwegian Oslo-Bergen Tagset multi-part: subst appell fem be ent http://tekstlab.uio.no/obt-ny/english/tags.html

Labelled Sequences 8 We are interested in the probability of sequences like: flies like the wind or flies like the wind nns vb dt nn vbz p dt nn In normal text, we see the words, but not the tags. Consider the POS tags to be underlying skeleton of the sentence, unseen but influencing the sentence shape. A structure like this, consisting of a hidden state sequence, and a related observation sequence can be modelled as a Hidden Markov Model.

Hidden Markov Models 9 The generative story: the cat eats mice P(the DT) P(cat NN) P(eats VBZ) P(mice NNS) S DT NN VBZ NNS /S P(DT S ) P(NN DT) P(VBZ NN) P(NNS VBZ) P( /S NNS) P(S, O) = P( DT S ) P(the DT) P(NN DT) P(cat NN) P(VBZ NN) P(eats VBZ) P(NNS VBZ) P(mice NNS) P( /S NNS)

Hidden Markov Models 10 For a bi-gram HMM, with O N 1 : N+1 P(S, O) = P(s i s i 1 )P(o i s i ) where s 0 = S, s N+1 = /S i=1 The transition probabilities model the probabilities of moving from state to state. The emission probabilities model the probability that a state emits a particular observation.

Using HMMs 11 The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximises P(S O) given O We can also learn the model parameters, given a set of observations. Our observations will be words (w i ), and our states PoS tags (t i )

Estimation 12 As so often in NLP, we learn an HMM from labelled data: Transition probabilities Based on a training corpus of previously tagged text, with tags as our state, the MLE can be computed from the counts of observed tags: P(t i t i 1 ) = C(t i 1, t i ) C(t i 1 ) Emission probabilities Computed from relative frequencies in the same way, with the words as observations: P(w i t i ) = C(t i, w i ) C(t i )

Implementation Issues 13 P(S, O) = P(s 1 S )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )P(s 3 s 2 )P(o 3 s 3 )... = 0.0429 0.0031 0.0044 0.0001 0.0072... Multiplying many small probabilities underflow Solution: work in log(arithmic) space: log(ab) = log(a) + log(b) hence P(A)P(B) = exp(log(a) + log(b)) log(p(s, O)) = 1.368 + 2.509 + 2.357 + 4 + 2.143 +... The issues related to MLE / smoothing that we discussed for n-gram models also applies here...

Ice Cream and Global Warming 14 Missing records of weather in Baltimore for Summer 2007 Jason likes to eat ice cream. He records his daily ice cream consumption in his diary. The number of ice creams he ate was influenced, but not entirely determined by the weather. Today s weather is partially predictable from yesterday s. A Hidden Markov Model! with: Hidden states: {H, C} (plus pseudo-states S and /S ) Observations: {1, 2, 3}

Ice Cream and Global Warming 15 0.8 S 0.2 0.3 0.6 H 0.2 C 0.5 P(1 H)=0.2 P(2 H)=0.4 P(3 H)=0.4 0.2 /S 0.2 P(1 C) = 0.5 P(2 C) = 0.4 P(3 C) = 0.1

Using HMMs 16 The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximises P(S O) given O

Manipulating the Definition 17 We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N+1 P(S, O) = P(s i s i 1 )P(o i s i ) P(S, O) We want: P(S O) = P(O) Actually, we want the state sequence Ŝ that maximises P(S O): i=1 Ŝ = arg max S P(S, O) P(O) Since P(O) always is the same, we can drop the denominator: Ŝ = arg max P(S, O) S

Decoding 18 Task What is the most likely state sequence S, given an observation sequence O and an HMM. HMM if O = 3 1 3 P(H S ) = 0.8 P(C S ) = 0.2 S H H H /S 0.0018432 P(H H) = 0.6 P(C H) = 0.2 S H H C /S 0.0001536 P(H C) = 0.3 P(C C) = 0.5 S H C H /S 0.0007680 P( /S H) = 0.2 P( /S C) = 0.2 S H C C /S 0.0003200 P(1 H) = 0.2 P(1 C) = 0.5 S C H H /S 0.0000576 P(2 H) = 0.4 P(2 C) = 0.4 S C H C /S 0.0000048 P(3 H) = 0.4 P(3 C) = 0.1 S C C H /S 0.0001200 S C C C /S 0.0000500

Dynamic Programming 19 For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but... for N observations and L states, there are L N sequences we do the same partial calculations over and over again Dynamic Programming: records sub-problem solutions for further re-use useful when a complex problem can be described recursively examples: Dijkstra s shortest path, minimum edit distance, longest common subsequence, Viterbi algorithm

Viterbi Algorithm 20 Recall our problem: maximise P(s 1... s n o 1... o n ) = P(s 1 s 0 )P(o 1 s 1 )P(s 2 s 1 )P(o 2 s 2 )... Our recursive sub-problem: v i (x) = L max k=1 [v i 1(k) P(x k) P(o i x)] The variable v i (x) represents the maximum probability that the i-th state is x, given that we have seen O i 1. At each step, we record backpointers showing which previous state led to the maximum probability.

An Example of the Viterbi Algorithm 21 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 v 1 (H) = 0.32 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 v 2 (H) = max(.32.12,.02.06) =.0384 P(C H)P(3 C) 0.2 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 v 3 (H) = max(.0384.24,.032.12) =.009216 C C C P( /S H) 0.2 P( /S C) 0.2 v f ( /S ) = max(.009216.2,.0016.2) =.0018432 /S v 1 (C) = 0.02 v 2 (C) = max(.32.1,.02.25) =.032 v 3 (C) = max(.0384.02,.032.05) =.0016 3 1 3 o 1 o 2 o 3 H H H

Pseudocode for the Viterbi Algorithm 22 Input: observations of length N, state set of size L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] for each state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) end end viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1]

Diversion: Complexity and O(N) 23 Big-O notation describes the complexity of an algorithm. it describes the worst-case order of growth in terms of the size of the input only the largest order term is represented constant factors are ignored determined by looking at loops in the code

Pseudocode for the Viterbi Algorithm 24 Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[n, L + 2] create a path backpointer matrix backpointer[n, L + 2] for each state s from 1 to L do viterbi[1, s] trans( S, s) emit(o 1, s) backpointer[1, s] 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] max L viterbi[i 1, s =1 s ] trans(s, s) emit(o i, s) backpointer[i, s] arg max L viterbi[i 1, s =1 s ] trans(s, s) end end viterbi[n, L + 1] max L viterbi[s, N] trans(s, /S ) s=1 backpointer[n, L + 1] arg max L viterbi[n, s] trans(s, /S ) s=1 return the path by following backpointers from backpointer[n, L + 1] O(L 2 N) L N L L N

Using HMMs 25 The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: P(S, O) given S and O P(O) given O S that maximises P(S O) given O

Computing Likelihoods 26 Task Given an observation sequence O, determine the likelihood P(O), according to the HMM. Compute the sum over all possible state sequences: P(O) = P(O, S) For example, the ice cream sequence 3 1 3: P(3 1 3) = P(3 1 3, cold cold cold) + P(3 1 3, cold cold hot) + S P(3 1 3, hot hot cold) +... O(L N N)

The Forward Algorithm 27 Again, we use dynamic programming storing and reusing the results of partial computations in a trellis α. Each cell in the trellis stores the probability of being in state s x after seeing the first i observations: α i (x) = P(o 1... o i, s i = x) L = α i 1 (k) P(x k) P(o i x) k=1 Note, instead of the max in Viterbi.

An Example of the Forward Algorithm 28 S P(H S)P(3 H) 0.8 0.4 P(C S)P(3 C) 0.2 0.1 α 1 (H) = 0.32 P(H H)P(1 H) 0.6 0.2 H H H P(C H)P(1 C) 0.2 0.5 P(H C)P(1 H) 0.3 0.2 P(C C)P(1 C) 0.5 0.5 α 2 (H) = (.32.12,.02.06) =.0396 P(C H)P(3 C) 0.2 0.1 P(H H)P(3 H) 0.6 0.4 P(H C)P(3 H) 0.3 0.4 P(C C)P(3 C) 0.5 0.1 α 3 (H) = (.0396.24,.037.12) =.013944 C C C P( /S H) 0.2 P( /S C) 0.2 α f ( /S ) = (.013944.2,.002642.2) =.0033172 /S α 1 (C) = 0.02 α 2 (C) = (.32.1,.02.25) =.037 α 3 (C) = (.0396.02,.037.05) =.002642 3 1 3 o 1 o 2 o 3 P(3 1 3) = 0.0033172

Pseudocode for the Forward Algorithm 29 Input: observations of length N, state set of length L Output: forward-probability create a probability matrix forward[n, L + 2] for each state s from 1 to L do forward[1, s] trans( S, s) emit(o 1, s) end for each time step i from 2 to N do for each state s from 1 to L do forward[i, s] L s =1 forward[i 1, s] trans(s, s) emit(o t, s) end end forward[n, L + 1] L return forward[n, L + 1] s=1 forward[n, s] trans(s, /S )

Tagger Evaluation 30 To evaluate a part-of-speech tagger (or any classification system) we: train on a labelled training set test on a separate test set For a POS tagger, the standard evaluation metric is tag accuracy: Acc = number of correct tags number of words The other metric sometimes used is error rate: error rate = 1 Acc

Summary 31 Part-of-speech tagging as an example of sequence labelling. Hidden Markov Models to model the observation and hidden sequences. Learn the parameters of HMM (i.e. transition and emission probabilities) using MLE. Use Viterbi for decoding, i.e.: S that maximises P(S O) given O. Use Forward for computing likelihood, i.e.: P(O) given O