Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Similar documents
Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

CS838-1 Advanced NLP: Hidden Markov Models

Lecture 13: Structured Prediction

Hidden Markov Models

Sequence Labeling: HMMs & Structured Perceptron

A gentle introduction to Hidden Markov Models

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Statistical Methods for NLP

Hidden Markov Models

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

10/17/04. Today s Main Points

Lecture 7: Sequence Labeling

Statistical Methods for NLP

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

LECTURER: BURCU CAN Spring

Machine Learning for natural language processing

Hidden Markov Models (HMMs)

Hidden Markov Models

Lecture 12: EM Algorithm

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

COMP90051 Statistical Machine Learning

Statistical methods in NLP, lecture 7 Tagging and parsing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Dept. of Linguistics, Indiana University Fall 2009

Hidden Markov Models in Language Processing

Text Mining. March 3, March 3, / 49

Hidden Markov Models and Gaussian Mixture Models

A brief introduction to Conditional Random Fields

Lecture 15. Probabilistic Models on Graph

Expectation Maximization (EM)

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

Collapsed Variational Bayesian Inference for Hidden Markov Models

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Lecture 3: ASR: HMMs, Forward, Viterbi

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

O 3 O 4 O 5. q 3. q 4. Transition

Data-Intensive Computing with MapReduce

An Introduction to Bioinformatics Algorithms Hidden Markov Models

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Hidden Markov Modelling

Probabilistic Context-free Grammars

Advanced Natural Language Processing Syntactic Parsing

Graphical models for part of speech tagging

Intelligent Systems (AI-2)

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Hidden Markov Models

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Statistical NLP: Hidden Markov Models. Updated 12/15

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Conditional Random Fields

Lecture 9: Hidden Markov Model

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

with Local Dependencies

Probabilistic Graphical Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Sequential Supervised Learning

Basic math for biology

Statistical Methods for NLP

Fun with weighted FSTs

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Sequences and Information

Hidden Markov models

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Spectral Unsupervised Parsing with Additive Tree Metrics

Midterm sample questions

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Intelligent Systems (AI-2)

Natural Language Processing

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Lecture 11: Hidden Markov Models

CS 7180: Behavioral Modeling and Decision- making in AI

Statistical Pattern Recognition

Conditional Random Field

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

POS-Tagging. Fabian M. Suchanek

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Log-Linear Models, MEMMs, and CRFs

Lecture 11: Viterbi and Forward Algorithms

Introduction to Machine Learning CMU-10701

1. Markov models. 1.1 Markov-chain

STA 4273H: Statistical Machine Learning

Hidden Markov Models. x 1 x 2 x 3 x N

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Lab 12: Structured Prediction

Directed Probabilistic Graphical Models CMSC 678 UMBC

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Transcription:

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015

Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about your project proposal Due Oct 22 2

Exercise in Supervised Training What are the MLE for the following training corpus? DT NN VBD IN DT NN the cat sat on the mat DT NN VBD JJ the cat was sad RB VBD DT NN so was the mat DT JJ NN VBD IN DT JJ NN the sad cat was on the sad mat 3

Outline Hidden Markov models: review of last class Things to do with HMMs: Forward algorithm Backward algorithm Viterbi algorithm Baum-Welch as Expectation Maximization 4

Parts of Speech in English Nouns Verbs Adjectives Prepositions Adverbs Determiners restaurant, me, dinner find, eat, is good, vegetarian in, of, up, above quickly, well, very the, a, an 5

Sequence Labelling Predict labels for an entire sequence of inputs:??????????? Pierre Vinken, 61 years old, will join the board NNP NNP, CD NNS JJ, MD VB DT NN Pierre Vinken, 61 years old, will join the board Must consider: Current word Previous context 6

Decomposing the Joint Probability Graph specifies how join probability decomposes Q 1 Q 2 Q 3 Q 4 Q 5 O 1 O 2 O 3 O 4 O 5 T 1 P(O, Q) = P Q 1 P(Q t+1 Q t ) Initial state probability t=1 State transition probabilities T t=1 P(O t Q t ) Emission probabilities 7

Model Parameters Let there be N possible tags, W possible words Parameters θ has three components: 1. Initial probabilities for Q 1 : Π = {π 1, π 2,, π N } (categorical) 2. Transition probabilities for Q t to Q t+1 : A = a ij i, j [1, N] (categorical) 3. Emission probabilities for Q t to O t : B = b i (w k ) i 1, N, k 1, W (categorical) How many distributions and values of each type are there? 8

Model Parameters MLE Recall categorical distributions MLE: For our parameters: π i = P Q 1 = i = P outcome i = # Q 1 = i #(sentences) #(outcome i) # all events a ij = P Q t+1 = j Q t = i) = #(i, j) / #(i) b ik = P O t = k Q t = i) = #(word k, tag i) / #(i) 9

Questions for an HMM 1. Compute likelihood of a sequence of observations, P(O θ) Forward algorithm, backward algorithm 2. What state sequence best explains a sequence of observations? argmax Q P(Q, O θ) Viterbi algorithm 3. Given an observation sequence (without labels), what is the best model for it? Forward-backward algorithm Baum-Welch algorithm Expectation Maximization 10

Q1: Compute Likelihood Marginalize over all possible state sequences P(O θ) = Q P(O, Q θ) Problem: Exponentially many paths (N T ) Solution: Forward algorithm Dynamic programming to avoid unnecessary recalculations Create a table of all the possible state sequences, annotated with probabilities 11

States Forward Algorithm Trellis of possible state sequences O 1 O 2 O 3 O 4 O 5 VB α VB (1) α VB (2) α VB (3) α VB (4) α VB (5) NN α NN (1) α NN (2) α NN (3) α NN (4) α NN (5) DT α DT (1) α DT (2) α DT (3) α DT (4) α DT (5) JJ α JJ (1) α JJ (2) α JJ (3) α JJ (4) α JJ (5) CD α CD (1) α CD (2) α CD (3) α CD (4) α CD (5) Time α i t is P(O 1:t, Q t = i θ) Probability of current tag and words up to now 12

States First Column α i t is P(O 1:t, Q t = i θ) Consider: O 1 O 2 O 3 O 4 O 5 VB α VB (1) α VB (2) α VB (3) α VB (4) α VB (5) NN α NN (1) α NN (2) α NN (3) α NN (4) α NN (5) α j 1 = π j b j (O 1 ) DT α DT (1) α DT (2) α DT (3) α DT (4) α DT (5) JJ α JJ (1) α JJ (2) α JJ (3) α JJ (4) α JJ (5) CD α CD (1) α CD (2) α CD (3) α CD (4) α CD (5) Initial state probability First emission Time 13

States Middle Cells α i t is P(O 1:t, Q t = i θ) Consider: O 1 O 2 O 3 O 4 O 5 VB α VB (1) α VB (2) α VB (3) α VB (4) αn VB (5) NN α NN (1) α NN (2) α NN (3) αα NN j (4) t = α NN α(5) i t 1 a ij b j (O t ) DT α DT (1) α DT (2) α DT (3) α DT (4) α DT (5) JJ α JJ (1) α JJ (2) α JJ (3) α JJ (4) α JJ (5) CD α CD (1) α CD (2) α CD (3) α CD (4) α CD (5) Time All possible ways to get to current state Emission from current state i=1 14

States After Last Column α i t is P(O 1:t, Q t = i θ) O 1 O 2 O 3 O 4 O 5 VB α VB (1) α VB (2) α VB (3) α VB (4) α VB (5) NN α NN (1) α NN (2) α NN (3) α NN (4) α NN (5) DT α DT (1) α DT (2) α DT (3) α DT (4) α DT (5) JJ α JJ (1) α JJ (2) α JJ (3) α JJ (4) α JJ (5) = P O θ N j=1 α j (T) CD α CD (1) α CD (2) α CD (3) α CD (4) α CD (5) Time Sum over last column for overall likelihood 15

Forward Algorithm Summary Create trellis α i (t) for i = 1 N, t = 1 T α j 1 = π j b j (O 1 ) for j = 1 N for t = 2 T: for j = 1 N: N α j t = α i t 1 a ij b j (O t ) i=1 N P O θ = α j (T) j=1 Runtime: O(N 2 T) 16

Backward Algorithm Nothing stops us from going backwards too! This is not just for fun, as we'll see later. Define new trellis with cells β i (t) β i t = P O t+1:t Q t = i, θ) Probability of all subsequent words, given current tag is i. Note that unlike α i (t), it excludes the current word 17

Backward Algorithm Summary Create trellis β i (t) for i = 1 N, t = 1 T β i T = 1 for i = 1 N for t = T-1 1: for i = 1 N: P O θ = β i t = N i=1 N j=1 Runtime: O(N 2 T) a ij b j (O t+1 )β j t + 1 π i b i (O 1 )β i (1) Remember β j (t + 1) does not include b j (O t+1 ), so we need to add this factor in! 18

Forward and Backward α i t = P(O 1:t, Q t = i θ) β i t = P O t+1:t Q t = i, θ) That means Thus, α i t β i t = P(O, Q t = i θ) Probability of the entire sequence of observations, and we are in state i at timestep t. P O θ = N i=1 α i t β i (t) for any t = 1 T 19

Working in the Log Domain Practical note: need to avoid underflow work in log domain log p i = log p i = a i Log sum trick log( p i ) = log( exp a i ) a i = log p i Let b = max a i log( exp a i ) = log exp b exp(a i b) = b + log exp(a i b) 20

Q2: Sequence Labelling Find most likely state sequence for a sample Q = argmax Q P(Q, O θ) Intuition: use forward algorithm, but replace summation with max This is now called the Viterbi algorithm. Trellis cells: δ i t = max Q 1:t 1 P(Q 1:t 1, O 1:t, Q t = i θ) 21

Viterbi Algorithm Summary Create trellis δ i (t) for i = 1 N, t = 1 T δ j 1 = π j b j (O 1 ) for j = 1 N for t = 2 T: for j = 1 N: δ j t = max δ i t 1 a ij b j (O t ) i Take max δ i T i Runtime: O(N 2 T) 22

States Backpointers O 1 O 2 O 3 O 4 O 5 VB δ VB (1) δ VB (2) δ VB (3) δ VB (4) δ VB (5) NN δ NN (1) δ NN (2) δ NN (3) δ NN (4) δ NN (5) DT δ DT (1) δ DT (2) δ DT (3) δ DT (4) δ DT (5) JJ δ JJ (1) δ JJ (2) δ JJ (3) δ JJ (4) δ JJ (5) CD δ CD (1) δ CD (2) δ CD (3) δ CD (4) δ CD (5) Time Keep track of where the max entry to each cell came from Work backwards to recover best label sequence 23

Exercise 3 states (X, Y, Z), 2 possible emissions (!,@) π = 0.2 0.5 0.3 0.5 0.4 0.1 A = 0.2 0.3 0.5 0.1 0.1 0.8 0.1 0.9 B = 0.5 0.5 0.7 0.3 Run the forward, backward, and Viterbi algorithms on the emission sequence "!@@" 24

Q3: Unsupervised Training Problem: No state sequences to compute above Solution: Guess the state sequences Initialize parameters randomly Repeat for a while: 1. Predict the current state sequences using the current model 2. Update the current parameters based on current predictions 25

"Hard EM" or Viterbi EM Initialize parameters randomly Repeat for a while: 1. Predict the current state sequences using the current model with the Viterbi algorithm 2. Update the current parameters using the current predictions as in the supervised learning case Can also use "soft" predictions; i.e., the probabilities of all the possible state sequences 26

Baum-Welch Algorithm a.k.a., Forward-backward algorithm EM algorithm applied to HMMs: Expectation Maximization Get expected counts for hidden structures using current θ k. Find θ k+1 to maximize the likelihood of the training data given the expected counts from the E-step. 27

Responsibilities Needed Supervised/Hard EM: Baum-Welch: A single tag Distribution over tags Call such probabilities responsibilities. γ i t = P(Q t = i O, θ k ) Probability of being in state i at time t given the observed sequence under the current model. ξ ij t = P(Q t = i, Q t+1 = j O, θ k ) Probability of transitioning from i at time t to j at time t + 1. 28

E-Step γ γ i t = P(Q t = i O, θ k ) = P(Q t = i, O θ k ) P(O θ k ) = α i t β i (t) P(O θ k ) 29

E-Step ξ ξ ij t = P(Q t = i, Q t+1 = j O, θ k ) = P(Q t = i, Q t+1 = j, O θ k ) P(O θ k ) = α i t a ij b j O t+1 β j (t + 1) P(O θ k ) Beginning to state i at time t Transition from i to j Emit O t+1 Rest of the sequence 30

M-Step "Soft" versions of the MLE updates: π i k+1 = γ i (1) a k+1 ij = t=1 T 1 ξ ij (t) T 1 γ i (t) t=1 T b k+1 i w k = t=1 γ i t Ot =w k γ i (t) T t=1 Compare: # i, j #(i) # word k, tag i #(i) With multiple sentences, sum up expected counts over all sentences 31

When to Stop EM? Multiple options When training set likelihood stops improving When prediction performance on held-out development or validation set stops improving 32

Why Does EM Work? EM finds a local optimum in P O θ. You can show that after each step of EM: P O θ k+1 > P(O θ k ) However, this is not necessarily a global optimum. 33

Proof of Baum-Welch Correctness Part 1: Show that in our procedure, one iteration corresponds to this update: θ k+1 = argmax θ log P O, Q θ P(Q O, θ k ) Q http://www.cs.berkeley.edu/~stephentu/writeups/hmm-baum-welch-derivation.pdf Part 2: Show that improving the quantity Q log P O, Q θ P(Q O, θk ) corresponds to improving P O θ https://en.wikipedia.org/wiki/expectation%e2%80%93maximization_algorithm#proof_of_c orrectness 34

Dealing with Local Optima Random restarts Train multiple models with different random initializations Model selection on development set to pick the best one Biased initialization Bias your initialization using some external source of knowledge (e.g., external corpus counts or clustering procedure, expert knowledge about domain) Further training will hopefully improve results 35

Caveats Baum-Welch with no labelled data generally gives poor results, at least for linguistic structure (~40% accuracy, according to Johnson, (2007)) Semi-supervised learning: combine small amounts of labelled data with larger corpus of unlabelled data 36

In Practice Per-token (i.e., per word) accuracy results on WSJ corpus: Most frequent tag baseline ~90 94% HMMs (Brants, 2000) 96.5% Stanford tagger (Manning, 2011) 97.32% 37

Other Sequence Modelling Tasks Chunking (a.k.a., shallow parsing) Find syntactic chunks in the sentence; not hierarchical [ NP The chicken] [ V crossed] [ NP the road] [ P across] [ NP the lake]. Named-Entity Recognition (NER) Identify elements in text that correspond to some high level categories (e.g., PERSON, ORGANIZATION, LOCATION) [ ORG McGill University] is located in [ LOC Montreal, Canada]. Problem: need to detect spans of multiple words 38

First Attempt Simply label words by their category ORG ORG ORG - ORG - - - LOC McGill, UQAM, UdeM, and Concordia are located in Montreal. What is the problem with this scheme? 39

IOB Tagging Label whether a word is inside, outside, or at the beginning of a span as well. For n categories, 2n+1 total tags. B ORG I ORG O O O B LOC I LOC McGill University is located in Montreal, Canada B ORG B ORG B ORG O B ORG O O O B LOC McGill, UQAM, UdeM, and Concordia are located in Montreal. 40