Sequence Labeling: HMMs & Structured Perceptron

Similar documents
NLP Programming Tutorial 11 - The Structured Perceptron

Sequential Data Modeling - The Structured Perceptron

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Lecture 12: Algorithms for HMMs

Data-Intensive Computing with MapReduce

Lecture 12: Algorithms for HMMs

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 13: Structured Prediction

Today s Agenda. Need to cover lots of background material. Now on to the Map Reduce stuff. Rough conceptual sketch of unsupervised training using EM

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Statistical Methods for NLP

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Machine Learning for natural language processing

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Statistical Methods for NLP

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

CS838-1 Advanced NLP: Hidden Markov Models

Hidden Markov Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

A gentle introduction to Hidden Markov Models

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 9: Hidden Markov Model

Statistical methods in NLP, lecture 7 Tagging and parsing

COMP90051 Statistical Machine Learning

Lecture 7: Sequence Labeling

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

LECTURER: BURCU CAN Spring

Sequential Supervised Learning

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Dynamic Programming: Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

Logistic Regression & Neural Networks

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Statistical Methods for NLP

Hidden Markov Models

Lecture 12: EM Algorithm

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Modelling

Hidden Markov Models (HMMs)

lecture 6: modeling sequences (final part)

Log-Linear Models, MEMMs, and CRFs

Dept. of Linguistics, Indiana University Fall 2009

Computational Genomics and Molecular Biology, Fall

with Local Dependencies

Introduction to Machine Learning CMU-10701

Probabilistic Context-free Grammars

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Lecture 3: ASR: HMMs, Forward, Viterbi

Natural Language Processing

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

10/17/04. Today s Main Points

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models NIKOLAY YAKOVETS

Hidden Markov Models in Language Processing

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Part-of-Speech Tagging

Fun with weighted FSTs

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Hidden Markov Models

STA 414/2104: Machine Learning

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Statistical Processing of Natural Language

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Probabilistic Graphical Models

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

p(d θ ) l(θ ) 1.2 x x x

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Text Mining. March 3, March 3, / 49

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Statistical Machine Learning from Data

Lecture 11: Hidden Markov Models

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Hidden Markov Models

Sequential Data Modeling - Conditional Random Fields

Language Processing with Perl and Prolog

Brief Introduction of Machine Learning Techniques for Content Analysis

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Machine Learning for Structured Prediction

Hidden Markov Models and Gaussian Mixture Models

Multiscale Systems Engineering Research Group

Natural Language Processing

Advanced Data Science

HMM and Part of Speech Tagging. Adam Meyers New York University

Midterm sample questions

Graphical models for part of speech tagging

8: Hidden Markov Models

STA 4273H: Statistical Machine Learning

Hidden Markov Model and Speech Recognition

A brief introduction to Conditional Random Fields

Transcription:

Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu

HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition probability matrix A = [a ij ] a ij = P(q j q i ), Σ a ij = 1 I Sequence of observations O = o 1, o 2,... o T Each drawn from a given set of symbols (vocabulary V) N V Emission probability matrix, B = [b it ] b it = b i (o t ) = P(o t q i ), Σ b it = 1 i Start and end states An explicit start state q 0 or alternatively, a prior distribution over start states: {π 1, π 2, π 3, }, Σ π i = 1 The set of final states: q F

HMMs: Three Problems Likelihood: Given an HMM λ = (A, B, ), and a sequence of observed events O, find P(O λ) Decoding: Given an HMM λ = (A, B, ), and an observation sequence O, find the most likely (hidden) state sequence Learning: Given a set of observation sequences and the set of states Q in λ, compute the parameters A and B

Computing Likelihood π 1 =0.5 π 2 =0.2 π 3 =0.3 t: 1 2 3 4 5 6 O: λ stock Assuming λ stock models the stock market, how likely are we to observe the sequence of outputs?

Computing Likelihood First try: Sum over all possible ways in which we could generate O from λ What s the problem? Takes O(N T ) time to compute! Right idea, wrong algorithm!

Forward Algorithm Use an N T trellis or chart [α tj ] Forward probabilities: α tj or α t (j) = P(being in state j after seeing t observations) = P(o 1, o 2,... o t, q t =j) Each cell = extensions of all paths from other cells α t (j) = i α t-1 (i) a ij b j (o t ) α t-1 (i): forward path probability until (t-1) a ij : transition probability of going from state i to j b j (o t ): probability of emitting symbol o t in state j P(O λ) = i α T (i)

Forward Algorithm: Formal Initialization Definition Recursion Termination

Forward Algorithm O = find P(O λ stock )

states Forward Algorithm Static Bear Bull t=1 t=2 t=3 time

states Forward Algorithm: Initialization Static α 1 (Static) 0.3 0.3 =0.09 Bear α 1 (Bear) 0.5 0.1 =0.05 Bull α 1 (Bull) 0.2 0.7=0.14 t=1 t=2 t=3 time

states Forward Algorithm: Recursion Static 0.3 0.3 =0.09 Bear 0.5 0.1 =0.05... and so on Bull 0.2 0.7=0.14 0.14 0.6 0.1=0.0084 0.0145 α 1 (Bull) a BullBull b Bull ( ) t=1 t=2 t=3 time

states Forward Algorithm: Recursion Work through the rest of these numbers Static 0.3 0.3 =0.09?? Bear 0.5 0.1 =0.05?? Bull 0.2 0.7=0.14 0.0145? t=1 t=2 t=3 time

HMMs: Three Problems Likelihood: Given an HMM λ = (A, B, ), and a sequence of observed events O, find P(O λ) Decoding: Given an HMM λ = (A, B, ), and an observation sequence O, find the most likely (hidden) state sequence Learning: Given a set of observation sequences and the set of states Q in λ, compute the parameters A and B

Decoding π 1 =0.5 π 2 =0.2 π 3 =0.3 t: 1 2 3 4 5 6 O: λ stock Given λ stock as our model and O as our observations, what are the most likely states the market went through to produce O?

Viterbi Algorithm Use an N T trellis [v tj ] Just like in forward algorithm v tj or v t (j) = P(in state j after seeing t observations and passing through the most likely state sequence so far) = P(q 1, q 2,... q t-1, q t=j, o 1, o 2,... o t ) Each cell = extension of most likely path from other cells v t (j) = max i v t-1 (i) a ij b j (o t ) v t-1 (i): Viterbi probability until (t-1) a ij : transition probability of going from state i to j b j (o t ) : probability of emitting symbol o t in state j P = max i v T (i)

Viterbi vs. Forward Almost the same algorithm! Key difference: maximize instead of sum over previous paths We also need to store the most likely path (transition): Use backpointers to keep track of most likely transition At the end, follow the chain of backpointers to recover the most likely state sequence

Viterbi Algorithm: Formal Definition Initialization Recursion Termination

Viterbi Algorithm O = find most likely state sequence given λstock

states Viterbi Algorithm Static Bear Bull t=1 t=2 t=3 time

states Viterbi Algorithm: Initialization Static α 1 (Static) 0.3 0.3 =0.09 Bear α 1 (Bear) 0.5 0.1 =0.05 Bull α 1 (Bull) 0.2 0.7=0.14 t=1 t=2 t=3 time

states Viterbi Algorithm: Recursion Static 0.3 0.3 =0.09 Bear 0.5 0.1 =0.05 Max Bull 0.2 0.7=0.14 0.14 0.6 0.1=0.0084 0.0084 α 1 (Bull) a BullBull b Bull ( ) t=1 t=2 t=3 time

states Viterbi Algorithm: Recursion Static 0.3 0.3 =0.09 Bear 0.5 0.1 =0.05 store backpointer... and so on Bull 0.2 0.7=0.14 0.0084 t=1 t=2 t=3 time

states Viterbi Algorithm: Recursion Work through the rest of the algorithm Static 0.3 0.3 =0.09?? Bear 0.5 0.1 =0.05?? Bull 0.2 0.7=0.14 0.0084? t=1 t=2 t=3 time

HMMs: Three Problems Likelihood: Given an HMM λ = (A, B, ), and a sequence of observed events O, find P(O λ) Decoding: Given an HMM λ = (A, B, ), and an observation sequence O, find the most likely (hidden) state sequence Learning: Given a set of observation sequences and the set of states Q in λ, compute the parameters A and B

Learning HMMs for POS tagging is a supervised task A POS tagged corpus tells us the hidden states! We can compute Maximum Likelihood Estimates (MLEs) for the various parameters

Supervised Training Transition Probabilities Any P(t i t i-1 ) = C(t i-1, t i ) / C(t i-1 ), from the tagged data Example: for P(NN VB) count how many times a noun follows a verb divide by the total number of times you see a verb

Supervised Training Emission Probabilities Any P(w i t i ) = C(w i, t i ) / C(t i ), from the tagged data For P(bank NN) count how many times bank is tagged as a noun divide by how many times anything is tagged as a noun

Supervised Training Priors Any P(q 1 = t i ) = π i = C(t i )/N, from the tagged data For π NN, count the number of times NN occurs and divide by the total number of tags (states)

HMMs: Three Problems Likelihood: Given an HMM λ = (A, B, ), and a sequence of observed events O, find P(O λ) Decoding: Given an HMM λ = (A, B, ), and an observation sequence O, find the most likely (hidden) state sequence Learning: Given a set of observation sequences and the set of states Q in λ, compute the parameters A and B

Approaches to POS tagging Discriminative Classifiers Multiclass classification problem Logistic Regression/Perceptron Model context using lots of features Generative Models Structured prediction problem (Sequence labeling) Hidden Markov Models Models transitions between states/pos Structured perceptron Classification with lots of features over structured models!

Let s restructure HMMs with features Given a sentence X, predict its part of speech sequence Y Natural language processing ( NLP ) is a field of computer science JJ NN NN -LRB- NN -RRB- VBZ DT NN IN NN NN

Let s restructure HMMs with features POS POS transition probabilities POS Word emission probabilities P T (JJ <s>) * P T (NN JJ) * P T (NN NN) P Y P X Y <s> JJ NN NN LRB NN RRB... </s> I+1 i=1 P T y i y i 1 I 1 P E x i y i natural language processing ( nlp )... P E (natural JJ) * P E (language NN) * P E (processing NN)

Restructuring HMM With Features Normal HMM: P X, Y = I 1 P E x i y i I+1 i=1 P T y i y i 1 Log Likelihood: logp X, Y = I 1 log P E x i y i I+1 i=1 log P T y i y i 1 Score S X, Y = I w E,yi,x i 1 I+1 w T,yi 1,y i i=1 When: w E,yi,x i = logp E x i y i w T,yi 1,y i = logp T y i y i 1 log P(X,Y) = S(X,Y)

Example I visited Nara φ( ) = PRP VBD NNP φ T,<S>,PRP (X,Y 1 ) = 1 φ T,PRP,VBD (X,Y 1 ) = 1 φ T,VBD,NNP (X,Y 1 ) = 1 φ T,NNP,</S> (X,Y 1 ) = 1 φ E,PRP, I (X,Y 1 ) = 1 φ E,VBD, visited (X,Y 1 ) = 1 φ E,NNP, Nara (X,Y 1 ) = 1 φ CAPS,PRP (X,Y 1 ) = 1 φ CAPS,NNP (X,Y 1 ) = 1 φ SUF,VBD,...ed (X,Y 1 ) = 1 I visited Nara φ( ) = NNP VBD NNP φ T,<S>,NNP (X,Y 2 ) = 1 φ T,NNP,VBD (X,Y 2 ) = 1 φ T,VBD,NNP (X,Y 2 ) = 1 φ T,NNP,</S> (X,Y 2 ) = 1 φ E,NNP, I (X,Y 2 ) = 1 φ E,VBD, visited (X,Y 2 ) = 1 φ E,NNP, Nara (X,Y 2 ) = 1 φ CAPS,NNP (X,Y 2 ) = 2 φ SUF,VBD,...ed (X,Y 2 ) = 1

How to decode? We must find the POS sequence that satisfies: Y = argmax Y w i ϕ i X, Y i Solution: Viterbi algorithm

HMM Viterbi Algorithm Forward step, calculate the best path to a node Find the path to each node with the lowest negative log probability Backtracking step, reproduce the path

HMM Viterbi with Features Same as probabilities, use feature weights 0:<S> I 1:NN 1:JJ 1:VB 1:PRP 1:NNP best_score[ 1 NN ] = w T,<S>,NN + w E,NN,I best_score[ 1 JJ ] = w T,<S>,JJ + w E,JJ,I best_score[ 1 VB ] = w T,<S>,VB + w E,VB,I best_score[ 1 PRP ] = w T,<S>,PRP + w E,PRP,I best_score[ 1 NNP ] = w T,<S>,NNP + w E,NNP,I

HMM Viterbi with Features Can add additional features 0:<S> I 1:NN 1:JJ 1:VB 1:PRP 1:NNP best_score[ 1 NN ] = w T,<S>,NN + w E,NN,I + w CAPS,NN best_score[ 1 JJ ] = w T,<S>,JJ + w E,JJ,I + w CAPS,JJ best_score[ 1 VB ] = w T,<S>,VB + w E,VB,I + w CAPS,VB best_score[ 1 PRP ] = w T,<S>,PRP + w E,PRP,I + w CAPS,PRP best_score[ 1 NNP ] = w T,<S>,NNP + w E,NNP,I + w CAPS,NNP

Learning: the Structured Perceptron Recall the perceptron algorithm If there is a mistake: w w + yφ x, y Update weights to: increase score of positive examples decrease score of negative examples What are positive/negative examples in structured perceptron? As here, when y is a sequence of labels

Learning in the Structured Perceptron Positive example, correct feature vector: I visited Nara φ( ) PRP VBD NNP Negative example, incorrect feature vector: I visited Nara φ( ) NNP VBD NNP

Problem: There are many incorrect feature vectors! I visited Nara φ( ) NNP VBD NNP I visited Nara φ( ) PRP VBD NN I visited Nara φ( ) PRP VB NNP

Structured Perceptron Update Rule Use the incorrect answer with the highest score Y = argmax Y Update rule becomes w w + φ X, Y φ X, Y Y' is the correct answer w i ϕ i X, Y i Note: If highest scoring answer is correct, no change

Structured Perceptron Algorithm create map w for I iterations for each labeled pair X, Y_prime in the data Y_hat = hmm_viterbi(w, X) phi_prime = create_features(x, Y_prime) phi_hat = create_features(x, Y_hat) w += phi_prime - phi_hat

Recap: POS tagging Example of a sequence labeling task Can be addressed using Hidden Markov Models Decoding with Viterbi algorithm Supervised training: counts-based estimates Structured Perceptron Decoding with Viterbi algorithm Supervised training: perceptron algorithm with structured weight updates First example of structured prediction, first step toward predicting syntactic structure

Aside: unsupervised training for HMMs? Baum-Welch algorithm (sketch) Start with random parameters A,B Until convergence E-step: determine probability of observations being generated by various hidden state sequences (using forward-backward algorithm) M-step: re-estimate A,B using these probabilities More details here