Sequential Data Modeling - The Structured Perceptron

Similar documents
NLP Programming Tutorial 11 - The Structured Perceptron

Sequence Labeling: HMMs & Structured Perceptron

Sequential Data Modeling - Conditional Random Fields

NLP Programming Tutorial 6 - Advanced Discriminative Learning

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

NLP Programming Tutorial 8 - Recurrent Neural Nets

Lecture 13: Structured Prediction

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Logistic Regression & Neural Networks

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Statistical methods in NLP, lecture 7 Tagging and parsing

Statistical Methods for NLP

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Machine Learning for Structured Prediction

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 12: Algorithms for HMMs

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Hidden Markov Models

Graphical models for part of speech tagging

Maxent Models and Discriminative Estimation

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 12: Algorithms for HMMs

Language Processing with Perl and Prolog

A gentle introduction to Hidden Markov Models

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Sequential Supervised Learning

Statistical Methods for NLP

with Local Dependencies

Hidden Markov Models in Language Processing

CS838-1 Advanced NLP: Hidden Markov Models

HMM and Part of Speech Tagging. Adam Meyers New York University

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lecture 7: Sequence Labeling

Natural Language Processing

Intelligent Systems (AI-2)

LECTURER: BURCU CAN Spring

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

The Noisy Channel Model and Markov Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Applied Natural Language Processing

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Midterm sample questions

Hidden Markov Models

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Lecture 9: Hidden Markov Model

Intelligent Systems (AI-2)

Hidden Markov Models (HMMs)

Part-of-Speech Tagging

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Hidden Markov Models

Natural Language Processing

Text Mining. March 3, March 3, / 49

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

10/17/04. Today s Main Points

A brief introduction to Conditional Random Fields

COMP90051 Statistical Machine Learning

Structured Prediction

Probabilistic Context-free Grammars

Feature-based Discriminative Models. More Sequence Models

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Midterm sample questions

STA 414/2104: Machine Learning

lecture 6: modeling sequences (final part)

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Natural Language Processing

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Probabilistic Context Free Grammars. Many slides from Michael Collins

From perceptrons to word embeddings. Simon Šuster University of Groningen

STA 4273H: Statistical Machine Learning

Machine Learning for NLP

Hidden Markov Models

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Probabilistic Graphical Models

Log-Linear Models, MEMMs, and CRFs

Hidden Markov Models

CSE 490 U Natural Language Processing Spring 2016

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

AN ABSTRACT OF THE DISSERTATION OF

POS-Tagging. Fabian M. Suchanek

Conditional Random Field

Statistical Methods for NLP

CS460/626 : Natural Language

Naïve Bayes, Maxent and Neural Models

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

EXPERIMENTS ON PHRASAL CHUNKING IN NLP USING EXPONENTIATED GRADIENT FOR STRUCTURED PREDICTION

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Personal Project: Shift-Reduce Dependency Parsing

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz

Generalized Linear Classifiers in NLP

Machine Learning for natural language processing

TnT Part of Speech Tagger

Directed Probabilistic Graphical Models CMSC 678 UMBC

Transcription:

Sequential Data Modeling - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1

Prediction Problems Given x, predict y 2

Prediction Problems Given x, A book review Oh, man I love this book! This book is so boring... A tweet On the way to the park! 公園に行くなう! A sentence I read a book predict y Is it positive? yes no Its language English Japanese Its parts-of-speech N VBD DET NN I read a book Binary Prediction (2 choices) Multi-class Prediction (several choices) Structured Prediction (millions of choices) Sequential prediction is a subset 3

Simple Prediction: The Perceptron Model 4

Example we will use: Given an introductory sentence from Wikipedia Predict whether the article is about a person Given Gonso was a Sanron sect priest (754-827) in the late Nara and early Heian periods. Predict Yes! Shichikuzan Chigogataki Fudomyoo is a historical site located at Magura, Maizuru City, Kyoto Prefecture. No! This is binary classification (of course!) 5

How do We Predict? Gonso was a Sanron sect priest ( 754 827 ) in the late Nara and early Heian periods. Shichikuzan Chigogataki Fudomyoo is a historical site located at Magura, Maizuru City, Kyoto Prefecture. 6

How do We Predict? Contains priest probably person! Contains (<#>-<#>) probably person! Gonso was a Sanron sect priest ( 754 827 ) in the late Nara and early Heian periods. Contains site probably not person! Shichikuzan Chigogataki Fudomyoo is a historical site located at Magura, Maizuru City, Kyoto Prefecture. Contains Kyoto Prefecture probably not person! 7

Combining Pieces of Information Each element that helps us predict is a feature contains priest contains (<#>-<#>) contains site contains Kyoto Prefecture Each feature has a weight, positive if it indicates yes, and negative if it indicates no w contains priest = 2 w contains (<#>-<#>) = 1 w contains site = -3 w contains Kyoto Prefecture = -1 For a new example, sum the weights Kuya (903-972) was a priest born in Kyoto Prefecture. 2 + -1 + 1 = 2 If the sum is at least 0: yes, otherwise: no 8

Let me Say that in Math! y = sign (w ϕ(x)) = sign ( i=1 I w i ϕ i ( x)) x: the input φ(x): vector of feature functions {φ 1 (x), φ 2 (x),, φ I (x)} w: the weight vector {w 1, w 2,, w I } y: the prediction, +1 if yes, -1 if no (sign(v) is +1 if v >= 0, -1 otherwise) 9

Example Feature Functions: Unigram Features Equal to number of times a particular word appears x = A site, located in Maizuru, Kyoto φ unigram A (x) = 1 φ unigram site (x) = 1 φ unigram, (x) = 2 φ unigram located (x) = 1 φ unigram in (x) = 1 φ unigram Maizuru (x) = 1 φ unigram Kyoto (x) = 1 φ unigram the (x) = 0 φ unigram temple (x) = 0 The rest are all 0 For convenience, we use feature names (φ unigram A ) instead of feature indexes (φ 1 ) 10

Calculating the Weighted Sum x = A site, located in Maizuru, Kyoto φ unigram A (x) = 1 φ unigram site (x) = 1 φ unigram located (x) = 1 w unigram a = 0 w unigram site = -3 w unigram located = 0 w unigram Maizuru = 0 w unigram, = 0 w unigram in = 0 w unigram Kyoto = 0 φ unigram Maizuru (x) = 1 φ unigram, (x) = 2 φ unigram in (x) = 1 φ unigram Kyoto (x) = 1 φ unigram priest (x) = 0 w unigram priest = 2 φ unigram black (x) = 0 w unigram black = 0 * = 0-3 0 0 0 0 0 0 0 + + + + + + + + + = -3 No! 11

Learning Weights Using the Perceptron Algorithm 12

Learning Weights Manually creating weights is hard Many many possible useful features Changing weights changes results in unexpected ways Instead, we can learn from labeled data y x 1 FUJIWARA no Chikamori ( year of birth and death unknown ) was a samurai and poet who lived at the end of the Heian period. 1 Ryonen ( 1646 - October 29, 1711 ) was a Buddhist nun of the Obaku Sect who lived from the early Edo period to the mid-edo period. -1 A moat settlement is a village surrounded by a moat. -1 Fushimi Momoyama Athletic Park is located in Momoyama-cho, Kyoto City, Kyoto Prefecture. 13

Online Learning create map w for I iterations for each labeled pair x, y in the data phi = create_features(x) y' = predict_one(w, phi) if y'!= y update_weights(w, phi, y) In other words Try to classify each training example Every time we make a mistake, update the weights Many different online learning algorithms The most simple is the perceptron 14

Perceptron Weight Update In other words: w w + y ϕ(x) If y=1, increase the weights for features in φ(x) Features for positive examples get a higher weight If y=-1, decrease the weights for features in φ(x) Features for negative examples get a lower weight Every time we update, our predictions get better! 15

Example: Initial Update Initialize w=0 x = A site, located in Maizuru, Kyoto y = -1 w ϕ(x)=0 y '=sign(w ϕ( x))=1 y ' y w w + y ϕ(x) w unigram Maizuru = -1 w unigram, = -2 w unigram in = -1 w unigram Kyoto = -1 w unigram A = -1 w unigram site = -1 w unigram located = -1 16

Example: Second Update x = Shoken, monk born in Kyoto y = 1-2 -1-1 w ϕ(x)= 4 y '=sign(w ϕ( x))= 1 y ' y w w + y ϕ(x) w unigram Maizuru = -1 w unigram, = -1 w unigram in = 0 w unigram Kyoto = 0 w unigram A = -1 w unigram site = -1 w unigram located = -1 w unigram Shoken = 1 w unigram monk = 1 w unigram born = 1 17

Review: The HMM Model 18

Part of Speech (POS) Tagging Given a sentence X, predict its part of speech sequence Y Natural language processing ( NLP ) is a field of computer science JJ NN NN -LRB- NN -RRB- VBZ DT NN IN NN NN A type of structured prediction, from two weeks ago How can we do this? Any ideas? 19

Probabilistic Model for Tagging Find the most probable tag sequence, given the sentence Natural language processing ( NLP ) is a field of computer science JJ NN NN LRB NN RRB VBZ DT NN IN NN NN argmax Y P (Y X ) Any ideas? 20

Generative Sequence Model First decompose probability using Bayes' law argmax Y P (Y X )=argmax Y =argmax Y P (X Y ) P(Y ) P (X ) P (X Y ) P(Y ) Model of word/pos interactions natural is probably a JJ Model of POS/POS interactions NN comes after DET Also sometimes called the noisy-channel model 21

Hidden Markov Models (HMMs) for POS Tagging POS POS transition probabilities Like a bigram model! POS Word emission probabilities P (Y ) i=1 I + 1 PT (y i y i 1 ) P (X Y ) 1 I P E ( x i y i ) P T (JJ <s>) * P T (NN JJ) * P T (NN NN) <s> JJ NN NN LRB NN RRB... </s> natural language processing ( nlp )... P E (natural JJ) * P E (language NN) * P E (processing NN) 22

Learning Markov Models (with tags) Count the number of occurrences in the corpus and natural language processing ( nlp ) is c(jj natural)++ c(nn language)++ <s> JJ NN NN LRB NN RRB VB </s> c(<s> JJ)++ c(jj NN)++ Divide by context to get probability P T (LRB NN) = c(nn LRB)/c(NN) = 1/3 P E (language NN) = c(nn language)/c(nn) = 1/3 23

Remember: HMM Viterbi Algorithm Forward step, calculate the best path to a node Find the path to each node with the lowest negative log probability Backward step, reproduce the path This is easy, almost the same as word segmentation 24

Forward Step: Part 1 First, calculate transition from <S> and emission of the first word for every POS 0:<S> I 1:NN 1:JJ 1:VB 1:PRN best_score[ 1 NN ] = -log P T (NN <S>) + -log P E (I NN) best_score[ 1 JJ ] = -log P T (JJ <S>) + -log P E (I JJ) best_score[ 1 VB ] = -log P T (VB <S>) + -log P E (I VB) best_score[ 1 PRN ] = -log P T (PRN <S>) + -log P E (I PRN) 1:NNP best_score[ 1 NNP ] = -log P T (NNP <S>) + -log P E (I NNP) 25

Forward Step: Middle Parts For middle words, calculate the minimum score for all possible previous POS tags I 1:NN 1:JJ 1:VB 1:PRN 1:NNP visited 2:NN 2:JJ 2:VB 2:PRN 2:NNP best_score[ 2 NN ] = min( best_score[ 1 NN ] + -log P T (NN NN) + -log P E (visited NN), best_score[ 1 JJ ] + -log P T (NN JJ) + -log P E (visited NN), best_score[ 1 VB ] + -log P T (NN VB) + -log P E (visited NN), best_score[ 1 PRN ] + -log P T (NN PRN) + -log P E (visited NN), best_score[ 1 NNP ] + -log P T (NN NNP) + -log P E (visited NN),... ) best_score[ 2 JJ ] = min( best_score[ 1 NN ] + -log P T (JJ NN) + -log P E (visited JJ), best_score[ 1 JJ ] + -log P T (JJ JJ) + -log P E (visited JJ), best_score[ 1 VB ] + -log P T (JJ VB) + -log P E (visited JJ),... 26

The Structured Perceptron 27

So Far, We Have Learned Classifiers Perceptron Lots of features Binary prediction Generative Models HMM Conditional probabilities Structured prediction 28

Structured Perceptron Classifiers Perceptron Lots of features Binary prediction Generative Models HMM Conditional probabilities Structured prediction Structured perceptron Classification with lots of features over structured models! 29

Why are Features Good? Can easily try many different ideas Are capital letters usually nouns? Are words that end with -ed usually verbs? -ing? 30

Restructuring HMM With Features Normal HMM: P (X,Y )= 1 I P E ( x i y i ) i=1 I + 1 PT ( y i y i 1 ) 31

Restructuring HMM With Features Normal HMM: P (X,Y )= 1 I P E ( x i y i ) i=1 I + 1 PT ( y i y i 1 ) Log Likelihood: logp (X,Y )= 1 I logp E ( x i y i ) i=1 I+ 1 logpt ( y i y i 1 ) 32

Restructuring HMM With Features Normal HMM: P (X,Y )= 1 I P E ( x i y i ) i=1 I + 1 PT ( y i y i 1 ) Log Likelihood: logp (X,Y )= 1 I logp E ( x i y i ) i=1 I+ 1 logpt ( y i y i 1 ) Score S (X,Y )= 1 I w E, y i, x i i=1 I+ 1 w T, y i 1, y i 33

Restructuring HMM With Features Normal HMM: P (X,Y )= 1 I P E ( x i y i ) i=1 I + 1 PT ( y i y i 1 ) Log Likelihood: I logp ( X,Y )= i=1 logp E (x i y i )+ i=1 I +1 logp T (y i y i 1 ) Score S (X,Y )= i =1 I w E, y i, x i + I +1 i =1 w E, y, y i 1 i When: w E, y i, x i =logp E (x i y i ) w T, y i 1, y i =logp T ( y i y i 1 ) log P(X,Y) = S(X,Y) 34

Example I visited Nara φ( ) = PRN VBD NNP φ T,<S>,PRN ) = 1 φ T,PRN,VBD ) = 1 φ T,VBD,NNP ) = 1 φ T,NNP,</S> ) = 1 φ E,PRN, I ) = 1 φ E,VBD, visited ) = 1 φ E,NNP, Nara ) = 1 φ CAPS,PRN ) = 1 φ CAPS,NNP ) = 1 φ SUF,VBD,...ed ) = 1 I visited Nara φ( ) = NNP VBD NNP φ T,<S>,NNP ) = 1 φ T,NNP,VBD ) = 1 φ T,VBD,NNP ) = 1 φ T,NNP,</S> ) = 1 φ E,NNP, I ) = 1 φ E,VBD, visited ) = 1 φ E,NNP, Nara ) = 1 φ CAPS,NNP ) = 2 φ SUF,VBD,...ed ) = 1 35

Finding the Best Solution We must find the POS sequence that satisfies: Ŷ =argmax Y i w i ϕ i (X,Y ) 36

HMM Viterbi with Features Same as probabilities, use feature weights 0:<S> I 1:NN 1:JJ 1:VB 1:PRN best_score[ 1 NN ] = w + w T,<S>,NN E,NN,I best_score[ 1 JJ ] = w + w T,<S>,JJ E,JJ,I best_score[ 1 VB ] = w + w T,<S>,VB E,VB,I best_score[ 1 PRN ] = w + w T,<S>,PRN E,PRN,I 1:NNP best_score[ 1 NNP ] = w T,<S>,NNP + w E,NNP,I 37

HMM Viterbi with Features Can add additional features 0:<S> I 1:NN 1:JJ 1:VB 1:PRN best_score[ 1 NN ] = w + w + w T,<S>,NN E,NN,I CAPS,NN best_score[ 1 JJ ] = w + w + w T,<S>,JJ E,JJ,I CAPS,JJ best_score[ 1 VB ] = w + w + w T,<S>,VB E,VB,I CAPS,VB best_score[ 1 PRN ] = w + w + w T,<S>,PRN E,PRN,I CAPS,PRN 1:NNP best_score[ 1 NNP ] = w T,<S>,NNP + w E,NNP,I + w CAPS,NNP 38

Learning In the Structured Perceptron Remember the perceptron algorithm If there is a mistake: w w + y ϕ(x) Update weights to: increase score of positive examples decrease score of negative examples What is positive/negative in structured perceptron? 39

Learning in the Structured Perceptron Positive example, correct feature vector: I visited Nara φ( ) PRN VBD NNP Negative example, incorrect feature vector: I visited Nara φ( ) NNP VBD NNP 40

Choosing an Incorrect Feature Vector There are too many incorrect feature vectors! I visited Nara φ( ) NNP VBD NNP I visited Nara φ( ) PRN VBD NN I visited Nara φ( ) PRN VB NNP Which do we use? 41

Choosing an Incorrect Feature Vector Answer: We update using the incorrect answer with the highest score: Ŷ =argmax Y i w i ϕ i (X,Y ) Our update rule becomes: w w + ϕ(x,y ') ϕ(x,ŷ ) (Y' is the correct answer) Note: If highest scoring answer is correct, no change 42

Example φ T,<S>,PRN ) = 1 φ T,PRN,VBD ) = 1 φ T,VBD,NNP ) = 1 φ T,NNP,</S> ) = 1 φ E,PRN, I ) = 1 φ E,VBD, visited ) = 1 φ E,NNP, Nara ) = 1 φ CAPS,PRN ) = 1 φ CAPS,NNP ) = 1 - φ SUF,VBD,...ed ) = 1 φ T,<S>,NNP ) = 1 φ T,NNP,VBD ) = 1 φ T,VBD,NNP ) = 1 φ T,NNP,</S> ) = 1 φ E,NNP, I ) = 1 φ E,VBD, visited ) = 1 φ E,NNP, Nara ) = 1 φ CAPS,NNP ) = 2 φ SUF,VBD,...ed ) = 1 = φ T,<S>,PRN ) = 1 φ T,PRN,VBD ) = 1 φ φ T,NNP,VBD ) = -1 φ (X,Y ) = 0 φ (X,Y ) = 0 T,VBD,NNP 1 T,NNP,</S> 1 T,<S>,NNP ) = -1 φ E,PRN, I ) = 1 φ φ E,NNP, I ) = -1 E,VBD, visited ) = 0 φ E,NNP, Nara ) = 0 43 φ CAPS,NNP ) = -1 φ CAPS,PRN ) = 1 φ SUF,VBD,...ed ) = 0

Structured Perceptron Algorithm create map w for I iterations for each labeled pair X, Y_prime in the data Y_hat = hmm_viterbi(w, X) phi_prime = create_features(x, Y_prime) phi_hat = create_features(x, Y_hat) w += phi_prime - phi_hat 44

Conclusion The structured perceptron is a discriminative structured prediction model HMM: generative structured prediction Perceptron: discriminative binary prediction It can be used for many problems Prediction of 45

Thank You! 46