NLP Programming Tutorial 11 - The Structured Perceptron

Similar documents
Sequential Data Modeling - The Structured Perceptron

Sequence Labeling: HMMs & Structured Perceptron

NLP Programming Tutorial 8 - Recurrent Neural Nets

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Sequential Data Modeling - Conditional Random Fields

Lecture 13: Structured Prediction

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Machine Learning for Structured Prediction

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Probabilistic Context-free Grammars

LECTURER: BURCU CAN Spring

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

NLP Programming Tutorial 6 - Advanced Discriminative Learning

Applied Natural Language Processing

Sequential Supervised Learning

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Statistical Methods for NLP

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Statistical methods in NLP, lecture 7 Tagging and parsing

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Hidden Markov Models

Lecture 12: Algorithms for HMMs

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Structured Prediction

with Local Dependencies

CS838-1 Advanced NLP: Hidden Markov Models

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Statistical Methods for NLP

10/17/04. Today s Main Points

Probabilistic Context Free Grammars. Many slides from Michael Collins

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Lecture 12: Algorithms for HMMs

AN ABSTRACT OF THE DISSERTATION OF

A gentle introduction to Hidden Markov Models

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Natural Language Processing

Multiword Expression Identification with Tree Substitution Grammars

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Intelligent Systems (AI-2)

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Intelligent Systems (AI-2)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Statistical Methods for NLP

Lecture 7: Sequence Labeling

Natural Language Processing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Generalized Linear Classifiers in NLP

lecture 6: modeling sequences (final part)

HMM and Part of Speech Tagging. Adam Meyers New York University

Hidden Markov Models (HMMs)

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Hidden Markov Models in Language Processing

Midterm sample questions

Graphical models for part of speech tagging

Feature-based Discriminative Models. More Sequence Models

Structured Prediction

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

2.2 Structured Prediction

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

COMP90051 Statistical Machine Learning

Hidden Markov Models

Maxent Models and Discriminative Estimation

A Support Vector Method for Multivariate Performance Measures

Structured Prediction with Perceptron: Theory and Algorithms

Algorithms for Predicting Structured Data

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CS388: Natural Language Processing Lecture 4: Sequence Models I

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Log-Linear Models, MEMMs, and CRFs

Lecture 9: Hidden Markov Model

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Logistic Regression & Neural Networks

Hidden Markov Models

Decoding and Inference with Syntactic Translation Models

From perceptrons to word embeddings. Simon Šuster University of Groningen

Hidden Markov Models

Part-of-Speech Tagging

Hidden Markov Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Personal Project: Shift-Reduce Dependency Parsing

Parsing with Context-Free Grammars

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Lecture 5 Neural models for NLP

Natural Language Processing

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Fun with weighted FSTs

Linear Classifiers IV

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Marrying Dynamic Programming with Recurrent Neural Networks

Chunking with Support Vector Machines

Natural Language Processing

Neural Networks in Structured Prediction. November 17, 2015

Machine Learning for NLP

Machine Learning for natural language processing

Search-Based Structured Prediction

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

The Noisy Channel Model and Markov Models

Transcription:

NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1

Prediction Problems Given x, A book review Oh, man I love this book! This book is so boring... A tweet On the way to the park! 公園に行くなう! A sentence I read a book predict y Is it positive? yes no Its language English Japanese Its syntactic parse N S VBD VP DET NP NN I read a book Binary Prediction (2 choices) Multi-class Prediction (several choices) Structured Prediction (millions of choices) 2

Prediction Problems Given x, A book review Oh, man I love this book! This book is so boring... A tweet On the way to the park! 公園に行くなう! A sentence I read a book predict y Is it positive? yes no Its language English Japanese Its syntactic parse N S VBD VP DET NP NN I read a book Binary Prediction (2 choices) Multi-class Prediction (several choices) Most NLP Problems! Structured Prediction (millions of choices) 3

So Far, We Have Learned Classifiers Perceptron, SVM, Neural Net Lots of features Binary prediction Generative Models HMM POS Tagging CFG Parsing Conditional probabilities Structured prediction 4

Structured Perceptron Classifiers Perceptron, SVM, Neural Net Lots of features Binary prediction Generative Models HMM POS Tagging CFG Parsing Conditional probabilities Structured prediction Structured perceptron Classification with lots of features over structured models! 5

Uses of Structured Perceptron (or Variants) POS Tagging with HMMs Collins Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms ACL02 Parsing Huang+ Forest Reranking: Discriminative Parsing with Non-Local Features ACL08 Machine Translation Liang+ An End-to-End Discriminative Approach to Machine Translation ACL06 (Neubig+ Inducing a Discriminative Parser for Machine Translation Reordering, EMNLP12, Plug :) ) Discriminative Language Models Roark+ Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm ACL04 6

Example: Part of Speech (POS) Tagging Given a sentence X, predict its part of speech sequence Y Natural language processing ( NLP ) is a field of computer science JJ NN NN -LRB- NN -RRB- VBZ DT NN IN NN NN A type of structured prediction 7

Hidden Markov Models (HMMs) for POS Tagging POS POS transition probabilities Like a bigram model! POS Word emission probabilities P (Y ) i=1 I +1 PT (y i y i 1 ) P (X Y ) 1 I P E ( x i y i ) P T (JJ <s>) * P T (NN JJ) * P T (NN NN) <s> JJ NN NN LRB NN RRB... </s> natural language processing ( nlp )... P E (natural JJ) * P E (language NN) * P E (processing NN) 8

Why are Features Good? Can easily try many different ideas Are capital letters usually nouns? Are words that end with -ed usually verbs? -ing? 9

Restructuring HMM With Features Normal HMM: P (X,Y )= 1 I P E ( x i y i ) i=1 I +1 PT ( y i y i 1 ) 10

Restructuring HMM With Features Normal HMM: P (X,Y )= 1 I P E ( x i y i ) i=1 I +1 PT ( y i y i 1 ) Log Likelihood: logp (X,Y )= 1 I logp E ( x i y i ) i=1 I+1 logpt ( y i y i 1 ) 11

Restructuring HMM With Features Normal HMM: P (X,Y )= 1 I P E ( x i y i ) i=1 I +1 PT ( y i y i 1 ) Log Likelihood: logp (X,Y )= 1 I logp E ( x i y i ) i=1 I+1 logpt ( y i y i 1 ) Score S (X,Y )= 1 I w E, y i, x i i=1 I+1 w T, y i 1, y i 12

Restructuring HMM With Features Normal HMM: P (X,Y )= 1 I P E ( x i y i ) i=1 I +1 PT ( y i y i 1 ) Log Likelihood: logp (X,Y )= 1 I logp E ( x i y i ) i=1 I+1 logpt ( y i y i 1 ) Score S (X,Y )= 1 I w E, y i, x i i=1 I+1 w E, y i 1, y i When: w E, y i, x i =logp E (x i y i ) w T, y i 1, y i =logp T ( y i y i 1 ) log P(X,Y) = S(X,Y) 13

Example I visited Nara φ( ) = PRP VBD NNP φ T,<S>,PRP ) = 1 φ T,PRP,VBD ) = 1 φ T,VBD,NNP ) = 1 φ T,NNP,</S> ) = 1 φ E,PRP, I ) = 1 φ E,VBD, visited ) = 1 φ E,NNP, Nara ) = 1 φ CAPS,PRP ) = 1 φ CAPS,NNP ) = 1 φ SUF,VBD,...ed ) = 1 I visited Nara φ( ) = NNP VBD NNP φ T,<S>,NNP ) = 1 φ T,NNP,VBD ) = 1 φ T,VBD,NNP ) = 1 φ T,NNP,</S> ) = 1 φ E,NNP, I ) = 1 φ E,VBD, visited ) = 1 φ E,NNP, Nara ) = 1 φ CAPS,NNP ) = 2 φ SUF,VBD,...ed ) = 1 14

Finding the Best Solution We must find the POS sequence that satisfies: Ŷ =argmax Y i w i ϕ i (X,Y ) 15

Remember: HMM Viterbi Algorithm Forward step, calculate the best path to a node Find the path to each node with the lowest negative log probability Backward step, reproduce the path This is easy, almost the same as word segmentation 16

Forward Step: Part 1 First, calculate transition from <S> and emission of the first word for every POS 0:<S> I 1:NN 1:JJ 1:VB 1:PRP best_score[ 1 NN ] = -log P T (NN <S>) + -log P E (I NN) best_score[ 1 JJ ] = -log P T (JJ <S>) + -log P E (I JJ) best_score[ 1 VB ] = -log P T (VB <S>) + -log P E (I VB) best_score[ 1 PRP ] = -log P T (PRP <S>) + -log P E (I PRP) 1:NNP best_score[ 1 NNP ] = -log P T (NNP <S>) + -log P E (I NNP) 17

Forward Step: Middle Parts For middle words, calculate the minimum score for all possible previous POS tags I 1:NN 1:JJ 1:VB 1:PRP 1:NNP visited 2:NN 2:JJ 2:VB 2:PRP 2:NNP best_score[ 2 NN ] = min( best_score[ 1 NN ] + -log P T (NN NN) + -log P E (visited NN), best_score[ 1 JJ ] + -log P T (NN JJ) + -log P E (language NN), best_score[ 1 VB ] + -log P T (NN VB) + -log P E (language NN), best_score[ 1 PRP ] + -log P T (NN PRP) + -log P E (language NN), best_score[ 1 NNP ] + -log P T (NN NNP) + -log P E (language NN),... ) best_score[ 2 JJ ] = min( best_score[ 1 NN ] + -log P T (JJ NN) + -log P E (language JJ), best_score[ 1 JJ ] + -log P T (JJ JJ) + -log P E (language JJ), best_score[ 1 VB ] + -log P T (JJ VB) + -log P E (language JJ),... 18

HMM Viterbi with Features Same as probabilities, use feature weights 0:<S> I 1:NN 1:JJ 1:VB 1:PRP best_score[ 1 NN ] = w + w T,<S>,NN E,NN,I best_score[ 1 JJ ] = w + w T,<S>,JJ E,JJ,I best_score[ 1 VB ] = w + w T,<S>,VB E,VB,I best_score[ 1 PRP ] = w + w T,<S>,PRP E,PRP,I 1:NNP best_score[ 1 NNP ] = w T,<S>,NNP + w E,NNP,I 19

HMM Viterbi with Features Can add additional features 0:<S> I 1:NN 1:JJ 1:VB 1:PRP best_score[ 1 NN ] = w + w + w T,<S>,NN E,NN,I CAPS,NN best_score[ 1 JJ ] = w + w + w T,<S>,JJ E,JJ,I CAPS,JJ best_score[ 1 VB ] = w + w + w T,<S>,VB E,VB,I CAPS,VB best_score[ 1 PRP ] = w + w + w T,<S>,PRP E,PRP,I CAPS,PRP 1:NNP best_score[ 1 NNP ] = w T,<S>,NNP + w E,NNP,I + w CAPS,NNP 20

Learning In the Structured Perceptron Remember the perceptron algorithm If there is a mistake: w w + y ϕ (x) Update weights to: increase score of positive examples decrease score of negative examples What is positive/negative in structured perceptron? 21

Learning in the Structured Perceptron Positive example, correct feature vector: I visited Nara φ( ) PRP VBD NNP Negative example, incorrect feature vector: I visited Nara φ( ) NNP VBD NNP 22

Choosing an Incorrect Feature Vector There are too many incorrect feature vectors! I visited Nara φ( ) NNP VBD NNP I visited Nara φ( ) PRP VBD NN I visited Nara φ( ) PRP VB NNP Which do we use? 23

Choosing an Incorrect Feature Vector Answer: We update using the incorrect answer with the highest score: Ŷ =argmax Y i w i ϕ i (X,Y ) Our update rule becomes: w w +ϕ (X,Y ') ϕ (X,Ŷ ) (Y' is the correct answer) Note: If highest scoring answer is correct, no change 24

Structured Perceptron Algorithm create map w for I iterations for each labeled pair X, Y_prime in the data Y_hat = hmm_viterbi(w, X) phi_prime = create_features(x, Y_prime) phi_hat = create_features(x, Y_hat) w += phi_prime - phi_hat 25

Creating HMM Features Make create features for each transition, emission create_trans( NNP,VBD ) create_emit( NNP,Nara ) φ[ T,NNP,VBD ] = 1 φ[ E,NNP,Nara ] = 1 φ[ CAPS,NNP ] = 1 26

Creating HMM Features The create_features function does for all words create_features(x, Y): create map phi for i in 0.. Y : if i == 0: first_tag = <s> else: first_tag = Y[i-1] if i == Y : next_tag = </s> else: next_tag = Y[i] phi += create_trans(first_tag, next_tag) for i in 0.. Y -1: phi += create_emit(y[i], X[i]) return phi 27

Viterbi Algorithm Forward Step split line into words I = length(words) make maps best_score, best_edge best_score[ 0 <s> ] = 0 # Start with <s> best_edge[ 0 <s> ] = NULL for i in 0 I-1: for each prev in keys of possible_tags for each next in keys of possible_tags if best_score[ i prev ] and transition[ prev next ] exist score = best_score[ i prev ] + -log P T (next prev) + -log P E (word[i] next) w*(create_t(prev,next)+create_e(next,word[i])) if best_score[ i+1 next ] is new or < score best_score[ i+1 next ] = score best_edge[ i+1 next ] = i prev 28 # Finally, do the same for </s>

Exercise 29

Exercise Write train-hmm-percep and test-hmm-percep Test the program Input: test/05 {train,test} input.txt Answer: test/05 {train,test} answer.txt Train an HMM model on data/wiki en train.norm_pos and run the program on data/wiki en test.norm Measure the accuracy of your tagging with script/gradepos.pl data/wiki en test.pos my_answer.pos Report the accuracy (compare to standard HMM) Challenge: create new features use training with margin or regularization 30

Thank You! 31