Lecture 13: Structured Prediction

Similar documents
Lecture 9: Hidden Markov Model

Lecture 12: EM Algorithm

Lecture 11: Viterbi and Forward Algorithms

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Lecture 7: Sequence Labeling

Statistical Methods for NLP

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Intelligent Systems (AI-2)

Sequence Labeling: HMMs & Structured Perceptron

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Intelligent Systems (AI-2)

10/17/04. Today s Main Points

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 12: Algorithms for HMMs

Log-Linear Models, MEMMs, and CRFs

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Lecture 12: Algorithms for HMMs

A brief introduction to Conditional Random Fields

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

CS838-1 Advanced NLP: Hidden Markov Models

with Local Dependencies

Natural Language Processing

Conditional Random Field

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Probabilistic Models for Sequence Labeling

Hidden Markov Models

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Statistical methods in NLP, lecture 7 Tagging and parsing

Sequential Supervised Learning

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Methods for NLP

Graphical models for part of speech tagging

NLP Programming Tutorial 11 - The Structured Perceptron

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

COMP90051 Statistical Machine Learning

A gentle introduction to Hidden Markov Models

Hidden Markov Models (HMMs)

Hidden Markov Models in Language Processing

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Hidden Markov Models

Sequential Data Modeling - The Structured Perceptron

lecture 6: modeling sequences (final part)

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Conditional Random Fields

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

LECTURER: BURCU CAN Spring

Machine Learning for Structured Prediction

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Lecture 6: Graphical Models

Linear Classifiers IV

Lecture 5 Neural models for NLP

Statistical Methods for NLP

AN ABSTRACT OF THE DISSERTATION OF

Lecture 15. Probabilistic Models on Graph

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

CSE 490 U Natural Language Processing Spring 2016

STA 414/2104: Machine Learning

Conditional Random Fields: An Introduction

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Processing/Speech, NLP and the Web

Hidden Markov Models

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

Intelligent Systems (AI-2)

Information Extraction from Text

Machine Learning for natural language processing

Dynamic Approaches: The Hidden Markov Model

Maximum Entropy Markov Models

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 8 POS tagset) Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan, 2012

Applied Natural Language Processing

Conditional Random Fields for Sequential Supervised Learning

Advanced Natural Language Processing Syntactic Parsing

Lecture 3: Multiclass Classification

Hidden Markov Models Hamid R. Rabiee

Brief Introduction of Machine Learning Techniques for Content Analysis

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lab 12: Structured Prediction

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz

Undirected Graphical Models

Feature-based Discriminative Models. More Sequence Models

CSE 447/547 Natural Language Processing Winter 2018

Lecture 11: Hidden Markov Models

Structure Learning in Sequential Data

Logistic Regression & Neural Networks

Personal Project: Shift-Reduce Dependency Parsing

Statistical Processing of Natural Language

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Structured Output Prediction: Generative Models

Transcription:

Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1

Quiz 2 v Lectures 9-13 v Lecture 12: before page 44 v Lecture 13: before page 33 v Key points: v HMM model v Three basic problems v Sequential tagging CS6501: NLP 2

Three basic problems for HMMs v Likelihood of the input: v Forward algorithm v Decoding (tagging) the input: v Viterbi algorithm v Estimation (learning): How likely the sentence I love cat occurs POS tags of I love cat occurs How to learn the model? v Find the best model parameters v Case 1: supervised tags are annotated vmaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vforward-backward algorithm CS6501: NLP 3

Supervised Learning Setting v Assume we have annotated examples Tag set: DT, JJ, NN, VBD POS Tagger The/DT grand/jj jury/nn commented/vbd on/in a/dt number/nn of/in other/jj topics/nns./. CS6501: NLP 4

Sequence tagging problems v Many problems in NLP (ML) have data with tag sequences v Brainstorm: name other sequential tagging problems CS6501: NLP 5

OCR example CS6501: NLP 6

Noun phrase (NP) chunking v Task: identify all non-recursive NP chunks CS6501: NLP 7

The BIO encoding v Define three new tags v B-NP: beginning of a noun phrase chunk v I-NP: inside of a noun phrase chunk v O: outside of a noun phrase chunk POS Tagging with a restricted Tagset? CS6501: NLP 8

Shallow parsing v Task: identify all non-recursive NP, verb ( VP ) and preposition ( PP ) chunks CS6501: NLP 9

BIO Encoding for Shallow Parsing v Define new tags v B-NP B-VP B-PP: beginning of an NP, VP, PP chunk v I-NP I-VP I-PP: inside of an NP, VP, PP chunk v O: outside of any chunk POS Tagging with a restricted Tagset? CS6501: NLP 10

Named Entity Recognition v Task: identify all mentions of named entities (people, organizations, locations, dates) CS6501: NLP 11

BIO Encoding for NER v Define many new tags v B-PERS, B-DATE, : beginning of a mention of a person/date... v I-PERS, I-DATE, : inside of a mention of a person/date... v O: outside of any mention of a named entity CS6501: NLP 12

Sequence tagging v Many NLP tasks are sequence tagging tasks v Input: a sequence of tokens/words v Output: a sequence of corresponding labels v E.g., POS tags, BIO encoding for NER v Solution: finding the most probable label sequence for the given word sequence vt = argmax t P t w CS6501: NLP 13

Sequential tagging v.s independent prediction Sequence labeling t = argmax t P t w t is a vector/matrix Independent classifier y = argmax - P(y x) y is a single label t i t j y i y j w i w j x i x j CS6501: NLP 14

Sequential tagging v.s independent prediction Sequence labeling t = argmax t P t w t is a vector/matrix Dependency between both (t, w) and (t 4, t 5 ) Structured output Difficult to solve the inference problem Independent classifiers y = argmax - P(y x) y is a single label Dependency only within (y, x) Independent output Easy to solve the inference problem CS6501: NLP 15

Recap: Viterbi Decoding Induction: δ 7 q = P w 7 t 7 = q max >? δ 7@A q B P t 7 = q t 7@A = q B CS6501 Natural Language Processing 16

Recap: Viterbi algorithm v Store the best tag sequence for w A w 4 that ends in t 5 in T[j][i] v T[j][i] = max P(w A w 4, t A, t 4 = t 5 ) v Recursively compute T[j][i] from the entries in the previous column T[j][i-1] v T j i = P w 4 t 5 Max 7 T k i 1 P t 5 t 7 Generating the current observation The best i-1 tag sequence Transition from the previous best ending tag CS6501: NLP 17

Two modeling perspectives v Generative models v Model the joint probability of labels and words v t = argmax t P t w = argmax t M w,t M w = argmax t P(t, w) v Discriminative models v Directly model the conditional probability of labels given the words Often modeled by v t = argmax t P t w Softmax function CS6501: NLP 18

Generative V.S. discriminative models v Binary classification as an example Generative Model s view Discriminative Model s view CS6501: NLP 19

Generative V.S. discriminative models Generative joint distribution Full probabilistic specification for all the random variables Dependence assumption has to be specified for P w t and P(t) Can be used in unsupervised learning Discriminative conditional distribution Only explain the target variable Arbitrary features can be incorporated for modeling P t w Need labeled data, suitable for (semi-) supervised learning CS6501: NLP 20

Independent Classifiers vp t w = P(t 4 w 4 ) 4 v ~95% accuracy (token-wise) t A t O t P t Q w A w O w P w Q CS6501: NLP 21

Maximum entropy Markov models v MEMMs are discriminative models of the labels t given the observed input sequence w v P t w = P(t 4 w 4, t 4@A ) 4 CS6501: NLP 22

Design features v Emission-like features v Binary feature functions v f first-letter-capitalized-nnp (China) = 1 v f first-letter-capitalized-vb (know) = 0 VB know v Integer (or real-valued) feature functions v f number-of-vowels-nnp (China) = 2 NNP China v Transition-like features v Binary feature functions v f first-letter-capitalized-vb-nnp (China) = 1 Not necessarily independent features! CS6501: NLP 23

Parameterization of P(t 4 w 4, t 4@A ) v Associate a real-valued weight λ to each specific type of feature function v λ 7 for f first-letter-capitalized-nnp (w) v Define a scoring function f t 4, t 4@A, w 4 = λ 7 f 7 (t 4, t 4@A, w 4 ) 7 v Naturally P t 4 w 4, t 4@A exp f t 4, t 4@A, w 4 v Recall the basic definition of probability v P(x) > 0 v p(x) [ = 1 CS6501: NLP 24

Parameterization of MEMMs P t w = 4 P(t 4 w 4, t 4@A ) = abc d e f,e fgh,i f 4 v It is a log-linear model abc d e,e fgh,i f v log p t w = f(t 4, t 4@A, w 4 ) 4 C(λ) v Viterbi algorithm can be used to decode the most probable label sequence solely based on f(t 4, t 4@A, w 4 ) 4 j = 4 exp f(t 4, t 4@A, w 4 ) e exp f t, t 4@A, w 4 4 Constant only related to λ λ: parameters CS6501: NLP 25

Parameter estimation (Intuition) v Maximum likelihood estimator can be used in a similar way as in HMMs v λ = argmax k t,i log P(t w) = argmax k t,i 4 f(t 4, t 4@A, w 4 ) C(λ) Decompose the training data into such units CS6501: NLP 26

Parameter estimation (Intuition) v Essentially, training local classifiers using previous assigned tags as features CS6501: NLP 27

More about MEMMs v Emission features can go across multiple observations v f t 4, t 4@A, w 4 7 λ 7 f 7 (t 4,t 4@A, w) v Especially useful for shallow parsing and NER tasks CS6501: NLP 28

Label biased problem v Consider the following tag sequences as the training data Thomas/B-PER Jefferson/I-PER Thomas/B-LOC Hall/I-LOC B-PER E-PER other B-LOC E-LOC CS6501: NLP 29

Label biased problem v Thomas/B-PER Jefferson/I-PER Thomas/B-LOC Hall/I-LOC v MEMM: P(B-PER Thomas,other)= ½ P(B-LOC Thomas,other)= ½ P(I-PER Jefferson, B-PER)=1 P(I-LOC Jefferson, B-LOC)=1 Should globally normalize! other B-PER E-PER B-LOC E-LOC CS6501: NLP 30

Conditional Random Field v Model global dependency v P t w exp S t, w = exp S t, w / tb exp S(t B, w) Score entire sequence directly t A t O t P t Q w A w O w P w Q CS6501: NLP 31

Conditional Random Field v S t, w = ( λ 7 f 7 t 4, w + γ q g q (t 4, t 4@A, w) ) i 7 v P t w exp S t, w = exp( λ 7 f 7 t 4, w + γ q g q (t 4, t 4@A, w) 4 7 q ) q t A t O t P t Q Edge feature g(t 4,t 4@A, w) Node feature f(t 4,w) w A w O w P w Q CS6501: NLP 32

Design features v Emission-like features v Binary feature functions v f first-letter-capitalized-nnp (China) = 1 v f first-letter-capitalized-vb (know) = 0 VB know v Integer (or real-valued) feature functions v f number-of-vowels-nnp (China) = 2 NNP China v Transition-like features v Binary feature functions v f first-letter-capitalized-vb-nnp (China) = 1 Not necessarily independent features! CS6501: NLP 33

General Idea v We want the score to the correct answer S t, w higher than others. S t, w > S t B, w t B T, t B t v Different level of mistakes S t, w S t B, w + Δ(t B, t ) t B T v Several ML models can be used v Structured Perceptron v Structured SVM v Learning to Search CS6501: NLP 34

Log-linear model v P t w exp S t, w v S t, w = ( λ 7 f 7 t 4, w + γg q (t 4, t 4@A, w) ) i 7 = k λ 7 ( 4 f 7 t 4,w ) q + l γ q ( 4 g q (t 4,t 4@A, w)) λ A λ O γ A γ O f A t 4, w ) 4 f O t 4, w ) 4 g A (t 4, t 4@A, w)) 4 4 g O (t 4, t 4@A, w)) θ F(t, w) Essentially, we aggregate transition and emission patterns as features CS6501: NLP 35

MEMM v.s. CRF Like in the previous slide, we can rearrange the summations v Score function can be the same: S t, w = ( λ 7 f 7 t 4, w + γg q (t 4, t 4@A, w) i 7 = f(t 4, t 4@A, w 4 ) i v MEMM: Locally normalized q ) P t w = P t 4 w 4, t 4@A v CRF: 4 = f abc d(e f,e fgh,i f ) abc d e,e fgh,i f globally normalized f j P t w = abc (Ž t,w ) abc t t,w = f abc d(e f,e fgh,i f ) abc d(e f,e fgh,i f ) t f CS6501: NLP 36

HMM v.s. MEMM v.s. CRF P(X,Y) P(Y X) CS6501: NLP 37

Structured Prediction beyond sequence tagging Assign values to a set of interdependent output variables Task Input Output Part-of-speech Tagging They operate ships and banks. Pronoun Verb Noun And Noun Dependency Parsing Segmentation They operate ships and banks. Root They operate ships and banks. 38

Inference v Find the best scoring output given the model argmax S y, x - v Output space is usually exponentially large v Inference algorithms: v Specific: e.g., Viterbi (linear chain) v General: Integer linear programming (ILP) v Approximate inference algorithms: e.g., belief propagation, dual decomposition 39

Learning Structured Models Solve inferences Update the model (stochastic) gradient updates 40

Example: Structured Perceptron v Goal: we want the score to the correct answer S y, x; θ higher than others. S y, x; θ > S y B,x; θ y B T, y B y v Let S y, x;θ = θ F(y, x; θ) v Give training data {(y i, x i )}, i = 1 N v Loop until converge v For i = 1 N v Let y B = arg max θ F(y, x; θ) y v If y B y : θ θ + η(f y, x; θ F(y, x; θ)) Kai-Wei Chang 41