Natural Language Processing

Similar documents
with Local Dependencies

Lecture 13: Structured Prediction

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Feedforward Neural Networks. Michael Collins, Columbia University

6.891: Lecture 16 (November 2nd, 2003) Global Linear Models: Part III

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Neural Networks in Structured Prediction. November 17, 2015

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

lecture 6: modeling sequences (final part)

CSE 490 U Natural Language Processing Spring 2016

Conditional Random Fields

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Applied Natural Language Processing

Log-Linear Models, MEMMs, and CRFs

Soft Inference and Posterior Marginals. September 19, 2013

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Statistical Methods for NLP

Lab 12: Structured Prediction

Probabilistic Context Free Grammars. Many slides from Michael Collins

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Graphical models for part of speech tagging

CSE 447/547 Natural Language Processing Winter 2018

Machine Learning for Structured Prediction

Conditional Random Field

A brief introduction to Conditional Random Fields

Lecture 12: Algorithms for HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Structured Prediction Models via the Matrix-Tree Theorem

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Lecture 5 Neural models for NLP

Lecture 12: Algorithms for HMMs

Sequential Supervised Learning

Neural Architectures for Image, Language, and Speech Processing

Sequence Labeling: HMMs & Structured Perceptron

Expectation Maximization (EM)

Conditional Random Fields for Sequential Supervised Learning

CS395T: Structured Models for NLP Lecture 5: Sequence Models II

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Machine Learning for natural language processing

Probabilistic Models for Sequence Labeling

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Statistical methods in NLP, lecture 7 Tagging and parsing

Structured Prediction

So# Inference and Posterior Marginals. September 19, 2013

10/17/04. Today s Main Points

NEURAL LANGUAGE MODELS

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

6.891: Lecture 8 (October 1st, 2003) Log-Linear Models for Parsing, and the EM Algorithm Part I

NLP Programming Tutorial 11 - The Structured Perceptron

text classification 3: neural networks

Learning to translate with neural networks. Michael Auli

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Hidden Markov Models

EXPERIMENTS ON PHRASAL CHUNKING IN NLP USING EXPONENTIATED GRADIENT FOR STRUCTURED PREDICTION

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Machine Learning for NLP

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I

6.864: Lecture 21 (November 29th, 2005) Global Linear Models: Part II

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I. Greg Durrett

Intelligent Systems (AI-2)

Conditional Random Fields: An Introduction

Conditional Language modeling with attention

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Feature-based Discriminative Models. More Sequence Models

Log-Linear Models for Tagging (Maximum-entropy Markov Models (MEMMs)) Michael Collins, Columbia University

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs. CNNs. Administrivia. This Lecture. Greg Durrett. Project 1 due today at 5pm

CRF for human beings

Statistical Methods for NLP

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

AN ABSTRACT OF THE DISSERTATION OF

From perceptrons to word embeddings. Simon Šuster University of Groningen

Structured Prediction Theory and Algorithms

Lecture 3: ASR: HMMs, Forward, Viterbi

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Ecient Higher-Order CRFs for Morphological Tagging

Intelligent Systems (AI-2)

Midterm sample questions

Parameter learning in CRF s

Lecture 17: Neural Networks and Deep Learning

Marrying Dynamic Programming with Recurrent Neural Networks

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Lecture 7: Sequence Labeling

Sequential Data Modeling - The Structured Perceptron

Natural Language Processing with Deep Learning CS224N/Ling284

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Machine Learning for NLP

CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models

Natural Language Processing

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Transcription:

Natural Language Processing Global linear models Based on slides from Michael Collins

Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability of an entire sequence p (y x) = P f(x, y) > y 0 2Y f(x, y0 ) > Y : all tag sequences for x

Global linear models Motivations: Optimize directly what you care about Flexibility in feature function Avoid the label bias problem Question: How to compute the denominator?

Label bias problem Consider a history-based model for names with two tokens, that are either PERSON or LOC. States b-person e-person b-loc e-loc other

Label bias problem corpus: Harvey Ford (person 9 times, location 1 time) Harvey Park (location 9 times, person 1 time) Myrtle Ford (person 9 times, location 1 time) Myrtle Park (location 9 times, person 1 time) other b-person e-person b-loc e-loc Second word provides good information

Label bias problem Conditional probabilities: p(b-person other, w = Harvey) = 0.5 p(b-locn other, w = Harvey) = 0.5 p(b-person other, w = Myrtle) = 0.5 p(b-locn other, w = Myrtle) = 0.5 p(e-person b-person, w = Ford) = 1 p(e-person b-person, w = Park) = 1 p(e-locn b-locn, w = Ford) = 1 p(e-locn b-locn, w = Park) = 1 other b-person b-loc e-person e-loc Information from second word is lost

Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State 1 0.4 0.45 0.5 0.6 0.55 0.1 0.5 State 2 0.3 0.3 0.1 State 3 0.1 State 4 0.3 State 5 What the local transition probabilities say: State 1 prefers to go to state 2 State 2 prefer to stay in state 2

Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State 1 0.4 0.45 0.5 0.6 0.55 0.1 0.5 State 2 0.3 0.3 0.1 State 3 0.1 State 4 0.3 State 5 Probability of path 1-> 1-> 1-> 1: 0.4 x 0.45 x 0.5 = 0.09

Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State 1 0.4 0.45 0.5 0.6 0.55 0.1 0.5 State 2 0.3 0.3 0.1 State 3 0.1 State 4 0.3 State 5 Probability of path 2->2->2->2 : X 0.3 X 0.3 = 0.018 Other paths: 1-> 1-> 1-> 1: 0.09

Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State 1 0.4 0.45 0.5 0.6 0.55 0.1 0.5 State 2 0.3 0.3 0.1 State 3 0.1 State 4 0.3 State 5 Probability of path 1->2->1->2: 0.6 X X 0.5 = 0.06 Other paths: 1->1->1->1: 0.09 2->2->2->2: 0.018

Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State 1 0.4 0.45 0.5 0.6 0.55 0.1 0.5 State 2 0.3 0.3 0.1 State 3 0.1 State 4 0.3 State 5 Probability of path 1->1->2->2: 0.4 X 0.55 X 0.3 = 0.066 Other paths: 1->1->1->1: 0.09 2->2->2->2: 0.018 1->2->1->2: 0.06

Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State 1 0.4 0.45 0.5 0.6 0.55 0.1 0.5 State 2 0.3 0.3 0.1 State 3 0.1 State 4 0.3 State 5 Most Likely Path: 1-> 1-> 1-> 1 Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2.

Label bias problem Prefer states that have low entropy next-state distributions

Global linear models Main difference: We don t look at a history of decisions Instead, define features over full structure More flexible! We will see this in parsing But we directly consider the exponential space

Global linear models Contain three components A feature function f mapping a pair (x,y) to a feature vector f(x,y) A generating function GEN enumerating all candidate outputs A parameter vector θ F (x) = arg max y f(x, y) >

Tagging: GEN Inputs: sentences x = w 1 w n Tag set: T GEN(X) = T n In other setups: All parse trees Top-K parse trees produced by weaker model All translations

Tagging: feature function A history h is a 4-tuple <t i-2, t i-1, w, i> t i-2, t i-1 : previous two tags w: input sentence i: index of word being tagged Hispaniola/NNP quickly/rb became/vb an/dt important/jj base/?? from which Spain expanded its empire into the rest of the Western Hemisphere. ti-2, ti-1: DT JJ w: Hispaniola,, Hemisphere i: 6

Local features For every history/tag pair (h,t) g s (h, t) for s =1...,d: local features for tagging decision t given history h ( 1 w g 100 (h, t) i = base and t = VB 0 otherwise ( 1 w g 101 (h, t) i ends with ing and t = VBG 0 otherwise ( 1 (t g 102 (h, t) i 2,t i 1,t) = (DT, JJ, VB) 0 otherwise

Multiple local features Hispaniola/NNP quickly/rb became/vb an/dt important/jj base/nn History t-2 t-1 w i t * * * NNP Hispaniola Hispaniola NNP RB Hispaniola RB VB Hispaniola VB DT Hispaniola DT JJ Hispaniola Tag 1 NNP 2 RB 3 VB 4 DT 5 JJ 6 NN

Global feature function Global feature function f(x, y) = X i,j g j (h, i, t) = X i g(h, i, t) Features are now counts, rather than binary How many times a word ending in ing was tagged as VBG

Global log-linear model p (y x) = = P P exp(f(x, y) > ) y 0 2GEN(x) exp(f(x, y0 ) > ) exp( P i g(x, i, y i 2,y i 1,y i ) > ) y 0 2GEN(x) exp(p i g(x, i, y0 i 2,y0 i 1,y0 i )> )

Decoding As long as the feature function decomposes locally we can still apply Viterbi Otherwise we can do greedy/beam decoding

Decoding Find the structure that maximizes the dot linear score arg max y p (y x) = arg max y = arg max y = arg max y = arg max y exp(f(x, y) > ) P y 0 2GEN(x) exp(f(x, y0 ) > ) exp(f(x, y) > ) f(x, y) > X g(x, i, y i 2,y i 1,y i ) > i

Viterbi still works! Dependencies did not change Definition: Y i is the set of possible tags in position i Base: (1,,y)=g(x, 1,,,y) > for all i 2 {2...n}, for all u 2 Y i 1,v 2 Y i : (i, u, v) = max t2y i 2 (j 1,t,u)+g(x, i, t, u, v) >

Learning L( ) = X i log p (y i x i ) rl( ) i = f(x i,y i ) X y 0 p (y 0 x)f(x, y 0 ) How to compute the second term?

Learning X y Let s look at bigram features only p (y x)f(x, y) = X y = X i = X i = X i X q terms are the probability that the tag sequence has a and b in positions i-1, i i X a,b X a,b p (y x)g(x, i, y i 1,y i ) X p (y x)g(x, i, y i 1,y i ) y:y i 1 =a,y i =b g(x, i, a, b) X p (y x) y:y i 1 =a,y i =b X g(x, i, a, b)q i (a, b) a,b

Learning If we compute the q terms efficiently then we can compute the gradients and learn The q terms are computed with a dynamic programming algorithm called forward-backward. Also used in unsupervised parameter estimation for HMMs Similar to Viterbi

Forward-backward Definitions (y 0,y,j)=exp(g(x, j, y 0,y) > ) my (y 1,...,y m )= (y j 1,y j,j) = p(y 1,...y m x) = j=1 my j=1 =exp( exp(g j (x, j, y j 1,y j ) > ) mx j=1 g j (x, j, y j 1,y j ) > ) (y 1,...,y m ) P z 1,...,z m (z 1,...,z m )

Forward-backward What will be computed? Z = µ(j, a) = µ(j, a, b) = q j (a, b) = X (y 1,...,y m ) y 1,...,y X m (y 1,...,y m ) y 1,...,y m :y j =a X y 1,...,y m :y j =a,y j+1 =b µ(j, a, b) Z (y 1,...,y m )

Forward-backward What will be computed? Sum of the scores of all paths that end at tag y in position j (j, y) = X y 1,...,y j :y j =y jy k=1 (y k 1,y k,k) Sum of the scores of all paths start at tag y in position j (j, y) = X y j,...,y m :y j =y my k=j (y k,y k+1,k)

α terms Compute with dynamic programming (1,y)= (,y,1) for j 2 {2...m},y 2 Y X (j, y) = (j 1,y 0 ) (y 0,y,j) y 0 2Y 1 2 3 y1 α(1, y1) y2 α(1, y2) y3 α(1, y3) (1,y 1 )= (,y 1, 1) (1,y 2 )= (,y 2, 1) (1,y 3 )= (,y 3, 1)

α terms 1 2 3 y1 α(1, y1) α(2, y1) y2 α(1, y2) α(2, y2) y3 α(1, y3) α(3, y3) (2,y 1 )= (1,y 1 ) (y 1,y 1, 2) + (1,y 2 ) (y 2,y 1, 2) + (1,y 3 ) (y 3,y 1, 2) = (,y 1, 1) (y 1,y 1, 2) + (,y 2, 1) (y 2,y 1, 2) + (,y 3, 1) (y 3,y 1, 2) (2,y 2 )= (,y 1, 1) (y 1,y 2, 2) + (,y 2, 1) (y 2,y 2, 2) + (,y 3, 1) (y 3,y 2, 2) (2,y 3 )= (,y 1, 1) (y 1,y 3, 2) + (,y 2, 1) (y 2,y 3, 2) + (,y 3, 1) (y 3,y 3, 2)

α terms 1 2 3 The distributive law works y1 α(1, y1) α(2, y1) α(3, y1) y2 α(1, y2) α(2, y2) y3 α(1, y3) α(3, y3) (3,y 1 )= (2,y 1 ) (y 1,y 1, 3) + (2,y 2 ) (y 2,y 1, 3) + (2,y 3 ) (y 3,y 1, 3) = (,y 1, 1) (y 1,y 1, 2) (y 1,y 1, 3) + (,y 2, 1) (y 2,y 1, 2) (y 1,y 1, 3) + (,y 3, 1) (y 3,y 1, 2) (y 1,y 1, 3) + (,y 1, 1) (y 1,y 2, 2) (y 2,y 1, 3) + (,y 2, 1) (y 2,y 2, 2) (y 2,y 1, 3) + (,y 3, 1) (y 3,y 2, 2) (y 2,y 1, 3) + (,y 1, 1) (y 1,y 3, 2) (y 3,y 1, 3) + (,y 2, 1) (y 2,y 3, 2) (y 3,y 1, 3) + (,y 3, 1) (y 3,y 3, 2) (y 3,y 1, 3)

β terms (m, y) =1 for j 2 {m 1...1},y 2 Y (j, y) = X y 0 2Y (j +1,y 0 ) (y, y 0,j) Use distributive property again to get final result Z = X y2y (m, y) µ(j, a) = (j, a) (j, a) µ(j, a, b) = (j, a) (a, b, j + 1) q j (a, b) = µ(j, a, b) Z (j +1,a)

Complexity If length of sentence is m, then m Y 2 We can compute the gradient We can apply SGD We can learn Limitation: decomposable features

Structured perceptron Similar to CRFs, but needs only Viterbi For every training example (x,y) Find best y according to model (Viterbi) If different from gold y Update weights: add features of y and subtract features of y Regularization is needed (averaged perceptron)

Global vs. Local models p L (y i x, y 1...y i 1 x) = exp(s(x, y 1...y i ) Z L () Z L (x, y 1...,y i 1 )= X t exp(s(x, y 1...y i 1,t)) p L (y 1...,y n x) = exp P n i=1 s(x, y 1...y i ) Q n i=1 Z L(x, y 1...,y i 1 ) p G (y 1...,y n x) = exp P n i=1 s(x, y 1...y i ) Z G (x) Z G (x) = X nx s(x, y 1...,y i ) y 1...y n exp i=1 It can be shown that some distribution can only be expressed with pg

Graphical models Viterbi and forward-backward are instances of max-product and sum-product algorithms for inference in graphical models Advanced ML class HMM Y 1 Y 2 Y n X 1 X 2 X n MEMM Y 1 Y 2 Y n X 1:n CRF p(y) = 1 Z ny i=1 (x, y i 1,y i ) Y 1 Y 2 Y n x 1:n

Deep learning models for tagging

Deep learning model We have a more powerful learning model Can we make things better? simpler?

Ling et al., 2015 Simple POS-tagging model x i : one-hot rep. for word i e i = W emb x i : word embedding h f i = LSTM(h i 1,e i ) h b i = LSTM(h i+1,e i ) c i = (W [h f i ; hb i]+b) p(y i x) = softmax(w s c i + b) L( ) = X i log p(y i = y x) All tag decisions are independent!! Works OK for POS tagging 96.97 acc. compared to 97.32 with a linear CRF Doesn t work for NER tagging. Why?

Bi-RNNs (Bi-LSTMs) Provide a representation of a word in context Have become the de-facto standard for word representation in context for NLP tasks

Lample et al., 2016 Greedy taggers A history x is a 4-tuple <t1 i-1, w, i> Predict ti from history p(ti x) exp(f(x,ti) x θ) = score(x,ti) Just replace the log-linear function with a non-linear neural network Need to define a neural network that reads the history and produces a label

Neural structured prediction s(x, y) =f(x, y) > = X i g(x, i, y i 1,y i ) > = s(x, 1,y 1,y 2 )+s(x, 2,y 2,y 3 )+s(x, 3,y 3,y 4 ) Replace linear score with non-linear neural network s(x, y) = X i NN(x, i, y i 1,y i ) y1 y2 y3 y4

Neural structured prediction What has changed? Decoding: As long as the neural network scores decompose we can still apply Viterbi Learning: p (y x) = exp(p i NN (x, i, y i 1,y i )) P y 0 exp( P i NN (x, i, yi 0 1,y0 i )) L (i) ( ) = log p(y (i) x (i) )

Neural structured prediction How do we compute the denominator (partition function)? Implement dynamic programming as part of the neural network All dynamic programming operations are differentiable (+ and x) In structured perceptron things are slightly simpler Compute the best structure (outside of the network) Define the loss: max(0, score(x, y) - max y score(x, y )

Lample et al., 2016 Bi-LSTM CRF for NER x i : one-hot rep. for word i e i = W emb x i : word embedding l i = LSTM(l i 1,e i ) r i = LSTM(r i+1,e i ) c i =[l i ; r i ] p i = W (proj) (Wc i + b) nx nx s(x, y) = p i,yi + i=1 i=0 A yi,y i+1 A is a learned parameter matrix (#tags 2 ). SOTA or close to that on POS tagging, NER, and other tasks

State-of-the-art To outperform state-of-the-art Regularization techniques Character-based information Pre-training with a lot of data

Summary Tagging Parsing Generative HMMs PCFGs Greedy Transitionbased parsing Log-linear History-based Viterbi CKY Global forwardbackward Inside-outside