Natural Language Processing Global linear models Based on slides from Michael Collins
Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability of an entire sequence p (y x) = P f(x, y) > y 0 2Y f(x, y0 ) > Y : all tag sequences for x
Global linear models Motivations: Optimize directly what you care about Flexibility in feature function Avoid the label bias problem Question: How to compute the denominator?
Label bias problem Consider a history-based model for names with two tokens, that are either PERSON or LOC. States b-person e-person b-loc e-loc other
Label bias problem corpus: Harvey Ford (person 9 times, location 1 time) Harvey Park (location 9 times, person 1 time) Myrtle Ford (person 9 times, location 1 time) Myrtle Park (location 9 times, person 1 time) other b-person e-person b-loc e-loc Second word provides good information
Label bias problem Conditional probabilities: p(b-person other, w = Harvey) = 0.5 p(b-locn other, w = Harvey) = 0.5 p(b-person other, w = Myrtle) = 0.5 p(b-locn other, w = Myrtle) = 0.5 p(e-person b-person, w = Ford) = 1 p(e-person b-person, w = Park) = 1 p(e-locn b-locn, w = Ford) = 1 p(e-locn b-locn, w = Park) = 1 other b-person b-loc e-person e-loc Information from second word is lost
Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State 1 0.4 0.45 0.5 0.6 0.55 0.1 0.5 State 2 0.3 0.3 0.1 State 3 0.1 State 4 0.3 State 5 What the local transition probabilities say: State 1 prefers to go to state 2 State 2 prefer to stay in state 2
Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State 1 0.4 0.45 0.5 0.6 0.55 0.1 0.5 State 2 0.3 0.3 0.1 State 3 0.1 State 4 0.3 State 5 Probability of path 1-> 1-> 1-> 1: 0.4 x 0.45 x 0.5 = 0.09
Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State 1 0.4 0.45 0.5 0.6 0.55 0.1 0.5 State 2 0.3 0.3 0.1 State 3 0.1 State 4 0.3 State 5 Probability of path 2->2->2->2 : X 0.3 X 0.3 = 0.018 Other paths: 1-> 1-> 1-> 1: 0.09
Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State 1 0.4 0.45 0.5 0.6 0.55 0.1 0.5 State 2 0.3 0.3 0.1 State 3 0.1 State 4 0.3 State 5 Probability of path 1->2->1->2: 0.6 X X 0.5 = 0.06 Other paths: 1->1->1->1: 0.09 2->2->2->2: 0.018
Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State 1 0.4 0.45 0.5 0.6 0.55 0.1 0.5 State 2 0.3 0.3 0.1 State 3 0.1 State 4 0.3 State 5 Probability of path 1->1->2->2: 0.4 X 0.55 X 0.3 = 0.066 Other paths: 1->1->1->1: 0.09 2->2->2->2: 0.018 1->2->1->2: 0.06
Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State 1 0.4 0.45 0.5 0.6 0.55 0.1 0.5 State 2 0.3 0.3 0.1 State 3 0.1 State 4 0.3 State 5 Most Likely Path: 1-> 1-> 1-> 1 Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2.
Label bias problem Prefer states that have low entropy next-state distributions
Global linear models Main difference: We don t look at a history of decisions Instead, define features over full structure More flexible! We will see this in parsing But we directly consider the exponential space
Global linear models Contain three components A feature function f mapping a pair (x,y) to a feature vector f(x,y) A generating function GEN enumerating all candidate outputs A parameter vector θ F (x) = arg max y f(x, y) >
Tagging: GEN Inputs: sentences x = w 1 w n Tag set: T GEN(X) = T n In other setups: All parse trees Top-K parse trees produced by weaker model All translations
Tagging: feature function A history h is a 4-tuple <t i-2, t i-1, w, i> t i-2, t i-1 : previous two tags w: input sentence i: index of word being tagged Hispaniola/NNP quickly/rb became/vb an/dt important/jj base/?? from which Spain expanded its empire into the rest of the Western Hemisphere. ti-2, ti-1: DT JJ w: Hispaniola,, Hemisphere i: 6
Local features For every history/tag pair (h,t) g s (h, t) for s =1...,d: local features for tagging decision t given history h ( 1 w g 100 (h, t) i = base and t = VB 0 otherwise ( 1 w g 101 (h, t) i ends with ing and t = VBG 0 otherwise ( 1 (t g 102 (h, t) i 2,t i 1,t) = (DT, JJ, VB) 0 otherwise
Multiple local features Hispaniola/NNP quickly/rb became/vb an/dt important/jj base/nn History t-2 t-1 w i t * * * NNP Hispaniola Hispaniola NNP RB Hispaniola RB VB Hispaniola VB DT Hispaniola DT JJ Hispaniola Tag 1 NNP 2 RB 3 VB 4 DT 5 JJ 6 NN
Global feature function Global feature function f(x, y) = X i,j g j (h, i, t) = X i g(h, i, t) Features are now counts, rather than binary How many times a word ending in ing was tagged as VBG
Global log-linear model p (y x) = = P P exp(f(x, y) > ) y 0 2GEN(x) exp(f(x, y0 ) > ) exp( P i g(x, i, y i 2,y i 1,y i ) > ) y 0 2GEN(x) exp(p i g(x, i, y0 i 2,y0 i 1,y0 i )> )
Decoding As long as the feature function decomposes locally we can still apply Viterbi Otherwise we can do greedy/beam decoding
Decoding Find the structure that maximizes the dot linear score arg max y p (y x) = arg max y = arg max y = arg max y = arg max y exp(f(x, y) > ) P y 0 2GEN(x) exp(f(x, y0 ) > ) exp(f(x, y) > ) f(x, y) > X g(x, i, y i 2,y i 1,y i ) > i
Viterbi still works! Dependencies did not change Definition: Y i is the set of possible tags in position i Base: (1,,y)=g(x, 1,,,y) > for all i 2 {2...n}, for all u 2 Y i 1,v 2 Y i : (i, u, v) = max t2y i 2 (j 1,t,u)+g(x, i, t, u, v) >
Learning L( ) = X i log p (y i x i ) rl( ) i = f(x i,y i ) X y 0 p (y 0 x)f(x, y 0 ) How to compute the second term?
Learning X y Let s look at bigram features only p (y x)f(x, y) = X y = X i = X i = X i X q terms are the probability that the tag sequence has a and b in positions i-1, i i X a,b X a,b p (y x)g(x, i, y i 1,y i ) X p (y x)g(x, i, y i 1,y i ) y:y i 1 =a,y i =b g(x, i, a, b) X p (y x) y:y i 1 =a,y i =b X g(x, i, a, b)q i (a, b) a,b
Learning If we compute the q terms efficiently then we can compute the gradients and learn The q terms are computed with a dynamic programming algorithm called forward-backward. Also used in unsupervised parameter estimation for HMMs Similar to Viterbi
Forward-backward Definitions (y 0,y,j)=exp(g(x, j, y 0,y) > ) my (y 1,...,y m )= (y j 1,y j,j) = p(y 1,...y m x) = j=1 my j=1 =exp( exp(g j (x, j, y j 1,y j ) > ) mx j=1 g j (x, j, y j 1,y j ) > ) (y 1,...,y m ) P z 1,...,z m (z 1,...,z m )
Forward-backward What will be computed? Z = µ(j, a) = µ(j, a, b) = q j (a, b) = X (y 1,...,y m ) y 1,...,y X m (y 1,...,y m ) y 1,...,y m :y j =a X y 1,...,y m :y j =a,y j+1 =b µ(j, a, b) Z (y 1,...,y m )
Forward-backward What will be computed? Sum of the scores of all paths that end at tag y in position j (j, y) = X y 1,...,y j :y j =y jy k=1 (y k 1,y k,k) Sum of the scores of all paths start at tag y in position j (j, y) = X y j,...,y m :y j =y my k=j (y k,y k+1,k)
α terms Compute with dynamic programming (1,y)= (,y,1) for j 2 {2...m},y 2 Y X (j, y) = (j 1,y 0 ) (y 0,y,j) y 0 2Y 1 2 3 y1 α(1, y1) y2 α(1, y2) y3 α(1, y3) (1,y 1 )= (,y 1, 1) (1,y 2 )= (,y 2, 1) (1,y 3 )= (,y 3, 1)
α terms 1 2 3 y1 α(1, y1) α(2, y1) y2 α(1, y2) α(2, y2) y3 α(1, y3) α(3, y3) (2,y 1 )= (1,y 1 ) (y 1,y 1, 2) + (1,y 2 ) (y 2,y 1, 2) + (1,y 3 ) (y 3,y 1, 2) = (,y 1, 1) (y 1,y 1, 2) + (,y 2, 1) (y 2,y 1, 2) + (,y 3, 1) (y 3,y 1, 2) (2,y 2 )= (,y 1, 1) (y 1,y 2, 2) + (,y 2, 1) (y 2,y 2, 2) + (,y 3, 1) (y 3,y 2, 2) (2,y 3 )= (,y 1, 1) (y 1,y 3, 2) + (,y 2, 1) (y 2,y 3, 2) + (,y 3, 1) (y 3,y 3, 2)
α terms 1 2 3 The distributive law works y1 α(1, y1) α(2, y1) α(3, y1) y2 α(1, y2) α(2, y2) y3 α(1, y3) α(3, y3) (3,y 1 )= (2,y 1 ) (y 1,y 1, 3) + (2,y 2 ) (y 2,y 1, 3) + (2,y 3 ) (y 3,y 1, 3) = (,y 1, 1) (y 1,y 1, 2) (y 1,y 1, 3) + (,y 2, 1) (y 2,y 1, 2) (y 1,y 1, 3) + (,y 3, 1) (y 3,y 1, 2) (y 1,y 1, 3) + (,y 1, 1) (y 1,y 2, 2) (y 2,y 1, 3) + (,y 2, 1) (y 2,y 2, 2) (y 2,y 1, 3) + (,y 3, 1) (y 3,y 2, 2) (y 2,y 1, 3) + (,y 1, 1) (y 1,y 3, 2) (y 3,y 1, 3) + (,y 2, 1) (y 2,y 3, 2) (y 3,y 1, 3) + (,y 3, 1) (y 3,y 3, 2) (y 3,y 1, 3)
β terms (m, y) =1 for j 2 {m 1...1},y 2 Y (j, y) = X y 0 2Y (j +1,y 0 ) (y, y 0,j) Use distributive property again to get final result Z = X y2y (m, y) µ(j, a) = (j, a) (j, a) µ(j, a, b) = (j, a) (a, b, j + 1) q j (a, b) = µ(j, a, b) Z (j +1,a)
Complexity If length of sentence is m, then m Y 2 We can compute the gradient We can apply SGD We can learn Limitation: decomposable features
Structured perceptron Similar to CRFs, but needs only Viterbi For every training example (x,y) Find best y according to model (Viterbi) If different from gold y Update weights: add features of y and subtract features of y Regularization is needed (averaged perceptron)
Global vs. Local models p L (y i x, y 1...y i 1 x) = exp(s(x, y 1...y i ) Z L () Z L (x, y 1...,y i 1 )= X t exp(s(x, y 1...y i 1,t)) p L (y 1...,y n x) = exp P n i=1 s(x, y 1...y i ) Q n i=1 Z L(x, y 1...,y i 1 ) p G (y 1...,y n x) = exp P n i=1 s(x, y 1...y i ) Z G (x) Z G (x) = X nx s(x, y 1...,y i ) y 1...y n exp i=1 It can be shown that some distribution can only be expressed with pg
Graphical models Viterbi and forward-backward are instances of max-product and sum-product algorithms for inference in graphical models Advanced ML class HMM Y 1 Y 2 Y n X 1 X 2 X n MEMM Y 1 Y 2 Y n X 1:n CRF p(y) = 1 Z ny i=1 (x, y i 1,y i ) Y 1 Y 2 Y n x 1:n
Deep learning models for tagging
Deep learning model We have a more powerful learning model Can we make things better? simpler?
Ling et al., 2015 Simple POS-tagging model x i : one-hot rep. for word i e i = W emb x i : word embedding h f i = LSTM(h i 1,e i ) h b i = LSTM(h i+1,e i ) c i = (W [h f i ; hb i]+b) p(y i x) = softmax(w s c i + b) L( ) = X i log p(y i = y x) All tag decisions are independent!! Works OK for POS tagging 96.97 acc. compared to 97.32 with a linear CRF Doesn t work for NER tagging. Why?
Bi-RNNs (Bi-LSTMs) Provide a representation of a word in context Have become the de-facto standard for word representation in context for NLP tasks
Lample et al., 2016 Greedy taggers A history x is a 4-tuple <t1 i-1, w, i> Predict ti from history p(ti x) exp(f(x,ti) x θ) = score(x,ti) Just replace the log-linear function with a non-linear neural network Need to define a neural network that reads the history and produces a label
Neural structured prediction s(x, y) =f(x, y) > = X i g(x, i, y i 1,y i ) > = s(x, 1,y 1,y 2 )+s(x, 2,y 2,y 3 )+s(x, 3,y 3,y 4 ) Replace linear score with non-linear neural network s(x, y) = X i NN(x, i, y i 1,y i ) y1 y2 y3 y4
Neural structured prediction What has changed? Decoding: As long as the neural network scores decompose we can still apply Viterbi Learning: p (y x) = exp(p i NN (x, i, y i 1,y i )) P y 0 exp( P i NN (x, i, yi 0 1,y0 i )) L (i) ( ) = log p(y (i) x (i) )
Neural structured prediction How do we compute the denominator (partition function)? Implement dynamic programming as part of the neural network All dynamic programming operations are differentiable (+ and x) In structured perceptron things are slightly simpler Compute the best structure (outside of the network) Define the loss: max(0, score(x, y) - max y score(x, y )
Lample et al., 2016 Bi-LSTM CRF for NER x i : one-hot rep. for word i e i = W emb x i : word embedding l i = LSTM(l i 1,e i ) r i = LSTM(r i+1,e i ) c i =[l i ; r i ] p i = W (proj) (Wc i + b) nx nx s(x, y) = p i,yi + i=1 i=0 A yi,y i+1 A is a learned parameter matrix (#tags 2 ). SOTA or close to that on POS tagging, NER, and other tasks
State-of-the-art To outperform state-of-the-art Regularization techniques Character-based information Pre-training with a lot of data
Summary Tagging Parsing Generative HMMs PCFGs Greedy Transitionbased parsing Log-linear History-based Viterbi CKY Global forwardbackward Inside-outside