Natural Language Processing

Size: px

Start display at page:

Download "Natural Language Processing"

Samuel Shelton
5 years ago
Views:

1 Natural Language Processing Global linear models Based on slides from Michael Collins

2 Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability of an entire sequence p (y x) = P f(x, y) > y 0 2Y f(x, y0 ) > Y : all tag sequences for x

3 Global linear models Motivations: Optimize directly what you care about Flexibility in feature function Avoid the label bias problem Question: How to compute the denominator?

4 Label bias problem Consider a history-based model for names with two tokens, that are either PERSON or LOC. States b-person e-person b-loc e-loc other

5 Label bias problem corpus: Harvey Ford (person 9 times, location 1 time) Harvey Park (location 9 times, person 1 time) Myrtle Ford (person 9 times, location 1 time) Myrtle Park (location 9 times, person 1 time) other b-person e-person b-loc e-loc Second word provides good information

6 Label bias problem Conditional probabilities: p(b-person other, w = Harvey) = 0.5 p(b-locn other, w = Harvey) = 0.5 p(b-person other, w = Myrtle) = 0.5 p(b-locn other, w = Myrtle) = 0.5 p(e-person b-person, w = Ford) = 1 p(e-person b-person, w = Park) = 1 p(e-locn b-locn, w = Ford) = 1 p(e-locn b-locn, w = Park) = 1 other b-person b-loc e-person e-loc Information from second word is lost

7 Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State State State State State 5 What the local transition probabilities say: State 1 prefers to go to state 2 State 2 prefer to stay in state 2

8 Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State State State State State 5 Probability of path 1-> 1-> 1-> 1: 0.4 x 0.45 x 0.5 = 0.09

9 Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State State State State State 5 Probability of path 2->2->2->2 : X 0.3 X 0.3 = Other paths: 1-> 1-> 1-> 1: 0.09

10 Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State State State State State 5 Probability of path 1->2->1->2: 0.6 X X 0.5 = 0.06 Other paths: 1->1->1->1: >2->2->2: 0.018

11 Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State State State State State 5 Probability of path 1->1->2->2: 0.4 X 0.55 X 0.3 = Other paths: 1->1->1->1: >2->2->2: >2->1->2: 0.06

12 Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State State State State State 5 Most Likely Path: 1-> 1-> 1-> 1 Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2.

13 Label bias problem Prefer states that have low entropy next-state distributions

14 Global linear models Main difference: We don t look at a history of decisions Instead, define features over full structure More flexible! We will see this in parsing But we directly consider the exponential space

15 Global linear models Contain three components A feature function f mapping a pair (x,y) to a feature vector f(x,y) A generating function GEN enumerating all candidate outputs A parameter vector θ F (x) = arg max y f(x, y) >

16 Tagging: GEN Inputs: sentences x = w 1 w n Tag set: T GEN(X) = T n In other setups: All parse trees Top-K parse trees produced by weaker model All translations

17 Tagging: feature function A history h is a 4-tuple <t i-2, t i-1, w, i> t i-2, t i-1 : previous two tags w: input sentence i: index of word being tagged Hispaniola/NNP quickly/rb became/vb an/dt important/jj base/?? from which Spain expanded its empire into the rest of the Western Hemisphere. ti-2, ti-1: DT JJ w: Hispaniola,, Hemisphere i: 6

18 Local features For every history/tag pair (h,t) g s (h, t) for s =1...,d: local features for tagging decision t given history h ( 1 w g 100 (h, t) i = base and t = VB 0 otherwise ( 1 w g 101 (h, t) i ends with ing and t = VBG 0 otherwise ( 1 (t g 102 (h, t) i 2,t i 1,t) = (DT, JJ, VB) 0 otherwise

19 Multiple local features Hispaniola/NNP quickly/rb became/vb an/dt important/jj base/nn History t-2 t-1 w i t * * * NNP Hispaniola Hispaniola NNP RB Hispaniola RB VB Hispaniola VB DT Hispaniola DT JJ Hispaniola Tag 1 NNP 2 RB 3 VB 4 DT 5 JJ 6 NN

20 Global feature function Global feature function f(x, y) = X i,j g j (h, i, t) = X i g(h, i, t) Features are now counts, rather than binary How many times a word ending in ing was tagged as VBG

21 Global log-linear model p (y x) = = P P exp(f(x, y) > ) y 0 2GEN(x) exp(f(x, y0 ) > ) exp( P i g(x, i, y i 2,y i 1,y i ) > ) y 0 2GEN(x) exp(p i g(x, i, y0 i 2,y0 i 1,y0 i )> )

22 Decoding As long as the feature function decomposes locally we can still apply Viterbi Otherwise we can do greedy/beam decoding

23 Decoding Find the structure that maximizes the dot linear score arg max y p (y x) = arg max y = arg max y = arg max y = arg max y exp(f(x, y) > ) P y 0 2GEN(x) exp(f(x, y0 ) > ) exp(f(x, y) > ) f(x, y) > X g(x, i, y i 2,y i 1,y i ) > i

24 Viterbi still works! Dependencies did not change Definition: Y i is the set of possible tags in position i Base: (1,,y)=g(x, 1,,,y) > for all i 2 {2...n}, for all u 2 Y i 1,v 2 Y i : (i, u, v) = max t2y i 2 (j 1,t,u)+g(x, i, t, u, v) >

25 Learning L( ) = X i log p (y i x i ) rl( ) i = f(x i,y i ) X y 0 p (y 0 x)f(x, y 0 ) How to compute the second term?

26 Learning X y Let s look at bigram features only p (y x)f(x, y) = X y = X i = X i = X i X q terms are the probability that the tag sequence has a and b in positions i-1, i i X a,b X a,b p (y x)g(x, i, y i 1,y i ) X p (y x)g(x, i, y i 1,y i ) y:y i 1 =a,y i =b g(x, i, a, b) X p (y x) y:y i 1 =a,y i =b X g(x, i, a, b)q i (a, b) a,b

27 Learning If we compute the q terms efficiently then we can compute the gradients and learn The q terms are computed with a dynamic programming algorithm called forward-backward. Also used in unsupervised parameter estimation for HMMs Similar to Viterbi

28 Forward-backward Definitions (y 0,y,j)=exp(g(x, j, y 0,y) > ) my (y 1,...,y m )= (y j 1,y j,j) = p(y 1,...y m x) = j=1 my j=1 =exp( exp(g j (x, j, y j 1,y j ) > ) mx j=1 g j (x, j, y j 1,y j ) > ) (y 1,...,y m ) P z 1,...,z m (z 1,...,z m )

29 Forward-backward What will be computed? Z = µ(j, a) = µ(j, a, b) = q j (a, b) = X (y 1,...,y m ) y 1,...,y X m (y 1,...,y m ) y 1,...,y m :y j =a X y 1,...,y m :y j =a,y j+1 =b µ(j, a, b) Z (y 1,...,y m )

30 Forward-backward What will be computed? Sum of the scores of all paths that end at tag y in position j (j, y) = X y 1,...,y j :y j =y jy k=1 (y k 1,y k,k) Sum of the scores of all paths start at tag y in position j (j, y) = X y j,...,y m :y j =y my k=j (y k,y k+1,k)

31 α terms Compute with dynamic programming (1,y)= (,y,1) for j 2 {2...m},y 2 Y X (j, y) = (j 1,y 0 ) (y 0,y,j) y 0 2Y y1 α(1, y1) y2 α(1, y2) y3 α(1, y3) (1,y 1 )= (,y 1, 1) (1,y 2 )= (,y 2, 1) (1,y 3 )= (,y 3, 1)

32 α terms y1 α(1, y1) α(2, y1) y2 α(1, y2) α(2, y2) y3 α(1, y3) α(3, y3) (2,y 1 )= (1,y 1 ) (y 1,y 1, 2) + (1,y 2 ) (y 2,y 1, 2) + (1,y 3 ) (y 3,y 1, 2) = (,y 1, 1) (y 1,y 1, 2) + (,y 2, 1) (y 2,y 1, 2) + (,y 3, 1) (y 3,y 1, 2) (2,y 2 )= (,y 1, 1) (y 1,y 2, 2) + (,y 2, 1) (y 2,y 2, 2) + (,y 3, 1) (y 3,y 2, 2) (2,y 3 )= (,y 1, 1) (y 1,y 3, 2) + (,y 2, 1) (y 2,y 3, 2) + (,y 3, 1) (y 3,y 3, 2)

33 α terms The distributive law works y1 α(1, y1) α(2, y1) α(3, y1) y2 α(1, y2) α(2, y2) y3 α(1, y3) α(3, y3) (3,y 1 )= (2,y 1 ) (y 1,y 1, 3) + (2,y 2 ) (y 2,y 1, 3) + (2,y 3 ) (y 3,y 1, 3) = (,y 1, 1) (y 1,y 1, 2) (y 1,y 1, 3) + (,y 2, 1) (y 2,y 1, 2) (y 1,y 1, 3) + (,y 3, 1) (y 3,y 1, 2) (y 1,y 1, 3) + (,y 1, 1) (y 1,y 2, 2) (y 2,y 1, 3) + (,y 2, 1) (y 2,y 2, 2) (y 2,y 1, 3) + (,y 3, 1) (y 3,y 2, 2) (y 2,y 1, 3) + (,y 1, 1) (y 1,y 3, 2) (y 3,y 1, 3) + (,y 2, 1) (y 2,y 3, 2) (y 3,y 1, 3) + (,y 3, 1) (y 3,y 3, 2) (y 3,y 1, 3)

34 β terms (m, y) =1 for j 2 {m 1...1},y 2 Y (j, y) = X y 0 2Y (j +1,y 0 ) (y, y 0,j) Use distributive property again to get final result Z = X y2y (m, y) µ(j, a) = (j, a) (j, a) µ(j, a, b) = (j, a) (a, b, j + 1) q j (a, b) = µ(j, a, b) Z (j +1,a)

35 Complexity If length of sentence is m, then m Y 2 We can compute the gradient We can apply SGD We can learn Limitation: decomposable features

36 Structured perceptron Similar to CRFs, but needs only Viterbi For every training example (x,y) Find best y according to model (Viterbi) If different from gold y Update weights: add features of y and subtract features of y Regularization is needed (averaged perceptron)

37 Global vs. Local models p L (y i x, y 1...y i 1 x) = exp(s(x, y 1...y i ) Z L () Z L (x, y 1...,y i 1 )= X t exp(s(x, y 1...y i 1,t)) p L (y 1...,y n x) = exp P n i=1 s(x, y 1...y i ) Q n i=1 Z L(x, y 1...,y i 1 ) p G (y 1...,y n x) = exp P n i=1 s(x, y 1...y i ) Z G (x) Z G (x) = X nx s(x, y 1...,y i ) y 1...y n exp i=1 It can be shown that some distribution can only be expressed with pg

38 Graphical models Viterbi and forward-backward are instances of max-product and sum-product algorithms for inference in graphical models Advanced ML class HMM Y 1 Y 2 Y n X 1 X 2 X n MEMM Y 1 Y 2 Y n X 1:n CRF p(y) = 1 Z ny i=1 (x, y i 1,y i ) Y 1 Y 2 Y n x 1:n

39 Deep learning models for tagging

40 Deep learning model We have a more powerful learning model Can we make things better? simpler?

41 Ling et al., 2015 Simple POS-tagging model x i : one-hot rep. for word i e i = W emb x i : word embedding h f i = LSTM(h i 1,e i ) h b i = LSTM(h i+1,e i ) c i = (W [h f i ; hb i]+b) p(y i x) = softmax(w s c i + b) L( ) = X i log p(y i = y x) All tag decisions are independent!! Works OK for POS tagging acc. compared to with a linear CRF Doesn t work for NER tagging. Why?

42 Bi-RNNs (Bi-LSTMs) Provide a representation of a word in context Have become the de-facto standard for word representation in context for NLP tasks

43 Lample et al., 2016 Greedy taggers A history x is a 4-tuple <t1 i-1, w, i> Predict ti from history p(ti x) exp(f(x,ti) x θ) = score(x,ti) Just replace the log-linear function with a non-linear neural network Need to define a neural network that reads the history and produces a label

44 Neural structured prediction s(x, y) =f(x, y) > = X i g(x, i, y i 1,y i ) > = s(x, 1,y 1,y 2 )+s(x, 2,y 2,y 3 )+s(x, 3,y 3,y 4 ) Replace linear score with non-linear neural network s(x, y) = X i NN(x, i, y i 1,y i ) y1 y2 y3 y4

45 Neural structured prediction What has changed? Decoding: As long as the neural network scores decompose we can still apply Viterbi Learning: p (y x) = exp(p i NN (x, i, y i 1,y i )) P y 0 exp( P i NN (x, i, yi 0 1,y0 i )) L (i) ( ) = log p(y (i) x (i) )

46 Neural structured prediction How do we compute the denominator (partition function)? Implement dynamic programming as part of the neural network All dynamic programming operations are differentiable (+ and x) In structured perceptron things are slightly simpler Compute the best structure (outside of the network) Define the loss: max(0, score(x, y) - max y score(x, y )

47 Lample et al., 2016 Bi-LSTM CRF for NER x i : one-hot rep. for word i e i = W emb x i : word embedding l i = LSTM(l i 1,e i ) r i = LSTM(r i+1,e i ) c i =[l i ; r i ] p i = W (proj) (Wc i + b) nx nx s(x, y) = p i,yi + i=1 i=0 A yi,y i+1 A is a learned parameter matrix (#tags 2 ). SOTA or close to that on POS tagging, NER, and other tasks

48 State-of-the-art To outperform state-of-the-art Regularization techniques Character-based information Pre-training with a lot of data

49 Summary Tagging Parsing Generative HMMs PCFGs Greedy Transitionbased parsing Log-linear History-based Viterbi CKY Global forwardbackward Inside-outside

with Local Dependencies

CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence