CS395T: Structured Models for NLP Lecture 5: Sequence Models II

Size: px

Start display at page:

Download "CS395T: Structured Models for NLP Lecture 5: Sequence Models II"

Cornelia Harrington
5 years ago
Views:

1 CS395T: Structured Models for NLP Lecture 5: Sequence Models II Greg Durrett Some slides adapted from Dan Klein, UC Berkeley

2 Recall: HMMs Input x =(x 1,...,x n ) Output y =(y 1,...,y n ) y 1 y 2 y n P (y, x) =P (y 1 ) P (y i y i 1 ) P (x i y i ) i=2 x 1 x 2 x n Training: maximum likelihood eslmalon (with smoothing) Inference problem: argmax y P (y x) = argmax y P (y, x) P (x) ExponenLally many possible y here! Viterbi: score i (s) = max y i 1 P (s y i 1 )P (x i s)score i 1 (y i 1 )

3 This Lecture GeneraLve vs. discriminalve models CRFs for sequence modeling Named enlty recognilon (NER) Structured SVM (if Lme) Beam search

4 Named EnLty RecogniLon B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG BIO tagset: begin, inside, outside POS tagging is a plausible generalve model of language NER with this vanilla tag set is not What s different about modeling P(y x) directly vs. P(x,y) and compulng the posterior later?

5 GeneraLve vs. DiscriminaLve Models slide credit: Dan Klein

6 GeneraLve vs. DiscriminaLve Models slide credit: Dan Klein

7 GeneraLve vs. DiscriminaLve Models What does the model say when both lights are red? Lights are working wrong! slide credit: Dan Klein

8 GeneraLve vs. DiscriminaLve Models What if P(b) were 1/2 instead of 1/7 (the NB eslmate)? Lights are broken correct! Data likelihood is lower but Data likelihood P(x,y) is lower but posterior P(y x) is more accurate slide credit: Dan Klein

9 CondiLonal Random Fields HMMs are expressible as Bayes nets (factor graphs) y 1 y 2 y n x 1 x 2 x n This reflects the following decomposilon: P (y, x) =P (y 1 )P (x 1 y 1 )P (y 2 y 1 )P (x 2 y 2 )... Locally normalized model: each factor is a probability distribulon that normalizes

10 CondiLonal Random Fields HMMs: P (y, x) =P (y 1 )P (x 1 y 1 )P (y 2 y 1 )P (x 2 y 2 )... CRFs: discriminalve models with the following globally-normalized form: Y P (y x) = 1 Z normalizer k exp( k (x, y)) any real-valued scoring funclon of its arguments Naive Bayes : logislc regression :: HMMs : CRFs local vs. global normalizalon <-> generalve vs. discriminalve How do we max over y? Intractable in general can we fix this?

11 SequenLal CRFs HMMs: P (y, x) =P (y 1 )P (x 1 y 1 )P (y 2 y 1 )P (x 2 y 2 )... y 1 y 2 y n x 1 x 2 x n o t y 1 y 2 y n CRFs: e P (y x) / Y k exp( k (x, y)) x 1 x 2 x n P (y x) / exp( o (y 1 )) exp( t (y i 1,y i )) exp( e (x i,y i )) i=2

12 SequenLal CRFs o t y 1 y 2 y n o t y y 1 2 y n e x 1 x 2 x n e x P (y x) / exp( o (y 1 )) exp( t (y i 1,y i )) exp( e (x i,y i )) i=2 We condilon on x, so every variable can exp( e (y i,i,x)) depend on all of x x can t depend arbitrarily on y in a generalve model would make inference hard token index lets us look at current word

13 SequenLal CRFs t t o y y 1 2 y n o y y 1 2 y n e x e in fact, we typically don t show x at all Don t include inilal distribulon, can bake into other factors SequenLal CRFs: P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) i=2

14 CompuLng (arg)maxes t P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) y y 1 2 y n i=2 e argmax P (y x): can use Viterbi exactly as in HMM case y max y 1,...,y n e t (y n 1,y n ) e e (y n,n,x) e e (y 2,2,x) e t (y 1,y 2 ) e e (y 1,1,x) = max y 2,...,y n e t (y n 1,y n ) e e (y n,n,x) e e (y 2,2,x) max y 1 e t (y 1,y 2 ) e e (y 1,1,x) { = max y 3,...,y n e t (y n 1,y n ) e e (y n,n,x) max y 2 e t (y 2,y 3 ) e e (y 2,2,x) max {y1 e t(y1,y2 ) score1 (y1) exp( t (y i 1,y i )) exp( e (y i,i,x)) and play the role of the Ps now, same dynamic program

15 CompuLng Marginals t P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) y y 1 2 y n i=2 e Normalizing constant Z = X y exp( t (y i 1,y i )) exp( e (y i,i,x)) i=2 Analogous to P(x) for HMMs P (y i = s, x) For both HMMs and CRFs: for HMMs; sums P (y i = s x) = P forward i (s)backward i (s) s 0 forward i (s 0 )backward i (s 0 ) out other ys Z for CRFs, P(x) for HMMs

16 Inference in General CRFs t Can do inference in any tree-structured CRF y y 1 2 y n e Sum-product algorithm: generalizalon of forward-backward to arbitrary tree-structured graphs We ll come back to this in a few lectures when we deal with other kinds of graphs

17 Feature FuncLons t P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) y y 1 2 y n i=2 e Phi can have sophislcated features! Generally look like linear models e(y i,i,x) =w > f e (y i,i,x) t(y i 1,y i )=w > f t (y i 1,y i ) # P (y x) / exp w > " n X f t (y i 1,y i )+ nx f e (y i,i,x) i=2 Log-linear model structurally like logislc regression!

18 Training CRFs P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) i=2 Assume and are both linear feature funclons w T f(args) t e nx nx L(y, x) = log P (y x) = w > f t (y i 1,y i )+ w > f e (x i,y i ) log Z i=2 Gradient is gold features minus expected features under model, like in nx j L(y, x) = i=2 f t,j (y i 1,y i )+ f e,j (x i,y i ) " X n E y f t,j (y i 1,y i )+ nx f e,j (x i,y i ) # i=2

19 Training CRFs P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) i=2 How to compute expectalons? Forward-backward helps you compute P (y i = s x) Take weighted sum over all features at all tags and posilons TransiLon features: need to compute P (y i = s 1,y i+1 = s 2 x) using forward-backward as well but you can build a preky good system without transilon features

20 ImplementaLon Tips Olen many features but only a few are aclve on a single sentence even across many different labels Maintain the gradient as a sparse vector for efficiency Counter in utils.py is a way to do this

21 Basic Features for NER P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) i=2 B-LOC O Barack Obama will travel to Hangzhou today for the G20 mee=ng. TransiLons: f t (y i 1,y i )=Ind[y i 1 & y i ] Emissions: f e (y 6, 6, x) = Ind[B-LOC & Current word = Hangzhou] Ind[B-LOC & Prev word = to]

22 Features for NER LOC e(y i,i,x) Leicestershire is a nice place to visit PER Leonardo DiCaprio won an award LOC I took a vaca=on to Boston ORG LOC PER Apple released a new version Texas governor Greg AbboL said ORG According to the New York Times

23 Features for NER Word features CapitalizaLon Leicestershire Word shape Prefixes/suffixes Boston Lexical indicators Context features Words before/aler Apple released a new version According to the New York Times Tags before/aler Word clusters Gazekeers

24 Nonlocal Features The news agency Tanjug reported on the outcome of the meelng. ORG? PER? The delegalon met the president at the airport, Tanjug said. Various ways to capture this informalon we ll talk about this in a few lectures Finkel and Manning (2008), RaLnov and Roth (2009)

25 Semi-Markov Models Barack Obama will travel to Hangzhou today for the G20 mee=ng. { { { { { { PER O LOC O ORG O Chunk-level prediclon rather than token-level BIO y is a set of touching spans of the sentence Viterbi looks like looping over all spans that could lead to a given point Pros: features can look at whole span at once Cons: there s an extra factor of n during inference Sarawagi and Cohen (2004)

26 EvaluaLng NER B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG PredicLon of all Os slll gets 66% accuracy on this example! What we really want to know: how many named enlty chunk prediclons did we get right? Precision: of the ones we predicted, how many are right? Recall: of the gold named enlles, how many did we find? F-measure: harmonic mean of these two ParLal credit? Typically no but more complex metrics exist

27 EvaluaLng NER B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG More correct: ROC curve Recall Measure the area under the curve as a way of evalualng the system holislcally Precision

28 How well do NER systems do? RaLnov and Roth (2009) Lample et al. (2016)

29 Structured SVM nx nx CRF: log P (y x) / w > f t (y i 1,y i )+ w > f e (x i,y i ) i=2 We can formulate an SVM using the same features nx nx w > f(x, y) = w > f t (y i 1,y i )+ w > f e (x i,y i ) i=2 mx Minimize kwk j s.t. 8j j 0 j=1 8j8y 2Y w > f(x j, y j ) w > f(x j, y)+`(y, y j ) j

30 Structured SVM nx nx w > f(x, y) = w > f t (y i 1,y i )+ w > f e (x i,y i ) i=2 mx Minimize kwk j s.t. 8j j 0 j=1 8j8y 2Y w > f(x j, y j ) w > f(x j, y)+`(y, y j ) j ExponenLally large state space! Use Viterbi for loss-augmented decode Same as normal Viterbi but boost wrong labels scores by 1 (if using Hamming loss) Only need Viterbi, not forward-backward hmm

31 Viterbi Time Complexity VBD VBN NNP VBZ NNS VB VBP NN VBZ NNS CD NN Fed raises interest rates 0.5 percent n word sentence, s tags to consider what is the Lme complexity? sentence tags O(ns 2 ) s is ~40 for POS, n is ~20

32 Viterbi Time Complexity VBD VBN NNP VBZ NNS VB VBP NN VBZ NNS CD NN Fed raises interest rates 0.5 percent Many tags are totally implausible Can any of these be: Determiners? PreposiLons? AdjecLves? Features quickly eliminate many outcomes from consideralon don t need to consider these going forward

33 Beam Search Maintain a beam of k plausible states at the current Lmestep Expand all states, only keep k top hypotheses at new state VBD +1.2 VBZ -2.0 NNS -1.0 VBZ VBZ +1.2 NNP +0.9 NNS -1.0 VBN +0.7 DT -5.3 NN +0.3 PRP -5.8 Not expanded Not expanded Fed raises O(nks) Lme complexity with beam size of k

34 How good is beam search? Big enough beam size: always exact! Usually works well even with smaller beams What s the case when k=1? How about when there s no transilon model? Depends on the strength of nonlocal interaclons we ll come back to this later!

35 ImplementaLon Tips for CRFs Caching is your friend! Cache feature vectors especially Try to reduce redundant computalon, e.g. if you compute both the gradient and the objeclve value, don t rerun the dynamic program Exploit sparsity in feature vectors where possible. The weight vector needs to be stored explicitly, but all features and gradients are typically faster to handle sparsely Think about your data structures: if things are too slow

36 Debugging Tips for CRFs Hard to know whether inference, learning, or the model is broken! Compute the objeclve is oplmizalon working? Inference: check gradient computalon (most likely place for bug) Are expectalons being computed correctly? Do probabililes normalize / expectalons look reasonable? Learning: are you applying the gradient correctly? If objeclve is going down but model performance is bad: Inference: check performance if you decode the training set Model: if dev set performance is bad: work on features more!

37 Next Time Unsupervised sequence modeling WriLng Lps as you prepare your report

CS395T: Structured Models for NLP Lecture 2: Machine Learning Review

CS395T: Structured Models for NLP Lecture 2: Machine Learning Review Greg Durrett Some slides adapted from Vivek Srikumar, University of Utah Administrivia Course enrollment Lecture slides posted on website