Structured Prediction with Perceptron: Theory and Algorithms

Size: px

Start display at page:

Download "Structured Prediction with Perceptron: Theory and Algorithms"

Antony Cox
6 years ago
Views:

1 Structured Prediction with Perceptron: Theor and Algorithms the man bit the dog the man hit the dog DT NN VBD DT NN 那人咬了狗 =+1 =-1 Kai Zhao Dept. Computer Science Graduate Center, the Cit Universit of New York

2 What is Structured Prediction? the man bit the dog the man bit the dog DT NN VBD DT NN S =+1 =-1 the man bit the dog DT NP NN VB VP NP 那人咬了狗 the man bit DT NN binar classification: output is binar the dog NLP is all about structured prediction! multiclass classification: output is a (small) number structured classification: output is a structure (seq., tree, graph) part-of-speech tagging, parsing, summarization, translation eponentiall man classes: search (inference) efficienc is crucial! 2

3 Learning: Unstructured vs. Structured binar/multiclass structured learning generative naive baes HMMs (count & divide) Conditional Conditional discriminative logistic regression (maent) CRFs (epectations) Online+ Viterbi Online+ Viterbi perceptron structured perceptron (argma) SVM ma margin ma margin structured SVM (loss-augmented argma) 3

4 Wh Perceptron (Online Learning)? because we want scalabilit on big data! learning time has to be linear in the number of eamples can make onl constant number of passes over training data onl online learning (perceptron/mira) can guarantee this! SVM scales between O(n 2 ) and O(n 3 ); CRF no guarantee and inference on each eample must be super fast another advantage of perceptron: just need argma SVM CRF... 4

5 Perceptron: from binar to structured binar perceptron (Rosenblatt, 1959) 2 classes trivial w eact inference z update weights if z =+1 =-1 multiclass perceptron (Freund/Schapire, 1999) constant # of classes eas w eact inference z update weights if z structured perceptron (Collins, 2002) the man bit the dog 那人咬了狗 eponential # of classes hard w eact inference z update weights if z 5

6 Scalabilit Challenges inference (on one eample) is too slow (even w/ DP) can we sacrifice search eactness for faster learning? would ineact search interfere with learning? if so, how should we modif learning?... 6

7 Outline Overview of Structured Learning Challenges in Scalabilit Structured Perceptron convergence proof Structured Perceptron with Ineact Search Latent-Variable Structured Perceptron 7

8 Generic Perceptron perceptron is the simplest machine learning algorithm online-learning: one eample at a time learning b doing find the best output under the current weights update weights at mistakes i inference i zi update weights w 8

9 Structured Perceptron the man bit the dog i inference w DT NN VBD DT NN i zi DT NN NN DT NN update weights 9

10 Eample: POS Tagging gold-standard: DT NN VBD DT NN the man bit the dog current output: DT NN NN DT NN the man bit the dog Φ(, ) z Φ(, z) assume onl two feature classes tag bigrams word/tag pairs ti-1 ti wi weights ++: (NN, VBD) (VBD, DT) (VBD bit) weights --: (NN, NN) (NN, DT) (NN bit) 10

11 Inference: Dnamic Programming w eact inference z update weights if z tagging: O(nT 3 ) CKY parsing: O(n 3 ) 11

12 Perceptron vs. CRFs perceptron is online and Viterbi approimation of CRF simpler to code; faster to converge; ~same accurac online stochastic gradient descent (SGD) X (,)2D CRFs (Laffert et al, 2001) X z2gen() ep(w (, z)) Z() Viterbi hard/viterbi CRFs Viterbi for (, ) 2 D, argma z2gen() w (, z) structured perceptron (Collins, 2002) online 12

13 Perceptron Convergence Proof binar classification: converges iff. data is separable structured prediction: converges iff. data is separable there is an oracle vector that correctl labels all eamples one vs the rest (correct label better than all incorrect labels) theorem: if separable, then # of updates R 2 / δ 2 R: diameter R: diameter δ R: diameter 100 δ 100 =-1 =+1 z 100 Novikoff => Freund & Schapire => Collins

14 Geometr of Convergence Proof pt 1 w z eact 1-best eact inference z update weights if z (,, z) update perceptron update: correct label w (k) current model update new model w (k+1) δ separation unit oracle vector u kw k+1 k k margin (b induction) (part 1: lowerbound) 14

15 Geometr of Convergence Proof pt 2 z eact 1-best summar: the proof uses 3 facts: w 1. separation (margin) eact inference 2. diameter (alwas finite) z update weights if z 3. violation (guaranteed b eact search) violation: incorrect label scored higher (,, z) update perceptron update: R: ma diameter correct label w (k) current model update <90 new w (k+1) model p kr k apple R 2 diameter violation b induction: kw k+1 k 2 apple kr 2 (part 2: upperbound) kw k+1 kapple p kr combine with: kw k+1 k k (part 1: lowerbound) bound on # of updates: k apple R 2 / 2 15

16 Outline Overview of Structured Learning Challenges in Scalabilit Structured Perceptron convergence proof Structured Perceptron with Ineact Search Latent-Variable Perceptron 16

17 Scalabilit Challenge 1: Inference binar classification =+1 the man bit the dog DT NN VBD DT NN =-1 structured classification constant # of classes eponential # of classes trivial w eact inference hard w eact inference z z update weights if z update weights if z challenge: search efficienc (eponentiall man classes) often use dnamic programming (DP) but DP is still too slow for repeated use, e.g. parsing O(n 3 ) Q: can we sacrifice search eactness for faster learning? 17

18 Perceptron w/ Ineact Inference w the man bit the dog DT NN VBD DT NN ineact inference z update weights if z Q: does perceptron still work??? greed search beam search routine use of ineact inference in NLP (e.g. beam search) how does structured perceptron work with ineact search? so far most structured learning theor assume eact search would search errors break these learning properties? if so how to modif learning to accommodate ineact search? 18

19 Bad News and Good News w the man bit the dog DT NN VBD DT NN ineact inference z update weights if z greed search beam search A: it no longer works as is, but we can make it work b some magic. bad news: no more guarantee of convergence in practice perceptron degrades a lot due to search errors good news: new update methods guarantee convergence new perceptron variants that live with search errors in practice the work reall well w/ ineact search 19

20 Convergence with Eact Search current model w (k) w (k+1) update V V V N V N N V correct label training eample time flies N V output space {N,V} {N, V} structured perceptron converges with eact search 20

21 No Convergence w/ Greed Search current model w (k) w (k) new model w (k+1) V V V V V N V V N update (,, z) N V correct label N V training eample time flies N V output space {N,V} {N, V} structured perceptron does not converge N V N with greed search Which part of the convergence proof no longer holds? the proof onl uses 3 facts: 1. separation (margin) 2. diameter (alwas finite) 3. violation (guaranteed b eact search) 21

22 Geometr of Convergence Proof pt 2 w z eact 1-best eact inference z update weights if z violation: incorrect label scored higher (,, z) update perceptron update: R: ma diameter z correct label w (k) current model update <90 new model w (k+1) ineact 1-best apple R 2 diameter violation ineact search doesn t guarantee violation! 22

23 Observation: Violation is all we need! eact search is not reall required b the proof rather, it is onl used to ensure violation! all violations z eact 1-best update violation: incorrect label scored higher (,, z) R: ma diameter correct label the proof onl uses 3 facts: w (k) current model update <90 new w (k+1) model 1. separation (margin) 2. diameter (alwas finite) 3. violation (but no need for eact) 23

24 Violation-Fiing Perceptron if we guarantee violation, we don t care about eactness! violation is good b/c we can at least fi a mistake all violations all possible updates same mistake bound as before! w standard perceptron violation-fiing perceptron w eact inference z update weights if z find violation z update weights if z 24

25 What if can t guarantee violation this is wh perceptron doesn t work well w/ ineact search because not ever update is guaranteed to be a violation thus the proof breaks; no convergence guarantee eample: beam or greed search the model might prefer the correct label (if eact search) but the search prunes it awa such a non-violation update is bad because it doesn t fi an mistake beam bad update the new model still misguides the search current Q: how can we alwas guarantee violation? model 25

26 Solution 1: Earl update (Collins/Roark 2004) current model w (k) w (k) new model w (k+1) V V V N V V N update (,, z) N V correct label N V training eample time flies N V output space {N,V} {N, V} standard perceptron does not converge V N with greed search V w (k) w (k+1) new model N (,, z) stop and update at the first mistake 26

27 Earl Update: Guarantees Violation w (k) w (k) w (k+1) V V V V V N V N V N V N (,, z) N V correct label N V training eample time flies N V output space {N,V} {N, V} standard update doesn t converge b/c it doesn t guarantee violation V w (k) w (k+1) N (,, z) earl update: incorrect prefi scores higher: a violation! 27

28 Earl Update: from Greed to Beam beam search is a generalization of greed (where b=1) at each stage we keep top b hpothesis widel used: tagging, parsing, translation... earl update -- when correct label first falls off the beam up to this point the incorrect prefi should score higher standard update (full update) -- no guarantee! Model correct incorrect earl update correct label falls off beam (pruned) violation guaranteed: incorrect prefi scores higher up to this point standard update (no guarantee!) 28

29 Solution 2: Ma-Violation (Huang et al 2012) beam Model earl ma-violation standard (bad!) we now established a theor for earl update (Collins/Roark) but it learns too slowl due to partial updates ma-violation: use the prefi where violation is maimum worst-mistake in the search space all these update methods are violation-fiing perceptrons 29

30 Four Eperiments part-of-speech tagging the man bit the dog incremental parsing the man bit the dog DT NN VBD DT NN the man bit the dog bottom-up parsing w/ cube pruning the man bit the dog machine translation the man bit the dog the man bit the dog 那人咬了狗

31 Ma-Violation > Earl >> Standard ep 1 on part-of-speech tagging w/ beam search (on CTB5) earl and ma-violation >> standard update at smallest beams this advantage shrinks as beam size increases ma-violation converges faster than earl (and slightl better) tagging accurac on held-out beam=1 (greed) best tagging accurac on held-out best accurac vs. beam size ma-violation earl standard training time (hours) beam size 31

32 Ma-Violation > Earl >> Standard ep 1 on part-of-speech tagging w/ beam search (on CTB5) earl and ma-violation >> standard update at smallest beams this advantage shrinks as beam size increases ma-violation converges faster than earl (and slightl better) tagging accurac on held-out beam= best tagging accurac on held-out best accurac vs. beam size ma-violation earl standard training time (hours) beam size 32

33 Ma-Violation > Earl >> Standard ep 2 on incremental dependenc parser (Huang & Sagae 10) standard update is horrible due to search errors earl update: 38 iterations, 15.4 hours (92.24) ma-violation: 10 iterations, 4.6 hours (92.25) parsing accurac on held-out beam=8 ma-violation earl standard training time (hours) parsing accurac on held-out beam=8 ma-violation earl standard (79.0, omitted) training time (hours) ma-violation is 3.3 faster than earl update 33

34 Wh standard update so bad for parsing standard update works horribl with severe search error due to large number of invalid updates (non-violation) parsing accurac on held-out O(n 11 ) => O(nb) parsing b=8 ma-violation earl standard tagging accurac on held-out O(nT 3 ) => O(nb) tagging b= training time (hours) training time (hours) % of invalid updates parsing tagging % of invalid updates in standard update beam size 34

35 Ep 3: Bottom-up Parsing CKY parsing with cube pruning for higher-order features we etended our framework from graphs to hpergraphs UAS s-ma p-ma skip standard (Zhang et al 2013) UAS on Penn-YM dev epochs 35

36 Ep 4: Machine Translation standard perceptron works poorl for machine translation b/c invalid update ratio is ver high (search qualit is low) ma-violation converges faster than earl update first trul successful effort in large-scale training for translation BLEU ma-violation earl standard local Number of iteration Ratio 90% 80% 70% 60% 50% (Yu et al 2013) Ratio of invalid updates +non-local feature (standard perceptron) beam size

37 Comparison of Four Eps the harder our search, the more advantageous 94 tagging accurac on held-out tagging b= training time (hours) parsing accurac on held-out incremental parsing b=8 ma-violation earl standard training time (hours) UAS s-ma p-ma skip standard epochs bottom-up parsing BLEU ma-violation earl standard local machine translation Number of iteration 37

38 Outline Overview of Structured Learning Challenges in Scalabilit Structured Perceptron convergence proof Structured Perceptron with Ineact Search Latent-Variable Perceptron 38

39 Learning with Latent Variables aka weakl-supervised or partiall-observed learning learning from natural annotations ; more scalable eamples: translation, transliteration, semantic parsing... 布什与沙龙会谈 computer What is the largest state? latent derivation latent derivation argma(state, size) Bush talked with Sharon コンピューター ko n p u : ta : Alaska transliteration QA pairs parallel tet (Liang et al 2006; Yu et al 2013; Xiao and Xiong 2013) (Knight & Graehl, 1998; Kondrak et al 2007, etc.) (Clark et al 2010; Liang et al 2013; Kwiatkowski et al 2013) 39

40 Learning Latent Structures binar/multiclass naive baes Conditional structured learning HMMs Conditional latent structures EM (forward-backward) logistic regression (maent) CRFs latent CRFs perceptron SVM Online+ Viterbi ma margin Online+ Viterbi structured perceptron ma margin structured SVM latent perceptron latent structured SVM 40

41 Latent Structured Perceptron no eplicit positive signal hallucinate the correct derivation b current weights training eample during online learning... 那人咬了狗那人咬了狗 forced decoding space d highest-scoring gold derivation full search space beam ˆd highest-scoring derivation the man bit the dog the dog bit the man z w w + (, d ) (, ˆd) (Liang et al 2006; Yu et al 2013) reward correct penalize wrong 41

42 Unconstrained Search eample: beam search phrase-based decoding Bushi u Shalong juing le huitan???... meeting... Shalong _ Bush... talks... Sharon... talk 42

43 Constrained Search forced decoding: must produce the eact reference translation Bushi u Shalong juing le huitan... meeting Bush held talks with Sharon... Shalong _ Bush... talks... Sharon one gold derivation gold derivation lattice... talk held talks with Sharon Bush held talks with Sharon _ _ _

44 Search Errors in Decoding no eplicit positive signal hallucinate the correct derivation b current weights training eample during online learning... 那人咬了狗那人咬了狗 forced decoding space d highest-scoring gold derivation full search space beam ˆd highest-scoring derivation the man bit the dog (Liang et al 2006; Yu et al 2013) the dog bit the man z w w + (, d ) (, ˆd) reward correct penalize wrong problem: search errors 44

45 Search Error: Gold Derivations Pruned gold derivation lattice Bush held held talks with Sharon talks with Sharon _ _ _ real decoding beam search should address search errors here! _ _ _ _ _ _ _ _ _ _ _ _ _

46 Fiing Search Error 1: Earl Update earl update (Collins/Roark 04) when the correct falls off beam up to this point the incorrect prefi should score higher that s a violation we want to fi; proof in (Huang et al 2012) standard perceptron does not guarantee violation the correct sequence (pruned) might score higher at the end! invalid update b/c it reinforces the model error Model correct incorrect earl update correct sequence falls off beam (pruned) violation guaranteed: incorrect prefi scores higher up to this point standard update (no guarantee!) 46

47 Earl Update w/ Latent Variable the gold-standard derivations are not annotated we treat an reference-producing derivation as good gold derivation lattice held talks with Sharon Bush held talks with Sharon _ _ _ Model correct incorrect earl update violation guaranteed: incorrect prefi scores higher up to this point all correct derivations fall off stop decoding 47

48 Fiing Search Error 2: Ma-Violation + h best in the beam h i e h i std model correct sequences worst in the beam earl h + i e maviolation h local standard update is invalid correct seqs all fall off beam h + i earl update works but learns slowl due to partial updates ma-violation: use the prefi where violation is maimum worst-mistake in the search space now etended to handle latent-variable 48

49 Latent-Variable Perceptron best in the beam full (standard) correct sequence best in the beam h i e h i earl maviolation latest worst in the beam falls off the beam biggest violation last valid update invalid update! + h std model correct sequences worst in the beam earl h + i e maviolation h local standard update is invalid correct seqs all fall off beam h + i

50 Roadmap of Techniques structured perceptron (Collins, 2002) latent-variable perceptron (Zettlemoer and Collins, 2005; Sun et al., 2009) perceptron w/ ineact search (Collins & Roark, 2004; Huang et al 2012) latent-variable perceptron w/ ineact search (Yu et al 2013; Zhao et al 2014) MT sntactic parsing semantic parsing transliteration 50

51 Eperiments: Discriminative Training for MT standard update (Liang et al s bold ) works poorl b/c invalid update ratio is ver high (search qualit is low) ma-violation converges faster than earl update best in the beam d i d i d + std d this eplains wh Liang et al 06 failed std ~ bold ; local ~ local model w worst in the beam earl d + i maviolation d + i d local standard update is invalid BLEU MaForce local earl MERT Ratio 90% 80% 70% Ratio of invalid updates +non-local feature standard latent-variable perceptron standard 60% Number of iteration 50% beam size 51

52 Final Conclusions online structured learning is simple and powerful search efficienc is the ke challenge search errors do interfere with learning but we can use violation-fiing perceptron w/ ineact search we can etend perceptron to learn latent structures best in the beam earl maviolation latest full (standard) correct sequence worst in the beam falls off the beam biggest violation last valid update invalid update! 52

Structured Prediction

Structured Prediction Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the