Structured Prediction

Size: px

Start display at page:

Download "Structured Prediction"

Silas Newton
5 years ago
Views:

1 Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML)

2 x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the dog x DT NP NN VB VP NP the man bit DT NN binar classification: output is binar multiclass classification: output is a number (small # of classes) structured classification: output is a structure (seq., tree, graph) part-of-speech tagging, parsing, summarization, translation exponentiall man classes: search (inference) efficienc is crucial! 2 the dog

3 Generic Perceptron online-learning: one example at a time learning b doing update weights at mistakes find the best output under the current weights xi inference i zi update weights w 3

4 Perceptron: from binar to structured binar classification x x 2 classes trivial x w exact inference z update weights if z =+1 =-1 multiclass classification constant x # of classes eas w exact inference z update weights if z structured classification exponential # of classes hard w the man bit the dog DT NN VBD DT NN x x exact inference z update weights if z 4

5 From Perceptron to SVM max margin 1964 Vapnik Chervonenkis same journal! 1964 Aizerman+ online kernels kernel. perc. +soft-margin fall of USSR 1995 Cortes/Vapnik SVM conservative updates online approx. max margin subgradient descent 2003 Crammer/Singer MIRA minibatch batch Singer group Pegasos 2006 Singer group aggressive 1959 Rosenblatt invention 1962 Novikoff proof inseparable case 1999 Freund/Schapire voted/avg: revived 2003 Taskar M3N 2004 Tsochantaridis struct. SVM multinomial logistic regression (max. entrop) 2001 Laffert+ CRF AT&T Research 2002 Collins structured ex-at&t and students 2005 McDonald+ structured MIRA 5

6 Multiclass Classification one weight vector ( prototpe ) for each class: multiclass decision rule: ŷ = argmax z21...m (best agreement w/ prototpe) w (z) x Q1: what about 2-class? Q2: do we still need augmented space?

7 Multiclass Perceptron on an error, penalize the weight for the wrong class, and reward the weight for the true class

8 Convergence of Multiclass update rule: w w + (x,, z) separabilit: 9u, s.t. 8(x, ) 2 D, z 6= u (x,, z) for all

9 Example: POS Tagging gold-standard: DT NN VBD DT NN the man bit the dog current output: DT NN NN DT NN the man bit the dog assume onl two feature classes tag bigrams t i-1 ti word/tag pairs w i weights ++: (NN, VBD) (VBD, DT) (VBD, bit) weights --: (NN, NN) (NN, DT) (NN, bit) x Φ(x, ) z Φ(x, z) x ɸ(x,)={(<s>, DT): 1, (DT, NN):2,..., (NN, </s>): 1, (DT, the):1,..., (VBD, bit): 1,..., (NN, dog): 1} ɸ(x,z)={(<s>, DT): 1, (DT, NN):2,..., (NN, </s>): 1, (DT, the):1,..., (NN, bit): 1,..., (NN, dog): 1} ɸ(x,) - ɸ(x,z) 9

10 Structured Perceptron xi inference w i zi update weights 10

11 Inference: Dnamic Programming w x exact inference z update weights if z 11

12 Pthon implementation Q: what about top-down recursive + memoization? CS Lec 5-6: Probs & WFSTs 12

13 xi Efficienc vs. Expressiveness argmax GEN(x) inference w the inference (argmax) must be efficient i zi update weights either the search space GEN(x) is small, or factored features must be local to (but can be global to x) e.g. bigram tagger, but look at all input words (cf. CRFs) x 13

14 Averaged Perceptron 0 j j +1 j = j W j more stable and accurate results approximation of voted perceptron (Freund & Schapire, 1999) 14

15 Averaging Tricks Daume (2006, PhD thesis) sparse vector: defaultdict w t t w w (0) = w (1) = w (2) = w (3) = w (4) = w (1) w (1) w (2) w (1) w (2) w (3) w (1) w (2) w (3) w (4) 15

16 Do we need smoothing? x smoothing is much easier in discriminative models just make sure for each feature template, its subset templates are also included e.g., to include (t 0 w0 w-1) ou must also include (t 0 w0) (t0 w-1) (w0 w-1) and mabe also (t 0 t-1) because t is less sparse than w 16

17 Convergence with Exact Search linear classification: converges iff. data is separable structured: converges iff. data separable & search exact there is an oracle vector that correctl labels all examples one vs the rest (correct label better than all incorrect labels) theorem: if separable, then # of updates R 2 / δ 2 R: diameter R: diameter R: diameter x 100 x 100 x x 3012 δ δ x 2000 =-1 =+1 Rosenblatt => Collins z

18 Geometr of Convergence Proof pt 1 w z exact 1-best x exact inference z update weights if z (x,, z) update perceptron update: correct label w (k) current model update new model w (k+1) δ separation unit oracle vector u margin (b induction) (part 1: upperbound) 18

19 Geometr of Convergence Proof pt 2 z (x,, z) exact 1-best update x w exact inference z update weights if z violation: incorrect label scored higher perceptron update: R: max diameter correct label w (k) current model update <90 new w (k+1) model b induction: apple R 2 diameter violation (part 2: upperbound) parts 1+2 => update bounds: k apple R 2 / 2 19

20 Experiments

21 Experiments: Tagging (almost) identical features from (Ratnaparkhi, 1996) trigram tagger: current tag t i, previous tags ti-1, ti-2 current word w i and its spelling features surrounding words w i-1 wi+1 wi-2 wi

22 Experiments: NP Chunking B-I-O scheme Rockwell International Corp. B I I 's Tulsa unit said it signed B I I O B O a tentative agreement... B I I features: unigram model surrounding words and POS tags 22

23 Experiments: NP Chunking results (Sha and Pereira, 2003) trigram tagger voted perceptron: 94.09% vs. CRF: 94.38% 23

24 Structured SVM structured perceptron: w Δɸ(x,,z) > 0 SVM: for all (x,), functional margin (w x) 1 structured SVM version 1: simple loss for all (x,), for all z, margin w Δɸ(x,,z) 1 correct has to score higher than an wrong z b 1 structured SVM version 2: structured loss for all (x,), for all z, margin w Δɸ(x,,z) (,z) correct has to score higher than an wrong z b (,z), a distance metric such as hamming loss 24

25 Loss-Augmented Decoding want for all z: w ɸ(x,) w ɸ(x,z) + (,z) same as: w ɸ(x,) maxz w ɸ(x,z) + (,z) loss-augmented decoding: argmaxz w ɸ(x,z) + (,z) if (,z) factors in z (e.g. hamming), just modif DP CIML version modified DP should have learning rate! λ = 1/(2C) ver similar to Pegasos; but should use Pegasos framework instead 25

26 Correct Version following Pegasos want for all z: w ɸ(x,) w ɸ(x,z) + (,z) same as: w ɸ(x,) maxz w ɸ(x,z) + (,z) loss-augmented decoding: argmaxz w ɸ(x,z) + (,z) if (,z) factors in z (e.g. hamming), just modif DP 1/t NC/2t ( ) N= D, C is from SVM t +=1 for each example 26

27 Struct. Perceptron vs Struct. SVM tagging, ATIS (train: 488 sent); SVM < avg perc << perc perceptron epoch 1 updates 102, W =291, train_err 3.90%, dev_err 9.36% avg_err 6.14% epoch 2 updates 91, W =334, train_err 3.33%, dev_err 8.19% avg_err 4.97% epoch 3 updates 78, W =347, train_err 2.92%, dev_err 5.85% avg_err 4.97% epoch 4 updates 81, W =368, train_err 3.11%, dev_err 6.73% avg_err 5.85% epoch 5 updates 78, W =378, train_err 2.70%, dev_err 6.14% avg_err 5.56% epoch 6 updates 63, W =385, train_err 2.26%, dev_err 6.14% avg_err 5.56% epoch 7 updates 69, W =385, train_err 2.43%, dev_err 7.02% avg_err 5.56% epoch 8 updates 60, W =388, train_err 2.15%, dev_err 6.73% avg_err 5.56% epoch 9 updates 59, W =390, train_err 2.04%, dev_err 6.14% avg_err 5.56% epoch 10 updates 64, W =394, train_err 2.15%, dev_err 5.85% avg_err 5.26% SVM C=1 epoch 1 updates 116, W =311, train_err 4.55%, dev_err 5.85% epoch 2 updates 82, W =328, train_err 3.05%, dev_err 4.97% epoch 3 updates 78, W =334, train_err 2.92%, dev_err 5.56% epoch 4 updates 77, W =339, train_err 2.92%, dev_err 5.26% epoch 5 updates 80, W =344, train_err 2.94%, dev_err 5.56% epoch 6 updates 73, W =345, train_err 2.75%, dev_err 4.68% epoch 7 updates 72, W =347, train_err 2.75%, dev_err 4.97% epoch 8 updates 75, W =352, train_err 2.86%, dev_err 4.97% epoch 9 updates 74, W =353, train_err 2.78%, dev_err 4.97% epoch 10 updates 72, W =354, train_err 2.78%, dev_err 4.97% 27

Machine Learning A Geometric Approach

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU) Perceptron Frank Rosenblatt deep learning multilayer perceptron linear regression