Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML)
x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the dog x DT NP NN VB VP NP the man bit DT NN binar classification: output is binar multiclass classification: output is a number (small # of classes) structured classification: output is a structure (seq., tree, graph) part-of-speech tagging, parsing, summarization, translation exponentiall man classes: search (inference) efficienc is crucial! 2 the dog
Generic Perceptron online-learning: one example at a time learning b doing update weights at mistakes find the best output under the current weights xi inference i zi update weights w 3
Perceptron: from binar to structured binar classification x x 2 classes trivial x w exact inference z update weights if z =+1 =-1 multiclass classification constant x # of classes eas w exact inference z update weights if z structured classification exponential # of classes hard w the man bit the dog DT NN VBD DT NN x x exact inference z update weights if z 4
From Perceptron to SVM max margin 1964 Vapnik Chervonenkis same journal! 1964 Aizerman+ online kernels kernel. perc. +soft-margin fall of USSR 1995 Cortes/Vapnik SVM conservative updates online approx. max margin subgradient descent 2003 Crammer/Singer MIRA minibatch batch 2007--2010 Singer group Pegasos 2006 Singer group aggressive 1959 Rosenblatt invention 1962 Novikoff proof inseparable case 1999 Freund/Schapire voted/avg: revived 2003 Taskar M3N 2004 Tsochantaridis struct. SVM multinomial logistic regression (max. entrop) 2001 Laffert+ CRF AT&T Research 2002 Collins structured ex-at&t and students 2005 McDonald+ structured MIRA 5
Multiclass Classification one weight vector ( prototpe ) for each class: multiclass decision rule: ŷ = argmax z21...m (best agreement w/ prototpe) w (z) x Q1: what about 2-class? Q2: do we still need augmented space?
Multiclass Perceptron on an error, penalize the weight for the wrong class, and reward the weight for the true class
Convergence of Multiclass update rule: w w + (x,, z) separabilit: 9u, s.t. 8(x, ) 2 D, z 6= u (x,, z) for all
Example: POS Tagging gold-standard: DT NN VBD DT NN the man bit the dog current output: DT NN NN DT NN the man bit the dog assume onl two feature classes tag bigrams t i-1 ti word/tag pairs w i weights ++: (NN, VBD) (VBD, DT) (VBD, bit) weights --: (NN, NN) (NN, DT) (NN, bit) x Φ(x, ) z Φ(x, z) x ɸ(x,)={(<s>, DT): 1, (DT, NN):2,..., (NN, </s>): 1, (DT, the):1,..., (VBD, bit): 1,..., (NN, dog): 1} ɸ(x,z)={(<s>, DT): 1, (DT, NN):2,..., (NN, </s>): 1, (DT, the):1,..., (NN, bit): 1,..., (NN, dog): 1} ɸ(x,) - ɸ(x,z) 9
Structured Perceptron xi inference w i zi update weights 10
Inference: Dnamic Programming w x exact inference z update weights if z 11
Pthon implementation Q: what about top-down recursive + memoization? CS 562 - Lec 5-6: Probs & WFSTs 12
xi Efficienc vs. Expressiveness argmax GEN(x) inference w the inference (argmax) must be efficient i zi update weights either the search space GEN(x) is small, or factored features must be local to (but can be global to x) e.g. bigram tagger, but look at all input words (cf. CRFs) x 13
Averaged Perceptron 0 j j +1 j = j W j more stable and accurate results approximation of voted perceptron (Freund & Schapire, 1999) 14
Averaging Tricks Daume (2006, PhD thesis) sparse vector: defaultdict w t t w w (0) = w (1) = w (2) = w (3) = w (4) = w (1) w (1) w (2) w (1) w (2) w (3) w (1) w (2) w (3) w (4) 15
Do we need smoothing? x smoothing is much easier in discriminative models just make sure for each feature template, its subset templates are also included e.g., to include (t 0 w0 w-1) ou must also include (t 0 w0) (t0 w-1) (w0 w-1) and mabe also (t 0 t-1) because t is less sparse than w 16
Convergence with Exact Search linear classification: converges iff. data is separable structured: converges iff. data separable & search exact there is an oracle vector that correctl labels all examples one vs the rest (correct label better than all incorrect labels) theorem: if separable, then # of updates R 2 / δ 2 R: diameter R: diameter R: diameter x 100 x 100 x 111 100 x 3012 δ δ x 2000 =-1 =+1 Rosenblatt => Collins 1957 2002 z 100 17
Geometr of Convergence Proof pt 1 w z exact 1-best x exact inference z update weights if z (x,, z) update perceptron update: correct label w (k) current model update new model w (k+1) δ separation unit oracle vector u margin (b induction) (part 1: upperbound) 18
Geometr of Convergence Proof pt 2 z (x,, z) exact 1-best update x w exact inference z update weights if z violation: incorrect label scored higher perceptron update: R: max diameter correct label w (k) current model update <90 new w (k+1) model b induction: apple R 2 diameter violation (part 2: upperbound) parts 1+2 => update bounds: k apple R 2 / 2 19
Experiments
Experiments: Tagging (almost) identical features from (Ratnaparkhi, 1996) trigram tagger: current tag t i, previous tags ti-1, ti-2 current word w i and its spelling features surrounding words w i-1 wi+1 wi-2 wi+2.. 21
Experiments: NP Chunking B-I-O scheme Rockwell International Corp. B I I 's Tulsa unit said it signed B I I O B O a tentative agreement... B I I features: unigram model surrounding words and POS tags 22
Experiments: NP Chunking results (Sha and Pereira, 2003) trigram tagger voted perceptron: 94.09% vs. CRF: 94.38% 23
Structured SVM structured perceptron: w Δɸ(x,,z) > 0 SVM: for all (x,), functional margin (w x) 1 structured SVM version 1: simple loss for all (x,), for all z, margin w Δɸ(x,,z) 1 correct has to score higher than an wrong z b 1 structured SVM version 2: structured loss for all (x,), for all z, margin w Δɸ(x,,z) (,z) correct has to score higher than an wrong z b (,z), a distance metric such as hamming loss 24
Loss-Augmented Decoding want for all z: w ɸ(x,) w ɸ(x,z) + (,z) same as: w ɸ(x,) maxz w ɸ(x,z) + (,z) loss-augmented decoding: argmaxz w ɸ(x,z) + (,z) if (,z) factors in z (e.g. hamming), just modif DP CIML version modified DP should have learning rate! λ = 1/(2C) ver similar to Pegasos; but should use Pegasos framework instead 25
Correct Version following Pegasos want for all z: w ɸ(x,) w ɸ(x,z) + (,z) same as: w ɸ(x,) maxz w ɸ(x,z) + (,z) loss-augmented decoding: argmaxz w ɸ(x,z) + (,z) if (,z) factors in z (e.g. hamming), just modif DP 1/t NC/2t ( ) N= D, C is from SVM t +=1 for each example 26
Struct. Perceptron vs Struct. SVM tagging, ATIS (train: 488 sent); SVM < avg perc << perc perceptron epoch 1 updates 102, W =291, train_err 3.90%, dev_err 9.36% avg_err 6.14% epoch 2 updates 91, W =334, train_err 3.33%, dev_err 8.19% avg_err 4.97% epoch 3 updates 78, W =347, train_err 2.92%, dev_err 5.85% avg_err 4.97% epoch 4 updates 81, W =368, train_err 3.11%, dev_err 6.73% avg_err 5.85% epoch 5 updates 78, W =378, train_err 2.70%, dev_err 6.14% avg_err 5.56% epoch 6 updates 63, W =385, train_err 2.26%, dev_err 6.14% avg_err 5.56% epoch 7 updates 69, W =385, train_err 2.43%, dev_err 7.02% avg_err 5.56% epoch 8 updates 60, W =388, train_err 2.15%, dev_err 6.73% avg_err 5.56% epoch 9 updates 59, W =390, train_err 2.04%, dev_err 6.14% avg_err 5.56% epoch 10 updates 64, W =394, train_err 2.15%, dev_err 5.85% avg_err 5.26% SVM C=1 epoch 1 updates 116, W =311, train_err 4.55%, dev_err 5.85% epoch 2 updates 82, W =328, train_err 3.05%, dev_err 4.97% epoch 3 updates 78, W =334, train_err 2.92%, dev_err 5.56% epoch 4 updates 77, W =339, train_err 2.92%, dev_err 5.26% epoch 5 updates 80, W =344, train_err 2.94%, dev_err 5.56% epoch 6 updates 73, W =345, train_err 2.75%, dev_err 4.68% epoch 7 updates 72, W =347, train_err 2.75%, dev_err 4.97% epoch 8 updates 75, W =352, train_err 2.86%, dev_err 4.97% epoch 9 updates 74, W =353, train_err 2.78%, dev_err 4.97% epoch 10 updates 72, W =354, train_err 2.78%, dev_err 4.97% 27