Structured Prediction Theory and Algorithms

Size: px

Start display at page:

Download "Structured Prediction Theory and Algorithms"

Cecil Long
6 years ago
Views:

1 Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Structured Prediction Structured output: Y = Y 1 Y l. Loss function: L: Y Y! R + decomposable. Example: Hamming loss. L(y, y 0 )= 1 l Example: edit-distance loss. lx 1 yk 6=yk 0. k=1 L(y, y 0 )= 1 l d edit(y 1 y l,y 0 1 y 0 l). page 2

3 Examples Pronunciation modeling. Part-of-speech tagging. Named-entity recognition. Context-free parsing. Dependency parsing. Machine translation. Image segmentation. page 3

4 Examples: NLP Tasks Pronunciation: POS tagging: I have formulated a ay hh ae v f ow r m y ax l ey t ih d ax The thief stole a car D N V D N Context-free parsing/dependency parsing: S VP NP NP D N V D N The thief stole a car root The thief stole a car page 4

5 Examples: Image Segmentation page 5

6 Predictors Family of scoring functions H mapping from X Y to R. For any h 2 H, prediction based on highest score: 8x 2 X, h(x) = argmax y2y h(x, y). Decomposition as a sum modeled by factor graphs. page 6

7 Factor Graph Examples Pairwise Markov network decomposition: 1 f 1 2 f 2 3 h(x, y) =h f1 (x, y 1,y 2 )+h f2 (x, y 2,y 3 ). Other decomposition: h(x, y) =h f1 (x, y 1,y 3 )+ h f2 (x, y 1,y 2,y 3 ). 2 f 2 1 f 1 3 page 7

8 Factor Graphs G =(V,F,E) : factor graph. N(f) : neighborhood of f. Y f = Q k2n(f) Y k : substructure set cross-product at f. Decomposition: h(x, y) = X f2f h f (x, y f ). More generally, example-dependent factor graph, G i = G(x i,y i )=(V i,f i,e i ). page 8

9 Linear Hypotheses Feature decomposition Example: bigram decomposition. (x, y) = lx s=1 y: D N V D N x: his cat ate the fish k: 4 (x, s, y s 1,y s ). h(x, y) =w (x, y) = lx s=1 Hypothesis decomposition. (x, 4,y 3,y 4 ) w (x, s, y s 1,y s ). {z } h s (x,y s 1,y s ) page 9

10 Structured Prediction Problem Training data: sample drawn i.i.d. from X Y some distribution D, according to S =((x 1,y 1 ),...,(x m,y m )) 2 X Y. Problem: find hypothesis h: X Y! R in H with small expected loss: R(h) = learning guarantees? role of factor graph? better algorithms? E [L(h(x),y)]. (x,y) D page 10

11 This Talk Theory. Voted risk minimization (VRM). Algorithms. Experiments. page 11

12 Theory

13 Previous Work Standard multi-class learning bounds: number of classes is exponential! Structured prediction bounds: covering number bounds: Hamming loss, linear hypotheses (Taskar et al., 2003). PAC-Bayesian bounds (randomized algorithms) (David McAllester, 2007). can we derive learning guarantees for general hypothesis sets and general loss functions? page 13

14 Factor Graph Complexity Empirical factor graph complexity for hypothesis set H and sample S =(x 1,...,x m ) : " # br G S (H) = 1 mx m E X X p Fi i,f,y h f (x i,y) = E 2 4sup h2h sup h2h 1 m i=1 Factor graph complexity: f2f i ". i,f,y. R G m(h) = # y2y f 2 4. p Fi h f (x i,y). {z } correlation with random noise h i E br G S (H). S D m page 14

15 Margin Definition: the margin of h at a labeled point (x, y) 2 X Y is h (x, y, y 0 )=min h(x, y) h(x, y 0 6=y y0 ). error when h (x, y, y 0 ) apple 0. small margin interpreted as low confidence. page 15

16 Empirical Margin Losses For any > 0, br add S, (h) = E (x,y) S br mult S, (h)= E (x,y) S apple apple M M max y 0 6=y L(y0,y) max y 0 6=y L(y0,y) 1 h(x,y) h(x,y 0 ) h(x,y) h(x,y 0 ), M (u) M 0 M u page 16

17 Generalization Bounds Theorem: for any > 0, with probability at least 1, each of the following holds for all h 2 H: s R(h) apple R b S, add (h)+ 4p 2 log 1 RG m(h)+m 2m, s R(h) apple R b S, mult (h)+ 4p 2M R G log 1 m(h)+m 2m. tightest margin bounds for structured prediction. data-dependent. improve upon bound of (Taskar et al., 2003) by log terms (in the special case they study). page 17

18 Linear Hypotheses Hypothesis set used by most convex structured prediction algorithms (StructSVM, M3N, CRF): o H p = nx 7! w (x, y): w 2 R N, kwk p apple p, with p 1 and (x, y) = X f2f f (x, y f ). page 18

19 Complexity Bounds Bounds on factor graph complexity of linear hypothesis sets: br G S (H 1 ) apple p 1r 1 s log(2n) m q Pm P 2 r 2 br G i=1 Pf2F i y2y f F i S (H 2 ) apple m with r q = max k f (x i,y)k q i,f,y mx X X s = max F i 1 f,j(x i,y)6=0. j2[1,n] i=1 y2y f f2f i page 19

20 Key Term Sparsity parameter: mx X X s apple F i apple i=1 y2y f f2f i where d i = max f2f i Y f. mx i=1 F i 2 d i apple m max i F i 2 d i, factor graph complexity in for H 1 hypothesis set. key term: average factor graph size. O p log(n) max i F i 2 d i /m page 20

21 NLP Applications Features: f,j is often a binary function, non-zero for a single pair (x, y) 2 X Y f. example: presence of n-gram (indexed by j ) at position f of the output with input sentence. complexity term only in O max F i p log(n)/m. i x i page 21

22 Theory Takeaways Key generalization terms: average size of factor graphs. empirical margin loss. But, is learning with very complex hypothesis sets (factor graph complexity) possible? richer families needed for difficult NLP tasks. but generalization bound indicates risk of overfitting. Voted Risk Minimization (VRM) theory. page 22

23 Voted Risk Minimization

24 Decomposition of H Decomposition in terms of sub-families. H 2 H 4 H 1 H 3 H 5 page 24

25 Ensemble Family Non-negative linear ensembles F =conv( p k=1 H k) : with t 0, f = T t=1 th t T t=1 t 1,h t H kt. H 2 H 4 H 1 H 3 H 5 page 25

26 Ideas Use hypotheses drawn from s with larger ks but allocate more weight to hypotheses drawn from smaller ks. how can we determine quantitatively the amounts of mixture weights apportioned to different families? H k (Cortes, MM, and Syed, 2014) can we provide learning guarantees guiding these choices? page 26

27 Learning Guarantee Theorem: Fix >0. Then, for any >0, with probability at least 1, the following holds for all f = T t=1 th t F : s! R(f) R badd S,,1(f) apple 4p 2 TX t R G m(h kt )+ O e log p M 2 m t=1 s! R(f) R bmult S,,1(f) apple 4p 2M TX t R G m(h kt )+ O e log p M 2. m t=1 page 27

28 Consequences Complexity term with explicit dependency on mixture weights. quantitative guide for controlling weights assigned to more complex sub-families. bound can be used to directly define an ensemble algorithm. page 28

29 Algorithms

30 Surrogate Loss Framework Lemma: assume that u 1 vapple0 apple u (v) for any and v 2 R. Then, for any (x, y) 2 X Y, u 2 R + L(h(x),y) apple max y 0 6=y L(y 0,y)(h(x, y) h(x, y 0 )). Proof: if h(x) =y, then L(h(x),y)=0and result is trivial. Otherwise, h(x) 6= y and L(h(x),y)=L(h(x),y)1 h(x,y) maxy 0 6=y h(x,y 0 )apple0 apple L(h(x),y) (h(x, y) max y 0 6=y h(x, y0 )) ( u (v) upper bound on u1 vapple0 ) = L(h(x),y) (h(x, y) h(x, h(x))) apple max y 0 6=y L(y 0,y)(h(x, y) h(x, y 0 )). (h(x) 6= y) page 30

31 Application Convex surrogate losses: u(v) = max(0,u(1 v)) : StructSVM (Tsochantaridis et al., 2005). u(v) = max(0,u v) : M3N (Taskar et al., 2003). u(v) = log(1 + e u v ): CRF (Lafferty et al., 2003). u(v) =ue v : StructBoost (Cortes et al., 2016). page 31

32 Voted Cond. Random Field Hypothesis set: linear functions: h: (x, y) 7! w (x, y). complex feature vector. apple 1 decomposition in blocks: =. Upper bound: max log(1 + ) w ( (x,y) (x,y 0 )) ) y 0 6=y el(y,y0 X apple log e L(y,y0 ) w ( (x,y) (x,y 0 )). y 0 2Y. p page 32

33 Voted Cond. Random Field Optimization problem (VCRF): min w 1 m mx X log i=1 y2y e L(y,y i) w ( (x i,y i ) (x i,y)) + px ( r k + )kw k k 1, k=1 with r k = r 1 F (k) p log N. solution via stochastic gradient descent (SGD). relationship with L1-CRF. other regularization, e.g., L2-VCRF. efficient gradient computation for Markovian features. page 33

34 Experiments

35 Preliminary Experiments Part-of-speech tagging. Multiple data sets. Dataset Full name Sentences Tokens Unique tokens Labels Basque Basque UD Treebank Chinese Chinese Treebank Dutch UD Dutch Treebank English UD English Web Treebank Finnish Finnish UD Treebank Finnish-FTB UD Finnish-FTB Hindi UD Hindi Treebank Tamil UD Tamil Treebank Turkish METU-Sabanci Turkish Treebank Twitter Tweebank page 35

36 Features - Example y: DET NN VBD RB JJ x: the cat was surprisingly agile s: h 1 (x) =1 x2 = was,x 3 = surprisingly,x 4 = agile (x) h 2 (y) =1 y2 = VBD,y 3 = RB (y) h 3 (x) =1 su (x3,2)= ly (x). page 36

37 Features Feature families: definition: for each choice of the window sizes ( k 1,k 2,k 3 ), sum of products of indicators over positions along the sequence. complexity: r(h k1,k 2,k 3 ) apple r 2(k1 log V + k 2 log m + k 3 log. page 37

38 Experiments Parameters and determined via cross-validation. Comparison with L1-CRF. Two sets of results: original data sets. artificial noise added: tokens corresponding to features that commonly appear in the dataset (at least five times), POS labels flipped with some probability (20% noise). page 38

39 Experimental Results VCRF error (%) CRF error(%) Dataset Token Sentence Token Sentence Basque 7.26 ± ± ± ± 1.39 Chinese 7.38 ± ± ± ± 0.49 Dutch 5.97 ± ± ± ± 1.02 English 5.51 ± ± ± ± 1.31 Finnish 7.48 ± ± ± ± 1.36 Finnish-FTB 9.79 ± ± ± ± 0.75 Hindi 4.84 ± ± ± ± 0.75 Tamil ± ± ± ± 1.54 Turkish ± ± ± ± 1.01 Twitter ± ± ± ± 1.37 page 39

40 Average No. of Features Dataset VCRF CRF Ratio Basque Chinese Dutch English Finnish Finnish-FTB Hindi Tamil Turkish Twitter page 40

41 Experimental Results VCRF error (%) CRF error(%) Dataset Token Sentence Token Sentence Basque 9.13 ± ± ± ± 1.08 Chinese ± ± ± ± 0.01 Dutch 8.16 ± ± ± ± 0.87 English 8.79 ± ± ± ± 1.18 Finnish 9.38 ± ± ± ± 0.93 Finnish-FTB ± ± ± ± 1.19 Hindi 6.63 ± ± ± ± 1.20 Tamil ± ± ± ± 1.78 Turkish ± ± ± ± 2.04 Twitter ± ± ± ± 0.00 page 41

42 Conclusion Structured prediction theory: tightest margin guarantees for structured prediction. general loss functions, data-dependent. key notion of factor graph complexity. VCRF and StructBoost algorithms. favorable preliminary experiments. guarantees for complex hypothesis sets (VRM theory). additionally, tightest margin bounds for standard classification. page 42

On-Line Learning with Path Experts and Non-Additive Losses

On-Line Learning with Path Experts and Non-Additive Losses Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Courant Institute) Manfred Warmuth (UC Santa Cruz) MEHRYAR MOHRI MOHRI@ COURANT