Structured Perceptrons & Structural SVMs

Size: px

Start display at page:

Download "Structured Perceptrons & Structural SVMs"

Merry Carr
5 years ago
Views:

1 Structured Perceptrons Structural SVMs 4/6/27 CS 59: Advanced Topcs n Machne Learnng

2 Recall: Sequence Predcton Input: x = (x,,x M ) Predct: y = (y,,y M ) Each y one of L labels. x = Fsh Sleep y = (N, V) POS Tags: Det, Noun, Verb, Ad, Adv, Prep L=6 x = The Dog Ate My Homework y = (D, N, V, D, N) x = The Fox Jumped Over The Fence y = (D, N, V, P, D, N) 2

3 Recall: st Order HMM x = (x,x 2,x 4,x 4,,x M ) y = (y,y 2,y 3,y 4,,y M ) (sequence of words) (sequence of POS tags) P(x y ) Probablty of state y generatng x P(y + y ) Probablty of state y transtonng to y + P(y y ) P(End y M ) Not always used y s defned to be the Start state Pror probablty of y M beng the fnal state 3

4 HMM Graphcal Model Representaton Y Y Y 2 Y M Y End X X 2 X M Optonal ( ) = P(End y M ) P(y y ) P x, y M P(x y ) = M = 4

5 Most Common Predcton Problem Gven nput sentence, predct POS Tag seq. h x = argmax ) P y x = argmax log P(y x) log P y x = 2 log P y 3 x 3 + log P(x 3 x 356 ) Solve usng Vterb 8 Specal case of max product algorthm 5

6 Smple Example x = Fsh Sleep y = (N,V) F y, x log P y x = 2 log P y 3 x 3 + log P(x 3 x 356 ) 8 Log P(y x ) Log P(x x - ) Log P P(* N) P(* V) P(Fsh *) 2 P(Sleep *) Log P P(N *) P(V *) P(* N) -2 P(* V) 2-2 P(* Start) - 6

7 New Notaton M = F(y, x) w T ϕ (y, y x) %! w = " w w 2 % Unary Features! ϕ (a, b x) = ϕ (a x) " ϕ 2 (a, b) Parwse Transton Features % " ϕ (a x) = " ϕ 2 (a, b) = " ( a=noun ) ( x =Fsh " ( a=noun ) ( x =Sleep " ( a=verb ) ( x =Fsh " ( a=verb ) ( x =Sleep " ( a=noun ) ( b=start) % " ( a=noun ) ( b=noun) % " ( a=noun ) ( b=verb) % " ( a=verb ) ( b=start) % " ( a=verb ) ( b=noun) % " ( a=verb ) ( b=verb) % % % 7

8 Noun Class Features Verb Class Features! ϕ (Noun "Fsh Sleep") = "! ϕ 2 (Noun "Fsh Sleep") = "! ϕ (Verb "Fsh Sleep") = "! ϕ 2 (Verb "Fsh Sleep") = " % % % % New Notaton Duplcate word features for each label. " ϕ (a x) =! ϕ (a x) = " " ( a=noun ) ( x =Fsh " ( a=noun ) ( x =Sleep " ( a=verb ) ( x =Fsh " ( a=verb ) ( x =Sleep [ a= ] φ (x )! [ a=l ] φ (x ) % % 8

9 ! ϕ 2 (Noun, Start) = "! ϕ 2 (Verb, Start) = "! ϕ 2 (Verb, Noun) = " % % % New Notaton One feature for every transton. " ϕ (a x) = " ϕ 2 (a, b) = " ( a=noun ) ( x =Fsh " ( a=noun ) ( x =Sleep " ( a=verb ) ( x =Fsh " ( a=verb ) ( x =Sleep " ( a=noun ) ( b=start) % " ( a=noun ) ( b=noun) % " ( a=noun ) ( b=verb) % " ( a=verb ) ( b=start) % " ( a=verb ) ( b=noun) % " ( a=verb ) ( b=verb) % % % 9

10 M!! w ϕ (a, b x) = ϕ (a x) w = " w " ϕ 2 % 2 (a, b) F(y, x) w T ϕ (y, y x) % = %! w = " 2 % " ϕ (a x) = Old Notaton: P(N *) P(* N) -2 P(* V) 2-2 P(* Start) - " ( a=noun ) ( x =Fsh " ( a=noun ) ( x =Sleep " ( a=verb ) ( x =Fsh " ( a=verb ) ( x =Sleep P(V *) % " w 2 = % Old Notaton: P(* N) P(Fsh *) 2 P(Sleep *) " ϕ 2 (a, b) = P(* V) " ( a=noun ) ( b=start) % " ( a=noun ) ( b=noun) % " ( a=noun ) ( b=verb) % " ( a=verb ) ( b=start) % " ( a=verb ) ( b=noun) % " ( a=verb ) ( b=verb) % %

11 Recap: st Order Sequental Model Input: x = (x,,x M ) Predct: y = (y,,y M ) Each y one of L labels. POS Tags: Det, Noun, Verb, Ad, Adv, Prep L=6 Lnear Model w.r.t. parwse features φ (a,b x): Predcton va maxmzng F: Encodes Structure h x = argmax ) F y, x = argmax ) w > Ψ(y, x)

12 x = Fsh Sleep! w = " 2 % " ϕ (a x) = y = (N,V) " ( a=noun ) ( x =Fsh " ( a=noun ) ( x =Sleep " ( a=verb ) ( x =Fsh " ( a=verb ) ( x =Sleep % " w 2 = % " ϕ 2 (a, b) = " ( a=noun ) ( b=start) % " ( a=noun ) ( b=noun) % " ( a=noun ) ( b=verb) % " ( a=verb ) ( b=start) % " ( a=verb ) ( b=noun) % " ( a=verb ) ( b=verb) % % F(y = (N,V ), x = "Fsh Sleep") = w T ϕ (N, x)+ w 2T ϕ 2 (N, Start)+ w T ϕ 2 (V, x)+ w 2T ϕ 2 (V, N) = w, + w 2, + w,4 + w 2,5 = = 4 Predcton: argmax y F(y, x) y F(y,x) (N,N) = 2 (N,V) 2+++ = 4 (V,N) -++2 = 3 (V,V) -+-2 = -2 2

13 Why New Notaton? Easer to reason about: Computng predctons Learnng (lnear model!) Extensons (ust generalze φ)! w = " w w 2! ϕ (a, b x) = ϕ (a x) " ϕ 2 (a, b) % % " ϕ (a x) = " ϕ 2 (a, b) = " ( a=noun ) ( x =Fsh " ( a=noun ) ( x =Sleep " ( a=verb ) ( x =Fsh " ( a=verb ) ( x =Sleep " ( a=noun ) ( b=start) % " ( a=noun ) ( b=noun) % " ( a=noun ) ( b=verb) % " ( a=verb ) ( b=start) % " ( a=verb ) ( b=noun) % " ( a=verb ) ( b=verb) % % % 3

14 Generalzes Multclass Stack weght vectors for each class: F(y, x) w > Ψ(y, x) w = w 6 w 6 w A Ψ y, x = )C6 x )CD x )CA x h x = argmax ) w > Ψ y, x = argmax ) w ) > x 4

15 Learnng for Structured Predcton 5

16 Perceptron Learnng Algorthm (Lnear Classfcaton Model) w = For t =. Receve example (x,y) If h(x w t ) = y w t+ = w t Else w t+ = w t + yx h x = sgn(w > x) Tranng Set: N S = {(x, y )} = y { +, } Go through tranng set n arbtrary order (e.g., randomly) 6

17 Structured Perceptron (Lnear Classfcaton Model) w = For t =. Receve example (x,y) If h(x w t ) = y w t+ = w t Else w t+ = w t + Ψ(y,x) h x = argmax ) Hw > Ψ(y, x) Tranng Set: S = (x 8, y 8 ) y 8 structured Go through tranng set n arbtrary order (e.g., randomly) Only thng that changes! 7

Structured Perceptron Method Error rate/% Numts Perc, avg, cc= 2.93 Perc, noavg, cc= 3.68 2 Perc, avg, cc=5 3.3 6 Perc, noavg, cc=5 4.4 7 ME, cc= 3.4 ME, cc=5 3.

18 Structured Perceptron Method Error rate/% Numts Perc, avg, cc= 2.93 Perc, noavg, cc= Perc, avg, cc= Perc, noavg, cc= ME, cc= 3.4 ME, cc= Dscrmnatve Tranng Methods for Hdden Markov Models: Theory and Experments wth Perceptron Algorthms Mchael Collns, EMNLP

19 Lmtatons of Perceptron Not all mstakes are created equal One POS tag wrong as bad as fve! Even worse for more complcated problems 9

20 Comparson Method HMM CRF Perceptron SVM Error Large Margn Methods for Structured and Interdependent Output Varables Ionns Tsochantards, Thorsten Joachms, Thomas Hofmann, Yasemn Altun Journal of Machne Learnng Research, Volume 6, Pages

21 Hammng Loss Hammng Loss: True y = (D,N,V,D,N) y = (D,N,V,N,N) y = (V,D,N,V,V) ( ) = " h(x) y l y, x, F M = y has much worse hammng loss (loss of 5 vs loss of ) % (But not contnuous!) Need to defne contnuous surrogate of Hnge Loss! 2

22 Loss Orgnal Hnge Loss (Support Vector Machne) Hnge Loss argmn w,b,ξ 2 wt w + C N ξ : y ( w T x b) ξ :ξ l(y, f (x )) = max(, y f (x )) = ξ.5 / Loss Target y f(x) 22

23 Property of Hnge Loss argmn w,b,ξ : y :ξ 2 wt w + C N ξ ( w T x b) ξ h(x) = argmax y,+ { } yf (x) = sgn( f (x)) è ξ [ h(x ) y ] l(y, f (x )) = max(, y f (x )) = ξ Hnge loss = contnuous upper bound on / loss 23

24 Hammng Hnge Loss (Structural SVM) argmn w,ξ 2 wt w + C N ξ, y : F(y, x ) F(y, x ) ξ :ξ y y ( ) Sometmes normalze by M h(x) = argmax y Learned Predctor F(y, x) l(y, x, F) = max y è F(y, x ) F(h(x ), x ) è ξ h(x ) y % (* ) " y y F(y, x ) F(y, x ) % +* ( ) Contnuous upper bound on Hammng Loss!,* -.* = ξ 24

25 argmn w,ξ Hammng Hnge Loss 2 wt w + C N ξ, y : F(y, x ) F(y, x ) M ξ :ξ y y ( ) Score(y ) Score(y ) Loss(y ) Slack Suppose for ncorrect y =h(x ): Then: ξ =.6.5 = M h(x ) y % Score(y) Score(y) Loss(y) Slack varable upper bounds Hammng Loss! 25

26 Structural SVM argmn w,ξ 2 wt w + C N ξ Sometmes normalze by M, y : F(y, x ) F(y, x ) y y ( ξ :ξ ) Slack Consder: y = argmax F(y, x) y Predcton of Learned Model è F(y, x ) F(y, x ) Slack s contnuous upper bound on Hammng Loss! y y y = y è ξ è ξ y y % 26

27 Reducton to Independent Multclass argmn w,ξ 2 wt w + C N ξ, y : F(y, x ) F(y, x ) ξ :ξ y y ( ) Suppose: M F(y, x) " w T ϕ (y x) % = No parwse features.!! y = φ (x % ) " % ϕ (y x) = %! %! " y =L φ (x ) " % Stack features ϕ (x ) L tmes,, a : w T y φ (x ) w at φ (x ) ξ Decompose constrants to multclass hnge loss per token! 27

28 Example argmn w,ξ 2 wt w + C N ξ, y : F(y, x ) F(y, x ) y y ( ξ :ξ ) x = Fsh Sleep y = (N,V) ξ = y F(y,x ) F(y,x ) F(y,x ) Loss (N,N) 2 2 (N,V) 4 (V,N) 3 2 (V,V) 3 28

29 Example 2 argmn w,ξ 2 wt w + C N ξ, y : F(y, x ) F(y, x ) y y ( ξ :ξ ) x = Fsh Sleep y = (N,V) ξ = 2 y F(y,x ) F(y,x ) F(y,x ) Loss (N,N) 4 - (N,V) 3 (V,N) 3 2 (V,V) 2 29

30 Example 3 argmn w,ξ 2 wt w + C N ξ, y : F(y, x ) F(y, x ) y y ( ξ :ξ ) x = Fsh Sleep y = (N,V) ξ = y F(y,x ) F(y,x ) F(y,x ) Loss (N,N) 2 2 (N,V) 4 (V,N) 3 2 (V,V) 3 3

31 argmn w,ξ When s Slack Postve? 2 wt w + C N ξ, y : F(y, x ) F(y, x ) y y ( ξ :ξ Whenever margn not bg enough! ξ > çè y : F(y, x ) F(y, x ) < y y ξ = max y (* ) " y y F(y, x ) F(y, x ) % +* Verfy that above defnton ( ) ) %,* -.* = l(y, x, F) 3

32 argmn w,ξ 2 wt w + C N ξ, y : F(y, x ) F(y, x ) ξ :ξ y y ( ) Structural SVM Geometrc Interpretaton èξ y y % w F y, x = w > Ψ(y, x) Hgh Dmensonal Pont y y y Sze of Margn vs Sze of Margn Volatons (C controls trade-off) (Margn scaled by Hammng Loss) 32

33 Structural SVM Tranng argmn w,ξ 2 wt w + C N ξ, y : F(y, x ) F(y, x ) ξ :ξ y y ( ) Often Exponentally Many! Strctly convex optmzaton problem Same form as standard SVM optmzaton Easy rght? Intractable of constrants! 33

34 Structural SVM Tranng The trck s to not enumerate all constrants. Only solve the SVM obectve over a small subset of constrants (workng set). Effcent! y : F(y, x ) F(y, x )+ y % y But some constrants mght be volated. ξ 34

35 Example argmn w,ξ 2 wt w + C N ξ, y : F(y, x ) F(y, x ) y y ( ξ :ξ ) x = Fsh Sleep y = (N,V) ξ = y F(y,x ) F(y,x ) F(y,x ) Loss (N,N) 2 2 (N,V) 4 (V,N) 3 2 (V,V) 3 35

36 Approxmate Hnge Loss Choose tolerate ε>: argmn w,ξ 2 wt w + C N ξ, y : F(y, x ) F(y, x ) Consder: y = argmax F(y, x) y Predcton of Learned Model Slack s contnuous upper bound on Hammng Loss - ε! y y y = y ξ ε :ξ y y ( ) è F(y, x ) F(y, x ) è ξ è ξ y y % ε 36

37 Example argmn w,ξ 2 wt w + C N ξ, y : F(y, x ) F(y, x ) y y ( ξ ε :ξ ) x = Fsh Sleep y = (N,V) ξ = ε = y F(y,x ) F(y,x ) F(y,x ) Loss (N,N) 2 2 (N,V) 4 (V,N) 3 2 (V,V) 3 37

38 Structural SVM Tranng STEP : Specfy tolerance ε STEP : Solve SVM obectve functon usng only workng set of constrants W (ntally empty). The traned model s w. STEP 2: Usng w, fnd the y whose constrant s most volated. STEP 3: If constrant s volated by more than ε, add t to W. Constrant Volaton Formula: ) ( M " y y % * +ξ, F(y, x ) F(y, x ) + ( ) ε Repeat STEP -3 untl no addtonal constrants are added. Return most recent model w traned n STEP. *Ths s known as a cuttng plane method. 38

39 Example argmn w,ξ 2 wt w + C N ξ, y W : F(y, x ) F(y, x ) ξ ε :ξ y y ) ( * Choose ε=. Int: W = Solve: ξ = x = Fsh Sleep y = (N,V) y F(y,x ) F(y,x ) F(y,x ) Loss Vol. (N,N) (N,V) (V,N) 2 2 (V,V) Constrant Volaton: Loss Slack ( F(y,x)-F(y,x) ) = Vol 39

40 Example argmn w,ξ 2 wt w + C N ξ, y W : F(y, x ) F(y, x ) ξ ε :ξ y y ) ( * Choose ε=. Update: W = {(V, N) } Solve: ξ = x = Fsh Sleep y = (N,V) Constrant Volaton: y F(y,x ) F(y,x ) F(y,x ) Loss Vol. (N,N) (N,V) (V,N) 2 2 (V,V) Loss Slack ( F(y,x)-F(y,x) ) = Vol 4

41 Example argmn w,ξ 2 wt w + C N ξ, y W : F(y, x ) F(y, x ) ξ ε :ξ y y ) ( * Choose ε=. Update: W = {(V, N) } Solve: ξ =.5 x = Fsh Sleep y = (N,V) Constrant Volaton: y F(y,x ) F(y,x ) F(y,x ) Loss Vol. (N,N) (N,V).9 (V,N) (V,V).9.4 Loss Slack ( F(y,x)-F(y,x) ) = Vol 4

42 Example argmn w,ξ 2 wt w + C N ξ, y W : F(y, x ) F(y, x ) ξ ε :ξ y y ) ( * Choose ε=. Update: W = {(V, N),(N, N) } Solve: ξ =.5 x = Fsh Sleep y = (N,V) Constrant Volaton: y F(y,x ) F(y,x ) F(y,x ) Loss Vol. (N,N) (N,V).9 (V,N) (V,V).9.4 Loss Slack ( F(y,x)-F(y,x) ) = Vol 42

55.45 (N,V) (V,N) -.65.65 2 (V,V) -.5.95.

43 Example argmn w,ξ 2 wt w + C N ξ, y W : F(y, x ) F(y, x ) ξ ε :ξ y y ) ( * Choose ε=. Update: W = (V, N),(N, N) Solve: ξ =.55 x = Fsh Sleep y = (N,V) { } No constrant s volated by more than ε y F(y,x ) F(y,x ) F(y,x ) Loss Vol. (N,N) (N,V) (V,N) (V,V) Constrant Volaton: Loss Slack ( F(y,x)-F(y,x) ) = Vol 43

44 Geometrc Example Naïve SVM Problem Exponental constrants Most are domnated by a small set of mportant constrants Structural SVM Approach Repeatedly fnds the next most volated constrant untl set of constrants s a good approxmaton. *Ths s known as a cuttng plane method. 44

45 Geometrc Example Naïve SVM Problem Exponental constrants Most are domnated by a small set of mportant constrants Structural SVM Approach Repeatedly fnds the next most volated constrant untl set of constrants s a good approxmaton. *Ths s known as a cuttng plane method. 45

46 Geometrc Example Naïve SVM Problem Exponental constrants Most are domnated by a small set of mportant constrants Structural SVM Approach Repeatedly fnds the next most volated constrant untl set of constrants s a good approxmaton. *Ths s known as a cuttng plane method. 46

47 Geometrc Example Naïve SVM Problem Exponental constrants Most are domnated by a small set of mportant constrants Structural SVM Approach Repeatedly fnds the next most volated constrant untl set of constrants s a good approxmaton. *Ths s known as a cuttng plane method. 47

48 Lnear Convergence Rate Guarantee for any ε>: argmn w,ξ 2 wt w + C N ξ, y : F(y, x ) F(y, x ) :ξ Termnates after teratons: y y ( )! O " ε % ξ ε Proof found n: 48

49 Fndng Most Volated Constrant A constrant s volated when: F(y, x ) F(y, x )+ y y % ξ > Fndng most volated constrant reduces to argmax y F(y, x )+ Hghly related to predcton: argmax y F(y, x ) " y y % Loss augmented nference 49

50 Augmented Scorng Functon F(y, x ) w T ϕ (y, y x Goal: argmax y M = F(y, x )+ " y y % Solve Usng Vterb! M!F(y, x, y )!w T!ϕ (y, y x, y = "!ϕ (a, b x, y ) = ϕ (a, b x ) " a y % Addtonal Unary Feature! Goal: argmax y!!w = w "!F(y, x, y ) % % 5

51 Structural SVM Recpe Feature map: Ψ(y, x) Inference: h x = argmax ) F y, x w > Ψ(y, x) Loss functon:. Δ 8 (y) Loss-augmented: argmax ) w > Ψ y, x +. Δ 8 (y) (most volated constrant)

Support Vector Machines

Support Vector Machines CS 2750: Machne Learnng Support Vector Machnes Prof. Adrana Kovashka Unversty of Pttsburgh February 17, 2016 Announcement Homework 2 deadlne s now 2/29 We ll have covered everythng you need today or at