Polyhedral Outer Approximations with Application to Natural Language Parsing

Size: px

Start display at page:

Download "Polyhedral Outer Approximations with Application to Natural Language Parsing"

Myron Gaines
5 years ago
Views:

1 Polyhedral Outer Approximations with Application to Natural Language Parsing André F. T. Martins 1,2 Noah A. Smith 1 Eric P. Xing 1 1 Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA, USA 2 Instituto de Telecomunicações Instituto Superior Técnico Lisboa, Portugal ICML, Montréal, Québec, June 17th, 2009

2 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004]

3 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions

4 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions Often: better (non-local) models with approximate inference

5 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions Often: better (non-local) models with approximate inference Sometimes outputs are globally constrained (matchings, permutations, spanning trees)

6 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions Often: better (non-local) models with approximate inference Sometimes outputs are globally constrained (matchings, permutations, spanning trees) How does approximate inference affect learning? [Kulesza and Pereira, 2007, Finley and Joachims, 2008]

7 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions Often: better (non-local) models with approximate inference Sometimes outputs are globally constrained (matchings, permutations, spanning trees) How does approximate inference affect learning? [Kulesza and Pereira, 2007, Finley and Joachims, 2008] This paper: LP-relaxed inference and max-margin learning Guarantees for algorithmic separability New interpretation: balancing accuracy and computational cost Learning bounds via polyhedral characterizations Application: dependency parsing with rich features

8 Example: Dependency Parsing Let x be a sentence in X Σ

9 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships

10 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships

11 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships

12 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships

13 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships Let Y(x) be the set of legal dependency trees of x

14 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships Let Y(x) be the set of legal dependency trees of x Each y Y(x) is a spanning tree of the complete digraph linking all word pairs

15 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships Let Y(x) be the set of legal dependency trees of x Each y Y(x) is a spanning tree of the complete digraph linking all word pairs We want to learn a parser h : X Y, where Y = x X Y(x)

16 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships Let Y(x) be the set of legal dependency trees of x Each y Y(x) is a spanning tree of the complete digraph linking all word pairs We want to learn a parser h : X Y, where Y = x X Y(x) This is a structured classification problem involving non-local interactions among output variables

17 Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion

18 Structured Classification and LP Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion

19 Structured Classification and LP Notation Input set X

20 Structured Classification and LP Notation Input set X Output set Y

21 Structured Classification and LP Notation Input set X Output set Y Labeled dataset L {(x 1, y 1 ),..., (x m, y m )} X Y drawn i.i.d. from P(X, Y )

22 Structured Classification and LP Notation Input set X Output set Y Labeled dataset L {(x 1, y 1 ),..., (x m, y m )} X Y drawn i.i.d. from P(X, Y ) Loss function l : Y Y R +

23 Structured Classification and LP Notation Input set X Output set Y Labeled dataset L {(x 1, y 1 ),..., (x m, y m )} X Y drawn i.i.d. from P(X, Y ) Loss function l : Y Y R + Goal: learn h : X Y with small expected loss El(h(X ); Y )

24 Structured Classification and LP Notation Input set X Output set Y Labeled dataset L {(x 1, y 1 ),..., (x m, y m )} X Y drawn i.i.d. from P(X, Y ) Loss function l : Y Y R + Goal: learn h : X Y with small expected loss El(h(X ); Y ) Here: linear classifiers h w (x) = arg max y Y w f(x, y) Hypothesis space H {h w w W}, W R d convex

25 Structured Classification and LP Decomposition Into Parts Assumption: each y Y decomposes into parts

26 Structured Classification and LP Decomposition Into Parts Assumption: each y Y decomposes into parts Example: clique assignments in a Markov network Example: arcs in a dependency parse tree Example: k-tuples of arcs in a dependency tree (up to some k)

27 Structured Classification and LP Decomposition Into Parts Assumption: each y Y decomposes into parts Example: clique assignments in a Markov network Example: arcs in a dependency parse tree Example: k-tuples of arcs in a dependency tree (up to some k) Define a set of parts R Replace y by an indicator vector z (z r ) r R with z r I(r y)

28 Structured Classification and LP Decomposition Into Parts Assumption: each y Y decomposes into parts Example: clique assignments in a Markov network Example: arcs in a dependency parse tree Example: k-tuples of arcs in a dependency tree (up to some k) Define a set of parts R Replace y by an indicator vector z (z r ) r R with z r I(r y) Assumption: features decompose over the parts f(x, y) r y f r (x) = r R z r f r (x) = F(x)z, F(x) (f r (x)) r R is a feature matrix (d-by- R )

29 Structured Classification and LP From the Output Set to a Polytope

30 Structured Classification and LP From the Output Set to a Polytope

31 Structured Classification and LP From the Output Set to a Polytope

32 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N)

33 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N) Inference becomes an LP [Taskar et al., 2004]: max y Y w f(x, y) = max z V (Z) w F(x)z = max z Z s z with s = F(x) w

34 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N) Inference becomes an LP [Taskar et al., 2004]: max y Y w f(x, y) = max z V (Z) w F(x)z Are we done? = max z Z s z with s = F(x) w

35 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N) Inference becomes an LP [Taskar et al., 2004]: max y Y w f(x, y) = max z V (Z) w F(x)z = max z Z s z with s = F(x) w Are we done? No: Finding A and b is problem dependent.

36 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N) Inference becomes an LP [Taskar et al., 2004]: max y Y w f(x, y) = max z V (Z) w F(x)z = max z Z s z with s = F(x) w Are we done? No: Finding A and b is problem dependent. In general p = O(exp(n)) (exponentially many constraints)

37 Structured Classification and LP LP-Relaxed Inference and Outer Polytope Often: a concise representation of an outer polytope Z Z such that Z Z n = V (Z) max z Z s z = max z Z,z Z n s z

38 Structured Classification and LP LP-Relaxed Inference and Outer Polytope Often: a concise representation of an outer polytope Z Z such that Z Z n = V (Z) max z Z s z = max s z z Z,z Z n max s z z Z

39 Structured Classification and LP Learning What about learning?

40 Structured Classification and LP Learning What about learning? Assumption: The loss function also decomposes over the parts

41 Structured Classification and LP Learning What about learning? Assumption: The loss function also decomposes over the parts Example: Hamming loss l(y ; y) ( I(r y )I(r / y) + I(r / y )I(r y) ) r R = z z 1 = p z + q where p 1 2z and q 1 z Hamming loss is an affine function of z

42 Structured Classification and LP Learning Structured SVM: min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the loss-augmented inference (LAI) problem r t (w) = max y t Y w f(x t, y t) w f(x }{{} t, y t ) + l(y }{{} t; y t ) }{{} F tz t p tz t +qt F tz t

43 Structured Classification and LP Learning Structured SVM: min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the loss-augmented inference (LAI) problem r t (w) = max y t Y w f(x t, y t) w f(x }{{} t, y t ) }{{} Also an LP. = F tz t ( max z t Z(F t w + p t ) z t F tz t + l(yt; y t ) }{{} p tz t +qt ) (F t w) z t + q t

44 Structured Classification and LP Learning Structured SVM: min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the loss-augmented inference (LAI) problem r t (w) = max y t Y w f(x t, y t) w f(x }{{} t, y t ) }{{} Also an LP. = F tz t ( max z t Z(F t w + p t ) z t F tz t + l(yt; y t ) }{{} p tz t +qt ) (F t w) z t + q t

45 Learning with LP-Relaxed Inference Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion

46 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (exact): min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the exact LAI problem ( ) r t (w) = max z t Z(F t w + p t ) z t (F t w) z t + q t

47 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (exact): min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the exact LAI problem ( ) r t (w) = max z t Z(F t w + p t ) z t (F t w) z t + q t Relax.

48 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (relaxed): min w λ 2 w m m r t (w) where the slack r t (w) is the solution of the relaxed LAI problem ( ) r t (w) = t=1 max(f z t t w + p t ) z t Z (F t w) z t + q t

49 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (relaxed): min w λ 2 w m m r t (w) where the slack r t (w) is the solution of the relaxed LAI problem ( ) r t (w) = t=1 max(f z t t w + p t ) z t Z (F t w) z t + q t r t (w) upper bounds the exact slack

50 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (relaxed): min w λ 2 w m m r t (w) where the slack r t (w) is the solution of the relaxed LAI problem ( ) r t (w) = t=1 max(f z t t w + p t ) z t Z (F t w) z t + q t r t (w) upper bounds the exact slack l(h w (x t ), z t ) upper bounds the true loss.

51 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples

52 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples Equivalently: an approximate algorithm A w which sometimes returns fractional solutions

53 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples Equivalently: an approximate algorithm A w which sometimes returns fractional solutions Some definitions [Kulesza and Pereira, 2007] L is separable if w s.t. h w classifies all data correctly L is alg. separable if w s.t. A w classifies all data correctly

54 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples Equivalently: an approximate algorithm A w which sometimes returns fractional solutions Some definitions [Kulesza and Pereira, 2007] L is separable if w s.t. h w classifies all data correctly L is alg. separable if w s.t. A w classifies all data correctly Remark: L algorithmically separable = L separable

55 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples Equivalently: an approximate algorithm A w which sometimes returns fractional solutions Some definitions [Kulesza and Pereira, 2007] L is separable if w s.t. h w classifies all data correctly L is alg. separable if w s.t. A w classifies all data correctly Remark: L algorithmically separable = L separable Margin of separation: Minimal γ s.t. (x t, y t ) L, y t Y(x t ): w f(x t, y t ) w f(x t, y t) + γl(y t, y t) with w =1.

56 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Zooming around a fractional vertex of the outer polytope:

57 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Zooming around a fractional vertex of the outer polytope:

58 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex

59 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex

60 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex

61 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex

62 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex

63 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex

64 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex

65 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex Assume binary-valued features Let N f be the maximum number of active features per part

66 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex Assume binary-valued features Let N f be the maximum number of active features per part Proposition If L is separable with γ L N f then it is algorithmically separable.

67 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex Assume binary-valued features Let N f be the maximum number of active features per part Proposition If L is separable with γ L N f then it is algorithmically separable. In the paper: bounds for the nonseparable case Also: bounds for ɛ-approximate algorithms

68 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Typical goal: minimize expected loss El(h(X ), Y )

69 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Typical goal: minimize expected loss El(h(X ), Y ) Structured prediction: computational cost is also important Let l c (h, x) be the cost of computing h(x)

70 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Typical goal: minimize expected loss El(h(X ), Y ) Structured prediction: computational cost is also important Let l c (h, x) be the cost of computing h(x) Let El c (h, X ) be the average computational cost of h

71 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Typical goal: minimize expected loss El(h(X ), Y ) Structured prediction: computational cost is also important Let l c (h, x) be the cost of computing h(x) Let El c (h, X ) be the average computational cost of h Alternative goal: minimize El(h(X ), Y ) +η }{{} El c (h, X ) }{{} expected loss of h average cost of h

72 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

73 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

74 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

75 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

76 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

77 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

78 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

79 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex

80 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex At test time: s P(F(X ) w) is a r.v. that depends on X (filtered by the parameters w)

81 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex At test time: s P(F(X ) w) is a r.v. that depends on X (filtered by the parameters w) A low-cost hypothesis h w is one which yields P(F(X ) w) with large mass on nice score vectors

82 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex At test time: s P(F(X ) w) is a r.v. that depends on X (filtered by the parameters w) A low-cost hypothesis h w is one which yields P(F(X ) w) with large mass on nice score vectors Idea: Approximate computational cost by relaxation gap: El c (h w, X ) El(h w (X ), h w (X ))

83 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex At test time: s P(F(X ) w) is a r.v. that depends on X (filtered by the parameters w) A low-cost hypothesis h w is one which yields P(F(X ) w) with large mass on nice score vectors Idea: Approximate computational cost by relaxation gap: El c (h w, X ) El(h w (X ), h w (X )) Most ILP solvers (branch-and-bound, Gomory s cuts) converge faster as this gap is smaller

84 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Add a empirical relaxation gap term to our learning objective: 1 m m t=1 ( r t(w) r t (w))

85 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Add a empirical relaxation gap term to our learning objective: 1 m m t=1 ( r t(w) r t (w)) The learning problem becomes min w λ 2 w η m m r t (w) + η m t=1 } {{ } Exact LAI m r t (w). t=1 } {{ } Relaxed LAI

86 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Add a empirical relaxation gap term to our learning objective: 1 m m t=1 ( r t(w) r t (w)) The learning problem becomes min w λ 2 w η m m r t (w) + η m t=1 } {{ } Exact LAI m r t (w). t=1 } {{ } Relaxed LAI In the paper: a stochastic adaptation of the online subgradient algorithm [Ratliff et al., 2006]

87 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Add a empirical relaxation gap term to our learning objective: 1 m m t=1 ( r t(w) r t (w)) The learning problem becomes min w λ 2 w η m m r t (w) + η m t=1 } {{ } Exact LAI m r t (w). t=1 } {{ } Relaxed LAI In the paper: a stochastic adaptation of the online subgradient algorithm [Ratliff et al., 2006] A PAC bound with respect to the best exact learner It measures the impact of the approximation in learning Previous bounds were in terms of the approximate learner [Kulesza and Pereira, 2007]

88 Experiments Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion

89 Experiments Experiments Dependency parsing for seven languages Danish, Dutch, Portuguese, Slovene, Swedish, Turkish, English

90 Experiments Experiments Dependency parsing for seven languages Danish, Dutch, Portuguese, Slovene, Swedish, Turkish, English Exact inference is efficient with a arc-factored model Find a maximal spanning tree [McDonald et al., 2005]

91 Experiments Experiments Dependency parsing for seven languages Danish, Dutch, Portuguese, Slovene, Swedish, Turkish, English Exact inference is efficient with a arc-factored model Find a maximal spanning tree [McDonald et al., 2005] Beyond that: NP-hard [McDonald and Satta, 2007]

92 Experiments Experiments Dependency parsing for seven languages Danish, Dutch, Portuguese, Slovene, Swedish, Turkish, English Exact inference is efficient with a arc-factored model Find a maximal spanning tree [McDonald et al., 2005] Beyond that: NP-hard [McDonald and Satta, 2007] Our model: a ILP formulation with non-arc-factored features Models grandparents/siblings interactions Models valency and nonprojective arcs Only O(n 3 ) variables and constraints More details: [Martins et al., 2009]

93 Experiments Experiments Experiment #1: Training with LP-relaxed LAI (η = 1)

94 Experiments Experiments Experiment #1: Training with LP-relaxed LAI (η = 1) Two different decoders at test time: Exact decoder (solve an ILP) Approximate decoder (solve the relaxed LP; if the solution is fractional, round it in polynomial time by finding a maximal spanning tree on the reweighted graph)

95 Experiments Experiments Experiment #1: Training with LP-relaxed LAI (η = 1) Two different decoders at test time: Exact decoder (solve an ILP) Approximate decoder (solve the relaxed LP; if the solution is fractional, round it in polynomial time by finding a maximal spanning tree on the reweighted graph) Strong baselines: [MP06] approximate second-order parser [McDonald and Pereira, 2006] [MDSX08] stacked parser [Martins et al., 2008]

96 Experiments Experiments Our models: Arc-factored

97 Experiments Experiments Our models: Arc-factored Full model (with exact decoding) much better

98 Experiments Experiments Our models: Arc-factored Full model (with exact decoding) much better Approximate decoding did not considerably affect accuracy

99 Experiments Experiments Where are the baselines?

100 Experiments Experiments Where are the baselines? [MP06] second-order parser [McDonald and Pereira, 2006]

101 Experiments Experiments Where are the baselines? [MP06] second-order parser [McDonald and Pereira, 2006] [MDSX08] stacked parser [Martins et al., 2008]

102 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features)

103 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features) As η increases, the model learns to avoid fractional solutions

104 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features) As η increases, the model learns to avoid fractional solutions Runtime does correlate with the relaxation gap

105 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features) As η increases, the model learns to avoid fractional solutions Runtime does correlate with the relaxation gap Yet the approximate decoder is significantly faster

106 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features) As η increases, the model learns to avoid fractional solutions Runtime does correlate with the relaxation gap Yet the approximate decoder is significantly faster Our full model: Same order of magnitude as the baselines ( sec.)

107 Conclusion Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion

108 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning

109 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability

110 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost

111 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost We demonstrated the effectiveness of these techniques in dependency parsing with non-arc-factored features

112 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost We demonstrated the effectiveness of these techniques in dependency parsing with non-arc-factored features Future work: polyhedral characterizations that guarantee tighter bounds

113 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost We demonstrated the effectiveness of these techniques in dependency parsing with non-arc-factored features Future work: polyhedral characterizations that guarantee tighter bounds Conditions for vanishing relaxation gap in online learning

114 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost We demonstrated the effectiveness of these techniques in dependency parsing with non-arc-factored features Future work: polyhedral characterizations that guarantee tighter bounds Conditions for vanishing relaxation gap in online learning Connections with regularization

115 References References I Chu, Y. J. and Liu, T. H. (1965). On the shortest arborescence of a directed graph. Science Sinica, 14: Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. (2006). Online Passive-Aggressive Algorithms. JMLR, 7: Edmonds, J. (1967). Optimum branchings. Journal of Research of the National Bureau of Standards, 71B: Eisner, J. (1996). Three new probabilistic models for dependency parsing: An exploration. In COLING. Finley, T. and Joachims, T. (2008). Training structural SVMs when exact inference is intractable. In ICML. Kulesza, A. and Pereira, F. (2007). Structured Learning with Approximate Inference. NIPS. Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML.

116 References References II Magnanti, T. and Wolsey, L. (1994). Optimal Trees. Technical Report , Massachusetts Institute of Technology, Operations Research Center. Martins, A. F. T., Das, D., Smith, N. A., and Xing, E. P. (2008). Stacking dependency parsers. In Proc. of EMNLP. Martins, A. F. T., Smith, N. A., and Xing, E. P. (2009). Concise integer linear programming formulations for dependency parsing. In Proc. of ACL-IJCNLP. McDonald, R. and Satta, G. (2007). On the complexity of non-projective data-driven dependency parsing. In Proc. of IWPT. McDonald, R. T., Pereira, F., Ribarov, K., and Hajic, J. (2005). Non-projective dependency parsing using spanning tree algorithms. In Proc. of HLT-EMNLP. McDonald, R. T. and Pereira, F. C. N. (2006). Online learning of approximate dependency parsing algorithms. In Proc. of EACL. Ratliff, N., Bagnell, J., and Zinkevich, M. (2006). Subgradient methods for maximum margin structured learning. In ICML Workshop on Learning in Structured Outputs Spaces.

117 References References III Smith, D. A. and Eisner, J. (2008). Dependency parsing by belief propagation. In Proc. of EMNLP. Taskar, B., Chatalbashev, V., and Koller, D. (2004). Learning associative Markov networks. In ICML. ACM New York, NY, USA. Taskar, B., Guestrin, C., and Koller, D. (2003). Max-margin markov networks. In NIPS. Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ICML. ACM New York, NY, USA.

118 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006])

119 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t

120 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0

121 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do end for

122 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) end for

123 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else end if end for

124 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else Solve exact LAI, ẑ t arg max z t Z w t F t (z t z t ) + l(z t; z t ) end if end for

125 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else Solve exact LAI, ẑ t arg max z t Z w t F t (z t z t ) + l(z t; z t ) end if Compute the subgradient g t λw t + F t (ẑ t z t ) end for

126 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else Solve exact LAI, ẑ t arg max z t Z w t F t (z t z t ) + l(z t; z t ) end if Compute the subgradient g t λw t + F t (ẑ t z t ) Project and update w t+1 Proj W (w t α t g t ) end for

127 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else Solve exact LAI, ẑ t arg max z t Z w t F t (z t z t ) + l(z t; z t ) end if Compute the subgradient g t λw t + F t (ẑ t z t ) Project and update w t+1 Proj W (w t α t g t ) end for Return the averaged model ŵ 1 m m t=1 w t.

128 Extra Slides Generalization Bound Proposition ( ) 1+log m Setting α t = 1/(λt), λ = Θ m and η t = Θ(t 1/2 ): El(hŵ(X ), Y ) 1 m m t=1 holds with probability 1 δ. r t (w ) + L o(m) m + O ( ) 1 m ln 1 δ

129 Extra Slides Generalization Bound Proposition ( ) 1+log m Setting α t = 1/(λt), λ = Θ m and η t = Θ(t 1/2 ): El(hŵ(X ), Y ) 1 m m t=1 holds with probability 1 δ. r t (w ) + L o(m) m + O ( ) 1 m ln 1 δ Remark: The bound is in terms of what could be achieved with the best exact learner It measures the impact of the approximation in learning

130 Extra Slides Generalization Bound Proposition ( ) 1+log m Setting α t = 1/(λt), λ = Θ m and η t = Θ(t 1/2 ): El(hŵ(X ), Y ) 1 m m t=1 holds with probability 1 δ. r t (w ) + L o(m) m + O ( ) 1 m ln 1 δ Remark: The bound is in terms of what could be achieved with the best exact learner It measures the impact of the approximation in learning Previous bounds were in terms of the approximate learner [Kulesza and Pereira, 2007]

Introduction to Data-Driven Dependency Parsing

Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007 Ryan McDonald 1 Joakim Nivre 2 1 Google Inc., New York, USA E-mail: ryanmcd@google.com 2 Uppsala University and Växjö University,