Polyhedral Outer Approximations with Application to Natural Language Parsing

Size: px
Start display at page:

Download "Polyhedral Outer Approximations with Application to Natural Language Parsing"

Transcription

1 Polyhedral Outer Approximations with Application to Natural Language Parsing André F. T. Martins 1,2 Noah A. Smith 1 Eric P. Xing 1 1 Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA, USA 2 Instituto de Telecomunicações Instituto Superior Técnico Lisboa, Portugal ICML, Montréal, Québec, June 17th, 2009

2 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004]

3 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions

4 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions Often: better (non-local) models with approximate inference

5 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions Often: better (non-local) models with approximate inference Sometimes outputs are globally constrained (matchings, permutations, spanning trees)

6 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions Often: better (non-local) models with approximate inference Sometimes outputs are globally constrained (matchings, permutations, spanning trees) How does approximate inference affect learning? [Kulesza and Pereira, 2007, Finley and Joachims, 2008]

7 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions Often: better (non-local) models with approximate inference Sometimes outputs are globally constrained (matchings, permutations, spanning trees) How does approximate inference affect learning? [Kulesza and Pereira, 2007, Finley and Joachims, 2008] This paper: LP-relaxed inference and max-margin learning Guarantees for algorithmic separability New interpretation: balancing accuracy and computational cost Learning bounds via polyhedral characterizations Application: dependency parsing with rich features

8 Example: Dependency Parsing Let x be a sentence in X Σ

9 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships

10 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships

11 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships

12 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships

13 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships Let Y(x) be the set of legal dependency trees of x

14 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships Let Y(x) be the set of legal dependency trees of x Each y Y(x) is a spanning tree of the complete digraph linking all word pairs

15 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships Let Y(x) be the set of legal dependency trees of x Each y Y(x) is a spanning tree of the complete digraph linking all word pairs We want to learn a parser h : X Y, where Y = x X Y(x)

16 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships Let Y(x) be the set of legal dependency trees of x Each y Y(x) is a spanning tree of the complete digraph linking all word pairs We want to learn a parser h : X Y, where Y = x X Y(x) This is a structured classification problem involving non-local interactions among output variables

17 Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion

18 Structured Classification and LP Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion

19 Structured Classification and LP Notation Input set X

20 Structured Classification and LP Notation Input set X Output set Y

21 Structured Classification and LP Notation Input set X Output set Y Labeled dataset L {(x 1, y 1 ),..., (x m, y m )} X Y drawn i.i.d. from P(X, Y )

22 Structured Classification and LP Notation Input set X Output set Y Labeled dataset L {(x 1, y 1 ),..., (x m, y m )} X Y drawn i.i.d. from P(X, Y ) Loss function l : Y Y R +

23 Structured Classification and LP Notation Input set X Output set Y Labeled dataset L {(x 1, y 1 ),..., (x m, y m )} X Y drawn i.i.d. from P(X, Y ) Loss function l : Y Y R + Goal: learn h : X Y with small expected loss El(h(X ); Y )

24 Structured Classification and LP Notation Input set X Output set Y Labeled dataset L {(x 1, y 1 ),..., (x m, y m )} X Y drawn i.i.d. from P(X, Y ) Loss function l : Y Y R + Goal: learn h : X Y with small expected loss El(h(X ); Y ) Here: linear classifiers h w (x) = arg max y Y w f(x, y) Hypothesis space H {h w w W}, W R d convex

25 Structured Classification and LP Decomposition Into Parts Assumption: each y Y decomposes into parts

26 Structured Classification and LP Decomposition Into Parts Assumption: each y Y decomposes into parts Example: clique assignments in a Markov network Example: arcs in a dependency parse tree Example: k-tuples of arcs in a dependency tree (up to some k)

27 Structured Classification and LP Decomposition Into Parts Assumption: each y Y decomposes into parts Example: clique assignments in a Markov network Example: arcs in a dependency parse tree Example: k-tuples of arcs in a dependency tree (up to some k) Define a set of parts R Replace y by an indicator vector z (z r ) r R with z r I(r y)

28 Structured Classification and LP Decomposition Into Parts Assumption: each y Y decomposes into parts Example: clique assignments in a Markov network Example: arcs in a dependency parse tree Example: k-tuples of arcs in a dependency tree (up to some k) Define a set of parts R Replace y by an indicator vector z (z r ) r R with z r I(r y) Assumption: features decompose over the parts f(x, y) r y f r (x) = r R z r f r (x) = F(x)z, F(x) (f r (x)) r R is a feature matrix (d-by- R )

29 Structured Classification and LP From the Output Set to a Polytope

30 Structured Classification and LP From the Output Set to a Polytope

31 Structured Classification and LP From the Output Set to a Polytope

32 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N)

33 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N) Inference becomes an LP [Taskar et al., 2004]: max y Y w f(x, y) = max z V (Z) w F(x)z = max z Z s z with s = F(x) w

34 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N) Inference becomes an LP [Taskar et al., 2004]: max y Y w f(x, y) = max z V (Z) w F(x)z Are we done? = max z Z s z with s = F(x) w

35 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N) Inference becomes an LP [Taskar et al., 2004]: max y Y w f(x, y) = max z V (Z) w F(x)z = max z Z s z with s = F(x) w Are we done? No: Finding A and b is problem dependent.

36 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N) Inference becomes an LP [Taskar et al., 2004]: max y Y w f(x, y) = max z V (Z) w F(x)z = max z Z s z with s = F(x) w Are we done? No: Finding A and b is problem dependent. In general p = O(exp(n)) (exponentially many constraints)

37 Structured Classification and LP LP-Relaxed Inference and Outer Polytope Often: a concise representation of an outer polytope Z Z such that Z Z n = V (Z) max z Z s z = max z Z,z Z n s z

38 Structured Classification and LP LP-Relaxed Inference and Outer Polytope Often: a concise representation of an outer polytope Z Z such that Z Z n = V (Z) max z Z s z = max s z z Z,z Z n max s z z Z

39 Structured Classification and LP Learning What about learning?

40 Structured Classification and LP Learning What about learning? Assumption: The loss function also decomposes over the parts

41 Structured Classification and LP Learning What about learning? Assumption: The loss function also decomposes over the parts Example: Hamming loss l(y ; y) ( I(r y )I(r / y) + I(r / y )I(r y) ) r R = z z 1 = p z + q where p 1 2z and q 1 z Hamming loss is an affine function of z

42 Structured Classification and LP Learning Structured SVM: min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the loss-augmented inference (LAI) problem r t (w) = max y t Y w f(x t, y t) w f(x }{{} t, y t ) + l(y }{{} t; y t ) }{{} F tz t p tz t +qt F tz t

43 Structured Classification and LP Learning Structured SVM: min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the loss-augmented inference (LAI) problem r t (w) = max y t Y w f(x t, y t) w f(x }{{} t, y t ) }{{} Also an LP. = F tz t ( max z t Z(F t w + p t ) z t F tz t + l(yt; y t ) }{{} p tz t +qt ) (F t w) z t + q t

44 Structured Classification and LP Learning Structured SVM: min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the loss-augmented inference (LAI) problem r t (w) = max y t Y w f(x t, y t) w f(x }{{} t, y t ) }{{} Also an LP. = F tz t ( max z t Z(F t w + p t ) z t F tz t + l(yt; y t ) }{{} p tz t +qt ) (F t w) z t + q t

45 Learning with LP-Relaxed Inference Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion

46 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (exact): min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the exact LAI problem ( ) r t (w) = max z t Z(F t w + p t ) z t (F t w) z t + q t

47 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (exact): min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the exact LAI problem ( ) r t (w) = max z t Z(F t w + p t ) z t (F t w) z t + q t Relax.

48 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (relaxed): min w λ 2 w m m r t (w) where the slack r t (w) is the solution of the relaxed LAI problem ( ) r t (w) = t=1 max(f z t t w + p t ) z t Z (F t w) z t + q t

49 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (relaxed): min w λ 2 w m m r t (w) where the slack r t (w) is the solution of the relaxed LAI problem ( ) r t (w) = t=1 max(f z t t w + p t ) z t Z (F t w) z t + q t r t (w) upper bounds the exact slack

50 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (relaxed): min w λ 2 w m m r t (w) where the slack r t (w) is the solution of the relaxed LAI problem ( ) r t (w) = t=1 max(f z t t w + p t ) z t Z (F t w) z t + q t r t (w) upper bounds the exact slack l(h w (x t ), z t ) upper bounds the true loss.

51 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples

52 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples Equivalently: an approximate algorithm A w which sometimes returns fractional solutions

53 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples Equivalently: an approximate algorithm A w which sometimes returns fractional solutions Some definitions [Kulesza and Pereira, 2007] L is separable if w s.t. h w classifies all data correctly L is alg. separable if w s.t. A w classifies all data correctly

54 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples Equivalently: an approximate algorithm A w which sometimes returns fractional solutions Some definitions [Kulesza and Pereira, 2007] L is separable if w s.t. h w classifies all data correctly L is alg. separable if w s.t. A w classifies all data correctly Remark: L algorithmically separable = L separable

55 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples Equivalently: an approximate algorithm A w which sometimes returns fractional solutions Some definitions [Kulesza and Pereira, 2007] L is separable if w s.t. h w classifies all data correctly L is alg. separable if w s.t. A w classifies all data correctly Remark: L algorithmically separable = L separable Margin of separation: Minimal γ s.t. (x t, y t ) L, y t Y(x t ): w f(x t, y t ) w f(x t, y t) + γl(y t, y t) with w =1.

56 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Zooming around a fractional vertex of the outer polytope:

57 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Zooming around a fractional vertex of the outer polytope:

58 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex

59 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex

60 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex

61 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex

62 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex

63 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex

64 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex

65 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex Assume binary-valued features Let N f be the maximum number of active features per part

66 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex Assume binary-valued features Let N f be the maximum number of active features per part Proposition If L is separable with γ L N f then it is algorithmically separable.

67 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex Assume binary-valued features Let N f be the maximum number of active features per part Proposition If L is separable with γ L N f then it is algorithmically separable. In the paper: bounds for the nonseparable case Also: bounds for ɛ-approximate algorithms

68 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Typical goal: minimize expected loss El(h(X ), Y )

69 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Typical goal: minimize expected loss El(h(X ), Y ) Structured prediction: computational cost is also important Let l c (h, x) be the cost of computing h(x)

70 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Typical goal: minimize expected loss El(h(X ), Y ) Structured prediction: computational cost is also important Let l c (h, x) be the cost of computing h(x) Let El c (h, X ) be the average computational cost of h

71 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Typical goal: minimize expected loss El(h(X ), Y ) Structured prediction: computational cost is also important Let l c (h, x) be the cost of computing h(x) Let El c (h, X ) be the average computational cost of h Alternative goal: minimize El(h(X ), Y ) +η }{{} El c (h, X ) }{{} expected loss of h average cost of h

72 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

73 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

74 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

75 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

76 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

77 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

78 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n

79 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex

80 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex At test time: s P(F(X ) w) is a r.v. that depends on X (filtered by the parameters w)

81 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex At test time: s P(F(X ) w) is a r.v. that depends on X (filtered by the parameters w) A low-cost hypothesis h w is one which yields P(F(X ) w) with large mass on nice score vectors

82 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex At test time: s P(F(X ) w) is a r.v. that depends on X (filtered by the parameters w) A low-cost hypothesis h w is one which yields P(F(X ) w) with large mass on nice score vectors Idea: Approximate computational cost by relaxation gap: El c (h w, X ) El(h w (X ), h w (X ))

83 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex At test time: s P(F(X ) w) is a r.v. that depends on X (filtered by the parameters w) A low-cost hypothesis h w is one which yields P(F(X ) w) with large mass on nice score vectors Idea: Approximate computational cost by relaxation gap: El c (h w, X ) El(h w (X ), h w (X )) Most ILP solvers (branch-and-bound, Gomory s cuts) converge faster as this gap is smaller

84 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Add a empirical relaxation gap term to our learning objective: 1 m m t=1 ( r t(w) r t (w))

85 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Add a empirical relaxation gap term to our learning objective: 1 m m t=1 ( r t(w) r t (w)) The learning problem becomes min w λ 2 w η m m r t (w) + η m t=1 } {{ } Exact LAI m r t (w). t=1 } {{ } Relaxed LAI

86 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Add a empirical relaxation gap term to our learning objective: 1 m m t=1 ( r t(w) r t (w)) The learning problem becomes min w λ 2 w η m m r t (w) + η m t=1 } {{ } Exact LAI m r t (w). t=1 } {{ } Relaxed LAI In the paper: a stochastic adaptation of the online subgradient algorithm [Ratliff et al., 2006]

87 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Add a empirical relaxation gap term to our learning objective: 1 m m t=1 ( r t(w) r t (w)) The learning problem becomes min w λ 2 w η m m r t (w) + η m t=1 } {{ } Exact LAI m r t (w). t=1 } {{ } Relaxed LAI In the paper: a stochastic adaptation of the online subgradient algorithm [Ratliff et al., 2006] A PAC bound with respect to the best exact learner It measures the impact of the approximation in learning Previous bounds were in terms of the approximate learner [Kulesza and Pereira, 2007]

88 Experiments Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion

89 Experiments Experiments Dependency parsing for seven languages Danish, Dutch, Portuguese, Slovene, Swedish, Turkish, English

90 Experiments Experiments Dependency parsing for seven languages Danish, Dutch, Portuguese, Slovene, Swedish, Turkish, English Exact inference is efficient with a arc-factored model Find a maximal spanning tree [McDonald et al., 2005]

91 Experiments Experiments Dependency parsing for seven languages Danish, Dutch, Portuguese, Slovene, Swedish, Turkish, English Exact inference is efficient with a arc-factored model Find a maximal spanning tree [McDonald et al., 2005] Beyond that: NP-hard [McDonald and Satta, 2007]

92 Experiments Experiments Dependency parsing for seven languages Danish, Dutch, Portuguese, Slovene, Swedish, Turkish, English Exact inference is efficient with a arc-factored model Find a maximal spanning tree [McDonald et al., 2005] Beyond that: NP-hard [McDonald and Satta, 2007] Our model: a ILP formulation with non-arc-factored features Models grandparents/siblings interactions Models valency and nonprojective arcs Only O(n 3 ) variables and constraints More details: [Martins et al., 2009]

93 Experiments Experiments Experiment #1: Training with LP-relaxed LAI (η = 1)

94 Experiments Experiments Experiment #1: Training with LP-relaxed LAI (η = 1) Two different decoders at test time: Exact decoder (solve an ILP) Approximate decoder (solve the relaxed LP; if the solution is fractional, round it in polynomial time by finding a maximal spanning tree on the reweighted graph)

95 Experiments Experiments Experiment #1: Training with LP-relaxed LAI (η = 1) Two different decoders at test time: Exact decoder (solve an ILP) Approximate decoder (solve the relaxed LP; if the solution is fractional, round it in polynomial time by finding a maximal spanning tree on the reweighted graph) Strong baselines: [MP06] approximate second-order parser [McDonald and Pereira, 2006] [MDSX08] stacked parser [Martins et al., 2008]

96 Experiments Experiments Our models: Arc-factored

97 Experiments Experiments Our models: Arc-factored Full model (with exact decoding) much better

98 Experiments Experiments Our models: Arc-factored Full model (with exact decoding) much better Approximate decoding did not considerably affect accuracy

99 Experiments Experiments Where are the baselines?

100 Experiments Experiments Where are the baselines? [MP06] second-order parser [McDonald and Pereira, 2006]

101 Experiments Experiments Where are the baselines? [MP06] second-order parser [McDonald and Pereira, 2006] [MDSX08] stacked parser [Martins et al., 2008]

102 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features)

103 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features) As η increases, the model learns to avoid fractional solutions

104 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features) As η increases, the model learns to avoid fractional solutions Runtime does correlate with the relaxation gap

105 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features) As η increases, the model learns to avoid fractional solutions Runtime does correlate with the relaxation gap Yet the approximate decoder is significantly faster

106 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features) As η increases, the model learns to avoid fractional solutions Runtime does correlate with the relaxation gap Yet the approximate decoder is significantly faster Our full model: Same order of magnitude as the baselines ( sec.)

107 Conclusion Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion

108 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning

109 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability

110 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost

111 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost We demonstrated the effectiveness of these techniques in dependency parsing with non-arc-factored features

112 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost We demonstrated the effectiveness of these techniques in dependency parsing with non-arc-factored features Future work: polyhedral characterizations that guarantee tighter bounds

113 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost We demonstrated the effectiveness of these techniques in dependency parsing with non-arc-factored features Future work: polyhedral characterizations that guarantee tighter bounds Conditions for vanishing relaxation gap in online learning

114 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost We demonstrated the effectiveness of these techniques in dependency parsing with non-arc-factored features Future work: polyhedral characterizations that guarantee tighter bounds Conditions for vanishing relaxation gap in online learning Connections with regularization

115 References References I Chu, Y. J. and Liu, T. H. (1965). On the shortest arborescence of a directed graph. Science Sinica, 14: Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. (2006). Online Passive-Aggressive Algorithms. JMLR, 7: Edmonds, J. (1967). Optimum branchings. Journal of Research of the National Bureau of Standards, 71B: Eisner, J. (1996). Three new probabilistic models for dependency parsing: An exploration. In COLING. Finley, T. and Joachims, T. (2008). Training structural SVMs when exact inference is intractable. In ICML. Kulesza, A. and Pereira, F. (2007). Structured Learning with Approximate Inference. NIPS. Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML.

116 References References II Magnanti, T. and Wolsey, L. (1994). Optimal Trees. Technical Report , Massachusetts Institute of Technology, Operations Research Center. Martins, A. F. T., Das, D., Smith, N. A., and Xing, E. P. (2008). Stacking dependency parsers. In Proc. of EMNLP. Martins, A. F. T., Smith, N. A., and Xing, E. P. (2009). Concise integer linear programming formulations for dependency parsing. In Proc. of ACL-IJCNLP. McDonald, R. and Satta, G. (2007). On the complexity of non-projective data-driven dependency parsing. In Proc. of IWPT. McDonald, R. T., Pereira, F., Ribarov, K., and Hajic, J. (2005). Non-projective dependency parsing using spanning tree algorithms. In Proc. of HLT-EMNLP. McDonald, R. T. and Pereira, F. C. N. (2006). Online learning of approximate dependency parsing algorithms. In Proc. of EACL. Ratliff, N., Bagnell, J., and Zinkevich, M. (2006). Subgradient methods for maximum margin structured learning. In ICML Workshop on Learning in Structured Outputs Spaces.

117 References References III Smith, D. A. and Eisner, J. (2008). Dependency parsing by belief propagation. In Proc. of EMNLP. Taskar, B., Chatalbashev, V., and Koller, D. (2004). Learning associative Markov networks. In ICML. ACM New York, NY, USA. Taskar, B., Guestrin, C., and Koller, D. (2003). Max-margin markov networks. In NIPS. Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ICML. ACM New York, NY, USA.

118 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006])

119 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t

120 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0

121 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do end for

122 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) end for

123 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else end if end for

124 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else Solve exact LAI, ẑ t arg max z t Z w t F t (z t z t ) + l(z t; z t ) end if end for

125 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else Solve exact LAI, ẑ t arg max z t Z w t F t (z t z t ) + l(z t; z t ) end if Compute the subgradient g t λw t + F t (ẑ t z t ) end for

126 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else Solve exact LAI, ẑ t arg max z t Z w t F t (z t z t ) + l(z t; z t ) end if Compute the subgradient g t λw t + F t (ẑ t z t ) Project and update w t+1 Proj W (w t α t g t ) end for

127 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else Solve exact LAI, ẑ t arg max z t Z w t F t (z t z t ) + l(z t; z t ) end if Compute the subgradient g t λw t + F t (ẑ t z t ) Project and update w t+1 Proj W (w t α t g t ) end for Return the averaged model ŵ 1 m m t=1 w t.

128 Extra Slides Generalization Bound Proposition ( ) 1+log m Setting α t = 1/(λt), λ = Θ m and η t = Θ(t 1/2 ): El(hŵ(X ), Y ) 1 m m t=1 holds with probability 1 δ. r t (w ) + L o(m) m + O ( ) 1 m ln 1 δ

129 Extra Slides Generalization Bound Proposition ( ) 1+log m Setting α t = 1/(λt), λ = Θ m and η t = Θ(t 1/2 ): El(hŵ(X ), Y ) 1 m m t=1 holds with probability 1 δ. r t (w ) + L o(m) m + O ( ) 1 m ln 1 δ Remark: The bound is in terms of what could be achieved with the best exact learner It measures the impact of the approximation in learning

130 Extra Slides Generalization Bound Proposition ( ) 1+log m Setting α t = 1/(λt), λ = Θ m and η t = Θ(t 1/2 ): El(hŵ(X ), Y ) 1 m m t=1 holds with probability 1 δ. r t (w ) + L o(m) m + O ( ) 1 m ln 1 δ Remark: The bound is in terms of what could be achieved with the best exact learner It measures the impact of the approximation in learning Previous bounds were in terms of the approximate learner [Kulesza and Pereira, 2007]

Introduction to Data-Driven Dependency Parsing

Introduction to Data-Driven Dependency Parsing Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007 Ryan McDonald 1 Joakim Nivre 2 1 Google Inc., New York, USA E-mail: ryanmcd@google.com 2 Uppsala University and Växjö University,

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss Jeffrey Flanigan Chris Dyer Noah A. Smith Jaime Carbonell School of Computer Science, Carnegie Mellon University, Pittsburgh,

More information

Online MKL for Structured Prediction

Online MKL for Structured Prediction Online MKL for Structured Prediction André F. T. Martins Noah A. Smith Eric P. Xing School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA Pedro M. Q. Aguiar Instituto de Sistemas

More information

Stacking Dependency Parsers

Stacking Dependency Parsers Stacking Dependency Parsers André F. T. Martins Dipanjan Das Noah A. Smith Eric P. Xing School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA Instituto de Telecomunicações,

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing

More information

The Structured Weighted Violations Perceptron Algorithm

The Structured Weighted Violations Perceptron Algorithm The Structured Weighted Violations Perceptron Algorithm Rotem Dror and Roi Reichart Faculty of Industrial Engineering and Management, Technion, IIT {rtmdrr@campus roiri@ie}.technion.ac.il Abstract We present

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Structured Prediction Models via the Matrix-Tree Theorem

Structured Prediction Models via the Matrix-Tree Theorem Structured Prediction Models via the Matrix-Tree Theorem Terry Koo Amir Globerson Xavier Carreras Michael Collins maestro@csail.mit.edu gamir@csail.mit.edu carreras@csail.mit.edu mcollins@csail.mit.edu

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

Train and Test Tightness of LP Relaxations in Structured Prediction

Train and Test Tightness of LP Relaxations in Structured Prediction Ofer Meshi Mehrdad Mahdavi Toyota Technological Institute at Chicago Adrian Weller University of Cambridge David Sontag New York University MESHI@TTIC.EDU MAHDAVI@TTIC.EDU AW665@CAM.AC.UK DSONTAG@CS.NYU.EDU

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive

More information

Distributed Training Strategies for the Structured Perceptron

Distributed Training Strategies for the Structured Perceptron Distributed Training Strategies for the Structured Perceptron Ryan McDonald Keith Hall Gideon Mann Google, Inc., New York / Zurich {ryanmcd kbhall gmann}@google.com Abstract Perceptron training is widely

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Generalized Linear Classifiers in NLP

Generalized Linear Classifiers in NLP Generalized Linear Classifiers in NLP (or Discriminative Generalized Linear Feature-Based Classifiers) Graduate School of Language Technology, Sweden 2009 Ryan McDonald Google Inc., New York, USA E-mail:

More information

Train and Test Tightness of LP Relaxations in Structured Prediction

Train and Test Tightness of LP Relaxations in Structured Prediction Ofer Meshi Mehrdad Mahdavi Toyota Technological Institute at Chicago Adrian Weller University of Cambridge David Sontag New York University MESHI@TTIC.EDU MAHDAVI@TTIC.EDU AW665@CAM.AC.UK DSONTAG@CS.NYU.EDU

More information

2.2 Structured Prediction

2.2 Structured Prediction The hinge loss (also called the margin loss), which is optimized by the SVM, is a ramp function that has slope 1 when yf(x) < 1 and is zero otherwise. Two other loss functions squared loss and exponential

More information

Structured Learning with Approximate Inference

Structured Learning with Approximate Inference Structured Learning with Approximate Inference Alex Kulesza and Fernando Pereira Department of Computer and Information Science University of Pennsylvania {kulesza, pereira}@cis.upenn.edu Abstract In many

More information

4.1 Online Convex Optimization

4.1 Online Convex Optimization CS/CNS/EE 53: Advanced Topics in Machine Learning Topic: Online Convex Optimization and Online SVM Lecturer: Daniel Golovin Scribe: Xiaodi Hou Date: Jan 3, 4. Online Convex Optimization Definition 4..

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Relaxed Marginal Inference and its Application to Dependency Parsing

Relaxed Marginal Inference and its Application to Dependency Parsing Relaxed Marginal Inference and its Application to Dependency Parsing Sebastian Riedel David A. Smith Department of Computer Science University of Massachusetts, Amherst {riedel,dasmith}@cs.umass.edu Abstract

More information

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,

More information

Graph-based Dependency Parsing. Ryan McDonald Google Research

Graph-based Dependency Parsing. Ryan McDonald Google Research Graph-based Dependency Parsing Ryan McDonald Google Research ryanmcd@google.com Reader s Digest Graph-based Dependency Parsing Ryan McDonald Google Research ryanmcd@google.com root ROOT Dependency Parsing

More information

Structural Learning with Amortized Inference

Structural Learning with Amortized Inference Structural Learning with Amortized Inference AAAI 15 Kai-Wei Chang, Shyam Upadhyay, Gourab Kundu, and Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign {kchang10,upadhya3,kundu2,danr}@illinois.edu

More information

Turbo Parsers: Dependency Parsing by Approximate Variational Inference

Turbo Parsers: Dependency Parsing by Approximate Variational Inference Turbo Parsers: Dependency Parsing by Approximate Variational Inference André F. T. Martins Noah A. Smith Eric P. Xing School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA {afm,nasmith,epxing}@cs.cmu.edu

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Support Vector Machines: Kernels

Support Vector Machines: Kernels Support Vector Machines: Kernels CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 14.1, 14.2, 14.4 Schoelkopf/Smola Chapter 7.4, 7.6, 7.8 Non-Linear Problems

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

Transition-based Dependency Parsing with Selectional Branching

Transition-based Dependency Parsing with Selectional Branching Transitionbased Dependency Parsing with Selectional Branching Presented at the 4th workshop on Statistical Parsing in Morphologically Rich Languages October 18th, 2013 Jinho D. Choi University of Massachusetts

More information

Generalized Boosting Algorithms for Convex Optimization

Generalized Boosting Algorithms for Convex Optimization Alexander Grubb agrubb@cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 523 USA Abstract Boosting is a popular way to derive powerful

More information

Structured Prediction Theory and Algorithms

Structured Prediction Theory and Algorithms Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models David Sontag New York University Lecture 6, March 7, 2013 David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 1 / 25 Today s lecture 1 Dual decomposition 2 MAP inference

More information

Online Bayesian Passive-Aggressive Learning

Online Bayesian Passive-Aggressive Learning Online Bayesian Passive-Aggressive Learning Full Journal Version: http://qr.net/b1rd Tianlin Shi Jun Zhu ICML 2014 T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 1 / 35 Outline Introduction Motivation Framework

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Sequence Labelling SVMs Trained in One Pass

Sequence Labelling SVMs Trained in One Pass Sequence Labelling SVMs Trained in One Pass Antoine Bordes 12, Nicolas Usunier 1, and Léon Bottou 2 1 LIP6, Université Paris 6, 104 Avenue du Pdt Kennedy, 75016 Paris, France 2 NEC Laboratories America,

More information

Advanced Graph-Based Parsing Techniques

Advanced Graph-Based Parsing Techniques Advanced Graph-Based Parsing Techniques Joakim Nivre Uppsala University Linguistics and Philology Based on previous tutorials with Ryan McDonald Advanced Graph-Based Parsing Techniques 1(33) Introduction

More information

Dual Decomposition for Inference

Dual Decomposition for Inference Dual Decomposition for Inference Yunshu Liu ASPITRG Research Group 2014-05-06 References: [1]. D. Sontag, A. Globerson and T. Jaakkola, Introduction to Dual Decomposition for Inference, Optimization for

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

1 SVM Learning for Interdependent and Structured Output Spaces

1 SVM Learning for Interdependent and Structured Output Spaces 1 SVM Learning for Interdependent and Structured Output Spaces Yasemin Altun Toyota Technological Institute at Chicago, Chicago IL 60637 USA altun@tti-c.org Thomas Hofmann Google, Zurich, Austria thofmann@google.com

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction

More information

Junction Tree, BP and Variational Methods

Junction Tree, BP and Variational Methods Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models David Sontag New York University Lecture 12, April 23, 2013 David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 1 / 24 What notion of best should learning be optimizing?

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing Lagrangian Relaxation Algorithms for Inference in Natural Language Processing Alexander M. Rush and Michael Collins (based on joint work with Yin-Wen Chang, Tommi Jaakkola, Terry Koo, Roi Reichart, David

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Parameter Learning for Log-supermodular Distributions

Parameter Learning for Log-supermodular Distributions Parameter Learning for Log-supermodular Distributions Tatiana Shpakova IRIA - École ormale Supérieure Paris tatiana.shpakova@inria.fr Francis Bach IRIA - École ormale Supérieure Paris francis.bach@inria.fr

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

On Primal and Dual Sparsity of Markov Networks

On Primal and Dual Sparsity of Markov Networks Jun Zhu jun-zhu@mails.tsinghua.edu.cn Eric P. Xing epxing@cs.cmu.edu Dept. of Comp. Sci & Tech, TNList Lab, Tsinghua University, Beijing 84 China School of Computer Science, Carnegie Mellon University,

More information

Learning to Rank and Quadratic Assignment

Learning to Rank and Quadratic Assignment Learning to Rank and Quadratic Assignment Thomas Mensink TVPA - XRCE & LEAR - INRIA Grenoble, France Jakob Verbeek LEAR Team INRIA Rhône-Alpes Grenoble, France Abstract Tiberio Caetano Machine Learning

More information

Accelerated dual decomposition for MAP inference

Accelerated dual decomposition for MAP inference Vladimir Jojic vjojic@csstanfordedu Stephen Gould sgould@stanfordedu Daphne Koller koller@csstanfordedu Computer Science Department, 353 Serra Mall, Stanford University, Stanford, CA 94305, USA Abstract

More information

Accelerated dual decomposition for MAP inference

Accelerated dual decomposition for MAP inference Vladimir Jojic vjojic@csstanfordedu Stephen Gould sgould@stanfordedu Daphne Koller koller@csstanfordedu Computer Science Department, 353 Serra Mall, Stanford University, Stanford, CA 94305, USA Abstract

More information

Minibatch and Parallelization for Online Large Margin Structured Learning

Minibatch and Parallelization for Online Large Margin Structured Learning Minibatch and Parallelization for Online Large Margin Structured Learning Kai Zhao Computer Science Program, Graduate Center City University of New York kzhao@gc.cuny.edu Liang Huang, Computer Science

More information

Coactive Learning for Locally Optimal Problem Solving

Coactive Learning for Locally Optimal Problem Solving Coactive Learning for Locally Optimal Problem Solving Robby Goetschalckx, Alan Fern, Prasad adepalli School of EECS, Oregon State University, Corvallis, Oregon, USA Abstract Coactive learning is an online

More information

Structured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania

Structured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania Structured Prediction: A Large Margin Approach Ben Taskar University of Pennsylvania Structured Prediction Prediction of complex outputs Structured outputs: multivariate, correlated, constrained Novel,

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

A Support Vector Method for Multivariate Performance Measures

A Support Vector Method for Multivariate Performance Measures A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Max-Margin Weight Learning for Markov Logic Networks

Max-Margin Weight Learning for Markov Logic Networks In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Part 1, pp. 564-579, Bled, Slovenia, September 2009. Max-Margin Weight Learning

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Search-Based Structured Prediction

Search-Based Structured Prediction Search-Based Structured Prediction by Harold C. Daumé III (Utah), John Langford (Yahoo), and Daniel Marcu (USC) Submitted to Machine Learning, 2007 Presented by: Eugene Weinstein, NYU/Courant Institute

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

Spectral Unsupervised Parsing with Additive Tree Metrics

Spectral Unsupervised Parsing with Additive Tree Metrics Spectral Unsupervised Parsing with Additive Tree Metrics Ankur Parikh, Shay Cohen, Eric P. Xing Carnegie Mellon, University of Edinburgh Ankur Parikh 2014 1 Overview Model: We present a novel approach

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

Section Notes 9. IP: Cutting Planes. Applied Math 121. Week of April 12, 2010

Section Notes 9. IP: Cutting Planes. Applied Math 121. Week of April 12, 2010 Section Notes 9 IP: Cutting Planes Applied Math 121 Week of April 12, 2010 Goals for the week understand what a strong formulations is. be familiar with the cutting planes algorithm and the types of cuts

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):

More information

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University Sample and Computationally Efficient Active Learning Maria-Florina Balcan Carnegie Mellon University Machine Learning is Shaping the World Highly successful discipline with lots of applications. Computational

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information

ML4NLP Multiclass Classification

ML4NLP Multiclass Classification ML4NLP Multiclass Classification CS 590NLP Dan Goldwasser Purdue University dgoldwas@purdue.edu Social NLP Last week we discussed the speed-dates paper. Interesting perspective on NLP problems- Can we

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Maximum Margin Bayesian Networks

Maximum Margin Bayesian Networks Maximum Margin Bayesian Networks Yuhong Guo Department of Computing Science University of Alberta yuhong@cs.ualberta.ca Dana Wilkinson School of Computer Science University of Waterloo d3wilkin@cs.uwaterloo.ca

More information

Multi-Class Confidence Weighted Algorithms

Multi-Class Confidence Weighted Algorithms Multi-Class Confidence Weighted Algorithms Koby Crammer Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 Mark Dredze Alex Kulesza Human Language Technology

More information

lecture 6: modeling sequences (final part)

lecture 6: modeling sequences (final part) Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation Outline After a recap: } Few more words about unsupervised estimation of

More information

arxiv: v1 [stat.ml] 13 Oct 2010

arxiv: v1 [stat.ml] 13 Oct 2010 Online Multiple Kernel Learning for Structured Prediction arxiv:00.770v [stat.ml] 3 Oct 00 André F. T. Martins Noah A. Smith Eric P. Xing Pedro M. Q. Aguiar Mário A. T. Figueiredo {afm,nasmith,epxing}@cs.cmu.edu

More information

Support Vector Machines and Bayes Regression

Support Vector Machines and Bayes Regression Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering

More information

Conditional Random Fields for Sequential Supervised Learning

Conditional Random Fields for Sequential Supervised Learning Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd

More information

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for atural Language Processing Alexander M. Rush (based on joint work with Michael Collins, Tommi Jaakkola, Terry Koo, David Sontag) Uncertainty

More information

The Geometry of Statistical Machine Translation

The Geometry of Statistical Machine Translation The Geometry of Statistical Machine Translation Presented by Rory Waite 16th of December 2015 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions ntroduction We provide

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information