Polyhedral Outer Approximations with Application to Natural Language Parsing
|
|
- Myron Gaines
- 5 years ago
- Views:
Transcription
1 Polyhedral Outer Approximations with Application to Natural Language Parsing André F. T. Martins 1,2 Noah A. Smith 1 Eric P. Xing 1 1 Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA, USA 2 Instituto de Telecomunicações Instituto Superior Técnico Lisboa, Portugal ICML, Montréal, Québec, June 17th, 2009
2 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004]
3 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions
4 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions Often: better (non-local) models with approximate inference
5 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions Often: better (non-local) models with approximate inference Sometimes outputs are globally constrained (matchings, permutations, spanning trees)
6 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions Often: better (non-local) models with approximate inference Sometimes outputs are globally constrained (matchings, permutations, spanning trees) How does approximate inference affect learning? [Kulesza and Pereira, 2007, Finley and Joachims, 2008]
7 In a Nutshell Structured prediction: models interdependence among outputs [Lafferty et al., 2001, Taskar et al., 2003, Tsochantaridis et al., 2004] Exact inference only tractable w/ strong locality assumptions Often: better (non-local) models with approximate inference Sometimes outputs are globally constrained (matchings, permutations, spanning trees) How does approximate inference affect learning? [Kulesza and Pereira, 2007, Finley and Joachims, 2008] This paper: LP-relaxed inference and max-margin learning Guarantees for algorithmic separability New interpretation: balancing accuracy and computational cost Learning bounds via polyhedral characterizations Application: dependency parsing with rich features
8 Example: Dependency Parsing Let x be a sentence in X Σ
9 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships
10 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships
11 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships
12 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships
13 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships Let Y(x) be the set of legal dependency trees of x
14 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships Let Y(x) be the set of legal dependency trees of x Each y Y(x) is a spanning tree of the complete digraph linking all word pairs
15 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships Let Y(x) be the set of legal dependency trees of x Each y Y(x) is a spanning tree of the complete digraph linking all word pairs We want to learn a parser h : X Y, where Y = x X Y(x)
16 Example: Dependency Parsing Let x be a sentence in X Σ Dependency trees: a syntactic representation that captures lexical relationships Let Y(x) be the set of legal dependency trees of x Each y Y(x) is a spanning tree of the complete digraph linking all word pairs We want to learn a parser h : X Y, where Y = x X Y(x) This is a structured classification problem involving non-local interactions among output variables
17 Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion
18 Structured Classification and LP Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion
19 Structured Classification and LP Notation Input set X
20 Structured Classification and LP Notation Input set X Output set Y
21 Structured Classification and LP Notation Input set X Output set Y Labeled dataset L {(x 1, y 1 ),..., (x m, y m )} X Y drawn i.i.d. from P(X, Y )
22 Structured Classification and LP Notation Input set X Output set Y Labeled dataset L {(x 1, y 1 ),..., (x m, y m )} X Y drawn i.i.d. from P(X, Y ) Loss function l : Y Y R +
23 Structured Classification and LP Notation Input set X Output set Y Labeled dataset L {(x 1, y 1 ),..., (x m, y m )} X Y drawn i.i.d. from P(X, Y ) Loss function l : Y Y R + Goal: learn h : X Y with small expected loss El(h(X ); Y )
24 Structured Classification and LP Notation Input set X Output set Y Labeled dataset L {(x 1, y 1 ),..., (x m, y m )} X Y drawn i.i.d. from P(X, Y ) Loss function l : Y Y R + Goal: learn h : X Y with small expected loss El(h(X ); Y ) Here: linear classifiers h w (x) = arg max y Y w f(x, y) Hypothesis space H {h w w W}, W R d convex
25 Structured Classification and LP Decomposition Into Parts Assumption: each y Y decomposes into parts
26 Structured Classification and LP Decomposition Into Parts Assumption: each y Y decomposes into parts Example: clique assignments in a Markov network Example: arcs in a dependency parse tree Example: k-tuples of arcs in a dependency tree (up to some k)
27 Structured Classification and LP Decomposition Into Parts Assumption: each y Y decomposes into parts Example: clique assignments in a Markov network Example: arcs in a dependency parse tree Example: k-tuples of arcs in a dependency tree (up to some k) Define a set of parts R Replace y by an indicator vector z (z r ) r R with z r I(r y)
28 Structured Classification and LP Decomposition Into Parts Assumption: each y Y decomposes into parts Example: clique assignments in a Markov network Example: arcs in a dependency parse tree Example: k-tuples of arcs in a dependency tree (up to some k) Define a set of parts R Replace y by an indicator vector z (z r ) r R with z r I(r y) Assumption: features decompose over the parts f(x, y) r y f r (x) = r R z r f r (x) = F(x)z, F(x) (f r (x)) r R is a feature matrix (d-by- R )
29 Structured Classification and LP From the Output Set to a Polytope
30 Structured Classification and LP From the Output Set to a Polytope
31 Structured Classification and LP From the Output Set to a Polytope
32 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N)
33 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N) Inference becomes an LP [Taskar et al., 2004]: max y Y w f(x, y) = max z V (Z) w F(x)z = max z Z s z with s = F(x) w
34 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N) Inference becomes an LP [Taskar et al., 2004]: max y Y w f(x, y) = max z V (Z) w F(x)z Are we done? = max z Z s z with s = F(x) w
35 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N) Inference becomes an LP [Taskar et al., 2004]: max y Y w f(x, y) = max z V (Z) w F(x)z = max z Z s z with s = F(x) w Are we done? No: Finding A and b is problem dependent.
36 Structured Classification and LP Inference Minkowski-Weyl theorem: there is a representation Z = {z R n Az b} where A is a p-by-n matrix and b is a vector in R p (p, n N) Inference becomes an LP [Taskar et al., 2004]: max y Y w f(x, y) = max z V (Z) w F(x)z = max z Z s z with s = F(x) w Are we done? No: Finding A and b is problem dependent. In general p = O(exp(n)) (exponentially many constraints)
37 Structured Classification and LP LP-Relaxed Inference and Outer Polytope Often: a concise representation of an outer polytope Z Z such that Z Z n = V (Z) max z Z s z = max z Z,z Z n s z
38 Structured Classification and LP LP-Relaxed Inference and Outer Polytope Often: a concise representation of an outer polytope Z Z such that Z Z n = V (Z) max z Z s z = max s z z Z,z Z n max s z z Z
39 Structured Classification and LP Learning What about learning?
40 Structured Classification and LP Learning What about learning? Assumption: The loss function also decomposes over the parts
41 Structured Classification and LP Learning What about learning? Assumption: The loss function also decomposes over the parts Example: Hamming loss l(y ; y) ( I(r y )I(r / y) + I(r / y )I(r y) ) r R = z z 1 = p z + q where p 1 2z and q 1 z Hamming loss is an affine function of z
42 Structured Classification and LP Learning Structured SVM: min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the loss-augmented inference (LAI) problem r t (w) = max y t Y w f(x t, y t) w f(x }{{} t, y t ) + l(y }{{} t; y t ) }{{} F tz t p tz t +qt F tz t
43 Structured Classification and LP Learning Structured SVM: min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the loss-augmented inference (LAI) problem r t (w) = max y t Y w f(x t, y t) w f(x }{{} t, y t ) }{{} Also an LP. = F tz t ( max z t Z(F t w + p t ) z t F tz t + l(yt; y t ) }{{} p tz t +qt ) (F t w) z t + q t
44 Structured Classification and LP Learning Structured SVM: min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the loss-augmented inference (LAI) problem r t (w) = max y t Y w f(x t, y t) w f(x }{{} t, y t ) }{{} Also an LP. = F tz t ( max z t Z(F t w + p t ) z t F tz t + l(yt; y t ) }{{} p tz t +qt ) (F t w) z t + q t
45 Learning with LP-Relaxed Inference Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion
46 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (exact): min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the exact LAI problem ( ) r t (w) = max z t Z(F t w + p t ) z t (F t w) z t + q t
47 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (exact): min w λ 2 w m m r t (w) t=1 where the slack r t (w) is the solution of the exact LAI problem ( ) r t (w) = max z t Z(F t w + p t ) z t (F t w) z t + q t Relax.
48 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (relaxed): min w λ 2 w m m r t (w) where the slack r t (w) is the solution of the relaxed LAI problem ( ) r t (w) = t=1 max(f z t t w + p t ) z t Z (F t w) z t + q t
49 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (relaxed): min w λ 2 w m m r t (w) where the slack r t (w) is the solution of the relaxed LAI problem ( ) r t (w) = t=1 max(f z t t w + p t ) z t Z (F t w) z t + q t r t (w) upper bounds the exact slack
50 Learning with LP-Relaxed Inference Exact and Relaxed Structured SVMs Structured SVM (relaxed): min w λ 2 w m m r t (w) where the slack r t (w) is the solution of the relaxed LAI problem ( ) r t (w) = t=1 max(f z t t w + p t ) z t Z (F t w) z t + q t r t (w) upper bounds the exact slack l(h w (x t ), z t ) upper bounds the true loss.
51 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples
52 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples Equivalently: an approximate algorithm A w which sometimes returns fractional solutions
53 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples Equivalently: an approximate algorithm A w which sometimes returns fractional solutions Some definitions [Kulesza and Pereira, 2007] L is separable if w s.t. h w classifies all data correctly L is alg. separable if w s.t. A w classifies all data correctly
54 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples Equivalently: an approximate algorithm A w which sometimes returns fractional solutions Some definitions [Kulesza and Pereira, 2007] L is separable if w s.t. h w classifies all data correctly L is alg. separable if w s.t. A w classifies all data correctly Remark: L algorithmically separable = L separable
55 Learning with LP-Relaxed Inference Algorithmic Separability LP relaxed inference augments the output space: makes up artificial negative examples Equivalently: an approximate algorithm A w which sometimes returns fractional solutions Some definitions [Kulesza and Pereira, 2007] L is separable if w s.t. h w classifies all data correctly L is alg. separable if w s.t. A w classifies all data correctly Remark: L algorithmically separable = L separable Margin of separation: Minimal γ s.t. (x t, y t ) L, y t Y(x t ): w f(x t, y t ) w f(x t, y t) + γl(y t, y t) with w =1.
56 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Zooming around a fractional vertex of the outer polytope:
57 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Zooming around a fractional vertex of the outer polytope:
58 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex
59 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex
60 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex
61 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex
62 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L be the radius of the largest loss ball centered at a fractional vertex which does not contain any integer vertex
63 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex
64 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex
65 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex Assume binary-valued features Let N f be the maximum number of active features per part
66 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex Assume binary-valued features Let N f be the maximum number of active features per part Proposition If L is separable with γ L N f then it is algorithmically separable.
67 Learning with LP-Relaxed Inference A Sufficient Condition for Algorithmic Separability Let L ( L) be the radius of the largest loss ball centered at a fractional vertex which contains at most one integer vertex Assume binary-valued features Let N f be the maximum number of active features per part Proposition If L is separable with γ L N f then it is algorithmically separable. In the paper: bounds for the nonseparable case Also: bounds for ɛ-approximate algorithms
68 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Typical goal: minimize expected loss El(h(X ), Y )
69 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Typical goal: minimize expected loss El(h(X ), Y ) Structured prediction: computational cost is also important Let l c (h, x) be the cost of computing h(x)
70 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Typical goal: minimize expected loss El(h(X ), Y ) Structured prediction: computational cost is also important Let l c (h, x) be the cost of computing h(x) Let El c (h, X ) be the average computational cost of h
71 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Typical goal: minimize expected loss El(h(X ), Y ) Structured prediction: computational cost is also important Let l c (h, x) be the cost of computing h(x) Let El c (h, X ) be the average computational cost of h Alternative goal: minimize El(h(X ), Y ) +η }{{} El c (h, X ) }{{} expected loss of h average cost of h
72 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n
73 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n
74 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n
75 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n
76 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n
77 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n
78 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Recall that exact inference in our setting is cast as an ILP: max s z with s = F(x) w, z Z Z n
79 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex
80 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex At test time: s P(F(X ) w) is a r.v. that depends on X (filtered by the parameters w)
81 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex At test time: s P(F(X ) w) is a r.v. that depends on X (filtered by the parameters w) A low-cost hypothesis h w is one which yields P(F(X ) w) with large mass on nice score vectors
82 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex At test time: s P(F(X ) w) is a r.v. that depends on X (filtered by the parameters w) A low-cost hypothesis h w is one which yields P(F(X ) w) with large mass on nice score vectors Idea: Approximate computational cost by relaxation gap: El c (h w, X ) El(h w (X ), h w (X ))
83 Learning with LP-Relaxed Inference Nice Score Vectors and Low-Cost Hypotheses A nice score vector s is one which hits an integer vertex At test time: s P(F(X ) w) is a r.v. that depends on X (filtered by the parameters w) A low-cost hypothesis h w is one which yields P(F(X ) w) with large mass on nice score vectors Idea: Approximate computational cost by relaxation gap: El c (h w, X ) El(h w (X ), h w (X )) Most ILP solvers (branch-and-bound, Gomory s cuts) converge faster as this gap is smaller
84 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Add a empirical relaxation gap term to our learning objective: 1 m m t=1 ( r t(w) r t (w))
85 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Add a empirical relaxation gap term to our learning objective: 1 m m t=1 ( r t(w) r t (w)) The learning problem becomes min w λ 2 w η m m r t (w) + η m t=1 } {{ } Exact LAI m r t (w). t=1 } {{ } Relaxed LAI
86 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Add a empirical relaxation gap term to our learning objective: 1 m m t=1 ( r t(w) r t (w)) The learning problem becomes min w λ 2 w η m m r t (w) + η m t=1 } {{ } Exact LAI m r t (w). t=1 } {{ } Relaxed LAI In the paper: a stochastic adaptation of the online subgradient algorithm [Ratliff et al., 2006]
87 Learning with LP-Relaxed Inference Balancing Accuracy and Runtime Add a empirical relaxation gap term to our learning objective: 1 m m t=1 ( r t(w) r t (w)) The learning problem becomes min w λ 2 w η m m r t (w) + η m t=1 } {{ } Exact LAI m r t (w). t=1 } {{ } Relaxed LAI In the paper: a stochastic adaptation of the online subgradient algorithm [Ratliff et al., 2006] A PAC bound with respect to the best exact learner It measures the impact of the approximation in learning Previous bounds were in terms of the approximate learner [Kulesza and Pereira, 2007]
88 Experiments Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion
89 Experiments Experiments Dependency parsing for seven languages Danish, Dutch, Portuguese, Slovene, Swedish, Turkish, English
90 Experiments Experiments Dependency parsing for seven languages Danish, Dutch, Portuguese, Slovene, Swedish, Turkish, English Exact inference is efficient with a arc-factored model Find a maximal spanning tree [McDonald et al., 2005]
91 Experiments Experiments Dependency parsing for seven languages Danish, Dutch, Portuguese, Slovene, Swedish, Turkish, English Exact inference is efficient with a arc-factored model Find a maximal spanning tree [McDonald et al., 2005] Beyond that: NP-hard [McDonald and Satta, 2007]
92 Experiments Experiments Dependency parsing for seven languages Danish, Dutch, Portuguese, Slovene, Swedish, Turkish, English Exact inference is efficient with a arc-factored model Find a maximal spanning tree [McDonald et al., 2005] Beyond that: NP-hard [McDonald and Satta, 2007] Our model: a ILP formulation with non-arc-factored features Models grandparents/siblings interactions Models valency and nonprojective arcs Only O(n 3 ) variables and constraints More details: [Martins et al., 2009]
93 Experiments Experiments Experiment #1: Training with LP-relaxed LAI (η = 1)
94 Experiments Experiments Experiment #1: Training with LP-relaxed LAI (η = 1) Two different decoders at test time: Exact decoder (solve an ILP) Approximate decoder (solve the relaxed LP; if the solution is fractional, round it in polynomial time by finding a maximal spanning tree on the reweighted graph)
95 Experiments Experiments Experiment #1: Training with LP-relaxed LAI (η = 1) Two different decoders at test time: Exact decoder (solve an ILP) Approximate decoder (solve the relaxed LP; if the solution is fractional, round it in polynomial time by finding a maximal spanning tree on the reweighted graph) Strong baselines: [MP06] approximate second-order parser [McDonald and Pereira, 2006] [MDSX08] stacked parser [Martins et al., 2008]
96 Experiments Experiments Our models: Arc-factored
97 Experiments Experiments Our models: Arc-factored Full model (with exact decoding) much better
98 Experiments Experiments Our models: Arc-factored Full model (with exact decoding) much better Approximate decoding did not considerably affect accuracy
99 Experiments Experiments Where are the baselines?
100 Experiments Experiments Where are the baselines? [MP06] second-order parser [McDonald and Pereira, 2006]
101 Experiments Experiments Where are the baselines? [MP06] second-order parser [McDonald and Pereira, 2006] [MDSX08] stacked parser [Martins et al., 2008]
102 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features)
103 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features) As η increases, the model learns to avoid fractional solutions
104 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features) As η increases, the model learns to avoid fractional solutions Runtime does correlate with the relaxation gap
105 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features) As η increases, the model learns to avoid fractional solutions Runtime does correlate with the relaxation gap Yet the approximate decoder is significantly faster
106 Experiments Experiments Experiment #2: does η really penalize computational cost? Slovene dataset (with a reduced set of features) As η increases, the model learns to avoid fractional solutions Runtime does correlate with the relaxation gap Yet the approximate decoder is significantly faster Our full model: Same order of magnitude as the baselines ( sec.)
107 Conclusion Outline 1 Structured Classification and LP 2 Learning with LP-Relaxed Inference 3 Experiments 4 Conclusion
108 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning
109 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability
110 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost
111 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost We demonstrated the effectiveness of these techniques in dependency parsing with non-arc-factored features
112 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost We demonstrated the effectiveness of these techniques in dependency parsing with non-arc-factored features Future work: polyhedral characterizations that guarantee tighter bounds
113 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost We demonstrated the effectiveness of these techniques in dependency parsing with non-arc-factored features Future work: polyhedral characterizations that guarantee tighter bounds Conditions for vanishing relaxation gap in online learning
114 Conclusion Conclusions and Future Work We studied the impact of LP relaxed inference in max-margin learning We established sufficient conditions for algorithmic separability As a by-product: a new learning algorithm that penalizes computational cost We demonstrated the effectiveness of these techniques in dependency parsing with non-arc-factored features Future work: polyhedral characterizations that guarantee tighter bounds Conditions for vanishing relaxation gap in online learning Connections with regularization
115 References References I Chu, Y. J. and Liu, T. H. (1965). On the shortest arborescence of a directed graph. Science Sinica, 14: Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. (2006). Online Passive-Aggressive Algorithms. JMLR, 7: Edmonds, J. (1967). Optimum branchings. Journal of Research of the National Bureau of Standards, 71B: Eisner, J. (1996). Three new probabilistic models for dependency parsing: An exploration. In COLING. Finley, T. and Joachims, T. (2008). Training structural SVMs when exact inference is intractable. In ICML. Kulesza, A. and Pereira, F. (2007). Structured Learning with Approximate Inference. NIPS. Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML.
116 References References II Magnanti, T. and Wolsey, L. (1994). Optimal Trees. Technical Report , Massachusetts Institute of Technology, Operations Research Center. Martins, A. F. T., Das, D., Smith, N. A., and Xing, E. P. (2008). Stacking dependency parsers. In Proc. of EMNLP. Martins, A. F. T., Smith, N. A., and Xing, E. P. (2009). Concise integer linear programming formulations for dependency parsing. In Proc. of ACL-IJCNLP. McDonald, R. and Satta, G. (2007). On the complexity of non-projective data-driven dependency parsing. In Proc. of IWPT. McDonald, R. T., Pereira, F., Ribarov, K., and Hajic, J. (2005). Non-projective dependency parsing using spanning tree algorithms. In Proc. of HLT-EMNLP. McDonald, R. T. and Pereira, F. C. N. (2006). Online learning of approximate dependency parsing algorithms. In Proc. of EACL. Ratliff, N., Bagnell, J., and Zinkevich, M. (2006). Subgradient methods for maximum margin structured learning. In ICML Workshop on Learning in Structured Outputs Spaces.
117 References References III Smith, D. A. and Eisner, J. (2008). Dependency parsing by belief propagation. In Proc. of EMNLP. Taskar, B., Chatalbashev, V., and Koller, D. (2004). Learning associative Markov networks. In ICML. ACM New York, NY, USA. Taskar, B., Guestrin, C., and Koller, D. (2003). Max-margin markov networks. In NIPS. Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ICML. ACM New York, NY, USA.
118 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006])
119 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t
120 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0
121 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do end for
122 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) end for
123 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else end if end for
124 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else Solve exact LAI, ẑ t arg max z t Z w t F t (z t z t ) + l(z t; z t ) end if end for
125 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else Solve exact LAI, ẑ t arg max z t Z w t F t (z t z t ) + l(z t; z t ) end if Compute the subgradient g t λw t + F t (ẑ t z t ) end for
126 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else Solve exact LAI, ẑ t arg max z t Z w t F t (z t z t ) + l(z t; z t ) end if Compute the subgradient g t λw t + F t (ẑ t z t ) Project and update w t+1 Proj W (w t α t g t ) end for
127 Extra Slides Balancing Accuracy and Runtime Stochastic Online Subgradient Algorithm (based on [Ratliff et al., 2006]) Input: L, η t t, learning rate sequence α t t Initialize w 1 0 for t = 1 to m = L do Pick σ t Bernoulli(η t ) if σ t = 1 then Solve relaxed LAI, ẑ t arg max z t Z w t F t ( z t z t ) + l( z t; z t ) else Solve exact LAI, ẑ t arg max z t Z w t F t (z t z t ) + l(z t; z t ) end if Compute the subgradient g t λw t + F t (ẑ t z t ) Project and update w t+1 Proj W (w t α t g t ) end for Return the averaged model ŵ 1 m m t=1 w t.
128 Extra Slides Generalization Bound Proposition ( ) 1+log m Setting α t = 1/(λt), λ = Θ m and η t = Θ(t 1/2 ): El(hŵ(X ), Y ) 1 m m t=1 holds with probability 1 δ. r t (w ) + L o(m) m + O ( ) 1 m ln 1 δ
129 Extra Slides Generalization Bound Proposition ( ) 1+log m Setting α t = 1/(λt), λ = Θ m and η t = Θ(t 1/2 ): El(hŵ(X ), Y ) 1 m m t=1 holds with probability 1 δ. r t (w ) + L o(m) m + O ( ) 1 m ln 1 δ Remark: The bound is in terms of what could be achieved with the best exact learner It measures the impact of the approximation in learning
130 Extra Slides Generalization Bound Proposition ( ) 1+log m Setting α t = 1/(λt), λ = Θ m and η t = Θ(t 1/2 ): El(hŵ(X ), Y ) 1 m m t=1 holds with probability 1 δ. r t (w ) + L o(m) m + O ( ) 1 m ln 1 δ Remark: The bound is in terms of what could be achieved with the best exact learner It measures the impact of the approximation in learning Previous bounds were in terms of the approximate learner [Kulesza and Pereira, 2007]
Introduction to Data-Driven Dependency Parsing
Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007 Ryan McDonald 1 Joakim Nivre 2 1 Google Inc., New York, USA E-mail: ryanmcd@google.com 2 Uppsala University and Växjö University,
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationCMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss
CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss Jeffrey Flanigan Chris Dyer Noah A. Smith Jaime Carbonell School of Computer Science, Carnegie Mellon University, Pittsburgh,
More informationOnline MKL for Structured Prediction
Online MKL for Structured Prediction André F. T. Martins Noah A. Smith Eric P. Xing School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA Pedro M. Q. Aguiar Instituto de Sistemas
More informationStacking Dependency Parsers
Stacking Dependency Parsers André F. T. Martins Dipanjan Das Noah A. Smith Eric P. Xing School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA Instituto de Telecomunicações,
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationSubgradient Methods for Maximum Margin Structured Learning
Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing
More informationThe Structured Weighted Violations Perceptron Algorithm
The Structured Weighted Violations Perceptron Algorithm Rotem Dror and Roi Reichart Faculty of Industrial Engineering and Management, Technion, IIT {rtmdrr@campus roiri@ie}.technion.ac.il Abstract We present
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationStructured Prediction Models via the Matrix-Tree Theorem
Structured Prediction Models via the Matrix-Tree Theorem Terry Koo Amir Globerson Xavier Carreras Michael Collins maestro@csail.mit.edu gamir@csail.mit.edu carreras@csail.mit.edu mcollins@csail.mit.edu
More informationStructured Prediction
Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured
More informationTrain and Test Tightness of LP Relaxations in Structured Prediction
Ofer Meshi Mehrdad Mahdavi Toyota Technological Institute at Chicago Adrian Weller University of Cambridge David Sontag New York University MESHI@TTIC.EDU MAHDAVI@TTIC.EDU AW665@CAM.AC.UK DSONTAG@CS.NYU.EDU
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive
More informationDistributed Training Strategies for the Structured Perceptron
Distributed Training Strategies for the Structured Perceptron Ryan McDonald Keith Hall Gideon Mann Google, Inc., New York / Zurich {ryanmcd kbhall gmann}@google.com Abstract Perceptron training is widely
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationGeneralized Linear Classifiers in NLP
Generalized Linear Classifiers in NLP (or Discriminative Generalized Linear Feature-Based Classifiers) Graduate School of Language Technology, Sweden 2009 Ryan McDonald Google Inc., New York, USA E-mail:
More informationTrain and Test Tightness of LP Relaxations in Structured Prediction
Ofer Meshi Mehrdad Mahdavi Toyota Technological Institute at Chicago Adrian Weller University of Cambridge David Sontag New York University MESHI@TTIC.EDU MAHDAVI@TTIC.EDU AW665@CAM.AC.UK DSONTAG@CS.NYU.EDU
More information2.2 Structured Prediction
The hinge loss (also called the margin loss), which is optimized by the SVM, is a ramp function that has slope 1 when yf(x) < 1 and is zero otherwise. Two other loss functions squared loss and exponential
More informationStructured Learning with Approximate Inference
Structured Learning with Approximate Inference Alex Kulesza and Fernando Pereira Department of Computer and Information Science University of Pennsylvania {kulesza, pereira}@cis.upenn.edu Abstract In many
More information4.1 Online Convex Optimization
CS/CNS/EE 53: Advanced Topics in Machine Learning Topic: Online Convex Optimization and Online SVM Lecturer: Daniel Golovin Scribe: Xiaodi Hou Date: Jan 3, 4. Online Convex Optimization Definition 4..
More informationAlgorithms for Predicting Structured Data
1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationRelaxed Marginal Inference and its Application to Dependency Parsing
Relaxed Marginal Inference and its Application to Dependency Parsing Sebastian Riedel David A. Smith Department of Computer Science University of Massachusetts, Amherst {riedel,dasmith}@cs.umass.edu Abstract
More informationAlgorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,
More informationGraph-based Dependency Parsing. Ryan McDonald Google Research
Graph-based Dependency Parsing Ryan McDonald Google Research ryanmcd@google.com Reader s Digest Graph-based Dependency Parsing Ryan McDonald Google Research ryanmcd@google.com root ROOT Dependency Parsing
More informationStructural Learning with Amortized Inference
Structural Learning with Amortized Inference AAAI 15 Kai-Wei Chang, Shyam Upadhyay, Gourab Kundu, and Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign {kchang10,upadhya3,kundu2,danr}@illinois.edu
More informationTurbo Parsers: Dependency Parsing by Approximate Variational Inference
Turbo Parsers: Dependency Parsing by Approximate Variational Inference André F. T. Martins Noah A. Smith Eric P. Xing School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA {afm,nasmith,epxing}@cs.cmu.edu
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationSupport Vector Machines: Kernels
Support Vector Machines: Kernels CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 14.1, 14.2, 14.4 Schoelkopf/Smola Chapter 7.4, 7.6, 7.8 Non-Linear Problems
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationCutting Plane Training of Structural SVM
Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs
More informationTransition-based Dependency Parsing with Selectional Branching
Transitionbased Dependency Parsing with Selectional Branching Presented at the 4th workshop on Statistical Parsing in Morphologically Rich Languages October 18th, 2013 Jinho D. Choi University of Massachusetts
More informationGeneralized Boosting Algorithms for Convex Optimization
Alexander Grubb agrubb@cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 523 USA Abstract Boosting is a popular way to derive powerful
More informationStructured Prediction Theory and Algorithms
Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 6, March 7, 2013 David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 1 / 25 Today s lecture 1 Dual decomposition 2 MAP inference
More informationOnline Bayesian Passive-Aggressive Learning
Online Bayesian Passive-Aggressive Learning Full Journal Version: http://qr.net/b1rd Tianlin Shi Jun Zhu ICML 2014 T. Shi, J. Zhu (Tsinghua) BayesPA ICML 2014 1 / 35 Outline Introduction Motivation Framework
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationSequence Labelling SVMs Trained in One Pass
Sequence Labelling SVMs Trained in One Pass Antoine Bordes 12, Nicolas Usunier 1, and Léon Bottou 2 1 LIP6, Université Paris 6, 104 Avenue du Pdt Kennedy, 75016 Paris, France 2 NEC Laboratories America,
More informationAdvanced Graph-Based Parsing Techniques
Advanced Graph-Based Parsing Techniques Joakim Nivre Uppsala University Linguistics and Philology Based on previous tutorials with Ryan McDonald Advanced Graph-Based Parsing Techniques 1(33) Introduction
More informationDual Decomposition for Inference
Dual Decomposition for Inference Yunshu Liu ASPITRG Research Group 2014-05-06 References: [1]. D. Sontag, A. Globerson and T. Jaakkola, Introduction to Dual Decomposition for Inference, Optimization for
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More information1 SVM Learning for Interdependent and Structured Output Spaces
1 SVM Learning for Interdependent and Structured Output Spaces Yasemin Altun Toyota Technological Institute at Chicago, Chicago IL 60637 USA altun@tti-c.org Thomas Hofmann Google, Zurich, Austria thofmann@google.com
More information14 : Theory of Variational Inference: Inner and Outer Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction
More informationJunction Tree, BP and Variational Methods
Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 12, April 23, 2013 David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 1 / 24 What notion of best should learning be optimizing?
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationLagrangian Relaxation Algorithms for Inference in Natural Language Processing
Lagrangian Relaxation Algorithms for Inference in Natural Language Processing Alexander M. Rush and Michael Collins (based on joint work with Yin-Wen Chang, Tommi Jaakkola, Terry Koo, Roi Reichart, David
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationLearning Tetris. 1 Tetris. February 3, 2009
Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationParameter Learning for Log-supermodular Distributions
Parameter Learning for Log-supermodular Distributions Tatiana Shpakova IRIA - École ormale Supérieure Paris tatiana.shpakova@inria.fr Francis Bach IRIA - École ormale Supérieure Paris francis.bach@inria.fr
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationConditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013
Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationLab 12: Structured Prediction
December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationOn Primal and Dual Sparsity of Markov Networks
Jun Zhu jun-zhu@mails.tsinghua.edu.cn Eric P. Xing epxing@cs.cmu.edu Dept. of Comp. Sci & Tech, TNList Lab, Tsinghua University, Beijing 84 China School of Computer Science, Carnegie Mellon University,
More informationLearning to Rank and Quadratic Assignment
Learning to Rank and Quadratic Assignment Thomas Mensink TVPA - XRCE & LEAR - INRIA Grenoble, France Jakob Verbeek LEAR Team INRIA Rhône-Alpes Grenoble, France Abstract Tiberio Caetano Machine Learning
More informationAccelerated dual decomposition for MAP inference
Vladimir Jojic vjojic@csstanfordedu Stephen Gould sgould@stanfordedu Daphne Koller koller@csstanfordedu Computer Science Department, 353 Serra Mall, Stanford University, Stanford, CA 94305, USA Abstract
More informationAccelerated dual decomposition for MAP inference
Vladimir Jojic vjojic@csstanfordedu Stephen Gould sgould@stanfordedu Daphne Koller koller@csstanfordedu Computer Science Department, 353 Serra Mall, Stanford University, Stanford, CA 94305, USA Abstract
More informationMinibatch and Parallelization for Online Large Margin Structured Learning
Minibatch and Parallelization for Online Large Margin Structured Learning Kai Zhao Computer Science Program, Graduate Center City University of New York kzhao@gc.cuny.edu Liang Huang, Computer Science
More informationCoactive Learning for Locally Optimal Problem Solving
Coactive Learning for Locally Optimal Problem Solving Robby Goetschalckx, Alan Fern, Prasad adepalli School of EECS, Oregon State University, Corvallis, Oregon, USA Abstract Coactive learning is an online
More informationStructured Prediction: A Large Margin Approach. Ben Taskar University of Pennsylvania
Structured Prediction: A Large Margin Approach Ben Taskar University of Pennsylvania Structured Prediction Prediction of complex outputs Structured outputs: multivariate, correlated, constrained Novel,
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More information14 : Theory of Variational Inference: Inner and Outer Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationA Support Vector Method for Multivariate Performance Measures
A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationMax-Margin Weight Learning for Markov Logic Networks
In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, Part 1, pp. 564-579, Bled, Slovenia, September 2009. Max-Margin Weight Learning
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationSearch-Based Structured Prediction
Search-Based Structured Prediction by Harold C. Daumé III (Utah), John Langford (Yahoo), and Daniel Marcu (USC) Submitted to Machine Learning, 2007 Presented by: Eugene Weinstein, NYU/Courant Institute
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationSupport Vector Machines. Machine Learning Fall 2017
Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce
More informationSpectral Unsupervised Parsing with Additive Tree Metrics
Spectral Unsupervised Parsing with Additive Tree Metrics Ankur Parikh, Shay Cohen, Eric P. Xing Carnegie Mellon, University of Edinburgh Ankur Parikh 2014 1 Overview Model: We present a novel approach
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationMachine Learning 4771
Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines
More informationSection Notes 9. IP: Cutting Planes. Applied Math 121. Week of April 12, 2010
Section Notes 9 IP: Cutting Planes Applied Math 121 Week of April 12, 2010 Goals for the week understand what a strong formulations is. be familiar with the cutting planes algorithm and the types of cuts
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationNatural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):
More informationSample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University
Sample and Computationally Efficient Active Learning Maria-Florina Balcan Carnegie Mellon University Machine Learning is Shaping the World Highly successful discipline with lots of applications. Computational
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationDoes Better Inference mean Better Learning?
Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract
More informationML4NLP Multiclass Classification
ML4NLP Multiclass Classification CS 590NLP Dan Goldwasser Purdue University dgoldwas@purdue.edu Social NLP Last week we discussed the speed-dates paper. Interesting perspective on NLP problems- Can we
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationMaximum Margin Bayesian Networks
Maximum Margin Bayesian Networks Yuhong Guo Department of Computing Science University of Alberta yuhong@cs.ualberta.ca Dana Wilkinson School of Computer Science University of Waterloo d3wilkin@cs.uwaterloo.ca
More informationMulti-Class Confidence Weighted Algorithms
Multi-Class Confidence Weighted Algorithms Koby Crammer Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104 Mark Dredze Alex Kulesza Human Language Technology
More informationlecture 6: modeling sequences (final part)
Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation Outline After a recap: } Few more words about unsupervised estimation of
More informationarxiv: v1 [stat.ml] 13 Oct 2010
Online Multiple Kernel Learning for Structured Prediction arxiv:00.770v [stat.ml] 3 Oct 00 André F. T. Martins Noah A. Smith Eric P. Xing Pedro M. Q. Aguiar Mário A. T. Figueiredo {afm,nasmith,epxing}@cs.cmu.edu
More informationSupport Vector Machines and Bayes Regression
Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering
More informationConditional Random Fields for Sequential Supervised Learning
Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd
More informationProbabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing
Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for atural Language Processing Alexander M. Rush (based on joint work with Michael Collins, Tommi Jaakkola, Terry Koo, David Sontag) Uncertainty
More informationThe Geometry of Statistical Machine Translation
The Geometry of Statistical Machine Translation Presented by Rory Waite 16th of December 2015 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions ntroduction We provide
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationLogistic regression and linear classifiers COMS 4771
Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random
More information