Lagrangian Relaxation Algorithms for Inference in Natural Language Processing
|
|
- Willa Lee Miles
- 5 years ago
- Views:
Transcription
1 Lagrangian Relaxation Algorithms for Inference in Natural Language Processing Alexander M. Rush and Michael Collins (based on joint work with Yin-Wen Chang, Tommi Jaakkola, Terry Koo, Roi Reichart, David Sontag)
2 Decoding in NLP focus: structured prediction for natural language processing decoding as a combinatorial optimization problem y = arg max y Y f (y) where f is a scoring function and Y is a set of structures for some problems, use simple combinatorial algorithms dynamic programming minimum spanning tree min cut
3 Structured prediction: Tagging United flies some large jet N V D A N United 1 flies 2 some 3 large 4 jet 5
4 Structured prediction: Parsing United flies some large jet United flies some large jet S NP VP N V NP * 0 United 1 flies 2 some 3 large 4 jet 5 United flies D A N some large jet
5 Decoding complexity issue: simple combinatorial algorithms do not scale to richer models y = arg max y Y f (y) need decoding algorithms for complex natural language tasks motivation: richer model structure often leads to improved accuracy exact decoding for complex models tends to be intractable
6 Structured prediction: Phrase-based translation das muss unsere sorge gleichermaßen sein das muss unsere sorge gleichermaßen sein this must also be our concern
7 Structured prediction: Word alignment the ugly dog has red fur le chien laid a fourrure rouge
8 high complexity Decoding tasks combined parsing and part-of-speech tagging (Rush et al., 2010) loopy HMM part-of-speech tagging syntactic machine translation (Rush and Collins, 2011) NP-Hard symmetric HMM alignment (DeNero and Macherey, 2011) phrase-based translation (Chang and Collins, 2011) higher-order non-projective dependency parsing (Koo et al., 2010) in practice: approximate decoding methods (coarse-to-fine, beam search, cube pruning, gibbs sampling, belief propagation) approximate models (mean field, variational models)
9 Lagrangian relaxation a general technique for constructing decoding algorithms solve complicated models y = arg max f (y) y by decomposing into smaller problems. upshot: can utilize a toolbox of combinatorial algorithms. dynamic programming minimum spanning tree shortest path min cut...
10 Lagrangian relaxation algorithms Simple - uses basic combinatorial algorithms Efficient - faster than solving exact decoding problems Strong guarantees gives a certificate of optimality when exact direct connections to linear programming relaxations
11 MAP problem in Markov random fields given: binary variables x 1... x n goal: MAP problem arg max x 1...x n (i,j) E f i,j (x i, x j ) where each f i,j (x i, x j ) is a local potential for variables x i, x j
12 Dual decomposition for MRFs (Komodakis et al., 2010) + goal: arg max equivalent formulation: arg max x 1...x n (i,j) E f x 1...x n,y 1...y n (i,j) T 1 f i,j (x i, x j ) i,j(x i, x j ) + such that for i = 1... n, x i = y i (i,j) T 2 f i,j(y i, y j ) Lagrangian: L(u, x, y) = (i,j) T 1 f i,j(x i, x j ) + (i,j) T 2 f i,j(y i, y j ) + i u i (x i y i )
13 Related work belief propagation using combinatorial algorithms (Duchi et al., 2007; Smith and Eisner, 2008) factored A* search (Klein and Manning, 2003)
14 Tutorial outline 1. worked algorithm for combined parsing and tagging 2. important theorems and formal derivation 3. more examples from parsing and alignment 4. relationship to linear programming relaxations 5. practical considerations for implemention 6. further example from machine translation
15 1. Worked example aim: walk through a Lagrangian relaxation algorithm for combined parsing and part-of-speech tagging introduce formal notation for parsing and tagging give assumptions necessary for decoding step through a run of the Lagrangian relaxation algorithm
16 Combined parsing and part-of-speech tagging S NP VP N V NP United flies D A N some large jet goal: find parse tree that optimizes score(s NP VP) + score(vp V NP) score(n V) + score(n United) +...
17 notation: Constituency parsing Y is set of constituency parses for input y Y is a valid parse f (y) scores a parse tree goal: arg max y Y f (y) example: a context-free grammar for constituency parsing S NP VP N V NP United flies D A N some large jet
18 notation: goal: Part-of-speech tagging Z is set of tag sequences for input z Z is a valid tag sequence g(z) scores of a tag sequence arg max z Z g(z) example: an HMM for part-of speech tagging N V D A N United 1 flies 2 some 3 large 4 jet 5
19 Identifying tags notation: identify the tag labels selected by each model y(i, t) = 1 when parse y selects tag t at position i z(i, t) = 1 when tag sequence z selects tag t at position i example: a parse and tagging with y(4, A) = 1 and z(4, A) = 1 S NP VP N V NP N V D A N United flies D A N United 1 flies 2 some 3 large 4 jet 5 some large jet z y
20 Combined optimization goal: arg max y Y,z Z such that for all i = 1... n, t T, f (y) + g(z) y(i, t) = z(i, t) i.e. find the best parse and tagging pair that agree on tag labels equivalent formulation: arg max f (y) + g(l(y)) y Y where l : Y Z extracts the tag sequence from a parse tree
21 Exact method: Dynamic programming intersection can solve by solving the product of the two models example: parsing model is a context-free grammar tagging model is a first-order HMM can solve as CFG and finite-state automata intersection S replace VP V NP with VP N,V V N,V NP V,N NP N V VP NP United flies D A N some large jet
22 Intersected parsing and tagging complexity let G be the number of grammar non-terminals parsing CFG require O(G 3 n 3 ) time with rules VP V NP S NP,N VP N,N N V N,V NP V,N United flies D V,D A N A,N some large jet with intersection O(G 3 n 3 T 3 ) with rules VP N,V V N,V NP V,N becomes O(G 3 n 3 T 6 ) time for second-order HMM
23 Parsing assumption assumption: optimization with u can be solved efficiently arg max y Y f (y) + i,t u(i, t)y(i, t) example: CFG with rule scoring function h f (y) = h(x Y Z) + h(x w i ) where X Y Z y (i,x ) y arg max y Y f (y) + i,t u(i, t)y(i, t) = arg max y Y h(x Y Z) + X Y Z y (i,x ) y (h(x w i ) + u(i, X ))
24 Tagging assumption assumption: optimization with u can be solved efficiently arg max z Z g(z) i,t u(i, t)z(i, t) example: HMM with scores for transitions T and observations O g(z) = T (t t ) + O(t w i ) where t t z (i,t) z arg max z Z g(z) i,t u(i, t)z(i, t) = arg max z Z T (t t ) + (O(t w i ) u(i, t)) t t z (i,t) z
25 Lagrangian relaxation algorithm Set u (1) (i, t) = 0 for all i, t T For k = 1 to K y (k) arg max y Y f (y) + i,t u (k) (i, t)y(i, t) [Parsing] z (k) arg max z Z g(z) i,t u (k) (i, t)z(i, t) [Tagging] If y (k) (i, t) = z (k) (i, t) for all i, t Return (y (k), z (k) ) Else u (k+1) (i, t) u (k) (i, t) α k (y (k) (i, t) z (k) (i, t))
26 CKY Parsing Penalties u(i, t) = 0 for all i,t Viterbi Decoding y = arg max y Y (f (y) + i,t u(i, t)y(i, t)) United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
27 CKY Parsing NP S VP Penalties u(i, t) = 0 for all i,t A N D A V United flies some large jet Viterbi Decoding y = arg max y Y (f (y) + i,t u(i, t)y(i, t)) United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
28 CKY Parsing NP S VP Penalties u(i, t) = 0 for all i,t A N D A V United flies some large jet Viterbi Decoding y = arg max y Y (f (y) + i,t u(i, t)y(i, t)) N V D A N United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
29 CKY Parsing NP S VP Penalties u(i, t) = 0 for all i,t A N D A V United flies some large jet Viterbi Decoding y = arg max y Y (f (y) + i,t u(i, t)y(i, t)) N V D A N United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
30 CKY Parsing A United N flies Viterbi Decoding NP D some S A large y = arg max y Y (f (y) + i,t VP V jet u(i, t)y(i, t)) N V D A N Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
31 CKY Parsing Viterbi Decoding y = arg max y Y (f (y) + i,t u(i, t)y(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
32 CKY Parsing Viterbi Decoding NP N United S V flies y = arg max y Y (f (y) + i,t VP D some NP A large u(i, t)y(i, t)) N jet Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
33 CKY Parsing Viterbi Decoding NP N United S V flies y = arg max y Y (f (y) + i,t VP D some NP A large u(i, t)y(i, t)) A N D A N N jet Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
34 CKY Parsing Viterbi Decoding NP N United S V flies y = arg max y Y (f (y) + i,t VP D some NP A large u(i, t)y(i, t)) A N D A N United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) N jet Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Iteration 2 u(5, V ) -1 u(5, N) 1 Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
35 CKY Parsing Viterbi Decoding y = arg max y Y (f (y) + i,t u(i, t)y(i, t)) United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Iteration 2 u(5, V ) -1 u(5, N) 1 Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
36 CKY Parsing Viterbi Decoding NP N United S V flies y = arg max y Y (f (y) + i,t VP D some NP A large u(i, t)y(i, t)) United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) N jet Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Iteration 2 u(5, V ) -1 u(5, N) 1 Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
37 CKY Parsing Viterbi Decoding NP N United S V flies y = arg max y Y (f (y) + i,t VP D some NP A large u(i, t)y(i, t)) N V D A N United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) N jet Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Iteration 2 u(5, V ) -1 u(5, N) 1 Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
38 CKY Parsing Viterbi Decoding Key NP N United S V flies y = arg max y Y (f (y) + i,t VP D some NP A large u(i, t)y(i, t)) N V D A N United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) N jet Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, N) 1 u(2, N) -1 u(2, V ) 1 u(5, V ) -1 u(5, N) 1 Iteration 2 u(5, V ) -1 u(5, N) 1 Converged y = arg max f (y) + g(y) y Y f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i
39 Main theorem theorem: if at any iteration, for all i, t T y (k) (i, t) = z (k) (i, t) then (y (k), z (k) ) is the global optimum proof: focus of the next section
40 Convergence % examples converged <=1 <=2 <=3 <=4 <=10 <=20 <=50 number of iterations
41 2. Formal properties aim: formal derivation of the algorithm given in the previous section derive Lagrangian dual prove three properties upper bound convergence optimality describe subgradient method
42 goal: Lagrangian arg Lagrangian: max y Y,z Z f (y) + g(z) such that y(i, t) = z(i, t) L(u, y, z) = f (y) + g(z) + i,t u(i, t) (y(i, t) z(i, t)) redistribute terms L(u, y, z) = f (y) + i,t u(i, t)y(i, t) + g(z) i,t u(i, t)z(i, t)
43 Lagrangian: L(u, y, z) = Lagrangian dual f (y) + u(i, t)y(i, t) + g(z) i,t i,t u(i, t)z(i, t) Lagrangian dual: L(u) = max y Y,z Z = max y Y max z Z L(u, y, z) f (y) + i,t g(z) i,t u(i, t)y(i, t) + u(i, t)z(i, t)
44 Theorem 1. Upper bound define: y, z is the optimal combined parsing and tagging solution with y (i, t) = z (i, t) for all i, t theorem: for any value of u L(u) f (y ) + g(z ) L(u) provides an upper bound on the score of the optimal solution note: upper bound may be useful as input to branch and bound or A* search
45 Theorem 1. Upper bound (proof) theorem: for any value of u, L(u) f (y ) + g(z ) proof: L(u) = max L(u, y, z) y Y,z Z (1) max L(u, y, z) y Y,z Z:y=z (2) = max f (y) + g(z) y Y,z Z:y=z (3) = f (y ) + g(z ) (4)
46 Formal algorithm (reminder) Set u (1) (i, t) = 0 for all i, t T For k = 1 to K y (k) arg max y Y f (y) + i,t u (k) (i, t)y(i, t) [Parsing] z (k) arg max z Z g(z) i,t u (k) (i, t)z(i, t) [Tagging] If y (k) (i, t) = z (k) (i, t) for all i, t Return (y (k), z (k) ) Else u (k+1) (i, t) u (k) (i, t) α k (y (k) (i, t) z (k) (i, t))
47 Theorem 2. Convergence notation: u (k+1) (i, t) u (k) (i, t) + α k (y (k) (i, t) z (k) (i, t)) is update u (k) is the penalty vector at iteration k α k > 0 is the update rate at iteration k theorem: for any sequence α 1, α 2, α 3,... such that lim t αt = 0 and α t =, t=1 we have lim t L(ut ) = min L(u) u i.e. the algorithm converges to the tightest possible upper bound proof: by subgradient convergence (next section)
48 Dual solutions define: for any value of u y u = arg max y Y f (y) + i,t u(i, t)y(i, t) and z u = arg max z Z g(z) i,t u(i, t)z(i, t) y u and z u are the dual solutions for a given u
49 Theorem 3. Optimality theorem: if there exists u such that y u (i, t) = z u (i, t) for all i, t then f (y u ) + g(z u ) = f (y ) + g(z ) i.e. if the dual solutions agree, we have an optimal solution (y u, z u )
50 Theorem 3. Optimality (proof) theorem: if u such that y u (i, t) = z u (i, t) for all i, t then f (y u ) + g(z u ) = f (y ) + g(z ) proof: by the definitions of y u and z u L(u) = f (y u ) + g(z u ) + i,t u(i, t)(y u (i, t) z u (i, t)) = f (y u ) + g(z u ) since L(u) f (y ) + g(z ) for all values of u f (y u ) + g(z u ) f (y ) + g(z ) but y and z are optimal f (y u ) + g(z u ) f (y ) + g(z )
51 Dual optimization Lagrangian dual: L(u) = max y Y,z Z = max y Y max z Z L(u, y, z) f (y) + i,t g(z) i,t u(i, t)y(i, t) + u(i, t)z(i, t) goal: dual problem is to find the tightest upper bound min L(u) u
52 L(u) = max y Y f (y) + i,t Dual subgradient u(i, t)y(i, t) + max g(z) z Z i,t u(i, t)z(i, t) properties: L(u) is convex in u (no local minima) L(u) is not differentiable (because of max operator) handle non-differentiability by using subgradient descent define: a subgradient of L(u) at u is a vector g u such that for all v L(v) L(u) + g u (v u)
53 Subgradient algorithm L(u) = max y Y f (y) + i,t u(i, t)y(i, t) + max g(z) z Z i,j u(i, t)z(i, t) recall, y u and z u are the argmax s of the two terms subgradient: g u (i, t) = y u (i, t) z u (i, t) subgradient descent: move along the subgradient u (i, t) = u(i, t) α (y u (i, t) z u (i, t)) guaranteed to find a minimum with conditions given earlier for α
54 3. More examples aim: demonstrate similar algorithms that can be applied to other decoding applications context-free parsing combined with dependency parsing combined translation alignment
55 Combined constituency and dependency parsing (Rush et al., 2010) setup: assume separate models trained for constituency and dependency parsing problem: find constituency parse that maximizes the sum of the two models example: combine lexicalized CFG with second-order dependency parser
56 notation: Lexicalized constituency parsing Y is set of lexicalized constituency parses for input y Y is a valid parse f (y) scores a parse tree goal: arg max y Y f (y) example: a lexicalized context-free grammar S(flies) NP(United) VP(flies) N V NP(jet) United flies D A N some large jet
57 Dependency parsing define: Z is set of dependency parses for input z Z is a valid dependency parse g(z) scores a dependency parse example: * 0 United 1 flies 2 some 3 large 4 jet 5
58 Identifying dependencies notation: identify the dependencies selected by each model y(i, j) = 1 when word i modifies of word j in constituency parse y z(i, j) = 1 when word i modifies of word j in dependency parse z example: a constituency and dependency parse with y(3, 5) = 1 and z(3, 5) = 1 S(flies) NP(United) N V VP(flies) NP(jet) United flies D A N * 0 United 1 flies 2 some 3 large 4 jet 5 some large jet z y
59 Combined optimization goal: arg max y Y,z Z f (y) + g(z) such that for all i = 1... n, j = 0... n, y(i, j) = z(i, j)
60 CKY Parsing Penalties u(i, j) = 0 for all i,j Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j
61 CKY Parsing NP S(flies) VP(flies) Penalties u(i, j) = 0 for all i,j N V D NP(jet) United flies some A N large jet Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j
62 CKY Parsing NP S(flies) VP(flies) Penalties u(i, j) = 0 for all i,j N V D NP(jet) United flies some A N large jet Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j
63 CKY Parsing NP S(flies) VP(flies) Penalties u(i, j) = 0 for all i,j N V D NP(jet) United flies some A N large jet Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j
64 CKY Parsing NP N United S(flies) V flies VP(flies) D NP(jet) some A N large jet Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j
65 CKY Parsing Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j
66 CKY Parsing NP N United S(flies) V flies VP(flies) D some NP(jet) A large N jet Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j
67 CKY Parsing NP N United S(flies) V flies VP(flies) D some NP(jet) A large N jet Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j
68 CKY Parsing NP N United S(flies) V flies VP(flies) D some NP(jet) A large N jet Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Converged y = arg max f (y) + g(y) y Y * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j
69 Convergence % examples converged <=1 <=2 <=3 <=4 <=10 <=20 <=50 number of iterations
70 Integrated Constituency and Dependency Parsing: Accuracy Collins Dep Dual F 1 Score Collins (1997) Model 1 Fixed, First-best Dependencies from Koo (2008) Dual Decomposition
71 Combined alignment (DeNero and Macherey, 2011) setup: assume separate models trained for English-to-French and French-to-English alignment problem: find an alignment that maximizes the score of both models example: HMM models for both directional alignments (assume correct alignment is one-to-one for simplicity)
72 define: English-to-French alignment Y is set of all possible English-to-French alignments y Y is a valid alignment f (y) scores of the alignment example: HMM alignment Le 1 laid 3 chien 2 a 4 rouge 6 fourrure 5 The 1 ugly 2 dog 3 has 4 red 5 fur 6
73 define: French-to-English alignment Z is set of all possible French-to-English alignments z Z is a valid alignment g(z) scores of an alignment example: HMM alignment The 1 ugly 2 dog 3 has 4 fur 6 red 5 Le 1 chien 2 laid 3 a 4 fourrure 5 rouge 6
74 Identifying word alignments notation: identify the tag labels selected by each model y(i, j) = 1 when e-to-f alignment y selects French word i to align with English word j z(i, j) = 1 when f-to-e alignment z selects French word i to align with English word j example: two HMM alignment models with y(6, 5) = 1 and z(6, 5) = 1
75 Combined optimization goal: arg max y Y,z Z f (y) + g(z) such that for all i = 1... n, j = 1... n, y(i, j) = z(i, j)
76 English-to-French Penalties u(i, j) = 0 for all i,j French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j
77 English-to-French Penalties u(i, j) = 0 for all i,j French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j
78 English-to-French Penalties u(i, j) = 0 for all i,j French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j
79 English-to-French Penalties u(i, j) = 0 for all i,j French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j
80 English-to-French French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(3, 2) -1 u(2, 2) 1 u(2, 3) -1 u(3, 3) 1 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j
81 English-to-French French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(3, 2) -1 u(2, 2) 1 u(2, 3) -1 u(3, 3) 1 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j
82 English-to-French French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(3, 2) -1 u(2, 2) 1 u(2, 3) -1 u(3, 3) 1 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j
83 English-to-French French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(3, 2) -1 u(2, 2) 1 u(2, 3) -1 u(3, 3) 1 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j
84 English-to-French French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(3, 2) -1 u(2, 2) 1 u(2, 3) -1 u(3, 3) 1 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j
85 English-to-French French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(3, 2) -1 u(2, 2) 1 u(2, 3) -1 u(3, 3) 1 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j
86 4. Linear programming aim: explore the connections between Lagrangian relaxation and linear programming basic optimization over the simplex formal properties of linear programming full example with fractional optimal solutions
87 define: Simplex y R Y is the simplex over Y where α y implies α y 0 and y α y = 1 α is distribution over Y z is the simplex over Z δ y : Y y maps elements to the simplex example: Y = {y 1, y 2, y 3 } vertices δ y (y 1 ) = (1, 0, 0) δ y (y 2 ) = (0, 1, 0) δ y (y 3 ) = (0, 0, 1) δ y (y 2 ) δ y (y 1 ) y δ y (y3)
88 Theorem 1. Simplex linear program optimize over the simplex y instead of the discrete sets Y goal: optimize linear program theorem: max α y y max f (y) = max y Y α y f (y) α y y α y f (y) proof: points in Y correspond to the exteme points of simplex {δ y (y) : y Y} linear program has optimum at extreme point proof shows that best distribution chooses a single parse
89 Combined linear program optimize over the simplices y and z instead of the discrete sets Y and Z goal: optimize linear program max α y f (y) + α y,β z y z β z g(z) such that for all i, t α y y(i, t) = y z β z z(i, t) note: the two distributions must match in expectation of POS tags the best distributions α,β are possibly no longer a single parse tree or tag sequence
90 Lagrangian Lagrangian: M(u, α, β) = α y f (y) + β zg(z) + ( u(i, t) α y y(i, t) y z i,t y z ( = α y f (y) + u(i, t) ) α y y(i, t) + y i,t y ( β zg(z) u(i, t) ) β zz(i, t) z i,t z Lagrangian dual: β zz(i, t) ) M(u) = max M(u, α, β) α y,β z
91 Theorem 2. Strong duality define: α, β is the optimal assignment to α, β in the linear program theorem: min M(u) = u y α y f (y) + z β z g(z) proof: by linear programming duality
92 Theorem 3. Dual relationship theorem: for any value of u, M(u) = L(u) note: solving the original Lagrangian dual also solves dual of the linear program
93 Theorem 3. Dual relationship (proof sketch) focus on Y term in Lagrangian L(u) = max y Y f (y) + i,t u(i, t)y(i, t) +... M(u) = max α y y α y f (y) + i,t u(i, t) y α y y(i, t) +... by theorem 1. optimization over Y and y have the same max similar argument for Z gives L(u) = M(u)
94 Summary f (y) + g(z) L(u) y α y f (y) + z β zg(z) M(u) original primal objective original dual LP primal objective LP dual relationship between LP dual, original dual, and LP primal objective min u M(u) = min L(u) = u y α y f (y) + z β z g(z)
95 Concrete example Y = {y 1, y 2, y 3 } Z = {z 1, z 2, z 3 } y R 3, z R 3 y 1 y 2 y 3 x x x Y a a b b c c He is He is He is z 1 z 2 z 3 Z a b b a c c He is He is He is
96 Simple solution y 1 y 2 y 3 Y a x a b x b c x c He is He is He is z 1 z 2 z 3 Z a b b a c c He is He is He is choose: α (1) = (0, 0, 1) y is representation of y 3 β (1) = (0, 0, 1) z is representation of z 3 confirm: α y (1) y(i, t) = β z (1) z(i, t) y z α (1) and β (1) satisfy agreement constraint
97 Fractional solution y 1 y 2 y 3 Y a x a b x b c x c He is He is He is z 1 z 2 z 3 Z a b b a c c He is He is He is choose: α (2) = (0.5, 0.5, 0) y is combination of y 1 and y 2 β (2) = (0.5, 0.5, 0) z is combination of z 1 and z 2 confirm: y α y (2) y(i, t) = z β (2) z z(i, t) α (2) and β (2) satisfy agreement constraint, but not integral
98 Optimal solution weights: the choice of f and g determines the optimal solution if (f, g) favors (α (2), β (2) ), the optimal solution is fractional example: f = [1 1 2] and g = [1 1 2] f α (1) + g β (1) = 0 vs f α (2) + g β (2) = 2 α (2), β (2) is optimal, even though it is fractional summary: dual and LP primal optimal: min u M(u) = min L(u) = u y α y (2) f (y) + z β (2) z g(z) = 2 original primal optimal: f (y ) + g(z ) = 0
99 round 1 dual solutions: y 3 z 2 dual values: y (1) 2.00 z (1) 1.00 L(u (1) ) 3.00 x c c He is b He a is previous solutions: y 3 z Round
100 round 2 dual solutions: y 2 z 1 dual values: y (2) 2.00 z (2) 1.00 L(u (2) ) 3.00 x b b He is a He b is previous solutions: y 3 z 2 y 2 z Round
101 round 3 dual solutions: y 1 z 1 dual values: y (3) 2.50 z (3) 0.50 L(u (3) ) 3.00 x a a He is a He b is previous solutions: y 3 z 2 y 2 z 1 y 1 z Round
102 round 4 dual solutions: y 1 z 1 dual values: y (4) 2.17 z (4) 0.17 L(u (4) ) x a a He is a He b is previous solutions: y 3 z 2 y 2 z 1 y 1 z 1 y 1 z Round
103 round 5 dual solutions: y 2 z 2 dual values: y (5) 2.08 z (5) 0.08 L(u (5) ) x b b He is b He a is previous solutions: y 3 z 2 y 2 z 1 y 1 z 1 y 1 z 1 y 2 z Round
104 round 6 dual solutions: y 1 z 1 dual values: y (6) 2.12 z (6) 0.12 L(u (6) ) x a a He is a He b is previous solutions: y 3 z 2 y 2 z 1 y 1 z 1 y 1 z 1 y 2 z 2 y 1 z Round
105 round 7 dual solutions: y 2 z 2 dual values: y (7) 2.05 z (7) 0.05 L(u (7) ) x b b He is b He a is previous solutions: y 3 z 2 y 2 z 1 y 1 z 1 y 1 z 1 y 2 z 2 y 1 z 1 y 2 z Round
106 round 8 dual solutions: y 1 z 1 dual values: y (8) 2.09 z (8) 0.09 L(u (8) ) x a a He is a He b is previous solutions: y 3 z 2 y 2 z 1 y 1 z 1 y 1 z 1 y 2 z 2 y 1 z 1 y 2 z 2 y 1 z Round
107 round 9 dual solutions: y 2 z 2 dual values: y (9) 2.03 z (9) 0.03 L(u (9) ) x b b He is b He a is previous solutions: y 3 z 2 y 2 z 1 y 1 z 1 y 1 z 1 y 2 z 2 y 1 z 1 y 2 z 2 y 1 z 1 y 2 z Round
108 5. Practical issues tracking the progress of the algorithm know current dual value and (possibly) primal value choice of update rate α k various strategies; success with rate based on dual progress lazy update of dual solutions if updates are sparse, can avoid dynamically update soltuions extracting solutions if algorithm does not converge best primal feasible solution; average solutions
109 Phrase-Based Translation define: source-language sentence words x 1,..., x N phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein p 1 p 2 p 3 p 4 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}
110 Phrase-Based Translation define: source-language sentence words x 1,..., x N phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must p 1 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}
111 Phrase-Based Translation define: source-language sentence words x 1,..., x N phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also p 1 p 2 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}
112 Phrase-Based Translation define: source-language sentence words x 1,..., x N phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be p 1 p 2 p 3 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}
113 Phrase-Based Translation define: source-language sentence words x 1,..., x N phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern p 1 p 2 p 3 p 4 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}
114 Phrase-Based Translation define: source-language sentence words x 1,..., x N phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern p 1 p 2 p 3 p 4 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}
115 Phrase-Based Translation define: source-language sentence words x 1,..., x N phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern p 1 p 2 p 3 p 4 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}
116 Scoring Derivations derivation: y = {(1, 2, this must), (5, 5, 2also), (6, 6, be), (3, 4, our concern)} x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern objective: f (y) = h(e(y)) + L g(p k ) + k=1 L 1 η t(p k ) + 1 s(p k+1 ) k=1 language model score h phrase translation score g distortion penalty η
117 Scoring Derivations derivation: y = {(1, 2, this must), (5, 5, 2also), (6, 6, be), (3, 4, our concern)} x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern objective: f (y) = h(e(y)) + L g(p k ) + k=1 L 1 η t(p k ) + 1 s(p k+1 ) k=1 language model score h phrase translation score g distortion penalty η
118 Scoring Derivations derivation: y = {(1, 2, this must), (5, 5, 2also), (6, 6, be), (3, 4, our concern)} x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern objective: f (y) = h(e(y)) + L g(p k ) + k=1 L 1 η t(p k ) + 1 s(p k+1 ) k=1 language model score h phrase translation score g distortion penalty η
119 Scoring Derivations derivation: y = {(1, 2, this must), (5, 5, 2also), (6, 6, be), (3, 4, our concern)} x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern objective: f (y) = h(e(y)) + L g(p k ) + k=1 L 1 η t(p k ) + 1 s(p k+1 ) k=1 language model score h phrase translation score g distortion penalty η
120 Relaxed Problem Y : only requires the total number of words translated to be N Y ={y : N y(i) = N and the distortion limit d is satisfied} i=1 example: y(i) sum 6 x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein our concern must be our concern (3, 4, our concern)(2, 2, must)(6, 6, be)(3, 4, our concern)
121 Relaxed Problem Y : only requires the total number of words translated to be N Y ={y : N y(i) = N and the distortion limit d is satisfied} i=1 example: y(i) sum 6 x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein our concern must be our concern (3, 4, our concern)(2, 2, must)(6, 6, be)(3, 4, our concern)
122 Relaxed Problem Y : only requires the total number of words translated to be N Y ={y : N y(i) = N and the distortion limit d is satisfied} i=1 example: y(i) sum 6 x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein our concern must be our concern (3, 4, our concern)(2, 2, must)(6, 6, be)(3, 4, our concern)
123 Relaxed Problem Y : only requires the total number of words translated to be N Y ={y : N y(i) = N and the distortion limit d is satisfied} i=1 example: y(i) sum 6 x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein our concern must be our concern (3, 4, our concern)(2, 2, must)(6, 6, be)(3, 4, our concern)
124 Lagrangian Relaxation Method original: arg max f (y) y Y }{{} exact DP is NP-hard rewrite: Y = {y : y(i) = 1 i = 1... N} arg max f (y) such that y Y }{{ } can be solved efficiently by DP y(i) = 1 i = 1... N }{{} using Lagrangian relaxation Y = {y : N i=1 y(i) = N} }{{} sum to N
125 Lagrangian Relaxation Method original: arg max f (y) y Y }{{} exact DP is NP-hard rewrite: Y = {y : y(i) = 1 i = 1... N} arg max f (y) such that y Y }{{ } can be solved efficiently by DP y(i) = 1 i = 1... N }{{} using Lagrangian relaxation Y = {y : N i=1 y(i) = N} }{{} sum to N
126 Lagrangian Relaxation Method original: arg max f (y) y Y }{{} exact DP is NP-hard rewrite: Y = {y : y(i) = 1 i = 1... N} arg max f (y) such that y Y }{{ } can be solved efficiently by DP y(i) = 1 i = 1... N }{{} using Lagrangian relaxation Y = {y : N i=1 y(i) = N} }{{} sum to N
127 Lagrangian Relaxation Method original: arg max f (y) y Y }{{} exact DP is NP-hard rewrite: Y = {y : y(i) = 1 i = 1... N} arg max f (y) such that y Y }{{ } can be solved efficiently by DP y(i) = 1 i = 1... N }{{} using Lagrangian relaxation Y = {y : N i=1 y(i) = N} }{{} sum to N
128 Lagrangian Relaxation Method original: arg max f (y) y Y }{{} exact DP is NP-hard rewrite: Y = {y : y(i) = 1 i = 1... N} arg max f (y) such that y Y }{{ } can be solved efficiently by DP y(i) = 1 i = 1... N }{{} using Lagrangian relaxation Y = {y : N i=1 y(i) = N} }{{} sum to N
129 Algorithm Iteration 1: update u(i): u(i) u(i) α(y(i) 1) α = 1 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein
130 Algorithm Iteration 1: update u(i): u(i) u(i) α(y(i) 1) α = 1 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein our concern must be our concern
131 Algorithm Iteration 1: update u(i): u(i) u(i) α(y(i) 1) α = 1 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein our concern must be our concern
132 Algorithm Iteration 2: update u(i): u(i) u(i) α(y(i) 1) α = 0.5 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein
133 Algorithm Iteration 2: update u(i): u(i) u(i) α(y(i) 1) α = 0.5 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must be equally must equally
134 Algorithm Iteration 2: update u(i): u(i) u(i) α(y(i) 1) α = 0.5 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must be equally must equally
135 Algorithm Iteration 3: update u(i): u(i) u(i) α(y(i) 1) α = 0.5 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein
136 Algorithm Iteration 3: update u(i): u(i) u(i) α(y(i) 1) α = 0.5 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern
137 Tightening the Relaxation In some cases, we never reach y(i) = 1 for i = 1... N If dual L(u) is not decreasing fast enough run for 10 more iterations count number of times each constraint is violated add 3 most often violated constraints
138 Tightening the Relaxation Iteration 41: count(i) y(i) x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also our concern equally
139 Tightening the Relaxation Iteration 42: count(i) y(i) x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must be our concern is
140 Tightening the Relaxation Iteration 43: count(i) y(i) x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also our concern equally
141 Tightening the Relaxation Iteration 44: count(i) y(i) x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must be our concern is
142 Tightening the Relaxation Iteration 50: count(i) y(i) x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also our concern equally
143 Tightening the Relaxation Iteration 51: count(i) 0 y(i) 1 1 x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern Add 2 hard constraints (x 5, x 6 ) to the dynamic program
144 Tightening the Relaxation Iteration 51: count(i) 0 y(i) x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern Add 2 hard constraints (x 5, x 6 ) to the dynamic program
145 Experiments: German to English Europarl data: German to English Test on 1,824 sentences with length 1-50 words Converged: 1,818 sentences (99.67%)
146 Experiments: Number of Iterations Percentage words words words words words all Maximum Number of Lagrangian Relexation Iterations
147 Experiments: Number of Hard Constraints Required Percentage words words words words words all Number of Hard Constraints Added
148 Experiments: Mean Time in Seconds # words All mean median
149 Comparison to ILP Decoding (sec.) (sec.) , , , ,692.6
150 Summary presented Lagrangian relaxation as a method for decoding in NLP formal guarantees gives certificate or approximate solution can improve approximate solutions by tightening relaxation efficient algorithms uses fast combinatorial algorithms can improve speed with lazy decoding widely applicable demonstrated algorithms for a wide range of NLP tasks (parsing, tagging, alignment, mt decoding)
151 Higher-order non-projective dependency parsing setup: given a model for higher-order non-projective dependency parsing (sibling features) problem: find non-projective dependency parse that maximizes the score of this model difficulty: model is NP-hard to decode complexity of the model comes from enforcing combinatorial constraints strategy: design a decomposition that separates combinatorial constraints from direct implementation of the scoring function
152 Non-Projective Dependency Parsing * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 Important problem in many languages. Problem is NP-Hard for all but the simplest models.
153 Dual Decomposition A classical technique for constructing decoding algorithms. Solve complicated models y = arg max f (y) y by decomposing into smaller problems. Upshot: Can utilize a toolbox of combinatorial algorithms. Dynamic programming Minimum spanning tree Shortest path Min-Cut...
154 Non-Projective Dependency Parsing * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 Starts at the root symbol * Each word has a exactly one parent word Produces a tree structure (no cycles) Dependencies can cross
155 Arc-Factored * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) =
156 Arc-Factored * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, mod =saw 2 )
157 Arc-Factored * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, mod =saw 2 ) +score(saw 2, John 1 )
158 Arc-Factored * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, mod =saw 2 ) +score(saw 2, John 1 ) +score(saw 2, movie 4 )
159 Arc-Factored * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, mod =saw 2 ) +score(saw 2, John 1 ) +score(saw 2, movie 4 ) +score(saw 2, today 5 )
160 Arc-Factored * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, mod =saw 2 ) +score(saw 2, John 1 ) +score(saw 2, movie 4 ) +score(saw 2, today 5 ) +score(movie 4, a 3 ) +...
161 Arc-Factored * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, mod =saw 2 ) +score(saw 2, John 1 ) +score(saw 2, movie 4 ) +score(saw 2, today 5 ) +score(movie 4, a 3 ) +... e.g. score( 0, saw 2 ) = log p(saw 2 0 ) (generative model)
162 Arc-Factored * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, mod =saw 2 ) +score(saw 2, John 1 ) +score(saw 2, movie 4 ) +score(saw 2, today 5 ) +score(movie 4, a 3 ) +... e.g. score( 0, saw 2 ) = log p(saw 2 0 ) or score( 0, saw 2 ) = w φ(saw 2, 0 ) (generative model) (CRF/perceptron model)
163 Arc-Factored * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, mod =saw 2 ) +score(saw 2, John 1 ) +score(saw 2, movie 4 ) +score(saw 2, today 5 ) +score(movie 4, a 3 ) +... e.g. score( 0, saw 2 ) = log p(saw 2 0 ) or score( 0, saw 2 ) = w φ(saw 2, 0 ) (generative model) (CRF/perceptron model) y = arg max f (y) Minimum Spanning Tree Algorithm y
164 Sibling Models * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) =
165 Sibling Models * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, prev = NULL, mod = saw 2 )
166 Sibling Models * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, prev = NULL, mod = saw 2 ) +score(saw 2, NULL, John 1 )
167 Sibling Models * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, prev = NULL, mod = saw 2 ) +score(saw 2, NULL, John 1 ) +score(saw 2, NULL, movie 4 )
168 Sibling Models * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, prev = NULL, mod = saw 2 ) +score(saw 2, NULL, John 1 ) +score(saw 2, NULL, movie 4 ) +score(saw 2,movie 4, today 5 ) +...
169 Sibling Models * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, prev = NULL, mod = saw 2 ) +score(saw 2, NULL, John 1 ) +score(saw 2, NULL, movie 4 ) +score(saw 2,movie 4, today 5 ) +... e.g. score(saw 2, movie 4, today 5 ) = log p(today 5 saw 2, movie 4 )
170 Sibling Models * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, prev = NULL, mod = saw 2 ) +score(saw 2, NULL, John 1 ) +score(saw 2, NULL, movie 4 ) +score(saw 2,movie 4, today 5 ) +... e.g. score(saw 2, movie 4, today 5 ) = log p(today 5 saw 2, movie 4 ) or score(saw 2, movie 4, today 5 ) = w φ(saw 2, movie 4, today 5 )
171 Sibling Models * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 f (y) = score(head = 0, prev = NULL, mod = saw 2 ) +score(saw 2, NULL, John 1 ) +score(saw 2, NULL, movie 4 ) +score(saw 2,movie 4, today 5 ) +... e.g. score(saw 2, movie 4, today 5 ) = log p(today 5 saw 2, movie 4 ) or score(saw 2, movie 4, today 5 ) = w φ(saw 2, movie 4, today 5 ) y = arg max f (y) NP-Hard y
172 Thought Experiment: Individual Decoding * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8
173 Thought Experiment: Individual Decoding * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 score(saw 2, NULL, John 1 ) + score(saw 2, NULL, movie 4 ) +score(saw 2, movie 4, today 5 )
174 Thought Experiment: Individual Decoding * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 score(saw 2, NULL, John 1 ) + score(saw 2, NULL, movie 4 ) +score(saw 2, movie 4, today 5 ) score(saw 2, NULL, John 1 ) + score(saw 2, NULL, that 6 )
175 Thought Experiment: Individual Decoding * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 score(saw 2, NULL, John 1 ) + score(saw 2, NULL, movie 4 ) +score(saw 2, movie 4, today 5 ) score(saw 2, NULL, John 1 ) + score(saw 2, NULL, that 6 ) score(saw 2, NULL, a 3 ) + score(saw 2, a 3, he 7 )
176 Thought Experiment: Individual Decoding * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 score(saw 2, NULL, John 1 ) + score(saw 2, NULL, movie 4 ) +score(saw 2, movie 4, today 5 ) 2 n 1 possibilities score(saw 2, NULL, John 1 ) + score(saw 2, NULL, that 6 ) score(saw 2, NULL, a 3 ) + score(saw 2, a 3, he 7 )
177 Thought Experiment: Individual Decoding * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 score(saw 2, NULL, John 1 ) + score(saw 2, NULL, movie 4 ) +score(saw 2, movie 4, today 5 ) 2 n 1 possibilities score(saw 2, NULL, John 1 ) + score(saw 2, NULL, that 6 ) score(saw 2, NULL, a 3 ) + score(saw 2, a 3, he 7 ) Under Sibling Model, can solve for each word with Viterbi decoding.
178 Thought Experiment Continued * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 Idea: Do individual decoding for each head word using dynamic programming. If we re lucky, we ll end up with a valid final tree.
179 Thought Experiment Continued * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 Idea: Do individual decoding for each head word using dynamic programming. If we re lucky, we ll end up with a valid final tree.
180 Thought Experiment Continued * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 Idea: Do individual decoding for each head word using dynamic programming. If we re lucky, we ll end up with a valid final tree.
181 Thought Experiment Continued * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 Idea: Do individual decoding for each head word using dynamic programming. If we re lucky, we ll end up with a valid final tree.
182 Thought Experiment Continued * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 Idea: Do individual decoding for each head word using dynamic programming. If we re lucky, we ll end up with a valid final tree.
183 Thought Experiment Continued * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 Idea: Do individual decoding for each head word using dynamic programming. If we re lucky, we ll end up with a valid final tree.
184 Thought Experiment Continued * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 Idea: Do individual decoding for each head word using dynamic programming. If we re lucky, we ll end up with a valid final tree.
185 Thought Experiment Continued * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 Idea: Do individual decoding for each head word using dynamic programming. If we re lucky, we ll end up with a valid final tree.
Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing
Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for atural Language Processing Alexander M. Rush (based on joint work with Michael Collins, Tommi Jaakkola, Terry Koo, David Sontag) Uncertainty
More informationDual Decomposition for Natural Language Processing. Decoding complexity
Dual Decomposition for atural Language Processing Alexander M. Rush and Michael Collins Decoding complexity focus: decoding problem for natural language tasks motivation: y = arg max y f (y) richer model
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 6, March 7, 2013 David Sontag (NYU) Graphical Models Lecture 6, March 7, 2013 1 / 25 Today s lecture 1 Dual decomposition 2 MAP inference
More informationDual Decomposition for Inference
Dual Decomposition for Inference Yunshu Liu ASPITRG Research Group 2014-05-06 References: [1]. D. Sontag, A. Globerson and T. Jaakkola, Introduction to Dual Decomposition for Inference, Optimization for
More informationProbabilistic Context Free Grammars. Many slides from Michael Collins
Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar
More informationNatural Language Processing : Probabilistic Context Free Grammars. Updated 5/09
Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require
More informationNatural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):
More informationStatistical Methods for NLP
Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification
More informationExact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation
Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation Abstract This paper describes an algorithm for exact decoding of phrase-based translation models, based on Lagrangian relaxation.
More informationLab 12: Structured Prediction
December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?
More informationExact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation
Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation Yin-Wen Chang MIT CSAIL Cambridge, MA 02139, USA yinwen@csail.mit.edu Michael Collins Department of Computer Science, Columbia
More informationA Support Vector Method for Multivariate Performance Measures
A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme
More informationProbabilistic Context-Free Grammars. Michael Collins, Columbia University
Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview Probabilistic Context-Free Grammars (PCFGs) The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar
More informationStructured Prediction Models via the Matrix-Tree Theorem
Structured Prediction Models via the Matrix-Tree Theorem Terry Koo Amir Globerson Xavier Carreras Michael Collins maestro@csail.mit.edu gamir@csail.mit.edu carreras@csail.mit.edu mcollins@csail.mit.edu
More informationDecoding and Inference with Syntactic Translation Models
Decoding and Inference with Syntactic Translation Models March 5, 2013 CFGs S NP VP VP NP V V NP NP CFGs S NP VP S VP NP V V NP NP CFGs S NP VP S VP NP V NP VP V NP NP CFGs S NP VP S VP NP V NP VP V NP
More informationMargin-based Decomposed Amortized Inference
Margin-based Decomposed Amortized Inference Gourab Kundu and Vivek Srikumar and Dan Roth University of Illinois, Urbana-Champaign Urbana, IL. 61801 {kundu2, vsrikum2, danr}@illinois.edu Abstract Given
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More informationChapter 14 (Partially) Unsupervised Parsing
Chapter 14 (Partially) Unsupervised Parsing The linguistically-motivated tree transformations we discussed previously are very effective, but when we move to a new language, we may have to come up with
More informationLog-Linear Models, MEMMs, and CRFs
Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx
More informationAlgorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,
More informationLecture 15. Probabilistic Models on Graph
Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how
More informationLinear Classifiers IV
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationA Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister
A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model
More informationAspects of Tree-Based Statistical Machine Translation
Aspects of Tree-Based Statistical Machine Translation Marcello Federico Human Language Technology FBK 2014 Outline Tree-based translation models: Synchronous context free grammars Hierarchical phrase-based
More informationParsing with Context-Free Grammars
Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationDoctoral Course in Speech Recognition. May 2007 Kjell Elenius
Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state
More informationCutting Plane Training of Structural SVM
Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs
More informationLatent Variable Models in NLP
Latent Variable Models in NLP Aria Haghighi with Slav Petrov, John DeNero, and Dan Klein UC Berkeley, CS Division Latent Variable Models Latent Variable Models Latent Variable Models Observed Latent Variable
More informationStructured Prediction
Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured
More informationQuasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti
Quasi-Synchronous Phrase Dependency Grammars for Machine Translation Kevin Gimpel Noah A. Smith 1 Introduction MT using dependency grammars on phrases Phrases capture local reordering and idiomatic translations
More informationA* Search. 1 Dijkstra Shortest Path
A* Search Consider the eight puzzle. There are eight tiles numbered 1 through 8 on a 3 by three grid with nine locations so that one location is left empty. We can move by sliding a tile adjacent to the
More informationComputing Lattice BLEU Oracle Scores for Machine Translation
Computing Lattice Oracle Scores for Machine Translation Artem Sokolov & Guillaume Wisniewski & François Yvon {firstname.lastname}@limsi.fr LIMSI, Orsay, France 1 Introduction 2 Oracle Decoding Task 3 Proposed
More informationProbabilistic Context-Free Grammar
Probabilistic Context-Free Grammar Petr Horáček, Eva Zámečníková and Ivana Burgetová Department of Information Systems Faculty of Information Technology Brno University of Technology Božetěchova 2, 612
More informationProbabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning
Probabilistic Context Free Grammars Many slides from Michael Collins and Chris Manning Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic
More informationThis kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.
Chapter 12 Synchronous CFGs Synchronous context-free grammars are a generalization of CFGs that generate pairs of related strings instead of single strings. They are useful in many situations where one
More informationLow-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park
Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features
More informationNLP Programming Tutorial 11 - The Structured Perceptron
NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, A book review Oh, man I love this book! This book is
More informationProbabilistic Context-free Grammars
Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John
More informationNatural Language Processing
SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University September 27, 2018 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class
More informationMultiword Expression Identification with Tree Substitution Grammars
Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011 Main Idea Use syntactic
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationLECTURER: BURCU CAN Spring
LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can
More informationA Context-Free Grammar
Statistical Parsing A Context-Free Grammar S VP VP Vi VP Vt VP VP PP DT NN PP PP P Vi sleeps Vt saw NN man NN dog NN telescope DT the IN with IN in Ambiguity A sentence of reasonable length can easily
More informationIn this chapter, we explore the parsing problem, which encompasses several questions, including:
Chapter 12 Parsing Algorithms 12.1 Introduction In this chapter, we explore the parsing problem, which encompasses several questions, including: Does L(G) contain w? What is the highest-weight derivation
More informationSoft Inference and Posterior Marginals. September 19, 2013
Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard inference Give me a single solution Viterbi algorithm Maximum spanning tree (Chu-Liu-Edmonds alg.) Soft inference
More informationLecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)
Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for
More informationPenn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark
Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationPolyhedral Outer Approximations with Application to Natural Language Parsing
Polyhedral Outer Approximations with Application to Natural Language Parsing André F. T. Martins 1,2 Noah A. Smith 1 Eric P. Xing 1 1 Language Technologies Institute School of Computer Science Carnegie
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationNatural Language Processing
Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability
More informationDecoding Revisited: Easy-Part-First & MERT. February 26, 2015
Decoding Revisited: Easy-Part-First & MERT February 26, 2015 Translating the Easy Part First? the tourism initiative addresses this for the first time the die tm:-0.19,lm:-0.4, d:0, all:-0.65 tourism touristische
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationSyntax-Based Decoding
Syntax-Based Decoding Philipp Koehn 9 November 2017 1 syntax-based models Synchronous Context Free Grammar Rules 2 Nonterminal rules NP DET 1 2 JJ 3 DET 1 JJ 3 2 Terminal rules N maison house NP la maison
More informationc(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)
Chapter 13 Statistical Parsg Given a corpus of trees, it is easy to extract a CFG and estimate its parameters. Every tree can be thought of as a CFG derivation, and we just perform relative frequency estimation
More informationLecture 9: Decoding. Andreas Maletti. Stuttgart January 20, Statistical Machine Translation. SMT VIII A. Maletti 1
Lecture 9: Decoding Andreas Maletti Statistical Machine Translation Stuttgart January 20, 2012 SMT VIII A. Maletti 1 Lecture 9 Last time Synchronous grammars (tree transducers) Rule extraction Weight training
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 12, April 23, 2013 David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 1 / 24 What notion of best should learning be optimizing?
More informationAdvanced Natural Language Processing Syntactic Parsing
Advanced Natural Language Processing Syntactic Parsing Alicia Ageno ageno@cs.upc.edu Universitat Politècnica de Catalunya NLP statistical parsing 1 Parsing Review Statistical Parsing SCFG Inside Algorithm
More informationStatistical Methods for NLP
Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationPart of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015
Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about
More informationAdvanced Graph-Based Parsing Techniques
Advanced Graph-Based Parsing Techniques Joakim Nivre Uppsala University Linguistics and Philology Based on previous tutorials with Ryan McDonald Advanced Graph-Based Parsing Techniques 1(33) Introduction
More informationS NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.
/6/7 CS 6/CS: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang The grammar: Binary, no epsilons,.9..5
More informationSpectral Unsupervised Parsing with Additive Tree Metrics
Spectral Unsupervised Parsing with Additive Tree Metrics Ankur Parikh, Shay Cohen, Eric P. Xing Carnegie Mellon, University of Edinburgh Ankur Parikh 2014 1 Overview Model: We present a novel approach
More informationMVE165/MMG631 Linear and integer optimization with applications Lecture 8 Discrete optimization: theory and algorithms
MVE165/MMG631 Linear and integer optimization with applications Lecture 8 Discrete optimization: theory and algorithms Ann-Brith Strömberg 2017 04 07 Lecture 8 Linear and integer optimization with applications
More informationRelaxed Marginal Inference and its Application to Dependency Parsing
Relaxed Marginal Inference and its Application to Dependency Parsing Sebastian Riedel David A. Smith Department of Computer Science University of Massachusetts, Amherst {riedel,dasmith}@cs.umass.edu Abstract
More informationPart 7: Structured Prediction and Energy Minimization (2/2)
Part 7: Structured Prediction and Energy Minimization (2/2) Colorado Springs, 25th June 2011 G: Worst-case Complexity Hard problem Generality Optimality Worst-case complexity Integrality Determinism G:
More informationMultilevel Coarse-to-Fine PCFG Parsing
Multilevel Coarse-to-Fine PCFG Parsing Eugene Charniak, Mark Johnson, Micha Elsner, Joseph Austerweil, David Ellis, Isaac Haxton, Catherine Hill, Shrivaths Iyengar, Jeremy Moore, Michael Pozar, and Theresa
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationCKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016
CKY & Earley Parsing Ling 571 Deep Processing Techniques for NLP January 13, 2016 No Class Monday: Martin Luther King Jr. Day CKY Parsing: Finish the parse Recognizer à Parser Roadmap Earley parsing Motivation:
More informationMarrying Dynamic Programming with Recurrent Neural Networks
Marrying Dynamic Programming with Recurrent Neural Networks I eat sushi with tuna from Japan Liang Huang Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark Marrying
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationSequence Labeling: HMMs & Structured Perceptron
Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition
More informationAN ABSTRACT OF THE DISSERTATION OF
AN ABSTRACT OF THE DISSERTATION OF Kai Zhao for the degree of Doctor of Philosophy in Computer Science presented on May 30, 2017. Title: Structured Learning with Latent Variables: Theory and Algorithms
More informationDiscrete Optimization 2010 Lecture 8 Lagrangian Relaxation / P, N P and co-n P
Discrete Optimization 2010 Lecture 8 Lagrangian Relaxation / P, N P and co-n P Marc Uetz University of Twente m.uetz@utwente.nl Lecture 8: sheet 1 / 32 Marc Uetz Discrete Optimization Outline 1 Lagrangian
More informationHidden Markov Models
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework
More informationSpectral Learning for Non-Deterministic Dependency Parsing
Spectral Learning for Non-Deterministic Dependency Parsing Franco M. Luque 1 Ariadna Quattoni 2 Borja Balle 2 Xavier Carreras 2 1 Universidad Nacional de Córdoba 2 Universitat Politècnica de Catalunya
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems
More informationHidden Markov Models Hamid R. Rabiee
Hidden Markov Models Hamid R. Rabiee 1 Hidden Markov Models (HMMs) In the previous slides, we have seen that in many cases the underlying behavior of nature could be modeled as a Markov process. However
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More informationPersonal Project: Shift-Reduce Dependency Parsing
Personal Project: Shift-Reduce Dependency Parsing 1 Problem Statement The goal of this project is to implement a shift-reduce dependency parser. This entails two subgoals: Inference: We must have a shift-reduce
More informationParsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)
Parsing Based on presentations from Chris Manning s course on Statistical Parsing (Stanford) S N VP V NP D N John hit the ball Levels of analysis Level Morphology/Lexical POS (morpho-synactic), WSD Elements
More information- Well-characterized problems, min-max relations, approximate certificates. - LP problems in the standard form, primal and dual linear programs
LP-Duality ( Approximation Algorithms by V. Vazirani, Chapter 12) - Well-characterized problems, min-max relations, approximate certificates - LP problems in the standard form, primal and dual linear programs
More informationAlgorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Classifica(on III Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley The Perceptron, Again Start with zero weights Visit training instances one by one Try to classify If correct,
More informationLanguage and Statistics II
Language and Statistics II Lecture 19: EM for Models of Structure Noah Smith Epectation-Maimization E step: i,, q i # p r $ t = p r i % ' $ t i, p r $ t i,' soft assignment or voting M step: r t +1 # argma
More information17 Variational Inference
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms for Inference Fall 2014 17 Variational Inference Prompted by loopy graphs for which exact
More informationCS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning
CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we
More information3.10 Lagrangian relaxation
3.10 Lagrangian relaxation Consider a generic ILP problem min {c t x : Ax b, Dx d, x Z n } with integer coefficients. Suppose Dx d are the complicating constraints. Often the linear relaxation and the
More informationPart A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )
Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationParsing. A Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 27
Parsing A Parsing Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Winter 2017/18 1 / 27 Table of contents 1 Introduction 2 A parsing 3 Computation of the SX estimates 4 Parsing with the SX estimate
More informationAspects of Tree-Based Statistical Machine Translation
Aspects of Tree-Based tatistical Machine Translation Marcello Federico (based on slides by Gabriele Musillo) Human Language Technology FBK-irst 2011 Outline Tree-based translation models: ynchronous context
More informationProcessing/Speech, NLP and the Web
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25 Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March, 2011 Bracketed Structure: Treebank Corpus [ S1[
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are
More informationStatistical Machine Translation
Statistical Machine Translation -tree-based models (cont.)- Artem Sokolov Computerlinguistik Universität Heidelberg Sommersemester 2015 material from P. Koehn, S. Riezler, D. Altshuler Bottom-Up Decoding
More informationVariational Decoding for Statistical Machine Translation
Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1
More information