Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

Size: px

Start display at page:

Download "Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing"

Juliana Wood
6 years ago
Views:

1 Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for atural Language Processing Alexander M. Rush (based on joint work with Michael Collins, Tommi Jaakkola, Terry Koo, David Sontag)

2 Uncertainty in language natural language is notoriusly ambiguous, even in toy sentences United flies some large jet problem becomes very difficult in real-world sentences Lin made the first start of his surprising.b.a. career Monday just as the Knicks lost Amar e Stoudemire and Carmelo Anthony, turning the roster upside down and setting the scene for another strange and wondrous evening at Madison Square Garden.

3 LP challenges Lin made the first start of his surprising.b.a. career Monday just as the Knicks lost Amar e Stoudemire and Carmelo Anthony, turning the roster upside down and setting the scene for another strange and wondrous evening at Madison Square Garden. who lost Stoudemire? what is turning the roster upside down? are strange and wonderous paired? what about down and setting?

4 LP applications

5 This lecture 1. introduce models for natural language processing in the context of decoding in graphical models 2. describe research in Lagrangian relaxation methods for inference in natural language problems.

6 Decoding in LP focus: decoding as a combinatorial optimization problem y = arg max y Y f (y) where f is a scoring function and Y is a set of structures also known as the MAP problem f (y; x) = p(y x)

7 Algorithms for exact decoding coming lectures will explore algorithms for decoding y = arg max y Y f (y) for some problems, simple combinatorial algorithms are exact dynamic programming - e.g. tree-structured MRF s min cut - for certain types of weight structures minimum spanning tree - certain LP problems can depend on structure of the graphical model

8 Tagging United flies some large jet V D A United 1 flies 2 some 3 large 4 jet 5

9 Parsing United flies some large jet United flies some large jet S P VP V P * 0 United 1 flies 2 some 3 large 4 jet 5 United flies D A some large jet

10 Decoding complexity issue: simple combinatorial algorithms do not scale to richer models y = arg max y Y f (y) need decoding algorithms for complex natural language tasks motivation: richer model structure often leads to improved accuracy exact decoding for complex models tends to be intractable

11 Phrase-based translation das muss unsere sorge gleichermaßen sein das muss unsere sorge gleichermaßen sein this must also be our concern

12 Word alignment the ugly dog has red fur le chien laid a fourrure rouge

13 high complexity Decoding tasks combined parsing and part-of-speech tagging (Rush et al., 2010) loopy HMM part-of-speech tagging syntactic machine translation (Rush and Collins, 2011) P-Hard symmetric HMM alignment (Deero and Macherey, 2011) phrase-based translation (Chang and Collins, 2011) higher-order non-projective dependency parsing (Koo et al., 2010) in practice: approximate decoding methods (coarse-to-fine, beam search, cube pruning, gibbs sampling, belief propagation) approximate models (mean field, variational models)

14 Motivation cannot hope to find exact algorithms (particularly when P-Hard) aim: develop decoding algorithms with formal guarantees method: derive fast algorithms that provide certificates of optimality show that for practical instances, these algorithms often yield exact solutions provide strategies for improving solutions or finding approximate solutions when no certificate is found Lagrangian relaxation helps us develop algorithms of this form

15 Lagrangian relaxation a general technique for constructing decoding algorithms solve complicated models y = arg max f (y) y by decomposing into smaller problems. upshot: can utilize a toolbox of combinatorial algorithms. dynamic programming minimum spanning tree shortest path min cut...

16 Lagrangian relaxation algorithms Simple - uses basic combinatorial algorithms Efficient - faster than solving exact decoding problems Strong guarantees gives a certificate of optimality when exact direct connections to linear programming relaxations

17 Outline 1. background on exact LP decoding 2. worked algorithm for combined parsing and tagging 3. important theorems and formal derivation 4. more examples from parsing and alignment

18 1. Background: Part-of-speech tagging United flies some large jet V D A United 1 flies 2 some 3 large 4 jet 5

19 Graphical model formulation given: a sentence of length n and a tag set T one variable for each word, takes values in T edge potentials θ(i 1, i, t, t) for all i n, t, t T example: United 1 flies 2 some 3 large 4 jet 5 T = {A, D,, V } note: for probabilistic HMM θ(i 1, i, t, t) = log(p(w i t)p(t t ))

20 Decoding problem let Y = T n be the set of assignments to the variables decoding problem: arg max y Y f (y) example: f (y) = n θ(i 1, i, y(i 1), y(i)) i=2 V D A United 1 flies 2 some 3 large 4 jet 5

21 Combinatorial solution: Viterbi algorithm Set c(i, t) = for all i 1, t T Set c(1, t) = 0 for all t T For i = 2 to n For t = 1 to T c(i, t) max t T c(i 1, t ) + θ(i 1, i, t, t) Return max t T c(n, t ) ote: run time O(n T 2 ), can greatly reduce T in practice special case of the belief propagation (next week)

22 Example run POS Tag Best Path A D V United 1 flies 2 some 3 large 4 jet 5

23 Example run POS Tag Best Path A 0 D 0 0 V 0 United 1 flies 2 some 3 large 4 jet 5

24 Example run POS Tag Best Path A 0 1 D V 0 2 United 1 flies 2 some 3 large 4 jet 5

25 Example run POS Tag Best Path A D V United 1 flies 2 some 3 large 4 jet 5

26 Example run POS Tag Best Path A D V United 1 flies 2 some 3 large 4 jet 5

27 Example run POS Tag Best Path A D V V D A United 1 flies 2 some 3 large 4 jet 5

28 Parsing United flies some large jet S P VP V P United flies D A some large jet

29 parsing has potentials: Parse decoding θ(a B C) for each internal rules θ(a w) for pre-terminal (bottom) rules not easily represented as a graphical model complicated independence structure S P VP V P United flies D A some large jet luckily, similar decoding algorithms to Viterbi CKY algorithm for context-free parsing - O(G 3 n 3 ) Eisner algorithm for dependency parsing - O(n 3 )

30 2. Worked example aim: walk through a Lagrangian relaxation algorithm for combined parsing and part-of-speech tagging introduce formal notation for parsing and tagging give assumptions necessary for decoding step through a run of the Lagrangian relaxation algorithm

31 Combined parsing and part-of-speech tagging S P VP V P United flies D A some large jet goal: find parse tree that optimizes score(s P VP) + score(vp V P) score( V) + score( United) +...

32 notation: Constituency parsing Y is set of constituency parses for input y Y is a valid parse f (y) scores a parse tree goal: arg max y Y f (y) example: a context-free grammar for constituency parsing S P VP V P United flies D A some large jet

33 notation: goal: Part-of-speech tagging Z is set of tag sequences for input z Z is a valid tag sequence g(z) scores of a tag sequence arg max z Z g(z) example: an HMM for part-of speech tagging V D A United 1 flies 2 some 3 large 4 jet 5

34 Identifying tags notation: identify the tag labels selected by each model y(i, t) = 1 when parse y selects tag t at position i z(i, t) = 1 when tag sequence z selects tag t at position i example: a parse and tagging with y(4, A) = 1 and z(4, A) = 1 S P VP V P V D A United flies D A United 1 flies 2 some 3 large 4 jet 5 some large jet z y

35 Combined optimization goal: arg max y Y,z Z such that for all i = 1... n, t T, f (y) + g(z) y(i, t) = z(i, t) i.e. find the best parse and tagging pair that agree on tag labels equivalent formulation: arg max f (y) + g(l(y)) y Y where l : Y Z extracts the tag sequence from a parse tree

36 Exact method: Dynamic programming intersection can solve by solving the product of the two models example: parsing model is a context-free grammar tagging model is a first-order HMM can solve as CFG and finite-state automata intersection S replace VP V P with VP,V V,V P V, P V VP P United flies D A some large jet

37 Intersected parsing and tagging complexity let G be the number of grammar non-terminals parsing CFG require O(G 3 n 3 ) time with rules VP V P S P, VP, V,V P V, United flies D V,D A A, some large jet with intersection O(G 3 n 3 T 3 ) with rules VP,V V,V P V, becomes O(G 3 n 3 T 6 ) time for second-order HMM

38 Tagging assumption assumption: optimization with u can be solved efficiently arg max z Z g(z) i,t u(i, t)z(i, t) example: HMM with scores with edge potentials θ g(z) = n θ(i 1, i, z(i 1), z(i)) i=2 where arg max z Z g(z) i,t u(i, t)z(i, t) = arg max z Z n (θ(i 1, i, z(i 1), z(i)) u(i, z(i))) u(1, z(1)) i=2

39 (Modified) Viterbi algorithm Set c(i, t) = for all i 1, t T Set c(1, t) = u(1, t) for all t T For i = 2 to n For t = 1 to T c(i, t) max t T c(i 1, t ) + θ(i 1, i, t, t) u(i, t) Return max t T c(n, t ) ote: same algorithm as before with modified potentials

40 Parsing assumption assumption: optimization with u can be solved efficiently arg max y Y f (y) + i,t u(i, t)y(i, t) example: CFG with rule scoring function θ f (y) = θ(x Y Z) + θ(x w i ) where X Y Z y (i,x ) y arg max y Y f (y) + i,t u(i, t)y(i, t) = arg max y Y θ(x Y Z) + X Y Z y (i,x ) y (θ(x w i ) + u(i, X ))

41 Lagrangian relaxation algorithm Set u (1) (i, t) = 0 for all i, t T For k = 1 to K y (k) arg max y Y f (y) + i,t u (k) (i, t)y(i, t) [Parsing] z (k) arg max z Z g(z) i,t u (k) (i, t)z(i, t) [Tagging] If y (k) (i, t) = z (k) (i, t) for all i, t Return (y (k), z (k) ) Else u (k+1) (i, t) u (k) (i, t) α k (y (k) (i, t) z (k) (i, t))

42 CKY Parsing Penalties u(i, t) = 0 for all i,t Viterbi Decoding y = arg max y Y (f (y) + i,t u(i, t)y(i, t)) United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

43 CKY Parsing P S VP Penalties u(i, t) = 0 for all i,t A D A V United flies some large jet Viterbi Decoding y = arg max y Y (f (y) + i,t u(i, t)y(i, t)) United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

44 CKY Parsing P S VP Penalties u(i, t) = 0 for all i,t A D A V United flies some large jet Viterbi Decoding y = arg max y Y (f (y) + i,t u(i, t)y(i, t)) V D A United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

45 CKY Parsing P S VP Penalties u(i, t) = 0 for all i,t A D A V United flies some large jet Viterbi Decoding y = arg max y Y (f (y) + i,t u(i, t)y(i, t)) V D A United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

46 CKY Parsing A United flies Viterbi Decoding P D some S A large y = arg max y Y (f (y) + i,t VP V jet u(i, t)y(i, t)) V D A Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, ) 1 u(2, ) -1 u(2, V ) 1 u(5, V ) -1 u(5, ) 1 United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

47 CKY Parsing Viterbi Decoding y = arg max y Y (f (y) + i,t u(i, t)y(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, ) 1 u(2, ) -1 u(2, V ) 1 u(5, V ) -1 u(5, ) 1 United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

48 CKY Parsing Viterbi Decoding P United S V flies y = arg max y Y (f (y) + i,t VP D some P A large u(i, t)y(i, t)) jet Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, ) 1 u(2, ) -1 u(2, V ) 1 u(5, V ) -1 u(5, ) 1 United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

49 CKY Parsing Viterbi Decoding P United S V flies y = arg max y Y (f (y) + i,t VP D some P A large u(i, t)y(i, t)) A D A jet Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, ) 1 u(2, ) -1 u(2, V ) 1 u(5, V ) -1 u(5, ) 1 United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

50 CKY Parsing Viterbi Decoding P United S V flies y = arg max y Y (f (y) + i,t VP D some P A large u(i, t)y(i, t)) A D A United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) jet Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, ) 1 u(2, ) -1 u(2, V ) 1 u(5, V ) -1 u(5, ) 1 Iteration 2 u(5, V ) -1 u(5, ) 1 Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

51 CKY Parsing Viterbi Decoding y = arg max y Y (f (y) + i,t u(i, t)y(i, t)) United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, ) 1 u(2, ) -1 u(2, V ) 1 u(5, V ) -1 u(5, ) 1 Iteration 2 u(5, V ) -1 u(5, ) 1 Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

52 CKY Parsing Viterbi Decoding P United S V flies y = arg max y Y (f (y) + i,t VP D some P A large u(i, t)y(i, t)) United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) jet Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, ) 1 u(2, ) -1 u(2, V ) 1 u(5, V ) -1 u(5, ) 1 Iteration 2 u(5, V ) -1 u(5, ) 1 Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

53 CKY Parsing Viterbi Decoding P United S V flies y = arg max y Y (f (y) + i,t VP D some P A large u(i, t)y(i, t)) V D A United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) jet Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, ) 1 u(2, ) -1 u(2, V ) 1 u(5, V ) -1 u(5, ) 1 Iteration 2 u(5, V ) -1 u(5, ) 1 Key f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

54 CKY Parsing Viterbi Decoding Key P United S V flies y = arg max y Y (f (y) + i,t VP D some P A large u(i, t)y(i, t)) V D A United 1 flies 2 some 3 large 4 jet 5 z = arg max z Z (g(z) i,t u(i, t)z(i, t)) jet Penalties u(i, t) = 0 for all i,t Iteration 1 u(1, A) -1 u(1, ) 1 u(2, ) -1 u(2, V ) 1 u(5, V ) -1 u(5, ) 1 Iteration 2 u(5, V ) -1 u(5, ) 1 Converged y = arg max f (y) + g(y) y Y f (y) CFG g(z) HMM Y Parse Trees Z Taggings y(i, t) = 1 if y contains tag t at position i

55 Main theorem theorem: if at any iteration, for all i, t T y (k) (i, t) = z (k) (i, t) then (y (k), z (k) ) is the global optimum proof: focus of the next section

56 Convergence % examples converged <=1 <=2 <=3 <=4 <=10 <=20 <=50 number of iterations

57 2. Formal properties aim: formal derivation of the algorithm given in the previous section derive Lagrangian dual prove three properties upper bound convergence optimality describe subgradient method

58 goal: Lagrangian arg Lagrangian: max y Y,z Z f (y) + g(z) such that y(i, t) = z(i, t) L(u, y, z) = f (y) + g(z) + i,t u(i, t) (y(i, t) z(i, t)) redistribute terms L(u, y, z) = f (y) + i,t u(i, t)y(i, t) + g(z) i,t u(i, t)z(i, t)

59 Lagrangian: L(u, y, z) = Lagrangian dual f (y) + u(i, t)y(i, t) + g(z) i,t i,t u(i, t)z(i, t) Lagrangian dual: L(u) = max y Y,z Z = max y Y max z Z L(u, y, z) f (y) + i,t g(z) i,t u(i, t)y(i, t) + u(i, t)z(i, t)

60 Theorem 1. Upper bound define: y, z is the optimal combined parsing and tagging solution with y (i, t) = z (i, t) for all i, t theorem: for any value of u L(u) f (y ) + g(z ) L(u) provides an upper bound on the score of the optimal solution note: upper bound may be useful as input to branch and bound or A* search

61 Theorem 1. Upper bound (proof) theorem: for any value of u, L(u) f (y ) + g(z ) proof: L(u) = max L(u, y, z) y Y,z Z (1) max L(u, y, z) y Y,z Z:y=z (2) = max f (y) + g(z) y Y,z Z:y=z (3) = f (y ) + g(z ) (4)

62 Formal algorithm (reminder) Set u (1) (i, t) = 0 for all i, t T For k = 1 to K y (k) arg max y Y f (y) + i,t u (k) (i, t)y(i, t) [Parsing] z (k) arg max z Z g(z) i,t u (k) (i, t)z(i, t) [Tagging] If y (k) (i, t) = z (k) (i, t) for all i, t Return (y (k), z (k) ) Else u (k+1) (i, t) u (k) (i, t) α k (y (k) (i, t) z (k) (i, t))

63 Theorem 2. Convergence notation: u (k+1) (i, t) u (k) (i, t) + α k (y (k) (i, t) z (k) (i, t)) is update u (k) is the penalty vector at iteration k α k > 0 is the update rate at iteration k theorem: for any sequence α 1, α 2, α 3,... such that lim t αt = 0 and α t =, t=1 we have lim t L(ut ) = min L(u) u i.e. the algorithm converges to the tightest possible upper bound proof: by subgradient convergence (next section)

64 Dual solutions define: for any value of u y u = arg max y Y f (y) + i,t u(i, t)y(i, t) and z u = arg max z Z g(z) i,t u(i, t)z(i, t) y u and z u are the dual solutions for a given u

65 Theorem 3. Optimality theorem: if there exists u such that y u (i, t) = z u (i, t) for all i, t then f (y u ) + g(z u ) = f (y ) + g(z ) i.e. if the dual solutions agree, we have an optimal solution (y u, z u )

66 Theorem 3. Optimality (proof) theorem: if u such that y u (i, t) = z u (i, t) for all i, t then f (y u ) + g(z u ) = f (y ) + g(z ) proof: by the definitions of y u and z u L(u) = f (y u ) + g(z u ) + i,t u(i, t)(y u (i, t) z u (i, t)) = f (y u ) + g(z u ) since L(u) f (y ) + g(z ) for all values of u f (y u ) + g(z u ) f (y ) + g(z ) but y and z are optimal f (y u ) + g(z u ) f (y ) + g(z )

67 Dual optimization Lagrangian dual: L(u) = max y Y,z Z = max y Y max z Z L(u, y, z) f (y) + i,t g(z) i,t u(i, t)y(i, t) + u(i, t)z(i, t) goal: dual problem is to find the tightest upper bound min L(u) u

68 L(u) = max y Y f (y) + i,t Dual subgradient u(i, t)y(i, t) + max g(z) z Z i,t u(i, t)z(i, t) properties: L(u) is convex in u (no local minima) L(u) is not differentiable (because of max operator) handle non-differentiability by using subgradient descent define: a subgradient of L(u) at u is a vector g u such that for all v L(v) L(u) + g u (v u)

69 Subgradient algorithm L(u) = max y Y f (y) + i,t u(i, t)y(i, t) + max g(z) z Z i,j u(i, t)z(i, t) recall, y u and z u are the argmax s of the two terms subgradient: g u (i, t) = y u (i, t) z u (i, t) subgradient descent: move along the subgradient u (i, t) = u(i, t) α (y u (i, t) z u (i, t)) guaranteed to find a minimum with conditions given earlier for α

70 4. More examples aim: demonstrate similar algorithms that can be applied to other decoding applications context-free parsing combined with dependency parsing combined translation alignment

71 Combined constituency and dependency parsing (Rush et al., 2010) setup: assume separate models trained for constituency and dependency parsing problem: find constituency parse that maximizes the sum of the two models example: combine lexicalized CFG with second-order dependency parser

72 notation: Lexicalized constituency parsing Y is set of lexicalized constituency parses for input y Y is a valid parse f (y) scores a parse tree goal: arg max y Y f (y) example: a lexicalized context-free grammar S(flies) P(United) VP(flies) V P(jet) United flies D A some large jet

73 Dependency parsing define: Z is set of dependency parses for input z Z is a valid dependency parse g(z) scores a dependency parse example: * 0 United 1 flies 2 some 3 large 4 jet 5

74 Identifying dependencies notation: identify the dependencies selected by each model y(i, j) = 1 when word i modifies of word j in constituency parse y z(i, j) = 1 when word i modifies of word j in dependency parse z example: a constituency and dependency parse with y(3, 5) = 1 and z(3, 5) = 1 S(flies) P(United) V VP(flies) P(jet) United flies D A * 0 United 1 flies 2 some 3 large 4 jet 5 some large jet z y

75 Combined optimization goal: arg max y Y,z Z f (y) + g(z) such that for all i = 1... n, j = 0... n, y(i, j) = z(i, j)

76 CKY Parsing Penalties u(i, j) = 0 for all i,j Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j

77 CKY Parsing P S(flies) VP(flies) Penalties u(i, j) = 0 for all i,j V D P(jet) United flies some A large jet Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j

78 CKY Parsing P S(flies) VP(flies) Penalties u(i, j) = 0 for all i,j V D P(jet) United flies some A large jet Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j

79 CKY Parsing P S(flies) VP(flies) Penalties u(i, j) = 0 for all i,j V D P(jet) United flies some A large jet Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j

80 CKY Parsing P United S(flies) V flies VP(flies) D P(jet) some A large jet Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j

81 CKY Parsing Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j

82 CKY Parsing P United S(flies) V flies VP(flies) D some P(jet) A large jet Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j

83 CKY Parsing P United S(flies) V flies VP(flies) D some P(jet) A large jet Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j

84 CKY Parsing P United S(flies) V flies VP(flies) D some P(jet) A large jet Penalties u(i, j) = 0 for all i,j Iteration 1 u(2, 3) -1 u(5, 3) 1 Dependency Parsing y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Converged y = arg max f (y) + g(y) y Y * 0 United 1 flies 2 some 3 large 4 jet 5 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) CFG g(z) Dependency Model Y Parse Trees Z Dependency Trees y(i, j) = 1 if y contains dependency i, j

85 Convergence % examples converged <=1 <=2 <=3 <=4 <=10 <=20 <=50 number of iterations

86 Integrated Constituency and Dependency Parsing: Accuracy Collins Dep Dual F 1 Score Collins (1997) Model 1 Fixed, First-best Dependencies from Koo (2008) Dual Decomposition

87 Corpus-level tagging setup: given a corpus of sentences and a trained sentence-level tagging model problem: find best tagging for each sentence, while at the same time enforcing inter-sentence soft constraints example: test-time decoding with a trigram tagger constraint that each word type prefer a single POS tag

88 Corpus-level tagging English is my first language He studies language arts now Language makes us human beings

89 notation: Sentence-level decoding Y i is set of tag sequences for input sentence i Y = Y 1... Y m is set of tag sequences for the input corpus Y Y is a valid tag sequence for the corpus f (Y i ) is the score for tagging the whole corpus F (Y ) = i goal: arg max Y Y F (Y ) example: decode each sentence with a trigram tagger V P A English is my first language P V A R He studies language arts now

90 notation: Inter-sentence constraints Z is set of possible assignments of tags to word types z Z is a valid tag assignment g(z) is a scoring function for assignments to word types example: an MRF model that encourages words of the same type to choose the same tag z 1 z 2 A language language language language language language g(z 1 ) > g(z 2 )

91 Identifying word tags notation: identify the tag labels selected by each model Y s (i, t) = 1 when the tagger for sentence s at position i selects tag t z(s, i, t) = 1 when the constraint assigns at sentence s position i the tag t example: a parse and tagging with Y 1 (5, ) = 1 and z(1, 5, ) = 1 English is my first language He studies language arts now Y language language language z

92 Combined optimization goal: arg max Y Y,z Z F (Y ) + g(z) such that for all s = 1... m, i = 1... n, t T, Y s (i, t) = z(s, i, t)

93 Tagging Penalties u(s, i, t) = 0 for all s,i,t MRF Key F (Y ) Tagging model g(z) MRF Y Sentence-level tagging Z Inter-sentence constraints Y s (i, t) = 1 if sentence s has tag t at position i

94 Tagging Penalties u(s, i, t) = 0 for all s,i,t V P A English is my first language P V A R He studies language arts now V P Language makes us human beings MRF Key F (Y ) Tagging model g(z) MRF Y Sentence-level tagging Z Inter-sentence constraints Y s (i, t) = 1 if sentence s has tag t at position i

95 Tagging Penalties u(s, i, t) = 0 for all s,i,t V P A English is my first language P V A R He studies language arts now V P Language makes us human beings MRF A A A A language language language Key F (Y ) Tagging model g(z) MRF Y Sentence-level tagging Z Inter-sentence constraints Y s (i, t) = 1 if sentence s has tag t at position i

96 Tagging Penalties u(s, i, t) = 0 for all s,i,t V P A English is my first language P V A R He studies language arts now V P Language makes us human beings MRF A A A A language language language Key F (Y ) Tagging model g(z) MRF Y Sentence-level tagging Z Inter-sentence constraints Y s (i, t) = 1 if sentence s has tag t at position i

97 Tagging English P V is V P my A A first language R Penalties u(s, i, t) = 0 for all s,i,t Iteration 1 u(1, 5, ) -1 u(1, 5, A) 1 u(3, 1, ) -1 He studies language arts now u(3, 1, A) 1 V P Language makes us human beings MRF A A A A language language language Key F (Y ) Tagging model g(z) MRF Y Sentence-level tagging Z Inter-sentence constraints Y s (i, t) = 1 if sentence s has tag t at position i

98 Tagging Penalties u(s, i, t) = 0 for all s,i,t Iteration 1 u(1, 5, ) -1 u(1, 5, A) 1 u(3, 1, ) -1 u(3, 1, A) 1 MRF Key F (Y ) Tagging model g(z) MRF Y Sentence-level tagging Z Inter-sentence constraints Y s (i, t) = 1 if sentence s has tag t at position i

99 Tagging English P V is V P my A A first language R Penalties u(s, i, t) = 0 for all s,i,t Iteration 1 u(1, 5, ) -1 u(1, 5, A) 1 u(3, 1, ) -1 He studies language arts now u(3, 1, A) 1 V P Language makes us human beings MRF Key F (Y ) Tagging model g(z) MRF Y Sentence-level tagging Z Inter-sentence constraints Y s (i, t) = 1 if sentence s has tag t at position i

100 Tagging English P V is V P my A A first language R Penalties u(s, i, t) = 0 for all s,i,t Iteration 1 u(1, 5, ) -1 u(1, 5, A) 1 u(3, 1, ) -1 He studies language arts now u(3, 1, A) 1 V P Language makes us human beings MRF language language language Key F (Y ) Tagging model g(z) MRF Y Sentence-level tagging Z Inter-sentence constraints Y s (i, t) = 1 if sentence s has tag t at position i

101 Tagging English P V is V P my A A first language R Penalties u(s, i, t) = 0 for all s,i,t Iteration 1 u(1, 5, ) -1 u(1, 5, A) 1 u(3, 1, ) -1 He studies language arts now u(3, 1, A) 1 V P Language makes us human beings MRF language language language Key F (Y ) Tagging model g(z) MRF Y Sentence-level tagging Z Inter-sentence constraints Y s (i, t) = 1 if sentence s has tag t at position i

102 Tagging English P V is V P my A A first language R Penalties u(s, i, t) = 0 for all s,i,t Iteration 1 u(1, 5, ) -1 u(1, 5, A) 1 u(3, 1, ) -1 He studies language arts now u(3, 1, A) 1 MRF Language V makes P us human beings Iteration 2 u(1, 5, ) -1 u(1, 5, A) 1 u(3, 1, ) -1 u(3, 1, A) 1 u(2, 3, ) 1 u(2, 3, A) -1 language language language Key F (Y ) Tagging model g(z) MRF Y Sentence-level tagging Z Inter-sentence constraints Y s (i, t) = 1 if sentence s has tag t at position i

103 Tagging English P V is V P my A first language R Penalties u(s, i, t) = 0 for all s,i,t Iteration 1 u(1, 5, ) -1 u(1, 5, A) 1 u(3, 1, ) -1 He studies language arts now u(3, 1, A) 1 MRF Language V makes P us human beings Iteration 2 u(1, 5, ) -1 u(1, 5, A) 1 u(3, 1, ) -1 u(3, 1, A) 1 u(2, 3, ) 1 u(2, 3, A) -1 Key F (Y ) Tagging model g(z) MRF Y Sentence-level tagging Z Inter-sentence constraints Y s (i, t) = 1 if sentence s has tag t at position i

104 Tagging English P V is V P my A first language R Penalties u(s, i, t) = 0 for all s,i,t Iteration 1 u(1, 5, ) -1 u(1, 5, A) 1 u(3, 1, ) -1 He studies language arts now u(3, 1, A) 1 MRF Language V makes P us human beings Iteration 2 u(1, 5, ) -1 u(1, 5, A) 1 u(3, 1, ) -1 u(3, 1, A) 1 u(2, 3, ) 1 u(2, 3, A) -1 language language language Key F (Y ) Tagging model g(z) MRF Y Sentence-level tagging Z Inter-sentence constraints Y s (i, t) = 1 if sentence s has tag t at position i

105 Combined alignment (Deero and Macherey, 2011) setup: assume separate models trained for English-to-French and French-to-English alignment problem: find an alignment that maximizes the score of both models example: HMM models for both directional alignments (assume correct alignment is one-to-one for simplicity)

106 Alignment the ugly dog has red fur le chien laid a fourrure rouge

107 Graphical model formulation given: French sentence of length n and English sentence of length m one variable for each English word, takes values in {1,..., n} edge potentials θ(i 1, i, f, f ) for all i n, f, f {1,..., n} example: The 1 ugly 2 dog 3 has 4 red 5 fur 6

108 define: English-to-French alignment Y is set of all possible English-to-French alignments y Y is a valid alignment f (y) scores of the alignment example: HMM alignment Le 1 laid 3 chien 2 a 4 rouge 6 fourrure 5 The 1 ugly 2 dog 3 has 4 red 5 fur 6

109 define: French-to-English alignment Z is set of all possible French-to-English alignments z Z is a valid alignment g(z) scores of an alignment example: HMM alignment The 1 ugly 2 dog 3 has 4 fur 6 red 5 Le 1 chien 2 laid 3 a 4 fourrure 5 rouge 6

English word j z(i, j) = 1 when f-to-e alignment z selects French word i to align

110 Identifying word alignments notation: identify the tag labels selected by each model y(i, j) = 1 when e-to-f alignment y selects French word i to align with English word j z(i, j) = 1 when f-to-e alignment z selects French word i to align with English word j example: two HMM alignment models with y(6, 5) = 1 and z(6, 5) = 1

111 Combined optimization goal: arg max y Y,z Z f (y) + g(z) such that for all i = 1... n, j = 1... n, y(i, j) = z(i, j)

112 English-to-French Penalties u(i, j) = 0 for all i,j French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j

113 English-to-French Penalties u(i, j) = 0 for all i,j French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j

114 English-to-French Penalties u(i, j) = 0 for all i,j French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j

115 English-to-French Penalties u(i, j) = 0 for all i,j French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j

116 English-to-French French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(3, 2) -1 u(2, 2) 1 u(2, 3) -1 u(3, 3) 1 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j

117 English-to-French French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(3, 2) -1 u(2, 2) 1 u(2, 3) -1 u(3, 3) 1 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j

118 English-to-French French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(3, 2) -1 u(2, 2) 1 u(2, 3) -1 u(3, 3) 1 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j

119 English-to-French French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(3, 2) -1 u(2, 2) 1 u(2, 3) -1 u(3, 3) 1 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j

120 English-to-French French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(3, 2) -1 u(2, 2) 1 u(2, 3) -1 u(3, 3) 1 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j

Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y

121 English-to-French French-to-English y = arg max y Y (f (y) + i,j u(i, j)y(i, j)) Penalties u(i, j) = 0 for all i,j Iteration 1 u(3, 2) -1 u(2, 2) 1 u(2, 3) -1 u(3, 3) 1 Key z = arg max z Z (g(z) i,j u(i, j)z(i, j)) f (y) HMM Alignment g(z) HMM Alignment Y English-to-French model Z French-to-English model y(i, j) = 1 if French word i aligns to English word j

122 MAP problem in Markov random fields given: binary variables x 1... x n goal: MAP problem arg max x 1...x n (i,j) E f i,j (x i, x j ) where each f i,j (x i, x j ) is a local potential for variables x i, x j

123 Dual decomposition for MRFs (Komodakis et al., 2010) + goal: arg max equivalent formulation: arg max x 1...x n (i,j) E f x 1...x n,y 1...y n (i,j) T 1 f i,j (x i, x j ) i,j(x i, x j ) + such that for i = 1... n, x i = y i (i,j) T 2 f i,j(y i, y j ) Lagrangian: L(u, x, y) = (i,j) T 1 f i,j(x i, x j ) + (i,j) T 2 f i,j(y i, y j ) + i u i (x i y i )

124 4. Practical issues aim: overview of practical dual decomposition techniques tracking the progress of the algorithm choice of update rate α k lazy update of dual solutions extracting solutions if algorithm does not converge

125 Optimization tracking at each stage of the algorithm there are several useful values track: y (k), z (k) are current dual solutions L(u (k) ) is the current dual value y (k), l(y (k) ) is a potential primal feasible solution f (y (k) ) + g(l(y (k) )) is the potential primal value

126 -13 Tracking example Value Current Primal -19 Current Dual Round example run from syntactic machine translation (later in talk) current primal f (y (k) ) + g(l(y (k) )) current dual L(u (k) )

127 Optimization progress useful signals: L(u (k) ) L(u (k 1) ) is the dual change (may be positive) min L(u (k) ) is the best dual value (tightest upper bound) k max k f (y (k) ) + g(l(y (k) )) is the best primal value the optimal value must be between the best dual and primal values

128 Progress example Value best primal max f (y (k) ) + g(l(y (k) )) k best dual Value -18 Best Primal Best Dual Round Round Gap gap min k L(u (k) ) min k L(u k ) max f (y (k) ) + g(l(y (k) ) k

129 Update rate choice of α k has important practical consequences α k too high causes dual value to fluctuate α k too low means slow progress Value Round

130 Update rate practical: find a rate that is robust to varying inputs α k = c (constant rate) can be very fast, but hard to find constant that works for all problems α k = c (decreasing rate) often cuts rate too aggressively, k lowers value even when making progress rate based on dual progress α k = c where t < k is number of iterations where dual t + 1 value increased robust in practice, reduces rate when dual value is fluctuating

131 Lazy decoding idea: don t recompute y (k) or z (k) from scratch each iteration lazy decoding: if subgradient u (k) is sparse, then y (k) may be very easy to compute from y (k 1) use: helpful if y or z factor naturally into independent components can be important for fast decompositions

132 Lazy decoding example recall corpus-level tagging example at this iteration, only sentence 2 receives a weight update with lazy decoding Y (k) 1 Y (k 1) 1 Y (k) 3 Y (k 1) 3

133 Lazy decoding results lazy decoding is critical for the efficiency of some applications % of Head Automata Recomputed % recomputed, g+s % recomputed, sib Iterations of Dual Decomposition recomputation statistics for non-projective dependency parsing

134 Approximate solution upon agreement the solution is exact, but this may not occur otherwise, there is an easy way to find an approximate solution choose: the structure y (k ) where k = arg max f (y (k) ) + g(l(y (k) )) k is the iteration with the best primal score guarantee: the solution y k is non-optimal by at most (min k L(u k )) (f (y (k ) ) + g(l(y (k ) ))) there are other methods to estimate solutions, for instance by averaging solutions (see?)

135 Choosing best solution 0-5 Best Primal -10 Value Current Primal -30 Current Dual Round non-exact example from syntactic translation best approximate primal solution occurs at iteration 63

136 Early stopping results early stopping results for constituency and dependency parsing Percentage f score % certificates % match K= Maximum umber of Dual Decomposition Iterations

137 Early stopping results early stopping results for non-projective dependency parsing Percentage % validation UAS % certificates % match K= Maximum umber of Dual Decomposition Iterations

138 Phrase-Based Translation define: source-language sentence words x 1,..., x phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein p 1 p 2 p 3 p 4 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

139 Phrase-Based Translation define: source-language sentence words x 1,..., x phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must p 1 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

140 Phrase-Based Translation define: source-language sentence words x 1,..., x phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also p 1 p 2 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

141 Phrase-Based Translation define: source-language sentence words x 1,..., x phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be p 1 p 2 p 3 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

142 Phrase-Based Translation define: source-language sentence words x 1,..., x phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern p 1 p 2 p 3 p 4 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

143 Phrase-Based Translation define: source-language sentence words x 1,..., x phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern p 1 p 2 p 3 p 4 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

144 Phrase-Based Translation define: source-language sentence words x 1,..., x phrase translation p = (s, e, t) translation derivation y = p 1,..., p L example: x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern p 1 p 2 p 3 p 4 y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

145 Scoring Derivations derivation: y = {(1, 2, this must), (5, 5, 2also), (6, 6, be), (3, 4, our concern)} x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern objective: f (y) = h(e(y)) + L g(p k ) + k=1 L 1 η t(p k ) + 1 s(p k+1 ) k=1 language model score h phrase translation score g distortion penalty η

146 Scoring Derivations derivation: y = {(1, 2, this must), (5, 5, 2also), (6, 6, be), (3, 4, our concern)} x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern objective: f (y) = h(e(y)) + L g(p k ) + k=1 L 1 η t(p k ) + 1 s(p k+1 ) k=1 language model score h phrase translation score g distortion penalty η

147 Scoring Derivations derivation: y = {(1, 2, this must), (5, 5, 2also), (6, 6, be), (3, 4, our concern)} x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern objective: f (y) = h(e(y)) + L g(p k ) + k=1 L 1 η t(p k ) + 1 s(p k+1 ) k=1 language model score h phrase translation score g distortion penalty η

148 Scoring Derivations derivation: y = {(1, 2, this must), (5, 5, 2also), (6, 6, be), (3, 4, our concern)} x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern objective: f (y) = h(e(y)) + L g(p k ) + k=1 L 1 η t(p k ) + 1 s(p k+1 ) k=1 language model score h phrase translation score g distortion penalty η

149 Relaxed Problem Y : only requires the total number of words translated to be Y ={y : y(i) = and the distortion limit d is satisfied} i=1 example: y(i) sum 6 x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein our concern must be our concern (3, 4, our concern)(2, 2, must)(6, 6, be)(3, 4, our concern)

150 Relaxed Problem Y : only requires the total number of words translated to be Y ={y : y(i) = and the distortion limit d is satisfied} i=1 example: y(i) sum 6 x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein our concern must be our concern (3, 4, our concern)(2, 2, must)(6, 6, be)(3, 4, our concern)

151 Relaxed Problem Y : only requires the total number of words translated to be Y ={y : y(i) = and the distortion limit d is satisfied} i=1 example: y(i) sum 6 x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein our concern must be our concern (3, 4, our concern)(2, 2, must)(6, 6, be)(3, 4, our concern)

152 Relaxed Problem Y : only requires the total number of words translated to be Y ={y : y(i) = and the distortion limit d is satisfied} i=1 example: y(i) sum 6 x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein our concern must be our concern (3, 4, our concern)(2, 2, must)(6, 6, be)(3, 4, our concern)

153 Lagrangian Relaxation Method original: arg max f (y) y Y }{{} exact DP is P-hard rewrite: Y = {y : y(i) = 1 i = 1... } arg max f (y) such that y Y }{{ } can be solved efficiently by DP y(i) = 1 i = 1... }{{} using Lagrangian relaxation Y = {y : i=1 y(i) = } }{{} sum to

154 Lagrangian Relaxation Method original: arg max f (y) y Y }{{} exact DP is P-hard rewrite: Y = {y : y(i) = 1 i = 1... } arg max f (y) such that y Y }{{ } can be solved efficiently by DP y(i) = 1 i = 1... }{{} using Lagrangian relaxation Y = {y : i=1 y(i) = } }{{} sum to

155 Lagrangian Relaxation Method original: arg max f (y) y Y }{{} exact DP is P-hard rewrite: Y = {y : y(i) = 1 i = 1... } arg max f (y) such that y Y }{{ } can be solved efficiently by DP y(i) = 1 i = 1... }{{} using Lagrangian relaxation Y = {y : i=1 y(i) = } }{{} sum to

156 Lagrangian Relaxation Method original: arg max f (y) y Y }{{} exact DP is P-hard rewrite: Y = {y : y(i) = 1 i = 1... } arg max f (y) such that y Y }{{ } can be solved efficiently by DP y(i) = 1 i = 1... }{{} using Lagrangian relaxation Y = {y : i=1 y(i) = } }{{} sum to

157 Lagrangian Relaxation Method original: arg max f (y) y Y }{{} exact DP is P-hard rewrite: Y = {y : y(i) = 1 i = 1... } arg max f (y) such that y Y }{{ } can be solved efficiently by DP y(i) = 1 i = 1... }{{} using Lagrangian relaxation Y = {y : i=1 y(i) = } }{{} sum to

158 Algorithm Iteration 1: update u(i): u(i) u(i) α(y(i) 1) α = 1 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein

159 Algorithm Iteration 1: update u(i): u(i) u(i) α(y(i) 1) α = 1 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein our concern must be our concern

160 Algorithm Iteration 1: update u(i): u(i) u(i) α(y(i) 1) α = 1 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein our concern must be our concern

161 Algorithm Iteration 2: update u(i): u(i) u(i) α(y(i) 1) α = 0.5 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein

162 Algorithm Iteration 2: update u(i): u(i) u(i) α(y(i) 1) α = 0.5 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must be equally must equally

163 Algorithm Iteration 2: update u(i): u(i) u(i) α(y(i) 1) α = 0.5 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must be equally must equally

164 Algorithm Iteration 3: update u(i): u(i) u(i) α(y(i) 1) α = 0.5 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein

165 Algorithm Iteration 3: update u(i): u(i) u(i) α(y(i) 1) α = 0.5 u(i) y(i) update x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern

166 Tightening the Relaxation In some cases, we never reach y(i) = 1 for i = 1... If dual L(u) is not decreasing fast enough run for 10 more iterations count number of times each constraint is violated add 3 most often violated constraints

167 Tightening the Relaxation Iteration 41: count(i) y(i) x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also our concern equally

168 Tightening the Relaxation Iteration 42: count(i) y(i) x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must be our concern is

169 Tightening the Relaxation Iteration 43: count(i) y(i) x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also our concern equally

170 Tightening the Relaxation Iteration 44: count(i) y(i) x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must be our concern is

171 Tightening the Relaxation Iteration 50: count(i) y(i) x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also our concern equally

172 Tightening the Relaxation Iteration 51: count(i) 0 y(i) 1 1 x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern Add 2 hard constraints (x 5, x 6 ) to the dynamic program

173 Tightening the Relaxation Iteration 51: count(i) 0 y(i) x 1 x 2 x 3 x 4 x 5 x 6 das muss unsere sorge gleichermaßen sein this must also be our concern Add 2 hard constraints (x 5, x 6 ) to the dynamic program

174 Experiments: German to English Europarl data: German to English Test on 1,824 sentences with length 1-50 words Converged: 1,818 sentences (99.67%)

175 Experiments: umber of Iterations Percentage words words words words words all Maximum umber of Lagrangian Relexation Iterations

176 Experiments: umber of Hard Constraints Required Percentage words words words words words all umber of Hard Constraints Added

177 Experiments: Mean Time in Seconds # words All mean median

178 Comparison to ILP Decoding (sec.) (sec.) , , , ,692.6

179 Higher-order non-projective dependency parsing setup: given a model for higher-order non-projective dependency parsing (sibling features) problem: find non-projective dependency parse that maximizes the score of this model difficulty: model is P-hard to decode complexity of the model comes from enforcing combinatorial constraints strategy: design a decomposition that separates combinatorial constraints from direct implementation of the scoring function

180 on-projective Dependency Parsing * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 * 0 John 1 saw 2 a 3 movie 4 today 5 that 6 he 7 liked 8 Important problem in many languages. Problem is P-Hard for all but the simplest models.

181 Dual Decomposition A classical technique for constructing decoding algorithms. Solve complicated models y = arg max f (y) y by decomposing into smaller problems. Upshot: Can utilize a toolbox of combinatorial algorithms. Dynamic programming Minimum spanning tree Shortest path Min-Cut...

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing Alexander M. Rush and Michael Collins (based on joint work with Yin-Wen Chang, Tommi Jaakkola, Terry Koo, Roi Reichart, David