The Geometry of Statistical Machine Translation

Size: px

Start display at page:

Download "The Geometry of Statistical Machine Translation"

James Cunningham
5 years ago
Views:

1 The Geometry of Statistical Machine Translation Presented by Rory Waite 16th of December 2015

2 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions ntroduction We provide a novel description of MERT using convex geometry 1 Convex geometry is a description of linear models 1 Ziegler Lectures on Polytopes 2 / 33

3 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 3 / 33 Statistical Machine Translation We have a source string f We have a very large number of possible translations called hypotheses {e...} A feature function h(e, f) 2 R D that yields a D 1 column vector Features include: language models word alignments models hierarchical phrase-based models

4 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions The Linear Model Dual vector space (R D ) of 1 D weight vectors representing linear maps R D 7! R For a given weight vector w 2 (R D ) we can compute a score wh(e, f) A decoder can be represented by an argmax 2 ê(f; w) =argmax wh(e, f) e Decoding is hard, but not the focus of this talk 2 Och and Ney. Discriminative training and maximum entropy models for statistical machine translation. ACL / 33

5 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions Current Trends in SMT The trend is to use linear models with thousands or millions of features 3 Training methods include MERT, MRA, PRO, and expected BLEU Published results show that these methods give equal performance for a large number of features Systems with a large number of features are not used in evaluations Published results exhibit overtraining 3 Chiang, David. 11,001 new features for statistical machine translation 5 / 33

6 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 6 / 33 Minimum Error Rate Training (1) Consider two K -best lists ranked by w (0) e 1 w (0) h(e 1, f 1 ) e 2 w (0) h(e 2, f 1 ) e 3 w (0) h(e 3, f 1 ) e 4 w (0) h(e 4, f 1 ) e 1 w (0) h(e 1, f 2 ) e 2 w (0) h(e 2, f 2 ) e 3 w (0) h(e 3, f 2 ) e 4 w (0) h(e 4, f 2 ) apple 1 i = 1 is the indices of the 1-bests E(f 1, f 2 ; w (0) )=E(e 1,i1, e 2,i2 )=E(e 1,1, e 2,1 )

7 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 7 / 33 Minimum Error Rate Training (2) Consider the K -best lists reranked by w 0 e 4 w 0 h(e 4, f 1 ) e 2 w 0 h(e 2, f 1 ) e 1 w 0 h(e 1, f 1 ) e 3 w 0 h(e 3, f 1 ) e 2 w 0 h(e 2, f 2 ) e 3 w 0 h(e 3, f 2 ) e 4 w 0 h(e 4, f 2 ) e 1 w 0 h(e 1, f 2 ) apple 4 i 0 = 2 is the indices of the 1-bests E(f 1, f 2 ; w 0 )=E(e 1,i 0 1, e 2,i 0 2 )=E(e 1,4, e 2,2 )

8 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 8 / 33 MERT using Line Optimisation Define a line w (0) + d in parameter space (R D ) Compute the model score with respect to (w (0) + d)h(e, f s ) =w (0) h(e, f s )+ dh(e, f s ) =`e( ) Define the Upper Envelope of model scores Env(f s ; )=max e {`e( ): 2 R}

9 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 9 / 33 The Upper Envelope Env(f s ; ) `e4 `e1 `e2 `e3 E(ê(f s ; )) e4 e 3 e 1 1 2

10 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 10 / 33 Linear Programming MERT Can h s,i = h(f s, e s,i ) maximise the model? For the sth sentence: w(h s,j h s,i ) apple 0 for 1 apple j apple K f a solution w 6= 0 exists, then the constraint set is said to be feasible f the constraints are infeasible then h s,i cannot maximise the model Feasibility can be tested with a linear program in O(D 3.5 K )

11 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 11 / 33 Multi-Sentence LP-MERT Define i as vector that contains S elements 1 apple i s apple K Define feature vectors of the form h i = h 1,i h S,iS Test for feasibility Worst case runtime is exponential at O(K S )

12 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions Large Margin Training Define the oracle index vector î î s = argmin E(e s,is ), for all 1 apple s apple S i s The primal form of the structured SVM is: maximise y subject to w(h s,j h s, î s )+y apple 0, 1 apple j apple K, 1 apple s apple S, îs 6= j kwk = 1 Relax constraints until feasible MRA is an online variant of this approach 12 / 33

13 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 13 / 33 Pairwise Ranking Optimisation Find a parameter vector w such that the ranking of a K -best list by model score wh s,i we s,j... follows the ranking implied by error counts E(e s,i ) apple E(e s,j ) apple... For any pair of hypotheses 1 apple i, j apple K w(h s,j h s,i ) apple 0 if E(e s,i ) E(e s,j ) 0

14 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 14 / 33 Convex Geometry Convex geometry is the study of polytopes and their faces A polytope H s is a convex hull of feature vectors Consider a decision boundary where the polytope is fully contained in one half space A face is the intersection of the polytope with the decision boundary The parameter vector w defines the decision boundary, and thus the face

15 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 15 / 33 Convex Geometry Two classes of faces are of interest Vertex A face consisting of a single point is called a vertex Edge A face in the form of a line segment between two vertices [h s,i, h s,j ] LP-MERT finds vertices

16 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 16 / 33 Vertex Example h TM h LM h 2 h 4 h 5 h 1 h 3 R D

17 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 17 / 33 Edge Example h TM h LM h 2 h 4 h 5 h 1 h 3 R D

18 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 18 / 33 Geometry in (R D ) For a face F, the normal cone N F is the set of parameter vectors that yield the face The normal cone for a vertex is the set of feasible parameter vectors that satisfy LP-MERT The normal cone for an edge is a decision boundary in (R D ) The set of all normal cones is called the normal fan N (H s )

19 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 19 / 33 Normal Cone for Edge 1 w TM h 2 h 4 h 5 w N [h4,h 1 ] h 1 h 4 w LM h 1 h 3 R D (R D )

20 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 20 / 33 Normal Cone for Edge 2 w TM h 2 h 4 h 5 N [h4,h 1 ] w LM h 1 N[h 3,h 1 ] h 3 R D (R D )

21 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 21 / 33 Normal Cone for a Vertex w TM h 2 h 4 h 5 h 1 N [h4,h 1 ] N[h 3,h 1 ] w LM N {h1 } h 3 R D (R D )

22 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 22 / 33 The Full Normal Fan h 2 h 4 h 5 N {h2 } N[h 4,h 2 ] N {h4 } N [h4,h 1 ] h 1 N [h3,h 2 ] N[h 3,h 1 ] N {h1 } h 3 N {h3 } R D (R D )

23 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 23 / 33 The Common Refinement The common refinement for two polytopes is: N (H s ) ^N(H t ):= {N \ N 0 : N 2N(H s ), N 0 2N(H t )} The common refinement can be extended to more than two polytopes

24 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 24 / 33 The Minkowski Sum The Minkowski sum for two polytopes is: H s + H t := {h + h 0 : h 2 H s, h 0 2 H t } The Minkowski sum is the quantity computed in LP-MERT The common refinement and Minkowski sum have the relationship: N (H s + H t )=N (H s ) ^N(H t )

25 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 25 / 33 Minkowski Sum Example R 2 h 1,2 h 1,3 h 2,2 h 1,2 + h 2,2 h 1,3 + h 2,2 {h 1,3 + h 2,1, h 1,2 + h 2,3 } h 1,2 + h 2,1 h 1,3 + h 2,3 h 1,1 h 1,4 h 2,1 h 2,3 h 1,1 + h 2,1 h 1,1 + h 2,2 {h 1,4 + h 2,1, h 1,1 + h 2,3 } h 1,4 + h 2,2 h 1,4 + h 2,3 H 1 H 2 H 1 + H 2 (R 2 ) N (H 1 ) N (H 2 ) N (H 1 ) ^N(H 2 )

26 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 4 Fukuda, Komei From the zonotope construction to the Minkowski addition of convex polytopes. Journal of Symbolic Computation 26 / 33 A Minkowski Sum Algorithm Treat the Minkowski sum polytope as a graph G(V, E) V and E are the sets of vertices and edges in the polytope Start from the normal cone that contains w (0) Enumerate through the adjacent vertices and repeat Total runtime is 4 O( (D 3.5 ) V ) is the max degree of G

27 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions Upper Bounds Theorem The upper bound on V is 5 O(S D 1 K 2(D 1) ) Each vertex maps to a feasible index vector i For low dimension features, i.e. for D : S D 1 K 2(D 1) K S the optimiser is tightly constrained The optimiser cannot freely select hypothesis as the 1-best for sentences We believe this acts as an inherent form of regularisation 5 Gritzmann and Sturmfels Minkowski addition of polytopes:.... SAM Journal on Discrete Mathematics 27 / 33

28 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 28 / 33 Upper Bounds Theorem The upper bound on V is O(S D 1 K 2(D 1) ) Recall that PRO enforces pairwise ranking Each pairwise ranking corresponds to an edge The upper bound on edges is V 2 We believe that ranking methods also benefit from this regularisation

29 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 29 / 33 Upper Bounds Example The upper bound on V is O(S D 1 K 2(D 1) ) For the CUED WMT 13 Russian-to-English system S = 1502 K = 1000 D = 12 Only percent of the K S index vectors are feasible f D = 493 then S D 1 K 2(D 1) K S

30 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions Projected MERT MERT line optimisation can be represented as an affine projection to a M + 1 dimensional space, M < D A M+1,D h s,i = " d1 #. h s,i = h s,i d M w (0) Searches over all parameter vectors embedded in the hyperplane given by the directions 30 / 33

31 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 31 / 33 Projected MERT Example CUED Russian-to-English entry to WMT 13 Projected to a 3-D affine subspace containing the source-to-target (UtoV) log-probability and the word insertion penalty (WP) Used the Minkowski sum implementation of Weibel (2010) Left it running for several weeks First application of multi-dimensional MERT in polynomial time

32 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions Error Surface in 2-Dimensions BLEU WP Parameter w(0) UtoV Parameter / 33

33 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions 33 / 33 Conclusions Linear models can be represented by convex geometry Overtraining is potentially a problem We could try non-linear models Neural Networks Random decision forrests

Discriminative Training

Discriminative Training February 19, 2013 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder