Aspects of Tree-Based Statistical Machine Translation

Aspects of Tree-Based tatistical Machine Translation Marcello Federico (based on slides by Gabriele Musillo) Human Language Technology FBK-irst 2011

Outline Tree-based translation models: ynchronous context free grammars BTG alignment model Hierarchical phrase-based model Decoding with CFGs: Translation as Parsing DP-based chart decoding Integration of language model scores Learning CFGs Rule extraction from phrase-tables

Tree-Based Translation Models Levels of Representation in Machine Translation: π π π π σ σ source target π σ: tree-to-string σ π: string-to-tree π π: tree-to-tree? Appropriate Levels of Representation?

Tree tructures NNP Pierre NP NNP MD Vinken will VP VP VB NP join Det the NN board yntactic tructures: rooted ordered trees internal nodes labeled with syntactic categories leaf nodes labeled with words linear and hierarchical relations between nodes

Tree-to-Tree Translation Models NNP Pierre NP Pierre NNP NNP Vinken NP Vinken NNP MD wird will MD VP NP Det dem join VB VP VP NN Vorstand the Det NP VP board NN VB beitreten syntactic generalizations over pairs of languages: isomorphic trees syntactically informed unbounded reordering formalized as derivations in synchronous grammars? Adequacy of Isomorphism Assumption?

Context-Free Grammars CFG (Chomsky, 1956): formal model of languages more expressive than FAs and REs first used in linguistics to describe embedded and recursive structures { 01 ɛ CFG Rules: left-hand side nonterminal symbol right-hand side string of nonterminal or terminal symbols distinguished start nonterminal symbol rewrites as 01 rewrites as ɛ

CFG Derivations Generative Process: 1. Write down the start nonterminal symbol. 2. Choose a rule whose left-hand side is the left-most written down nonterminal symbol. 3. Replace the symbol with the right-hand side of that rule. 4. Repeat step 2 while there are nonterminal symbols written down. Derivation 01 01 01 01 0011 000111 ɛ 000111 Parse tree 0 0 0 ɛ 1 1 1

CFG Formal Definitions CFG G = V, Σ, R, : V : finite set of nonterminal symbols Σ: finite set of terminals, disjoint from V R: finite set of rule α β, with α a nonterminal and β a string of terminals and nonterminals : the start nonterminal symbol Let u, v be strings of V Σ, and α β R, then we say:. uαv yields uβv, written as uαv uβv. u derives v, written as u v, if u = v or a sequence u 1, u 2,..., u k exists for k 0 and u u 1 u 2... u k v. L(G) = {w Σ w}

CFG Examples G 1 : G 3 : R = { NP VP, NP N DET N N PP, VP V NP V NP PP, PP P NP, DET the a, N Alice Bob trumpet, V chased, P with} R = {NP NP CONJ NP NP PP DET N, PP P NP, P of, DET the two three, N mother pianists singers, CONJ and}? derivations of the mother of three pianists and two singers? derivations of Alice chased Bob with the trumpet same parse tree can be derived in different ways ( order of rules) same sentence can have different parse trees ( choice of rules)

Transduction Grammars aka ynchronous Grammars TG (Lewis and tearns, 1968; Aho and Ullman, 1969): two or more strings derived simultaneously more powerful than FTs used in NLP to model alignments, unbounded reordering, and mappings from surface forms to logical forms E E [1] + E [3] / + E [1] E [3] ynchronous Rules: left-hand side nonterminal symbol associated with source and target right-hand sides bijection [] mapping nonterminals in source and target of right-hand sides infix to Polish notation E E [1] E [2] / E [1] E [2] E n / n n N

ynchronous CFG 1-to-1 correspondence between nodes isomorphic derivation trees uniquely determined word alignment

Bracketing Transduction Grammars BTG (Wu, 1997): special form of CFG only one nonterminal nonterminal rules: { [1] [2] / [1] [2] [1] [2] / [2] [1] monotone rule inversion rule preterminal rules where e V t {ɛ} and f V s {ɛ}: { f / e lexical translation rules

CFG Derivations Generative Process: 1. Write down the source and target start symbols. 2. Choose a synchronous rule whose left-hand side is the left-most written down source nonterminal symbol. 3. imultaneously rewrite the source symbol and its corresponding target symbol with the source and the target side of the rule, respectively. 4. Repeat step 2 while there are written down source and target nonterminal symbols. [1] [2] / [1] [2] [1] [2] / [2] [1] k / k k {1, 2, 3}

BTG Alignments, 1 2 / 2 1 1 2, 2 1 1 2 / 2 1 3 4 2, 2 4 3 re-indexed symbols 1/1 1 4 2, 2 4 1 2/2 12 2, 2 21 3/3 123, 321 1 2 3 3 2 1

Phrase-Based Models and CFGs CFG Formalization of Phrase-Based Translation Models: phrase pair f, ẽ, make rule f / ẽ make monotone rules [1] [2] / [1] [2] make reordering rules [1] / [1] [1] [2] / [2] [1]? Completeness? Correctness?

Hierarchical Phrase-Based Models HPBM (Chiang, 2007): first tree-to-tree approach to perform better than phrase-based systems in large-scale evaluations discontinuous phrases long-range reordering rules formalized as synchronous context-free grammars

HPBM: Motivations Typical Phrase-Based Chinese-English Translation: Chinese VPs follow PPs / English VPs precede PPs yu 1 you 2 / have 2 with 1 Chinese NPs follow RCs / English NPs precede RCs 1 de 2 / the 2 that 1 translation of zhiyi construct in English word order 1 zhiyi / one of 1

HPBM: Example Rules 1 / 1 (1) 1 2 / 1 2 (2) yu 1 you 2 / have 2 with 1 (3) 1 de 2 / the 2 that 1 (4) 1 zhiyi / one of 1 (5) Aozhou / Australia (6) Beihan / N. Korea (7) she / is (8) bangjiao / dipl.rels. (9) shaoshu guojia / few countries (10)

HPBM: Example Translation tep 1 1 2 / 1 2

HPBM: Example Translation tep 2 1 2 / 1 2

HPBM: Example Translation tep 3 1 / 1

HPBM: Example Translation tep 4 Aozhou / Australia

HPBM: Example Translation tep 5 Aozhou Australia she / is

HPBM: Example Translation tep 6 shi Aozhou is Australia 1 zhiyi / one of 1

HPBM: Example Translation tep 7 zhiyi shi Aozhou one of is Australia 1 de 2 / the 2 that 1

HPBM: Example Translation tep 8 zhiyi shi de Aozhou one of is the that Australia yu 1 you 2 / have 2 with 1

HPBM: Example Translation tep 9 zhiyi shi de Aozhou yu you one of is the that Australia have with Beihan / N. Korea

HPBM: Example Translation tep 10 zhiyi shi de Aozhou yu Beihan you one of is the that Australia have with N. Korea bangjiao / dipl.rels.

HPBM: Example Translation tep 11 zhiyi shi de Aozhou yu you Beihan bangjiao one of is the that Australia have with dipl.rels. N. Korea shaoshu guojia / few countries

HPBM: Example Translation zhiyi shi de Aozhou yu you shaoshu guojia Beihan bangjiao one of is the that Australia few countries have with dipl.rels. N. Korea

ummary ynchronous Context-Free Grammars: formal model to synchronize source and target derivation processes BTG alignment model HPB recursive reordering model Additional topics (optional): Decoding CFGs: Translation as Parsing Learning CFGs from phrase tables

ynchronous Context-Free Grammars CFGs: CFGs in two dimensions synchronous derivation of isomorphic a trees unbounded reordering preserving hierarchy a excluding leafs VB PRP 1 VB1 2 VB2 3 / PRP 1 VB2 3 VB1 2 VB2 VB 1 TO 2 / TO 2 VB 1 ga TO TO 1 NN 2 / NN 2 TO 1 PRP he / kare ha VB listening / daisuki desu VB 1 PRP 2 he VB 1 VB1 3 VB2 4 adores VB 5 TO 6 listening TO 7 NN 8 PRP 2 kare ha VB2 4 TO 6 VB 5 NN 8 TO 7 kiku no ongaku wo ga VB1 3 daisuki desu to music

Weighted CFGs rules A α / β associated with positive weights w A α/β derivation trees π = π 1, π 2 weighted as W(π) = A α/β G w f A α/β(π) A α/β probabilistic CFGs if the following conditions hold w A α/β [0, 1] and α,β W A α/β = 1 notice: CFGs might well include rules of type A α/β 1... A α/β k

MAP Translation Problem Maximum A Posterior Translation: e = argmax e = argmax e p(e f ) π Π(f,e) p(e, π f ) Π(f, e) is the set of synchronous derivation trees yielding f, e Exact MAP decoding is NP-hard (imaan, 1996; atta and Peserico, 2005)

Viterbi Approximation Tractable Approximate Decoding: e = argmax e argmax e π Π(f,e) max π Π(f,e) = E(argmax p(π)) π Π(f ) p(e, π f ) p(e, π f ) Π(f ) is the set of synchronous derivations yielding f E(π) is the target string resulting from the synchronous derivation π

Translation as Parsing Parsing olution: π = argmax p(π) π Π(f ) 1. compute the most probable derivation tree that generates f using the source dimension of the WCFG 2. build the translation string e by applying the target dimension of the rules used in the most probable derivation most probable derivation computed in O(n 3 ) using dynamic programming algorithms for parsing weighted CFGs transfer of algorithms and optimizations developed for CFG to MT

Translation as Parsing: Illustration

Weighted CFGs in Chomsky Normal Form WCFGs: rules A α associated with positive weights w A α derivation trees π weighted as W(π) = A α G w f A α(π) A α probabilistic CFGs if the following conditions hold w A α [0, 1] and α w A α = 1 WCFGs in CNF: rules in CFGs in Chomsky Normal Form: A BC or A a equivalence between WCFGs and WCFGs in CNF no analogous equivalence holds for weighted CFGs

Translation as Weighted CKY Parsing Given a WCFG G and a source string f: 1. project G into its source WCFG G A w α G if A w α/β G and A w α/β G w w 2. transform G into its CNF G 3. solve π = argmax π Π G (f ) p(π ) with the CKY algorithm 4. revert π into π, the derivation tree according to G 5. map π into its corresponding target tree π 6. read off the translation e from π

Weighted CKY Parsing Dynamic Programming: recursive division of problems into subproblems optimal solutions compose optimal sub-solutions (Bellman s Principle) tabulation of subproblems and their solutions CKY Parsing: subproblems: parsing substrings of the input string u 1... u n solutions to subproblems tabulated using a chart bottom up algorithm starting with derivation of terminals O(n 3 G ) time complexity widely used to perform statistical inference over random trees

Weighted CKY Parsing Problem: π = argmax π ΠG (u=u 1...u n) p(π) DP chart: M i,k,a = maximum probability of A u i+1,k base case, k i = 1: inductive case, k i > 1: M i,k,a = 1 i n M i 1,i,A = w A ui ) max {w A B C M i,j,b M j,k,c } B,C,i<j<k best derivation built by storing B, C, and j for each M i,k,a

Weighted CKY Parsing M i,k,a = max {w A B C M i,j,b M j,k,c } B,C,i<j<k A B C u i+1,j u j+1,k

Weighted CKY Pseudo-Code 1: A, 0 i, j n M i,j,a = 0; 2: for i = 1 to n do {base case: substrings of length 1} 3: M i 1,i,A = w A ui ; {inductive case: substrings of length > 1} 4: for l = 2 to n do {l: length of the subtring} 5: for i = 0 to n l do {i: start position of the substring} 6: k = i + l; {k: end position of the substring} 7: for j = i + 1 to k 1 do {j: split position} 8: for A BC do 9: q = w A BC M i,j,b M j,k,c ; 10: if q > M i,k,a then 11: M i,k,a = q; 12: {backpointers to build derivation tree} D i,k,a = j, B, C ;

Parsing CFG and Language Modelling Viterbi Decoding of WCFGs: focus on most probable derivation of source (ignoring different target sides associated with the same source side) derivation weights do not include language models scores? HOW TO EFFICIENTLY COMPUTE TARGET LANGUAGE MODEL CORE FOR POIBLE DERIVATION? Approaches: 1. rescoring: generate k-best candidate translations and rerank k-best list with LM 2. online: integrate target m-gram LM scores into dynamic programming parsing 3. cube pruning (Huang and Chiang, 2007): rescore k-best sub-translations at each node of the parse forest

Online Translation Online Translation: parsing of the source string and building of the corresponding subtranslations in parallel PP 1,3 : (w 1, t 1 ) VP 3,6 : (w 2, t 2 ) VP 1,6 : (w w 1 w 2, t 2 t 1 ) w 1, w 2 : weights of the two antecedents w: weight of the synchronous rule t 1, t 2 : translations

LM Online Integration (Wu, 1996) Bigram Online Integration: PP with haron 1,3 : (w 1, t 1 ) VP held talk 3,6 : (w 2, t 2 ) VP held haron 1,6 : (w w 1 w 2 p LM (with talk), t 2 t 1 )

Cube Pruning (Huang and Chiang, 2007) Beam earch: at each step in the derivation, keep at most k items integrating target subtranslations in a beam enumerate all possible combinations of LM items extract the k-best combinations Cube Pruning: get k-best LM items without without computing all possible combinations approximate beam search: prone to search errors (in practice, much less significant than efficient decoding)

Cube Pruning Heuristic Assumption: best adjacent items lie towards the upper-left corner part of the grid can be pruned without computing its cells

Cube Pruning: Example

Cube Pruning: Pseudo-Code To efficiently compute a small corner of the grid: push cost of grid cell 1, 1 onto priority queue repeat j 1. extract best cell from queue 2. push costs of best cell s neighbours onto queue until k cells have been extracted (other termination conditions are possible)

ummary Translation As Parsing: Viterbi Approximation Weighted CKY Parsing Online LM Integration and Cube Pruning Next ession: Learning CFGs and Hiero

Hierarchical Phrase-Based Models Hiero (Chiang, 2005, 2007): CFG of rank 2 with only two nonterminal symbols discontinuous phrases long-range reordering rules

Hiero ynchronous Rules Rule Extraction: a word-aligned sentence pair b extract initial phrase pairs c replace sub-phrases in phrases with symbol Glue Rules: 1 2 / 1 2 1 / 1 Rule Filtering: limited length of initial phrases no adjacent nonterminals on source at least one pair of aligned words in non-glue rules

Hiero: Rule Extraction

Hiero: Log-Linear Parametrization coring Rules: (A γ) = λ h h(a γ): feature representation vector R m λ: weight vector R m h r (A γ): value of the r-th feature λ r : weight of the r-th feature coring Derivations: (π) = λ LM log p(e(π)) + (Z γ) Z γ,i,j π derivation scores decompose into sum of rule scores p(e(π)) is the LM score computed while parsing

Hiero: Feature Representation Word Translation Features: h 1 ( α/β) = log p(t β T α ) h 2 ( α/β) = log p(t α T β ) Word Penalty Feature: h 3 ( α/β) = T β ynchronous Features: h 4 ( α/β) = log p(β α) h 5 ( α/β) = log p(α β) Glue Penalty Feature: h 6 ( 1 1 / 1 1 ) = 1 Phrase Penalty Feature: h 7 ( α/β) = 1 λ i tuned on dev set using MERT

ummary Hiero: Rule Extraction Log-Linear Parametrization Feature Representation