Aspects of Tree-Based Statistical Machine Translation

Similar documents
Aspects of Tree-Based Statistical Machine Translation

Decoding and Inference with Syntactic Translation Models

A* Search. 1 Dijkstra Shortest Path

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Structure and Complexity of Grammar-Based Machine Translation

Statistical Machine Translation

Synchronous Grammars

Probabilistic Context-free Grammars

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

LECTURER: BURCU CAN Spring

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Syntax-based Statistical Machine Translation

Syntax-Based Decoding

Algorithms for Syntax-Aware Statistical Machine Translation

Part I - Introduction Part II - Rule Extraction Part III - Decoding Part IV - Extensions

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Probabilistic Context-Free Grammar

Advanced Natural Language Processing Syntactic Parsing

Natural Language Processing

Parsing with Context-Free Grammars

Parsing. Context-Free Grammars (CFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 26

Parsing with Context-Free Grammars

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Context- Free Parsing with CKY. October 16, 2014

Parsing. Weighted Deductive Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 26

Statistical Methods for NLP

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

Hierarchies of Tree Series TransducersRevisited 1

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context Free Grammars

Introduction to Computational Linguistics

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Soft Inference and Posterior Marginals. September 19, 2013

In this chapter, we explore the parsing problem, which encompasses several questions, including:

Languages. Languages. An Example Grammar. Grammars. Suppose we have an alphabet V. Then we can write:

AN ABSTRACT OF THE DISSERTATION OF

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Probabilistic Context Free Grammars. Many slides from Michael Collins

Chapter 14 (Partially) Unsupervised Parsing

Recurrent neural network grammars

Probabilistic Context-Free Grammars and beyond

Einführung in die Computerlinguistik

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Prefix Probability for Probabilistic Synchronous Context-Free Grammars

Tuning as Linear Regression

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

NPDA, CFG equivalence

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Exam: Synchronous Grammars

Notes for Comp 497 (Comp 454) Week 10 4/5/05

Computational Models - Lecture 4

Suppose h maps number and variables to ɛ, and opening parenthesis to 0 and closing parenthesis

Computational Models - Lecture 4 1

Properties of Context-Free Languages

Remembering subresults (Part I): Well-formed substring tables

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

The Power of Tree Series Transducers

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Outline. CS21 Decidability and Tractability. Machine view of FA. Machine view of FA. Machine view of FA. Machine view of FA.

Processing/Speech, NLP and the Web

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Even More on Dynamic Programming

Cross-Entropy and Estimation of Probabilistic Context-Free Grammars

Learning to translate with neural networks. Michael Auli

CS 188 Introduction to AI Fall 2005 Stuart Russell Final

Einführung in die Computerlinguistik

Phrase-Based Statistical Machine Translation with Pivot Languages

Context Free Grammars

Context-Free Grammars and Languages

Notes for Comp 497 (454) Week 10

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Parsing. Unger s Parser. Laura Kallmeyer. Winter 2016/17. Heinrich-Heine-Universität Düsseldorf 1 / 21

Harvard CS 121 and CSCI E-207 Lecture 9: Regular Languages Wrap-Up, Context-Free Grammars

NLP Programming Tutorial 11 - The Structured Perceptron

Marrying Dynamic Programming with Recurrent Neural Networks

Foundations of Informatics: a Bridging Course

Introduction to Theory of Computing

MA/CSSE 474 Theory of Computation

Relating Probabilistic Grammars and Automata

Multiword Expression Identification with Tree Substitution Grammars

Latent Variable Models in NLP

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus

The Infinite PCFG using Hierarchical Dirichlet Processes

Discriminative Training. March 4, 2014

Harvard CS 121 and CSCI E-207 Lecture 12: General Context-Free Recognition

CPS 220 Theory of Computation

Computational Models - Lecture 4 1

Natural Language Processing. Lecture 13: More on CFG Parsing

Bringing machine learning & compositional semantics together: central concepts

Object Detection Grammars

Parsing Linear Context-Free Rewriting Systems

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Transcription:

Aspects of Tree-Based tatistical Machine Translation Marcello Federico (based on slides by Gabriele Musillo) Human Language Technology FBK-irst 2011

Outline Tree-based translation models: ynchronous context free grammars BTG alignment model Hierarchical phrase-based model Decoding with CFGs: Translation as Parsing DP-based chart decoding Integration of language model scores Learning CFGs Rule extraction from phrase-tables

Tree-Based Translation Models Levels of Representation in Machine Translation: π π π π σ σ source target π σ: tree-to-string σ π: string-to-tree π π: tree-to-tree? Appropriate Levels of Representation?

Tree tructures NNP Pierre NP NNP MD Vinken will VP VP VB NP join Det the NN board yntactic tructures: rooted ordered trees internal nodes labeled with syntactic categories leaf nodes labeled with words linear and hierarchical relations between nodes

Tree-to-Tree Translation Models NNP Pierre NP Pierre NNP NNP Vinken NP Vinken NNP MD wird will MD VP NP Det dem join VB VP VP NN Vorstand the Det NP VP board NN VB beitreten syntactic generalizations over pairs of languages: isomorphic trees syntactically informed unbounded reordering formalized as derivations in synchronous grammars? Adequacy of Isomorphism Assumption?

Context-Free Grammars CFG (Chomsky, 1956): formal model of languages more expressive than FAs and REs first used in linguistics to describe embedded and recursive structures { 01 ɛ CFG Rules: left-hand side nonterminal symbol right-hand side string of nonterminal or terminal symbols distinguished start nonterminal symbol rewrites as 01 rewrites as ɛ

CFG Derivations Generative Process: 1. Write down the start nonterminal symbol. 2. Choose a rule whose left-hand side is the left-most written down nonterminal symbol. 3. Replace the symbol with the right-hand side of that rule. 4. Repeat step 2 while there are nonterminal symbols written down. Derivation 01 01 01 01 0011 000111 ɛ 000111 Parse tree 0 0 0 ɛ 1 1 1

CFG Formal Definitions CFG G = V, Σ, R, : V : finite set of nonterminal symbols Σ: finite set of terminals, disjoint from V R: finite set of rule α β, with α a nonterminal and β a string of terminals and nonterminals : the start nonterminal symbol Let u, v be strings of V Σ, and α β R, then we say:. uαv yields uβv, written as uαv uβv. u derives v, written as u v, if u = v or a sequence u 1, u 2,..., u k exists for k 0 and u u 1 u 2... u k v. L(G) = {w Σ w}

CFG Examples G 1 : G 3 : R = { NP VP, NP N DET N N PP, VP V NP V NP PP, PP P NP, DET the a, N Alice Bob trumpet, V chased, P with} R = {NP NP CONJ NP NP PP DET N, PP P NP, P of, DET the two three, N mother pianists singers, CONJ and}? derivations of the mother of three pianists and two singers? derivations of Alice chased Bob with the trumpet same parse tree can be derived in different ways ( order of rules) same sentence can have different parse trees ( choice of rules)

Transduction Grammars aka ynchronous Grammars TG (Lewis and tearns, 1968; Aho and Ullman, 1969): two or more strings derived simultaneously more powerful than FTs used in NLP to model alignments, unbounded reordering, and mappings from surface forms to logical forms E E [1] + E [3] / + E [1] E [3] ynchronous Rules: left-hand side nonterminal symbol associated with source and target right-hand sides bijection [] mapping nonterminals in source and target of right-hand sides infix to Polish notation E E [1] E [2] / E [1] E [2] E n / n n N

ynchronous CFG 1-to-1 correspondence between nodes isomorphic derivation trees uniquely determined word alignment

Bracketing Transduction Grammars BTG (Wu, 1997): special form of CFG only one nonterminal nonterminal rules: { [1] [2] / [1] [2] [1] [2] / [2] [1] monotone rule inversion rule preterminal rules where e V t {ɛ} and f V s {ɛ}: { f / e lexical translation rules

CFG Derivations Generative Process: 1. Write down the source and target start symbols. 2. Choose a synchronous rule whose left-hand side is the left-most written down source nonterminal symbol. 3. imultaneously rewrite the source symbol and its corresponding target symbol with the source and the target side of the rule, respectively. 4. Repeat step 2 while there are written down source and target nonterminal symbols. [1] [2] / [1] [2] [1] [2] / [2] [1] k / k k {1, 2, 3}

BTG Alignments, 1 2 / 2 1 1 2, 2 1 1 2 / 2 1 3 4 2, 2 4 3 re-indexed symbols 1/1 1 4 2, 2 4 1 2/2 12 2, 2 21 3/3 123, 321 1 2 3 3 2 1

Phrase-Based Models and CFGs CFG Formalization of Phrase-Based Translation Models: phrase pair f, ẽ, make rule f / ẽ make monotone rules [1] [2] / [1] [2] make reordering rules [1] / [1] [1] [2] / [2] [1]? Completeness? Correctness?

Hierarchical Phrase-Based Models HPBM (Chiang, 2007): first tree-to-tree approach to perform better than phrase-based systems in large-scale evaluations discontinuous phrases long-range reordering rules formalized as synchronous context-free grammars

HPBM: Motivations Typical Phrase-Based Chinese-English Translation: Chinese VPs follow PPs / English VPs precede PPs yu 1 you 2 / have 2 with 1 Chinese NPs follow RCs / English NPs precede RCs 1 de 2 / the 2 that 1 translation of zhiyi construct in English word order 1 zhiyi / one of 1

HPBM: Example Rules 1 / 1 (1) 1 2 / 1 2 (2) yu 1 you 2 / have 2 with 1 (3) 1 de 2 / the 2 that 1 (4) 1 zhiyi / one of 1 (5) Aozhou / Australia (6) Beihan / N. Korea (7) she / is (8) bangjiao / dipl.rels. (9) shaoshu guojia / few countries (10)

HPBM: Example Translation tep 1 1 2 / 1 2

HPBM: Example Translation tep 2 1 2 / 1 2

HPBM: Example Translation tep 3 1 / 1

HPBM: Example Translation tep 4 Aozhou / Australia

HPBM: Example Translation tep 5 Aozhou Australia she / is

HPBM: Example Translation tep 6 shi Aozhou is Australia 1 zhiyi / one of 1

HPBM: Example Translation tep 7 zhiyi shi Aozhou one of is Australia 1 de 2 / the 2 that 1

HPBM: Example Translation tep 8 zhiyi shi de Aozhou one of is the that Australia yu 1 you 2 / have 2 with 1

HPBM: Example Translation tep 9 zhiyi shi de Aozhou yu you one of is the that Australia have with Beihan / N. Korea

HPBM: Example Translation tep 10 zhiyi shi de Aozhou yu Beihan you one of is the that Australia have with N. Korea bangjiao / dipl.rels.

HPBM: Example Translation tep 11 zhiyi shi de Aozhou yu you Beihan bangjiao one of is the that Australia have with dipl.rels. N. Korea shaoshu guojia / few countries

HPBM: Example Translation zhiyi shi de Aozhou yu you shaoshu guojia Beihan bangjiao one of is the that Australia few countries have with dipl.rels. N. Korea

ummary ynchronous Context-Free Grammars: formal model to synchronize source and target derivation processes BTG alignment model HPB recursive reordering model Additional topics (optional): Decoding CFGs: Translation as Parsing Learning CFGs from phrase tables

ynchronous Context-Free Grammars CFGs: CFGs in two dimensions synchronous derivation of isomorphic a trees unbounded reordering preserving hierarchy a excluding leafs VB PRP 1 VB1 2 VB2 3 / PRP 1 VB2 3 VB1 2 VB2 VB 1 TO 2 / TO 2 VB 1 ga TO TO 1 NN 2 / NN 2 TO 1 PRP he / kare ha VB listening / daisuki desu VB 1 PRP 2 he VB 1 VB1 3 VB2 4 adores VB 5 TO 6 listening TO 7 NN 8 PRP 2 kare ha VB2 4 TO 6 VB 5 NN 8 TO 7 kiku no ongaku wo ga VB1 3 daisuki desu to music

Weighted CFGs rules A α / β associated with positive weights w A α/β derivation trees π = π 1, π 2 weighted as W(π) = A α/β G w f A α/β(π) A α/β probabilistic CFGs if the following conditions hold w A α/β [0, 1] and α,β W A α/β = 1 notice: CFGs might well include rules of type A α/β 1... A α/β k

MAP Translation Problem Maximum A Posterior Translation: e = argmax e = argmax e p(e f ) π Π(f,e) p(e, π f ) Π(f, e) is the set of synchronous derivation trees yielding f, e Exact MAP decoding is NP-hard (imaan, 1996; atta and Peserico, 2005)

Viterbi Approximation Tractable Approximate Decoding: e = argmax e argmax e π Π(f,e) max π Π(f,e) = E(argmax p(π)) π Π(f ) p(e, π f ) p(e, π f ) Π(f ) is the set of synchronous derivations yielding f E(π) is the target string resulting from the synchronous derivation π

Translation as Parsing Parsing olution: π = argmax p(π) π Π(f ) 1. compute the most probable derivation tree that generates f using the source dimension of the WCFG 2. build the translation string e by applying the target dimension of the rules used in the most probable derivation most probable derivation computed in O(n 3 ) using dynamic programming algorithms for parsing weighted CFGs transfer of algorithms and optimizations developed for CFG to MT

Translation as Parsing: Illustration

Weighted CFGs in Chomsky Normal Form WCFGs: rules A α associated with positive weights w A α derivation trees π weighted as W(π) = A α G w f A α(π) A α probabilistic CFGs if the following conditions hold w A α [0, 1] and α w A α = 1 WCFGs in CNF: rules in CFGs in Chomsky Normal Form: A BC or A a equivalence between WCFGs and WCFGs in CNF no analogous equivalence holds for weighted CFGs

Translation as Weighted CKY Parsing Given a WCFG G and a source string f: 1. project G into its source WCFG G A w α G if A w α/β G and A w α/β G w w 2. transform G into its CNF G 3. solve π = argmax π Π G (f ) p(π ) with the CKY algorithm 4. revert π into π, the derivation tree according to G 5. map π into its corresponding target tree π 6. read off the translation e from π

Weighted CKY Parsing Dynamic Programming: recursive division of problems into subproblems optimal solutions compose optimal sub-solutions (Bellman s Principle) tabulation of subproblems and their solutions CKY Parsing: subproblems: parsing substrings of the input string u 1... u n solutions to subproblems tabulated using a chart bottom up algorithm starting with derivation of terminals O(n 3 G ) time complexity widely used to perform statistical inference over random trees

Weighted CKY Parsing Problem: π = argmax π ΠG (u=u 1...u n) p(π) DP chart: M i,k,a = maximum probability of A u i+1,k base case, k i = 1: inductive case, k i > 1: M i,k,a = 1 i n M i 1,i,A = w A ui ) max {w A B C M i,j,b M j,k,c } B,C,i<j<k best derivation built by storing B, C, and j for each M i,k,a

Weighted CKY Parsing M i,k,a = max {w A B C M i,j,b M j,k,c } B,C,i<j<k A B C u i+1,j u j+1,k

Weighted CKY Pseudo-Code 1: A, 0 i, j n M i,j,a = 0; 2: for i = 1 to n do {base case: substrings of length 1} 3: M i 1,i,A = w A ui ; {inductive case: substrings of length > 1} 4: for l = 2 to n do {l: length of the subtring} 5: for i = 0 to n l do {i: start position of the substring} 6: k = i + l; {k: end position of the substring} 7: for j = i + 1 to k 1 do {j: split position} 8: for A BC do 9: q = w A BC M i,j,b M j,k,c ; 10: if q > M i,k,a then 11: M i,k,a = q; 12: {backpointers to build derivation tree} D i,k,a = j, B, C ;

Parsing CFG and Language Modelling Viterbi Decoding of WCFGs: focus on most probable derivation of source (ignoring different target sides associated with the same source side) derivation weights do not include language models scores? HOW TO EFFICIENTLY COMPUTE TARGET LANGUAGE MODEL CORE FOR POIBLE DERIVATION? Approaches: 1. rescoring: generate k-best candidate translations and rerank k-best list with LM 2. online: integrate target m-gram LM scores into dynamic programming parsing 3. cube pruning (Huang and Chiang, 2007): rescore k-best sub-translations at each node of the parse forest

Online Translation Online Translation: parsing of the source string and building of the corresponding subtranslations in parallel PP 1,3 : (w 1, t 1 ) VP 3,6 : (w 2, t 2 ) VP 1,6 : (w w 1 w 2, t 2 t 1 ) w 1, w 2 : weights of the two antecedents w: weight of the synchronous rule t 1, t 2 : translations

LM Online Integration (Wu, 1996) Bigram Online Integration: PP with haron 1,3 : (w 1, t 1 ) VP held talk 3,6 : (w 2, t 2 ) VP held haron 1,6 : (w w 1 w 2 p LM (with talk), t 2 t 1 )

Cube Pruning (Huang and Chiang, 2007) Beam earch: at each step in the derivation, keep at most k items integrating target subtranslations in a beam enumerate all possible combinations of LM items extract the k-best combinations Cube Pruning: get k-best LM items without without computing all possible combinations approximate beam search: prone to search errors (in practice, much less significant than efficient decoding)

Cube Pruning Heuristic Assumption: best adjacent items lie towards the upper-left corner part of the grid can be pruned without computing its cells

Cube Pruning: Example

Cube Pruning: Example

Cube Pruning: Example

Cube Pruning: Example

Cube Pruning: Pseudo-Code To efficiently compute a small corner of the grid: push cost of grid cell 1, 1 onto priority queue repeat j 1. extract best cell from queue 2. push costs of best cell s neighbours onto queue until k cells have been extracted (other termination conditions are possible)

ummary Translation As Parsing: Viterbi Approximation Weighted CKY Parsing Online LM Integration and Cube Pruning Next ession: Learning CFGs and Hiero

Hierarchical Phrase-Based Models Hiero (Chiang, 2005, 2007): CFG of rank 2 with only two nonterminal symbols discontinuous phrases long-range reordering rules

Hiero ynchronous Rules Rule Extraction: a word-aligned sentence pair b extract initial phrase pairs c replace sub-phrases in phrases with symbol Glue Rules: 1 2 / 1 2 1 / 1 Rule Filtering: limited length of initial phrases no adjacent nonterminals on source at least one pair of aligned words in non-glue rules

Hiero: Rule Extraction

Hiero: Rule Extraction

Hiero: Rule Extraction

Hiero: Rule Extraction

Hiero: Rule Extraction

Hiero: Rule Extraction

Hiero: Rule Extraction

Hiero: Log-Linear Parametrization coring Rules: (A γ) = λ h h(a γ): feature representation vector R m λ: weight vector R m h r (A γ): value of the r-th feature λ r : weight of the r-th feature coring Derivations: (π) = λ LM log p(e(π)) + (Z γ) Z γ,i,j π derivation scores decompose into sum of rule scores p(e(π)) is the LM score computed while parsing

Hiero: Feature Representation Word Translation Features: h 1 ( α/β) = log p(t β T α ) h 2 ( α/β) = log p(t α T β ) Word Penalty Feature: h 3 ( α/β) = T β ynchronous Features: h 4 ( α/β) = log p(β α) h 5 ( α/β) = log p(α β) Glue Penalty Feature: h 6 ( 1 1 / 1 1 ) = 1 Phrase Penalty Feature: h 7 ( α/β) = 1 λ i tuned on dev set using MERT

ummary Hiero: Rule Extraction Log-Linear Parametrization Feature Representation