Decoding and Inference with Syntactic Translation Models

Similar documents
Aspects of Tree-Based Statistical Machine Translation

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Aspects of Tree-Based Statistical Machine Translation

Algorithms for Syntax-Aware Statistical Machine Translation

Variational Decoding for Statistical Machine Translation

Syntax-based Statistical Machine Translation

Part I - Introduction Part II - Rule Extraction Part III - Decoding Part IV - Extensions

A* Search. 1 Dijkstra Shortest Path

Soft Inference and Posterior Marginals. September 19, 2013

Probabilistic Context-free Grammars

Chapter 14 (Partially) Unsupervised Parsing

Synchronous Grammars

Advanced Natural Language Processing Syntactic Parsing

Statistical Machine Translation

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Lecture 9: Decoding. Andreas Maletti. Stuttgart January 20, Statistical Machine Translation. SMT VIII A. Maletti 1

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Parsing with Context-Free Grammars

LECTURER: BURCU CAN Spring

Attendee information. Seven Lectures on Statistical Parsing. Phrase structure grammars = context-free grammars. Assessment.

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Structure and Complexity of Grammar-Based Machine Translation

Statistical Methods for NLP

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Probabilistic Context-Free Grammar

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung

Marrying Dynamic Programming with Recurrent Neural Networks

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

Context- Free Parsing with CKY. October 16, 2014

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

Parsing with Context-Free Grammars

Exam: Synchronous Grammars

AN ABSTRACT OF THE DISSERTATION OF

Introduction to Computational Linguistics

Syntax-Based Decoding

Probabilistic Context Free Grammars

Statistical Machine Translation. Part III: Search Problem. Complexity issues. DP beam-search: with single and multi-stacks

Efficient Incremental Decoding for Tree-to-String Translation

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Top-Down K-Best A Parsing

Natural Language Processing

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Multiword Expression Identification with Tree Substitution Grammars

Parsing. Unger s Parser. Laura Kallmeyer. Winter 2016/17. Heinrich-Heine-Universität Düsseldorf 1 / 21

Even More on Dynamic Programming

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Analysing Soft Syntax Features and Heuristics for Hierarchical Phrase Based Machine Translation

Parsing. Context-Free Grammars (CFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 26

Natural Language Processing 1. lecture 7: constituent parsing. Ivan Titov. Institute for Logic, Language and Computation

Handout 8: Computation & Hierarchical parsing II. Compute initial state set S 0 Compute initial state set S 0

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

Translation as Weighted Deduction

So# Inference and Posterior Marginals. September 19, 2013

Parsing. Unger s Parser. Introduction (1) Unger s parser [Grune and Jacobs, 2008] is a CFG parser that is

NLP Programming Tutorial 11 - The Structured Perceptron

Products of Weighted Logic Programs

CS460/626 : Natural Language

A brief introduction to Logic. (slides from

Natural Language Processing. Lecture 13: More on CFG Parsing

Languages. Languages. An Example Grammar. Grammars. Suppose we have an alphabet V. Then we can write:

Translation as Weighted Deduction

Learning to translate with neural networks. Michael Auli

Syntax Analysis Part I

Syntax Analysis Part I

Einführung in die Computerlinguistik

Lecture 7: Introduction to syntax-based MT

Sharpening the empirical claims of generative syntax through formalization

Tuning as Linear Regression

Structured Prediction with Perceptron: Theory and Algorithms

Recap: Lexicalized PCFGs (Fall 2007): Lecture 5 Parsing and Syntax III. Recap: Charniak s Model. Recap: Adding Head Words/Tags to Trees

Introduction to Bottom-Up Parsing

PCFGs 2 L645 / B659. Dept. of Linguistics, Indiana University Fall PCFGs 2. Questions. Calculating P(w 1m ) Inside Probabilities

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Introduction to Bottom-Up Parsing

Context Free Grammars: Introduction

X-bar theory. X-bar :

Chapter 4: Context-Free Grammars

Cube Summing, Approximate Inference with Non-Local Features, and Dynamic Programming without Semirings

Computational Models - Lecture 4 1

Parsing. Weighted Deductive Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 26

Introduction to Bottom-Up Parsing

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

c(a) = X c(a! Ø) (13.1) c(a! Ø) ˆP(A! Ø A) = c(a)

MA/CSSE 474 Theory of Computation

The Noisy Channel Model and Markov Models

Bringing machine learning & compositional semantics together: central concepts

Transcription:

Decoding and Inference with Syntactic Translation Models March 5, 2013

CFGs S NP VP VP NP V V NP NP

CFGs S NP VP S VP NP V V NP NP

CFGs S NP VP S VP NP V NP VP V NP NP

CFGs S NP VP S VP NP V NP VP V NP NP

CFGs S NP VP S VP NP V NP VP V NP V NP NP

CFGs S NP VP S VP NP V NP VP V NP V NP NP

CFGs S NP VP S VP NP V NP VP V NP V NP NP

CFGs S NP VP S VP NP V NP VP V NP V NP NP Output:

Synchronous CFGs S NP VP VP NP V V NP NP

Synchronous CFGs S NP VP : 1 2 (monotonic) VP NP V : 2 1 (inverted) V : ate NP : John NP : an apple

Synchronous CFGs S NP VP : 1 2 (monotonic) VP NP V : 2 1 (inverted) V : ate NP : John NP : an apple

Synchronous generation

Synchronous generation S S

Synchronous generation S S NP VP NP VP

Synchronous generation S S NP VP NP VP John

Synchronous generation S S NP VP NP VP NP V John V NP

Synchronous generation S S NP VP NP VP NP V John V NP an apple

Synchronous generation S S NP VP NP VP NP V John V NP ate an apple

Synchronous generation S S NP VP NP VP NP V John V NP ate an apple Output: ( : John ate an apple)

Translation as parsing Parse source

Translation as parsing Parse source NP

Translation as parsing Parse source NP NP

Translation as parsing Parse source NP NP V

Translation as parsing Parse source VP NP NP V

Translation as parsing Parse source S VP NP NP V

Translation as parsing Parse source Project to target S S VP VP NP NP V NP V NP John ate an apple

A closer look at parsing Parsing is usually done with dynamic programming Share common computations and structure Represent exponential number of alternatives in polynomial space With SCFGs there are two kinds of ambiguity source parse ambiguity translation ambiguity parse forests can represent both!

A closer look at parsing Any monolingual parser can be used (most often: CKY / CKY variants) Parsing complexity is O( n3 ) cubic in the length of the sentence (n3 ) cubic in the number of non-terminals ( G 3 ) adding nonterminal types increases parsing complexity substantially! With few NTs, exhaustive parsing is tractable

Parsing as deduction Antecedents Side conditions A : u C : w B : v Consequent If A and B are true with weights u and v, and phi is also true, then C is true with weight w.

Example: CKY Inputs: f = hf 1,f 2,...,f`i G Context-free grammar in Chomsky normal form. Item form: [, i, j] A subtree rooted with NT type spanning i to j has been recognized.

Example: CKY Goal: [S, 0, ] Axioms: [, i 1,i]:w (! f i) 2 G w Inference rules: [, i, k] :u [Y,k,j]:v [Z, i, j] :u v w w (Z! Y ) 2 G

S PRP VP VP V NP VP V SBAR SBAR PRP V NP PRP NN V NN V PRP PRP PRP,0,1 V,1,2 PRP,2,3 NN,3,4 V,3,4 I saw her duck 0 1 2 3 4

S PRP VP VP V NP VP V SBAR SBAR PRP V NP PRP NN V NN V PRP PRP PRP,0,1 V,1,2 PRP,2,3 NN,3,4 V,3,4 I saw her duck 0 1 2 3 4

S PRP VP VP V NP VP V SBAR SBAR PRP V NP PRP NN V NN V PRP PRP PRP,0,1 V,1,2 PRP,2,3 NN,3,4 V,3,4 I saw her duck 0 1 2 3 4

S PRP VP VP V NP VP V SBAR SBAR PRP V NP PRP NN V NN V PRP PRP PRP,0,1 V,1,2 PRP,2,3 NN,3,4 V,3,4 I saw her duck 0 1 2 3 4

S PRP VP VP V NP VP V SBAR SBAR PRP V NP PRP NN V NN V PRP PRP PRP,0,1 V,1,2 PRP,2,3 NN,3,4 V,3,4 I saw her duck 0 1 2 3 4

S PRP VP VP V NP VP V SBAR SBAR PRP V NP PRP NN V NN V PRP PRP PRP,0,1 V,1,2 PRP,2,3 NN,3,4 V,3,4 I saw her duck 0 1 2 3 4

S PRP VP VP V NP VP V SBAR SBAR PRP V NP PRP NN V NN V NP,2,4 PRP PRP PRP,0,1 V,1,2 PRP,2,3 NN,3,4 V,3,4 I saw her duck 0 1 2 3 4

S PRP VP VP V NP VP V SBAR SBAR PRP V NP PRP NN V NN SBAR 2,4 V NP,2,4 PRP PRP PRP,0,1 V,1,2 PRP,2,3 NN,3,4 V,3,4 I saw her duck 0 1 2 3 4

S PRP VP VP V NP VP V SBAR SBAR PRP V NP PRP NN VP,1,4 V NN SBAR 2,4 V NP,2,4 PRP PRP PRP,0,1 V,1,2 PRP,2,3 NN,3,4 V,3,4 I saw her duck 0 1 2 3 4

S PRP VP VP V NP VP V SBAR SBAR PRP V NP PRP NN VP,1,4 V NN SBAR 2,4 V NP,2,4 PRP PRP PRP,0,1 V,1,2 PRP,2,3 NN,3,4 V,3,4 I saw her duck 0 1 2 3 4

S PRP VP VP V NP S,0,4 VP V SBAR SBAR PRP V NP PRP NN VP,1,4 V NN SBAR 2,4 V NP,2,4 PRP PRP PRP,0,1 V,1,2 PRP,2,3 NN,3,4 V,3,4 I saw her duck 0 1 2 3 4

S,0,4 What is this object? VP,1,4 SBAR 2,4 NP,2,4 V,3,4 PRP,0,1 V,1,2 PRP,2,3 NN,3,4 I saw her duck 0 1 2 3 4

Semantics of hypergraphs Generalization of directed graphs Special node designated the goal Every edge has a single head and 0 or more tails (the arity of the edge is the number of tails) Node labels correspond to LHS s of CFG rules A derivation is the generalization of the graph concept of path to hypergraphs Weights multiply along edges in the derivation, and add at nodes (cf. semiring parsing)

Edge labels Edge labels may be a mix of terminals and substitution sites (non-terminals) In translation hypergraphs, edges are labeled in both the source and target languages The number of substitution sites must be equal to the arity of the edge and must be the same in both languages The two languages may have different orders of the substitution sites There is no restriction on the number of terminal symbols

Edge labels la lectura : reading 1 de 2 : 2 's 1 S ayer : yesterday 1 de 2 : 1 from 2 {( : yesterday s reading ), ( : reading from yesterday) }

Inference algorithms Viterbi O( E + V ) Find the maximum weighted derivation Requires a partial ordering of weights Inside - outside O( E + V ) Compute the marginal (sum) weight of all derivations passing through each edge/node k-best derivations O( E + D max k log k) Enumerate the k-best derivations in the hypergraph See IWPT paper by Huang and Chiang (2005)

Things to keep in mind Bound on the number of edges: E O(n 3 G 3 ) Bound on the number of nodes: V O(n 2 G )

Decoding Again Translation hypergraphs are a lingua franca for translation search spaces Note that FST lattices are a special case Decoding problem: how do I build a translation hypergraph?

Representational limits Consider this very simple SCFG translation model: Glue rules: S S S : 1 2 S S S : 2 1

Representational limits Consider this very simple SCFG translation model: Glue rules: S S S : 1 2 S S S : 2 1 Lexical rules: S S S : : : ate John an apple

Representational limits Phrase-based decoding runs in exponential time All permutations of the source are modeled (traveling salesman problem!) Typically distortion limits are used to mitigate this But parsing is polynomial...what s going on?

Representational limits Binary SCFGs cannot model this (however, ternary SCFGs can): A B C D B D A C

Representational limits Binary SCFGs cannot model this (however, ternary SCFGs can): A B C D B D A C But can t we binarize any grammar?

Representational limits Binary SCFGs cannot model this (however, ternary SCFGs can): A B C D B D A C But can t we binarize any grammar? No. Synchronous CFGs cannot generally be binarized!

Does this matter? The forbidden pattern is observed in real data (Melamed, 2003) Does this matter? Learning Phrasal units and higher rank grammars can account for the pattern Sentences can be simplified or ignored Translation The pattern does exist, but how often must it exist (i.e., is there a good translation that doesn t violate the SCFG matching property)?

Tree-to-string How do we generate a hypergraph for a tree-tostring translation model? Simple linear-time (given a fixed translation model) top-down matching algorithm Recursively cover uncovered sites in tree Each node in the input tree becomes a node in the translation forest For details, Huang et al. (AMTA, 2006) and Huang et al. (EMNLP, 2010)

S(x 1 :NP x 2 :VP)! x 1 x 2 VP(x 1 :NP x 2 :V)! x 2 x 1 tabeta! ate ringo-o! an apple }Tree-to-string grammar jon-ga! John

S VP NP NP V S(x 1 :NP x 2 :VP)! x 1 x 2 VP(x 1 :NP x 2 :V)! x 2 x 1 tabeta! ate ringo-o! an apple jon-ga! John

S VP S 1 2 NP NP V NP 1 VP John 2 1 S(x 1 :NP x 2 :VP)! x 1 x 2 VP(x 1 :NP x 2 :V)! x 2 x 1 tabeta! ate ringo-o! an apple jon-ga! John an apple NP 2 ate V

S VP S 1 2 NP NP V NP 1 VP John 2 1 S(x 1 :NP x 2 :VP)! x 1 x 2 VP(x 1 :NP x 2 :V)! x 2 x 1 tabeta! ate ringo-o! an apple jon-ga! John an apple NP 2 ate V

S VP S 1 2 NP NP V NP 1 VP John 2 1 S(x 1 :NP x 2 :VP)! x 1 x 2 VP(x 1 :NP x 2 :V)! x 2 x 1 tabeta! ate ringo-o! an apple jon-ga! John an apple NP 2 ate V

S VP S 1 2 NP NP V NP 1 VP John 2 1 S(x 1 :NP x 2 :VP)! x 1 x 2 NP 2 V VP(x 1 :NP x 2 :V)! x 2 x 1 tabeta! ate an apple ate ringo-o! an apple jon-ga! John

S VP S 1 2 NP NP V NP 1 VP John 2 1 S(x 1 :NP x 2 :VP)! x 1 x 2 NP 2 V VP(x 1 :NP x 2 :V)! x 2 x 1 tabeta! ate an apple ate ringo-o! an apple jon-ga! John

Language Models

Hypergraph review la lectura : reading 1 de 2 : 2 's 1 S ayer : yesterday 1 de 2 : 1 from 2 Goal node Source label Target label

Hypergraph review la lectura : reading 1 de 2 : 2 's 1 S ayer : yesterday 1 de 2 : 1 from 2 Substitution sites / variables / non-terminals

Hypergraph review la lectura : reading 1 de 2 : 2 's 1 S ayer : yesterday 1 de 2 : 1 from 2 For LM integration, we ignore the source!

Hypergraph review la lectura : reading 1 de 2 : 2 's 1 S ayer : yesterday 1 de 2 : 1 from 2 For LM integration, we ignore the source!

Hypergraph review la lectura : reading 1 de 2 : 2 's 1 S ayer : yesterday 1 de 2 : 1 from 2 {( yesterday s reading ), ( reading from yesterday) } How can we add the LM score to each string derived by the hypergraph?

LM Integration If LM features were purely local... Unigram model Discriminative LM... integration would be a breeze Add an LM feature to every edge But, LM features are non-local!

Why is it hard? de : 1 saw 2 Two problems: 1. What is the content of the variables?

Why is it hard? Two problems: de : 1 saw 2 John Mary 1. What is the content of the variables?

Why is it hard? Two problems: de : 1 saw 2 I think I Mary 1. What is the content of the variables?

Why is it hard? Two problems: de : 1 saw 2 I think I somebody 1. What is the content of the variables?

Why is it hard? Two problems: de : 1 saw 2 I think I somebody 1. What is the content of the variables? 2. What will be the left context when this string is substituted somewhere?

Why is it hard? <s> 1 </s> S Two problems: de : 1 saw 2 I think I somebody 1. What is the content of the variables? 2. What will be the left context when this string is substituted somewhere?

Why is it hard? fortunately, 1 Two problems: de : 1 saw 2 I think I somebody 1. What is the content of the variables? 2. What will be the left context when this string is substituted somewhere?

Why is it hard? 2 1 Two problems: de : 1 saw 2 I think I somebody 1. What is the content of the variables? 2. What will be the left context when this string is substituted somewhere?

Naive solution Extract the all (k-best?) translations from the translation model Score them with an LM What s the problem with this?

Outline of DP solution Use n-order Markov assumption to help us In an n-gram LM, words more than n words away will not affect the local (conditional) probability of a word in context This is not generally true, just the Markov assumption! General approach Restructure the hypergraph so that LM probabilities decompose along edges. Solves both problems we will not know the full value of variables, but we will know enough. defer scoring of left context until the context is established.

Hypergraph restructuring

Hypergraph restructuring Note the following three facts: If you know n or more consecutive words, the conditional probabilities of the nth, (n+1)th,... words can be computed. Therefore: add a feature weight to the edge for words.

Hypergraph restructuring Note the following three facts: If you know n or more consecutive words, the conditional probabilities of the nth, (n+1)th,... words can be computed. Therefore: add a feature weight to the edge for words. (n-1) words of context to the left is enough to determine the probability of any word Therefore: split nodes based on the (n-1) words on the right side of the span dominated by every node

Hypergraph restructuring Note the following three facts: If you know n or more consecutive words, the conditional probabilities of the nth, (n+1)th,... words can be computed. Therefore: add a feature weight to the edge for words. (n-1) words of context to the left is enough to determine the probability of any word Therefore: split nodes based on the (n-1) words on the right side of the span dominated by every node (n-1) words on the left side of a span cannot be scored with certainty because the context is not known Therefore: split nodes based on the (n-1) words on the left side of the span dominated by every node

Hypergraph restructuring Note the following three facts: If you know n or more consecutive words, the conditional probabilities of the nth, (n+1)th,... words can be computed. Therefore: add a feature weight to the edge for words. (n-1) words of context to the left is enough to determine the probability of any word Split nodes by the (n-1) words on both Therefore: sides split of nodes the based convergent the (n-1) words on edges. the right side of the span dominated by every node (n-1) words on the left side of a span cannot be scored with certainty because the context is not known Therefore: split nodes based on the (n-1) words on the left side of the span dominated by every node

Hypergraph restructuring Algorithm ( cube intersection ): For each node v (proceeding in topological order through the nodes) For each edge e with head-node v, compute the (n-1) words on the left and right; call this qe Do this by substituting the (n-1)x2 word string from the tail node corresponding to the substitution variable If node vq e does not exist, create it, duplicating all outgoing edges from v so that they also proceed from vqe Disconnect e from v and attach it to vq e Delete v

Hypergraph restructuring 0.6 the man 1 de 2 : 2 's 1 0.6 0.4 the husband 0.1 la mancha 0.7 0.2 the stain the gray stain de 1 : 21 from 2 0.4

Hypergraph restructuring 0.6 the man 1 de 2 : 2 's 1 0.6 0.4 the husband 0.1 la mancha 0.7 0.2 the stain the gray stain de 1 : 21 from 2 0.4 -LM Viterbi: the stain s the man

Hypergraph restructuring Let s add a bi-gram language model! 0.6 the man 1 de 2 : 2 's 1 0.6 0.4 the husband 0.1 la mancha 0.7 0.2 the stain the gray stain de 1 : 21 from 2 0.4

Hypergraph restructuring Let s add a bi-gram language model! 0.6 the man 1 de 2 : 2 's 1 0.6 0.4 the husband 0.1 la mancha 0.7 0.2 the stain the gray stain de 1 : 21 from 2 0.4

Hypergraph restructuring 0.6 the man 1 de 2 : 2 's 1 0.6 0.4 the husband p(mancha la) 0.1 la mancha 0.7 0.2 the stain the gray stain de 1 : 21 from 2 0.4

Hypergraph restructuring 0.6 the man 1 de 2 : 2 's 1 0.6 0.4 the husband p(mancha la) 0.1 la mancha la mancha 0.7 0.2 the stain the gray stain de 1 : 21 from 2 0.4

Hypergraph restructuring 0.6 the man 1 de 2 : 2 's 1 0.6 0.4 the husband p(stain the) 0.1 0.7 0.2 la mancha the stain the gray stain la mancha de 1 : 21 from 2 0.4

Hypergraph restructuring 0.6 the man 1 de 2 : 2 's 1 0.6 0.4 the husband p(stain the) 0.1 0.7 0.2 la mancha the stain the gray stain la mancha de 1 : 21 from 2 0.4 the stain

Hypergraph restructuring 0.6 the man 1 de 2 : 2 's 1 0.6 0.4 the husband 0.1 la mancha la mancha p(gray the) x p(stain gray) 0.7 0.2 the stain the gray stain de 1 : 21 from 2 0.4 the stain

Hypergraph restructuring 0.6 the man 1 de 2 : 2 's 1 0.6 0.4 the husband 0.1 la mancha la mancha p(gray the) x p(stain gray) 0.7 0.2 the stain the gray stain de 1 : 21 from 2 0.4 the stain

Hypergraph restructuring 0.6 the man 1 de 2 : 2 's 1 0.6 0.4 the husband 0.1 la mancha la mancha 0.7 0.2 the stain the gray stain de 1 : 21 from 2 0.4 the stain

Hypergraph restructuring 0.6 the man 1 de 2 : 2 's 1 0.6 0.4 the husband 0.1 0.7 0.2 la mancha the stain the gray stain la mancha de 1 : 21 from 2 0.4 the stain

Hypergraph restructuring 0.6 the man the man 0.4 the husband the de : 's 1 2 2 1 husband 0.6 0.1 0.7 0.2 la mancha the stain the gray stain la mancha de 1 : 21 from 2 0.4 the stain

Hypergraph restructuring 0.6 the man the man 0.4 de : 's 1 2 2 1 the husband Every node the husband remembers enough for 0.1edges la mancha to compute LM costs 0.7 the stain la mancha de 1 : 21 from 2 0.4 0.2 the gray stain 0.6 the stain

Complexity What is the run-time of this algorithm?

Complexity What is the run-time of this algorithm? O( V E 2(n 1) ) Going to longer n-grams is exponentially expensive!

Cube pruning Expanding every node like this exhaustively is impractical Polynomial time, but really, really big! Cube pruning: minor tweak on the above algorithm Compute the k-best expansions at each node Use an estimate (usually a unigram probability) of the unscored left-edge to rank the nodes

Cube pruning Why cube pruning? Cube-pruning only involves a cube when arity-2 rules are used! More appropriately called square pruning with arity-1 Or hypercube pruning with arity > 2!

Cube Pruning (VP VP1, 6 PP1, 3 VP3, 6 monotonic grid? held meeting 3,6 ) (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP 1.0 3.0 8.0 1.0 2.0 4.0 9.0 with Shalong 1,3 ) (VP held talk 3,6 ) 1.1 2.1 4.1 9.1 (VP hold conference 3,6 ) 3.5 4.5 6.5 11.5 Huang and Chiang Forest Rescoring 12

Cube Pruning VP1, 6 PP1, 3 VP3, 6 (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP non-monotonic grid due to LM combo costs 1.0 3.0 8.0 with Shalong 1,3 ) (VP held meeting 3,6 ) 1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5 (VP held talk 3,6 ) 1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3 (VP hold conference 3,6 ) 3.5 4.5 + 0.6 6.5 +10.5 11.5 + 0.6 Huang and Chiang Forest Rescoring 13

Cube Pruning VP1, 6 PP1, 3 VP3, 6 bigram (meeting, with) (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP non-monotonic grid due to LM combo costs 1.0 3.0 8.0 with Shalong 1,3 ) (VP held meeting 3,6 ) 1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5 (VP held talk 3,6 ) 1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3 (VP hold conference 3,6 ) 3.5 4.5 + 0.6 6.5 +10.5 11.5 + 0.6 Huang and Chiang Forest Rescoring 13

Cube Pruning (VP VP1, 6 PP1, 3 VP3, 6 non-monotonic grid due to LM combo costs held meeting 3,6 ) (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP 1.0 3.0 8.0 1.0 2.5 9.0 9.5 with Shalong 1,3 ) (VP held talk 3,6 ) 1.1 2.4 9.5 9.4 (VP hold conference 3,6 ) 3.5 5.1 17.0 12.1 Huang and Chiang Forest Rescoring 14

Cube Pruning k-best parsing (Huang and Chiang, 2005) a priority queue of candidates extract the best candidate (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP with Shalong 1,3 ) (VP held meeting 3,6 ) 1.0 3.0 8.0 1.0 2.5 9.0 9.5 (VP held talk 3,6 ) 1.1 2.4 9.5 9.4 (VP hold conference 3,6 ) 3.5 5.1 17.0 12.1 Huang and Chiang Forest Rescoring 15

k-best parsing (Huang and Chiang, 2005) a priority queue of candidates extract the best candidate push the two successors (VP held meeting 3,6 ) Cube Pruning (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP 1.0 3.0 8.0 1.0 2.5 9.0 9.5 with Shalong 1,3 ) (VP held talk 3,6 ) 1.1 2.4 9.5 9.4 (VP hold conference 3,6 ) 3.5 5.1 17.0 12.1 Huang and Chiang Forest Rescoring 16

k-best parsing (Huang and Chiang, 2005) a priority queue of candidates extract the best candidate push the two successors (VP held meeting 3,6 ) Cube Pruning (PP with Sharon 1,3 ) (PP along Sharon 1,3 ) (PP 1.0 3.0 8.0 1.0 2.5 9.0 9.5 with Shalong 1,3 ) (VP held talk 3,6 ) 1.1 2.4 9.5 9.4 (VP hold conference 3,6 ) 3.5 5.1 17.0 12.1 Huang and Chiang Forest Rescoring 17

Cube pruning Widely used for phrase-based and syntax-based MT May be applied in conjunction with a bottom-up decoder, or as a second rescoring pass Nodes may also be grouped together (for example, all nodes corresponding to a certain source span) Requirement for topological ordering means translation hypergraph may not have cycles

Alignment

Hypergraphs as Grammars A hypergraph is isomorphic to a (synchronous) CFG LM integration can be understood as the intersection of an LM and an CFG Cube pruning approximates this intersection Is the algorithm optimal?

Constrained decoding Wu (1997) gives an algorithm that is a generalization of CKY to SCFGs Requires an approximation of CNF Alternative solution: Parse in one language Then parse the other side of the string pair the with hypergraph grammar

Input: <dianzi shiang de mao, a cat on the mat> With thanks and apologies to Zhifei Li.

Input: <dianzi shiang de mao, a cat on the mat> S 0,4 S 0, 0 0,4 the cat S 0, 0 0,4 a mat 0 de 1, 1 on 0 0 de 1, 0 1 0 de 1, 0 s 1 0 de 1, 1 of 0 0,2 the mat 3,4 a cat dianzi shang, the mat mao, a cat dianzi0 shang1 de2 mao3 With thanks and apologies to Zhifei Li.

Input: <dianzi shiang de mao, a cat on the mat> S 0,4 Isomorphic CFG S 0, 0 0,4 the cat S 0, 0 0,4 a mat [34]! a cat 0 de 1, 1 on 0 0 de 1, 0 1 0 de 1, 0 s 1 0 de 1, 1 of 0 0,2 the mat 3,4 a cat dianzi shang, the mat mao, a cat dianzi0 shang1 de2 mao3

Input: <dianzi shiang de mao, a cat on the mat> S 0,4 Isomorphic CFG S 0, 0 0,4 the cat S 0, 0 0,4 a mat [34]! a cat [02]! the mat 0 de 1, 1 on 0 0 de 1, 0 1 0 de 1, 0 s 1 0 de 1, 1 of 0 0,2 the mat 3,4 a cat dianzi shang, the mat mao, a cat dianzi0 shang1 de2 mao3

Input: <dianzi shiang de mao, a cat on the mat> S 0,4 S S 0, 0 0,4 the cat 0, 0 0,4 a mat 0 de 1, 1 on 0 0 de 1, 0 s 1 0 de 1, 1 of 0 0 de 1, 0 1 Isomorphic CFG [34]! a cat [02]! the mat [04a]! [34] on [02] [04a]! [34] of [02] 0,2 the mat 3,4 a cat dianzi shang, the mat mao, a cat dianzi0 shang1 de2 mao3

Input: <dianzi shiang de mao, a cat on the mat> S 0,4 S S 0, 0 0,4 the cat 0, 0 0,4 a mat 0 de 1, 1 on 0 0 de 1, 0 s 1 0 de 1, 1 of 0 0 de 1, 0 1 Isomorphic CFG [34]! a cat [02]! the mat [04a]! [34] on [02] [04a]! [34] of [02] [04b]! [02]!s [34] [04b]! [02] [34] 0,2 the mat 3,4 a cat dianzi shang, the mat mao, a cat dianzi0 shang1 de2 mao3

Input: <dianzi shiang de mao, a cat on the mat> S 0,4 S S 0, 0 0,4 the cat 0, 0 0,4 a mat 0 de 1, 1 on 0 0 de 1, 0 s 1 0 de 1, 1 of 0 0 de 1, 0 1 0,2 the mat 3,4 a cat dianzi shang, the mat mao, a cat Isomorphic CFG [34]! a cat [02]! the mat [04a]! [34] on [02] [04a]! [34] of [02] [04b]! [02]!s [34] [04b]! [02] [34] [S]! [04a] [S]! [04b] dianzi0 shang1 de2 mao3

Input: <dianzi shiang de mao, a cat on the mat> Isomorphic CFG [34]! a cat [02]! the mat [04a]! [34] on [02] [04a]! [34] of [02] [04b]! [02]!s [34] [04b]! [02] [34] [S]! [04a] [S]! [04b]

Input: <dianzi shiang de mao, a cat on the mat> Isomorphic CFG [34]! a cat [02]! the mat [04a]! [34] on [02] [04a]! [34] of [02] [04b]! [02]!s [34] [04b]! [02] [34] [S]! [04a] [S]! [04b] a cat on the mat

Input: <dianzi shiang de mao, a cat on the mat> Isomorphic CFG [34]! a cat [02]! the mat [04a]! [34] on [02] [04a]! [34] of [02] [04b]! [02]!s [34] [04b]! [02] [34] [S]! [04a] [S]! [04b] [34] a cat on the mat

Input: <dianzi shiang de mao, a cat on the mat> Isomorphic CFG [34]! a cat [02]! the mat [04a]! [34] on [02] [04a]! [34] of [02] [04b]! [02]!s [34] [04b]! [02] [34] [S]! [04a] [S]! [04b] [34] [02] a cat on the mat

Input: <dianzi shiang de mao, a cat on the mat> Isomorphic CFG [34]! a cat [02]! the mat [04a]! [34] on [02] [04a]! [34] of [02] [04b]! [02]!s [34] [04b]! [02] [34] [S]! [04a] [S]! [04b] [34] [04a] [02] a cat on the mat

Input: <dianzi shiang de mao, a cat on the mat> Isomorphic CFG [34]! a cat [02]! the mat [04a]! [34] on [02] [04a]! [34] of [02] [04b]! [02]!s [34] [04b]! [02] [34] [S]! [04a] [S]! [04b] [34] [S] [04a] [02] a cat on the mat

seconds / sentence 60 40 20 0 ITG algorithm Two-parse algorithm 10 20 30 40 50 60 source length (words)