Advanced Graph-Based Parsing Techniques

Similar documents
Graph-based Dependency Parsing. Ryan McDonald Google Research

Structured Prediction Models via the Matrix-Tree Theorem

Vine Pruning for Efficient Multi-Pass Dependency Parsing. Alexander M. Rush and Slav Petrov

Polyhedral Outer Approximations with Application to Natural Language Parsing

Relaxed Marginal Inference and its Application to Dependency Parsing

Transition-based Dependency Parsing with Selectional Branching

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

Treebank Grammar Techniques for Non-Projective Dependency Parsing

Introduction to Data-Driven Dependency Parsing

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Stacking Dependency Parsers

Dual Decomposition for Natural Language Processing. Decoding complexity

Marrying Dynamic Programming with Recurrent Neural Networks

Quasi-Second-Order Parsing for 1-Endpoint-Crossing, Pagenumber-2 Graphs

Transition-based Dependency Parsing with Selectional Branching

Structured Prediction

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

Probabilistic Graphical Models

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

The Structured Weighted Violations Perceptron Algorithm

Lecture 21: Spectral Learning for Graphical Models

Computational Linguistics

Machine Learning for Structured Prediction

Computational Linguistics. Acknowledgements. Phrase-Structure Trees. Dependency-based Parsing

Distributed Training Strategies for the Structured Perceptron

Efficient Parsing of Well-Nested Linear Context-Free Rewriting Systems

Fourth-Order Dependency Parsing

Dual Decomposition for Inference

Transition-Based Parsing

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Tuning as Linear Regression

Statistical Methods for NLP

Computing Lattice BLEU Oracle Scores for Machine Translation

Probabilistic Context Free Grammars. Many slides from Michael Collins

Learning to translate with neural networks. Michael Auli

A* Search. 1 Dijkstra Shortest Path

Low-Rank Tensors for Scoring Dependency Structures

Margin-based Decomposed Amortized Inference

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

Driving Semantic Parsing from the World s Response

Lecture 9: Decoding. Andreas Maletti. Stuttgart January 20, Statistical Machine Translation. SMT VIII A. Maletti 1

A Tabular Method for Dynamic Oracles in Transition-Based Parsing

Spectral Learning for Non-Deterministic Dependency Parsing

Undirected Graphical Models

Dynamic-oracle Transition-based Parsing with Calibrated Probabilistic Output

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Parsing Mildly Non-projective Dependency Structures

Cube Summing, Approximate Inference with Non-Local Features, and Dynamic Programming without Semirings

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Structured Learning with Approximate Inference

Outline. Parsing. Approximations Some tricks Learning agenda-based parsers

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

Junction Tree, BP and Variational Methods

Learning and Inference over Constrained Output

Spectral Learning for Non-Deterministic Dependency Parsing

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Personal Project: Shift-Reduce Dependency Parsing

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Introduction to Machine Learning Midterm Exam

Probabilistic Context-free Grammars

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Better! Faster! Stronger*!

Aspects of Tree-Based Statistical Machine Translation

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Decoding and Inference with Syntactic Translation Models

Section Notes 8. Integer Programming II. Applied Math 121. Week of April 5, expand your knowledge of big M s and logical constraints.

Machine Learning for NLP

Lab 12: Structured Prediction

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Dynamic Programming for Linear-Time Incremental Parsing

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Tree Adjoining Grammars

Advanced Structured Prediction

Minibatch and Parallelization for Online Large Margin Structured Learning

Inference by Minimizing Size, Divergence, or their Sum

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Transition-based dependency parsing

Introduction To Graphical Models

The Real Story on Synchronous Rewriting Systems. Daniel Gildea Computer Science Department University of Rochester

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus

Probabilistic Graphical Models

Maschinelle Sprachverarbeitung

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Maschinelle Sprachverarbeitung

Learning Tetris. 1 Tetris. February 3, 2009

A Combined LP and QP Relaxation for MAP

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

PAC Generalization Bounds for Co-training

Lecture 13: Structured Prediction

DT2118 Speech and Speaker Recognition

Lecture 12: Algorithms for HMMs

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

Turbo Parsers: Dependency Parsing by Approximate Variational Inference

Boosting Ensembles of Structured Prediction Rules

Prepositional Phrase Attachment over Word Embedding Products

Lecture 3: ASR: HMMs, Forward, Viterbi

Part 7: Structured Prediction and Energy Minimization (2/2)

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Transcription:

Advanced Graph-Based Parsing Techniques Joakim Nivre Uppsala University Linguistics and Philology Based on previous tutorials with Ryan McDonald Advanced Graph-Based Parsing Techniques 1(33)

Introduction Overall Plan 1. Basic notions of dependency grammar and dependency parsing 2. Graph-based and transition-based dependency parsing 3. Advanced graph-based parsing techniques 4. Advanced transition-based parsing techniques 5. Neural network techniques in dependency parsing 6. Multilingual parsing from raw text to universal dependencies Advanced Graph-Based Parsing Techniques 2(33)

Introduction Plan for this Lecture Projective parsing Exact higher-order parsing Approximations Non-projective parsing NP-completeness Exact higher-order parsing Approximations Advanced Graph-Based Parsing Techniques 3(33)

Introduction Graph-Based Parsing Trade-Off [McDonald and Nivre 2007] Learning and inference are global Decoding guaranteed to find highest scoring tree Training algorithms use global structure learning But this is only possible with local feature factorizations Must limit context statistical model can look at Results in bad easy decisions The major question in graph-based parsing has been how to increase scope of features to larger subgraphs, without making inference intractable. Advanced Graph-Based Parsing Techniques 4(33)

Higher-Order Parsing Exact Higher-Order Projective Parsing Two main dimensions of higher-order features Vertical: e.g., remain is the grandparent of emeritus Horizontal: e.g., remain is first child of will Advanced Graph-Based Parsing Techniques 5(33)

Higher-Order Projective Parsing Exact Higher-Order Projective Parsing Easy just modify the chart Usually asymptotic increase with each order modeled But we have a bag of tricks that help Advanced Graph-Based Parsing Techniques 6(33)

Exact Higher-Order Projective Parsing 2nd-Order Horizontal Projective Parsing Score factors by pairs of horizontally adjacent arcs Often called sibling dependencies s(i, j, j ) = score of adjacent arcs x i x j and x i x j s(t ) = s(i, j, j ) (i,j):(i,j ) A =... + s(i 0, i 1, i 2 ) + s(i 0, i 2, i 3 ) +... + s(i 0, i j 1, i j ) + s(i 0, i j+1, i j+2 ) +... + s(i 0, i m 1, i m ) +... Advanced Graph-Based Parsing Techniques 7(33)

Exact Higher-Order Projective Parsing 2nd-Order Horizontal Projective Parsing Add a sibling chart item to get to O(n 3 ) i j j j i j j j i j j i s(i, j, j ) j Advanced Graph-Based Parsing Techniques 8(33)

Higher-Order Projective Parsing Exact Higher-Order Projective Parsing People played this game since 2006 McDonald and Pereira [2006] (2nd-order sibling) Carreras [2007] (2nd-order sibling and grandparent) Koo and Collins [2010] (3rd-order grand-sibling and tri-sibling) Ma and Zhao [2012] (4th-order grand-tri-sibling+) 1 2 3 VERTICAL CONTEXT g h O(n 4 ) O(n 4 ) m O(n 3 ) O(n 3 ) h m h s m g h s m g h O(n 4 ) h O(n 5 ) s s m s s m 2 1 HORIZONTAL CONTEXT Advanced Graph-Based Parsing Techniques * From Koo et al. 2010 presentation 9(33)

Approximate Higher-Order Projective Parsing Exact Higher-Order Projective Parsing Can be done via chart augmentation But there are drawbacks O(n 4 ), O(n 5 ),... is just too slow Every type of higher order feature requires specialized chart items and combination rules Advanced Graph-Based Parsing Techniques 10(33)

Approximate Higher-Order Projective Parsing Exact Higher-Order Projective Parsing Can be done via chart augmentation But there are drawbacks O(n 4 ), O(n 5 ),... is just too slow Every type of higher order feature requires specialized chart items and combination rules Led to research on approximations Bohnet [2010]: feature hashing, parallelization Koo and Collins [2010]: first-order marginal probabilities Bergsma and Cherry [2010]: classifier arc filtering Cascades Rush and Petrov [2012]: structured prediction cascades He et al. [2013]: dynamic feature selection Zhang and McDonald [2012], Zhang et al. [2013]: cube-pruning Advanced Graph-Based Parsing Techniques 10(33)

Structured Prediction Cascades [Rush and Petrov 2012] Approximate Higher-Order Projective Parsing Lower-order models prune space for higher order models Weiss et al. [2010]: train level n w.r.t. to level n + 1 Vine-parsing allows linear first stage [Dreyer et al. 2006] 100X+ faster than unpruned 3rd-order model with small accuracy loss (93.3 93.1) [Rush and Petrov 2012] Advanced Graph-Based Parsing Techniques 11(33)

Cube Pruning [Zhang and McDonald 2012, Zhang et al. 2013] Approximate Higher-Order Projective Parsing Keep Eisner O(n 3 ) as back bone Use chart item k-best lists to score higher order features i 0 i 5 i 0 i 1 i 2 i 0 i 1 i 2 Example: Grandparent features s(i 0 i 5 i 4 ) s(i 0 i 5 i 3 ) i 3 i 4 i 5 i 3 i 4 i 5 i 0 i 1 i 2 i 3 i 4 i 5 Always O(n 3 ) asymptotically No specialized chart parsing algorithms Advanced Graph-Based Parsing Techniques 12(33)

Projective Parsing Summary Approximate Higher-Order Projective Parsing Can augment chart (dynamic program) to increase scope of features but comes at complexity cost Solution: use pruning approximations En-UAS Zh-UAS 1st order exact 91.8 84.4 2nd order exact 92.4 86.6 3rd order exact 93.0 86.8 4th order exact 93.4 87.4 struct. pred. casc. 93.1 cube-pruning 93.5 87.9 [Koo and Collins 2010], [Ma and Zhao 2012], [Rush and Petrov 2012], [Zhang et al. 2013] Cube-pruning is 2x slower than structured prediction cascades and 5x faster than third-order Advanced Graph-Based Parsing Techniques 13(33)

Higher-Order Non-Projective Parsing Higher-Order Non-Projective Parsing McDonald and Satta [2007]: Parsing is NP-hard for all higher-order features Horizontal, vertical, valency, etc. Even seemingly simple arc features like Is this the only modifier result in intractability Advanced Graph-Based Parsing Techniques 14(33)

Higher-Order Non-Projective Parsing What to do? Advanced Graph-Based Parsing Techniques 15(33)

What to do? Higher-Order Non-Projective Parsing Exact non-projective parsing Integer Linear Programming [Riedel and Clarke 2006, Martins et al. 2009] Intractable in general, but efficient optimizers exact Higher order parsing: asymptotic increase in constraint set size Advanced Graph-Based Parsing Techniques 15(33)

What to do? Higher-Order Non-Projective Parsing Exact non-projective parsing Integer Linear Programming [Riedel and Clarke 2006, Martins et al. 2009] Intractable in general, but efficient optimizers exact Higher order parsing: asymptotic increase in constraint set size Approximations (some return optimal in practice) Approximate inference: T = argmax T Gx s(t ) Post-processing [McDonald and Pereira 2006], [Hall and Novák 2005], [Hall 2007] Dual Decomposition Belief Propagation [Smith and Eisner 2008] LP relaxations [Riedel et al. 2012] Sampling [Nakagawa 2007] Approximate search space: T = argmax T Gx s(t ) Mildly non-projective structures Advanced Graph-Based Parsing Techniques 15(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition Assume a 2nd-order sibling model: s(t ) = (i,j):(i,j ) A s(i, j, j ) T = argmax s(t ) T Computing T* is hard, so let us try something simpler: s(d i ) = s(i, j, j ) (i,j):(i,j ) A D i argmax D i s(d i ) The highest scoring sequence of dependents for each word x i can be computed in O(n) time using a semi-markov model Advanced Graph-Based Parsing Techniques 16(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition For each word x i, find D i : root ROOT What did economic news have little effect on amod ROOT What did economic news have little effect on aux nsubj dobj prep ROOT What did economic news have little effect on amod ROOT What did economic news have little effect on pobj ROOT What did economic news have little effect on Advanced Graph-Based Parsing Techniques 17(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition For each word x i, find D i : root ROOT What did economic news have little effect on amod ROOT What did economic news have little effect on aux nsubj dobj prep ROOT What did economic news have little effect on amod ROOT What did economic news have little effect on pobj ROOT What did economic news have little effect on put it together pobj root aux amod nsubj dobj amod prep ROOT What did economic news have little effect on Advanced Graph-Based Parsing Techniques 17(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition Why does this not work? No tree constraint! root ROOT What did economic news have little effect on amod ROOT What did economic news have little effect on aux nsubj dobj [ ROOT What did economic news have little effect on amod [ ROOT What did economic news have little effect on prep ROOT What did economic news have little effect on put it together pobj root aux amod nsubj dobj amod prep prep ROOT What did economic news have little effect on Advanced Graph-Based Parsing Techniques 18(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition First-order O(n 2 ) model with tree constraint exists MST algorithm [Chu and Liu 1965, Edmonds 1967] Second-order O(n 3 ) model without tree constraint exists The O(n 3 ) sibling decoding algorithm Advanced Graph-Based Parsing Techniques 19(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition First-order O(n 2 ) model with tree constraint exists MST algorithm [Chu and Liu 1965, Edmonds 1967] Second-order O(n 3 ) model without tree constraint exists The O(n 3 ) sibling decoding algorithm Dual Decomposition [Koo et al. 2010] Add components for each feature Independently calculate each efficiently Tie together with agreement constraints (penalties) Advanced Graph-Based Parsing Techniques 19(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition For a sentence x = x 1... x n, let: Advanced Graph-Based Parsing Techniques 20(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition For a sentence x = x 1... x n, let: s1o (T ) be the first-order score of a tree T T 1o = argmax T Gx s 1o(T ) Advanced Graph-Based Parsing Techniques 20(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition For a sentence x = x 1... x n, let: s1o (T ) be the first-order score of a tree T T 1o = argmax T Gx s 1o(T ) s 2o (G) be the second-order sibling score of a graph G G 2o = argmax G Gx s 2o(G) Advanced Graph-Based Parsing Techniques 20(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition For a sentence x = x 1... x n, let: s1o (T ) be the first-order score of a tree T T 1o = argmax T Gx s 1o(T ) s 2o (G) be the second-order sibling score of a graph G G 2o = argmax G Gx s 2o(G) Define structural variables t 1o (i, j) = 1 if (i, j) T 1o, 0 otherwise g2o (i, j) = 1 if (i, j) G 2o, 0 otherwise Advanced Graph-Based Parsing Techniques 20(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition For a sentence x = x 1... x n, let: s1o (T ) be the first-order score of a tree T T 1o = argmax T Gx s 1o(T ) s 2o (G) be the second-order sibling score of a graph G G 2o = argmax G Gx s 2o(G) Define structural variables t 1o (i, j) = 1 if (i, j) T 1o, 0 otherwise g2o (i, j) = 1 if (i, j) G 2o, 0 otherwise What we really want to find is T = argmax T G x s 1o (T ) + s so (T ) Advanced Graph-Based Parsing Techniques 20(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition For a sentence x = x 1... x n, let: s1o (T ) be the first-order score of a tree T T 1o = argmax T Gx s 1o(T ) s2o (G) be the second-order sibling score of a graph G G 2o = argmax G Gx s 2o(G) Define structural variables t 1o (i, j) = 1 if (i, j) T 1o, 0 otherwise g2o (i, j) = 1 if (i, j) G 2o, 0 otherwise This is equivalent to: (T, G) = argmax T G x,g G x s 1o (T ) + s so (G) s.t. t 1o (i, j) = g 2o (i, j), i, j n Advanced Graph-Based Parsing Techniques 20(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition (T, G) = argmax T G x,g G x s 1o (T ) + s so (G), s.t. t 1o (i, j) = g 2o (i, j) Algorithm sketch Advanced Graph-Based Parsing Techniques 21(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition (T, G) = argmax T G x,g G x s 1o (T ) + s so (G), s.t. t 1o (i, j) = g 2o (i, j) for k = 1 to K Algorithm sketch Advanced Graph-Based Parsing Techniques 21(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition (T, G) = argmax T G x,g G x s 1o (T ) + s so (G), s.t. t 1o (i, j) = g 2o (i, j) for k = 1 to K Algorithm sketch 1. T 1o = argmax T Gx s 1o (T ) p // first-order decoding Advanced Graph-Based Parsing Techniques 21(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition (T, G) = argmax T G x,g G x s 1o (T ) + s so (G), s.t. t 1o (i, j) = g 2o (i, j) for k = 1 to K Algorithm sketch 1. T 1o = argmax T Gx s 1o (T ) p // first-order decoding 2. G 2o = argmax G Gx s 2o (T ) + p // second-order decoding Advanced Graph-Based Parsing Techniques 21(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition (T, G) = argmax T G x,g G x s 1o (T ) + s so (G), s.t. t 1o (i, j) = g 2o (i, j) for k = 1 to K Algorithm sketch 1. T 1o = argmax T Gx s 1o (T ) p // first-order decoding 2. G 2o = argmax G Gx s 2o (T ) + p // second-order decoding 3. if t 1o (i, j) = g 2o (i, j), i, j, return T 1o Advanced Graph-Based Parsing Techniques 21(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition (T, G) = argmax T G x,g G x s 1o (T ) + s so (G), s.t. t 1o (i, j) = g 2o (i, j) for k = 1 to K Algorithm sketch 1. T 1o = argmax T Gx s 1o (T ) p // first-order decoding 2. G 2o = argmax G Gx s 2o (T ) + p // second-order decoding 3. if t 1o (i, j) = g 2o (i, j), i, j, return T 1o 4. else Update penalties p and go to 1 Advanced Graph-Based Parsing Techniques 21(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition (T, G) = argmax T G x,g G x s 1o (T ) + s so (G), s.t. t 1o (i, j) = g 2o (i, j) for k = 1 to K Algorithm sketch 1. T 1o = argmax T Gx s 1o (T ) p // first-order decoding 2. G 2o = argmax G Gx s 2o (T ) + p // second-order decoding 3. if t 1o (i, j) = g 2o (i, j), i, j, return T 1o 4. else Update penalties p and go to 1 If K is reached, return T 1o from last iteration Advanced Graph-Based Parsing Techniques 21(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition (T, G) = argmax T G x,g G x s 1o (T ) + s so (G), s.t. t 1o (i, j) = g 2o (i, j) for k = 1 to K Algorithm sketch 1. T 1o = argmax T Gx s 1o (T ) p // first-order decoding 2. G 2o = argmax G Gx s 2o (T ) + p // second-order decoding 3. if t 1o (i, j) = g 2o (i, j), i, j, return T 1o 4. else Update penalties p and go to 1 What are the penalties p? Advanced Graph-Based Parsing Techniques 21(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition Let p(i, j) = t 1o (i, j) g 2o (i, j) p is the set of all penalties p(i, j) Advanced Graph-Based Parsing Techniques 22(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition Let p(i, j) = t 1o (i, j) g 2o (i, j) p is the set of all penalties p(i, j) We rewrite the decoding objectives as: T 1o = argmax T G x G 2o = argmax G G x s 1o (T ) i,j s 2o (G) + i,j p(i, j) t 1o (i, j) p(i, j) g 2o (i, j) Reward trees/graphs that agree with other model Advanced Graph-Based Parsing Techniques 22(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition Let p(i, j) = t 1o (i, j) g 2o (i, j) p is the set of all penalties p(i, j) We rewrite the decoding objectives as: T 1o = argmax T G x G 2o = argmax G G x s 1o (T ) i,j s 2o (G) + i,j p(i, j) t 1o (i, j) p(i, j) g 2o (i, j) Reward trees/graphs that agree with other model Since t 1o and g 2o are arc-factored indicator variables, we can easily include in decoding s(i, j) = s(i, j) p(i, j) for first-order model Advanced Graph-Based Parsing Techniques 22(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition 1-iter Example First-order root aux pobj amod nsubj dobj prep amod ROOT 0 What 1 did 2 economic 3 news 4 have 5 little 6 effect 7 on 8 Second-order sibling root aux amod pobj nsubj dobj prep amod prep ROOT 0 What 1 did 2 economic 3 news 4 have 5 little 6 effect 7 on 8 penalties: p(5, 3) = 1, p(4, 3) = 1, p(7, 8) = 1 first-order: s 1o(5, 3) = 1, s 1o(4, 3) += 1, s 1o(7, 8) += 1 second-order: s 2o(5,, 3) += 1, s 2o(4,, 3) = 1, s 2o(7,, 8) = 1 *Indicates any sibling, even null if it is first left/right modifier. Advanced Graph-Based Parsing Techniques 23(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition Goal : (T, G) = argmax s 1o (T )+s so (G), s.t. t 1o (i, j) = g 2o (i, j) T G x,g G x for k = 1 to K 1. T 1o = argmax T Gx s 1o (T ) p // first-order decoding 2. G 2o = argmax G Gx s 2o (T ) + p // second-order decoding 3. if t 1o (i, j) = g 2o (i, j), i, j, return T 1o 4. else Update penalties p and go to 1 Penalties push scores towards agreement Theorem: If for any k, line 3 holds, then decoding is optimal Advanced Graph-Based Parsing Techniques 24(33)

Approximate Higher-Order Non-Projective Parsing Dual Decomposition Koo et al. [2010]: grandparents, grand-sibling, tri-siblings Martins et al. [2011, 2013]: arbitrary siblings, head bigrams UAS 1st order 90.52 2nd order 91.85 3rd order 92.41 [Martins et al. 2013] Advanced Graph-Based Parsing Techniques 25(33)

Mildly Non-Projective Structures Approximate Higher-Order Non-Projective Parsing Dual decomposition approximates search over entire space T = argmax T Gx s(t ) Advanced Graph-Based Parsing Techniques 26(33)

Mildly Non-Projective Structures Approximate Higher-Order Non-Projective Parsing Dual decomposition approximates search over entire space T = argmax T Gx s(t ) Another approach is to restrict search space T = argmax T Gx s(t ) 1. Allow efficient decoding 2. Still cover all linguistically plausible structures Advanced Graph-Based Parsing Techniques 26(33)

Mildly Non-Projective Structures Approximate Higher-Order Non-Projective Parsing Dual decomposition approximates search over entire space T = argmax T Gx s(t ) Another approach is to restrict search space T = argmax T Gx s(t ) 1. Allow efficient decoding 2. Still cover all linguistically plausible structures Do we really care about scoring such structures? aux aux aux aux aux aux aux aux aux aux 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 aux aux aux aux aux aux Advanced Graph-Based Parsing Techniques 26(33)

Mildly Non-Projective Structures Approximate Higher-Order Non-Projective Parsing Well-nested block-degree 2 [Bodirsky et al. 2005] LTAG-like algorithms: O(n 7 ) [Gómez-Rodríguez et al. 2011] + 1-inherit: O(n 6 ) [Pitler et al. 2012] Empirical coverage identical to well-nested block-degree 2 + Head-split: O(n 6 ) [Satta and Kuhlmann 2013] Empirical coverage similar to well-nested block-degree 2 + Head-split + 1-inherit: O(n 5 ) [Satta and Kuhlmann 2013] Gap Minding Trees: O(n 5 ) [Pitler et al. 2012] 1-Endpoint-Crossing: O(n 4 ) [Pitler et al. 2013] All run-times are for first-order parsing Advanced Graph-Based Parsing Techniques 27(33)

1-Endpoint-Crossing [Pitler et al. 2013] Approximate Higher-Order Non-Projective Parsing An arc A, is 1-endpoint-crossing iff all arcs A that cross A have a common endpoint p An endpoint p is either a head or a modifier in an arc E.g., (arrived, What) is crossed by (ROOT,think) and (think,?), both have endpoint think aux aux aux aux aux aux ROOT What do you think arrived? Advanced Graph-Based Parsing Techniques 28(33)

Approximate Higher-Order Non-Projective Parsing 1-Endpoint-Crossing Can we design an algorithm that parses all and only 1-endpoint-crossing trees? Pitler et al. [2013] provides the solution Pitler s algorithm works by defining 5 types of intervals i Interval representing a contiguous sub-tree from i to j j p i j p i j Only edges with endpoint at left may cross exterior point p Only edges with endpoint at right may cross exterior point p p i Both j p i j Neither Location of exterior point, direction of arcs, etc, controlled via variables, similar to Eisner [1996] projective formulation Advanced Graph-Based Parsing Techniques 29(33)

Approximate Higher-Order Non-Projective Parsing 1-Endpoint-Crossing On CoNLL-X data sets [Buchholz and Marsi 2006] Class Tree coverage Run-time Projective 80.8 O(n 3 ) Well-nested block-degree 2 98.4 O(n 7 ) Gap-Minding 95.1 O(n 5 ) 1-Endpoint-Crossing 98.5 O(n 4 ) [Pitler et al. 2013] Macro average over Arabic, Czech, Danish, Dutch, Portuguese Advanced Graph-Based Parsing Techniques 30(33)

Approximate Higher-Order Non-Projective Parsing 1-Endpoint-Crossing Good empirical coverage and low run-time Can be linguistically motivated [Pitler et al. 2013] aux nsubjamod amod prep root der mer em Hans es huus hälfed aastriiche Phrase-impenetrability condition (PIC) [Chomsky 1998] Only head and edge words of phrase accessible to sentence Long-distance elements leave chain of traces at clause edges Pitler et al. [2013] conjecture: PIC implies 1-endpoint-crossing Advanced Graph-Based Parsing Techniques 31(33)

Approximate Higher-Order Non-Projective Parsing 1-Endpoint-Crossing Pitler [2014]: 1-endpoint-crossing + third-order Merge of Pitler et al. [2013] and Koo and Collins [2010] Searches 1-endpoint-crossing trees Scores higher-order features when no crossing arc present O(n 4 ) identical to third-order projective! Significant improvements in accuracy Advanced Graph-Based Parsing Techniques 32(33)

Approximate Higher-Order Non-Projective Parsing Coming Up Next 1. Basic notions of dependency grammar and dependency parsing 2. Graph-based and transition-based dependency parsing 3. Advanced graph-based parsing techniques 4. Advanced transition-based parsing techniques 5. Neural network techniques in dependency parsing 6. Multilingual parsing from raw text to universal dependencies Advanced Graph-Based Parsing Techniques 33(33)

References and Further Reading References and Further Reading Shane Bergsma and Colin Cherry. 2010. Fast and accurate arc filtering for dependency parsing. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pages 53 61. Manuel Bodirsky, Marco Kuhlmann, and Mathias Möhl. 2005. Well-nested drawings as models of syntactic structure. In Tenth Conference on Formal Grammar and Ninth Meeting on Mathematics of Language. Bernd Bohnet. 2010. Very high accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 89 97. Association for Computational Linguistics. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning, pages 149 164. Xavier Carreras. 2007. Experiments with a higher-order projective dependency parser. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 957 961. Advanced Graph-Based Parsing Techniques 33(33)

References and Further Reading Noam Chomsky. 1998. Minimalist inquiries: The framework. MIT Working Papers in Linguistics, MIT, Department of Linguistics. Y. J. Chu and T. J. Liu. 1965. On the shortest arborescence of a directed graph. Science Sinica, 14:1396 1400. Markus Dreyer, David A Smith, and Noah A Smith. 2006. Vine parsing and minimum risk reranking for speed and precision. In Proceedings of the Tenth Conference on Computational Natural Language Learning, pages 201 205. Association for Computational Linguistics. J. Edmonds. 1967. Optimum branchings. Journal of Research of the National Bureau of Standards, 71B:233 240. Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In Proceedings of the 16th International Conference on Computational Linguistics (COLING), pages 340 345. Carlos Gómez-Rodríguez, John Carroll, and David Weir. 2011. Dependency parsing schemata and mildly non-projective dependency parsing. Computational Linguistics, 37(3):541 586. Advanced Graph-Based Parsing Techniques 33(33)

References and Further Reading Keith Hall and Václav Novák. 2005. Corrective modeling for non-projective dependency parsing. In Proceedings of the Ninth International Workshop on Parsing Technology, pages 42 52. Association for Computational Linguistics. Keith Hall. 2007. K-best spanning tree parsing. In Proceedings of the Association for Computational Linguistics (ACL). He He, Hal Daumé III, and Jason Eisner. 2013. Dynamic feature selection for dependency parsing. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP). Terry Koo and Michael Collins. 2010. Efficient third-order dependency parsers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1 11. Association for Computational Linguistics. Terry Koo, Alexander M Rush, Michael Collins, Tommi Jaakkola, and David Sontag. 2010. Dual decomposition for parsing with non-projective head automata. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1288 1298. Association for Computational Linguistics. Advanced Graph-Based Parsing Techniques 33(33)

Xuezhe Ma and Hai Zhao. 2012. Fourth-order dependency parsing. In Proceedings of the Conference on Computational Linguistics (COLING), pages 785 796. References and Further Reading André FT Martins, Noah A Smith, and Eric P Xing. 2009. Concise integer linear programming formulations for dependency parsing. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 342 350. Association for Computational Linguistics. André FT Martins, Noah A Smith, Pedro MQ Aguiar, and Mário AT Figueiredo. 2011. Dual decomposition with many overlapping components. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 238 249. Association for Computational Linguistics. André FT Martins, Miguel B Almeida, and Noah A Smith. 2013. Turning on the turbo: Fast third-order non-projective turbo parsers. In Proceedings of the Association for Computational Linguistics. Ryan McDonald and Joakim Nivre. 2007. Advanced Graph-Based Parsing Techniques 33(33)

References and Further Reading Characterizing the errors of data-driven dependency parsing models. In Proceedings of the Join Conference on Empirical Methods in Natural Language Processing and the Conference on Computational Natural Language Learning (EMNLP-CoNLL). Ryan McDonald and Fernando Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 81 88. Ryan McDonald and Giorgio Satta. 2007. On the complexity of non-projective data-driven dependency parsing. In Proceedings of the 10th International Conference on Parsing Technologies (IWPT), pages 122 131. Tetsuji Nakagawa. 2007. Multilingual dependency parsing using global features. In EMNLP-CoNLL, pages 952 956. Emily Pitler, Sampath Kannan, and Mitchell Marcus. 2012. Dynamic programming for higher order parsing of gap-minding trees. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 478 488. Association for Computational Linguistics. Advanced Graph-Based Parsing Techniques 33(33)

References and Further Reading Emily Pitler, Sampath Kannan, and Mitchell Marcus. 2013. Finding optimal 1-endpoint-crossing trees. Transactions of the Association for Computational Linguistics (TACL). Emily Pitler. 2014. A crossing-sensitive third-order factorization for dependency parsing. Transactions of the Association for Computational Linguistics (TACL). Sebastian Riedel and James Clarke. 2006. Incremental integer linear programming for non-projective dependency parsing. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 129 137. Association for Computational Linguistics. Sebastian Riedel, David Smith, and Andrew McCallum. 2012. Parse, price and cut: delayed column and row generation for graph based parsers. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 732 743. Association for Computational Linguistics. Alexander M Rush and Slav Petrov. 2012. Vine pruning for efficient multi-pass dependency parsing. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Advanced Graph-Based Parsing Techniques 33(33)

References and Further Reading Computational Linguistics: Human Language Technologies, pages 498 507. Association for Computational Linguistics. Giorgio Satta and Marco Kuhlmann. 2013. Efficient parsing for head-split dependency trees. Transactions of the Association for Computational Linguistics, 1(July):267 278. David A Smith and Jason Eisner. 2008. Dependency parsing by belief propagation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 145 156. Association for Computational Linguistics. David Weiss, Benjamin Sapp, and Ben Taskar. 2010. Sidestepping intractable inference with structured ensemble cascades. In Proceesings of Neural Information Processing Systems (NIPS). Hao Zhang and Ryan McDonald. 2012. Generalized higher-order dependency parsing with cube pruning. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 320 331. Association for Computational Linguistics. Liang Zhang, Huang, Kai Zhao, and Ryan McDonald. 2013. Advanced Graph-Based Parsing Techniques 33(33)

References and Further Reading Online learning for inexact hypergraph search. In Proceedings of Empirical Methods in Natural Language Processing. Advanced Graph-Based Parsing Techniques 33(33)