Vine Pruning for Efficient Multi-Pass Dependency Parsing. Alexander M. Rush and Slav Petrov

Similar documents
Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Structured Prediction Models via the Matrix-Tree Theorem

Advanced Graph-Based Parsing Techniques

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

NLP Programming Tutorial 11 - The Structured Perceptron

LECTURER: BURCU CAN Spring

Transition-based Dependency Parsing with Selectional Branching

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

10/17/04. Today s Main Points

Structured Prediction

Personal Project: Shift-Reduce Dependency Parsing

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Soft Inference and Posterior Marginals. September 19, 2013

Graph-based Dependency Parsing. Ryan McDonald Google Research

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Log-Linear Models with Structured Outputs

Posterior vs. Parameter Sparsity in Latent Variable Models Supplementary Material

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Computational Linguistics

Dynamic-oracle Transition-based Parsing with Calibrated Probabilistic Output

Latent Variable Models in NLP

Multilevel Coarse-to-Fine PCFG Parsing

Probabilistic Context Free Grammars. Many slides from Michael Collins

Computational Linguistics. Acknowledgements. Phrase-Structure Trees. Dependency-based Parsing

Lecture 13: Structured Prediction

Transition-based Dependency Parsing with Selectional Branching

Introduction to Data-Driven Dependency Parsing

Boosting Ensembles of Structured Prediction Rules

Probabilistic Context-free Grammars

Statistical methods in NLP, lecture 7 Tagging and parsing

Lab 12: Structured Prediction

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I

Parsing with Context-Free Grammars

Statistical Methods for NLP

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I. Greg Durrett

Lecture 9: Hidden Markov Model

Hierarchical Low-Rank Tensors for Multilingual Transfer Parsing

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Transition-Based Parsing

Sequence Labeling: HMMs & Structured Perceptron

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Marrying Dynamic Programming with Recurrent Neural Networks

CS838-1 Advanced NLP: Hidden Markov Models

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Statistical Methods for NLP

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Spectral Unsupervised Parsing with Additive Tree Metrics

Natural Language Processing

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Parsing with Context-Free Grammars

WHITE PAPER BENEFITS OF USING EXPERT FX RISK MANAGEMENT RESOURCES PREPARED BY ANDRE CILLIERS DIRECTOR AND CURRENCY RISK STRATEGIST AT TREASURYONE

Lecture 12: Algorithms for HMMs

Dependency grammar. Recurrent neural networks. Transition-based neural parsing. Word representations. Informs Models

CKY & Earley Parsing. Ling 571 Deep Processing Techniques for NLP January 13, 2016

Sequential Data Modeling - The Structured Perceptron

Log-Linear Models, MEMMs, and CRFs

Natural Language Processing

Distributed Training Strategies for the Structured Perceptron

Machine Learning for NLP

Natural Language Processing

Polyhedral Outer Approximations with Application to Natural Language Parsing

Structured Prediction Theory and Algorithms

CS388: Natural Language Processing Lecture 4: Sequence Models I

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

A Context-Free Grammar

Lecture 12: Algorithms for HMMs

Posterior Sparsity in Unsupervised Dependency Parsing

The Structured Weighted Violations Perceptron Algorithm

Statistical Methods for NLP

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Extracting Information from Text

Maxent Models and Discriminative Estimation

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

A FUZZY NEURAL NETWORK MODEL FOR FORECASTING STOCK PRICE

Machine Learning for NLP

Lagrangian Relaxation Algorithms for Inference in Natural Language Processing

Multiword Expression Identification with Tree Substitution Grammars

Adaptive Multi-Compositionality for Recursive Neural Models with Applications to Sentiment Analysis. July 31, 2014

Linear Classifiers IV

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Conditional Random Field

Tuning as Linear Regression

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

A Support Vector Method for Multivariate Performance Measures

Structured Prediction

CS460/626 : Natural Language

Relaxed Marginal Inference and its Application to Dependency Parsing

Variational Decoding for Statistical Machine Translation

Probabilistic Graphical Models

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Hidden Markov Models

Generalized Linear Classifiers in NLP

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lecture 7: Sequence Labeling

Midterm sample questions

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Fourth-Order Dependency Parsing

Transcription:

Vine Pruning for Efficient Multi-Pass Dependency Parsing Alexander M. Rush and Slav Petrov

Dependency Parsing

Styles of Dependency Parsing greedy O(n) transition-based parsers (Nivre 2004) graph-based parsers (Eisner 2000) (McDonald 2005) speed first-order O(n 3 ) k-best O(kn) second-order O(n 3 ) accuracy third-order O(n 4 )

Styles of Dependency Parsing greedy O(n) transition-based parsers (Nivre 2004) graph-based parsers (Eisner 2000) (McDonald 2005) speed first-order O(n 3 ) k-best O(kn) second-order O(n 3 ) accuracy this work third-order O(n 4 )

Preview: Coarse-to-Fine Cascades cgwire vine first second

linear-size dependency representation

Representation Heads Modifiers

Representation Heads Modifiers

Representation Heads Modifiers

Representation Heads Modifiers

Representation Heads Modifiers

Representation Heads Modifiers

Representation Heads Modifiers

Representation Heads Modifiers

First-Order Feature Calculation

First-Order Feature Calculation [] [VBD] [] [ADP] [] [VERB] [] [IN] [ VBD] [ ADP] [ ] [VBD ADP] [ VERB] [ IN] [ ] [VERB IN] [VBD ADP] [ ADP] [ VBD ADP] [ VBD ] [ADJ ADP] [VBD ADP] [VBD ADJ ADP] [VBD ADJ ] [NNS ADP] [NNS VBD ADP] [NNS VBD ] [ADJ ADP NNP] [VBD ADP NNP] [VBD ADJ NNP] [NNS ADP NNP] [NNS VBD NNP] [ left 5] [VBD left 5] [ left 5] [ADP left 5] [VERB IN] [ IN] [ VERB IN] [ VERB ] [JJ IN] [VERB IN] [VERB JJ IN] [VERB JJ ] [NOUN IN] [NOUN VERB IN] [NOUN VERB ] [JJ IN NOUN] [VERB IN NOUN] [VERB JJ NOUN] [NOUN IN NOUN] [NOUN VERB NOUN] [ left 5] [VERB left 5] [ left 5] [IN left 5] [ VBD ADP] [VBD ADJ ADP] [NNS VBD ADP] [VBD ADJ ADP NNP] [NNS VBD ADP NNP] [ VBD left 5] [ ADP left 5] [ left 5] [VBD ADP left 5] [ VERB IN] [VERB JJ IN] [NOUN VERB IN] [VERB JJ IN NOUN] [NOUN VERB IN NOUN] [ VERB left 5] [ IN left 5] [ left 5] [VERB IN left 5] [VBD ADP left 5] [ ADP left 5] [ VBD ADP left 5] [ VBD left 5] [ADJ ADP left 5] [VBD ADP left 5] [VBD ADJ ADP left 5] [VBD ADJ left 5] [NNS ADP left 5] [NNS VBD ADP left 5] [NNS VBD left 5] [ADJ ADP NNP left 5] [VBD ADP NNP left 5] [VBD ADJ NNP left 5] [NNS ADP NNP left 5] [NNS VBD NNP left 5] [VERB IN left 5] [ IN left 5] [ VERB IN left 5] [ VERB left 5] [JJ IN left 5] [VERB IN left 5] [VERB JJ IN left 5] [VERB JJ left 5] [NOUN IN left 5] [NOUN VERB IN left 5]

Arc Length By Part-of-Speech 0.5 0.4 NOUN ADP DET VERB ADJ counts 0.3 0.2 0.1 0.0 1 2 3 4 5 6 length

Arc Length By Part-of-Speech 0.5 0.4 NOUN ADP DET VERB ADJ counts 0.3 0.2 0.1 0.0 1 2 3 4 5 6 length

Arc Length By Part-of-Speech 0.5 0.4 NOUN ADP DET VERB ADJ counts 0.3 0.2 0.1 0.0 1 2 3 4 5 6 length

bill The to the intends to RTC restrict only borrowings Treasury the unless authorization congressional specific Arc Length Examples The bill intends to restrict the RTC to Treasury borrowings only unless the agency receives specific congressional authorization. receives agency.

This was in system financing the new created in law to order the keep from bailout the spending. swelling deficit budget Arc Length Examples This financing system was created in the new law in order to keep the bailout spending from swelling the budget deficit.

Arc Length Examples But the RTC also requires working capital to maintain the bad assets of thrifts that are sold until the assets can be sold separately. But the RTC also requires working capital to maintain the bad assets of thrifts that are sold until the assets can be sold separately.

Arc Length Examples It s a problem that clearly has to be resolved said David Cooke executive director of the RTC. It s a problem that clearly has to be resolved said David Cooke executive director of the RTC.

Arc Length Examples We would have to wait until we have collected on those assets before we can move forward he said. We would have to wait until we have collected on those assets before we can move forward he said.

The in the huge language law new complicated has the. fight muddied Arc Length Examples The complicated language in the huge new law has muddied the fight.

Arc Length Examples That secrecy leads to a proposal like the one from Ways and Means which seems to me sort of draconian he said. That secrecy leads to a proposal like the one from Ways and Means which seems to me sort of draconian he said.

Arc Length Examples The RTC is going to have to pay a price of prior consultation on the Hill if they want that kind of flexibility. The RTC is going to have to pay a price of prior consultation on the Hill if they want that kind of flexibility.

Arc Length Heat Map 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

Arc Length Heat Map 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

Banded Matrix

Banded Matrix

Outer Arc

Outer Arc

Outer Arc

Outer Arc

vine Coarse-to-Fine

Coarse-to-Fine vine first

Coarse-to-Fine cgwire vine first second

dynamic programs for parsing

Inference Questions questions: How do we reduce inference time to O(n)? How do we decide which arcs to prune? Vine Parsing (Eisner and Smith 2005)

Eisner First-Order Rules + h m h r r + 1 m + h e h m m e

First-Order Parsing

First-Order Parsing

First-Order Parsing

First-Order Parsing

First-Order Parsing

First-Order Parsing

First-Order Parsing

First-Order Parsing

First-Order Parsing

Vine Parsing Rules + 0 e 0 e 1 e 1 e + 0 e 0 m m e 0 e 0 e + 0 e 0 m m e + 0 e 0 e 1 e 1 e

Vine Parsing

Vine Parsing

Vine Parsing

Vine Parsing

Vine Parsing

Vine Parsing

Vine Parsing

Vine Parsing

Vine Parsing

Vine Parsing

Vine Parsing

Vine Parsing

Arc Pruning Prune arcs based on max-marginals. maxmarginal(a) = max (y w) y:a y Can compute using inside-outside algorithm. Generic algorithm using hypergraph parsing.

Max-Marginals for First-Order Arcs maxmarginal( ) > threshold?

Max-Marginals for Outer Arcs maxmarginal(left ) > threshold?

pruning and training

Max-Marginal Pruning goal: Define a threshold on max-marginal score. Validation parameter α trades off between speed and accuracy. t α (w) = α max (y w) + (1 α) 1 y A maxmarginal(a w) a A Highest scoring parse upper bounds any max-marginal. sume average of max-marginals is lower than gold.

Pruning Threshold feature two w feature one

Pruning Threshold max feature two w feature one

Pruning Threshold max feature two w feature one

Pruning Threshold max feature two w feature one

Pruning Threshold max feature two w feature one

Pruning Threshold max feature two average max-marginal w feature one

Pruning Threshold max feature two average max-marginal w feature one

Pruning Threshold max feature two average max-marginal w feature one

Pruning Threshold max feature two average max-marginal w feature one

Pruning Threshold max feature two α average max-marginal w feature one

Pruning Threshold max feature two average max-marginal w feature one

Structured Cascade Training (Weiss and Taskar 2011) Train a linear model with a loss function for pruning. Regularized risk minimization with loss based on threshold min w λ w 2 + 1 P P [1 y (p) w + t α (p) (w)] + p=1 Can use a simple variant of perceptron/pegasos to train.

Structured Cascade Training max feature two w feature one gold

Structured Cascade Training max feature two average max-marginal w feature one gold

Structured Cascade Training max feature two average max-marginal w feature one gold

Structured Cascade Training max feature two average max-marginal w feature one gold

Structured Cascade Training max feature two average max-marginal w feature one gold

Structured Cascade Training max feature two average max-marginal w feature one gold

Structured Cascade Training max feature two average max-marginal w feature one gold

Structured Cascade Training feature two w feature one gold

Structured Cascade Training feature two max w feature one gold

Structured Cascade Training feature two max w feature one gold

Structured Cascade Training feature two max w feature one gold

experiments

Implementation Inference Experiments use a highly-optimized C++ implementation. Baseline first-order parser processes 2000 tokens/sec. Hypergraph parsing framework with shared inference. Model Final models trained with hamming-loss MIRA. Full collection of dependency parsing features (Koo 2010). First- second- and third-order models match state-of-the-art.

Baselines NoPrune exhaustive parsing model with no pruning LocalShort unstructured classifier over O(n) short arcs (Bergsma and Cherry 2010) Local unstructured classifier over O(n 2 ) arcs (Bergsma and Cherry 2010) FirstOnly structured first-order model in cascade (Koo 2010) VinePosterior posterior pruning cascade trained with L-BFGS ZhangNivre reimplementation of state-of-the-art k-best transition-based parser (Zhang and Nivre 2011).

Speed/Accuracy Experiments: First-Order Parsing NoPrune Local FirstOnly VinePosterior VineCascade ZhangNivre(8) 0 1 2 3 4 5 6 Relative Speed 90 91 92 93 94 Accuracy

Speed/Accuracy Experiments: Second-Order Parsing NoPrune Local FirstOnly VinePosterior VineCascade ZhangNivre(16) 0 1 2 3 4 Relative Speed 90 91 92 93 94 Accuracy

Speed/Accuracy Experiments: Third-Order Parsing NoPrune Local FirstOnly VinePosterior VineCascade ZhangNivre(64) 0 1 2 Relative Speed 90 91 92 93 94 Accuracy

Empirical Complexity: First-Order Parsing NoPrune [2.8] VineCascade [1.4] time 10 20 30 40 50 sentence length

Empirical Complexity: Second-Order Parsing NoPrune [2.8] VineCascade [1.8] time 10 20 30 40 50 sentence length

Empirical Complexity: Third-Order Parsing NoPrune [3.8] VineCascade [1.9] time 10 20 30 40 50 sentence length

Multilingual Experiments: First-Order Parsing En Bg De Pt Sw Zh NoPrune VineCascade 0 1 2 3 4 5 6 7 Relative Speed

Multilingual Experiments: Second-Order Parsing En Bg De Pt Sw Zh NoPrune VineCascade 0 1 2 3 4 5 6 Relative Speed

Multilingual Experiments: Third-Order Parsing En Bg De Pt Sw Zh NoPrune VineCascade 0 1 2 3 Relative Speed

Special thanks to: Ryan McDonald Hao Zhang Michael Ringgaard Terry Koo Keith Hall Kuzman Ganchev Yoav Goldberg Andre Martins and the rest of the Google NLP team