Vine Pruning for Efficient Multi-Pass Dependency Parsing. Alexander M. Rush and Slav Petrov

Vine Pruning for Efficient Multi-Pass Dependency Parsing Alexander M. Rush and Slav Petrov

Dependency Parsing

Styles of Dependency Parsing greedy O(n) transition-based parsers (Nivre 2004) graph-based parsers (Eisner 2000) (McDonald 2005) speed first-order O(n 3 ) k-best O(kn) second-order O(n 3 ) accuracy third-order O(n 4 )

Preview: Coarse-to-Fine Cascades cgwire vine first second

linear-size dependency representation

Representation Heads Modifiers

First-Order Feature Calculation

First-Order Feature Calculation [] [VBD] [] [ADP] [] [VERB] [] [IN] [ VBD] [ ADP] [ ] [VBD ADP] [ VERB] [ IN] [ ] [VERB IN] [VBD ADP] [ ADP] [ VBD ADP] [ VBD ] [ADJ ADP] [VBD ADP] [VBD ADJ ADP] [VBD ADJ ] [NNS ADP] [NNS VBD ADP] [NNS VBD ] [ADJ ADP NNP] [VBD ADP NNP] [VBD ADJ NNP] [NNS ADP NNP] [NNS VBD NNP] [ left 5] [VBD left 5] [ left 5] [ADP left 5] [VERB IN] [ IN] [ VERB IN] [ VERB ] [JJ IN] [VERB IN] [VERB JJ IN] [VERB JJ ] [NOUN IN] [NOUN VERB IN] [NOUN VERB ] [JJ IN NOUN] [VERB IN NOUN] [VERB JJ NOUN] [NOUN IN NOUN] [NOUN VERB NOUN] [ left 5] [VERB left 5] [ left 5] [IN left 5] [ VBD ADP] [VBD ADJ ADP] [NNS VBD ADP] [VBD ADJ ADP NNP] [NNS VBD ADP NNP] [ VBD left 5] [ ADP left 5] [ left 5] [VBD ADP left 5] [ VERB IN] [VERB JJ IN] [NOUN VERB IN] [VERB JJ IN NOUN] [NOUN VERB IN NOUN] [ VERB left 5] [ IN left 5] [ left 5] [VERB IN left 5] [VBD ADP left 5] [ ADP left 5] [ VBD ADP left 5] [ VBD left 5] [ADJ ADP left 5] [VBD ADP left 5] [VBD ADJ ADP left 5] [VBD ADJ left 5] [NNS ADP left 5] [NNS VBD ADP left 5] [NNS VBD left 5] [ADJ ADP NNP left 5] [VBD ADP NNP left 5] [VBD ADJ NNP left 5] [NNS ADP NNP left 5] [NNS VBD NNP left 5] [VERB IN left 5] [ IN left 5] [ VERB IN left 5] [ VERB left 5] [JJ IN left 5] [VERB IN left 5] [VERB JJ IN left 5] [VERB JJ left 5] [NOUN IN left 5] [NOUN VERB IN left 5]

Arc Length By Part-of-Speech 0.5 0.4 NOUN ADP DET VERB ADJ counts 0.3 0.2 0.1 0.0 1 2 3 4 5 6 length

bill The to the intends to RTC restrict only borrowings Treasury the unless authorization congressional specific Arc Length Examples The bill intends to restrict the RTC to Treasury borrowings only unless the agency receives specific congressional authorization. receives agency.

This was in system financing the new created in law to order the keep from bailout the spending. swelling deficit budget Arc Length Examples This financing system was created in the new law in order to keep the bailout spending from swelling the budget deficit.

Arc Length Examples But the RTC also requires working capital to maintain the bad assets of thrifts that are sold until the assets can be sold separately. But the RTC also requires working capital to maintain the bad assets of thrifts that are sold until the assets can be sold separately.

Arc Length Examples It s a problem that clearly has to be resolved said David Cooke executive director of the RTC. It s a problem that clearly has to be resolved said David Cooke executive director of the RTC.

Arc Length Examples We would have to wait until we have collected on those assets before we can move forward he said. We would have to wait until we have collected on those assets before we can move forward he said.

The in the huge language law new complicated has the. fight muddied Arc Length Examples The complicated language in the huge new law has muddied the fight.

Arc Length Examples That secrecy leads to a proposal like the one from Ways and Means which seems to me sort of draconian he said. That secrecy leads to a proposal like the one from Ways and Means which seems to me sort of draconian he said.

Arc Length Examples The RTC is going to have to pay a price of prior consultation on the Hill if they want that kind of flexibility. The RTC is going to have to pay a price of prior consultation on the Hill if they want that kind of flexibility.

Arc Length Heat Map 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

Banded Matrix

Outer Arc

vine Coarse-to-Fine

Coarse-to-Fine vine first

Coarse-to-Fine cgwire vine first second

dynamic programs for parsing

Inference Questions questions: How do we reduce inference time to O(n)? How do we decide which arcs to prune? Vine Parsing (Eisner and Smith 2005)

Eisner First-Order Rules + h m h r r + 1 m + h e h m m e

First-Order Parsing

Vine Parsing Rules + 0 e 0 e 1 e 1 e + 0 e 0 m m e 0 e 0 e + 0 e 0 m m e + 0 e 0 e 1 e 1 e

Vine Parsing

Arc Pruning Prune arcs based on max-marginals. maxmarginal(a) = max (y w) y:a y Can compute using inside-outside algorithm. Generic algorithm using hypergraph parsing.

Max-Marginals for First-Order Arcs maxmarginal( ) > threshold?

Max-Marginals for Outer Arcs maxmarginal(left ) > threshold?

pruning and training

Max-Marginal Pruning goal: Define a threshold on max-marginal score. Validation parameter α trades off between speed and accuracy. t α (w) = α max (y w) + (1 α) 1 y A maxmarginal(a w) a A Highest scoring parse upper bounds any max-marginal. sume average of max-marginals is lower than gold.

Pruning Threshold feature two w feature one

Pruning Threshold max feature two w feature one

Pruning Threshold max feature two average max-marginal w feature one

Pruning Threshold max feature two α average max-marginal w feature one

Pruning Threshold max feature two average max-marginal w feature one

Structured Cascade Training (Weiss and Taskar 2011) Train a linear model with a loss function for pruning. Regularized risk minimization with loss based on threshold min w λ w 2 + 1 P P [1 y (p) w + t α (p) (w)] + p=1 Can use a simple variant of perceptron/pegasos to train.

Structured Cascade Training max feature two w feature one gold

Structured Cascade Training max feature two average max-marginal w feature one gold

Structured Cascade Training feature two w feature one gold

Structured Cascade Training feature two max w feature one gold

experiments

Implementation Inference Experiments use a highly-optimized C++ implementation. Baseline first-order parser processes 2000 tokens/sec. Hypergraph parsing framework with shared inference. Model Final models trained with hamming-loss MIRA. Full collection of dependency parsing features (Koo 2010). First- second- and third-order models match state-of-the-art.

Baselines NoPrune exhaustive parsing model with no pruning LocalShort unstructured classifier over O(n) short arcs (Bergsma and Cherry 2010) Local unstructured classifier over O(n 2 ) arcs (Bergsma and Cherry 2010) FirstOnly structured first-order model in cascade (Koo 2010) VinePosterior posterior pruning cascade trained with L-BFGS ZhangNivre reimplementation of state-of-the-art k-best transition-based parser (Zhang and Nivre 2011).

Speed/Accuracy Experiments: First-Order Parsing NoPrune Local FirstOnly VinePosterior VineCascade ZhangNivre(8) 0 1 2 3 4 5 6 Relative Speed 90 91 92 93 94 Accuracy

Speed/Accuracy Experiments: Second-Order Parsing NoPrune Local FirstOnly VinePosterior VineCascade ZhangNivre(16) 0 1 2 3 4 Relative Speed 90 91 92 93 94 Accuracy

Speed/Accuracy Experiments: Third-Order Parsing NoPrune Local FirstOnly VinePosterior VineCascade ZhangNivre(64) 0 1 2 Relative Speed 90 91 92 93 94 Accuracy

Empirical Complexity: First-Order Parsing NoPrune [2.8] VineCascade [1.4] time 10 20 30 40 50 sentence length

Empirical Complexity: Second-Order Parsing NoPrune [2.8] VineCascade [1.8] time 10 20 30 40 50 sentence length

Empirical Complexity: Third-Order Parsing NoPrune [3.8] VineCascade [1.9] time 10 20 30 40 50 sentence length

Multilingual Experiments: First-Order Parsing En Bg De Pt Sw Zh NoPrune VineCascade 0 1 2 3 4 5 6 7 Relative Speed

Multilingual Experiments: Second-Order Parsing En Bg De Pt Sw Zh NoPrune VineCascade 0 1 2 3 4 5 6 Relative Speed

Multilingual Experiments: Third-Order Parsing En Bg De Pt Sw Zh NoPrune VineCascade 0 1 2 3 Relative Speed

Special thanks to: Ryan McDonald Hao Zhang Michael Ringgaard Terry Koo Keith Hall Kuzman Ganchev Yoav Goldberg Andre Martins and the rest of the Google NLP team